SCD implementation from Source based CDC table
Consider a scenario where we want to capture all or any changes in the source table to be maintained as history in our data warehouse dimension table. So if we consider day end batch load we will miss out any intermediate changes to the master data as the OLTP system doesn't maintains version. But what if we have already captured the changes via database CDC mechanism. Then based on the operation type and operation order of CDC based source table we can ascertain the corresponding operations to the dimension table. We can even avoid lookup to the target dimension table for SCD Type 2 implementation with optimum performance.
Database Change Data Capture Mechanism
Let us consider an OLTP MSSQL table say ERP_PUBLISHER which has data related to books and publisher. The database schema is enabled for CDC using procedure sys.sp_cdc_enable_db proc. Next using the procedure sys.sp_cdc_enable_table, @source_name = N'ERP_PUBLISHER' and @capture_instance = N'PUBLISHER' we have the source based CDC table i.e. PUBLISHER in place. So any DML performed in the source table ERP_PUBLISHER will be captured in the CDC table PUBLISHER. Next we will use this CDC table and import in Data Services as our source to load slowly changing dimension table of Type 2 in target data warehouse. Here the source side natural key is PUB_ID, and the Surrogate key of dimension table is PUB_KEY.
Implementation Logic
The implementation logic is a bit tricky, as we need to consider a few scenarios based on the Operation Type & Operation Order. Let's go through them first followed by ETL implementation using SAP Data Services. Lets consider the below scenarios with respect to PUB_ID:
1. A single source record having a natural key K1, with operation code as INSERT.
This basically means that a master record is created in the source; This also means there is no entry for the same in target dimension table.
- We will simply insert this record K1 with a new Surrogate key and Active Flag marked as True/Yes and the record validity/end date is set to NULL OR some future date like- 2050.12.31
2. Let's consider a single source record having a natural key K2, with operation code as UPDATE.
This basically means that a few attributes of an existing master record in the source have been modified/updated; This also means there is a record for the same key K2 in target dimension table, with Active Flag marked as True/Yes and the record validity/end date is set to NULL OR some future date like- 2050.12.31.
- In this case we will insert this record K2 with a new Surrogate key and Active Flag marked as True/Yes and the record validity/end date is set to NULL OR some future date like- 2050.12.31
- Next we have to lookup the target dimension table with natural key K2 and with Active Flag marked as True/Yes and the record validity/end date as NULL OR some future date like- 2050.12.31. Next with the Surrogate key, returned from the target lookup we will update the dimension record with Active Flag marked as False/No and the record validity/end date is set to SYSDATE.
Alternatively, we can Update that dimension record where the PUB_ID is K2 and Active Flag as True/Yes with Active Flag marked as False/No and the record validity/end date is set to SYSDATE.
3. Let's consider a single source record having a natural key K3, with operation code as DELETE.
This basically means that an existing master record has been deleted; This also means there is a record for the same in target dimension table with Active Flag marked as True/Yes and the record validity/end date is set to NULL OR some future date like- 2050.12.31
- In this case we have to lookup the target dimension table with natural key K3 and with Active Flag marked as True/Yes and the record validity/end date as NULL OR some future date like- 2050.12.31. Next with the Surrogate key, returned from the target lookup we will update the dimension record with Active Flag marked as False/No and the record validity/end date is set to SYSDATE.
Alternatively, we can Update that dimension record where the PUB_ID is K3 and Active Flag as True/Yes with Active Flag marked as False/No and the record validity/end date is set to SYSDATE.
4. Let's consider two source records having a natural key K4, with operation codes as INSERT followed by one with UPDATE.
This basically means that a master record was created in the source and later a few attributes have been modified/updated for the same; This also means neither new or changed values are present in target dimension table.
- For the record with operation code INSERT, we will simply insert this record K4 with a new Surrogate key and Active Flag marked as False/No and the record validity/end date is set to SYSDATE.
- For the record with operation code UPDATE, We will simply insert this record K4 with a new Surrogate key and Active Flag marked as True/Yes and the record validity/end date is set to NULL OR some future date like- 2050.12.31
So this holds true for scenarios where a single insert may be accompanied by multiple updates for the same source natural key. So for all the records the dimension table will have the corresponding entries with Active Flag marked as False/No and the record validity/end date is set to SYSDATE expect for the last update record. Hence the DI_SEQUENCE_NUMBER is critical to ascertain the DML execution order within the group of source natural key.
5. Let's consider two source records having a natural key K5, with operation codes as INSERT followed by one DELETE.
This basically means that a master record was created in the source and at a later point of time it was deleted; This also means neither new or deleted values are present in target dimension table.
- For the record with operation code INSERT, we will simply insert this record K5 with a new Surrogate key and Active Flag marked as False/No and the record validity/end date is set to SYSDATE.
- For the record with operation code DELETE, We will simply discard this record.
So this holds true for scenarios where a single insert may be accompanied by multiple updates for the same source natural key and finally followed by a DELETE operation. So for all the records the dimension table will have the corresponding entries with Active Flag marked as False/No and the record validity/end date is set to SYSDATE. As mentioned earlier we will discard the last record with operation code as DELETE. Hence the DI_OPERATION_TYPE & DI_SEQUENCE_NUMBER is critical to ascertain the DML execution type & order within the group of source natural key.
6. Let's consider two source records having a natural key K6, with operation codes as UPDATE and DELETE.
This basically means that a few attributes of an existing master record have been modified/updated and later the master record was deleted; This also means there is a record for the same in target dimension table with Active Flag marked as True/Yes and the record validity/end date is set to NULL OR some future date like- 2050.12.31
- In this case we will insert this record K6 with a new Surrogate key and Active Flag marked as False/No and the record validity/end date is set to SYSDATE.
- Next we have to lookup the target dimension table with natural key K6 and with Active Flag marked as True/Yes and the record validity/end date as NULL OR some future date like- 2050.12.31. Next with the Surrogate key, returned from the target lookup we will update the dimesion record with Active Flag marked as False/No and the record validity/end date is set to SYSDATE.
Alternatively, we can Update that dimension record where the PUB_ID is K6 and Active Flag as True/Yes with Active Flag marked as False/No and the record validity/end date is set to SYSDATE.
NOTE
It is important to group by the source records based on the Source Natural Key followed by DI_SEQUENCE_NUMBER; Also the DI_OPERATION_TYPE is critical to process the source records accordingly.
ETL LOGIC
- We need to flag records to identify whether they are the LAST record or not within the Source Natural Key group.
- So for LAST records within the Source Natural Key groups will be loaded in the target with Active Flag marked as True/Yes and the record validity/end date as NULL OR some future date like- 2050.12.31.
- We need to flag records to identify whether they have any presence in the dimension table or not;
ETL Implementation
Below is the sample implementation job.
Below is the sample Dataflow for CDC Source & SCD Type2 Target implementation.
1. We will first sort the CDC Table data in the reverse order of DI_SEQUENCE_NUMBER to mark the last record within each Source Natural Key group.
Also we will ignore the records having operation codes for before image of updates. Again we need to consider only those records having the DI_SEQUENCE_NUMBER greater than the previous batch loads DI_SEQUENCE_NUMBER.
We will use a Pre-Script SQL function along with a Global Variable
$DI_SEQUENCE_NUMBER to get the last run DI_SEQUENCE_NUMBER stored in a load control table for each day's load.
Considering Publisher ID is the source natural key of the Source CDC table PUBLISHER, select source data set where
PUBLISHER.DI_SEQUENCE_NUMBER > $DI_SEQUENCE_NUMBER
AND PUBLISHER.DI_OPERATION_TYPE <> 'B'
ORDER BY PUB_ID, DI_SEQUENCE_NUMBER DESC
Note: This operation can be pushed down to the source database.
Within the dataflow another pipeline/flow will write the maximum DI_SEQUENCE_NUMBER for the day's load/batch run into this load control table.
2. NL_FLAG: Not Last Flag- So this flag will help to identify the last DML operation performed within a Source Natural Key group. Hence all other not last records will be inserted in DIM table with ACTIVE_FLAG as ‘N’, since SCD Type 2 implementation.
If the Previous Row Value of the Natural key is equal to Current Natural key value then we will flag it as 'NL' else 'L'
3. Again we Reverse Sort the data set based on DI_SEQUENCE_NUMBER to mark the last DML operation within each Source Natural Key group.
Considering Publisher ID is the source natural key, select source data set ORDER BY PUB_ID, DI_SEQUENCE_NUMBER.
4. EXS_FLAG: Existing Record Flag- So this flag will help to identify whether the input natural key is present in the dimension table or not.
- If the Previous Row Value of the Natural key is equal to Current Natural key value then we will flag it as 'NA'
- If the Previous Row Value of the Natural key is not equal to Current Natural key value and operation type is ‘I’ then we will flag it as 'NEW'; If a new published id comes in it will be obviously the first DML within the publisher id group having operation type as ‘I’, may be followed by other updates or deletes.
- If the Previous Row Value of the Natural key is not equal to Current Natural key value and operation type is ‘D’ then we will flag it as 'DEL'; If a publisher id comes in with operation type as ‘D’, which means this publisher info is already there in DIM table, we just need to make the target record as inactive
- For all other records we mark them as 'EXS'. These are the records where the Previous Row Value of the Natural key is not equal to Current Natural key value and operation type is ‘U’
5. FIL_FLAG: Filter deleted records- For records having operation type as ‘D’ but being accompanied by other updates/insert we will filter out that record.
If the Previous Row Value of the Natural key is equal to Current Natural key value and operation type is ‘D’ then we will flag it as 'Y' else 'N'
6. Next we select only those records in the next Query transform where FIL_FLAG = 'N'
7. Next we set the values of the history columns for the dimension table as below:
- VALID_FROM - sysdate()
- VALID_TO - if NL_FLAG = 'NL' then sysdate() else NULL
- ACTIVE_FLAG - if NL_FLAG = 'NL' then 'N' else 'Y'
ACTIVE_FLAG: General active record marker for SCD Type 2 Dimension table.
8. Next we use a CASE transform with two labels:
- NEW - EXS_FLAG IN ( 'NEW', 'NA', 'EXS' )
- EXS - EXS_FLAG IN ( 'EXS', 'DEL' )
9. Next for the case NEW records we will insert the records in the Dimension table with a Key Generation transform before the target table for the surrogate key generation.
10. For the case EXS record we will stage the PUB_ID in a temporary staging table say TEMP_PUB_ID_UPD.
Alternatively we can lookup the dimension table using a New lookup_ext function call, where the PUB_ID equals the input PUB_ID and and ACTIVE_FLAG = 'Y'. After that with the return PUB_KEY, surrogate key of the dimension table from lookup call, we will use a Map_Operation transform to update the target dimension tables existing rows to mark VALID_TO as sysdate() and ACTIVE_FLAG as 'N'
11. Next we will use a post dataflow script task to update records in the dimension table based on the PUB_ID in the temporary table.
Using Post Update Script
MSSQL Target
UPDATE PUBLISHER_DIM
SET PUBLISHER_DIM.ACTIVE_FLAG = 'N',
PUBLISHER_DIM.VALID_TO = GETDATE()
FROM TEMP_PUB_ID_UPD T
WHERE T.PUB_ID = PUBLISHER_DIM.PUB_ID
AND PUBLISHER_DIM.ACTIVE_FLAG = 'Y'
ORACLE Target
UPDATE (
SELECT /*+ PARALLEL(D, 8) */ T.PUB_ID, D.ACTIVE_FLAG, D.VALID_TO, D.PUB_ID
FROM TEMP_PUB_ID_UPD T, PUBLISHER_DIM D
WHERE T.PUB_ID = D.PUB_ID
AND D.ACTIVE_FLAG = 'Y'
)
SET D.ACTIVE_FLAG = 'N'
AND D.VALID_TO = SYSDATE()