Map_CDC_Operation in Data Services
Map CDC Operation is used to support relational or mainframe changed-data capture (CDC). This transform reads the source DML changes i.e. INSERT/UPDATE/DELETE and reflects the same in the target system. So the transform maps all the source rows logged as insert to an insert, all updates to update and the deletes to delete in the target table.
Whenever a CDC based source table is imported in Data Services, it generates two more additional columns namely DI_OPERATION_TYPE and DI_SEQUENCE_NUMBER.
- The DI_OPERATION_TYPE column holds the corresponding database DML operation codes for insert,update & delete. Possible OP codes are I - Insert, B - Before Image of update, U - Update, D - Delete
- The DI_SEQUENCE_NUMBER column holds the information of the order/sequence of the database DML operations. For example if we want to update or delete the row first and then insert it, it doesn't makes sense; Hence the execution order has to be maintained. The order of DML command execution is very important.
Transform Properties
- Sequencing column should be mapped to the column which specifies an integer used to order table rows or DML execution ordering in source.
- Row operation column should be mapped to a column which specifies the database DML operation codes.
- Input already sorted by sequencing column - This transform by default assumes that the input data is already sorted based on the value selected in the Sequencing column box. If we deselect this check box, Data Services will re-sort the input data based on the Sequencing column. Use the re-sort capability of this transform only when necessary as it impacts job performance.
So let's consider a scenario where the Source relational database Oracle/MSSQL is having CDC enabled and a particular table have been selected for CDC; Next we want to reflect these changes to a same table structure in another remote target database. In this case we can use Map_CDC_Operation transform.
Implementation
Let us consider we have a CDC based source table called PUBLISHER and we want to replicate the same to a different database using Map_CDC_Operation. Find below the implementation Job & Dataflow for the same:
1. The initialization script is used to get the maximum value of DI_SEQUENCE_NUMBER extracted during last execution time. After each execution the value of DI_SEQUENCE_NUMBER is preserved in a Replication control table.
#MSSQL Replication Control Table
$DI_SEQUENCE_NUMBER = cast( sql( 'DS_TGT_REPO',
'SELECT ISNULL( MAX( DI_SEQUENCE_NUMBER ), -1 ) FROM DW_LOAD_CONTROL
WHERE TARGET_NAME = {$TARGET_NAME}' ), 'int' );
2. Here PUBLISHER is our MSSQL CDC based Source table. The Query transform extracts only those tuples where the DI_SEQUENCE_NUMBER is greater than the last maximum DI_SEQUENCE_NUMBER extracted during last run. Also we have selected DI_SEQUENCE_NUMBER column in the ORDER BY tab as sort order Ascending
3. Next we have the Map_CDC_Operation transform followed by the target table which is a replica of the Source DB table.
The above functionality can also be achieved with a CASE transform plus three Map_Operation transforms, although the Map_CDC_Operation is just more convenient.
NOTE
- While using a relational or mainframe CDC source table, the DI_OPERATION_TYPE & DI_SEQUENCE_NUMBER columns are automatically selected as the Row operation column and the Sequencing column of the Map_CDC_Operation respectively.
- While using a relational or mainframe CDC source table, the DI_OPERATION_TYPE & DI_SEQUENCE_NUMBER columns appear in the input schema, and are automatically removed in the output schema. To propagate these columns to the output, create additional columns that map to them prior to coming into the Map_CDC_Operation transform.
- The Relational CDC sources supported by the transform includes Oracle and SQL Server.
- This transform is typically the last object before the target in a data flow because it produces INPUT, UPDATE and DELETE operation codes. Data Services produces a warning if other objects are used in between Map_CDC_Operation & Target.
USE CASE
A Map_CDC_Operation can be used to replicate source CDC table changes to another table of similar structure in a remote database or staging area of a data warehouse. Also we can use this transform to replicate/sync a SAP HANA database target table from/with a CDC enabled Source table.
In this context, the source based CDC tables can be optimally used to load slowly changing dimension tables of Data warehouse. In this context it is relevant to highlight that, since the DML operations codes/indicators are coming along with the source records, it might not at all be required to lookup the target dimension table anymore during dimension loading.
What happens in Slowly Changing Dimension Type1
Database Change data capture, records insert, update, and delete activity that is applied to a table. So based on the DI_OPERATION_TYPE & DI_SEQUENCE_NUMBER, we will INSERT or UPDATE records accordingly to target dimension table of SCD TYPE 1. Here we may filter out the records with DI_OPERATION_TYPE as 'D' i.e. delete at the source, because we usually don't delete records from dimension table.
What happens in Slowly Changing Dimension Type2
1. If a changed record comes from the source, which is not present in the target dimension table:
- A new entry is created in the dimension table with Active Flag marked as True/Yes and the record validity/end date is set to NULL OR some future date like- 2050.12.31
2. If a changed record comes from the source, which is present in the target dimension table:
- A new entry is created in the dimension table with Active Flag marked as True/Yes and the record validity/end date is set to NULL OR some future date like- 2050.12.31
- The existing active record for the natural key in the dimension table is modified/updated to mark the Active Flag as False/No and the record validity/end date is set to SYSDATE.
SCD Type 1 and SCD Type 2 implementation might not be achieved directly via Map_CDC_Operation. But consider a case where we have a CDC based source table and our target DW dimension table is a slowly changing of Type1 or Type2, then the question remains do we need to again lookup the target dimension table. Check out the next article on Slowly Changing Dimension Type 1 & Type 2 implementation of CDC based Source table.