For this we will assume our source table is SCD2_SOURCE and SCD2_TARGET is our Target. The source and target table definition are given below:
Source Table Definition
|COLUMN NAME||DATA TYPE|
Target Table Definition
|COLUMN NAME||DATA TYPE||REMARKS / MEANING|
|EMP_KEY||NUMBER(29,0)||Surrogate Key Column|
|EMP_NO||NUMBER(4,0)||Natural Key/ Source Identifier|
|ENAME||VARCHAR2(10 BYTE)||SCD Column|
|JOB||VARCHAR2(9 BYTE)||SCD Column|
|DEPTNAME||VARCHAR2(20 BYTE)||SCD Column|
|START_DATE||DATE||start date of the recordset|
|END_DATE||DATE||end date of the recordset|
|ACTIVE_FLAG||CHAR(1 BYTE)||Signifies the most recent record|
As in case of any SCD Type 2 implementation, here we need to first find out the set of SCD2 records which qualify for either INSERT or INSERT/UPDATE. Based on this approach, a typical mapping will contain expression, router and update strategy transformations but will not contain any lookup transformation. Instead, we will be using the ORA_HASH function in source qualifier for comparing records.
We have modified the Source Qualifier override query as shown below and have introduced two additional columns: TGT_EMP_KEY and FLAG
SELECT SRC.EMPNO, A.EMP_KEY TGT_EMP_KEY , SRC.ENAME, SRC.JOB, SRC.HIREDATE, SRC.SAL, SRC.DEPTNAME, DECODE(NVL(A.EMPNO,1), 1, 'I','U')FLAG FROM SCD2_SOURCE SRC, (SELECT * FROM SCD2_TARGET WHERE ACTIVE_FLAG = 'Y') A WHERE SRC.EMPNO = A.EMPNO(+) AND ORA_HASH(SRC.ENAME||SRC.JOB||SRC.SAL||SRC.DEPTNAME) != NVL(ORA_HASH(A.ENAME|| A.JOB||A.SAL||A.DEPTNAME),00)
In case you are wondering, ORA_HASH is a Oracle database function that computes a hash value for a given expression (In this case the concatenated column names). This function is useful for operations such as comparing and analyzing a subset of data, generating a random sample etc. The function returns a NUMBER value.
Thereafter we have used an Expression transformation as shown
Next we would use a ROUTER where there would be two groups created one would route the records for Inert i.e. FLAG = ‘I’ OR FLAG = ‘U’ and the next group would be for Update only i.e. FLAG = ‘U’.
The two output group of the router would be followed by UPDATE STRATEGIES. The UPDATE Strategy connecting the INSERT flow has to be set with a DD_INSERT while the Update Strategy connecting the UPDATE flow has to be set with a DD_UPDATE.
The INSERT path then Inserts, records, into Target table along with a Sequence generator whose output connects to the Surrogate key column of the target. The update path updates the END_DATE with the current date and ACTIVE_FLAG with ‘N’ based on the EMP_KEY returned by the lookup.
Benefits of Hash Comparison method for SCD
Some of the benefits of implementing SCD2 using this approach are:
- First the records which participate in the entire ETL are the ones affected. That is, we are not carrying baggage of extra records through some of the transformations and then filtering the ones actually required
- Second we are doing away with the Lookup transformation on the target which usually becomes an overhead in terms of cache space and cache creation time
- Third we can skip the comparison of the column values after the lookup
- Fourth, firing the query directly on the Source would prove to be better taking database performance into consideration
Limitations of this method
This approach has some limitations as well. Some of these are:
- Both Source and Target has to be on Oracle database and should have connectivity in between
- Works better if both Source and Target exists in the same database. For eg. If the Target is a Data Mart Dimension and Source in a stage table both in same DB but different schema.