Logo DWBI.org Login / Sign Up
Sign Up
Have Login?
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Login
New Account?
Recovery
Go to Login
By continuing you indicate that you agree to Terms of Service and Privacy Policy of the site.
Informatica

Implementing SCD2 in Informatica Using ORA_HASH at Source

Updated on Sep 30, 2020

In this article we shall see how we can implement SCD type2 in Informatica using ORA_HASH, which is an ORACLE function that computes hash value for a given expression. We can use this feature to find the existence of any change in any of the SCD column.

For this we will assume our source table is SCD2_SOURCE and SCD2_TARGET is our Target. The source and target table definition are given below:

Source Table Definition

COLUMN NAMEDATA TYPE
EMPNONUMBER(4,0)
ENAMEVARCHAR2(10)
JOBVARCHAR2(9)
HIREDATEDATE
SALNUMBER(7,2)
DEPTNAMEVARCHAR2(20)
LOAD_DATEDATE

Target Table Definition

COLUMN NAMEDATA TYPEREMARKS / MEANING
EMP_KEYNUMBER(29,0)Surrogate Key Column
EMPNONUMBER(4,0)Natural Key/ Source Identifier
ENAMEVARCHAR2(10)SCD Column
JOBVARCHAR2(9)SCD Column
HIREDATEDATENo Action
SALNUMBER(7,2)SCD Column
DEPTNAMEVARCHAR2(20)SCD Column
START_DATEDATEstart date of the recordset
END_DATEDATEend date of the recordset
ACTIVE_FLAGCHAR(1)Signifies the most recent record
LOAD_DATEDATE

SCD Mapping

As in case of any SCD Type 2 implementation, here we need to first find out the set of SCD2 records which qualify for either INSERT or INSERT/UPDATE. Based on this approach, a typical mapping will contain expression, router and update strategy transformations but will not contain any lookup transformation. Instead, we will be using the ORA_HASH function in source qualifier for comparing records.

We have modified the Source Qualifier override query as shown below and have introduced two additional columns: TGT_EMP_KEY and FLAG

SELECT SRC.EMPNO, 
       A.EMP_KEY TGT_EMP_KEY ,
       SRC.ENAME,
       SRC.JOB, 
       SRC.HIREDATE, 
       SRC.SAL, 
       SRC.DEPTNAME, 
       DECODE(NVL(A.EMPNO,1), 1, 'I','U')FLAG
FROM
	   SCD2_SOURCE SRC, 
	   (SELECT * FROM SCD2_TARGET WHERE ACTIVE_FLAG = 'Y') A
WHERE 
	   SRC.EMPNO = A.EMPNO(+)
	   AND ORA_HASH(SRC.ENAME||SRC.JOB||SRC.SAL||SRC.DEPTNAME) !=
	   NVL(ORA_HASH(A.ENAME|| A.JOB||A.SAL||A.DEPTNAME),00)

In case you are wondering, ORA_HASH is a Oracle database function that computes a hash value for a given expression (In this case the concatenated column names). This function is useful for operations such as comparing and analyzing a subset of data, generating a random sample etc. The function returns a NUMBER value.

Thereafter we have used an Expression transformation as shown

Expression Transformation
Expression Transformation

Next we would use a ROUTER where there would be two groups created one would route the records for Inert i.e. FLAG = ‘I’ OR FLAG = ‘U’ and the next group would be for Update only i.e. FLAG = ‘U’.

Router Transformation
Router Transformation

The two output group of the router would be followed by UPDATE STRATEGIES. The UPDATE Strategy connecting the INSERT flow has to be set with a DD_INSERT while the Update Strategy connecting the UPDATE flow has to be set with a DD_UPDATE.

The INSERT path then Inserts, records, into Target table along with a Sequence generator whose output connects to the Surrogate key column of the target. The update path updates the END_DATE with the current date and ACTIVE_FLAG with ‘N’ based on the EMP_KEY returned by the lookup.

Benefits of Hash Comparison method for SCD

Some of the benefits of implementing SCD2 using this approach are:

  • First the records which participate in the entire ETL are the ones affected. That is, we are not carrying baggage of extra records through some of the transformations and then filtering the ones actually required
  • Second we are doing away with the Lookup transformation on the target which usually becomes an overhead in terms of cache space and cache creation time
  • Third we can skip the comparison of the column values after the lookup
  • Fourth, firing the query directly on the Source would prove to be better taking database performance into consideration

Limitations of this method

This approach has some limitations as well. Some of these are:

  • Both Source and Target has to be on Oracle database and should have connectivity in between
  • Works better if both Source and Target exists in the same database. For eg. If the Target is a Data Mart Dimension and Source in a stage table both in same DB but different schema.