Informatica

Implementing SCD2 in Informatica Using ORA_HASH at Source

Saurav Mitra

In this article we shall see how we can implement SCD type2 in Informatica using ORA_HASH, which is an ORACLE function that computes hash value for a given expression. We can use this feature to find the existence of any change in any of the SCD column.

For this we will assume our source table is SCD2_SOURCE and SCD2_TARGET is our Target. The source and target table definition are given below:

Source Table Definition

COLUMN NAMEDATA TYPE
EMPNO NUMBER(4,0)
ENAME VARCHAR2(10)
JOB VARCHAR2(9)
HIREDATE DATE
SAL NUMBER(7,2)
DEPTNAME VARCHAR2(20)
LOAD_DATE DATE

Target Table Definition

COLUMN NAMEDATA TYPEREMARKS / MEANING
EMP_KEY NUMBER(29,0) Surrogate Key Column
EMP_NO NUMBER(4,0) Natural Key/ Source Identifier
ENAME VARCHAR2(10 BYTE) SCD Column
JOB VARCHAR2(9 BYTE) SCD Column
HIREDATE DATE No Action
SAL NUMBER(7,2) SCD Column
DEPTNAME VARCHAR2(20 BYTE) SCD Column
START_DATE DATE start date of the recordset
END_DATE DATE end date of the recordset
ACTIVE_FLAG CHAR(1 BYTE) Signifies the most recent record
LOAD_DATE TIMESTAMP(6) TIMESTAMP

SCD Mapping

As in case of any SCD Type 2 implementation, here we need to first find out the set of SCD2 records which qualify for either INSERT or INSERT/UPDATE. Based on this approach, a typical mapping will contain expression, router and update strategy transformations but will not contain any lookup transformation. Instead, we will be using the ORA_HASH function in source qualifier for comparing records.

We have modified the Source Qualifier override query as shown below and have introduced two additional columns: TGT_EMP_KEY and FLAG

SELECT SRC.EMPNO, 
       A.EMP_KEY TGT_EMP_KEY ,
       SRC.ENAME,
       SRC.JOB, 
       SRC.HIREDATE, 
       SRC.SAL, 
       SRC.DEPTNAME, 
       DECODE(NVL(A.EMPNO,1), 1, 'I','U')FLAG
FROM
	   SCD2_SOURCE SRC, 
	   (SELECT * FROM SCD2_TARGET WHERE ACTIVE_FLAG = 'Y') A
WHERE 
	   SRC.EMPNO = A.EMPNO(+)
	   AND ORA_HASH(SRC.ENAME||SRC.JOB||SRC.SAL||SRC.DEPTNAME) !=
	   NVL(ORA_HASH(A.ENAME|| A.JOB||A.SAL||A.DEPTNAME),00)

In case you are wondering, ORA_HASH is a Oracle database function that computes a hash value for a given expression (In this case the concatenated column names). This function is useful for operations such as comparing and analyzing a subset of data, generating a random sample etc. The function returns a NUMBER value.

Thereafter we have used an Expression transformation as shown

Expression Transformation

Next we would use a ROUTER where there would be two groups created one would route the records for Inert i.e. FLAG = ‘I’ OR FLAG = ‘U’ and the next group would be for Update only i.e. FLAG = ‘U’.

Router Transformation

The two output group of the router would be followed by UPDATE STRATEGIES. The UPDATE Strategy connecting the INSERT flow has to be set with a DD_INSERT while the Update Strategy connecting the UPDATE flow has to be set with a DD_UPDATE.

The INSERT path then Inserts, records, into Target table along with a Sequence generator whose output connects to the Surrogate key column of the target. The update path updates the END_DATE with the current date and ACTIVE_FLAG with ‘N’ based on the EMP_KEY returned by the lookup.

Benefits of Hash Comparison method for SCD

Some of the benefits of implementing SCD2 using this approach are:

  • First the records which participate in the entire ETL are the ones affected. That is, we are not carrying baggage of extra records through some of the transformations and then filtering the ones actually required
  • Second we are doing away with the Lookup transformation on the target which usually becomes an overhead in terms of cache space and cache creation time
  • Third we can skip the comparison of the column values after the lookup
  • Fourth, firing the query directly on the Source would prove to be better taking database performance into consideration

Limitations of this method

This approach has some limitations as well. Some of these are:

  • Both Source and Target has to be on Oracle database and should have connectivity in between
  • Works better if both Source and Target exists in the same database. For eg. If the Target is a Data Mart Dimension and Source in a stage table both in same DB but different schema.