This article lists down the challenges faced in real-time with the use of custom Sequencing logic using Expression while enabling Pipeline Partitioning in Informatica.
Suppose we have a transaction table
JOB_ORDER_TRANS that has details about Job Orders created by the Marketing team in a product company. This transaction table has a sequence key (PK) in order to perform an update post the data load as demanded by the business and this PK is loaded using the custom sequencing logic mentioned in the following section.
This transaction table sources data from a stage table (JOB_ORDER_STG) that sends two million rows for each run i.e. two million rows are processed by ETL everyday using Incremental approach to load into this target table.
Existing code that loads this transaction table does not follow any partitioning strategy to read data from stage table because of which ETL takes more than one hour to complete the load, hence going beyond the agreed SLA. Sometimes, during the end of the month, the load on server that hosts the Integration Service further pushes the data load of this batch job by another 10 – 15 mins extra.
Refer below the structure of source and target tables and the sequencing logic followed in this code.
The sequencing logic followed in this codes based on using “reusable expression and lookup transformation” to generate the Primary Key. We have discussed this approach in our earlier article.
Since this code with Reusable Expression and Lookup based sequencing logic has performance concerns in Production, ETL development team has planned to increase the performance of the job by following the partition strategy to implement parallel data processing.
It’s decided to implement “Pass Through” partitioning at Informatica level by partitioning source data using
GEO_CD column (Geography Id) from Stage source. Implementing partitioning at Informatica level enables parallel processing and the mapping to load Stage to Transaction table run multiple instances, one for each Geo Code partition.
Refer the article on Informatica Partitions to know about Informatica Partition options.
Challenges in using existing Sequence Logic
If we implement Pipeline partitioning at Informatica level, parallel data processing will create a pipeline for each partition. When the session runs, pipeline of individual partition runs in parallel and improves the session performance.
With the previously described sequencing logic, each pipeline partition calls the reusable lookup, get the same maximum value from the target table and generate sequence using Reusable expression. This behaviour generates duplicate sequence values across partition pipelines since each partition pipeline calls the lookup simultaneously and generates sequence in parallel.
Refer below the duplicate PKs loaded in the table.
Solution - Hybrid Sequencing
In order to overcome this issue of loading duplicate PK values in target table, a hybrid approach to generate sequence that uses the combination of below logic is followed.
- Sequencing using Reusable Expression and Lookup (discussed above)
- Sequence Generator (with Reset option)
Reason why this Hybrid approach is followed instead of just using Sequence Generator alone is to avoid draining of sequence keys in Production as the source volume is incremental for every run. Implementing Hybrid Sequencing requires minor tweaks on Reusable Lookup and Expression transformations and a new Sequence Generator transformation.
Reusable Lookup (Hybrid)
Reusable lookup has one I/P port (dummy), one Return Port to return the maximum Primary Key value and one Lookup port to match with I/P dummy port. There is no change on the ports.
But, Lookup Override returns 0 instead of 1 as Max(PK). It’s the only change on Lookup transformation. Lookup Override used in this transformation is shown below
SELECT 1 AS DUMMY, NVL(MAX($$LkpCol_Nm),0) as TARGET_PK FROM $$LkpTbl_Nm
Reusable Expression (Hybrid)
This Reusable expression has a minor change to accept
NextVal from Sequence Generator to add it with Max (PK) from Reusable Lookup.
V_LKP_MAX_PK = IIF(V_LKP_MAX_PK=0, :LKP.LKP_TARGET_HYBRID(1), V_LKP_MAX_PK)
O_MAX_PK = V_LKP_MAX_PK + I_NEXTVAL
I_NEXTVAL refers the input port from Sequence Generator.
Configure the Sequence Generator by setting “Reset” option so that the sequence will be reset to start from 1 for every run.
Below picture shows the usage of hybrid sequencing approach in the mapping that loads Job Order Transaction table with pipeline partitioning enabled.
Refer below the PK loaded without duplicates as a result of Hybrid Implementation: