Informatica

Challenges of Informatica Partitioning with Sequencing

Saurav Mitra

In the previous article, we showed how surrogate keys can be generated without using Sequence Generator transformation. However, if Informatica partitioning is implemented for such cases, then since each partition pipeline will call the lookup simultaneously, we will end up generating duplicate sequence numbers. In this article, we will see how we may resolve this issue.

This article lists down the challenges faced in real-time with the use of custom Sequencing logic using Expression while enabling Pipeline Partitioning in Informatica.

Suppose we have a transaction table JOB_ORDER_TRANS that has details about Job Orders created by the Marketing team in a product company. This transaction table has a sequence key (PK) in order to perform an update post the data load as demanded by the business and this PK is loaded using the custom sequencing logic mentioned in the following section.

This transaction table sources data from a stage table (JOB_ORDER_STG) that sends two million rows for each run i.e. two million rows are processed by ETL everyday using Incremental approach to load into this target table.

Unpartitioned Code

Existing code that loads this transaction table does not follow any partitioning strategy to read data from stage table because of which ETL takes more than one hour to complete the load, hence going beyond the agreed SLA. Sometimes, during the end of the month, the load on server that hosts the Integration Service further pushes the data load of this batch job by another 10 – 15 mins extra.

Refer below the structure of source and target tables and the sequencing logic followed in this code.

JOB_ORDER_STG

ColumnType
Region_CdString
Geo_CdString
Channel_NmNumber
Agent_IdNumber
Cust_idNumber
Order_NbrString
Order_DtDate
Create_DtDate

JOB_ORDER_STG

ColumnType
Order_key (PK)Number
Region_keyString
Geo_keyString
Channel_keyNumber
Agent_keyNumber
Cust_keyNumber
Order_NbrString
Order_DtDate
Create_DtDate
Update_DtDate

The sequencing logic followed in this codes based on using “reusable expression and lookup transformation” to generate the Primary Key. We have discussed this approach in our earlier article.

Problem Statement

Since this code with Reusable Expression and Lookup based sequencing logic has performance concerns in Production, ETL development team has planned to increase the performance of the job by following the partition strategy to implement parallel data processing.

It’s decided to implement “Pass Through” partitioning at Informatica level by partitioning source data using GEO_CD column (Geography Id) from Stage source. Implementing partitioning at Informatica level enables parallel processing and the mapping to load Stage to Transaction table run multiple instances, one for each Geo Code partition.

Refer the article on Informatica Partitions to know about Informatica Partition options.

Challenges in using existing Sequence Logic

If we implement Pipeline partitioning at Informatica level, parallel data processing will create a pipeline for each partition. When the session runs, pipeline of individual partition runs in parallel and improves the session performance.

With the previously described sequencing logic, each pipeline partition calls the reusable lookup, get the same maximum value from the target table and generate sequence using Reusable expression. This behaviour generates duplicate sequence values across partition pipelines since each partition pipeline calls the lookup simultaneously and generates sequence in parallel.

Refer below the duplicate PKs loaded in the table.

ORDER_KEYREGION_KEYGEO_KEYCHANNEL_KEYAGENT_KEYCUST_KEYORDER_NBRORDER_DT
1100105650088109736189801-04-2016 19:04
2100105750088284752456701-04-2016 19:04
1500206450088203842437801-04-2016 19:00
2500206781088139471680301-04-2016 19:04

Solution - Hybrid Sequencing

In order to overcome this issue of loading duplicate PK values in target table, a hybrid approach to generate sequence that uses the combination of below logic is followed.

  • Sequencing using Reusable Expression and Lookup (discussed above)
  • Sequence Generator (with Reset option)

Reason why this Hybrid approach is followed instead of just using Sequence Generator alone is to avoid draining of sequence keys in Production as the source volume is incremental for every run. Implementing Hybrid Sequencing requires minor tweaks on Reusable Lookup and Expression transformations and a new Sequence Generator transformation.

Reusable Lookup (Hybrid)

Reusable lookup has one I/P port (dummy), one Return Port to return the maximum Primary Key value and one Lookup port to match with I/P dummy port. There is no change on the ports.

But, Lookup Override returns 0 instead of 1 as Max(PK). It’s the only change on Lookup transformation. Lookup Override used in this transformation is shown below

SELECT 1 AS DUMMY, NVL(MAX($$LkpCol_Nm),0) as TARGET_PK FROM $$LkpTbl_Nm

Reusable Expression (Hybrid)

This Reusable expression has a minor change to accept NextVal from Sequence Generator to add it with Max (PK) from Reusable Lookup.

Port Expressions:

V_LKP_MAX_PK = IIF(V_LKP_MAX_PK=0, :LKP.LKP_TARGET_HYBRID(1), V_LKP_MAX_PK)
O_MAX_PK = V_LKP_MAX_PK + I_NEXTVAL

I_NEXTVAL refers the input port from Sequence Generator.

Sequence Generator:

Configure the Sequence Generator by setting “Reset” option so that the sequence will be reset to start from 1 for every run.

Mapping Snapshot:

Below picture shows the usage of hybrid sequencing approach in the mapping that loads Job Order Transaction table with pipeline partitioning enabled.

Refer below the PK loaded without duplicates as a result of Hybrid Implementation:

ORDER_KEYREGION_KEYGEO_KEYCHANNEL_KEYAGENT_KEYCUST_KEYORDER_NBRORDER_DT
1100105650088109736189801-04-2016 19:04
2100105750088284752456701-04-2016 19:04
3500206450088203842437801-04-2016 19:00
4500206781088139471680301-04-2016 19:04