PowerCenter Informatica Pipeline Partition
The PowerCenter® Partitioning Option increases the performance of PowerCenter through parallel data processing. This option provides a thread-based architecture and automatic data partitioning that optimizes parallel processing on multiprocessor and grid-based hardware environments.
There are different Types of Informatica Partitions, eg.
- Database partitioning
- Hash auto-keys
- Hash user keys
- Key range
Informatica Pipeline Partitioning Explained
Each mapping contains one or more pipelines. A pipeline consists of a source qualifier, all the transformations and the target. When the Integration Service runs the session, it can achieve higher performance by partitioning the pipeline and performing the extract, transformation, and load for each partition in parallel. Basically a partition is a pipeline stage that executes in a single reader, transformation, or writer thread.
The number of partitions in any pipeline stage equals the number of threads in the stage. By default, the Integration Service creates one partition in every pipeline stage. If we have the Informatica Partitioning option, we can configure multiple partitions for a single pipeline stage.
Setting partition attributes includes partition points, the number of partitions, and the partition types. In the session properties we can add or edit partition points. When we change partition points we can define the partition type and add or delete partitions(number of partitions).
We can set the following attributes to partition a pipeline:
Partition points mark thread boundaries and divide the pipeline into stages. A stage is a section of a pipeline between any two partition points. The Integration Service redistributes rows of data at partition points. When we add a partition point, we increase the number of pipeline stages by one. Increasing the number of partitions or partition points increases the number of threads.
We cannot create partition points at Source instances or at Sequence Generator transformations.
Number of partitions:
A partition is a pipeline stage that executes in a single thread. If we purchase the Partitioning option, we can set the number of partitions at any partition point. When we add partitions, we increase the number of processing threads, which can improve session performance. We can define up to 64 partitions at any partition point in a pipeline. When we increase or decrease the number of partitions at any partition point, the Workflow Manager increases or decreases the number of partitions at all partition points in the pipeline. The number of partitions remains consistent throughout the pipeline. The Integration Service runs the partition threads concurrently.
The Integration Service creates a default partition type at each partition point. If we have the Partitioning option, we can change the partition type. The partition type controls how the Integration Service distributes data among partitions at partition points.
We can define the following partition types here: Database partitioning, Hash auto-keys, Hash user keys, Key range, Pass-through, Round-robin.
- Database partitioning:
The Integration Service queries the database system for table partition information. It reads partitioned data from the corresponding nodes in the database.
The Integration Service processes data without redistributing rows among partitions. All rows in a single partition stay in the partition after crossing a pass-through partition point. Choose pass-through partitioning when we want to create an additional pipeline stage to improve performance, but do not want to change the distribution of data across partitions.
The Integration Service distributes data evenly among all partitions. Use round-robin partitioning where we want each partition to process approximately the same numbers of rows i.e. load balancing.
- Hash auto-keys:
The Integration Service uses a hash function to group rows of data among partitions. The Integration Service groups the data based on a partition key. The Integration Service uses all grouped or sorted ports as a compound partition key. We may need to use hash auto-keys partitioning at Rank, Sorter, and unsorted Aggregator transformations.
- Hash user keys:
The Integration Service uses a hash function to group rows of data among partitions. We define the number of ports to generate the partition key.
- Key range:
The Integration Service distributes rows of data based on a port or set of ports that we define as the partition key. For each port, we define a range of values. The Integration Service uses the key and ranges to send rows to the appropriate partition. Use key range partitioning when the sources or targets in the pipeline are partitioned by key range.
Points to consider while using Informatica partitions
- We cannot create a partition key for hash auto-keys, round-robin, or pass-through types partitioning
- If you have bitmap index defined upon the target and you are using pass-through partitioning to, say Update the target table - the session might fail as bitmap index creates serious locking problem in this scenario
- Partitioning considerably increases the total DTM buffer memory requirement for the job. Ensure you have enough free memory in order to avoid memory allocation failures
- When you do pass-through partitioning, Informatica will try to establish multiple connection requests to the database server. Ensure that database is configured to accept high number of connection requests
- As an alternative to partitioning, you may also use native database options to increase degree of parallelism of query processing. For example in Oracle database you can either specify PARALLEL hint or alter the DOP of the table in subject.
- If required you can even combine Informatica partitioning with native database level parallel options - e.g. you create 5 pass-through pipelines, each sending query to Oracle database with PARALLEL hint.