- You really can not do the aggregation in the source qualifier SQL query (e.g. Flat File source)
- Fields used for aggregation are derived inside the mapping
Tuning Aggregator Transformation
If you have to do the aggregation using Informatica aggregator, then ensure that all the columns used in the group by are sorted in the same order of group by and “Sorted Input” option is checked in the aggregator properties. Ensuring the input data is sorted is absolutely must in order to achieve better performance and we will soon know why.
Other things that need to be checked to increase aggregator performance are –
- Check if “Case-Sensitive String Comparison” option is really required. Keeping this option checked (default) slows down the aggregator performance
- Enough memory (RAM) is available to do the in memory aggregation. See below section for details.
- Aggregator cache is partitioned
How to (and when to) set aggregator Data and Index cache size
As I mentioned before also, my advice is to leave the “Aggregator Data Cache Size” and “Aggregator Index Cache Size” options as Auto (default) in the transformation level and if required, set either of the followings in the session level (under “Config Object” tab) to allow Informatica allocate enough memory automatically for the transformation:
- Maximum Memory Allowed For Auto Memory Attributes
- Maximum Percentage of Total Memory Allowed For Auto Memory Attributes
However if you do have to set Data Cache/ Index Cache size yourself, please note that the value you set here is actually RAM memory requirement (and not disk space requirement) and hence, your mapping will fail if Informatica can not allocate the entire memory in RAM at the session initiation. And yes, this can happen often because you never know what other jobs are running in the server and what amount of RAM other jobs are really occupying while you run this job.
Having understood the risk, let’s now see the benefit of manually configuring the Index and Data Cache sizes. If you leave the index and data cache sizes to auto then if Informatica does not get enough memory during session run time, your job will not fail, instead Informatica will page-out the data to hard disk level. Since I/O performance of hard disk drive is 1000~ times slower than RAM, paging out to hard disk drive will have performance penalty. So by setting data and index cache size manually you can ensure that Informatica block this memory in the beginning of session run so that the cache is not paged-out to disk and the entire aggregation actually take place in RAM. Do this at your own risk.
Manually configuring index and data cache sizes can be beneficial if ensuring consistent session performance is your highest priority compared to session stability and operational steadiness. Basically you risk your operations (since it creates high chance of session failure) to obtain optimized performance.
The best way to determine the data and index cache size(s) is to check the session log of already executed session. Session log clearly shows these sizes in bytes. But this size depends on the row count. So keep some buffer (around 20% in most cases) on top of these sizes and use those values for the configuration.
Other way to determine Index and Data Cache sizes are, of course, to use the inbuilt Cache-size calculator accessible in session level.
Using the Informatica Aggregator cache size calculator is a bit difficult (and lot inaccurate). The reason is to calculate cache size properly you will need to know the number of groups that the aggregator is going to process. The definition of number of groups is as below:
No. of Groups = Multiplication of cardinality values of each group by column
This means, suppose you group by store and product, and there are total 150 distinct stores and 10 distinct products, then no. of groups will be 150 X 10 = 1500.
This is inaccurate because, in most cases you can not ascertain how many distinct stores and product data will come on each load. You might have 150 stores and 10 products, but there is no guarantee that all the product will come on all the load. Hence the cache size you determine in this method is quite approximate.
You can, however, calculate the cache size in both the two methods discussed here and take the max of the values to be in safer side.