List of available transforms
- Data Integrator- Data_Transfer, Date_Generation, Effective_Date, Hierarchy_Flattening, History_Preserving, Key_Generation, Map_CDC_Operation, Pivot (Columns to Rows), Reverse Pivot (Rows to Columns), Table_Comparison, XML_Pipeline
- Data Quality- Associate, Country ID, Data Cleanse, DSF2 Walk Sequencer, Geocoder, Global Address Cleanse, Global Suggestion Lists, Match, USA Regulatory Address Cleanse, User-Defined
- Platform- Case, Map_Operation, Merge, Query, Row_Generation, SQL, Validation
- Text Data Processing- Entity_Extraction
Our approach is to get a detailed knowledge on all the above transforms starting with the mostly commonly used ones. So we will start with the Query transform.
Query transform is used to retrieve a data set based on the input schema that satisfies conditions that we specify. A query transform is similar to a SQL SELECT statement. The Query transform is used to perform the following operations: -
- Maps column from input Schema to output Schema.
- Perform transformations and functions on the source data.
- Assign Primary Keys to output Schema columns.
- Add New Output columns, Nested Schemas, and Function Calls to the output Schema.
- Perform Data Nesting and Unnesting with Sub Schemas of the Output Schema. Also assign Make Current Schema.
- Generate Distinct result set output for the input Schema.
- Join data from Multiple Input Source Schemas. Equi Join as well as Outer Join is supported.
- Filter input Source Data.
- Performs Aggregation based on input column groups.
- Generate sorted dataset based on source input column order.
- Also we can generate DTD, XML Schema or File Format based on the Input or Output Schema.
Case transform is used to divide or route the input data set into multiple output data sets based on the defined logical expression. It is used to implement IF-THEN-ELSE logic at dataflow level. This transform accepts only one source input. We can define multiple labels and their corresponding CASE expression. For input rows that do not satisfy any of the CASE conditions, we may select to output those records using the DEFAULT case. For that we need to select the check box Produce default output when all expressions are false.
Two other featured properties of this transform are Row can be TRUE for one case only and Preserve expression order. If we select the option Row can be TRUE for one case only, then a row is passed to the first case whose expression returns TRUE. Otherwise, the row is passed to all the cases whose expression returns TRUE. Preserve expression order option is available only when the Row can be TRUE for one case only option is checked. We can select this option if expression order is important to us because there is no way to guarantee which expression will evaluate to TRUE first.
Merge transform is used to combine multiple input dataset with the same schemas into a single output dataset of the same schema. It is equivalent to SQL UNION ALL statement. In order to eliminate duplicate records from output dataset basically to attain UNION operation, add a Query transform with DISTINCT option enabled after the Merge transform.
Validation transform is used to filter or replace the source dataset based on criteria or validation rules to produce desired output dataset. It enables to create validation rules on the input dataset, and generate the output based on whether they have passed or failed the validation condition. This transform is typically used for NULL ckecking for mandatory fields, Pattern matching, existence of value in reference table, validate datatype, etc.
The Validation transform can generate three output dataset Pass, Fail, and RuleViolation. The Pass Output schema is identical with the Input schema. The Fail Output schema has two more columns, DI_ERRORACTION and DI_ERRORCOLUMNS. The RuleViolation has three columns DI_ROWID, DI_RULENAME and DI_COLUMNNAME.
Map_Operation transform allows conversions between data manipulation operations like INSERT, UPDATE, DELETE & REJECT. It enables to change the operation codes of input data sets to produce the desired output row type. There are 4 operation codes for any input row type – Normal, Update, Insert and Delete. In addition, the DISCARD option can be assigned to the output row type. Discarded rows are not passed through to the output of the transform.
If the output record is flagged NORMAL or INSERT, then it inserts a new record in the target table. If it is marked as UPDATE it basically overwrites an existing row in the target table. If the input record is flagged as Delete it does not load the records in the target table. But if the output row type is set to DELETE then it deletes the corresponding records present in target. If the row is marked as DISCARD then no records are passed to the output of the transform.
Row_Generation transform produces a dataset with a single column. The column values start with the number that we specify in the Row number starts at option. The value then increments by one to specified number of rows as set in the Row count option. This transform does not allow any input data set.
SQL transform is used to submit or perform standard SQL operations on database server. The SQL transform supports a single SELECT statement only. This transform does not allow any input data set. Use this transform when other built-transforms cannot perform the required SQL operation. Try to use this transform as your last option as it not optimised for performance and also reduces readability.
Key_Generation transform helps to generate artificial keys for new rows in a table. The transform looks up the maximum existing key value of the surrogate key column from the table; And uses it as the starting value to generate new keys for new rows in the input dataset. The transform expects a column with the same name as the Generated key column of the source table to be a part of the input schema.
The source table must be imported into the DS repository before defining the source table for this transform. The fully qualified Table name e.g. DATASTORE.OWNER.TABLE should be specified. Also we can set the Increment value i.e. the interval between the generated key values. By default it is 1. We can also use a variable placeholder for this option. We will be using this transform frequently while populating surrogate key values of slowly changing dimension tables.
Table_Comparison transform helps to compare two data sets and generates the difference between them as a resultant data set with rows flagged as INSERT, UPDATE, or DELETE. This transform can be used to ensure rows are not duplicated in a target table, or to compare the changed records of a data warehouse dimension table. It helps to detect and forward all changes or the latest ones that have occurred since the last time the comparison table was updated. We will be using this transform frequently while implementing slowing changing dimensions and while designing dataflows for recovery.
The source table must be already imported into the DS repository. The fully qualified Table name e.g. DATASTORE.OWNER.TABLE should be specified. Also set the input dataset columns that uniquely identify each row as Input primary key columns. These columns must be present in the comparison table with the same column names and datatypes. If the primary key value from the input data set does not match a value in the comparison table, DS generates an INSERT statement. Else it generates an UPDATE row with the values from the input dataset row after comparing all the columns in the input data set that are also present in the comparison table apart from the primary key columns. As per your requirement select only the required subset of non-key Compare columns which will give performance improvement.
If the Input primary key columns have duplicate keys, the transform arbitrarily chooses any of the rows to compare during dataflow processing i.e. order of the input rows are ignored. Selecting the Input contains duplicate keys check box provides a method of handling duplicate keys in the input data set.
If the comparison table contains rows with the same primary keys, the transform arbitrarily chooses any of the rows to compare. Specify the column of the comparison table with unique keys i.e. by design contains no duplicate keys as the Generated key column. A generated key column indicates which row of a set containing identical primary keys is to be used in the comparison. This provides a method of handling duplicate keys in the comparison table.
For an UPDATE, the output data set will contain the largest key value found for the given primary key. And for a DELETE, the output data set can include all duplicate key rows or just the row with the largest key value.
When we select the check box Detect deleted row(s) from comparison table the transform flags rows of the comparison table with the same key value as DELETE. When we select the options of the transforms - Generated key column, Detect deleted row(s) from comparison table and Row-by-row select or the Sorted input comparison method; Additional section appears to specify how to handle DELETE rows with duplicate keys. i.e. Detect all rows or Detect row with largest generated key value
Apart from all these properties there are three methods for accessing the comparison table namely Row-by-row select, Cached comparison table and Sorted input. Below is the brief on when to select which option.
- Row-by-row select option is best if the target table is large compared to the number of rows the transform will receive as input. In this case for every input row the transform fires a SQL to lookup the target table.
- Cached comparison table option is best when we are comparing the entire target table. DS uses pageable cache as the default. If the table fits in the available memory, we can change the Cache type property of the dataflow to In-Memory.
- Sorted input option is best when the input data is pre sorted based on the primary key columns. DS reads the comparison table in the order of the primary key columns using sequential read only once. NOTE: The order of the input data set must exactly match the order of all primary key columns in the Table_Comparison transform.
- The transform only considers rows flagged as NORMAL as Input dataset.
- Cautious when using real datatype columns in this transform as comparison results are unpredictable for this datatype.