Incremental Loading for Dimension Table
In our previous article we have discussed the concept of incremental loading in general. In this article we will see how to perform incremental loading for dimension tables.
Should we do incremental loading for dimensions?
In a dimensional model, we may perform incremental loading for dimension tables also. One may argue that this wont be necessary as data volume in dimension tables are not as high as the data volumes in the fact tables, hence we can simply do a full load every time.
I personally do not agree to this argument. This is because during the last few years I have seen tremendous growth in the data in dimension tables and things can get quite heavy especially if we are trying to load SCD type 2 dimensions. Anyway, without much ado, let's delve deep.
Standard Method of Loading
Like before, for our purpose we will assume we have the below customer table in our source system from where we need to perform the data loading.
CustomerID CustomerName Type LastUpdatedDate
1 John Individual 22-Mar-2012
2 Ryan Individual 22-Mar-2012
3 Bakers' Corporate 23-Mar-2012
As discussed in the previous article, a typical SQL query to extract data incrementally from this source system will be like this:
SELECT t.*
FROM Customer t
WHERE t.lastUpdatedDate > (select nvl(
max(b.loaded_until),
to_date('01-01-1900', 'MM-DD-YYYY')
)
from batch b
where b.status = 'Success'
);
Here "batch" is a separate table which stores the date until which we have successfully extracted the data.
Batch_ID Loaded_Until Status
1 22-Mar-2012 Success
2 23-Mar-2012 Success
Which one to use: "Entry Date" / "Load Date" or "Last Update Date"?
In an incremental load methodology, we should extract the record when it is first created and after that whenever the record is updated. Therefore, we should always look for "last update date" column for extracting records. This is because, "entry date" or "load date" columns in the source systems are not enough to determine if the record is updated in the later point in time.
Often source systems maintain 2 different columns as load_date and last_update_date. When extracting data based on "last update date", ensure that source systems always populate "last updated date" field with "load date" when the record is first created.
What are the benefits of incremental loading of dimension tables?
Once we extract records incrementally based on their last update date, we can compare each record with the target based on their natural keys and determine if the record is a new record or updated record.
However, if we do not extract incrementally (and every time extract all the records from source), then the number of records to compare against target will be much higher resulting into performance degradation. If we are doing incremental loading, records that do not have any change will not come - only new or updatable records will come. But if we are doing full load, everything will come irrespective of any change.