What is Incremental Loading and why is it required
In almost all data warehousing scenario, we extract data from one or more source systems and keep storing them in the data warehouse for future analysis. The source systems are generally OLTP systems which store everyday transactional data. Now when it comes to loading these transactional data to data warehouse, we have 2 ways to accomplish this, Full Load or Incremental Load.
To understand these two loads better, consider a simple scenario. Let's say my source system in RDBMS - that is, a database - and I have 2 tables, customer and Sales.
In the customer table I have details of all my customers in this format:
CustomerID CustomerName Type Entry Date 1 John Individual 22-Mar-2012 2 Ryan Individual 22-Mar-2012 3 Bakers' Corporate 23-Mar-2012
In the sales table, I have the details of product sold to customers. This is how the sales table looks like:
ID CustomerID ProductDescription Qty Revenue Sales Date 1 1 White sheet (A4) 100 4.00 22-Mar-2012 2 1 James Clip (Box) 1 2.50 22-Mar-2012 3 2 Whiteboard Marker 1 2.00 22-Mar-2012 4 3 Letter Envelop 200 75.00 23-Mar-2012 5 1 Paper Clip 12 4.00 23-Mar-2012
As you can see, above tables store data for 2 consecutive days - 22 Mar and 23 Mar. On 22 Mar, I had only 2 customers (John and Ryan) who made 3 transactions in the sales table. Next day, I have got one more customer (Bakers') and I have recorded 2 transactions - one from Bakers' and 1 from my old customer John.
Also imagine, we have a data warehouse which is loaded everyday in the night with the data from this system.
FULL LOAD METHOD FOR LOADING DATA WAREHOUSE
In case we are to opt for full load method for loading, we will read the 2 source tables (Customer and Sales) everyday in full. So,
On 22 Mar 2012: We will read 2 records from Customer and 3 records from Sales and load all of them in the target.
On 23 Mar 2012: We will read 3 records from customer (including the 2 older records) and 5 records from sales (including 3 old records) and will load or update them in the target data warehouse.
As you can clearly guess, this method of loading unnecessarily read old records that we need not read as we have already processed them before. Hence we need to implement a smarter way of loading.
INCREMENTAL LOAD METHOD FOR LOADING DATA WAREHOUSE
In case of incremental loading, we will only read those records that are not already read and loaded into our target system (data warehouse). That is, on 22 March, we will read 2 records from customer and 3 records from sales - however - on 23 March, we will read 1 record from customer and 2 records from sales.
But how do we ensure that we "only" read those records that are not "already" read? How do we know which records are already read and which records are not?
This is a tricky question but the answer is, fortunately, easy!
We can make use of "entry date" field in the customer table and "sales date" field in the sales table to keep track of this. After each loading we will "store" the date until which the loading has been performed in some data warehouse table and next day we only extract those records that has a date greater than our stored date. Let's create a new table to store this date. We will call this table as "Batch"
Batch_ID Loaded_Until Status 1 22-Mar-2012 Success 2 23-Mar-2012 Success
Once we have done this, all we have to do to perform incremental or delta loading is to rite our data extraction SQL queries in this format:
Customer Table Extraction SQL
SELECT t.* FROM Customer t WHERE t.entry_date > (select nvl( max(b.loaded_until), to_date('01-01-1900', 'MM-DD-YYYY') ) from batch b where b.status = 'Success' );
Sales Table Extraction SQL
SELECT t.* FROM Sales t WHERE t.sales_date > ( select nvl( max(b.loaded_until), to_date('01-01-1900', 'MM-DD-YYYY') ) from batch b where b.status = 'Success' );
Okay, now at this point you may wonder and ask
How does the above query work?
On First day (22 Mar):
There wont be any record in our batch table since we have not loaded any batch yet. So "SELECT max(b.loaded_until)" will return NULL. That is why we have put one NVL() function to replace the NULL with a very old historical date - 01 Jan 1900 in this case.
So in the first day, we are asking the select query to extract all the data having entry date (or sales date) greater than 01-Jan-1900. This will essentially extract everything from the table.
Once 22 Mar loading is complete, we will make one entry in the batch table (entry 1) to mark the successful extraction of records.
Second Day (23 Mar):
Next day, the query "SELECT max(b.loaded_until)" will return me 22-Mar-2012. So in effect, above queries will reduce to this:
Customer Table Extraction SQL
SELECT t.* FROM Customer t WHERE t.entry_date > '22-Mar-2012';
Sales Table Extraction SQL
SELECT t.* FROM Sales t WHERE t.sales_date > '22-Mar-2012';
As you can understand, this will ensure that only 23-Mar records are extracted from the table thereby performing a successful incremental loading. After this loading is complete successfully, we will make one more entry in the batch table (entry number 2).
Why MAX() is used in the above query?
When we try to load 23 Mar data, there was only one entry in the batch table (that of 22nd). But when we go to load 24th data or any data after that, there will be multiple entries in the batch table. We must take the max of these entries.
Why status field is created in batch table?
This is because it might so happen that 23rd load has failed. So when we start loading again on 24th, we must take into consideration both 23rd data and 24th data.
Batch_ID Loaded_Until Status 1 22-Mar-2012 Success 2 23-Mar-2012 Fail 3 24-Mar-2012 Success
In the above case, 23rd batch load was a failure. That is why next day we have selected all the data after 22-Mar (including 23rd and 24th Mar).
Now that we have discussed the general concepts of Incremental loading, next please read Incremental Loading for Dimension Table and Incremental Loading for Fact Tables - where we will discuss specific approaches.