Incremental Loading for Fact Tables
In the previous articles, we have discussed the general concepts of incremental data loading as well as how to perform incremental data loading for dimension tables. In this article we will discuss the methods and issues of loading data incrementally in Fact tables of a data warehouse.
METHOD OF LOADING
Generally speaking, incremental loading for Fact tables is relatively easier as, unlike dimension tables, here you do not need to perform any look-up on your target table to find out if the source record already exists in the target or not. All you need to do is to select incremental records from source (as shown below for the case of "sales" table) and load them as it is to target (you may need to perform lookup to dimensional tables to assign respective surrogate keys - but that's a different story).
Like before we will assume we have a "sales" table in the source
Sales Table
ID CustomerID ProductDescription Qty Revenue Sales Date
1 1 White sheet (A4) 100 4.00 22-Mar-2012
2 1 James Clip (Box) 1 2.50 22-Mar-2012
3 2 Whiteboard Marker 1 2.00 22-Mar-2012
4 3 Letter Envelop 200 75.00 23-Mar-2012
5 1 Paper Clip 12 4.00 23-Mar-2012
Given this table, a typical extraction query will be like this:
SELECT t.*
FROM Sales t
WHERE t.sales_date > (select nvl(
max(b.loaded_until),
to_date('01-01-1900', 'MM-DD-YYYY')
)
from batch b
where b.status = 'Success'
);
where "batch" is a separate table maintained at target system having minimal structure and data like below
Batch_ID Loaded_Until Status
1 22-Mar-2012 Success
2 23-Mar-2012 Success
However, things may get pretty complicated if your fact is a special type of fact called "snapshot fact". Let's understand them below.
Loading Incident Fact
Incident fact is the normal fact that we encounter mostly (and that we have seen above in our sales table example). Records in these types of facts are only loaded if there are transactions coming from the source. For example, if at all there is one sale that happens in the source system, then only a new sales record will come. They are dependent on some real "incident" to happen in the source hence the name incident fact.
Loading Snapshot Fact
As opposed to incident fact, snapshot facts are loaded even if there is no real business incident in the source. Let me show you what I mean by using the above example of customer and sales tables in OLTP. Let's say I want to build a fact that would show me total revenue of sales from each customer for each day. In effect, I want to see the below data in my fact table.
Sales fact table (This is what I want to see in my target fact table)
Date Customer Revenue
22-Mar-2012 John 6.50
22-Mar-2012 Ryan 2.00
23-Mar-2012 John 10.50
23-Mar-2012 Ryan 2.00
23-Mar-2012 Bakers' 75.00
As you see, even if no sales was made to Ryan on 23-Mar, we still show him here with the old data. Similarly for John, even if goods totaling to $4.00 was sold to him on 23-Mar, his record shows the cumulative total of $10.50.
Now obviously the next logical question is how to load this fact using incremental loading? Because incremental loading only brings in incremental data - that is on 23rd March, we will only have Bakers' and John's records and that too with that day's sales figures. We won't have Ryan record in the incremental set.
Why not a full load
You can obviously opt-in for full load mechanism as that would solve this problem but that would take the toll on your loading performance.
Then what's the solution?
One way to resolve this issue is: creating 2 incremental channels of loading for the fact. 1 channel will bring in incremental data from source and the other channel will bring in incremental data from the target fact itself. Let's see how does it happen below. We will take the example for loading 23-Mar data.
Channel 1: Incremental data from source
Customer Revenue
John 4.00
Bakers' 75.00
Channel 2: Incremental data from target fact table (last day's record)
Customer Revenue
John 6.50
Ryan 2.00
Next we can perform a FULL OUTER JOIN between the above two sets to come to below result
John 10.50
Ryan 2.00
Bakers' 75.00