Implementing Rapidly changing dimension
Handling rapidly changing dimensions are tricky due to various performance implications. This article attempts to provide some methodologies on handling rapidly changing dimensions in a data warehouse.
In the past we have learnt how to design various slowly changing dimensions. But the problem with type 2 slowly changing dimension is, with every change in the dimensional attributes it increases the number of rows in the table. If lot of changes happen in the attributes of the dimension table (that is to say that the dimension is rapidly changing), the table quickly becomes bulky causing considerable performance issues. Hence the typical solution of SCD Type 2 dimensions may not be a very good fit for rapidly changing scenarios.
There are other methods to handle rapidly changing dimensions and one of those methods will be discussed in this article. Bear in mind, this is not the only method to handle rapidly changing scenarios. Neither is this the best one for every kind of scenarios. A data modeler is encouraged to be innovative to come up with other novel approaches.
Junk Dimension
The method that we are going to consider here assumes the fact that, not all the attributes of a dimension table are rapidly changing in nature. There might be a few attributes which are changing quite often and some other attributes which seldom change. If we can separate the fast changing attributes from the slowly changing ones and move them in some other table while maintaining the slowly changing attributes in the same table, we can get rid of the issue of bulking up the dimension table.
So let’s take one example to see how it works. Let’s say CUSTOMER dimension has following columns:
- CUSTOMER_KEY
- CUSTOMER_NAME
- CUSTOMER_GENDER
- CUSTOMER_MARITAL_STATUS
- CUSTOMER_TIER
- CUSTOMER_STATUS
While attributes like name, gender, marital status etc. do not change at all or rarely change, let’s assume customer tier and status change every month based on customer’s buying pattern. If we decide to keep status and tier in the same SCD Type 2 Customer dimension table, we could risk filling-up the table too much too soon. Instead, we can pull out those two attributes in yet another table, which some people refer as JUNK DIMENSION. Here is how our junk dimension will look like. In this case, it will have 3 columns as shown below.
Junk Dimension Structure
- SEGMENTATION_KEY
- TIER
- STATUS
The column SEGMENTATION_KEY is a surrogate key. This acts as the primary key of the table. Also since we have removed status and tier from our main dimension table, the dimension table now looks like this:
- CUSTOMER_KEY
- CUSTOMER_NAME
- CUSTOMER_GENDER
- CUSTOMER_MARITAL_STATUS
Next, we must create a linkage between the above customer dimension to our newly created JUNK dimension. Note here, we can not simply pull the primary key of the JUNK dimension (which we are calling as SEGMENTATION_KEY) into the customer dimension as foreign key. Because if we do so, then any change in JUNK dimension will require us to create a new record in Customer dimension to refer to the changed key. This would in effect again increase the data volume of the dimension table. We solve this problem by creating one more mini table in between the original customer dimension and the junk dimension. This mini dimension table acts as a bridge between them. We also put “start date” and “end date” columns in this mini table so that we can track the history. Here is how our new mini table looks like:
Mini Dimension Structure
- CUSTOMER_KEY
- SEGMENTATION_KEY
- START_DATE
- END_DATE
This table does not require any surrogate key. However, one may include one “CURRENT FLAG” column in the table if required. Now the whole model looks like this:
Maintaining the Junk Dimension
If number of attributes and the number of possible distinct values per attributes (cardinality) are not very large in the Junk dimension, we can actually pre-populate the junk dimension once and for all. In our earlier example, let’s say possible values of status are only “Active” and “Inactive” and possible values of Tier are only “Platinum”, “Gold” and “Silver”. That means there can be only 3 X 2 = 6 distinct combinations of records in this table. We can pre-populate the table with these 6 records from segmentation key = 1 to 6 and assign one key to each customer based on the customers status and tier values.
How does this Junk dimension help?
Since the connection between the segmentation key and customer key is actually maintained in the mini dimension table, frequent changes in tier and status do not change the number of records in the dimension table. Whenever a customer’s status or tier attribute changes, a new row is added in the mini dimension (with START_DATE = date of change of status) signifying the current relation between the customer and the segmentation.
It’s also worth mentioning that in this schema, we can manage the original customer dimension table in SCD type 1 or Type 2 methods, but we will have to take extra care to update the mini dimension also as and when there is a change in the key in the original dimension table.