What is data retention and Purging?
There are certain requirement to purge / archive / delete the data in data warehouse after a certain period of time, often termed as retention period of the data warehouse. Once a the retention period is reached, data from the data warehouse are purged or deleted or archived into separate place usually comprising of low cost storage medium (e.g. tape drive).
Why data purging is required?
In a idealistic scenario, we assume data warehouse to store data for good. However there are some reasons why this might not be a good idea in a real scenario:
- There are cost overhead associated with the amount of data that we store. This includes the cost of storage medium, infrastructure and human resources necessary to manage the data
- There is direct impact of data volume to the performance of a data warehouse. More data means more time consuming sorting and searching operations
- End users of the data warehouse in the business side may not be interested in the very old fact and figures. Data might lose its importance and relevance with the changing business landscape. Retaining such impertinent data may not be required
Variable purging requirement
The concept of data purging may not be applicable for all types of data in the warehouse. Typically one may only want to purge the transactional data and may not want to purge any master data. Similarly the retention period of the data warehouse may not be same for all types of data. One may set a variable retention period such that all detail level transactional data will be purged after 2 years and all aggregated transaction data will be purged after 7 years.
Rolling window for data purging
Process of purging is typically implemented as a rolling window wherein whatever data falls under the purging window are only purged. So suppose if the retention period is set as 2 years and purging process is executed bi-monthly then after every 15 days, whenever the purging process runs it deletes all the records older than 2 years as of that day.
Purging Master / Reference data
Purging of master data is not recommended unless:
- There is specific reason why the master data will never be required in the future
- History is maintained for the change in the master data (e.g. in the form of Slowly Changing Dimension etc.)
If history of the changes in the master data are maintained in the data warehouse, it might be necessary at some point to purge the records corresponding to old histories of the master data. When this is done, one must also keep in mind to clear all the corresponding transactional records pertaining to the master data. To understand the situation, consider this: I have a customer C1 for whom I have two records in my master data table with surrogate keys as K1 and K2. K1 corresponds to the customer C1's attributes as of year 1996 and K2 corresponds to the customer C1's attribute as of year 2000.
|Surrogate Key||Customer Name||Customer Tier||Year|
Now suppose there are 10 transactions pertaining to this customer C1 in the year 1996. In this circumstance, if we decide to purge the master record K1, then all the 10 transaction records would become orphan (meaning the key integrity between the master and transaction detail records will break). Ideally one must also purge these 10 transaction records while purging the corresponding master record.
In certain Relational databases (e.g. Oracle), this issue can be automatically taken care if foreign key constraints are enabled in the transaction detail table with "ON DELETE CASCADE" option.
Purging Transactional data
Transactional data purging may be necessary when
- Corresponding master data records (or parent record) are also purged
- Data crosses the designated retention period
As discussed above, there can be variable period of retention for the transaction data. Aggregated transaction data may be retained for a longer period of time than the detailed transactions.
Challenges faced during data warehouse data purging
Time. Purging takes time. And the process of purging may also slowdown legitimate daily activities of the data warehouse (e.g. loading may get slower/report may take more time to refresh etc.) Therefore one must plan and allocate enough time when the purging activity can be done. Weekends, holidays etc. can be a good time for data purging as number of users accessing the data warehouse are typically less at those times.
On the technical front one must also consider post purging activities that are required to be performed in the data warehouse. Since purging frees up lot of space - additional administrative activities are required to be performed to make use of newly freed-up space. This includes, but not limited to activities such as:
- Tablespace re-organization
- Index rebuilding
- Defragmentation / consolidation etc.