By the typical definition of data warehouse, we expect the data warehouse to be non-volatile in nature for its entire design life time. As long as it remain operation, all data loaded in the data warehouse should remain there for the purpose of analysis. However, this is often not the case.

What is data retention and Purging?

There are certain requirement to purge / archive / delete the data in data warehouse after a certain period of time, often termed as retention period of the data warehouse. Once a the retention period is reached, data from the data warehouse are purged or deleted or archived into separate place usually comprising of low cost storage medium (e.g. tape drive).

Why data purging is required?

In a idealistic scenario, we assume data warehouse to store data for good. However there are some reasons why this might not be a good idea in a real scenario:

  • There are cost overhead associated with the amount of data that we store. This includes the cost of storage medium, infrastructure and human resources necessary to manage the data
  • There is direct impact of data volume to the performance of a data warehouse. More data means more time consuming sorting and searching operations
  • End users of the data warehouse in the business side may not be interested in the very old fact and figures. Data might lose its importance and relevance with the changing business landscape. Retaining such impertinent data may not be required

Variable purging requirement

The concept of data purging may not be applicable for all types of data in the warehouse. Typically one may only want to purge the transactional data and may not want to purge any master data. Similarly the retention period of the data warehouse may not be same for all types of data. One may set a variable retention period such that all detail level transactional data will be purged after 2 years and all aggregated transaction data will be purged after 7 years.

Rolling window for data purging

Process of purging is typically implemented as a rolling window wherein whatever data falls under the purging window are only purged. So suppose if the retention period is set as 2 years and purging process is executed bi-monthly then after every 15 days, whenever the purging process runs it deletes all the records older than 2 years as of that day.

Purging Master / Reference data

Purging of master data is not recommended unless:

  • There is specific reason why the master data will never be required in the future
  • History is maintained for the change in the master data (e.g. in the form of Slowly Changing Dimension etc.)

If history of the changes in the master data are maintained in the data warehouse, it might be necessary at some point to purge the records corresponding to old histories of the master data. When this is done, one must also keep in mind to clear all the corresponding transactional records pertaining to the master data. To understand the situation, consider this: I have a customer C1 for whom I have two records in my master data table with surrogate keys as K1 and K2. K1 corresponds to the customer C1's attributes as of year 1996 and K2 corresponds to the customer C1's attribute as of year 2000.

Surrogate KeyCustomer NameCustomer TierYear

Now suppose there are 10 transactions pertaining to this customer C1 in the year 1996. In this circumstance, if we decide to purge the master record K1, then all the 10 transaction records would become orphan (meaning the key integrity between the master and transaction detail records will break). Ideally one must also purge these 10 transaction records while purging the corresponding master record.

In certain Relational databases (e.g. Oracle), this issue can be automatically taken care if foreign key constraints are enabled in the transaction detail table with "ON DELETE CASCADE" option.

Purging Transactional data

Transactional data purging may be necessary when

  • Corresponding master data records (or parent record) are also purged
  • Data crosses the designated retention period

As discussed above, there can be variable period of retention for the transaction data. Aggregated transaction data may be retained for a longer period of time than the detailed transactions.

Challenges faced during data warehouse data purging

Time. Purging takes time. And the process of purging may also slowdown legitimate daily activities of the data warehouse (e.g. loading may get slower/report may take more time to refresh etc.) Therefore one must plan and allocate enough time when the purging activity can be done. Weekends, holidays etc. can be a good time for data purging as number of users accessing the data warehouse are typically less at those times.

On the technical front one must also consider post purging activities that are required to be performed in the data warehouse. Since purging frees up lot of space - additional administrative activities are required to be performed to make use of newly freed-up space. This includes, but not limited to activities such as:

  • Tablespace re-organization
  • Index rebuilding
  • Defragmentation / consolidation etc.

Have a question on this subject?

Ask questions to our expert community members and clear your doubts. Asking question or engaging in technical discussion is both easy and rewarding.

Are you on Twitter?

Start following us. This way we will always keep you updated with what's happening in Data Analytics community. We won't spam you. Promise.

  • What is Data Warehousing?

    A data warehouse is a repository of an organization's electronically stored data. Data warehouses are designed to facilitate reporting and analysis.

  • Why people Hate Project Managers – A must read for would-be managers

    "Project Managers" are inevitable. Love them or hate them, but if you are in a project, you have to accept them. They are Omnipresent in any project. They intervene too much on technical things without much knowledge. They create unrealistic...

  • Top 5 Challenges of Data Warehousing

    Data warehousing projects are one of its kinds. All data warehousing projects do not pose same challenges and not all of them are complex but they are always different. This article illustrates the top 5 challenges that often plague modern data...

  • A road-map on Testing in Data Warehouse

    Testing in data warehouse projects are till date a less explored area. However, if not done properly, this can be a major reason for data warehousing project failures - especially in user acceptance phase. Given here a mind-map that will help a...

  • OLTP and OLAP

    OLTP stands for On-line Transaction Processing and OLAP stands for On-line Analytical Processing. We may have heard these definition many times. But do we really understand the difference between them. Let's explore further on these two kinds of...

  • Data Retention and Purging in a Data Warehouse

    By the typical definition of data warehouse, we expect the data warehouse to be non-volatile in nature for its entire design life time. As long as it remain operation, all data loaded in the data warehouse should remain there for the purpose of...

  • Top 10 things you must know before designing a data warehouse

    This paper outlines some of the most important (and equally neglected) things that one must consider before and during the design phase of a data warehouse. In our experience, we have seen data warehouse designers often miss out on these items...

  • Enterprise Data Warehouse Data Reconciliation Methodology

    An enterprise data warehouse often fetches records from several disparate systems and store them centrally in an enterprise-wide warehouse. But what is the guarantee that the...

  • What is a data warehouse - A 101 guide to modern data warehousing

    This article discusses data warehousing from a holistic standpoint and quickly touches upon all the relevant concepts that one needs to know. Start here if you do not know where to start from.

  • Decision Support System (DSS)

    Decision Support System (DSS) is a class of information systems (including but not limited to computerized systems) that support business and organizational decision-making activities. A properly designed DSS is an interactive software-based system...