Why do we need Staging Area during ETL Load
We have a simple data warehouse that takes data from a few RDBMS source systems and load the data in dimension and fact tables of the warehouse. I wonder why we have a staging layer in between. Why can’t we process everything on the fly and push them in the data warehouse?
Last night, I received this question from one of the members of DWBIConcepts community over email and thought of discussing the pros and cons of having a staging layer in this article.
Really staging area is not a necessity if we can handle it on the fly. But can we? Here are a few reasons why you can’t avoid a staging area:
- Source systems are only available for extraction during a specific time slot which is generally lesser than your overall data loading time. It’s a good idea to extract and keep things at your end before you lose the connection to the source systems.
- You want to extract data based on some conditions which require you to join two or more different systems together. E.g. you want to only extract those customers who also exist in some other system. You will not be able to perform a SQL query joining two tables from two physically different databases.
- Various source systems have different allotted timing for data extraction.
- Data warehouse’s data loading frequency does not match with the refresh frequencies of the source systems.
- Extracted data from the same set of source systems are going to be used in multiple places (data warehouse loading, ODS loading, third-party applications etc.)
- ETL process involves complex data transformations that require extra space to temporarily stage the data
- There is specific data reconciliation / debugging requirement which warrants the use of staging area for pre, during or post load data validations
Clearly staging area gives lot flexibility during data loading. Shouldn't we have a separate staging area always then? Is there any impact of having a stage area? Yes there are a few.
- Staging area increases latency – that is the time required for a change in the source system to take effect in the data warehouse. In lot of real time / near real time applications, staging area is rather avoided.
- Data in the staging area occupies extra space.
To me, in all practical senses, the benefit of having a staging area outweighs its problems. Hence, in general I will suggest designating a specific staging area in data warehousing projects.