What is a data warehouse - A 101 guide to modern data warehousing
This article discusses data warehousing from a holistic standpoint and quickly touches upon all the relevant concepts that one needs to know. Start here if you do not know where to start from.
Since this site is dedicated to data warehousing professionals, it was plain for us to assume that visitors of this site are already quite familiar with the basic concepts of data warehousing and business intelligence. Creating basic tutorials on these subjects were perhaps too obvious for us to do and that explains why there are not many articles on the basic concepts and definitions of data warehousing in this site. But something began to strike us.
We began to realize that there are still a lot of craving among our readers to understand the basic. And wikipedia definition of data warehousing is not enough for them - partly because of the inadequate elaboration and partly because of the heterogeneous background and experiences of our users who are still struggling to grasp the fundamentals. So we decided to produce a set of comprehensive and basic ground up articles for our readers to enable them venture deep into the subject of data warehousing. So let's begin with what is data warehousing, and what is business intelligence and why should we care.
What is data warehousing?
Data warehousing is the science of storing data for the purpose of meaningful future analysis.
Yes, it is a science (not much art involved!) and it deals with the mechanism of electronically storing and retrieving data so that some analysis can be performed on that data to corroborate / support a business decision or to predict a business outcome.
What is Business Intelligence?
Business Intelligence, on the other hand, is simply the art and science of presenting historical data in a meaningful way (often by using different data visualization techniques).
Raw data stored in databases turns into valuable information through the implementation of Business Intelligence processes.
Why data warehouse?
DW technologies provide historical, current and predictive views of business operations by analyzing the present and historical business data. Data analysis is often done using visualization techniques that turn complex data into images that tells compelling story. Raw data by this process of analysis help management take right decisions.
As an example of how modern visualization techniques are helping to unlock the complex and hidden information stored deep inside the data, visit http://www.gapminder.org
GapMinder is a small tool conceptualized by Prof. Hans Rosling, that analyzes complex socio-economical data to reveal world's most important trends.
To further demonstrate the need of data warehousing, consider this.
Let's imagine a company called "FairShop" that has 1000 retail outlets across USA. The company has built one data warehouse to store the data collected from all the shop outlets so that they can analyze the data to gather business intelligence.
The company collects raw sales data from all of their outlet shops (through a process called ETL) and then load them into a place called Data warehouse or data mart (at this point don't bother too much about the exact meaning and differences of data mart and data warehouse - we will get to it in detail later).
Once the data is there in data warehouse (or data mart) business intelligence techniques are applied to that data for analysis and reporting. Since the company now has the sales and purchase information from all their shops in a centralized place, it can easily use this data to answer some rudimentary questions about their business e.g. what shop makes highest sales, which product is most popular across the shop, what is the stock balance etc.
It is very common to talk about both data warehousing and business intelligence together since business intelligence in this context refers to analyzing the data in data warehouse. As Wikipedia puts it:
"Common functions of business intelligence technologies are reporting, online analytical processing, analytics, data mining, process mining, complex event processing, business performance management, benchmarking, text mining and predictive analytics."
Building the definition of data warehouse
There are couple of points to notice from the typical retail scenario shown above. These points form the base of our discussion on the definition.
- Objective of building a data warehouse is to store data that are required for analysis and reporting
- For a data warehouse to function, it should be supplemented with a process that can collect and load data into it (e.g. ETL)
- In a data warehouse, data actually flows from the source to target - so the contents of the data warehouse would be different between 2 given points in time
Now if I tell you that, the definition of data warehouse can be constructed from the above 3 points - that shouldn't surprise you. But what will surprise you is - a lot of these points are not really considered in the classic definition of data warehouse.
So let's discuss the classic definitions of data warehouse first.
Classic Definition of Data Warehouse - A peek in the history
The history of data warehouse dates back to 1960. Without going into detail, here we will quickly touch upon a few noteworthy events of the history of data warehouse. In 1970, ACNielsen, a global marketing research company, published sales data pertaining to retail industry in the form of dimensional data mart.
Earlier than this, the concept of data warehouse for analysis was only a subject of academic pursuit.
Along the same time, the concept of decision support system were gradually developing and people started to realize that data stored in operational systems (e.g. data stored in the individual stores of a retail chain) are not easy to analyze in the time of decision making. So in 1983, Teradata introduced a database management system specifically designed for decision support.
In this decade and next, several people experimented with several designs of data warehouse and some of them were quite successful. In the year 1992, one of them, named Bill Inmon published a book - Building the Data Warehouse - which among other things, gave us a widely accepted definition of what a data warehouse is. We will soon jump into that definition. But before that let me mention one more author - Ralph Kimball - who 4 years later in 1996 wrote another book - Data warehouse toolkit - showing us yet another approach of defining and building a data warehouse. Since then, both Inmon and Kimball approaches are widely accepted and implemented throughout the globe.
So How did Bill Inmon defined a data warehouse? Here it is:
A data warehouse is a subject oriented, non-volatile, integrated, time variant collection of data in support of management's decisions.
Now let's understand this definition.
Explanation on the classic definition of data warehouse
Subject Oriented: This means a data warehouse has a defined scope and it only stores data under that scope. So for example, if the sales team of your company is creating a data warehouse - the data warehouse by definition is required to contain data related to sales (and not the data related to production management for example)
Non-volatile: This means that data once stored in the data warehouse are not removed or deleted from it and always stay there no matter what.
Integrated: This means that the data stored in a data warehouse make sense. Fact and figures are related to each other and they are integrable and projects a single point of truth.
Time variant: This means that data is not constant, as new and new data gets loaded in the warehouse, data warehouse also grows in size.
Identifying a data warehouse based on its definition
4 simple terms by Inmon defined data warehouse succinctly. Let's now check how the definition help us identify a data warehouse from other types of data stores.
Is a book written on how to create 17 different peas pudding, a data warehouse? It is subject oriented (deals with peas pudding), Non-volatile (deals with fixed 17 methods that are there to stay), integrated (makes sense). But it's not time variant. Answer is it's not a data warehouse.
So is the folder on my desktop named "Account Statements" a data warehouse? Subject oriented (deals with financial accounting), non-volatile (until I manually delete it), Time variant (every month new account statements pour in) but it's not integrated (one file in the folder containing the account statement from bank XYZ for the month of May has no way to link to the other file in the folder containing the account statement of the bank ABC for the month of June). So - not a data warehouse.
So is the sales information collected in one store of a big retail chain a data warehouse? It's subject oriented, time variant and integrated (after all there is a relational database behind). But is it non-volatile? Mostly not. And even if it is - there is a fifth factor. Is it being used for the purpose of management decision making? Surely not. After all who will take an enterprise wide management decision based on the data collected from a single store?
A broader definition of data warehouse
The classic definition that we discussed above does not focus much on the purpose of the data warehouse. The purpose is something which distinguishes a data warehouse from a data mart if you will and help us understand the need of the data warehouse. The purpose of a data warehouse, as we discussed before, is to render a timely data-driven insight that was otherwise inconceivable directly from the raw data. A data warehouse which stores data, is time variant and subject oriented and integrated yet does not solve this purpose - is no better than just a data dump.
An alternative (and more concurrent) definition of data warehouse will be:
A data warehouse is an electronically stored collection of integrated data that can be used for the purpose of intelligent analysis.
Dropping the time variance from the above definition broadens the coverage of the definition and omission of non-volatility condition makes the definition more realistic rather than idealistic. There are many data that are not time variant (historical and scientific data) but can be stored in a data warehouse for analysis. Similarly modern data warehouses are purged regularly when the data lose its purpose. Adding a sense of purpose in the definition enables us to create a more reliable and goal-oriented data warehouse.
Schematic View of a data warehouse
The diagram above shows a typical schematic structure of a data warehouse. As one can see here, most data warehouses collect data from multiple sources to form one integrated warehouse. Before loading to the warehouse, these data often need special treatment which is done in the ETL layer (ETL - Extraction, Transformation, Loading). ETL layer is mostly responsible for 2 types of treatments on the data:
- Data Integration - So that some links can be established between data coming from separate systems, and
- Qualitative Treatment - so that the validity and quality of the data can be checked (and if required corrected) before loading to the data warehouse.