What is Data Integration (DI)?
In this tutorial we will learn - what is meant by the term "Data Integration" (DI), how data integration is done and why the need of data integration often requires us to build a data warehouse.
What is Data Integration (DI)?
Data Integration is the process of combining heterogeneous data sources in to a single queriable schema so as to get an unified view of these data.
Why Data Integration is required?
Often large companies and enterprises maintain separate departmental databases to store the data pertaining to the specific department. Although such separations of the data provide them better manageability and security, performing any cross departmental analysis on these datasets becomes impossible.
For example, if marketing department and sales department maintain two secluded databases, then it might not be possible to analyze the effect of a certain advertising campaign by the marketing department on sales of a product. Similarly, if HR department and production department maintain their individual databases, it might not be possible to analyze the correlation between yearly incentives and employee's productivity.
Data integration provides a mechanism to integrate these data from different departments into a single queriable schema.
Below is a list of examples where data integration is required. The list, however, is not comprehensive
- Cross functional analysis - as discussed in the above example
- Finding correlation - Statistical intelligence/scientific application
- Sharing information - legal or regulatory requirements e.g. sharing customers' credit information among banks
- Maintaining single point of truth - Higher management topping over several departments may need to see a single picture of the business
- Merger of Business - after merger two companies want to aggregate their individual data assets
How data integration can be done?
- different approaches to data integration
There are mainly 2 major approaches for data integration - commonly known as "tight coupling approach" and "loose coupling approach".
Tight Coupling: Data Warehousing
In case of tight coupling approach - which is often implemented through data warehousing, data is pulled over from disparate sources into a single physical location through the process of ETL - Extraction, Transformation and Loading. The single physical location provides an uniform interface for querying the data. ETL layer helps to map the data from the sources so as to provide a semantically uniform data warehouse.
This approach is called tight coupling since in this approach the data is tightly coupled with the physical repository at the time of query.
Loose Coupling: Virtual Mediated Schema
In contrast to tight coupling approach, a virtual mediated schema provides a interface that takes the query input from the user, transform the query in the way source database can understand and then sends the query directly to the source databases to obtain the result. In this approach, the data does not really remain in the schema and only remain in the actual source databases. However, mediated schema contains several "adapters" or "wrappers" that can connect back to the source systems in order to bring the data to the front end. This approach is often implemented through middleware architecture (EAI).
To understand "Virtual Mediated Schema" approach, consider this example. Let's say one database stores weather data of a country for past 50 years. Another database contains crop production data of the country for each year. A user might want to query - "In which year crop production is lowest in-spite of more than average rainfall?". The question can be subdivided into 3 different questions as follows:
- What is the average rainfall across all the years?
- Which are the years where recorded rainfall was higher than the average?
- Out of all the years obtained from the above query, which year has least crop production?
The first 2 queries can be straightway sent to the weather database and the last query can be sent to the crop production database to get the answers. In a mediated schema all the data of weather and crop productions will not be available in the schema itself, but the schema will have necessary "adapters" to send the queries back to appropriate databases and combine the result sets.
Comparison between Data Warehouse and Mediated Schema approaches
Following are the advantages and disadvantages of tight-coupling versus loose coupling
Advantages
Tight-Coupling (data warehouse) | Loose Coupling (mediated schema) |
Independence (Lesser dependency to source systems since data is physically copied over) | Data Freshness (low latency - almost realtime) |
Faster query processing | Higher Agility (when a new source system comes or existing source system changes - only the corresponding adapter is created or changed - largely not affecting the other parts of the system) |
Complex query processing | Less costlier (Lot of infrastructure cost can be saved since data localization not required) |
Advanced data summarization and storage possible | |
High Volume data processing |
Disadvantages
Tight-Coupling (data warehouse) | Loose Coupling (mediated schema) |
Latency (since data needs to be loaded using ETL) | Semantic conflicts (The meaning of the measure "net profit" can be different in different systems - so semantic awareness is often necessary in the mediated schema*) |
Costlier (data localization, infrastructure, security) | Slower query response (due to network / bandwidth issue, non-localization of data / work load on the source system etc.) |
High order dependency to the data sources |
Below are a few suggested reading from both internal and external sources which we believe will help you get better grasp on the subject
1. Modern definition of Data warehousing
2. Semantic Integration Problem - Wikipedia Article
3. Enterprise Information Integration - Wikipedia article
4. Data Integration Blueprint and Modeling: Techniques for a Scalable and Sustainable Architecture