Top 5 Challenges of Data Warehousing
Data warehousing projects are one of its kinds. All data warehousing projects do not pose same challenges and not all of them are complex but they are always different. This article illustrates the top 5 challenges that often plague modern data warehousing developments. Knowing these challenges upfront is your best bet to avoid them.
Data warehousing is different. They are different because unlike many of the software projects, data warehousing projects are not developed keeping a front-end application in mind. For the most part of it, these projects are heavily dependent on the backend infrastructure in order to support the front-end client reporting. Moreover, number of different stake holders involved in data warehousing projects is usually more than any typical IT project.
But these are not the only reasons why doing data warehousing is difficult. In the below list we show the top 5 reasons which actually make things complex on the practical ground.
Ensuring Acceptable Data Quality
More often than not, a data warehouse consumes data from disparate sources. Most of these data sources are legacy systems maintained by the client. These systems are usually managed by different people pertaining to different business departments. Since the business lines supported by these systems are different, the users of one system are often oblivious to the features or capacities of the other system. Because of this, a lot of business processes and data are duplicated across systems and the semantics are different in them. For example, the definition and calculation of revenue in “direct sales” department may be different from that of “Retail Sales” department. The list of customers maintained in “sales” department may be different in quantity and metadata quality with the list of customers maintained in “marketing” department.
Disparate data sources add to data inconsistency
When a data warehouse comes in between and tries to integrate the data from such systems, it encounters issues such as inconsistent data, repetitions, omissions and semantic conflicts. All these issues lead to data quality challenges. Resolving these issues and conflicts become difficult due to limited knowledge of business users outside the scope of their own systems.
The next reason which causes data quality issues is the fact that many a times data in source systems are stored in non-structured format like as in, flat files and MS Excel. These types of data structures are inherently susceptible to issues such as redundancy and data duplication.
Unstabilized source systems
The issues of data quality do not always originate from legacy systems. In fact, data quality issues may become more disastrous in case if a source system is comparatively new and has not fully stabilized yet at the time of data warehouse development. In some rare cases, data warehouses are built simultaneously with the source systems. In those cases, instability and vulnerability of source systems often wreck the overall development of data warehouse and ruins the data quality of it. This is because any bug in the source systems potentially injects unwarranted defects in data warehouse. Given any possibility, any plan of building data warehouse simultaneously with source systems should always be avoided, in my opinion.
This is why creating data warehouse for an organization with good master data management, relational database source systems, and cross-trained and knowledgeable users is often easier.
Ensuring acceptable Performance
Prioritizing performance
Many designers and users often forget about performance when they first conceive the plan to implement a data warehouse for their business. As is often the case, such oversight cripples the usability of a data warehouse when it is finally built. Indeed, little can be done to improve the performance of a data warehouse in the post-go-live period. This is because performance objectives are easier to be designed than to be tuned. Data warehouses should be built for performance rather than tuned for performance.
Setting realistic goal
Achieving the performance objectives is not easy. In the first place, setting up performance objectives itself is a challenging task. An untrained user can easily drift towards setting up some performance goals that are unrealistic for a given data warehousing scenario. Hence for the users of the data warehouse, it is generally considered safe to set up the performance goals in terms of practical usability requirements. A crude example will be, if one business user requires a specific report to be available at 9 AM daily then that should be given as the performance requirement by the users instead of stating requirements such as, the report must not run for more than 15 minutes.
Once reasonable performance goals are setup, the next task is to finding ways to achieve those goals. People often tend to believe that performance of a system depends on the hardware infrastructure and hardware augmentation is a good way for boosting performance. This understanding is incorrect. While it is true that a better hardware will generally ensure a better performance, the performance of a system is in fact more fundamental than this. Performance is directly dependent on the complexity of the system which, in turn, depends on the design. To give a relevant example, think of join operation in database. A nested-loop join can have a worst case complexity of O ( n*n ) whereas a merge-join can do the same thing only in O (nlogn). It’s easy to see that for a practical value of n (n being number of rows); one of these joining algorithms may run thousand times faster than the other. If the design of your system facilitates the database to perform a merge join instead of a nested-loop join, then that would give a huge performance benefit to your system. That would be something which is quite unachievable only by augmenting hardware infrastructure. Hardware augmentation cannot achieve the same level of performance boost since it would not be possible to increase the hardware by thousand times.
Performance by design
Performance is a consequence of design. So performance goals can be best addressed at the time of designing. If that’s not done, meeting up performance criteria can be an overwhelming challenge.
Like anything in data warehousing, performance should be subjected to testing – commonly termed as SPT or system performance testing.
Testing the data warehouse
Testing in data warehousing is a real challenge. A typical 20% time allocation on testing is just not enough. One of the reasons why testing is tricky is due to the reason that a top level object in data warehouse (e.g. BI reports) typically has high amount of dependency. For example, one cross subject area report built over a dimensional data warehouse will be dependent on data from many conformed dimensions and multiple fact tables that themselves are dependent on data from staging layer (if any) and multiple disparate source systems.
Test planning
Because of such high dependencies, regression testing requires lot of planning. Making the data available for re-testing for a certain component may not be possible as fresh data loading often changes the surrogate keys of dimension tables thereby breaking the referential integrity of the data. Thus continuing fresh testing along regression testing becomes impossible.
One solution is to plan the testing activities in batches that are in-line with the batches of data loading. This needs to be planned keeping in mind the availability of the data from dependent source systems as every source system may not provide data in the same extraction frequencies and windows.
No automated testing
Till date, there is no full-proof generic solution available for automation testing in data warehouses. Not that it is impossible. But it is very difficult given the lack of standardization in how the metadata are defined and design approaches are followed in different data warehousing projects. There are a few commercial solutions that depend on metadata of the data warehouse but they require considerable customization efforts to make them workable. Unavailability of automated testing opportunity also implies that right kind of skill set will be necessary in the testing team to perform such tasks.
Read more about data warehouse testing here
Reconciliation of data
Reconciliation is a process of ensuring correctness and consistency of data in a data warehouse. Unlike testing, which is predominantly a part of software development life cycle, reconciliation is a continuous process that needs to be carried out even after the development cycle is over.
Reconciliation is complex
Reconciliation is challenging because of two reasons. The first one is – complexity of the development. Generally a few critical measures are chosen from the business for the purpose of reconciliation. Imagine the measure is – “net sales amount”. This measure is calculated independently and separately in the source system end and data warehouse end to check if they tally. In order to develop this, one must imitate the entire transformation logic that are there in the data warehouse and applicable on this measure. Obviously one can check the existing logic from the developed ETL layers, nonetheless developing this is technically involved.
The second reasons that makes reconciliation challenging is the fact that, reconciliation process must also comply with performance requirement – which is more stringent than usual. I will explain why that is so. The reconciliation is like a certificate on the correctness of loaded data. A successful reconciliation gives the necessary confidence to the users for trusting the data for their business. Thus, it is imperative that reconciliation process gets completed by the time the business users intend to use the data. Considering that reconciliation can only start after the completion of data loading and should get finished before users start using the data, leaves this with very little time for execution. But even within that short time, the process needs to calculate functionally the same measures that are calculated in full-blown ETL process of data warehouse.
Read more about reconciliation here
User Acceptance
Last but not the least is the challenges of making a newly built data warehouse acceptable to the users. No matter how good or great you think your data warehouse is, unless the users accept and use it wholeheartedly the project will be considered as failure. In fact, most of the data warehouse projects fail in this phase alone.
Reluctant users
A new data warehouse brings with it new set of process and practices for the users. In many cases, business users need to forsake their long standing practice and habits of using their legacy systems to adapt themselves with the new processes. Humans, by nature are not very comfortable to adapting to changes, especially if they do not see great value propositions for doing so. Their reluctance or lack of interest in using a new kind of reporting system can render the data warehouse practically useless. The challenge here is to make them accept the data warehouse organically and seamlessly.
Users training, simplification of processes and designs, taking confidence building measures such as reconciliation processes etc. can help users come into terms with this new system easily.