Performance of a data warehouse is as important as the correctness of data in the data warehouse because unacceptable performance may render the data warehouse as useless. There is this increasing awareness about the fact that it’s much effective to build the performance from the beginning rather than to tune the performance at the end. In this article we have a few points that you may consider for optimally building the data model of a data warehouse. We will only consider performance considerations for dimensional modeling.
Good design, cleanly crafted code and optimal logic/algorithm will give you far better performance than that you can achieve by augmenting your hardware
You can spend literally millions of dollars on your hardware but you still can’t scale down a problem as efficiently as possible only by means of good design. So stop wasting money on better and bigger hardware, instead invest more on a good data architect / modeler.
Surrogate Key is not mandatory
Surrogate key is used in place of natural key because natural key may not be trusted (in the sense that natural key may fail to adhere to the fundamental properties of a key e.g. uniqueness, not null ability etc.) and natural key may not be standardized in terms of sizes and data types with other keys of your data warehouse. But this does not mean that it’s mandatory to replace all your natural keys with a standardized surrogate keys. What if your natural key has already the properties of your data warehouse level candidate keys? Do you still want to replace them with surrogate keys?
Consider the fact that introduction of surrogate keys bring in additional burden on data loading. This is because introduction to surrogate keys in dimension tables necessitates additional lookup operation when doing the fact data loading and lookup is a very costly affair, if not the most costly affair.
A lot of time, the source of your dimension data is a RDBMS source system. Since the system is RDBMS, you are already assured about the quality of keys and may be the source key is already numeric in nature just like the other keys of your data warehouse. Don’t let this natural key die in your ETL process in favor of a standardized surrogate key. Bring this as it is to your data warehouse and while doing reporting; base your join on this key.
KISS your data model
Not literally. For, KISS is a very important design principle which stands for “Keep It Simple and Sweet” (although some people favor another 4 letter word for the second “S”). Ask yourself if you really require the plethora of those attributes that you have kept in your dimension table? Or do you really need that additional table that you are thinking to introduce in your model? Don’t do it because you can do it. Do it only when you have to do it. Such minimalistic approach would make a lot of difference at the end and get you going through the tough time.
Keep it simple – nothing succeeds like simplicity.
If you don’t believe me, let me enlighten you with one example. Suppose you have decided to keep one column in your fact table just because it’s available in source and you think it’s good to have this information in your data warehouse because it may be required in the future although you are fully aware of the fact that there is no explicit reporting requirement on this column as of now. We will call this
column A and we will assume the data type of the column is
Mere introduction of this column, which currently serves no purpose other than giving you a false sense of “wholesomeness”, will require:
- Additional Programming effort - mapping effort from source to staging and staging to data warehouse
- Additional testing effort - 1 additional test case to be written and tested by the testers to check the correctness of the data
- Additional space requirement – If your fact table is going to contain 100 million records in next 2/3 years (which is not a big number by any standard), it’s going to take following extra space.
extra space = space for storing 1 number X No. of records in table =( ROUND( ( length( p ) + s ) / 2 ) ) + 1 ) bytes X 100,000,000 ( where, p = precision and s = 1 if number is negative, 0 otherwise ) = ( ROUND( ( 10 + 0 ) / 2 ) ) + 1 ) bytes X 100,000,000 = 600,000,000 bytes = over half Giga byte And since most data warehouse has a mirrored backup, actual space requirement will be double of this ( nearly 1.5 Giga byte ).
Now think of this for a moment. If such an innocent looking
NUMBER(10) column can add up so much extra space and processing burden, then consider what may happen if you continue introducing a lot of such attributes with data types like
VARCHAR(50) or more or
DATE etc. If you can stop the urge of such frivolity and take a considerate, minimalist approach you can easily save 1TB of space and numerous coding / testing efforts throughout the building process.
This is similar to the above point where we stressed about the simplicity of data models. But this is not exactly same. While simplicity will make your model considerably light weight, reducing complexity will make it easier to maintain. There are several ways of reducing complexities in data model.
Reducing pre-calculated values
We often make the mistake of creating extra columns in our tables that contain the result of a complex calculation. Let me give you a simple example. You need a measure called “Yearly Bonus” which is determined using the below rule:
For grade A employees, bonus = 0.83 X yearly salary For grade B, C employees, bonus = 0.69 X yearly salary
Obviously you can create one column in your data model to store the bonus of each employee along with their respective salary. However, that will be wrong thing to do unless you plan to use that column somewhere else in the time of data loading. It would be much wiser to calculate those columns in the time of reporting on-the-fly. This will reduce the space requirement and improve loading time.
The rule of thumb is, don’t use your data model to solve a programming problem.
Removing Unnecessary Constraints
This is not a good idea to put the database level constraints such as check constraints, not null constraints etc. in the tables meant for batch data loading. They make things slower (although slightly) and creates maintenance problem. Instead try to ensure data quality in the ETL process itself.
I know the whole idea of maintaining primary key/ foreign key constraints in data warehouse is itself widely debated and there are different schools of thoughts on this. But as far as I am concerned, I will probably not enforce the integrity through database level keys whereas I can easily enforce them in my ETL logic (batch processing logic).
Why do you snow-flake when de-normalization is your goal of dimensional data modeling? Clarity, brevity, reducing data redundancy – all these arguments are very lame when you compare them against the cost of maintaining foreign-key relation (additional lookup operation) during data loading.
Only snowflake, if you intend to provide an aggregated table (fact table) with the same granularity as that of your snow-flaked dimension.
Choose the attributes of SCD Type 2 dimensions judiciously
In a typical implementation, SCD Type 2 tables preserve the history by means of adding new row to the table whenever a change happens. Because of this property, SCD type 2 tables tend to grow larger and larger day by day thereby degrading the performance. This is a problem because growth of dimension tables hurts the performance even more than the growth of fact table as dimension tables are typically conformed and used across different schema. In order to control this growth, it is very important that we consider only those columns in SCD implementation, which we really want to preserve the history of.
This means, if you have one hundred columns/entities in the source system for a specific table, don’t start tracking SCD changes for all of them. Rather, carefully consider the list of the columns where you really need to preserve the history – track only those as Type 2 in your SCD implementation and just consider the other columns as Type 1 (so that even if they change, no row is added to the table).
Don’t Create a Snapshot Fact and a SCD Type 2 table for the same purpose
This is obviously a little complex to explain. A snapshot fact always show the latest (or last known) state of the measures. The latest records of a SCD type 2 dimension also do the same. The only difference is a fact shows the state of the measures whereas a SCD Type 2 table shows the state of the attributes. But there are cases where this difference can become very blurry.
Where do you store the attributes such as “No of telephone lines” of one customer or “No of bank accounts” of one customer? Is number of telephone lines of a customer an attribute of customer itself? Not quite. Then obviously we won’t store it in dimension table as we need to store this in fact table. But here comes the problem – all such records in fact table get different surrogate keys for the same customer since the key comes from a SCD Type 2 table. Because of such key variance, it is impossible to join two such facts in a single query if those facts are not loaded in the same time. To preserve the key invariance, the dimension table needs to be self-joined twice in the middle and this causes lot of performance issues.
Consider Indexing Carefully
Indexes are used to speed-up data retrieval from a relational database. As a part of physical data modeling (PDM design) a data modeler should make considerate effort to carefully evaluate and suggest initial indexing scheme for the data warehouse. It is not possible to fix a rigid indexing scheme in such an early phase of project. But it is important to at least come up with the basic scheme while doing the data model and later on the same scheme can be fine tuned by adding or dropping a few indexes.
Databases come with different indexing methodologies which are quite varied in nature. In Oracle, for example, b-tree indexes are most common. It’s a good idea to put b-tree indexes in the columns that are used for joining (e.g. Key columns of the table). There is a special index type called – bitmap – which is useful for low cardinality columns (i.e. columns having a lot of duplicate values) used in joining. Then off course there are other special indexes such as function-based indexes etc. You may read my other article if you want to know more about indexing.
Speaking about performance, bitmap index can make data retrieval from a fact table significantly faster if a special technique called “Star Schema Transformation with Bitmap” is utilized. In this method, you must create one bitmap index in each foreign key column of the fact table (and you should set the Oracle initialization parameter
STAR_TRANSFORMATION_ENABLED to true). When a data warehouse satisfies these conditions, the majority of the star queries running in the data warehouse will use a query execution strategy known as the star transformation. The star transformation provides very efficient query performance for star queries.
One caveat is – bitmap index may cause failure to your data loading due to locks if you are trying to load data in parallel.
In other databases, e.g. SQL Server, you have clustered and non-clustered indexing scheme. A clustered index is much faster since it reorganizes the physical orientation of the data in the table in line with the index keys. But this may cause a lot of fragmentations in the data and may require frequent rebuilding. Non-clustered indexes can be used in cases where you already have one clustered index in the table or in any other general purpose uses.
The problem with all indexes is same. It makes your data loading (INSERT Operation) slower (beside they take huge amount of space). So don’t fill-up your data warehouse with lots of them. Use them prudently and monitor them often, If you see one particular index in not being used so much, you may remove it.
If you have not realized it already, let me tell you – you’re going to love it. Partitioning is a great thing and you aren't going to go anywhere without its help. So as a data modeler, it’s your job to come up with appropriate partitioning scheme for your tables while you do the physical data modeling.
The concept of partitioning is not restricted to database. In fact, you can implement partitioning completely independent of database. Consider “vertical partitioning”. Let’s say you have one table called “Product” which has 5 attributes and two more tables called “Retail_Product” and “Healthcare_Product” each having additional 50 attributes that are only applicable to either retail or healthcare segment of products. Three tables share the same set of keys however other columns are different. And you use only the table you need instead of using a single table with all the attributes. This concept of vertical partitioning can be extended further to horizontal partitioning as well.
When it comes to partitioning, the most popular choice is obviously database level horizontal partitioning. In this type of partitioning, database create separate partitions based on the list or range of values in one column. A preferred choice of column is “TIME” or “DAY” key in your fact table. You can easily do range partition on this column to segregate data of each year or each month in separate partitions. Since most of the business queries on this table will contain a question pertaining to a specific date, month or year, database will just access the specific partition containing the data for that query instead of accessing the entire table (this is called partition pruning). So for example, a SQL query like below will access only 1 partition (that of January 2012) instead of reading the whole table which may be containing 5 years of data:
SELECT sum(revenue) FROM Sales WHERE Month_Key = 201201 -- results for Jan 2012, using intelligent key
The partitioning we discussed above is called range partition. There is one more fundamental types of partitioning called “List” partitioning wherein data pertaining to a set of given list is kept in one partition. This partition is good when you have a fixed list of values that can occur in the data of your partitioned column. Let’s say you are building a data model where your sales fact table contains data from 4 divisions – north, east, west and south. You can use list partition on your division column then.
SELECT sum(revenue) FROM Sales WHERE Division = ‘north’;
You may use one advantage of partitioning to speed up your data loading. This is called “Partition Exchange”. Partition exchange, although not supported by all databases, help you exchange one partition of one table with that of other. If your first table is staging table and the final table is data warehouse table, using this method you will be able to load data to your data warehouse table very quickly just by exchanging the partitions. Partition exchange is quick since it does not need to physically move the data.
Partitions not only help to improve performance, it also makes maintenance a much easy affair. For example, data purging from the fact tables can be done pretty quickly by dropping partitions rather than using conventional delete operation.
Avoid Fact to Fact Join
Make conscious effort in your design to avoid fact to fact join in the future. This is tricky. A lot of time we end up creating data models where one fact table typically contains information pertaining to one specific subject area. But what if we want to do cross-subject area reporting? Let’s say you have sales fact and marketing fact and you want to evaluate the effect of a certain marketing campaign of your sales. How do you do it? One solution is writing a SQL query which will bring data after joining the both fact tables. This does work but it takes its toll on the performance. One solution is, you can implement a special fact table with only the related measures from across more than one subject areas. So consider those situations from the very beginning and make necessary arrangements in your data model.
Give some attention to your server infrastructure
Server infrastructure may be mis-configured more often than you think. But even if they are not, it’s a good idea to review them as you go along. There are a lot of the parameters that need to be tweaked especially for data warehousing environment. So if your DBA does not have experience with data warehouses before, s/he may need some guidance. Look carefully through options such as, asynchronous filesystem IO, shared pool size of the database, caches sizes (of database, of IO subsystems), database block size, database level multiblock read/write options etc.