Understanding Database Normalization with easy Examples
Normalization is a method of organizing data in a database to reduce data redundancy and eliminate any undesirable anomalies such as Update and deletion anomaly. There could be various degree or form of normalization and in this article we will learn them step by step with easy examples.
What is database normalization
Formally stated,
Normalization is the systematic method to design a database schema such that duplicate and redundant data is avoided.
There are number of normalization level starting from level 1 to level 5. Each normal form describes how to get rid of a specific problem in database design.We will explore these levels by taking example of an imaginary dataset for a retail shop. Consider a retail shop wants to keep track of products and customers. So they decide to design a database. In their first attempt they design something like this -
Date | Product | Customer | City |
01 Jan | Orange, Apple | Ram | Lucedale |
01 Jan | Edible Oil | Harry | Spring Valley |
02 Jan | Grape Juice, Butter | Emily | Wellton |
02 Jan | French Loaf | Ram | Lucedale |
But this scheme of storing data has a few problems. For example, there are multiple values in single cell, there are data duplication or redundancy etc. So let us see how normalization helps here.
1st Normal Form
First Normal form requires that value of each attribute contains only a single value from the domain of that attribute. In short, this means you can only put one value for each cell of your table.
In the above table, we have put both 'Orange' and 'Apple' and 'Grape Juice' and 'Butter' in the same cells. This offends the First normal form. We can normalize this problem away by putting all the products into separate rows. See below,
Date | Product | Customer | City |
01 Jan | Orange | Ram | Lucedale |
01 Jan | Apple | Ram | Lucedale |
01 Jan | Edible Oil | Harry | Spring Valley |
02 Jan | Grape Juice | Emily | Wellton |
02 Jan | Butter | Emily | Wellton |
02 Jan | French Loaf | Ram | Lucedale |
2nd Normal Form
As you see in the above example, there are lot of data redundancy. The customer "Ram" appears 3 times in the table and "Emily" appears twice. Problem is - everytime this customer is appearing, its address is also appearing. This redundancy may create many problems. For example, if "Ram" change his address today, then you will need to update his address in all the rows where it appears (in this case 3 rows). 2nd Normal Form help you get rid of this issue. Let us see what 2nd Normal Form really mean.
A table is in 2nd Normal Form if and only if it is in First Normal Form and every non-prime attribute of the table is dependent on the whole of a candidate key
Let us now understand this statement. Look at the above table and try to determine the key of this table. Key is something that uniquely identifies each record of the table. Clearly column "Date" is not the key as "Date" is not unique. Similary neither column "Product Name" nor "Customer Name" is a key. However, the combination of "Date", "Product Name" and "Customer Name" together form the key of this table as that combination is unique and hence it uniquely identifies each record of the table. These columns or attributes are called "Prime Attribute" and they form the candidate key. The other column(s) like "Customer City" is the "non-prime attribute".
Having armed ourselves with the definition of key, non-prime attributes etc., let us now look back to the definition of 2nd Normal Form. It says that every non-prime attribute must be dependent on the whole of a key. In our case this means, the non-prime attribute "Customer City" must be dependent on the whole combination of the key. But "Customer City" is only related to the "Customer Name" but has no connection with "Product Name" or "Date" etc. So the non-prime attribute is not dependent on the whole of the candidate key.
In order to bring this table to 2nd Normal form, we will need to break this table like this
Date | Product | Customer |
01 Jan | Orange | Ram |
01 Jan | Apple | Ram |
01 Jan | Edible Oil | Harry |
02 Jan | Grape Juice | Emily |
02 Jan | Butter | Emily |
02 Jan | French Loaf | Ram |
And then we have a separate customer table like this
Customer | City |
Ram | Lucedale |
Harry | Spring Valley |
Emily | Wellton |
3rd Normal Form
Third Normal form requires that every non-prime attribute of the table must be non-transitively dependent on the every key of the table.
To simplify the above statement consider this situation. Suppose we decide to store customers' zip code along with customer's city in the second table like below:
Customer | City | Zip |
Ram | Lucedale | 230 110 |
Harry | Spring Valley | 219 266 |
Emily | Wellton | 269 041 |
If you look at this table, the table is in 2NF form. "Customer City" and "Zip" both are non-prime attributes and both of them are dependent on the whole of the key which is "Customer Name" in this case. But this table is not in third normal form yet. Because third normal form requires every non-prime attribute of the table must be dependent non-transitively, i.e., Directly on the key. City is not directly dependent on Customer Name. If you knew Zip code, you could also derive the name of the city. So city is actually dependent on the key through the Zip code (transitive dependency).
Zip | City |
230 110 | Lucedale |
219 266 | Spring Valley |
269 041 | Wellton |
Once we separate out the city and zip code detail in a new table like above, we can just refer the Zip code in our customer table. In this case, all our tables are in third normal form.