DWBI.org wants to publish your tech tutorial!

Help us build a comprehensive collection of beautifully-written tutorials about Data Analytics, Big Data and Business Intelligence. You don’t have to be an experienced writer. If you have technical knowledge and a knack for explaining things, our editors will help with writing and publication. Submit a writing sample to become a our community author, get published on our rich knowledge base with over a million unique page views each year, and make up to USD $100. Email us at service@dwbi.org to know more.

What we look for

  • Technical expertise and best practices
  • Correct and comprehensive commands
  • Clear explanation
  • Friendly, concise, and informative style

What do you get

  • Based on the size, importance and quality of your article, we will pay you USD $30 ~ USD $50.
  • Chance to publish your article in DWBI.org
  • Recognition as a technical author

Contact us at service@dwbi.org

Normalization is a method of organizing data in a database to reduce data redundancy and eliminate any undesirable anomalies such as Update and deletion anomaly. There could be various degree or form of normalization and in this article we will learn them step by step with easy examples.

What is database normalization

Formally stated,

Normalization is the systematic method to design a database schema such that duplicate and redundant data is avoided.

There are number of normalization level starting from level 1 to level 5. Each normal form describes how to get rid of a specific problem in database design.We will explore these levels by taking example of an imaginary dataset for a retail shop. Consider a retail shop wants to keep track of products and customers. So they decide to design a database. In their first attempt they design something like this -

DateProduct NameCustomer NameCustomer City
01 JanOrange, AppleRamLucedale
01 JanEdible OilHarrySpring Valley
02 JanGrape Juice, ButterEmilyWellton
02 JanFrench LoafRamLucedale

But this scheme of storing data has a few problems. For example, there are multiple values in single cell, there are data duplication or redundancy etc. So let us see how normalization helps here.

1st Normal Form

First Normal form requires that value of each attribute contains only a single value from the domain of that attribute. In short, this means you can only put one value for each cell of your table.

In the above table, we have put both 'Orange' and 'Apple' and 'Grape Juice' and 'Butter' in the same cells. This offends the First normal form. We can normalize this problem away by putting all the products into separate rows. See below,

DateProduct NameCustomer NameCustomer City
01 JanOrangeRamLucedale
01 JanAppleRamLucedale
01 JanOrangeHarrySpring Valley
02 JanGrape JuiceEmilyWellton
02 JanButterEmilyWellton
02 JanFrench LoafRamLucedale

2nd Normal Form

As you see in the above example, there are lot of data redundancy. The customer "Ram" appears 3 times in the table and "Emily" appears twice. Problem is - everytime this customer is appearing, its address is also appearing. This redundancy may create many problems. For example, if "Ram" change his address today, then you will need to update his address in all the rows where it appears (in this case 3 rows). 2nd Normal Form help you get rid of this issue. Let us see what 2nd Normal Form really mean.

A table is in 2nd Normal Form if and only if it is in First Normal Form and every non-prime attribute of the table is dependent on the whole of a candidate key

Let us now understand this statement. Look at the above table and try to determine the key of this table. Key is something that uniquely identifies each record of the table. Clearly column "Date" is not the key as "Date" is not unique. Similary neither column "Product Name" nor "Customer Name" is a key. However, the combination of "Date", "Product Name" and "Customer Name" together form the key of this table as that combination is unique and hence it uniquely identifies each record of the table. These columns or attributes are called "Prime Attribute" and they form the candidate key. The other column(s) like "Customer City" is the "non-prime attribute".

Having armed ourselves with the definition of key, non-prime attributes etc., let us now look back to the definition of 2nd Normal Form. It says that every non-prime attribute must be dependent on the whole of a key. In our case this means, the non-prime attribute "Customer City" must be dependent on the whole combination of the key. But "Customer City" is only related to the "Customer Name" but has no connection with "Product Name" or "Date" etc. So the non-prime attribute is not dependent on the whole of the candidate key.

In order to bring this table to 2nd Normal form, we will need to break this table like this

DateProduct NameCustomer Name
01 JanOrangeRam
01 JanAppleRam
01 JanOrangeHarry
02 JanGrape JuiceEmily
02 JanButterEmily
02 JanFrench LoafRam

And then we have a separate customer table like this

Customer NameCustomer City
HarrySpring Valley

3rd Normal Form

Third Normal form requires that every non-prime attribute of the table must be non-transitively dependent on the every key of the table.

To simplify the above statement consider this situation. Suppose we decide to store customers' zip code along with customer's city in the second table like below:

Customer NameCustomer CityZip
RamLucedale230 110
HarrySpring Valley219 266
EmilyWellton269 041

If you look at this table, the table is in 2NF form. "Customer City" and "Zip" both are non-prime attributes and both of them are dependent on the whole of the key which is "Customer Name" in this case. But this table is not in third normal form yet. Because third normal form requires every non-prime attribute of the table must be dependent non-transitively, i.e., Directly on the key. City is not directly dependent on Customer Name. If you knew Zip code, you could also derive the name of the city. So city is actually dependent on the key through the Zip code (transitive dependency).

230 110Lucedale
219 266Spring Valley
269 041Wellton

Once we separate out the city and zip code detail in a new table like above, we can just refer the Zip code in our customer table. In this case, all our tables are in third normal form

Have a question on this subject?

Ask questions to our expert community members and clear your doubts. Asking question or engaging in technical discussion is both easy and rewarding.

Are you on Twitter?

Start following us. This way we will always keep you updated with what's happening in Data Analytics community. We won't spam you. Promise.