In my previous article – “Fools guide to Big Data” – we have discussed about the origin of Bigdata and the need of big data analytics. We have also noted that Big Data is data that is too large, complex and dynamic for any conventional data tools (such as RDBMS) to compute, store, manage and analyze within a practical timeframe. In the next few articles, we will familiarize ourselves with the tools and techniques for processing Bigdata.

Today we will begin our discussion with “Map Reduce”. “Map Reduce” is a programming model just like any other programming model such as Object Oriented Programming - OOP, Functional Programming - FP etc. Like any other programming model, “Map Reduce” has some specific use-cases where it can be applied and many other cases where “Map Reduce” won’t be applicable. Map reduce is actually targeted towards cases where we need to process large datasets with parallel, distributed algorithm on a cluster setup.

Let’s say, you need to process huge amount of data. You can process individual records of the data set serially through one software program but let’s say you want to make the processing faster. One of the things you could do is to break the large chunk of data into smaller chunks and process them in parallel.

### Divide and Conquer Rule

You can write mapper program in any language such as Java, C++, Python etc

Let’s take one real example. Suppose you need to find the prime numbers from a set of 100 billion numbers. You have written one C (Or, Java Or, Python) program which can detect whether a given number is prime number or not. You could run this program on all the 100 billion numbers one by one through some kind of loop to evaluate the numbers – but you know that it would take a very long time to complete. So, to make the processing faster you decided to use 100 computers in parallel. You put the copy of your program to all these 100 computers and gave them each 1 billion numbers to crunch. This way, you could accomplish your entire task one hundredth of a time. This is an example of parallel computing where you are using a cluster of 100 computers. Each of these computers will then generate its own list of prime numbers and at the end of it, you will need to collate one final list comprising of all the numbers from the 100 different lists.

Now if you understand the above model of breaking big tasks in smaller chunks and finally collating the answers out of them, then you understand what map reduce essentially does. “Map Reduce” is just the name – it essentially actually upholds a paradigm of programming where a big task is broken in smaller tasks, called Maps which are then run in parallel to each other and finally in the next step – called “Reduce” - the output of each Map are put together to form the final answer.

As you see this is very easy to understand. But you may be wondering why Map-Reduce has become so important paradigm all of a sudden. After all, the basic principle behind Map-Reduce is very easy, if not the most obvious. Actually Map-Reduce has gained so much traction because of the fundamental nature of Bigdata. Bigdata is, well big. If you need to do some processing, it would take ages. So breaking the big data chunk to smaller chunks will make processing faster. But then you need to ensure that your “way of processing” actually supports dividing the record set in smaller chunks and processing them in parallel. Not all types of processing will support that (let’s say you need to calculate the cumulative summation – for example). But there are many other types of operation which will support such “way of processing” (e.g. sum, average, max, min etc.) and “Map-Reduce” just helps you to perform those faster than ever.

### Simple Example of Map-Reduce Program

It's not straight forward to see the benefit of Map-Reduce approach to Bigdata processing without seeing a comparable example. So, I thought I will take one real life problem and then try to solve this in both traditional and bigdata way. I hope you will appreciate the applicability of Map Reduce with this example.

### Sentiment Analysis

CEO of one large corporation is determined to make his organization more customer-centric by increasing the overall customer satisfaction level of the products that the company sells. He has taken some internal measures to uplift the consumer satisfaction level but he is unsure about how he could measure the changes in the satisfaction level of the consumers in timely, trustworthy basis.

One way to determine the satisfaction levels of the consumers is to check the feedback/complaints emails that they receive everyday in hundreds of thousands. If someone could read all those emails and tell him how many complains versus compliments they get on a daily basis then he could track if the number of complaints is dropping day by day or not. So he calls in his CIO to discuss this problem.

The CIO says that it’s easy to implement. All they need to do is, create a database to store the contents of the emails and then they could easily write some procedural programming code that would read data from the table and determine the overall sentiment of the email. When the CEO asked how exactly they would determine the overall sentiment of the message, CIO described him the following:

First they will build a static list of positive and negative words. For example, their lists may look something like below:

WonderfulAwful
LikedHated
AppreciateTerrible

Next, every day the contents of each email will be read from the database and stored into a string variable. Then they will perform a search operation in the string to check if any of the positive or negative sentiment words appear in the email. If a positive sentiment word appears, the program will count it as +1 and if the negative sentiment word appears then the program will count it as -1. They will finally sum-up all the counts to determine the final sentiment score of the day. CEO can track the sentiment score everyday to track the overall trend.

### The curse of the relational databases

CEO was way too ecstatic to hear this plan, so he enquired if he could get this done within next couple of days. Unfortunately CIO told him that building such a system will take at least a month. CIO further informed him that, looking at the data volume they need to process – they may not be able to generate this report every day as it would take long time to process the data. CEO wanted to know exactly where the problem is – he knew that they had a few spare servers and hardware – so he enquired if they could make it faster by utilizing some additional hardware?

But CIO explained, “The bottleneck is the database – relational databases are not scalable. You can’t really take full advantage of our multiple spare servers since our database is single. Similarly, we can’t really make our development effort any less since we need to write ETL codes to read data from emails and then transform that data to store in relational databases.”

“So what if we try to keep the relational database out of our design?” asked the CEO.

### Map-Reduce Strikes back

After a day of brain storming, the idea of map-reduce strikes the IT engineering team. To use the map-reduce approach, they don’t really need to make a lot of adjustments to their initial design proposal. All they need to do is to remove the RDBMS out of the design. Instead they would write a map program that will analyze the overall sentiment of the email message by checking individual words of the email against the predefined good or bad word lists and reduce will do the summation of the sentiment scores to come up with overall sentiment. Since the map program can read the flatfiles containing the emails, they need not normalize the data in RDBMS and since each map program can run parallel to each other, they can now use the extra spare hardware that they had in order to scale-up the execution time. There is no single point of failure in this design, scaling up would be rather easy as well!

## Map-Reducible Problem

As I mentioned before, not all problems are map-reducible. If individual maps can run independently only then it would be possible to take full advantage of this paradigm. But if processing of one map is dependent on the input from another map, then map-reduce won’t solve the problem. However, it might be possible to break a complex problem which may not be map-reducible in itself, in multiple logical sub-problems that are map-reducible. So we may encounter a map-reduce setup where there are multiple consecutive layers of maps and reduce programs one by one. I will show the examples of such problems in some later articles.

## A framework to handle map-reduce – Hadoop

Once you determine that a problem is map-reducible and divide them in multiple chunks to run in several computers in parallel, it may happen that one of your maps (or reducer) encounters a failure (disk failure, network failure etc). In such cases, it is imperative that you keep track of this failure and run the job again in another healthy machine. When you have a big problem consisting of thousands of maps and reduce programs, it might become too difficult to keep track of all such failures and rerun those jobs manually.

Another problem is, on such a huge map-reduce platform with huge data volume, it’s often difficult to efficiently transport the data for the purpose of processing to the individual machines running the maps timely. Because of this reason, it is better to use another software framework that can keep track of all the running maps and re-execute any job that has previously failed. It would be even better if that framework can also keep track of the data locality and make necessary data transportation automatically.

Fortunately, Hadoop is the answer. It is a high-performance-computing (HPC) framework designed to handle the above issues. My next article will be on Hadoop. Stay tuned.

### Recommended Books

I strongly recommend reading this book to delve deeper into the world of data intensive text processing using Map Reduce paradigm.

This book covers both Map-Reduce fundamentals as well as designing Map Reducible programming algorithms quite thoroughly

#### Do you know the answer?

Which of the following is not a database?

### Popular

#### Have a question on this subject?

Ask questions to our expert community members and clear your doubts. Asking question or engaging in technical discussion is both easy and rewarding.

Start following us. This way we will always keep you updated with what's happening in Data Analytics community. We won't spam you. Promise.

The Apache Hadoop is next big data platform. Apache Hadoop is an open-source, java-based framework software for reliable, scalable & distributed computing. The Apache Hadoop allows distributed processing of very large data sets across clusters of...

• Hadoop DataLake Implementation Part 3

To complete our implementation setup we will create the source tables based on the downloaded datafiles. Let us first load the SQL files in MySQL server under a new database called ‘sales’. We will simulate this database schema as our OLTP source...

• SQOOP Merge & Incremental Extraction from Oracle

Let us check how to perform Incremental Extraction & Merge using Sqoop. The SQOOP Merge utility allows to combine two datasets where entries in one dataset should overwrite entries of an older dataset. For example, an incremental import run in...

• SQOOP import from Oracle

In this article we will use Apache SQOOP to import data from Oracle database. Now that we have an oracle server in our cluster ready, let us login to EdgeNode. Next we will configure sqoop to import this data in HDFS file system followed by direct...

• Hadoop DataLake Implementation Part 9

• Install PIG In Client Node of Hadoop Cluster

Apache Pig is a platform for analyzing large data sets. Pig Latin is the high level programming language that, lets us specify a sequence of data transformations such as merging data sets, filtering them, grouping them, and applying functions to...

• Oracle Installation for SQOOP Import

We would like to perform practical test of Apache SQOOP import/export utility between ORACLE relational database & Apache HADOOP file system, let us quickly setup an ORACLE server. For that we will be using cloud based services/servers as we did...

• Hadoop DataLake Implementation Part 5

In this article we will load the showroom master data from MySQL source system to HDFS using Sqoop as SCD Type 1.

• Fools Guide to Big data - What is Big Data

Sure enough, you have heard the term, "Big Data" many times before. There is no dearth of information in the Internet and printed medium about this. But guess what, this term still remains vaguely defined and poorly understood. This essay is our...

• Hadoop DataLake Implementation Part 6

In this article we will load the Customer data in the Hive warehouse as SCD Type 1. This time we will follow a different approach to implement Insert/Update or Merge strategy using Hive QL, rather than SQOOP Merge utility