In my previous article – “Fools guide to Big Data” – we have discussed about the origin of Bigdata and the need of big data analytics. We have also noted that Big Data is data that is too large, complex and dynamic for any conventional data tools (such as RDBMS) to compute, store, manage and analyze within a practical timeframe. In the next few articles, we will familiarize ourselves with the tools and techniques for processing Bigdata.
Map Reduce Paradigm
Today we will begin our discussion with “Map Reduce”. “Map Reduce” is a programming model just like any other programming model such as Object Oriented Programming - OOP, Functional Programming - FP etc. Like any other programming model, “Map Reduce” has some specific use-cases where it can be applied and many other cases where “Map Reduce” won’t be applicable. Map reduce is actually targeted towards cases where we need to process large datasets with parallel, distributed algorithm on a cluster setup.
Let’s say, you need to process huge amount of data. You can process individual records of the data set serially through one software program but let’s say you want to make the processing faster. One of the things you could do is to break the large chunk of data into smaller chunks and process them in parallel.
Divide and Conquer Rule
You can write mapper program in any language such as Java, C++, Python etc
Let’s take one real example. Suppose you need to find the prime numbers from a set of 100 billion numbers. You have written one C (Or, Java Or, Python) program which can detect whether a given number is prime number or not. You could run this program on all the 100 billion numbers one by one through some kind of loop to evaluate the numbers – but you know that it would take a very long time to complete. So, to make the processing faster you decided to use 100 computers in parallel. You put the copy of your program to all these 100 computers and gave them each 1 billion numbers to crunch. This way, you could accomplish your entire task one hundredth of a time. This is an example of parallel computing where you are using a cluster of 100 computers. Each of these computers will then generate its own list of prime numbers and at the end of it, you will need to collate one final list comprising of all the numbers from the 100 different lists.
Now if you understand the above model of breaking big tasks in smaller chunks and finally collating the answers out of them, then you understand what map reduce essentially does. “Map Reduce” is just the name – it essentially actually upholds a paradigm of programming where a big task is broken in smaller tasks, called Maps which are then run in parallel to each other and finally in the next step – called “Reduce” - the output of each Map are put together to form the final answer.
Advent of Map Reduce
As you see this is very easy to understand. But you may be wondering why Map-Reduce has become so important paradigm all of a sudden. After all, the basic principle behind Map-Reduce is very easy, if not the most obvious. Actually Map-Reduce has gained so much traction because of the fundamental nature of Bigdata. Bigdata is, well big. If you need to do some processing, it would take ages. So breaking the big data chunk to smaller chunks will make processing faster. But then you need to ensure that your “way of processing” actually supports dividing the record set in smaller chunks and processing them in parallel. Not all types of processing will support that (let’s say you need to calculate the cumulative summation – for example). But there are many other types of operation which will support such “way of processing” (e.g. sum, average, max, min etc.) and “Map-Reduce” just helps you to perform those faster than ever.
Simple Example of Map-Reduce Program
It's not straight forward to see the benefit of Map-Reduce approach to Bigdata processing without seeing a comparable example. So, I thought I will take one real life problem and then try to solve this in both traditional and bigdata way. I hope you will appreciate the applicability of Map Reduce with this example.
CEO of one large corporation is determined to make his organization more customer-centric by increasing the overall customer satisfaction level of the products that the company sells. He has taken some internal measures to uplift the consumer satisfaction level but he is unsure about how he could measure the changes in the satisfaction level of the consumers in timely, trustworthy basis.
One way to determine the satisfaction levels of the consumers is to check the feedback/complaints emails that they receive everyday in hundreds of thousands. If someone could read all those emails and tell him how many complains versus compliments they get on a daily basis then he could track if the number of complaints is dropping day by day or not. So he calls in his CIO to discuss this problem.
The CIO says that it’s easy to implement. All they need to do is, create a database to store the contents of the emails and then they could easily write some procedural programming code that would read data from the table and determine the overall sentiment of the email. When the CEO asked how exactly they would determine the overall sentiment of the message, CIO described him the following:
First they will build a static list of positive and negative words. For example, their lists may look something like below:
|Good Sentiment List||Bad Sentiment List|
Next, every day the contents of each email will be read from the database and stored into a string variable. Then they will perform a search operation in the string to check if any of the positive or negative sentiment words appear in the email. If a positive sentiment word appears, the program will count it as +1 and if the negative sentiment word appears then the program will count it as -1. They will finally sum-up all the counts to determine the final sentiment score of the day. CEO can track the sentiment score everyday to track the overall trend.
The curse of the relational databases
CEO was way too ecstatic to hear this plan, so he enquired if he could get this done within next couple of days. Unfortunately CIO told him that building such a system will take at least a month. CIO further informed him that, looking at the data volume they need to process – they may not be able to generate this report every day as it would take long time to process the data. CEO wanted to know exactly where the problem is – he knew that they had a few spare servers and hardware – so he enquired if they could make it faster by utilizing some additional hardware?
But CIO explained, “The bottleneck is the database – relational databases are not scalable. You can’t really take full advantage of our multiple spare servers since our database is single. Similarly, we can’t really make our development effort any less since we need to write ETL codes to read data from emails and then transform that data to store in relational databases.”
“So what if we try to keep the relational database out of our design?” asked the CEO.
Map-Reduce Strikes back
After a day of brain storming, the idea of map-reduce strikes the IT engineering team. To use the map-reduce approach, they don’t really need to make a lot of adjustments to their initial design proposal. All they need to do is to remove the RDBMS out of the design. Instead they would write a map program that will analyze the overall sentiment of the email message by checking individual words of the email against the predefined good or bad word lists and reduce will do the summation of the sentiment scores to come up with overall sentiment. Since the map program can read the flatfiles containing the emails, they need not normalize the data in RDBMS and since each map program can run parallel to each other, they can now use the extra spare hardware that they had in order to scale-up the execution time. There is no single point of failure in this design, scaling up would be rather easy as well!
As I mentioned before, not all problems are map-reducible. If individual maps can run independently only then it would be possible to take full advantage of this paradigm. But if processing of one map is dependent on the input from another map, then map-reduce won’t solve the problem. However, it might be possible to break a complex problem which may not be map-reducible in itself, in multiple logical sub-problems that are map-reducible. So we may encounter a map-reduce setup where there are multiple consecutive layers of maps and reduce programs one by one. I will show the examples of such problems in some later articles.
A framework to handle map-reduce – Hadoop
Once you determine that a problem is map-reducible and divide them in multiple chunks to run in several computers in parallel, it may happen that one of your maps (or reducer) encounters a failure (disk failure, network failure etc). In such cases, it is imperative that you keep track of this failure and run the job again in another healthy machine. When you have a big problem consisting of thousands of maps and reduce programs, it might become too difficult to keep track of all such failures and rerun those jobs manually.
Another problem is, on such a huge map-reduce platform with huge data volume, it’s often difficult to efficiently transport the data for the purpose of processing to the individual machines running the maps timely. Because of this reason, it is better to use another software framework that can keep track of all the running maps and re-execute any job that has previously failed. It would be even better if that framework can also keep track of the data locality and make necessary data transportation automatically.
Fortunately, Hadoop is the answer. It is a high-performance-computing (HPC) framework designed to handle the above issues. My next article will be on Hadoop. Stay tuned.