Assume we have a 500 MB data file and the requirement is to count the occurrence of words in it. You can think of various options as shown below to achieve this.
1. SQL Script:
Break the data into one word per line and store them in a column of a table, then apply <code>SELECT</code> and <code>GROUP BY</code> to get the unique words and their count of occurrence. But, this involves breaking down each line into words, then applying sorting and grouping functions in order to get the result would be time consuming.
2. UNIX Script:
Using Unix Shell programming, replace space/tab delimiter with new line, order the data using Sort and find frequencies using Uniq command. But, this approach will have a toss on the performance if the data grows because it involves sorting algorithms to sort the data.
In both of the aforesaid approaches, Disk I/O and CPU will be bottlenecks if data grows over a period of time. In order to overcome these performance issues, Hadoop MapReduce with the capability of processing Big Data in multiple nodes by leveraging the power of HDFS can be used.
MapReduce is highly powerful to process the following types of data.
- Structured: Data stored in relational database in columns or data that has some structure defined to it. E.g. Customer data in Oracle table or Call records in a flat file with a header.
- Semi-Structured: Data that doesn’t reside in relational tables but has some properties to define the fields within the data. E.g. Call Records, Log file.
- Unstructured: It represents major portion of the data we have. Data neither have a valid structure nor have a valid schema defined to it. E.g. Image, multimedia.
The term MapReduce actually refers two different tasks that Hadoop programs usually perform. They are
It takes a set of data and converts into another set by breaking down the input into (key, value) pairs.
E.g. Consider an example that calculates the number of occurrences of words in a file. Map Reduce forms the below
<Key, Value> pairs based on input text.
<apple, 1> <apple, 1> <mango, 1> <mango, 1>
Here, apple is the Key and 1 is the Value.
It takes the data processed by
Map() function and aggregates them to produce the desired output.
E.g. Reducer aggregates the
<Key, Value> pairs as below:-
<apple, 2> <mango, 2>
As the name implies, Reduce job is performed after the Map job because Reduce has to accept the input from Map job. Map and Reduce jobs can be written in Java, Python or any supporting programming languages. Both the input and output of MapReduce job are stored in Hadoop Distributed File System (HDFS). To know more about HDFS and the commands used, refer Apache Hadoop guide.
You can relate MapReduce functionality to the real world “Exit Polling” activity that takes the opinion of voters to whom they voted for.
- In terms of “Exit Polling”, Mapper is analogous to the data collected by the Agent from the voters and Reducer is analogous to the count of votes polled to candidates.
- Agent who collects the data from voters and assign value against candidates is the Mapper whereas the Agent who aggregates the number of votes polled per candidate is the Reducer. Multiple mappers may be executed here since the data collection happens at multiple every polling stations.
In the above flow, big data file from the local file system (either Windows/Linux) should be copied to Hadoop Distributed File System for processing. When MapReduce job is invoked using any compatible programming language (e.g. Java/Python), this source file copied to HDFS should be fed as an input to the mapper job. The outcome of the mapper job should be fed to the reducer job to get the desired output.
Map and Reduce functions are explained in the following sections in detail along with other associated functions of map and reduce.
Every mapreduce program has two components, one is Mapper and another one is typically a Reducer.
As said earlier, Mapper has the Map function that transforms input data into (key, value) pairs. It accepts the input in the form of data splits.
- A Mapper instance is created for each map task.
- A mapper is called for every input data split from the input file. InputSplit refers a block of source data assigned to the mapper for processing. These input splits may span across multiple nodes in Hadoop cluster and every node runs Mappers to process these data blocks.
- If InputSplit has N number of lines, Map function will be executed N number of times by the MapReduce framework.
- InputSplit is read by RecordReader record by record. By default, each line is considered as a record.
- Upon the completion of map tasks in individual nodes, the intermediate results of Map operations from every node will be sent for further processing.
Before Reduce operation after Map operation, MapReduce performs Shuffling, Partitioning and Sorting of the intermediate results from individual map tasks.
It determines which Key from intermediate map output should go to which Reducer. Records having same key should go to the same Reducer. All values for the same key should always be reduced together irrespective of which mapper from which it comes from.
Note: The number of Partitioners is equal to the number of Reducers.
It shuffles the Partitioner data between the nodes so that map outputs are partitioned based on keys.
Shuffled data will be sorted (merge sort) based on Keys to fed to the Reducer as input.
Reducer aggregates the output of the mapper that’s partitioned and sorted.
- A Reducer instance is created for each reduce task. By default, the number of reducer is 1 but can be customized by the user.
- For each key in the partition, Reducer’s reduce task is invoked. Reduce task receives a key as well as iterator for all the values associated with the key. The output of the reducer will not be re-sorted again.
- RecordWriter reads the Reducer output and writes the result to the output file.
Example to demonstrate MapReduce:
Let’s consider a word count example and see how it is processed by MapReduce framework.
Source file content:
Consider the below contents in a file that would be used as a source to perform word count task using MapReduce framework in Hadoop.
apple orange banana choco banana burger orange butter choco butter apple burger
This file in HDFS is split into blocks and processed by two different nodes in the above example. One file block per node is assigned to perform MapReduce task to get the word count.
A map task is executed for every InputSplit and generates
<Key, Value> pairs for each InputSplit. The intermediary result of Map task is then shuffled and partitioned to assign them to the appropriate Reducers.
<Key, Value> pairs are then sorted and passed to the Reducers as Input to perform count of words by aggregating the values grouping by keys.
Final Reducer output is then loaded into HDFS in the form of a result file.