The Hadoop, since its inception is changing the way the enterprises store, process and analyse data. MapReduce is the core part of the Hadoop framework and we can also call it as the core processing engine of Hadoop. It is a programming model designed to process large amount of data in parallel by dividing the load across multiple nodes in a Hadoop cluster. Continuing on our previous discussion on MapReduce, let's go deeper in this article.

Why MapReduce?

Assume we have a 500 MB data file and the requirement is to count the occurrence of words in it. You can think of various options as shown below to achieve this.

  1. SQL Script:

    Break the data into one word per line and store them in a column of a table, then apply SELECT and GROUP BY to get the unique words and their count of occurrence. But, this involves breaking down each line into words, then applying sorting and grouping functions in order to get the result would be time consuming.

  2. UNIX Script:

    Using Unix Shell programming, replace space/tab delimiter with new line, order the data using Sort and find frequencies using Uniq command. But, this approach will have a toss on the performance if the data grows because it involves sorting algorithms to sort the data.

In both of the aforesaid approaches, Disk I/O and CPU will be bottlenecks if data grows over a period of time. In order to overcome these performance issues, Hadoop MapReduce with the capability of processing Big Data in multiple nodes by leveraging the power of HDFS can be used.

MapReduce Concepts:

MapReduce is highly powerful to process the following types of data.

  • Structured:

    Data stored in relational database in columns or data that has some structure defined to it. E.g. Customer data in Oracle table or Call records in a flat file with a header.

  • Semi-Structured:

    Data that doesn’t reside in relational tables but has some properties to define the fields within the data. E.g. Call Records, Log file.

  • Unstructured:

    It represents major portion of the data we have. Data neither have a valid structure nor have a valid schema defined to it. E.g. Image, multimedia.

The term MapReduce actually refers two different tasks that Hadoop programs usually perform. They are

  1. Map

    It takes a set of data and converts into another set by breaking down the input into (key, value) pairs.

    E.g. Consider an example that calculates the number of occurrences of words in a file. Map Reduce forms the below <Key, Value> pairs based on input text.

    <apple, 1>
    <apple, 1>
    <mango, 1>
    <Mango, 1>

    Here, apple is the Key and 1 is the Value.

  2. Reduce

    It takes the data processed by Map() function and aggregates them to produce the desired output.

    E.g. Reducer aggregates the <Key, Value> pairs as below:-

    <apple, 2> 
    <mango, 2>

As the name implies, Reduce job is performed after the Map job because Reduce has to accept the input from Map job. Map and Reduce jobs can be written in Java, Python or any supporting programming languages. Both the input and output of MapReduce job are stored in Hadoop Distributed File System (HDFS). To know more about HDFS and the commands used, refer Apache Hadoop guide.

You can relate MapReduce functionality to the real world “Exit Polling” activity that takes the opinion of voters to whom they voted for.

  • In terms of “Exit Polling”, Mapper is analogous to the data collected by the Agent from the voters and Reducer is analogous to the count of votes polled to candidates.
  • Agent who collects the data from voters and assign value against candidates is the Mapper whereas the Agent who aggregates the number of votes polled per candidate is the Reducer. Multiple mappers may be executed here since the data collection happens at multiple every polling stations.

The overall flow of mapreduce looks like below:-

In the above flow, big data file from the local file system (either Windows/Linux) should be copied to Hadoop Distributed File System for processing. When MapReduce job is invoked using any compatible programming language (e.g. Java/Python), this source file copied to HDFS should be fed as an input to the mapper job. The outcome of the mapper job should be fed to the reducer job to get the desired output.

Map and Reduce functions are explained in the following sections in detail along with other associated functions of map and reduce. Every mapreduce program has two components, one is Mapper and another one is typically a Reducer.

  1. Mapper:

    As said earlier, Mapper has the Map function that transforms input data into (key, value) pairs. It accepts the input in the form of data splits.

    • A Mapper instance is created for each map task.
    • A mapper is called for every input data split from the input file. InputSplit refers a block of source data assigned to the mapper for processing. These input splits may span across multiple nodes in Hadoop cluster and every node runs Mappers to process these data blocks.
    • If InputSplit has N number of lines, Map function will be executed N number of times by the MapReduce framework.
    • InputSplit is read by RecordReader record by record. By default, each line is considered as a record.
    • Upon the completion of map tasks in individual nodes, the intermediate results of Map operations from every node will be sent for further processing.

    Before Reduce operation after Map operation, MapReduce performs Shuffling, Partitioning and Sorting of the intermediate results from individual map tasks.

    1. Partitioning:

      It determines which Key from intermediate map output should go to which Reducer. Records having same key should go to the same Reducer. All values for the same key should always be reduced together irrespective of which mapper from which it comes from. Note: The number of Partitioners is equal to the number of Reducers.

    2. Shuffling:

      It shuffles the Partitioner data between the nodes so that map outputs are partitioned based on keys.

    3. Sorting:

      Shuffled data will be sorted (merge sort) based on Keys to fed to the Reducer as input.

  2. Reducer:

    Reducer aggregates the output of the mapper that’s partitioned and sorted.

    • A Reducer instance is created for each reduce task. By default, the number of reducer is 1 but can be customized by the user.
    • For each key in the partition, Reducer’s reduce task is invoked. Reduce task receives a key as well as iterator for all the values associated with the key. The output of the reducer will not be re-sorted again.
    • RecordWriter reads the Reducer output and writes the result to the output file.

Example to demonstrate MapReduce:

Let’s consider a word count example and see how it is processed by MapReduce framework.

Source file content:

Consider the below contents in a file that would be used as a source to perform word count task using MapReduce framework in Hadoop.

apple orange banana
choco banana burger
orange butter choco
butter apple burger
  • This file in HDFS is split into blocks and processed by two different nodes in the above example. One file block per node is assigned to perform MapReduce task to get the word count.
  • A map task is executed for every InputSplit and generates <Key, Value> pairs for each InputSplit. The intermediary result of Map task is then shuffled and partitioned to assign them to the appropriate Reducers.
  • Shuffled <Key, Value> pairs are then sorted and passed to the Reducers as Input to perform count of words by aggregating the values grouping by keys.
  • Final Reducer output is then loaded into HDFS in the form of a result file.

Have a question on this subject?

Ask questions to our expert community members and clear your doubts. Asking question or engaging in technical discussion is both easy and rewarding.

Are you on Twitter?

Start following us. This way we will always keep you updated with what's happening in Data Analytics community. We won't spam you. Promise.

  • Oracle Installation for SQOOP Import

    We would like to perform practical test of Apache SQOOP import/export utility between ORACLE relational database & Apache HADOOP file system, let us quickly setup an ORACLE server. For that we will be using cloud based services/servers as we did...

  • Hadoop DataLake Implementation Part 9

    In this article we will load our final fact table i.e. stock.

  • Hadoop DataLake Implementation Part 2

    Now that we are familiar with HDP stack, in this article we are going to access HDP sandbox command line, Ambari Web UI, Hive & Ranger to create a user for our implementation setup.

  • Install FLUME In Client Node of Hadoop Cluster

    Apache Flume is a distributed, robust, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data or streaming event data from different sources to a centralized data store. Its main goal is to...

  • Hadoop DataLake Implementation Part 10

    In this article we will create oozie workflow to orchestrate the daily loading of showroom dimension table from MySQL source to HDFS using Sqoop, followed by Loading data from HDFS to Hive warehouse using Hive and finally housekkeping & archive.

  • Configuring MySQL as Hive Metastore

    In the previous article, we have learnt How to Install and Configure Hive with default Derby metastore. However, an embedded derby based metastore can process only one request at a time. Since this is very restrictive, we will setup a traditional...

  • Install HBASE in Hadoop Cluster

    Apache HBase provides large-scale tabular storage for Hadoop using the Hadoop Distributed File System (HDFS). Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable. HBase is used in cases...

  • Hadoop DataLake Implementation Part 6

    In this article we will load the Customer data in the Hive warehouse as SCD Type 1. This time we will follow a different approach to implement Insert/Update or Merge strategy using Hive QL, rather than SQOOP Merge utility

  • Hadoop MapReduce Basics

    The Hadoop, since its inception is changing the way the enterprises store, process and analyse data. MapReduce is the core part of the Hadoop framework and we can also call it as the core processing engine of Hadoop. It is a programming model...

  • Hadoop DataLake Implementation Part 8

    In this article we will load our first fact table into Hive warehouse which is sales transactions.