In this article we will learn about the Apache Hadoop framework architecture. The basic components of the Apache Hadoop HDFS & MapReduce engine are discussed in brief.

Hadoop Architecture

Hadoop Distributed File System: It is a high throughput, fault-tolerant distributed file system designed to run on commodity machines. HDFS has been designed to be easily portable from one platform to another. This facilitates widespread adoption of HDFS as a platform of choice for a large set of applications. A typical file in HDFS is gigabytes to terabytes in size. It supports tens of millions of files in a single instance.

HDFS is designed to support very large files. Applications dealing with large data sets write their data only once, but they read it one or more times and require these reads to be satisfied at streaming speeds. HDFS supports write-once-read-many semantics on files.

HDFS has a master/slave architecture. An HDFS cluster usually consists of a single NameNode or a master server and a number of DataNodes or slave nodes.


HDFS Block: A block is the smallest unit of data that can be stored or retrieved from the disk. Hadoop distributed file system stores the data in terms of fixed sized blocks. A typical block size used by HDFS is 64 MB. Internally, a large file is split/divided into one or more blocks and these blocks are stored in a set of DataNodes. Thus, an HDFS file is chopped up into 64 MB chunks, and if possible, each chunk will be distributed on multiple DataNodes in a Hadoop cluster. The NameNode maintains the metadata of all the blocks. Blocks are easy to replicate between the DataNodes and thus provide fault tolerance and high availability. Hadoop framework replicates each block across multiple nodes. The default replication factor is 3. In case of any node failure or block corruption, the same block can be read from another node.

NameNode: NameNode is the centrepiece of the hadoop system. There is only one NameNode in and it coordinates everything in a hadoop cluster. The NameNode manages the file system namespace operations like opening, closing, and renaming files and directories. It stores the metadata information of the data blocks. It also determines the mapping of blocks to DataNodes. This metadata is stored permanently on to local disk in the form of namespace image and edit log file. However the NameNode does not store this information persistently. The NameNode creates the block to DataNode mapping when it is restarted. The NameNode is a Single Point of Failure for the HDFS Cluster. If the NameNode crashes, then the entire hadoop system goes down. Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data resides.

DataNode: It stores the blocks of data and retrieves them. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode. DataNode instances can talk to each other, which is what they do when they are replicating data. The DataNodes also reports the blocks information to the NameNode periodically. Client applications can also talk directly to a DataNode, once the NameNode has provided the location of the data.

Secondary NameNode: This is optional node hosted on a separate machine and is responsible to copy and merge the namespace image and edit log from the NameNode periodically. In case if the name node crashes, then the namespace image stored in Secondary NameNode can be used to restart the NameNode.

Above the file systems comes the MapReduce engine, which consists of one JobTracker, to which client applications submit MapReduce jobs. The JobTracker assigns work to available Task Tracker nodes in the cluster, striving to keep the work as close to the data as possible (data locality). With a rack-aware file system, the JobTracker knows which node contains the data, and which other machines are nearby. If the work cannot be hosted on the actual node where the data resides, priority is given to nodes in the same rack. This minimizes network congestion and increases the overall throughput of the system.

JobTracker: It is responsible to handle the Job requests, submitted by the client applications. It schedules the MapReduce tasks to specific TaskTrackers (DataNodes) nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack. Job Tracker also checks for any failed tasks or time out and reschedules the failed tasks to a different TaskTracker. The JobTracker is a point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted. JobTracker can be configured to run on the NameNode machine or on a separate node.

TaskTracker: TaskTracker runs on the DataNodes. TaskTrackers are responsible to run the map, reduce or shuffle tasks assigned by the NameNode. It also reports back the status of the tasks to the NameNode. The TaskTracker spawns a separate JVM processes to do the actual work, to prevent the TaskTracker itself from failing if the running job crashes the JVM. The TaskTracker monitors these spawned processes and captures the output and exit codes. When the process finishes, the TaskTracker notifies the JobTracker about the success or failure of the job. The TaskTrackers also send out heartbeat messages to the JobTracker every few minutes, to reassure the JobTracker that it is still alive.

It is recommended not to host the DataNode, JobTracker or TaskTracker services on the same system as that of the NameNode.

Have a question on this subject?

Ask questions to our expert community members and clear your doubts. Asking question or engaging in technical discussion is both easy and rewarding.

Are you on Twitter?

Start following us. This way we will always keep you updated with what's happening in Data Analytics community. We won't spam you. Promise.

  • Install PIG In Client Node of Hadoop Cluster

    Apache Pig is a platform for analyzing large data sets. Pig Latin is the high level programming language that, lets us specify a sequence of data transformations such as merging data sets, filtering them, grouping them, and applying functions to...

  • Install FLUME In Client Node of Hadoop Cluster

    Apache Flume is a distributed, robust, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data or streaming event data from different sources to a centralized data store. Its main goal is to...

  • Install HBASE in Hadoop Cluster

    Apache HBase provides large-scale tabular storage for Hadoop using the Hadoop Distributed File System (HDFS). Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable. HBase is used in cases...

  • Hadoop DataLake Implementation Part 3

    To complete our implementation setup we will create the source tables based on the downloaded datafiles. Let us first load the SQL files in MySQL server under a new database called ‘sales’. We will simulate this database schema as our OLTP source...

  • Hadoop DataLake Implementation Part 6

    In this article we will load the Customer data in the Hive warehouse as SCD Type 1. This time we will follow a different approach to implement Insert/Update or Merge strategy using Hive QL, rather than SQOOP Merge utility

  • SQOOP import from MySQL

    In this article we will use Apache SQOOP to import data from MySQL database. For that let us create a MySql database & user and dump some data quickly. Let us download a MySQL database named Sakila Db from internet to get started. Next we will...

  • Install SPARK in Hadoop Cluster

    Apache Spark is a fast and general purpose engine for large-scale data processing over a distributed cluster. Apache Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing. Spark run programs up to 100x...

  • Hadoop DataLake Implementation Part 9

    In this article we will load our final fact table i.e. stock.

  • SQOOP Merge & Incremental Extraction from Oracle

    Let us check how to perform Incremental Extraction & Merge using Sqoop. The SQOOP Merge utility allows to combine two datasets where entries in one dataset should overwrite entries of an older dataset. For example, an incremental import run in...

  • How to Setup Hadoop Multi Node Cluster - Step By Step

    Setting up Hadoop in a single machine is easy, but no fun. Why? Because Hadoop is not meant for a single machine. Hadoop is meant to run on a computing cluster comprising of many machines. Running HDFS and MapReduce on a single machine is great for...