In this article we will learn about the Apache Hadoop framework architecture. The basic components of the Apache Hadoop HDFS & MapReduce engine are discussed in brief.

Hadoop Architecture

Hadoop Distributed File System: It is a high throughput, fault-tolerant distributed file system designed to run on commodity machines. HDFS has been designed to be easily portable from one platform to another. This facilitates widespread adoption of HDFS as a platform of choice for a large set of applications. A typical file in HDFS is gigabytes to terabytes in size. It supports tens of millions of files in a single instance.

HDFS is designed to support very large files. Applications dealing with large data sets write their data only once, but they read it one or more times and require these reads to be satisfied at streaming speeds. HDFS supports write-once-read-many semantics on files.

HDFS has a master/slave architecture. An HDFS cluster usually consists of a single NameNode or a master server and a number of DataNodes or slave nodes.


HDFS Block: A block is the smallest unit of data that can be stored or retrieved from the disk. Hadoop distributed file system stores the data in terms of fixed sized blocks. A typical block size used by HDFS is 64 MB. Internally, a large file is split/divided into one or more blocks and these blocks are stored in a set of DataNodes. Thus, an HDFS file is chopped up into 64 MB chunks, and if possible, each chunk will be distributed on multiple DataNodes in a Hadoop cluster. The NameNode maintains the metadata of all the blocks. Blocks are easy to replicate between the DataNodes and thus provide fault tolerance and high availability. Hadoop framework replicates each block across multiple nodes. The default replication factor is 3. In case of any node failure or block corruption, the same block can be read from another node.

NameNode: NameNode is the centrepiece of the hadoop system. There is only one NameNode in and it coordinates everything in a hadoop cluster. The NameNode manages the file system namespace operations like opening, closing, and renaming files and directories. It stores the metadata information of the data blocks. It also determines the mapping of blocks to DataNodes. This metadata is stored permanently on to local disk in the form of namespace image and edit log file. However the NameNode does not store this information persistently. The NameNode creates the block to DataNode mapping when it is restarted. The NameNode is a Single Point of Failure for the HDFS Cluster. If the NameNode crashes, then the entire hadoop system goes down. Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data resides.

DataNode: It stores the blocks of data and retrieves them. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode. DataNode instances can talk to each other, which is what they do when they are replicating data. The DataNodes also reports the blocks information to the NameNode periodically. Client applications can also talk directly to a DataNode, once the NameNode has provided the location of the data.

Secondary NameNode: This is optional node hosted on a separate machine and is responsible to copy and merge the namespace image and edit log from the NameNode periodically. In case if the name node crashes, then the namespace image stored in Secondary NameNode can be used to restart the NameNode.

Above the file systems comes the MapReduce engine, which consists of one JobTracker, to which client applications submit MapReduce jobs. The JobTracker assigns work to available Task Tracker nodes in the cluster, striving to keep the work as close to the data as possible (data locality). With a rack-aware file system, the JobTracker knows which node contains the data, and which other machines are nearby. If the work cannot be hosted on the actual node where the data resides, priority is given to nodes in the same rack. This minimizes network congestion and increases the overall throughput of the system.

JobTracker: It is responsible to handle the Job requests, submitted by the client applications. It schedules the MapReduce tasks to specific TaskTrackers (DataNodes) nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack. Job Tracker also checks for any failed tasks or time out and reschedules the failed tasks to a different TaskTracker. The JobTracker is a point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted. JobTracker can be configured to run on the NameNode machine or on a separate node.

TaskTracker: TaskTracker runs on the DataNodes. TaskTrackers are responsible to run the map, reduce or shuffle tasks assigned by the NameNode. It also reports back the status of the tasks to the NameNode. The TaskTracker spawns a separate JVM processes to do the actual work, to prevent the TaskTracker itself from failing if the running job crashes the JVM. The TaskTracker monitors these spawned processes and captures the output and exit codes. When the process finishes, the TaskTracker notifies the JobTracker about the success or failure of the job. The TaskTrackers also send out heartbeat messages to the JobTracker every few minutes, to reassure the JobTracker that it is still alive.

It is recommended not to host the DataNode, JobTracker or TaskTracker services on the same system as that of the NameNode.

Have a question on this subject?

Ask questions to our expert community members and clear your doubts. Asking question or engaging in technical discussion is both easy and rewarding.

Are you on Twitter?

Start following us. This way we will always keep you updated with what's happening in Data Analytics community. We won't spam you. Promise.

  • Hadoop DataLake Implementation Part 6

    In this article we will load the Customer data in the Hive warehouse as SCD Type 1. This time we will follow a different approach to implement Insert/Update or Merge strategy using Hive QL, rather than SQOOP Merge utility

  • Hadoop MapReduce Basics

    The Hadoop, since its inception is changing the way the enterprises store, process and analyse data. MapReduce is the core part of the Hadoop framework and we can also call it as the core processing engine of Hadoop. It is a programming model...

  • Apache Hadoop Architecture

    In this article we will learn about the Apache Hadoop framework architecture. The basic components of the Apache Hadoop HDFS & MapReduce engine are discussed in brief.

  • Oracle Installation for SQOOP Import

    We would like to perform practical test of Apache SQOOP import/export utility between ORACLE relational database & Apache HADOOP file system, let us quickly setup an ORACLE server. For that we will be using cloud based services/servers as we did...

  • Hadoop DataLake Implementation Part 10

    In this article we will create oozie workflow to orchestrate the daily loading of showroom dimension table from MySQL source to HDFS using Sqoop, followed by Loading data from HDFS to Hive warehouse using Hive and finally housekkeping & archive.

  • Hadoop DataLake Implementation

    In this multi-series article we will learn how to implement an Enterprise DataLake using Apache Hadoop, an open-source, java-based software framework for reliable, scalable & distributed computing. Apache Hadoop addresses the limitations of...

  • Hadoop DataLake Implementation Part 5

    In this article we will load the showroom master data from MySQL source system to HDFS using Sqoop as SCD Type 1.

  • Hadoop DataLake Implementation Part 3

    To complete our implementation setup we will create the source tables based on the downloaded datafiles. Let us first load the SQL files in MySQL server under a new database called ‘sales’. We will simulate this database schema as our OLTP source...

  • Install PIG In Client Node of Hadoop Cluster

    Apache Pig is a platform for analyzing large data sets. Pig Latin is the high level programming language that, lets us specify a sequence of data transformations such as merging data sets, filtering them, grouping them, and applying functions to...

  • Hadoop DataLake Implementation Part 2

    Now that we are familiar with HDP stack, in this article we are going to access HDP sandbox command line, Ambari Web UI, Hive & Ranger to create a user for our implementation setup.