The Apache Hadoop is next big data platform. Apache Hadoop is an open-source, java-based framework software for reliable, scalable & distributed computing. The Apache Hadoop allows distributed processing of very large data sets across clusters of commodity machines (low-cost hardware computers) using simple programming models.

Design Paradigm

Hadoop framework library is designed to detect and handle failures at the application layer, on top of a cluster of computers, each of which may be prone to failures. This is how the software library provides high-availability and resilience instead of simply relying on high-end hardware.

It is designed to scale up from single server to thousands of machines, each offering local computation and storage.

Hadoop enables a computing solution that is Scalable, Flexible, Fault tolerant & Cost effective. Hadoop is capable of processing big data of sizes ranging from Gigabytes to Petabytes. It is a complete solution for large-scale analytics. Hadoop changed the economics and the dynamics of large-scale computing.

Hadoop Modules

Hadoop is composed of four core components—Hadoop Common, Hadoop Distributed File System (HDFS), YARN and Hadoop MapReduce.

  • Hadoop Common: The library module containing common utilities that supports the other Hadoop components. It consists of utilities which provides file system and OS level abstractions.
  • Hadoop Distributed File System: A Java-based distributed, scalable, and portable file system that provides reliable data storage of diverse data and high-throughput access to application data across all the nodes in a Hadoop cluster. It links together the file systems on many local nodes to create a single file system.
  • Hadoop YARN: Yet Another Resource Negotiator – The next-generation framework for job scheduling and cluster resource management. It assigns CPU, memory and storage to applications running on a Hadoop cluster. It enables application frameworks other than MapReduce to run on Hadoop, opening up new possibilities.
  • Hadoop MapReduce: A YARN-based framework for writing applications to process large amounts of structured and unstructured data in-parallel on a cluster of thousands of machines, in a reliable and fault-tolerant manner.

Hadoop theoretically, can be used for any sort of work which are batch-oriented rather than real-time, very data-intensive, and benefits from parallel processing of data.

Essentially, it accomplishes two tasks: massive data storage and faster processing.

  • Distributed Data Storage: HDFS
  • Distributed Data Processing: MapReduce
  • Benefits of Hadoop

    • Low cost: The open-source framework is free and uses commodity hardware to store and process very large volumes of variety data.
    • Massive storage: The Hadoop framework can store huge amounts of data by breaking the data into blocks and storing it on clusters of low-cost commodity hardware.
    • Computing power: Its distributed computing model offers local computation and storage to quickly process large volumes of data. We can increase the processing power simply by adding computing nodes to the cluster.
    • Scalability: We can easily ramp up the system simply by adding more nodes to a cluster with a little administration required.
    • Storage flexibility: We can store a variety of data whether structured, semi-structured or unstructured. We can store as much data as we want and decide how to use it later.
    • Resilient framework: Data and application processing are protected against hardware failure by Inherent data protection and self-healing capabilities. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. And it automatically stores multiple copies of all data.


Have a question on this subject?

Ask questions to our expert community members and clear your doubts. Asking question or engaging in technical discussion is both easy and rewarding.

Are you on Twitter?

Start following us. This way we will always keep you updated with what's happening in Data Analytics community. We won't spam you. Promise.

  • Hadoop DataLake Implementation Part 6

    In this article we will load the Customer data in the Hive warehouse as SCD Type 1. This time we will follow a different approach to implement Insert/Update or Merge strategy using Hive QL, rather than SQOOP Merge utility

  • Hadoop DataLake Implementation Part 10

    In this article we will create oozie workflow to orchestrate the daily loading of showroom dimension table from MySQL source to HDFS using Sqoop, followed by Loading data from HDFS to Hive warehouse using Hive and finally housekkeping & archive.

  • Hadoop DataLake Implementation Part 2

    Now that we are familiar with HDP stack, in this article we are going to access HDP sandbox command line, Ambari Web UI, Hive & Ranger to create a user for our implementation setup.

  • Hadoop MapReduce Basics

    The Hadoop, since its inception is changing the way the enterprises store, process and analyse data. MapReduce is the core part of the Hadoop framework and we can also call it as the core processing engine of Hadoop. It is a programming model...

  • Hadoop DataLake Implementation Part 8

    In this article we will load our first fact table into Hive warehouse which is sales transactions.

  • Hadoop DataLake Implementation Part 4

    Now that our dummy OLTP source system & Hadoop HDFS directory structure is ready, we will first load the ‘dates’ data file in HDFS and further to a hive table.

  • Install Hive in Client Node of Hadoop Cluster

    In the previous article, we have shown how to setup a client node. Once this is done, now let's put Hadoop to use for some big data analytics purpose. One way to do that is by using Hive which let's us run SQL queries against the big data. A...

  • Set up Client Node (Gateway Node) in Hadoop Cluster

    Once we have our multi-node hadoop cluster up and running, let us create an EdgeNode or a GatewayNode. Gateway nodes are the interface between the Hadoop cluster and the outside network. Edge nodes are used to run client applications and cluster...

  • Configuring MySQL as Hive Metastore

    In the previous article, we have learnt How to Install and Configure Hive with default Derby metastore. However, an embedded derby based metastore can process only one request at a time. Since this is very restrictive, we will setup a traditional...

  • Oracle Installation for SQOOP Import

    We would like to perform practical test of Apache SQOOP import/export utility between ORACLE relational database & Apache HADOOP file system, let us quickly setup an ORACLE server. For that we will be using cloud based services/servers as we did...