The Apache Hadoop is next big data platform. Apache Hadoop is an open-source, java-based framework software for reliable, scalable & distributed computing. The Apache Hadoop allows distributed processing of very large data sets across clusters of commodity machines (low-cost hardware computers) using simple programming models.

Design Paradigm

Hadoop framework library is designed to detect and handle failures at the application layer, on top of a cluster of computers, each of which may be prone to failures. This is how the software library provides high-availability and resilience instead of simply relying on high-end hardware.

It is designed to scale up from single server to thousands of machines, each offering local computation and storage.

Hadoop enables a computing solution that is Scalable, Flexible, Fault tolerant & Cost effective. Hadoop is capable of processing big data of sizes ranging from Gigabytes to Petabytes. It is a complete solution for large-scale analytics. Hadoop changed the economics and the dynamics of large-scale computing.

Hadoop Modules

Hadoop is composed of four core components—Hadoop Common, Hadoop Distributed File System (HDFS), YARN and Hadoop MapReduce.

  • Hadoop Common: The library module containing common utilities that supports the other Hadoop components. It consists of utilities which provides file system and OS level abstractions.
  • Hadoop Distributed File System: A Java-based distributed, scalable, and portable file system that provides reliable data storage of diverse data and high-throughput access to application data across all the nodes in a Hadoop cluster. It links together the file systems on many local nodes to create a single file system.
  • Hadoop YARN: Yet Another Resource Negotiator – The next-generation framework for job scheduling and cluster resource management. It assigns CPU, memory and storage to applications running on a Hadoop cluster. It enables application frameworks other than MapReduce to run on Hadoop, opening up new possibilities.
  • Hadoop MapReduce: A YARN-based framework for writing applications to process large amounts of structured and unstructured data in-parallel on a cluster of thousands of machines, in a reliable and fault-tolerant manner.

Hadoop theoretically, can be used for any sort of work which are batch-oriented rather than real-time, very data-intensive, and benefits from parallel processing of data.

Essentially, it accomplishes two tasks: massive data storage and faster processing.

  • Distributed Data Storage: HDFS
  • Distributed Data Processing: MapReduce
  • Benefits of Hadoop

    • Low cost: The open-source framework is free and uses commodity hardware to store and process very large volumes of variety data.
    • Massive storage: The Hadoop framework can store huge amounts of data by breaking the data into blocks and storing it on clusters of low-cost commodity hardware.
    • Computing power: Its distributed computing model offers local computation and storage to quickly process large volumes of data. We can increase the processing power simply by adding computing nodes to the cluster.
    • Scalability: We can easily ramp up the system simply by adding more nodes to a cluster with a little administration required.
    • Storage flexibility: We can store a variety of data whether structured, semi-structured or unstructured. We can store as much data as we want and decide how to use it later.
    • Resilient framework: Data and application processing are protected against hardware failure by Inherent data protection and self-healing capabilities. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. And it automatically stores multiple copies of all data.

Have a question on this subject?

Ask questions to our expert community members and clear your doubts. Asking question or engaging in technical discussion is both easy and rewarding.

Are you on Twitter?

Start following us. This way we will always keep you updated with what's happening in Data Analytics community. We won't spam you. Promise.

  • Oracle Installation for SQOOP Import

    We would like to perform practical test of Apache SQOOP import/export utility between ORACLE relational database & Apache HADOOP file system, let us quickly setup an ORACLE server. For that we will be using cloud based services/servers as we did...

  • Hadoop DataLake Implementation Part 8

    In this article we will load our first fact table into Hive warehouse which is sales transactions.

  • SQOOP import from MySQL

    In this article we will use Apache SQOOP to import data from MySQL database. For that let us create a MySql database & user and dump some data quickly. Let us download a MySQL database named Sakila Db from internet to get started. Next we will...

  • How to Setup Hadoop Multi Node Cluster - Step By Step

    Setting up Hadoop in a single machine is easy, but no fun. Why? Because Hadoop is not meant for a single machine. Hadoop is meant to run on a computing cluster comprising of many machines. Running HDFS and MapReduce on a single machine is great for...

  • Hadoop DataLake Implementation Part 5

    In this article we will load the showroom master data from MySQL source system to HDFS using Sqoop as SCD Type 1.

  • Hadoop DataLake Implementation Part 6

    In this article we will load the Customer data in the Hive warehouse as SCD Type 1. This time we will follow a different approach to implement Insert/Update or Merge strategy using Hive QL, rather than SQOOP Merge utility

  • Hadoop DataLake Implementation Part 4

    Now that our dummy OLTP source system & Hadoop HDFS directory structure is ready, we will first load the ‘dates’ data file in HDFS and further to a hive table.

  • Install SQOOP in Client Node of Hadoop Cluster

    Sqoop is an open source software product of the Apache Software Foundation in the hadoop ecosystem, designed to transfer data between Hadoop and relational databases or mainframes. Sqoop can be used to import data from a relational database...

  • Hadoop DataLake Implementation Part 9

    In this article we will load our final fact table i.e. stock.

  • Hadoop DataLake Implementation Part 7

    In this article we will load our master data table ‘Product’ as Slowly Changing Dimension of Type 2 to maintain full history, so as to analyze the sales and stocks data with reference to the historical master data.