Introduction to Apache Hadoop
The Apache Hadoop is next big data platform. Apache Hadoop is an open-source, java-based framework software for reliable, scalable & distributed computing. The Apache Hadoop allows distributed processing of very large data sets across clusters of commodity machines (low-cost hardware computers) using simple programming models.
Design Paradigm
Hadoop framework library is designed to detect and handle failures at the application layer, on top of a cluster of computers, each of which may be prone to failures. This is how the software library provides high-availability and resilience instead of simply relying on high-end hardware.
It is designed to scale up from single server to thousands of machines, each offering local computation and storage.
Hadoop enables a computing solution that is Scalable, Flexible, Fault tolerant & Cost effective. Hadoop is capable of processing big data of sizes ranging from Gigabytes to Petabytes. It is a complete solution for large-scale analytics. Hadoop changed the economics and the dynamics of large-scale computing.
Hadoop Modules
Hadoop is composed of four core components — Hadoop Common, Hadoop Distributed File System (HDFS), YARN and Hadoop MapReduce.
- Hadoop Common: The library module containing common utilities that supports the other Hadoop components. It consists of utilities which provides file system and OS level abstractions.
- Hadoop Distributed File System: A Java-based distributed, scalable, and portable file system that provides reliable data storage of diverse data and high-throughput access to application data across all the nodes in a Hadoop cluster. It links together the file systems on many local nodes to create a single file system.
- Hadoop YARN: Yet Another Resource Negotiator – The next-generation framework for job scheduling and cluster resource management. It assigns CPU, memory and storage to applications running on a Hadoop cluster. It enables application frameworks other than MapReduce to run on Hadoop, opening up new possibilities.
- Hadoop MapReduce: A YARN-based framework for writing applications to process large amounts of structured and unstructured data in-parallel on a cluster of thousands of machines, in a reliable and fault-tolerant manner.
Hadoop theoretically, can be used for any sort of work which are batch-oriented rather than real-time, very data-intensive, and benefits from parallel processing of data.
Essentially, it accomplishes two tasks: massive data storage and faster processing.
- Distributed Data Storage: HDFS
- Distributed Data Processing: MapReduce
Benefits of Hadoop
- Low cost: The open-source framework is free and uses commodity hardware to store and process very large volumes of variety data.
- Massive storage: The Hadoop framework can store huge amounts of data by breaking the data into blocks and storing it on clusters of low-cost commodity hardware.
- Computing power: Its distributed computing model offers local computation and storage to quickly process large volumes of data. We can increase the processing power simply by adding computing nodes to the cluster.
- Scalability: We can easily ramp up the system simply by adding more nodes to a cluster with a little administration required.
- Storage flexibility: We can store a variety of data whether structured, semi-structured or unstructured. We can store as much data as we want and decide how to use it later.
- Resilient framework: Data and application processing are protected against hardware failure by Inherent data protection and self-healing capabilities. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. And it automatically stores multiple copies of all data.