Hadoop DataLake Implementation
In this multi-series article we will learn how to implement an Enterprise DataLake using Apache Hadoop, an open-source, java-based software framework for reliable, scalable & distributed computing. Apache Hadoop addresses the limitations of traditional computing, helps businesses overcome real challenges, and powers new types of Big Data analytics.
The Apache Hadoop allows distributed parallel processing of very large data sets across clusters of commodity machines (low-cost hardware computers) using simple programming models. The solution framework is horizontally scalable with addition of worker nodes to the cluster. Hadoop was first designed to process any kind of data whether structured, semi-structured or unstructured. Hadoop does not enforce on having a schema or a structure to the data that has to be stored. Hadoop has distributed storage and distributed processing framework, which is essential for unstructured data analysis, owing to its size and complexity. Hadoop framework library is designed to detect and handle failures at the application layer, on top of a cluster of computers, each of which may be prone to failures. This is how the software library provides high-availability and resilience instead of simply relying on high-end hardware.
Hadoop DataLake
Here we will design and implement an enterprise data lake, which is an enterprise-wide system for storing vast amounts of data and analysing disparate sources of data in their native formats in clusters consisting of commodity hardware. The goal is to break the information silos in the enterprise by bringing all the data into a single place for analysis without the restrictions of schema, security, or authorization. Due to variety of existing data, Hadoop represent an ideal platform for data lakes, thanks to support of various different file formats, dynamic structure of data and variety of tools for data access and querying. Data are simply copied into HDFS and then can be queried with different tools. All the existing data as well as incremental data from the various source systems will be loaded in EDFS for data analytics. In a more technical sense, a data lake is a set of tools for ingesting, transforming, storing, securing, recovering, accessing, and analysing all the relevant data of the company. A data lake solution does not require any changes to the existing source systems rather it fits perfectly in existing IT landscape. Even ETL offloading from current EDW platform to parallel, open source, cost-effective, scale-out environment like Hadoop is also another typical use case.
Hortonworks Hadoop Distribution
When it comes to selection of the right Hadoop flavour for Enterprise deployment we will be using Hortonworks Hadoop Distribution. Hortonworks is an industry leading innovator that creates, distributes and supports enterprise-ready open data platforms (100% open source Apache Hadoop) and modern data applications that deliver actionable intelligence from all data: data-in-motion and data-at-rest (Open Enterprise Hadoop). Powered by the world’s most popular Hadoop distribution, Hortonworks makes Hadoop fast, easy, and secure, so customers can focus on results. For organizations who want enterprise-grade Hadoop, Hortonworks Hadoop is the smart, trusted choice. HDP & HDF has the flexibility to run a variety of enterprise workloads (for example, batch processing, interactive SQL, enterprise search and advanced analytics) while meeting enterprise requirements such as integrations to existing systems, robust security, governance, data protection, and management.
About HDP
Hortonworks Data Platform is the industry's only true secure, enterprise-ready open source Apache™ Hadoop® distribution based on a centralized architecture (YARN). HDP addresses the complete needs of data-at-rest, powers real-time customer applications and delivers robust big data analytics that accelerate decision making and innovation.
YARN and Hadoop Distributed File System (HDFS) are the cornerstone components of Hortonworks Data Platform (HDP) for data-at-rest. While HDFS provides the scalable, fault-tolerant, cost-efficient storage for big data lake, YARN provides the centralized architecture that enables us to process multiple workloads simultaneously. YARN provides the resource management and pluggable architecture for enabling a wide variety of data access methods.
With regards to Operations, Ambari is an open source management platform for provisioning, managing, monitoring, and securing the Hortonworks Data Platform. It enables Hadoop to fit seamlessly into an enterprise IT landscape.
For this Lab setup, download the Hortonworks Data Platform Sandbox. Follow the installation guide to deploy the sandbox properly. To get started with the sandbox check the sandbox documentation. Alternatively you may setup a HDP cluster in AWS cloud.
Once you are familiar with the HDP, follow this guide to build an Enterprise Data Lake in Hadoop.