In this multi-series article we will learn how to implement an Enterprise DataLake using Apache Hadoop, an open-source, java-based software framework for reliable, scalable & distributed computing. Apache Hadoop addresses the limitations of traditional computing, helps businesses overcome real challenges, and powers new types of Big Data analytics.

The Apache Hadoop allows distributed parallel processing of very large data sets across clusters of commodity machines (low-cost hardware computers) using simple programming models. The solution framework is horizontally scalable with addition of worker nodes to the cluster. Hadoop was first designed to process any kind of data whether structured, semi-structured or unstructured. Hadoop does not enforce on having a schema or a structure to the data that has to be stored. Hadoop has distributed storage and distributed processing framework, which is essential for unstructured data analysis, owing to its size and complexity. Hadoop framework library is designed to detect and handle failures at the application layer, on top of a cluster of computers, each of which may be prone to failures. This is how the software library provides high-availability and resilience instead of simply relying on high-end hardware.

Hadoop DataLake

Here we will design and implement an enterprise data lake, which is an enterprise-wide system for storing vast amounts of data and analysing disparate sources of data in their native formats in clusters consisting of commodity hardware. The goal is to break the information silos in the enterprise by bringing all the data into a single place for analysis without the restrictions of schema, security, or authorization. Due to variety of existing data, Hadoop represent an ideal platform for data lakes, thanks to support of various different file formats, dynamic structure of data and variety of tools for data access and querying. Data are simply copied into HDFS and then can be queried with different tools. All the existing data as well as incremental data from the various source systems will be loaded in EDFS for data analytics. In a more technical sense, a data lake is a set of tools for ingesting, transforming, storing, securing, recovering, accessing, and analysing all the relevant data of the company. A data lake solution does not require any changes to the existing source systems rather it fits perfectly in existing IT landscape. Even ETL offloading from current EDW platform to parallel, open source, cost-effective, scale-out environment like Hadoop is also another typical use case.

Hortonworks Hadoop Distribution

When it comes to selection of the right Hadoop flavour for Enterprise deployment we will be using Hortonworks Hadoop Distribution. Hortonworks is an industry leading innovator that creates, distributes and supports enterprise-ready open data platforms (100% open source Apache Hadoop) and modern data applications that deliver actionable intelligence from all data: data-in-motion and data-at-rest (Open Enterprise Hadoop). Powered by the world’s most popular Hadoop distribution, Hortonworks makes Hadoop fast, easy, and secure, so customers can focus on results. For organizations who want enterprise-grade Hadoop, Hortonworks Hadoop is the smart, trusted choice. HDP & HDF has the flexibility to run a variety of enterprise workloads (for example, batch processing, interactive SQL, enterprise search and advanced analytics) while meeting enterprise requirements such as integrations to existing systems, robust security, governance, data protection, and management.

About HDP

Hortonworks Data Platform is the industry's only true secure, enterprise-ready open source Apache™ Hadoop® distribution based on a centralized architecture (YARN). HDP addresses the complete needs of data-at-rest, powers real-time customer applications and delivers robust big data analytics that accelerate decision making and innovation.

YARN and Hadoop Distributed File System (HDFS) are the cornerstone components of Hortonworks Data Platform (HDP) for data-at-rest. While HDFS provides the scalable, fault-tolerant, cost-efficient storage for big data lake, YARN provides the centralized architecture that enables us to process multiple workloads simultaneously. YARN provides the resource management and pluggable architecture for enabling a wide variety of data access methods.

With regards to Operations, Ambari is an open source management platform for provisioning, managing, monitoring, and securing the Hortonworks Data Platform. It enables Hadoop to fit seamlessly into an enterprise IT landscape.

For this Lab setup, download the Hortonworks Data Platform Sandbox. Follow the installation guide to deploy the sandbox properly. To get started with the sandbox check the sandbox documentation. Alternatively you may setup a HDP cluster in AWS cloud.

Once you are familiar with the HDP, follow this guide to build an Enterprise Data Lake in Hadoop.


Have a question on this subject?

Ask questions to our expert community members and clear your doubts. Asking question or engaging in technical discussion is both easy and rewarding.

Are you on Twitter?

Start following us. This way we will always keep you updated with what's happening in Data Analytics community. We won't spam you. Promise.

  • Introduction to Apache Hadoop

    The Apache Hadoop is next big data platform. Apache Hadoop is an open-source, java-based framework software for reliable, scalable & distributed computing. The Apache Hadoop allows distributed processing of very large data sets across clusters of...

  • Hadoop MapReduce Basics

    The Hadoop, since its inception is changing the way the enterprises store, process and analyse data. MapReduce is the core part of the Hadoop framework and we can also call it as the core processing engine of Hadoop. It is a programming model...

  • SQOOP import from Oracle

    In this article we will use Apache SQOOP to import data from Oracle database. Now that we have an oracle server in our cluster ready, let us login to EdgeNode. Next we will configure sqoop to import this data in HDFS file system followed by direct...

  • Install SQOOP in Client Node of Hadoop Cluster

    Sqoop is an open source software product of the Apache Software Foundation in the hadoop ecosystem, designed to transfer data between Hadoop and relational databases or mainframes. Sqoop can be used to import data from a relational database...

  • SQOOP import from MySQL

    In this article we will use Apache SQOOP to import data from MySQL database. For that let us create a MySql database & user and dump some data quickly. Let us download a MySQL database named Sakila Db from internet to get started. Next we will...

  • Install SPARK in Hadoop Cluster

    Apache Spark is a fast and general purpose engine for large-scale data processing over a distributed cluster. Apache Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing. Spark run programs up to 100x...

  • Set up Client Node (Gateway Node) in Hadoop Cluster

    Once we have our multi-node hadoop cluster up and running, let us create an EdgeNode or a GatewayNode. Gateway nodes are the interface between the Hadoop cluster and the outside network. Edge nodes are used to run client applications and cluster...

  • SQOOP Merge & Incremental Extraction from Oracle

    Let us check how to perform Incremental Extraction & Merge using Sqoop. The SQOOP Merge utility allows to combine two datasets where entries in one dataset should overwrite entries of an older dataset. For example, an incremental import run in...

  • Configuring MySQL as Hive Metastore

    In the previous article, we have learnt How to Install and Configure Hive with default Derby metastore. However, an embedded derby based metastore can process only one request at a time. Since this is very restrictive, we will setup a traditional...

  • Hadoop DataLake Implementation Part 10

    In this article we will create oozie workflow to orchestrate the daily loading of showroom dimension table from MySQL source to HDFS using Sqoop, followed by Loading data from HDFS to Hive warehouse using Hive and finally housekkeping & archive.