In this multi-series article we will learn how to implement an Enterprise DataLake using Apache Hadoop, an open-source, java-based software framework for reliable, scalable & distributed computing. Apache Hadoop addresses the limitations of traditional computing, helps businesses overcome real challenges, and powers new types of Big Data analytics.

The Apache Hadoop allows distributed parallel processing of very large data sets across clusters of commodity machines (low-cost hardware computers) using simple programming models. The solution framework is horizontally scalable with addition of worker nodes to the cluster. Hadoop was first designed to process any kind of data whether structured, semi-structured or unstructured. Hadoop does not enforce on having a schema or a structure to the data that has to be stored. Hadoop has distributed storage and distributed processing framework, which is essential for unstructured data analysis, owing to its size and complexity. Hadoop framework library is designed to detect and handle failures at the application layer, on top of a cluster of computers, each of which may be prone to failures. This is how the software library provides high-availability and resilience instead of simply relying on high-end hardware.

Hadoop DataLake

Here we will design and implement an enterprise data lake, which is an enterprise-wide system for storing vast amounts of data and analysing disparate sources of data in their native formats in clusters consisting of commodity hardware. The goal is to break the information silos in the enterprise by bringing all the data into a single place for analysis without the restrictions of schema, security, or authorization. Due to variety of existing data, Hadoop represent an ideal platform for data lakes, thanks to support of various different file formats, dynamic structure of data and variety of tools for data access and querying. Data are simply copied into HDFS and then can be queried with different tools. All the existing data as well as incremental data from the various source systems will be loaded in EDFS for data analytics. In a more technical sense, a data lake is a set of tools for ingesting, transforming, storing, securing, recovering, accessing, and analysing all the relevant data of the company. A data lake solution does not require any changes to the existing source systems rather it fits perfectly in existing IT landscape. Even ETL offloading from current EDW platform to parallel, open source, cost-effective, scale-out environment like Hadoop is also another typical use case.

Hortonworks Hadoop Distribution

When it comes to selection of the right Hadoop flavour for Enterprise deployment we will be using Hortonworks Hadoop Distribution. Hortonworks is an industry leading innovator that creates, distributes and supports enterprise-ready open data platforms (100% open source Apache Hadoop) and modern data applications that deliver actionable intelligence from all data: data-in-motion and data-at-rest (Open Enterprise Hadoop). Powered by the world’s most popular Hadoop distribution, Hortonworks makes Hadoop fast, easy, and secure, so customers can focus on results. For organizations who want enterprise-grade Hadoop, Hortonworks Hadoop is the smart, trusted choice. HDP & HDF has the flexibility to run a variety of enterprise workloads (for example, batch processing, interactive SQL, enterprise search and advanced analytics) while meeting enterprise requirements such as integrations to existing systems, robust security, governance, data protection, and management.

About HDP

Hortonworks Data Platform is the industry's only true secure, enterprise-ready open source Apache™ Hadoop® distribution based on a centralized architecture (YARN). HDP addresses the complete needs of data-at-rest, powers real-time customer applications and delivers robust big data analytics that accelerate decision making and innovation.

YARN and Hadoop Distributed File System (HDFS) are the cornerstone components of Hortonworks Data Platform (HDP) for data-at-rest. While HDFS provides the scalable, fault-tolerant, cost-efficient storage for big data lake, YARN provides the centralized architecture that enables us to process multiple workloads simultaneously. YARN provides the resource management and pluggable architecture for enabling a wide variety of data access methods.

With regards to Operations, Ambari is an open source management platform for provisioning, managing, monitoring, and securing the Hortonworks Data Platform. It enables Hadoop to fit seamlessly into an enterprise IT landscape.

For this Lab setup, download the Hortonworks Data Platform Sandbox. Follow the installation guide to deploy the sandbox properly. To get started with the sandbox check the sandbox documentation. Alternatively you may setup a HDP cluster in AWS cloud.

Once you are familiar with the HDP, follow this guide to build an Enterprise Data Lake in Hadoop.

Have a question on this subject?

Ask questions to our expert community members and clear your doubts. Asking question or engaging in technical discussion is both easy and rewarding.

Are you on Twitter?

Start following us. This way we will always keep you updated with what's happening in Data Analytics community. We won't spam you. Promise.

  • Install SQOOP in Client Node of Hadoop Cluster

    Sqoop is an open source software product of the Apache Software Foundation in the hadoop ecosystem, designed to transfer data between Hadoop and relational databases or mainframes. Sqoop can be used to import data from a relational database...

  • Oracle Installation for SQOOP Import

    We would like to perform practical test of Apache SQOOP import/export utility between ORACLE relational database & Apache HADOOP file system, let us quickly setup an ORACLE server. For that we will be using cloud based services/servers as we did...

  • Install SPARK in Hadoop Cluster

    Apache Spark is a fast and general purpose engine for large-scale data processing over a distributed cluster. Apache Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing. Spark run programs up to 100x...

  • Hadoop DataLake Implementation Part 9

    In this article we will load our final fact table i.e. stock.

  • Hadoop DataLake Implementation Part 2

    Now that we are familiar with HDP stack, in this article we are going to access HDP sandbox command line, Ambari Web UI, Hive & Ranger to create a user for our implementation setup.

  • Understanding Map-Reduce with Examples

    In my previous article – “Fools guide to Big Data” – we have discussed about the origin of Bigdata and the need of big data analytics. We have also noted that Big Data is data that is too large, complex and dynamic for any conventional data tools...

  • Hadoop DataLake Implementation

    In this multi-series article we will learn how to implement an Enterprise DataLake using Apache Hadoop, an open-source, java-based software framework for reliable, scalable & distributed computing. Apache Hadoop addresses the limitations of...

  • Hadoop DataLake Implementation Part 3

    To complete our implementation setup we will create the source tables based on the downloaded datafiles. Let us first load the SQL files in MySQL server under a new database called ‘sales’. We will simulate this database schema as our OLTP source...

  • Install HBASE in Hadoop Cluster

    Apache HBase provides large-scale tabular storage for Hadoop using the Hadoop Distributed File System (HDFS). Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable. HBase is used in cases...

  • Install Hive in Client Node of Hadoop Cluster

    In the previous article, we have shown how to setup a client node. Once this is done, now let's put Hadoop to use for some big data analytics purpose. One way to do that is by using Hive which let's us run SQL queries against the big data. A...