Sure enough, you have heard the term, "Big Data" many times before. There is no dearth of information in the Internet and printed medium about this. But guess what, this term still remains vaguely defined and poorly understood. This essay is our effort to describe big data in simple technical language, stripping-off all the marketing lingo and sales jargons. Shall we begin?
In my previous article – “Fools guide to Big Data” – we have discussed about the origin of Bigdata and the need of big data analytics. We have also noted that Big Data is data that is too large, complex and dynamic for any conventional data tools (such as RDBMS) to compute, store, manage and analyze within a practical timeframe. In the next few articles, we will familiarize ourselves with the tools and techniques for processing Bigdata.
The Apache Hadoop is next big data platform. Apache Hadoop is an open-source, java-based framework software for reliable, scalable & distributed computing. The Apache Hadoop allows distributed processing of very large data sets across clusters of commodity machines (low-cost hardware computers) using simple programming models.
In this article we will learn about the Apache Hadoop framework architecture. The basic components of the Apache Hadoop HDFS & MapReduce engine are discussed in brief.
Setting up Hadoop in a single machine is easy, but no fun. Why? Because Hadoop is not meant for a single machine. Hadoop is meant to run on a computing cluster comprising of many machines. Running HDFS and MapReduce on a single machine is great for learning about these systems, but to do useful work we need to run Hadoop on multiple nodes. There are a few options when it comes to staring a Hadoop cluster, from building our own to running on rented hardware, or using any offering that provides Hadoop as a service in the cloud. But, how can we - the learners, the beginners, the amateurs - take advantage of multi-node Hadoop cluster? Well, allow us to show you.
The Hadoop, since its inception is changing the way the enterprises store, process and analyse data. MapReduce is the core part of the Hadoop framework and we can also call it as the core processing engine of Hadoop. It is a programming model designed to process large amount of data in parallel by dividing the load across multiple nodes in a Hadoop cluster. Continuing on our previous discussion on MapReduce, let's go deeper in this article.
Once we have our multi-node hadoop cluster up and running, let us create an EdgeNode or a GatewayNode. Gateway nodes are the interface between the Hadoop cluster and the outside network. Edge nodes are used to run client applications and cluster administration tools.
In the previous article, we have shown how to setup a client node. Once this is done, now let's put Hadoop to use for some big data analytics purpose. One way to do that is by using Hive which let's us run SQL queries against the big data. A command line tool and JDBC driver are provided out of the box to connect users to Hive. Let's install Hive now.
In the previous article, we have learnt How to Install and Configure Hive with default Derby metastore. However, an embedded derby based metastore can process only one request at a time. Since this is very restrictive, we will setup a traditional relational database (in this case MySQL) to handle multi-client concurrency. In this article we will configure MySQL database as Hive metadata Metastore.
Sqoop is an open source software product of the Apache Software Foundation in the hadoop ecosystem, designed to transfer data between Hadoop and relational databases or mainframes. Sqoop can be used to import data from a relational database management system (RDBMS) such as MySQL , Oracle, MSSQL, PostgreSQL or a mainframe into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS.
In this article we will use Apache SQOOP to import data from MySQL database. For that let us create a MySql database & user and dump some data quickly. Let us download a MySQL database named Sakila Db from internet to get started. Next we will configure sqoop to import this data in HDFS file system followed by direct import into Hive tables.
We would like to perform practical test of Apache SQOOP import/export utility between ORACLE relational database & Apache HADOOP file system, let us quickly setup an ORACLE server. For that we will be using cloud based services/servers as we did previously using Digital Ocean.
In this article we will use Apache SQOOP to import data from Oracle database. Now that we have an oracle server in our cluster ready, let us login to EdgeNode. Next we will configure sqoop to import this data in HDFS file system followed by direct import into Hive tables.
Let us check how to perform Incremental Extraction & Merge using Sqoop. The SQOOP Merge utility allows to combine two datasets where entries in one dataset should overwrite entries of an older dataset. For example, an incremental import run in last-modified mode will generate multiple datasets in HDFS where successively newer data appears in each dataset. The merge tool will "flatten" two datasets into one, taking the newest available records for each primary key or merge key.
Apache Pig is a platform for analyzing large data sets. Pig Latin is the high level programming language that, lets us specify a sequence of data transformations such as merging data sets, filtering them, grouping them, and applying functions to records or groups of records.
Apache Flume is a distributed, robust, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data or streaming event data from different sources to a centralized data store. Its main goal is to deliver log data from various application or web servers to Apache Hadoop's HDFS. Flume supports a large set of sources and destinations types.
Apache HBase provides large-scale tabular storage for Hadoop using the Hadoop Distributed File System (HDFS). Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable. HBase is used in cases where we require random, realtime read/write access to Big Data. We can host very large tables (billions of rows X millions of columns) atop clusters of commodity hardware using HBase. In this article we will Install HBase in a fully distributed hadoop cluster.
Apache Spark is a fast and general purpose engine for large-scale data processing over a distributed cluster. Apache Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing. Spark run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). To check SPARK in action let us first install SPARK on Hadoop YARN.
In this article we will use Apache Flume to gather stream access log data from our remote Web Server into Hadoop Distributed File System. We will be analyzing the access log in a real-time basis. So we have to setup Flume such that it collects the access log information from the web server and pushes it to the hadoop cluster. Once the data is in our HDFS, we can analayze better using HIVE. Lets check the multiple Flume agent configurations.
In this multi-series article we will learn how to implement an Enterprise DataLake using Apache Hadoop, an open-source, java-based software framework for reliable, scalable & distributed computing. Apache Hadoop addresses the limitations of traditional computing, helps businesses overcome real challenges, and powers new types of Big Data analytics.
Now that we are familiar with HDP stack, in this article we are going to access HDP sandbox command line, Ambari Web UI, Hive & Ranger to create a user for our implementation setup.
To complete our implementation setup we will create the source tables based on the downloaded datafiles. Let us first load the SQL files in MySQL server under a new database called ‘sales’. We will simulate this database schema as our OLTP source system.
Now that our dummy OLTP source system & Hadoop HDFS directory structure is ready, we will first load the ‘dates’ data file in HDFS and further to a hive table.
In this article we will load the showroom master data from MySQL source system to HDFS using Sqoop as SCD Type 1.
In this article we will load the Customer data in the Hive warehouse as SCD Type 1. This time we will follow a different approach to implement Insert/Update or Merge strategy using Hive QL, rather than SQOOP Merge utility
In this article we will load our master data table ‘Product’ as Slowly Changing Dimension of Type 2 to maintain full history, so as to analyze the sales and stocks data with reference to the historical master data.
In this article we will load our first fact table into Hive warehouse which is sales transactions.
In this article we will load our final fact table i.e. stock.
In this article we will create oozie workflow to orchestrate the daily loading of showroom dimension table from MySQL source to HDFS using Sqoop, followed by Loading data from HDFS to Hive warehouse using Hive and finally housekkeping & archive.