Hadoop DataLake Implementation Part 2

Updated on Oct 03, 2020

Now that we are familiar with HDP stack, in this article we are going to access HDP sandbox command line, Ambari Web UI, Hive & Ranger to create a user for our implementation setup.

Once the HDP sandbox and/or cluster is Up & Running, use any preferred SSH client tool to login to the HDP machine (sandbox-hdp.hortonworks.com) with user root and password hadoop. At first login it may prompt you to reset root password. Once you are able to login to the HDP node, follow the next steps.

In our lab setup the HDP 2.6.4 Sandbox is running on VMWare with a IP address as 192.168.136.139 and ssh access on port 2222. So change your IP address accordingly in the rest of the lab setup. To know more about the sandbox stack details check the file under /opt/sandbox.info (e.g. OS Version: CentOS release 6.9)

Note:

Sandbox Host: ssh root@192.168.136.139 -p22

Sandbox HDP: ssh root@192.168.136.139 -p2222

Before we proceed let us add host alias entry in the hosts file, in windows - C:\Windows\System32\drivers\etc\hosts

192.168.136.139 sandbox-hdp.hortonworks.com

First let us create a unix user named edw_user with group as hdfs followed by the user’s home directory in HDFS. Set the password of edw_user to hadoop.

ssh root@192.168.136.139 -p 2222

useradd -G hdfs edw_user

passwd edw_user

sudo -u hdfs hdfs dfs -mkdir /user/edw_user

sudo -u hdfs hdfs dfs -chown -R edw_user:hdfs /user/edw_user

sudo -u hdfs hdfs dfs -chmod -R 770 /user/edw_user

exit

Ambari Sandbox

Login to Ambari Web Admin Console http://sandbox-hdp.hortonworks.com:8080 with user admin & password admin. If you are unable to login you may require to reset the admin login password. If so, then from the shell command line console use ambari-admin-password-reset and then reset the password accordingly. First verify that all the components under each of the services namely- HDFS, YARN, MapReduce2, Hive, Oozie, Zookeeper, Flume & Ranger are up & running. If not, start all the relevant components & services.

Next under Ambari Views click on the Files View link and verify that the edw_user folder & permissions under the user directory are all good.

Next go to admin-> Manage Ambari. Under User + Group Management, click on Users. Next click on Create Local User. Enter the user credentials as below and save:

Username: edw_user
Type: Local
Status: Active
Ambari Admin: No
Password: hadoop

Next click on Roles under Sandbox. Add edw_user under the Role Service Operator. Next click on Views. Go to Workflow Manager view. Under the Local Cluster Permissions Grant Use permission to the Service Operator role. Once done Restart the Ambari Server.

ssh root@192.168.136.139 -p 2222

ambari-server stop

ambari-server start

exit

Ranger

Next click on Ranger Service, select Quick Links and go to Ranger Admin UI http://sandbox-hdp.hortonworks.com:6080. Login to Ranger with username admin & password admin. Under Settings -> Users/Groups verify that edw_user User & Group has been created automatically.

Next go to Access Manager->Resource Based Policies. Next click on Sandbox_hive Policies. Click on Add New Policy to create a new Hive Policy, as per below:

Policy Type: Access
Policy Name: policy for edw_user; enabled
Database: sales_analytics; include
Table: *; include
Hive Column: *; include
Audit Logging: Yes
Description: Policy for edw_user for hive access
Select User: edw_user
Permissions: All

Lab Setup

In this Lab we will demonstrate a typical analytical use case of car sales and stocks. Download the sample data files for this exercise from the links mentioned below. There is one data file which contains date attributes in order to create a conformed date dimension table. Apart from that there are 4 SQL files to create & load sample database table entities like customer, product, showroom, sales & stocks. In this exercise we will use the MySQL database available in the HDP Sandbox to simulate as our dummy source system. Further we will load these structured dataset’s into Hadoop HDFS, followed by creating an analytical system in Hadoop as Hive warehouse. Once the data is loaded as Hadoop Hive Tables, we can use SQL Query to analyse our data, same like as we do in a typical data warehouse environment. In this article we will also demonstrate the initial setup & initial data loading as well as delta setup & incremental data loading. Also we will learn how to create & schedule a workflow for daily data loading in Hive Warehouse. This exercise will provide an end-to-end Data Lake implementation overview using Hadoop.

As a part of Initial Setup, we will create a directory and will download the below data files.

https://s3-ap-southeast-1.amazonaws.com/dwbi-datalake/dataset/dates.psv

https://s3-ap-southeast-1.amazonaws.com/dwbi-datalake/dataset/showroom.sql

https://s3-ap-southeast-1.amazonaws.com/dwbi-datalake/dataset/customer.sql

https://s3-ap-southeast-1.amazonaws.com/dwbi-datalake/dataset/product.sql

https://s3-ap-southeast-1.amazonaws.com/dwbi-datalake/dataset/sales.sql

https://s3-ap-southeast-1.amazonaws.com/dwbi-datalake/dataset/stocks.sql

ssh edw_user@192.168.136.139 -p 2222
mkdir /home/edw_user/sampledata
cd /home/edw_user/sampledata
wget https://s3-ap-southeast-1.amazonaws.com/dwbi-datalake/dataset/dates.psv
wget https://s3-ap-southeast-1.amazonaws.com/dwbi-datalake/dataset/showroom.sql
wget https://s3-ap-southeast-1.amazonaws.com/dwbi-datalake/dataset/customer.sql
wget https://s3-ap-southeast-1.amazonaws.com/dwbi-datalake/dataset/product.sql
wget https://s3-ap-southeast-1.amazonaws.com/dwbi-datalake/dataset/sales.sql
wget https://s3-ap-southeast-1.amazonaws.com/dwbi-datalake/dataset/stocks.sql
ls -ltra /home/edw_user/sampledata
exit

Once the files are downloaded we should have the following files in the /home/edw_user/sampledata directory - dates.psv, showroom.sql, customer.sql, product.sql, sales.sql, stocks.sql.

Implementation Approach

Firstly we will be load the initial dataset into Hadoop cluster followed by incremental/delta data loading.

For that we will load the Dates data file and create a hive managed table to act as our conformed date dimension.

Next we will on-board the Showroom & Customer data from MySQL RDBMS to Hadoop using Apache Sqoop.

Hive managed tables for Showroom & Customer data will be maintained as Slowly Changing Dimension of Type 1, i.e. Insert else Update.

We will look into two sets of approach while loading these two tables, to implement Merge logic- one using Sqoop Merge & another using Hive SQL. With the assumption that the Customer dimension is a big table and Hive OVERWRITE table is not an option, rather APPEND is the right way to go.

The hive OVERWRITE option requires moving the complete record set from source to Hadoop. While this approach may work for smaller data sets, it may be prohibitive at scale. The APPEND option can limit data movement only to new or updated records. As true Inserts and Updates are not yet available in Hive, we need to consider a process of preventing duplicate records as Updates are appended to the cumulative record set.

HDFS follows the write-once, read-many approach for its files and applications. It assumes that a file in HDFS once written will not be modified, though it can be access ‘n’ number of times. This assumption enables high throughput data access and also simplifies data coherency issues.

With regards to Product data, we assume that we want to maintain it as a Slowly Changing Dimension of Type 2 in our Hadoop system, so as to analyse the sales and stock data with regards to historical data of product master data.

We will manage the SCD type 2 implementation in the RDBMS system ( consider another schema of the MySQL database, let’s assume it as an EDW managed table ) which takes care of incremental extraction as well as history maintenance.

We will bulk export the data from EDW as flat-files from the SCD Type2 table, import into HDFS & finally a Hive managed table to act as the Product Dimension table.

With regards to the Sales & Stock transaction data Hive managed partitioned tables will be used. For data on-boarding we will use Sqoop to extract/load the initial as well as daily delta dataset from OLTP to Hadoop.

Here we will use Pig Script to replace the Natural keys with the Surrogate keys for the Product master data present for each transaction records, to enable analysis of Sales transaction based on product history.

Data will be appended on a daily basis to the partitioned Hive fact tables.

With regards to workflow and scheduling to orchestrate the data loading process in HDFS & Hive, we will use Apache Oozie. For the ease of use we will leverage the Workflow Manager View of Hortonworks. So overall, the components being used for this exercise are- HDFS, Sqoop, Pig, Hive and Oozie. Hence basic understanding or overview of these tools is expected.

In the next article, we will build our OLTP source system based on the downloaded data files. After which we will setup and configure HDFS directory, Sqoop Metastore,Oozie & Hive Optimization.

Hadoop DataLake Implementation Part 2

Ambari Sandbox

Ranger

Lab Setup

Implementation Approach

PrimeChess.org

Top 10 Articles