Now that we are familiar with HDP stack, in this article we are going to access HDP sandbox command line, Ambari Web UI, Hive & Ranger to create a user for our implementation setup.

Once the HDP sandbox and/or cluster is Up & Running, use any preferred SSH client tool to login to the HDP machine (sandbox-hdp.hortonworks.com) with user root and password hadoop. At first login it may prompt you to reset root password. Once you are able to login to the HDP node, follow the next steps.

In our lab setup the HDP 2.6.4 Sandbox is running on VMWare with a IP address as 192.168.136.139 and ssh access on port 2222. So change your IP address accordingly in the rest of the lab setup. To know more about the sandbox stack details check the file under /opt/sandbox.info (e.g. OS Version: CentOS release 6.9)

Note:
Sandbox Host: ssh This email address is being protected from spambots. You need JavaScript enabled to view it. -p22
Sandbox HDP: ssh This email address is being protected from spambots. You need JavaScript enabled to view it. -p2222

Before we proceed let us add host alias entry in the hosts file, in windows - C:\Windows\System32\drivers\etc\hosts

192.168.136.139 sandbox-hdp.hortonworks.com

First let us create a unix user named edw_user with group as hdfs followed by the user’s home directory in HDFS. Set the password of edw_user to hadoop.

ssh This email address is being protected from spambots. You need JavaScript enabled to view it. -p 2222
useradd -G hdfs edw_user
passwd edw_user
sudo -u hdfs hdfs dfs -mkdir /user/edw_user
sudo -u hdfs hdfs dfs -chown -R edw_user:hdfs /user/edw_user
sudo -u hdfs hdfs dfs -chmod -R 770 /user/edw_user
exit

Ambari Sandbox

Login to Ambari Web Admin Console http://sandbox-hdp.hortonworks.com:8080 with user admin & password admin. If you are unable to login you may require to reset the admin login password. If so, then from the shell command line console use ambari-admin-password-reset and then reset the password accordingly. First verify that all the components under each of the services namely- HDFS, YARN, MapReduce2, Hive, Oozie, Zookeeper, Flume & Ranger are up & running. If not, start all the relevant components & services.

Next under Ambari Views click on the Files View link and verify that the edw_user folder & permissions under the user directory are all good.

Next go to admin-> Manage Ambari. Under User + Group Management, click on Users. Next click on Create Local User. Enter the user credentials as below and save:

Username: edw_user
Type: Local
Status: Active
Ambari Admin: No
Password: hadoop

Next click on Roles under Sandbox. Add edw_user under the Role Service Operator. Next click on Views. Go to Workflow Manager view. Under the Local Cluster Permissions Grant Use permission to the Service Operator role. Once done Restart the Ambari Server.

ssh This email address is being protected from spambots. You need JavaScript enabled to view it. -p 2222
ambari-server stop
ambari-server start
exit

Ranger

Next click on Ranger Service, select Quick Links and go to Ranger Admin UI http://sandbox-hdp.hortonworks.com:6080. Login to Ranger with username admin & password admin. Under Settings -> Users/Groups verify that edw_user User & Group has been created automatically.

Next go to Access Manager->Resource Based Policies. Next click on Sandbox_hive Policies. Click on Add New Policy to create a new Hive Policy, as per below:

Policy Type: Access
Policy Name: policy for edw_user; enabled
Database: sales_analytics; include
Table: *; include
Hive Column: *; include
Audit Logging: Yes
Description: Policy for edw_user for hive access
Select User: edw_user
Permissions: All

Lab Setup

In this Lab we will demonstrate a typical analytical use case of car sales and stocks. Download the sample data files for this exercise from the links mentioned below. There is one data file which contains date attributes in order to create a conformed date dimension table. Apart from that there are 4 SQL files to create & load sample database table entities like customer, product, showroom, sales & stocks. In this exercise we will use the MySQL database available in the HDP Sandbox to simulate as our dummy source system. Further we will load these structured dataset’s into Hadoop HDFS, followed by creating an analytical system in Hadoop as Hive warehouse. Once the data is loaded as Hadoop Hive Tables, we can use SQL Query to analyse our data, same like as we do in a typical data warehouse environment. In this article we will also demonstrate the initial setup & initial data loading as well as delta setup & incremental data loading. Also we will learn how to create & schedule a workflow for daily data loading in Hive Warehouse. This exercise will provide an end-to-end Data Lake implementation overview using Hadoop.

As a part of Initial Setup, we will create a directory and will download the below data files.

https://s3-ap-southeast-1.amazonaws.com/dwbi-datalake/dataset/dates.psv
https://s3-ap-southeast-1.amazonaws.com/dwbi-datalake/dataset/showroom.sql
https://s3-ap-southeast-1.amazonaws.com/dwbi-datalake/dataset/customer.sql
https://s3-ap-southeast-1.amazonaws.com/dwbi-datalake/dataset/product.sql
https://s3-ap-southeast-1.amazonaws.com/dwbi-datalake/dataset/sales.sql
https://s3-ap-southeast-1.amazonaws.com/dwbi-datalake/dataset/stocks.sql

ssh This email address is being protected from spambots. You need JavaScript enabled to view it. -p 2222
mkdir /home/edw_user/sampledata
cd /home/edw_user/sampledata
wget https://s3-ap-southeast-1.amazonaws.com/dwbi-datalake/dataset/dates.psv
wget https://s3-ap-southeast-1.amazonaws.com/dwbi-datalake/dataset/showroom.sql
wget https://s3-ap-southeast-1.amazonaws.com/dwbi-datalake/dataset/customer.sql
wget https://s3-ap-southeast-1.amazonaws.com/dwbi-datalake/dataset/product.sql
wget https://s3-ap-southeast-1.amazonaws.com/dwbi-datalake/dataset/sales.sql
wget https://s3-ap-southeast-1.amazonaws.com/dwbi-datalake/dataset/stocks.sql
ls -ltra /home/edw_user/sampledata
exit

Once the files are downloaded we should have the following files in the /home/edw_user/sampledata directory - dates.psv, showroom.sql, customer.sql, product.sql, sales.sql, stocks.sql.

Implementation Approach

Firstly we will be load the initial dataset into Hadoop cluster followed by incremental/delta data loading.

  • For that we will load the Dates data file and create a hive managed table to act as our conformed date dimension.
  • Next we will on-board the Showroom & Customer data from MySQL RDBMS to Hadoop using Apache Sqoop.
  • Hive managed tables for Showroom & Customer data will be maintained as Slowly Changing Dimension of Type 1, i.e. Insert else Update.
  • We will look into two sets of approach while loading these two tables, to implement Merge logic- one using Sqoop Merge & another using Hive SQL. With the assumption that the Customer dimension is a big table and Hive OVERWRITE table is not an option, rather APPEND is the right way to go.

The hive OVERWRITE option requires moving the complete record set from source to Hadoop. While this approach may work for smaller data sets, it may be prohibitive at scale. The APPEND option can limit data movement only to new or updated records. As true Inserts and Updates are not yet available in Hive, we need to consider a process of preventing duplicate records as Updates are appended to the cumulative record set.

HDFS follows the write-once, read-many approach for its files and applications. It assumes that a file in HDFS once written will not be modified, though it can be access ‘n’ number of times. This assumption enables high throughput data access and also simplifies data coherency issues.

  • With regards to Product data, we assume that we want to maintain it as a Slowly Changing Dimension of Type 2 in our Hadoop system, so as to analyse the sales and stock data with regards to historical data of product master data.
  • We will manage the SCD type 2 implementation in the RDBMS system ( consider another schema of the MySQL database, let’s assume it as an EDW managed table ) which takes care of incremental extraction as well as history maintenance.
  • We will bulk export the data from EDW as flat-files from the SCD Type2 table, import into HDFS & finally a Hive managed table to act as the Product Dimension table.
  • With regards to the Sales & Stock transaction data Hive managed partitioned tables will be used. For data on-boarding we will use Sqoop to extract/load the initial as well as daily delta dataset from OLTP to Hadoop.
  • Here we will use Pig Script to replace the Natural keys with the Surrogate keys for the Product master data present for each transaction records, to enable analysis of Sales transaction based on product history.
  • Data will be appended on a daily basis to the partitioned Hive fact tables.

With regards to workflow and scheduling to orchestrate the data loading process in HDFS & Hive, we will use Apache Oozie. For the ease of use we will leverage the Workflow Manager View of Hortonworks. So overall, the components being used for this exercise are- HDFS, Sqoop, Pig, Hive and Oozie. Hence basic understanding or overview of these tools is expected.

In the next article, we will build our OLTP source system based on the downloaded data files. After which we will setup and configure HDFS directory, Sqoop Metastore,Oozie & Hive Optimization.


Have a question on this subject?

Ask questions to our expert community members and clear your doubts. Asking question or engaging in technical discussion is both easy and rewarding.

Are you on Twitter?

Start following us. This way we will always keep you updated with what's happening in Data Analytics community. We won't spam you. Promise.

  • Apache Hadoop Architecture

    In this article we will learn about the Apache Hadoop framework architecture. The basic components of the Apache Hadoop HDFS & MapReduce engine are discussed in brief.

  • Stream Webserver Log into Hdfs using FLUME

    In this article we will use Apache Flume to gather stream access log data from our remote Web Server into Hadoop Distributed File System. We will be analyzing the access log in a real-time basis. So we have to setup Flume such that it collects the...

  • Configuring MySQL as Hive Metastore

    In the previous article, we have learnt How to Install and Configure Hive with default Derby metastore. However, an embedded derby based metastore can process only one request at a time. Since this is very restrictive, we will setup a traditional...

  • Oracle Installation for SQOOP Import

    We would like to perform practical test of Apache SQOOP import/export utility between ORACLE relational database & Apache HADOOP file system, let us quickly setup an ORACLE server. For that we will be using cloud based services/servers as we did...

  • Set up Client Node (Gateway Node) in Hadoop Cluster

    Once we have our multi-node hadoop cluster up and running, let us create an EdgeNode or a GatewayNode. Gateway nodes are the interface between the Hadoop cluster and the outside network. Edge nodes are used to run client applications and cluster...

  • Fools Guide to Big data - What is Big Data

    Sure enough, you have heard the term, "Big Data" many times before. There is no dearth of information in the Internet and printed medium about this. But guess what, this term still remains vaguely defined and poorly understood. This essay is our...

  • SQOOP import from Oracle

    In this article we will use Apache SQOOP to import data from Oracle database. Now that we have an oracle server in our cluster ready, let us login to EdgeNode. Next we will configure sqoop to import this data in HDFS file system followed by direct...

  • Hadoop DataLake Implementation Part 7

    In this article we will load our master data table ‘Product’ as Slowly Changing Dimension of Type 2 to maintain full history, so as to analyze the sales and stocks data with reference to the historical master data.

  • Install Hive in Client Node of Hadoop Cluster

    In the previous article, we have shown how to setup a client node. Once this is done, now let's put Hadoop to use for some big data analytics purpose. One way to do that is by using Hive which let's us run SQL queries against the big data. A...

  • Hadoop DataLake Implementation Part 3

    To complete our implementation setup we will create the source tables based on the downloaded datafiles. Let us first load the SQL files in MySQL server under a new database called ‘sales’. We will simulate this database schema as our OLTP source...