To complete our implementation setup we will create the source tables based on the downloaded datafiles. Let us first load the SQL files in MySQL server under a new database called ‘sales’. We will simulate this database schema as our OLTP source system. Once the scripts are executed, we will have 3 master data entities- customer, product & showroom and 2 transaction data tables- sales & stocks. Let’s also assume that we want to maintain history for the product master data in a separate schema ‘sales_dw’ to simulate dummy OLAP DW layer. For this lab exercise we will use the MySQL database available in the HDP Sandbox. Login to MySQL with user root and password hadoop.

OLTP Source Setup

ssh This email address is being protected from spambots. You need JavaScript enabled to view it. -p 2222
mysql -u root -p
show databases;
use mysql;
select user,host from user;
create database sales;
create database sales_dw;
show databases;
exit;

Next create & load data tables based on the downloaded data files. When prompted for password enter hadoop.

mysql -u root -p sales < /home/edw_user/sampledata/showroom.sql
mysql -u root -p sales < /home/edw_user/sampledata/customer.sql
mysql -u root -p sales < /home/edw_user/sampledata/product.sql
mysql -u root -p sales < /home/edw_user/sampledata/sales.sql
mysql -u root -p sales < /home/edw_user/sampledata/stocks.sql

Verify the dummy OLTP tables are created and loaded in the ‘sales’ schema.

mysql -u root -p
use sales;
show tables;
# Dummy data-set preparation
update sales, product set sales.amount = product.price where sales.product_id = product.id;
commit;
select count(1) from sales;
exit;

Now let us create the dummy Product Dimension table of SCD Type 2 in the ‘sale_dw’ schema.

mysql -u root -p
use sales_dw;
create table if not exists dim_product (
prod_key INT NOT NULL AUTO_INCREMENT,
id INT,
code VARCHAR(50),
category VARCHAR(6),
make VARCHAR(50),
model VARCHAR(50),
year VARCHAR(50),
color VARCHAR(50),
price INT,
currency VARCHAR(50),
update_date TIMESTAMP,
create_date TIMESTAMP,
start_date DATE,
end_date DATE,
active_flag VARCHAR(1),
load_date DATE,
PRIMARY KEY(prod_key)
);

# Initial load Product Dimension
INSERT INTO dim_product (id, code, category, make, model, year, color, price, currency, update_date, create_date, start_date, end_date, active_flag, load_date) 
SELECT id, code, category, make, model, year, color, price, currency, update_date, create_date, now(), '2050-12-31', 'Y', now() FROM sales.product;
show tables;
select count(1) from dim_product;
exit;

exit

The maintenance and loading of Product Dimension is not covered in this article.

HDFS Directory Setup

We will create two directories – staging & archive. Staging will host all the data files imported from the external source systems. Once the data files are processed, they will be moved to archive directory. We will set the replication factor for the archive directory to a low value ‘1’. Based on your needs you can even ignore to archive the data files. Additionally we will create the corresponding directories for dates, showroom, customer, product, sales & stocks in staging as well as in archive area. Also we will create two more additional directories for hive/pig scripts and oozie workflows. Below are the HDFS filesystem commands to achieve this. All of this is going to be a onetime activity.

ssh This email address is being protected from spambots. You need JavaScript enabled to view it. -p 2222
hdfs dfs -mkdir -p /user/edw_user/sales/staging
hdfs dfs -mkdir -p /user/edw_user/sales/archive
hdfs dfs -setrep -R 1 /user/edw_user/sales/archive
hdfs dfs -mkdir -p /user/edw_user/sales/scripts
hdfs dfs -mkdir -p /user/edw_user/sales/workflows

hdfs dfs -mkdir -p /user/edw_user/sales/staging/date
hdfs dfs -mkdir -p /user/edw_user/sales/archive/date
hdfs dfs -mkdir -p /user/edw_user/sales/staging/showroom
hdfs dfs -mkdir -p /user/edw_user/sales/archive/showroom
hdfs dfs -mkdir -p /user/edw_user/sales/staging/customer
hdfs dfs -mkdir -p /user/edw_user/sales/archive/customer
hdfs dfs -mkdir -p /user/edw_user/sales/staging/product
hdfs dfs -mkdir -p /user/edw_user/sales/archive/product
hdfs dfs -mkdir -p /user/edw_user/sales/staging/sales
hdfs dfs -mkdir -p /user/edw_user/sales/archive/sales
hdfs dfs -mkdir -p /user/edw_user/sales/staging/stocks
hdfs dfs -mkdir -p /user/edw_user/sales/archive/stocks

Next we will also create a Password-file for Sqoop to connect to MySQL database schema’s sales & sales_dw.

echo -n "hadoop" > /home/edw_user/sampledata/.password
hdfs dfs -put .password /user/edw_user/sales/
hdfs dfs -chmod 400 /user/edw_user/sales/.password
rm -rf /home/edw_user/sampledata/.password

exit

Hive Optimization

The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive. Hive managed & unmanaged tables will be used for our analytics. Before proceeding any further we will set up the Hive environment to support interactive SQL. Go to Ambari Web Admin Console and click on Hive service followed by Configs. Using the Filter textbox search each of the below properties and set the values accordingly.

Set Hive Optimization Properties at Ambari Hive Configuration as below-

set hive.execution.engine=tez;
set hive.cbo.enable=true;
set hive.stats.autogather=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
set hive.vectorized.execution.enabled=true;
set hive.vectorized.execution.reduce.enabled=true;
set hive.enforce.bucketing=true;
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.compress.output=true;
set hive.exec.compress.intermediate=true;
set hive.server2.enable.doAs=true;
set hive.support.concurrency=true;

Once done, Save the configuration and perform a restart of all the affected services and components.

Next we will create the Hive database to host our sales analytics dimensions & facts table. Connect to Beeline CLI using root as username. If prompted for a password hit Enter. Password is blank. Check in Ambari Hive Summary tab to get the HiveServer2 JDBC URL. Let us create a hive schema named ‘sales_analytics’.

ssh This email address is being protected from spambots. You need JavaScript enabled to view it. -p 2222
beeline
!connect jdbc:hive2://sandbox-hdp.hortonworks.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2 root
create database sales_analytics;
show databases;
!quit

exit

Sqoop Metastore

Sqoop is a tool designed to transfer data between Hadoop and relational databases. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance. Prior to using Sqoop we will setup SQOOP metastore using HyperSQL database to save the SQOOP jobs. It is recommended to run the metastore as sqoop user.

The first step is to configure the metastore server details in sqoop-site.xml. The relevant properties to be set up are sqoop.metastore.server.location & sqoop.metastore.server.port. We will be using port as 16001.

Using Ambari WebUI, for Sqoop Configurations add the below properties under the Custom sqoop-site tags:

sqoop.metastore.server.location=/home/sqoop/meta-store/shared.db
sqoop.metastore.server.port=16001

Once done save the changes. Next click on Service Actions followed by Refresh configs. Once done, Save the configuration and perform a restart of all the affected services and components.

Next we will initialize the Sqoop Metastore server. Start the server process, redirect stdout and stderr to a file and leave it in the background. Login to the ssh console with user root and use sqoop user to run sqoop metastore server.

ssh root@192.168.136.139 -p 2222
su sqoop
cd /home/sqoop
mkdir -p /home/sqoop/meta-store/shared.db
cd /home/sqoop/meta-store/shared.db
nohup sqoop-metastore &>> shared.db.out &

To shut down the metastore gracefully, use sqoop-metastore --shutdown as the sqoop user. Next modify the shared.db.properties file under /home/sqoop/meta-store directory as per below, followed by re-start of sqoop-metastore server.

hsqldb.default_table_type=cached

cd /home/sqoop/meta-store/shared.db
nohup sqoop-metastore &>> shared.db.out &
exit

exit

In order for Sqoop clients to connect to shared metastore server, create a client-side sqoop configuration file namely sqoop-conf.xml which will be used later by Oozie workflows to trigger Sqoop Jobs. Define sqoop client configuration file as below. Login to ssh console with edw_user account:

ssh This email address is being protected from spambots. You need JavaScript enabled to view it. -p 2222
vi /home/edw_user/sampledata/sqoop-conf.xml

<configuration>  
	<property>  
		<name>sqoop.metastore.client.enable.autoconnect</name>  
		<value>false</value>  
	</property>  
	<property>  
		<name>sqoop.metastore.client.autoconnect.url</name>  
		<value>jdbc:hsqldb:hsql://sandbox-hdp.hortonworks.com:16001/sqoop</value>  
	</property>  
	<property>  
		<name>sqoop.metastore.client.autoconnect.username</name>  
		<value>sa</value>  
	</property>  
	<property>  
		<name>sqoop.metastore.client.autoconnect.password</name>  
		<value></value>  
	</property>  
	<property>  
		<name>sqoop.metastore.client.record.password</name>  
		<value>false</value>  
	</property>
</configuration>


:wq

Finally initialize & test the Sqoop Metastore.

sqoop job --meta-connect "jdbc:hsqldb:hsql://sandbox-hdp.hortonworks.com:16001/sqoop" --list

exit;

Oozie Setup

We will create oozie workflows to orchestrate the daily data loading from source systems/data warehouse to Hadoop HDFS/Hive. We will load the dimension tables followed by the fact tables. We will use oozie to internally execute the required HDFS commands, SQOOP job, PIG & HIVE scripts.

Using Ambari Web UI stop the Oozie service. Next Login to ssh console using root user, to configure Oozie to execute SQOOP Jobs & Hive Scripts. Backup the existing oozie war config and rebuild.

ssh This email address is being protected from spambots. You need JavaScript enabled to view it. -p 2222
cd /usr/hdp/current/oozie-server
cp /usr/share/java/mysql-connector-java-5.1.17.jar /usr/hdp/current/oozie-server/libext/

Next we will upload MySQL JDBC jar & Hadoop configuration files to HDFS under oozie shared library.

hdfs dfs -ls /user/oozie/share/lib/

Found 1 items
drwxr-xr-x - oozie hdfs 0 2018-02-01 10:31 /user/oozie/share/lib/lib_20180201102929

hdfs dfs -put /usr/share/java/mysql-connector-java-5.1.17.jar /user/oozie/share/lib/lib_20180201102929/sqoop
hdfs dfs -put /home/edw_user/sampledata/sqoop-conf.xml /user/oozie/share/lib/lib_20180201102929/sqoop
hdfs dfs -put /etc/hive/conf/hive-site.xml /user/oozie/share/lib/lib_20180201102929/hive/hive-conf.xml
hdfs dfs -put /etc/tez/conf/tez-site.xml /user/oozie/share/lib/lib_20180201102929/hive/tez-conf.xml

sudo -u hdfs hdfs dfs -chown oozie:hdfs /user/oozie/share/lib/lib_20180201102929/sqoop/mysql-connector-java-5.1.17.jar
sudo -u hdfs hdfs dfs -chown oozie:hdfs /user/oozie/share/lib/lib_20180201102929/sqoop/sqoop-conf.xml
sudo -u hdfs hdfs dfs -chown oozie:hdfs /user/oozie/share/lib/lib_20180201102929/hive/hive-conf.xml
sudo -u hdfs hdfs dfs -chown oozie:hdfs /user/oozie/share/lib/lib_20180201102929/hive/tez-conf.xml

cd /root
mkdir /root/bkp
mv /usr/hdp/current/oozie-server/oozie-server/webapps/oozie.war /root/bkp
mv /usr/hdp/current/oozie-server/oozie-server/webapps/oozie /root/bkp
/usr/hdp/current/oozie-server/bin/oozie-setup.sh prepare-war

exit

Now start Oozie service via Ambari Web UI.

We have completed all the essential pre-requisite steps to configure our hadoop data lake. Now follow the next articles to start with data loading in hive tables.


Have a question on this subject?

Ask questions to our expert community members and clear your doubts. Asking question or engaging in technical discussion is both easy and rewarding.

Are you on Twitter?

Start following us. This way we will always keep you updated with what's happening in Data Analytics community. We won't spam you. Promise.

  • SQOOP import from MySQL

    In this article we will use Apache SQOOP to import data from MySQL database. For that let us create a MySql database & user and dump some data quickly. Let us download a MySQL database named Sakila Db from internet to get started. Next we will...

  • Hadoop DataLake Implementation Part 7

    In this article we will load our master data table ‘Product’ as Slowly Changing Dimension of Type 2 to maintain full history, so as to analyze the sales and stocks data with reference to the historical master data.

  • Hadoop DataLake Implementation Part 3

    To complete our implementation setup we will create the source tables based on the downloaded datafiles. Let us first load the SQL files in MySQL server under a new database called ‘sales’. We will simulate this database schema as our OLTP source...

  • Understanding Map-Reduce with Examples

    In my previous article – “Fools guide to Big Data” – we have discussed about the origin of Bigdata and the need of big data analytics. We have also noted that Big Data is data that is too large, complex and dynamic for any conventional data tools...

  • Install PIG In Client Node of Hadoop Cluster

    Apache Pig is a platform for analyzing large data sets. Pig Latin is the high level programming language that, lets us specify a sequence of data transformations such as merging data sets, filtering them, grouping them, and applying functions to...

  • Install FLUME In Client Node of Hadoop Cluster

    Apache Flume is a distributed, robust, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data or streaming event data from different sources to a centralized data store. Its main goal is to...

  • Install SQOOP in Client Node of Hadoop Cluster

    Sqoop is an open source software product of the Apache Software Foundation in the hadoop ecosystem, designed to transfer data between Hadoop and relational databases or mainframes. Sqoop can be used to import data from a relational database...

  • Hadoop DataLake Implementation Part 5

    In this article we will load the showroom master data from MySQL source system to HDFS using Sqoop as SCD Type 1.

  • Oracle Installation for SQOOP Import

    We would like to perform practical test of Apache SQOOP import/export utility between ORACLE relational database & Apache HADOOP file system, let us quickly setup an ORACLE server. For that we will be using cloud based services/servers as we did...

  • SQOOP Merge & Incremental Extraction from Oracle

    Let us check how to perform Incremental Extraction & Merge using Sqoop. The SQOOP Merge utility allows to combine two datasets where entries in one dataset should overwrite entries of an older dataset. For example, an incremental import run in...