Hadoop DataLake Implementation Part 3

Updated on Oct 03, 2020

To complete our implementation setup we will create the source tables based on the downloaded datafiles. Let us first load the SQL files in MySQL server under a new database called ‘sales’. We will simulate this database schema as our OLTP source system.

Once the scripts are executed, we will have 3 master data entities- customer, product & showroom and 2 transaction data tables- sales & stocks. Let’s also assume that we want to maintain history for the product master data in a separate schema ‘sales_dw’ to simulate dummy OLAP DW layer. For this lab exercise we will use the MySQL database available in the HDP Sandbox. Login to MySQL with user root and password hadoop.

OLTP Source Setup

ssh root@192.168.136.139 -p 2222
mysql -u root -p
show databases;
use mysql;
select user,host from user;
create database sales;
create database sales_dw;
show databases;
exit;

Next create & load data tables based on the downloaded data files. When prompted for password enter hadoop.

mysql -u root -p sales < /home/edw_user/sampledata/showroom.sql
mysql -u root -p sales < /home/edw_user/sampledata/customer.sql
mysql -u root -p sales < /home/edw_user/sampledata/product.sql
mysql -u root -p sales < /home/edw_user/sampledata/sales.sql
mysql -u root -p sales < /home/edw_user/sampledata/stocks.sql

Verify the dummy OLTP tables are created and loaded in the ‘sales’ schema.

mysql -u root -p
use sales;
show tables;
# Dummy data-set preparation
update sales, product set sales.amount = product.price where sales.product_id = product.id;
commit;
select count(1) from sales;
exit;

Now let us create the dummy Product Dimension table of SCD Type 2 in the ‘sale_dw’ schema.

mysql -u root -p
use sales_dw;
create table if not exists dim_product (
prod_key INT NOT NULL AUTO_INCREMENT,
id INT,
code VARCHAR(50),
category VARCHAR(6),
make VARCHAR(50),
model VARCHAR(50),
year VARCHAR(50),
color VARCHAR(50),
price INT,
currency VARCHAR(50),
update_date TIMESTAMP,
create_date TIMESTAMP,
start_date DATE,
end_date DATE,
active_flag VARCHAR(1),
load_date DATE,
PRIMARY KEY(prod_key)
);

# Initial load Product Dimension
INSERT INTO dim_product (id, code, category, make, model, year, color, price, currency, update_date, create_date, start_date, end_date, active_flag, load_date) 
SELECT id, code, category, make, model, year, color, price, currency, update_date, create_date, now(), '2050-12-31', 'Y', now() FROM sales.product;
show tables;
select count(1) from dim_product;
exit;

exit

The maintenance and loading of Product Dimension is not covered in this article.

HDFS Directory Setup

We will create two directories – staging & archive. Staging will host all the data files imported from the external source systems. Once the data files are processed, they will be moved to archive directory. We will set the replication factor for the archive directory to a low value ‘1’. Based on your needs you can even ignore to archive the data files. Additionally we will create the corresponding directories for dates, showroom, customer, product, sales & stocks in staging as well as in archive area. Also we will create two more additional directories for hive/pig scripts and oozie workflows. Below are the HDFS filesystem commands to achieve this. All of this is going to be a onetime activity.

ssh edw_user@192.168.136.139 -p 2222
hdfs dfs -mkdir -p /user/edw_user/sales/staging
hdfs dfs -mkdir -p /user/edw_user/sales/archive
hdfs dfs -setrep -R 1 /user/edw_user/sales/archive
hdfs dfs -mkdir -p /user/edw_user/sales/scripts
hdfs dfs -mkdir -p /user/edw_user/sales/workflows

hdfs dfs -mkdir -p /user/edw_user/sales/staging/date
hdfs dfs -mkdir -p /user/edw_user/sales/archive/date
hdfs dfs -mkdir -p /user/edw_user/sales/staging/showroom
hdfs dfs -mkdir -p /user/edw_user/sales/archive/showroom
hdfs dfs -mkdir -p /user/edw_user/sales/staging/customer
hdfs dfs -mkdir -p /user/edw_user/sales/archive/customer
hdfs dfs -mkdir -p /user/edw_user/sales/staging/product
hdfs dfs -mkdir -p /user/edw_user/sales/archive/product
hdfs dfs -mkdir -p /user/edw_user/sales/staging/sales
hdfs dfs -mkdir -p /user/edw_user/sales/archive/sales
hdfs dfs -mkdir -p /user/edw_user/sales/staging/stocks
hdfs dfs -mkdir -p /user/edw_user/sales/archive/stocks

Next we will also create a Password-file for Sqoop to connect to MySQL database schema’s sales & sales_dw.

echo -n "hadoop" > /home/edw_user/sampledata/.password
hdfs dfs -put .password /user/edw_user/sales/
hdfs dfs -chmod 400 /user/edw_user/sales/.password
rm -rf /home/edw_user/sampledata/.password

exit

Hive Optimization

The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive. Hive managed & unmanaged tables will be used for our analytics. Before proceeding any further we will set up the Hive environment to support interactive SQL. Go to Ambari Web Admin Console and click on Hive service followed by Configs. Using the Filter textbox search each of the below properties and set the values accordingly.

Set Hive Optimization Properties at Ambari Hive Configuration as below-

set hive.execution.engine=tez;
set hive.cbo.enable=true;
set hive.stats.autogather=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
set hive.vectorized.execution.enabled=true;
set hive.vectorized.execution.reduce.enabled=true;
set hive.enforce.bucketing=true;
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.compress.output=true;
set hive.exec.compress.intermediate=true;
set hive.server2.enable.doAs=true;
set hive.support.concurrency=true;

Once done, Save the configuration and perform a restart of all the affected services and components.

Next we will create the Hive database to host our sales analytics dimensions & facts table. Connect to Beeline CLI using root as username. If prompted for a password hit Enter. Password is blank. Check in Ambari Hive Summary tab to get the HiveServer2 JDBC URL. Let us create a hive schema named ‘sales_analytics’.

ssh root@192.168.136.139 -p 2222
beeline
!connect jdbc:hive2://sandbox-hdp.hortonworks.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2 root
create database sales_analytics;
show databases;
!quit

exit

Sqoop Metastore

Sqoop is a tool designed to transfer data between Hadoop and relational databases. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance. Prior to using Sqoop we will setup SQOOP metastore using HyperSQL database to save the SQOOP jobs. It is recommended to run the metastore as sqoop user.

The first step is to configure the metastore server details in sqoop-site.xml. The relevant properties to be set up are sqoop.metastore.server.location & sqoop.metastore.server.port. We will be using port as 16001.

Using Ambari WebUI, for Sqoop Configurations add the below properties under the Custom sqoop-site tags:

sqoop.metastore.server.location=/home/sqoop/meta-store/shared.db
sqoop.metastore.server.port=16001

Once done save the changes. Next click on Service Actions followed by Refresh configs. Once done, Save the configuration and perform a restart of all the affected services and components.

Next we will initialize the Sqoop Metastore server. Start the server process, redirect stdout and stderr to a file and leave it in the background. Login to the ssh console with user root and use sqoop user to run sqoop metastore server.

ssh root@192.168.136.139 -p 2222
su sqoop
cd /home/sqoop
mkdir -p /home/sqoop/meta-store/shared.db
cd /home/sqoop/meta-store/shared.db
nohup sqoop-metastore &>> shared.db.out &

To shut down the metastore gracefully, use sqoop-metastore --shutdown as the sqoop user.

Next modify the shared.db.properties file under /home/sqoop/meta-store directory as per below, followed by re-start of sqoop-metastore server.

hsqldb.default_table_type=cached

cd /home/sqoop/meta-store/shared.db
nohup sqoop-metastore &>> shared.db.out &
exit

exit

In order for Sqoop clients to connect to shared metastore server, create a client-side sqoop configuration file namely sqoop-conf.xml which will be used later by Oozie workflows to trigger Sqoop Jobs. Define sqoop client configuration file as below. Login to ssh console with edw_user account:

ssh edw_user@192.168.136.139 -p 2222
vi /home/edw_user/sampledata/sqoop-conf.xml

<configuration></configuration>  
	<property>  
		<name>sqoop.metastore.client.enable.autoconnect</name>  
		<value>false</value>  
	</property>  
	<property>  
		<name>sqoop.metastore.client.autoconnect.url</name>  
		<value>jdbc:hsqldb:hsql://sandbox-hdp.hortonworks.com:16001/sqoop</value>  
	</property>  
	<property>  
		<name>sqoop.metastore.client.autoconnect.username</name>  
		<value>sa</value>  
	</property>  
	<property>  
		<name>sqoop.metastore.client.autoconnect.password</name>  
		<value></value>  
	</property>  
	<property>  
		<name>sqoop.metastore.client.record.password</name>  
		<value>false</value>  
	</property>
</configuration>

Finally initialize & test the Sqoop Metastore.

sqoop job --meta-connect "jdbc:hsqldb:hsql://sandbox-hdp.hortonworks.com:16001/sqoop" --list

exit;

Oozie Setup

We will create oozie workflows to orchestrate the daily data loading from source systems/data warehouse to Hadoop HDFS/Hive. We will load the dimension tables followed by the fact tables. We will use oozie to internally execute the required HDFS commands, SQOOP job, PIG & HIVE scripts.

Using Ambari Web UI stop the Oozie service. Next Login to ssh console using root user, to configure Oozie to execute SQOOP Jobs & Hive Scripts. Backup the existing oozie war config and rebuild.

ssh root@192.168.136.139 -p 2222
cd /usr/hdp/current/oozie-server
cp /usr/share/java/mysql-connector-java-5.1.17.jar /usr/hdp/current/oozie-server/libext/

Next we will upload MySQL JDBC jar & Hadoop configuration files to HDFS under oozie shared library.

hdfs dfs -ls /user/oozie/share/lib/

Found 1 items
drwxr-xr-x - oozie hdfs 0 2018-02-01 10:31 /user/oozie/share/lib/lib_20180201102929

hdfs dfs -put /usr/share/java/mysql-connector-java-5.1.17.jar /user/oozie/share/lib/lib_20180201102929/sqoop
hdfs dfs -put /home/edw_user/sampledata/sqoop-conf.xml /user/oozie/share/lib/lib_20180201102929/sqoop
hdfs dfs -put /etc/hive/conf/hive-site.xml /user/oozie/share/lib/lib_20180201102929/hive/hive-conf.xml
hdfs dfs -put /etc/tez/conf/tez-site.xml /user/oozie/share/lib/lib_20180201102929/hive/tez-conf.xml

sudo -u hdfs hdfs dfs -chown oozie:hdfs /user/oozie/share/lib/lib_20180201102929/sqoop/mysql-connector-java-5.1.17.jar
sudo -u hdfs hdfs dfs -chown oozie:hdfs /user/oozie/share/lib/lib_20180201102929/sqoop/sqoop-conf.xml
sudo -u hdfs hdfs dfs -chown oozie:hdfs /user/oozie/share/lib/lib_20180201102929/hive/hive-conf.xml
sudo -u hdfs hdfs dfs -chown oozie:hdfs /user/oozie/share/lib/lib_20180201102929/hive/tez-conf.xml

cd /root
mkdir /root/bkp
mv /usr/hdp/current/oozie-server/oozie-server/webapps/oozie.war /root/bkp
mv /usr/hdp/current/oozie-server/oozie-server/webapps/oozie /root/bkp
/usr/hdp/current/oozie-server/bin/oozie-setup.sh prepare-war

exit

Now start Oozie service via Ambari Web UI.

We have completed all the essential pre-requisite steps to configure our hadoop data lake. Now follow the next articles to start with data loading in hive tables.