Hadoop DataLake Implementation Part 6

Updated on Oct 03, 2020

In this article we will load the Customer data in the Hive warehouse as SCD Type 1. This time we will follow a different approach to implement Insert/Update or Merge strategy using Hive QL, rather than SQOOP Merge utility.

Load Customer Dimension Table

Using Sqoop we will now load the customer data, initial/base as well as incremental dataset from MySQL to HDFS.

ssh edw_user@192.168.136.139 -p 2222
sqoop job --meta-connect "jdbc:hsqldb:hsql://sandbox-hdp.hortonworks.com:16001/sqoop" \
--create jb_stg_customer \
-- import \
--bindir ./ \
--driver com.mysql.jdbc.Driver \
--connect jdbc:mysql://sandbox-hdp.hortonworks.com:3306/sales \
--username root \
--password-file /user/edw_user/sales/.password \
--table customer \
--fetch-size 1000 \
--as-textfile \
--fields-terminated-by '|' \
--target-dir /user/edw_user/sales/staging/customer \
--incremental lastmodified \
--check-column update_date \
--append \
--split-by id \
--num-mappers 2

Once the job is created, verify & execute the job. This Sqoop Job will scheduled on a daily basis via Oozie workflow.

sqoop job --meta-connect "jdbc:hsqldb:hsql://sandbox-hdp.hortonworks.com:16001/sqoop" --list
sqoop job --meta-connect "jdbc:hsqldb:hsql://sandbox-hdp.hortonworks.com:16001/sqoop" --show jb_stg_customer
sqoop job --meta-connect "jdbc:hsqldb:hsql://sandbox-hdp.hortonworks.com:16001/sqoop" --exec jb_stg_customer

Now we will define a hive external table for the Customer staging data as well as final Hive managed ORC dimension table. Connect to Beeline CLI using edw_user as username and password as hadoop. We will connect to hive schema ‘sales_analytics’.

One time setup

beeline
!connect jdbc:hive2://sandbox-hdp.hortonworks.com:10000/sales_analytics edw_user

CREATE EXTERNAL TABLE IF NOT EXISTS ext_customer (
id INT,
first_name VARCHAR(50),
last_name VARCHAR(50),
gender VARCHAR(50),
dob DATE,
company VARCHAR(50),
job VARCHAR(50),
email VARCHAR(50),
country VARCHAR(50),
state VARCHAR(50),
address VARCHAR(50),
update_date TIMESTAMP,
create_date TIMESTAMP
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION '/user/edw_user/sales/staging/customer';

SELECT * FROM ext_customer LIMIT 10;

Now we define our final Customer Dimension Table as Hive Managed ORC table.

CREATE TABLE IF NOT EXISTS dim_customer(
id INT,
first_name VARCHAR(50),
last_name VARCHAR(50),
gender VARCHAR(50),
dob DATE,
company VARCHAR(50),
job VARCHAR(50),
email VARCHAR(50),
country VARCHAR(50),
state VARCHAR(50),
address VARCHAR(50),
update_date TIMESTAMP,
create_date TIMESTAMP
)
STORED AS ORC 
TBLPROPERTIES ("orc.compress"="SNAPPY");

Let us create a Hive view based on the staging & dimension table, which will only show one record for each unique “id”, based on the latest “update_date” field value. Once the initial or incremental dataset is in HDFS, we will follow the below steps to load our final customer dimension.

CREATE VIEW vw_recon_customer AS
SELECT * FROM ext_customer
UNION ALL
SELECT * FROM dim_customer
WHERE NOT EXISTS (SELECT 1
FROM ext_customer
WHERE dim_customer.id = ext_customer.id);

!quit

Using the view we will populate an intermediate table to load the final customer dimension table. Now we will write a script to load the data from external table to hive managed table. This script will be used later in oozie workflow manager to schedule the load.

Initial & Daily Script

vi /home/edw_user/sampledata/load_customer.hql

DROP TABLE interim_customer;
CREATE TABLE interim_customer AS SELECT * FROM vw_recon_customer;

DROP TABLE dim_customer;
CREATE TABLE IF NOT EXISTS dim_customer(
id INT,
first_name VARCHAR(50),
last_name VARCHAR(50),
gender VARCHAR(50),
dob DATE,
company VARCHAR(50),
job VARCHAR(50),
email VARCHAR(50),
country VARCHAR(50),
state VARCHAR(50),
address VARCHAR(50),
update_date TIMESTAMP,
create_date TIMESTAMP
)
STORED AS ORC 
TBLPROPERTIES ("orc.compress"="SNAPPY");

INSERT OVERWRITE TABLE dim_customer SELECT * FROM interim_customer;

ANALYZE TABLE dim_customer COMPUTE STATISTICS FOR COLUMNS;

hdfs dfs -put /home/edw_user/sampledata/load_customer.hql /user/edw_user/sales/scripts

Execute the script to trigger the initial data load.

beeline -u jdbc:hive2://sandbox-hdp.hortonworks.com:10000/sales_analytics -n edw_user -p hadoop -d org.apache.hive.jdbc.HiveDriver -f "/home/edw_user/sampledata/load_customer.hql"

Finally move the customer datafiles to archive directory. This will also be used later by oozie.

vi /home/edw_user/sampledata/archive_customer.sh

hdfs dfs -mkdir /user/edw_user/sales/archive/customer/`date +%Y%m%d`
hdfs dfs -mv /user/edw_user/sales/staging/customer/* /user/edw_user/sales/archive/customer/`date +%Y%m%d`

hdfs dfs -put /home/edw_user/sampledata/archive_customer.sh /user/edw_user/sales/scripts
sh /home/edw_user/sampledata/archive_customer.sh
exit

In the next article we will load Product dimension as SCD Type 2.

Hadoop DataLake Implementation Part 6

Load Customer Dimension Table

One time setup

Initial & Daily Script

PrimeChess.org

Top 10 Articles