Hadoop DataLake Implementation Part 6
In this article we will load the Customer data in the Hive warehouse as SCD Type 1. This time we will follow a different approach to implement Insert/Update or Merge strategy using Hive QL, rather than SQOOP Merge utility.
Load Customer Dimension Table
Using Sqoop we will now load the customer data, initial/base as well as incremental dataset from MySQL to HDFS.
ssh edw_user@192.168.136.139 -p 2222
sqoop job --meta-connect "jdbc:hsqldb:hsql://sandbox-hdp.hortonworks.com:16001/sqoop" \
--create jb_stg_customer \
-- import \
--bindir ./ \
--driver com.mysql.jdbc.Driver \
--connect jdbc:mysql://sandbox-hdp.hortonworks.com:3306/sales \
--username root \
--password-file /user/edw_user/sales/.password \
--table customer \
--fetch-size 1000 \
--as-textfile \
--fields-terminated-by '|' \
--target-dir /user/edw_user/sales/staging/customer \
--incremental lastmodified \
--check-column update_date \
--append \
--split-by id \
--num-mappers 2
Once the job is created, verify & execute the job. This Sqoop Job will scheduled on a daily basis via Oozie workflow.
sqoop job --meta-connect "jdbc:hsqldb:hsql://sandbox-hdp.hortonworks.com:16001/sqoop" --list
sqoop job --meta-connect "jdbc:hsqldb:hsql://sandbox-hdp.hortonworks.com:16001/sqoop" --show jb_stg_customer
sqoop job --meta-connect "jdbc:hsqldb:hsql://sandbox-hdp.hortonworks.com:16001/sqoop" --exec jb_stg_customer
Now we will define a hive external table for the Customer staging data as well as final Hive managed ORC dimension table. Connect to Beeline CLI using edw_user as username and password as hadoop. We will connect to hive schema ‘sales_analytics’.
One time setup
beeline
!connect jdbc:hive2://sandbox-hdp.hortonworks.com:10000/sales_analytics edw_user
CREATE EXTERNAL TABLE IF NOT EXISTS ext_customer (
id INT,
first_name VARCHAR(50),
last_name VARCHAR(50),
gender VARCHAR(50),
dob DATE,
company VARCHAR(50),
job VARCHAR(50),
email VARCHAR(50),
country VARCHAR(50),
state VARCHAR(50),
address VARCHAR(50),
update_date TIMESTAMP,
create_date TIMESTAMP
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION '/user/edw_user/sales/staging/customer';
SELECT * FROM ext_customer LIMIT 10;
Now we define our final Customer Dimension Table as Hive Managed ORC table.
CREATE TABLE IF NOT EXISTS dim_customer(
id INT,
first_name VARCHAR(50),
last_name VARCHAR(50),
gender VARCHAR(50),
dob DATE,
company VARCHAR(50),
job VARCHAR(50),
email VARCHAR(50),
country VARCHAR(50),
state VARCHAR(50),
address VARCHAR(50),
update_date TIMESTAMP,
create_date TIMESTAMP
)
STORED AS ORC
TBLPROPERTIES ("orc.compress"="SNAPPY");
Let us create a Hive view based on the staging & dimension table, which will only show one record for each unique “id”, based on the latest “update_date” field value. Once the initial or incremental dataset is in HDFS, we will follow the below steps to load our final customer dimension.
CREATE VIEW vw_recon_customer AS
SELECT * FROM ext_customer
UNION ALL
SELECT * FROM dim_customer
WHERE NOT EXISTS (SELECT 1
FROM ext_customer
WHERE dim_customer.id = ext_customer.id);
!quit
Using the view we will populate an intermediate table to load the final customer dimension table. Now we will write a script to load the data from external table to hive managed table. This script will be used later in oozie workflow manager to schedule the load.
Initial & Daily Script
vi /home/edw_user/sampledata/load_customer.hql
DROP TABLE interim_customer;
CREATE TABLE interim_customer AS SELECT * FROM vw_recon_customer;
DROP TABLE dim_customer;
CREATE TABLE IF NOT EXISTS dim_customer(
id INT,
first_name VARCHAR(50),
last_name VARCHAR(50),
gender VARCHAR(50),
dob DATE,
company VARCHAR(50),
job VARCHAR(50),
email VARCHAR(50),
country VARCHAR(50),
state VARCHAR(50),
address VARCHAR(50),
update_date TIMESTAMP,
create_date TIMESTAMP
)
STORED AS ORC
TBLPROPERTIES ("orc.compress"="SNAPPY");
INSERT OVERWRITE TABLE dim_customer SELECT * FROM interim_customer;
ANALYZE TABLE dim_customer COMPUTE STATISTICS FOR COLUMNS;
hdfs dfs -put /home/edw_user/sampledata/load_customer.hql /user/edw_user/sales/scripts
Execute the script to trigger the initial data load.
beeline -u jdbc:hive2://sandbox-hdp.hortonworks.com:10000/sales_analytics -n edw_user -p hadoop -d org.apache.hive.jdbc.HiveDriver -f "/home/edw_user/sampledata/load_customer.hql"
Finally move the customer datafiles to archive directory. This will also be used later by oozie.
vi /home/edw_user/sampledata/archive_customer.sh
hdfs dfs -mkdir /user/edw_user/sales/archive/customer/`date +%Y%m%d`
hdfs dfs -mv /user/edw_user/sales/staging/customer/* /user/edw_user/sales/archive/customer/`date +%Y%m%d`
hdfs dfs -put /home/edw_user/sampledata/archive_customer.sh /user/edw_user/sales/scripts
sh /home/edw_user/sampledata/archive_customer.sh
exit
In the next article we will load Product dimension as SCD Type 2.