Hadoop DataLake Implementation Part 5

Updated on Oct 03, 2020

In this article we will load the showroom master data from MySQL source system to HDFS using Sqoop as SCD Type 1.

Load Showroom Dimension Table

Sqoop Import utility helps to import data from a relational database system. Imports and exports can be repeatedly performed by issuing the same command multiple times. Especially when using the incremental import capability, this is an expected scenario. Sqoop allows you to define saved jobs which make this process easier. A saved job records the configuration information required to execute a Sqoop command at a later time.

Incremental imports are performed by comparing the values in a check column against a reference value for the most recent import. For example, if the --incremental append argument was specified, along with --check-column id and --last-value 100, all rows with id > 100 will be imported. If an incremental import is run from the command line, the value which should be specified as --last-value in a subsequent incremental import will be printed to the screen for your reference. If an incremental import is run from a saved job, this value will be retained in the saved job.

ssh edw_user@192.168.136.139 -p 2222
sqoop job --meta-connect "jdbc:hsqldb:hsql://sandbox-hdp.hortonworks.com:16001/sqoop" \
--create jb_stg_showroom \
-- import \
--bindir ./ \
--driver com.mysql.jdbc.Driver \
--connect jdbc:mysql://sandbox-hdp.hortonworks.com:3306/sales \
--username root \
--password-file /user/edw_user/sales/.password \
--table showroom \
--fetch-size 1000 \
--as-textfile \
--fields-terminated-by '|' \
--target-dir /user/edw_user/sales/staging/showroom \
--incremental lastmodified \
--check-column update_date \
--merge-key id \
--num-mappers 1

Once the job is created, verify & execute the job. The MySQL password is provided using password file. The data merge for incremental/delta changes have already been taken care by Sqoop merge & incremental commands.

--incremental (mode): Specifies how Sqoop determines which rows are new. (append / lastmodified ).

--check-column (col): Specifies the column to be examined when determining which rows to import.

--last-value (value): Specifies the maximum value of the check column from the previous import.

Below commands help to list, show & execute the Sqoop Job. Execute the Sqoop Job to perform the initial data load. This Sqoop Job will be scheduled on a daily basis via Oozie workflow.

sqoop job --meta-connect "jdbc:hsqldb:hsql://sandbox-hdp.hortonworks.com:16001/sqoop" --list
sqoop job --meta-connect "jdbc:hsqldb:hsql://sandbox-hdp.hortonworks.com:16001/sqoop" --show jb_stg_showroom
sqoop job --meta-connect "jdbc:hsqldb:hsql://sandbox-hdp.hortonworks.com:16001/sqoop" --exec jb_stg_showroom

If there is a need to delete the job, then use sqoop job --meta-connect "jdbc:hsqldb:hsql://sandbox-hdp.hortonworks.com:16001/sqoop" --delete jb_stg_showroom

Now let’s define a Hive external table for the Showroom dataset in staging. Connect to Beeline CLI using edw_user as username and password as hadoop. We will connect to hive schema ‘sales_analytics’.

One Time Setup

beeline
!connect jdbc:hive2://sandbox-hdp.hortonworks.com:10000/sales_analytics edw_user
CREATE EXTERNAL TABLE IF NOT EXISTS ext_showroom (
id INT,
code VARCHAR(40),
name VARCHAR(50),
operation_date DATE,
staff_count INT,
country VARCHAR(50),
state VARCHAR(50),
address VARCHAR(50),
update_date TIMESTAMP,
create_date TIMESTAMP
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION '/user/edw_user/sales/staging/showroom';

SELECT * FROM ext_showroom LIMIT 10;

Next let’s define & load our final hive table for Showroom dimension, as Hive managed ORC table.

CREATE TABLE IF NOT EXISTS dim_showroom(
id INT,
code VARCHAR(40),
name VARCHAR(50),
operation_date DATE,
staff_count INT,
country VARCHAR(50),
state VARCHAR(50),
address VARCHAR(50),
update_date TIMESTAMP,
create_date TIMESTAMP
)
STORED AS ORC 
TBLPROPERTIES ("orc.compress"="SNAPPY");

!quit

Now we will write a script to load the data from external table to hive managed table. This script will be used later in oozie workflow manager to schedule the load.

Initial/Delta Setup

vi /home/edw_user/sampledata/load_showroom.hql

INSERT OVERWRITE TABLE dim_showroom SELECT * FROM ext_showroom;
ANALYZE TABLE dim_showroom COMPUTE STATISTICS FOR COLUMNS;

hdfs dfs -put /home/edw_user/sampledata/load_showroom.hql /user/edw_user/sales/scripts

Execute the script to trigger the initial data load.

beeline -u jdbc:hive2://sandbox-hdp.hortonworks.com:10000/sales_analytics -n edw_user -p hadoop -d org.apache.hive.jdbc.HiveDriver -f "/home/edw_user/sampledata/load_showroom.hql"

Finally we will write & execute a shell script to copy & archive the showroom datafiles. This will also be used later by oozie.

vi /home/edw_user/sampledata/archive_showroom.sh

hdfs dfs -mkdir /user/edw_user/sales/archive/showroom/`date +%Y%m%d`
hdfs dfs -cp /user/edw_user/sales/staging/showroom/* /user/edw_user/sales/archive/showroom/`date +%Y%m%d`

hdfs dfs -put /home/edw_user/sampledata/archive_showroom.sh /user/edw_user/sales/scripts
sh /home/edw_user/sampledata/archive_showroom.sh

exit

In our next article we will load the customer dimension table as SCD Type 1. We will use a different approach to deal with incremental/delta extraction and SCD implementation.

Hadoop DataLake Implementation Part 5

Load Showroom Dimension Table

One Time Setup

Initial/Delta Setup

PrimeChess.org

Top 10 Articles