In this article we will load the showroom master data from MySQL source system to HDFS using Sqoop as SCD Type 1.

Load Showroom Dimension Table

Sqoop Import utility helps to import data from a relational database system. Imports and exports can be repeatedly performed by issuing the same command multiple times. Especially when using the incremental import capability, this is an expected scenario. Sqoop allows you to define saved jobs which make this process easier. A saved job records the configuration information required to execute a Sqoop command at a later time.

Incremental imports are performed by comparing the values in a check column against a reference value for the most recent import. For example, if the --incremental append argument was specified, along with --check-column id and --last-value 100, all rows with id > 100 will be imported. If an incremental import is run from the command line, the value which should be specified as --last-value in a subsequent incremental import will be printed to the screen for your reference. If an incremental import is run from a saved job, this value will be retained in the saved job.

ssh This email address is being protected from spambots. You need JavaScript enabled to view it. -p 2222
sqoop job --meta-connect "jdbc:hsqldb:hsql://sandbox-hdp.hortonworks.com:16001/sqoop" \
--create jb_stg_showroom \
-- import \
--bindir ./ \
--driver com.mysql.jdbc.Driver \
--connect jdbc:mysql://sandbox-hdp.hortonworks.com:3306/sales \
--username root \
--password-file /user/edw_user/sales/.password \
--table showroom \
--fetch-size 1000 \
--as-textfile \
--fields-terminated-by '|' \
--target-dir /user/edw_user/sales/staging/showroom \
--incremental lastmodified \
--check-column update_date \
--merge-key id \
--num-mappers 1

Once the job is created, verify & execute the job. The MySQL password is provided using password file. The data merge for incremental/delta changes have already been taken care by Sqoop merge & incremental commands.

--incremental (mode): Specifies how Sqoop determines which rows are new. (append / lastmodified ).
--check-column (col): Specifies the column to be examined when determining which rows to import.
--last-value (value): Specifies the maximum value of the check column from the previous import.

Below commands help to list, show & execute the Sqoop Job. Execute the Sqoop Job to perform the initial data load. This Sqoop Job will be scheduled on a daily basis via Oozie workflow.

sqoop job --meta-connect "jdbc:hsqldb:hsql://sandbox-hdp.hortonworks.com:16001/sqoop" --list
sqoop job --meta-connect "jdbc:hsqldb:hsql://sandbox-hdp.hortonworks.com:16001/sqoop" --show jb_stg_showroom
sqoop job --meta-connect "jdbc:hsqldb:hsql://sandbox-hdp.hortonworks.com:16001/sqoop" --exec jb_stg_showroom

If there is a need to delete the job, then use sqoop job --meta-connect "jdbc:hsqldb:hsql://sandbox-hdp.hortonworks.com:16001/sqoop" --delete jb_stg_showroom

Now let’s define a Hive external table for the Showroom dataset in staging. Connect to Beeline CLI using edw_user as username and password as hadoop. We will connect to hive schema ‘sales_analytics’.

One Time Setup

beeline
!connect jdbc:hive2://sandbox-hdp.hortonworks.com:10000/sales_analytics edw_user
CREATE EXTERNAL TABLE IF NOT EXISTS ext_showroom (
id INT,
code VARCHAR(40),
name VARCHAR(50),
operation_date DATE,
staff_count INT,
country VARCHAR(50),
state VARCHAR(50),
address VARCHAR(50),
update_date TIMESTAMP,
create_date TIMESTAMP
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION '/user/edw_user/sales/staging/showroom';

SELECT * FROM ext_showroom LIMIT 10;

Next let’s define & load our final hive table for Showroom dimension, as Hive managed ORC table.

CREATE TABLE IF NOT EXISTS dim_showroom(
id INT,
code VARCHAR(40),
name VARCHAR(50),
operation_date DATE,
staff_count INT,
country VARCHAR(50),
state VARCHAR(50),
address VARCHAR(50),
update_date TIMESTAMP,
create_date TIMESTAMP
)
STORED AS ORC 
TBLPROPERTIES ("orc.compress"="SNAPPY");

!quit

Now we will write a script to load the data from external table to hive managed table. This script will be used later in oozie workflow manager to schedule the load.

Initial/Delta Setup

vi /home/edw_user/sampledata/load_showroom.hql

INSERT OVERWRITE TABLE dim_showroom SELECT * FROM ext_showroom;
ANALYZE TABLE dim_showroom COMPUTE STATISTICS FOR COLUMNS;

:wq


hdfs dfs -put /home/edw_user/sampledata/load_showroom.hql /user/edw_user/sales/scripts

Execute the script to trigger the initial data load.

beeline -u jdbc:hive2://sandbox-hdp.hortonworks.com:10000/sales_analytics -n edw_user -p hadoop -d org.apache.hive.jdbc.HiveDriver -f "/home/edw_user/sampledata/load_showroom.hql"

Finally we will write & execute a shell script to copy & archive the showroom datafiles. This will also be used later by oozie.

vi /home/edw_user/sampledata/archive_showroom.sh

hdfs dfs -mkdir /user/edw_user/sales/archive/showroom/`date +%Y%m%d`
hdfs dfs -cp /user/edw_user/sales/staging/showroom/* /user/edw_user/sales/archive/showroom/`date +%Y%m%d`

:wq


hdfs dfs -put /home/edw_user/sampledata/archive_showroom.sh /user/edw_user/sales/scripts
sh /home/edw_user/sampledata/archive_showroom.sh

exit

In our next article we will load the customer dimension table as SCD Type 1. We will use a different approach to deal with incremental/delta extraction and SCD implementation.


Have a question on this subject?

Ask questions to our expert community members and clear your doubts. Asking question or engaging in technical discussion is both easy and rewarding.

Are you on Twitter?

Start following us. This way we will always keep you updated with what's happening in Data Analytics community. We won't spam you. Promise.

  • Hadoop DataLake Implementation Part 5

    In this article we will load the showroom master data from MySQL source system to HDFS using Sqoop as SCD Type 1.

  • Configuring MySQL as Hive Metastore

    In the previous article, we have learnt How to Install and Configure Hive with default Derby metastore. However, an embedded derby based metastore can process only one request at a time. Since this is very restrictive, we will setup a traditional...

  • SQOOP import from Oracle

    In this article we will use Apache SQOOP to import data from Oracle database. Now that we have an oracle server in our cluster ready, let us login to EdgeNode. Next we will configure sqoop to import this data in HDFS file system followed by direct...

  • Set up Client Node (Gateway Node) in Hadoop Cluster

    Once we have our multi-node hadoop cluster up and running, let us create an EdgeNode or a GatewayNode. Gateway nodes are the interface between the Hadoop cluster and the outside network. Edge nodes are used to run client applications and cluster...

  • Hadoop DataLake Implementation Part 4

    Now that our dummy OLTP source system & Hadoop HDFS directory structure is ready, we will first load the ‘dates’ data file in HDFS and further to a hive table.

  • Install HBASE in Hadoop Cluster

    Apache HBase provides large-scale tabular storage for Hadoop using the Hadoop Distributed File System (HDFS). Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable. HBase is used in cases...

  • SQOOP import from MySQL

    In this article we will use Apache SQOOP to import data from MySQL database. For that let us create a MySql database & user and dump some data quickly. Let us download a MySQL database named Sakila Db from internet to get started. Next we will...

  • Stream Webserver Log into Hdfs using FLUME

    In this article we will use Apache Flume to gather stream access log data from our remote Web Server into Hadoop Distributed File System. We will be analyzing the access log in a real-time basis. So we have to setup Flume such that it collects the...

  • Introduction to Apache Hadoop

    The Apache Hadoop is next big data platform. Apache Hadoop is an open-source, java-based framework software for reliable, scalable & distributed computing. The Apache Hadoop allows distributed processing of very large data sets across clusters of...

  • Hadoop DataLake Implementation

    In this multi-series article we will learn how to implement an Enterprise DataLake using Apache Hadoop, an open-source, java-based software framework for reliable, scalable & distributed computing. Apache Hadoop addresses the limitations of...