In this article we will load the showroom master data from MySQL source system to HDFS using Sqoop as SCD Type 1.

Load Showroom Dimension Table

Sqoop Import utility helps to import data from a relational database system. Imports and exports can be repeatedly performed by issuing the same command multiple times. Especially when using the incremental import capability, this is an expected scenario. Sqoop allows you to define saved jobs which make this process easier. A saved job records the configuration information required to execute a Sqoop command at a later time.

Incremental imports are performed by comparing the values in a check column against a reference value for the most recent import. For example, if the --incremental append argument was specified, along with --check-column id and --last-value 100, all rows with id > 100 will be imported. If an incremental import is run from the command line, the value which should be specified as --last-value in a subsequent incremental import will be printed to the screen for your reference. If an incremental import is run from a saved job, this value will be retained in the saved job.

ssh This email address is being protected from spambots. You need JavaScript enabled to view it. -p 2222
sqoop job --meta-connect "jdbc:hsqldb:hsql://sandbox-hdp.hortonworks.com:16001/sqoop" \
--create jb_stg_showroom \
-- import \
--bindir ./ \
--driver com.mysql.jdbc.Driver \
--connect jdbc:mysql://sandbox-hdp.hortonworks.com:3306/sales \
--username root \
--password-file /user/edw_user/sales/.password \
--table showroom \
--fetch-size 1000 \
--as-textfile \
--fields-terminated-by '|' \
--target-dir /user/edw_user/sales/staging/showroom \
--incremental lastmodified \
--check-column update_date \
--merge-key id \
--num-mappers 1

Once the job is created, verify & execute the job. The MySQL password is provided using password file. The data merge for incremental/delta changes have already been taken care by Sqoop merge & incremental commands.

--incremental (mode): Specifies how Sqoop determines which rows are new. (append / lastmodified ).
--check-column (col): Specifies the column to be examined when determining which rows to import.
--last-value (value): Specifies the maximum value of the check column from the previous import.

Below commands help to list, show & execute the Sqoop Job. Execute the Sqoop Job to perform the initial data load. This Sqoop Job will be scheduled on a daily basis via Oozie workflow.

sqoop job --meta-connect "jdbc:hsqldb:hsql://sandbox-hdp.hortonworks.com:16001/sqoop" --list
sqoop job --meta-connect "jdbc:hsqldb:hsql://sandbox-hdp.hortonworks.com:16001/sqoop" --show jb_stg_showroom
sqoop job --meta-connect "jdbc:hsqldb:hsql://sandbox-hdp.hortonworks.com:16001/sqoop" --exec jb_stg_showroom

If there is a need to delete the job, then use sqoop job --meta-connect "jdbc:hsqldb:hsql://sandbox-hdp.hortonworks.com:16001/sqoop" --delete jb_stg_showroom

Now let’s define a Hive external table for the Showroom dataset in staging. Connect to Beeline CLI using edw_user as username and password as hadoop. We will connect to hive schema ‘sales_analytics’.

One Time Setup

beeline
!connect jdbc:hive2://sandbox-hdp.hortonworks.com:10000/sales_analytics edw_user
CREATE EXTERNAL TABLE IF NOT EXISTS ext_showroom (
id INT,
code VARCHAR(40),
name VARCHAR(50),
operation_date DATE,
staff_count INT,
country VARCHAR(50),
state VARCHAR(50),
address VARCHAR(50),
update_date TIMESTAMP,
create_date TIMESTAMP
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION '/user/edw_user/sales/staging/showroom';

SELECT * FROM ext_showroom LIMIT 10;

Next let’s define & load our final hive table for Showroom dimension, as Hive managed ORC table.

CREATE TABLE IF NOT EXISTS dim_showroom(
id INT,
code VARCHAR(40),
name VARCHAR(50),
operation_date DATE,
staff_count INT,
country VARCHAR(50),
state VARCHAR(50),
address VARCHAR(50),
update_date TIMESTAMP,
create_date TIMESTAMP
)
STORED AS ORC 
TBLPROPERTIES ("orc.compress"="SNAPPY");

!quit

Now we will write a script to load the data from external table to hive managed table. This script will be used later in oozie workflow manager to schedule the load.

Initial/Delta Setup

vi /home/edw_user/sampledata/load_showroom.hql

INSERT OVERWRITE TABLE dim_showroom SELECT * FROM ext_showroom;
ANALYZE TABLE dim_showroom COMPUTE STATISTICS FOR COLUMNS;

:wq


hdfs dfs -put /home/edw_user/sampledata/load_showroom.hql /user/edw_user/sales/scripts

Execute the script to trigger the initial data load.

beeline -u jdbc:hive2://sandbox-hdp.hortonworks.com:10000/sales_analytics -n edw_user -p hadoop -d org.apache.hive.jdbc.HiveDriver -f "/home/edw_user/sampledata/load_showroom.hql"

Finally we will write & execute a shell script to copy & archive the showroom datafiles. This will also be used later by oozie.

vi /home/edw_user/sampledata/archive_showroom.sh

hdfs dfs -mkdir /user/edw_user/sales/archive/showroom/`date +%Y%m%d`
hdfs dfs -cp /user/edw_user/sales/staging/showroom/* /user/edw_user/sales/archive/showroom/`date +%Y%m%d`

:wq


hdfs dfs -put /home/edw_user/sampledata/archive_showroom.sh /user/edw_user/sales/scripts
sh /home/edw_user/sampledata/archive_showroom.sh

exit

In our next article we will load the customer dimension table as SCD Type 1. We will use a different approach to deal with incremental/delta extraction and SCD implementation.


Have a question on this subject?

Ask questions to our expert community members and clear your doubts. Asking question or engaging in technical discussion is both easy and rewarding.

Are you on Twitter?

Start following us. This way we will always keep you updated with what's happening in Data Analytics community. We won't spam you. Promise.

  • Install SPARK in Hadoop Cluster

    Apache Spark is a fast and general purpose engine for large-scale data processing over a distributed cluster. Apache Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing. Spark run programs up to 100x...

  • Install FLUME In Client Node of Hadoop Cluster

    Apache Flume is a distributed, robust, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data or streaming event data from different sources to a centralized data store. Its main goal is to...

  • SQOOP import from MySQL

    In this article we will use Apache SQOOP to import data from MySQL database. For that let us create a MySql database & user and dump some data quickly. Let us download a MySQL database named Sakila Db from internet to get started. Next we will...

  • Hadoop MapReduce Basics

    The Hadoop, since its inception is changing the way the enterprises store, process and analyse data. MapReduce is the core part of the Hadoop framework and we can also call it as the core processing engine of Hadoop. It is a programming model...

  • Apache Hadoop Architecture

    In this article we will learn about the Apache Hadoop framework architecture. The basic components of the Apache Hadoop HDFS & MapReduce engine are discussed in brief.

  • Oracle Installation for SQOOP Import

    We would like to perform practical test of Apache SQOOP import/export utility between ORACLE relational database & Apache HADOOP file system, let us quickly setup an ORACLE server. For that we will be using cloud based services/servers as we did...

  • Install HBASE in Hadoop Cluster

    Apache HBase provides large-scale tabular storage for Hadoop using the Hadoop Distributed File System (HDFS). Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable. HBase is used in cases...

  • Hadoop DataLake Implementation Part 3

    To complete our implementation setup we will create the source tables based on the downloaded datafiles. Let us first load the SQL files in MySQL server under a new database called ‘sales’. We will simulate this database schema as our OLTP source...

  • Configuring MySQL as Hive Metastore

    In the previous article, we have learnt How to Install and Configure Hive with default Derby metastore. However, an embedded derby based metastore can process only one request at a time. Since this is very restrictive, we will setup a traditional...

  • Install Hive in Client Node of Hadoop Cluster

    In the previous article, we have shown how to setup a client node. Once this is done, now let's put Hadoop to use for some big data analytics purpose. One way to do that is by using Hive which let's us run SQL queries against the big data. A...