In this article we will load the Customer data in the Hive warehouse as SCD Type 1. This time we will follow a different approach to implement Insert/Update or Merge strategy using Hive QL, rather than SQOOP Merge utility

Load Customer Dimension Table

Using Sqoop we will now load the customer data, initial/base as well as incremental dataset from MySQL to HDFS.

ssh This email address is being protected from spambots. You need JavaScript enabled to view it. -p 2222
sqoop job --meta-connect "jdbc:hsqldb:hsql://sandbox-hdp.hortonworks.com:16001/sqoop" \
--create jb_stg_customer \
-- import \
--bindir ./ \
--driver com.mysql.jdbc.Driver \
--connect jdbc:mysql://sandbox-hdp.hortonworks.com:3306/sales \
--username root \
--password-file /user/edw_user/sales/.password \
--table customer \
--fetch-size 1000 \
--as-textfile \
--fields-terminated-by '|' \
--target-dir /user/edw_user/sales/staging/customer \
--incremental lastmodified \
--check-column update_date \
--append \
--split-by id \
--num-mappers 2

Once the job is created, verify & execute the job. This Sqoop Job will scheduled on a daily basis via Oozie workflow.

sqoop job --meta-connect "jdbc:hsqldb:hsql://sandbox-hdp.hortonworks.com:16001/sqoop" --list
sqoop job --meta-connect "jdbc:hsqldb:hsql://sandbox-hdp.hortonworks.com:16001/sqoop" --show jb_stg_customer
sqoop job --meta-connect "jdbc:hsqldb:hsql://sandbox-hdp.hortonworks.com:16001/sqoop" --exec jb_stg_customer

Now we will define a hive external table for the Customer staging data as well as final Hive managed ORC dimension table. Connect to Beeline CLI using edw_user as username and password as hadoop. We will connect to hive schema ‘sales_analytics’.

One time setup

beeline
!connect jdbc:hive2://sandbox-hdp.hortonworks.com:10000/sales_analytics edw_user

CREATE EXTERNAL TABLE IF NOT EXISTS ext_customer (
id INT,
first_name VARCHAR(50),
last_name VARCHAR(50),
gender VARCHAR(50),
dob DATE,
company VARCHAR(50),
job VARCHAR(50),
email VARCHAR(50),
country VARCHAR(50),
state VARCHAR(50),
address VARCHAR(50),
update_date TIMESTAMP,
create_date TIMESTAMP
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION '/user/edw_user/sales/staging/customer';

SELECT * FROM ext_customer LIMIT 10;

Now we define our final Customer Dimension Table as Hive Managed ORC table.

CREATE TABLE IF NOT EXISTS dim_customer(
id INT,
first_name VARCHAR(50),
last_name VARCHAR(50),
gender VARCHAR(50),
dob DATE,
company VARCHAR(50),
job VARCHAR(50),
email VARCHAR(50),
country VARCHAR(50),
state VARCHAR(50),
address VARCHAR(50),
update_date TIMESTAMP,
create_date TIMESTAMP
)
STORED AS ORC 
TBLPROPERTIES ("orc.compress"="SNAPPY");

Let us create a Hive view based on the staging & dimension table, which will only show one record for each unique “id”, based on the latest “update_date” field value. Once the initial or incremental dataset is in HDFS, we will follow the below steps to load our final customer dimension.

CREATE VIEW vw_recon_customer AS
SELECT * FROM ext_customer
UNION ALL
SELECT * FROM dim_customer
WHERE NOT EXISTS (SELECT 1
FROM ext_customer
WHERE dim_customer.id = ext_customer.id);

!quit

Using the view we will populate an intermediate table to load the final customer dimension table. Now we will write a script to load the data from external table to hive managed table. This script will be used later in oozie workflow manager to schedule the load.

Initial & Daily Script

vi /home/edw_user/sampledata/load_customer.hql

DROP TABLE interim_customer;
CREATE TABLE interim_customer AS SELECT * FROM vw_recon_customer;

DROP TABLE dim_customer;
CREATE TABLE IF NOT EXISTS dim_customer(
id INT,
first_name VARCHAR(50),
last_name VARCHAR(50),
gender VARCHAR(50),
dob DATE,
company VARCHAR(50),
job VARCHAR(50),
email VARCHAR(50),
country VARCHAR(50),
state VARCHAR(50),
address VARCHAR(50),
update_date TIMESTAMP,
create_date TIMESTAMP
)
STORED AS ORC 
TBLPROPERTIES ("orc.compress"="SNAPPY");

INSERT OVERWRITE TABLE dim_customer SELECT * FROM interim_customer;

ANALYZE TABLE dim_customer COMPUTE STATISTICS FOR COLUMNS;

:wq


hdfs dfs -put /home/edw_user/sampledata/load_customer.hql /user/edw_user/sales/scripts

Execute the script to trigger the initial data load.

beeline -u jdbc:hive2://sandbox-hdp.hortonworks.com:10000/sales_analytics -n edw_user -p hadoop -d org.apache.hive.jdbc.HiveDriver -f "/home/edw_user/sampledata/load_customer.hql"

Finally move the customer datafiles to archive directory. This will also be used later by oozie.

vi /home/edw_user/sampledata/archive_customer.sh

hdfs dfs -mkdir /user/edw_user/sales/archive/customer/`date +%Y%m%d`
hdfs dfs -mv /user/edw_user/sales/staging/customer/* /user/edw_user/sales/archive/customer/`date +%Y%m%d`

:wq


hdfs dfs -put /home/edw_user/sampledata/archive_customer.sh /user/edw_user/sales/scripts
sh /home/edw_user/sampledata/archive_customer.sh
exit

In the next article we will load Product dimension as SCD Type 2.


Have a question on this subject?

Ask questions to our expert community members and clear your doubts. Asking question or engaging in technical discussion is both easy and rewarding.

Are you on Twitter?

Start following us. This way we will always keep you updated with what's happening in Data Analytics community. We won't spam you. Promise.

  • How to Setup Hadoop Multi Node Cluster - Step By Step

    Setting up Hadoop in a single machine is easy, but no fun. Why? Because Hadoop is not meant for a single machine. Hadoop is meant to run on a computing cluster comprising of many machines. Running HDFS and MapReduce on a single machine is great for...

  • Install HBASE in Hadoop Cluster

    Apache HBase provides large-scale tabular storage for Hadoop using the Hadoop Distributed File System (HDFS). Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable. HBase is used in cases...

  • Hadoop DataLake Implementation Part 4

    Now that our dummy OLTP source system & Hadoop HDFS directory structure is ready, we will first load the ‘dates’ data file in HDFS and further to a hive table.

  • Hadoop MapReduce Basics

    The Hadoop, since its inception is changing the way the enterprises store, process and analyse data. MapReduce is the core part of the Hadoop framework and we can also call it as the core processing engine of Hadoop. It is a programming model...

  • SQOOP import from MySQL

    In this article we will use Apache SQOOP to import data from MySQL database. For that let us create a MySql database & user and dump some data quickly. Let us download a MySQL database named Sakila Db from internet to get started. Next we will...

  • Hadoop DataLake Implementation Part 7

    In this article we will load our master data table ‘Product’ as Slowly Changing Dimension of Type 2 to maintain full history, so as to analyze the sales and stocks data with reference to the historical master data.

  • SQOOP Merge & Incremental Extraction from Oracle

    Let us check how to perform Incremental Extraction & Merge using Sqoop. The SQOOP Merge utility allows to combine two datasets where entries in one dataset should overwrite entries of an older dataset. For example, an incremental import run in...

  • Hadoop DataLake Implementation

    In this multi-series article we will learn how to implement an Enterprise DataLake using Apache Hadoop, an open-source, java-based software framework for reliable, scalable & distributed computing. Apache Hadoop addresses the limitations of...

  • Install Hive in Client Node of Hadoop Cluster

    In the previous article, we have shown how to setup a client node. Once this is done, now let's put Hadoop to use for some big data analytics purpose. One way to do that is by using Hive which let's us run SQL queries against the big data. A...

  • Hadoop DataLake Implementation Part 9

    In this article we will load our final fact table i.e. stock.