In this article we will load the Customer data in the Hive warehouse as SCD Type 1. This time we will follow a different approach to implement Insert/Update or Merge strategy using Hive QL, rather than SQOOP Merge utility

Load Customer Dimension Table

Using Sqoop we will now load the customer data, initial/base as well as incremental dataset from MySQL to HDFS.

ssh This email address is being protected from spambots. You need JavaScript enabled to view it. -p 2222
sqoop job --meta-connect "jdbc:hsqldb:hsql://sandbox-hdp.hortonworks.com:16001/sqoop" \
--create jb_stg_customer \
-- import \
--bindir ./ \
--driver com.mysql.jdbc.Driver \
--connect jdbc:mysql://sandbox-hdp.hortonworks.com:3306/sales \
--username root \
--password-file /user/edw_user/sales/.password \
--table customer \
--fetch-size 1000 \
--as-textfile \
--fields-terminated-by '|' \
--target-dir /user/edw_user/sales/staging/customer \
--incremental lastmodified \
--check-column update_date \
--append \
--split-by id \
--num-mappers 2

Once the job is created, verify & execute the job. This Sqoop Job will scheduled on a daily basis via Oozie workflow.

sqoop job --meta-connect "jdbc:hsqldb:hsql://sandbox-hdp.hortonworks.com:16001/sqoop" --list
sqoop job --meta-connect "jdbc:hsqldb:hsql://sandbox-hdp.hortonworks.com:16001/sqoop" --show jb_stg_customer
sqoop job --meta-connect "jdbc:hsqldb:hsql://sandbox-hdp.hortonworks.com:16001/sqoop" --exec jb_stg_customer

Now we will define a hive external table for the Customer staging data as well as final Hive managed ORC dimension table. Connect to Beeline CLI using edw_user as username and password as hadoop. We will connect to hive schema ‘sales_analytics’.

One time setup

beeline
!connect jdbc:hive2://sandbox-hdp.hortonworks.com:10000/sales_analytics edw_user

CREATE EXTERNAL TABLE IF NOT EXISTS ext_customer (
id INT,
first_name VARCHAR(50),
last_name VARCHAR(50),
gender VARCHAR(50),
dob DATE,
company VARCHAR(50),
job VARCHAR(50),
email VARCHAR(50),
country VARCHAR(50),
state VARCHAR(50),
address VARCHAR(50),
update_date TIMESTAMP,
create_date TIMESTAMP
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION '/user/edw_user/sales/staging/customer';

SELECT * FROM ext_customer LIMIT 10;

Now we define our final Customer Dimension Table as Hive Managed ORC table.

CREATE TABLE IF NOT EXISTS dim_customer(
id INT,
first_name VARCHAR(50),
last_name VARCHAR(50),
gender VARCHAR(50),
dob DATE,
company VARCHAR(50),
job VARCHAR(50),
email VARCHAR(50),
country VARCHAR(50),
state VARCHAR(50),
address VARCHAR(50),
update_date TIMESTAMP,
create_date TIMESTAMP
)
STORED AS ORC 
TBLPROPERTIES ("orc.compress"="SNAPPY");

Let us create a Hive view based on the staging & dimension table, which will only show one record for each unique “id”, based on the latest “update_date” field value. Once the initial or incremental dataset is in HDFS, we will follow the below steps to load our final customer dimension.

CREATE VIEW vw_recon_customer AS
SELECT * FROM ext_customer
UNION ALL
SELECT * FROM dim_customer
WHERE NOT EXISTS (SELECT 1
FROM ext_customer
WHERE dim_customer.id = ext_customer.id);

!quit

Using the view we will populate an intermediate table to load the final customer dimension table. Now we will write a script to load the data from external table to hive managed table. This script will be used later in oozie workflow manager to schedule the load.

Initial & Daily Script

vi /home/edw_user/sampledata/load_customer.hql

DROP TABLE interim_customer;
CREATE TABLE interim_customer AS SELECT * FROM vw_recon_customer;

DROP TABLE dim_customer;
CREATE TABLE IF NOT EXISTS dim_customer(
id INT,
first_name VARCHAR(50),
last_name VARCHAR(50),
gender VARCHAR(50),
dob DATE,
company VARCHAR(50),
job VARCHAR(50),
email VARCHAR(50),
country VARCHAR(50),
state VARCHAR(50),
address VARCHAR(50),
update_date TIMESTAMP,
create_date TIMESTAMP
)
STORED AS ORC 
TBLPROPERTIES ("orc.compress"="SNAPPY");

INSERT OVERWRITE TABLE dim_customer SELECT * FROM interim_customer;

ANALYZE TABLE dim_customer COMPUTE STATISTICS FOR COLUMNS;

:wq


hdfs dfs -put /home/edw_user/sampledata/load_customer.hql /user/edw_user/sales/scripts

Execute the script to trigger the initial data load.

beeline -u jdbc:hive2://sandbox-hdp.hortonworks.com:10000/sales_analytics -n edw_user -p hadoop -d org.apache.hive.jdbc.HiveDriver -f "/home/edw_user/sampledata/load_customer.hql"

Finally move the customer datafiles to archive directory. This will also be used later by oozie.

vi /home/edw_user/sampledata/archive_customer.sh

hdfs dfs -mkdir /user/edw_user/sales/archive/customer/`date +%Y%m%d`
hdfs dfs -mv /user/edw_user/sales/staging/customer/* /user/edw_user/sales/archive/customer/`date +%Y%m%d`

:wq


hdfs dfs -put /home/edw_user/sampledata/archive_customer.sh /user/edw_user/sales/scripts
sh /home/edw_user/sampledata/archive_customer.sh
exit

In the next article we will load Product dimension as SCD Type 2.


Have a question on this subject?

Ask questions to our expert community members and clear your doubts. Asking question or engaging in technical discussion is both easy and rewarding.

Are you on Twitter?

Start following us. This way we will always keep you updated with what's happening in Data Analytics community. We won't spam you. Promise.

  • SQOOP import from MySQL

    In this article we will use Apache SQOOP to import data from MySQL database. For that let us create a MySql database & user and dump some data quickly. Let us download a MySQL database named Sakila Db from internet to get started. Next we will...

  • Apache Hadoop Architecture

    In this article we will learn about the Apache Hadoop framework architecture. The basic components of the Apache Hadoop HDFS & MapReduce engine are discussed in brief.

  • Set up Client Node (Gateway Node) in Hadoop Cluster

    Once we have our multi-node hadoop cluster up and running, let us create an EdgeNode or a GatewayNode. Gateway nodes are the interface between the Hadoop cluster and the outside network. Edge nodes are used to run client applications and cluster...

  • Introduction to Apache Hadoop

    The Apache Hadoop is next big data platform. Apache Hadoop is an open-source, java-based framework software for reliable, scalable & distributed computing. The Apache Hadoop allows distributed processing of very large data sets across clusters of...

  • Install SQOOP in Client Node of Hadoop Cluster

    Sqoop is an open source software product of the Apache Software Foundation in the hadoop ecosystem, designed to transfer data between Hadoop and relational databases or mainframes. Sqoop can be used to import data from a relational database...

  • Hadoop DataLake Implementation Part 8

    In this article we will load our first fact table into Hive warehouse which is sales transactions.

  • Understanding Map-Reduce with Examples

    In my previous article – “Fools guide to Big Data” – we have discussed about the origin of Bigdata and the need of big data analytics. We have also noted that Big Data is data that is too large, complex and dynamic for any conventional data tools...

  • Hadoop DataLake Implementation Part 6

    In this article we will load the Customer data in the Hive warehouse as SCD Type 1. This time we will follow a different approach to implement Insert/Update or Merge strategy using Hive QL, rather than SQOOP Merge utility

  • Hadoop DataLake Implementation Part 4

    Now that our dummy OLTP source system & Hadoop HDFS directory structure is ready, we will first load the ‘dates’ data file in HDFS and further to a hive table.

  • Install PIG In Client Node of Hadoop Cluster

    Apache Pig is a platform for analyzing large data sets. Pig Latin is the high level programming language that, lets us specify a sequence of data transformations such as merging data sets, filtering them, grouping them, and applying functions to...