In this article we will load our master data table ‘Product’ as Slowly Changing Dimension of Type 2 to maintain full history, so as to analyze the sales and stocks data with reference to the historical master data.

Load Product Dimension Table

We have maintained SCD Type 2 implementation at the RDBMS side. So we will export the table as text file and then import to Hadoop HDFS. Once the file is in HDFS we will define the HIVE tables. In reality you may use the bulk export utility of RDBMS; also the export data file needs to be transferred securely from database server machine to the HDFS cluster. Login with user root and password hadoop.

ssh This email address is being protected from spambots. You need JavaScript enabled to view it. -p 2222
rm -rf /var/lib/mysql-files/dim_product.psv

mysql -u root -p
use sales_dw;
SHOW VARIABLES LIKE "secure_file_priv";
SELECT * INTO OUTFILE '/var/lib/mysql-files/dim_product.psv' FIELDS TERMINATED BY '|' FROM dim_product;
exit;

mv /var/lib/mysql-files/dim_product.psv /home/edw_user/sampledata
exit

Next login to ssh console as edw_user and follow the further steps to upload the data file to HDFS filesystem.

ssh This email address is being protected from spambots. You need JavaScript enabled to view it. -p 2222
hdfs dfs -put /home/edw_user/sampledata/dim_product.psv /user/edw_user/sales/staging/product
hdfs dfs -tail /user/edw_user/sales/staging/product/dim_product.psv

Now let’s define a Hive external table for the Showroom dataset in staging. Connect to Beeline CLI using edw_user as username and password as hadoop. We will connect to hive schema ‘sales_analytics’.

One time setup

beeline
!connect jdbc:hive2://sandbox-hdp.hortonworks.com:10000/sales_analytics edw_user

CREATE EXTERNAL TABLE IF NOT EXISTS ext_dim_product (
prod_key INT,
id INT,
code VARCHAR(50),
category VARCHAR(6),
make VARCHAR(50),
model VARCHAR(50),
year VARCHAR(50),
color VARCHAR(50),
price INT,
currency VARCHAR(50),
update_date TIMESTAMP,
create_date TIMESTAMP,
start_date DATE,
end_date DATE,
active_flag VARCHAR(1),
load_date DATE
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION '/user/edw_user/sales/staging/product';

SELECT * FROM ext_dim_product LIMIT 10;

Next let’s define & load our final hive table for Showroom dimension, as Hive managed ORC table.

CREATE TABLE IF NOT EXISTS dim_product(
prod_key INT,
id INT,
code VARCHAR(50),
category VARCHAR(6),
make VARCHAR(50),
model VARCHAR(50),
year VARCHAR(50),
color VARCHAR(50),
price INT,
currency VARCHAR(50),
update_date TIMESTAMP,
create_date TIMESTAMP,
start_date DATE,
end_date DATE,
active_flag VARCHAR(1),
load_date DATE
)
STORED AS ORC 
TBLPROPERTIES ("orc.compress"="SNAPPY");

!quit

Now we will write a script to load the data from external table to hive managed table. This script will be used later in oozie workflow manager to schedule the load.

Initial & Daily Script

vi /home/edw_user/sampledata/load_product.hql

INSERT OVERWRITE TABLE dim_product SELECT * FROM ext_dim_product;
ANALYZE TABLE dim_product COMPUTE STATISTICS FOR COLUMNS;

:wq


hdfs dfs -put /home/edw_user/sampledata/load_product.hql /user/edw_user/sales/scripts

Execute the script to trigger the initial data load.

beeline -u jdbc:hive2://sandbox-hdp.hortonworks.com:10000/sales_analytics -n edw_user -p hadoop -d org.apache.hive.jdbc.HiveDriver -f "/home/edw_user/sampledata/load_product.hql"

Finally move the product datafiles to archive directory. This will also be used later by oozie.

vi /home/edw_user/sampledata/archive_product.sh

hdfs dfs -mkdir /user/edw_user/sales/archive/product/`date +%Y%m%d`
hdfs dfs -mv /user/edw_user/sales/staging/product/* /user/edw_user/sales/archive/product/`date +%Y%m%d`

:wq


hdfs dfs -put /home/edw_user/sampledata/archive_product.sh /user/edw_user/sales/scripts
sh /home/edw_user/sampledata/archive_product.sh

exit

In the next article we will load the Sales fact table. There we will remap the product natural key with the surrogate keys for the transaction records using Pig script.


Have a question on this subject?

Ask questions to our expert community members and clear your doubts. Asking question or engaging in technical discussion is both easy and rewarding.

Are you on Twitter?

Start following us. This way we will always keep you updated with what's happening in Data Analytics community. We won't spam you. Promise.

  • Install SQOOP in Client Node of Hadoop Cluster

    Sqoop is an open source software product of the Apache Software Foundation in the hadoop ecosystem, designed to transfer data between Hadoop and relational databases or mainframes. Sqoop can be used to import data from a relational database...

  • Install FLUME In Client Node of Hadoop Cluster

    Apache Flume is a distributed, robust, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data or streaming event data from different sources to a centralized data store. Its main goal is to...

  • How to Setup Hadoop Multi Node Cluster - Step By Step

    Setting up Hadoop in a single machine is easy, but no fun. Why? Because Hadoop is not meant for a single machine. Hadoop is meant to run on a computing cluster comprising of many machines. Running HDFS and MapReduce on a single machine is great for...

  • Apache Hadoop Architecture

    In this article we will learn about the Apache Hadoop framework architecture. The basic components of the Apache Hadoop HDFS & MapReduce engine are discussed in brief.

  • Introduction to Apache Hadoop

    The Apache Hadoop is next big data platform. Apache Hadoop is an open-source, java-based framework software for reliable, scalable & distributed computing. The Apache Hadoop allows distributed processing of very large data sets across clusters of...

  • Hadoop MapReduce Basics

    The Hadoop, since its inception is changing the way the enterprises store, process and analyse data. MapReduce is the core part of the Hadoop framework and we can also call it as the core processing engine of Hadoop. It is a programming model...

  • Fools Guide to Big data - What is Big Data

    Sure enough, you have heard the term, "Big Data" many times before. There is no dearth of information in the Internet and printed medium about this. But guess what, this term still remains vaguely defined and poorly understood. This essay is our...

  • Hadoop DataLake Implementation Part 4

    Now that our dummy OLTP source system & Hadoop HDFS directory structure is ready, we will first load the ‘dates’ data file in HDFS and further to a hive table.

  • Hadoop DataLake Implementation

    In this multi-series article we will learn how to implement an Enterprise DataLake using Apache Hadoop, an open-source, java-based software framework for reliable, scalable & distributed computing. Apache Hadoop addresses the limitations of...

  • SQOOP import from Oracle

    In this article we will use Apache SQOOP to import data from Oracle database. Now that we have an oracle server in our cluster ready, let us login to EdgeNode. Next we will configure sqoop to import this data in HDFS file system followed by direct...