In this article we will load our master data table ‘Product’ as Slowly Changing Dimension of Type 2 to maintain full history, so as to analyze the sales and stocks data with reference to the historical master data.

Load Product Dimension Table

We have maintained SCD Type 2 implementation at the RDBMS side. So we will export the table as text file and then import to Hadoop HDFS. Once the file is in HDFS we will define the HIVE tables. In reality you may use the bulk export utility of RDBMS; also the export data file needs to be transferred securely from database server machine to the HDFS cluster. Login with user root and password hadoop.

ssh This email address is being protected from spambots. You need JavaScript enabled to view it. -p 2222
rm -rf /var/lib/mysql-files/dim_product.psv

mysql -u root -p
use sales_dw;
SHOW VARIABLES LIKE "secure_file_priv";
SELECT * INTO OUTFILE '/var/lib/mysql-files/dim_product.psv' FIELDS TERMINATED BY '|' FROM dim_product;
exit;

mv /var/lib/mysql-files/dim_product.psv /home/edw_user/sampledata
exit

Next login to ssh console as edw_user and follow the further steps to upload the data file to HDFS filesystem.

ssh This email address is being protected from spambots. You need JavaScript enabled to view it. -p 2222
hdfs dfs -put /home/edw_user/sampledata/dim_product.psv /user/edw_user/sales/staging/product
hdfs dfs -tail /user/edw_user/sales/staging/product/dim_product.psv

Now let’s define a Hive external table for the Showroom dataset in staging. Connect to Beeline CLI using edw_user as username and password as hadoop. We will connect to hive schema ‘sales_analytics’.

One time setup

beeline
!connect jdbc:hive2://sandbox-hdp.hortonworks.com:10000/sales_analytics edw_user

CREATE EXTERNAL TABLE IF NOT EXISTS ext_dim_product (
prod_key INT,
id INT,
code VARCHAR(50),
category VARCHAR(6),
make VARCHAR(50),
model VARCHAR(50),
year VARCHAR(50),
color VARCHAR(50),
price INT,
currency VARCHAR(50),
update_date TIMESTAMP,
create_date TIMESTAMP,
start_date DATE,
end_date DATE,
active_flag VARCHAR(1),
load_date DATE
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION '/user/edw_user/sales/staging/product';

SELECT * FROM ext_dim_product LIMIT 10;

Next let’s define & load our final hive table for Showroom dimension, as Hive managed ORC table.

CREATE TABLE IF NOT EXISTS dim_product(
prod_key INT,
id INT,
code VARCHAR(50),
category VARCHAR(6),
make VARCHAR(50),
model VARCHAR(50),
year VARCHAR(50),
color VARCHAR(50),
price INT,
currency VARCHAR(50),
update_date TIMESTAMP,
create_date TIMESTAMP,
start_date DATE,
end_date DATE,
active_flag VARCHAR(1),
load_date DATE
)
STORED AS ORC 
TBLPROPERTIES ("orc.compress"="SNAPPY");

!quit

Now we will write a script to load the data from external table to hive managed table. This script will be used later in oozie workflow manager to schedule the load.

Initial & Daily Script

vi /home/edw_user/sampledata/load_product.hql

INSERT OVERWRITE TABLE dim_product SELECT * FROM ext_dim_product;
ANALYZE TABLE dim_product COMPUTE STATISTICS FOR COLUMNS;

:wq


hdfs dfs -put /home/edw_user/sampledata/load_product.hql /user/edw_user/sales/scripts

Execute the script to trigger the initial data load.

beeline -u jdbc:hive2://sandbox-hdp.hortonworks.com:10000/sales_analytics -n edw_user -p hadoop -d org.apache.hive.jdbc.HiveDriver -f "/home/edw_user/sampledata/load_product.hql"

Finally move the product datafiles to archive directory. This will also be used later by oozie.

vi /home/edw_user/sampledata/archive_product.sh

hdfs dfs -mkdir /user/edw_user/sales/archive/product/`date +%Y%m%d`
hdfs dfs -mv /user/edw_user/sales/staging/product/* /user/edw_user/sales/archive/product/`date +%Y%m%d`

:wq


hdfs dfs -put /home/edw_user/sampledata/archive_product.sh /user/edw_user/sales/scripts
sh /home/edw_user/sampledata/archive_product.sh

exit

In the next article we will load the Sales fact table. There we will remap the product natural key with the surrogate keys for the transaction records using Pig script.


Have a question on this subject?

Ask questions to our expert community members and clear your doubts. Asking question or engaging in technical discussion is both easy and rewarding.

Are you on Twitter?

Start following us. This way we will always keep you updated with what's happening in Data Analytics community. We won't spam you. Promise.

  • Fools Guide to Big data - What is Big Data

    Sure enough, you have heard the term, "Big Data" many times before. There is no dearth of information in the Internet and printed medium about this. But guess what, this term still remains vaguely defined and poorly understood. This essay is our...

  • SQOOP import from Oracle

    In this article we will use Apache SQOOP to import data from Oracle database. Now that we have an oracle server in our cluster ready, let us login to EdgeNode. Next we will configure sqoop to import this data in HDFS file system followed by direct...

  • Hadoop DataLake Implementation Part 3

    To complete our implementation setup we will create the source tables based on the downloaded datafiles. Let us first load the SQL files in MySQL server under a new database called ‘sales’. We will simulate this database schema as our OLTP source...

  • Set up Client Node (Gateway Node) in Hadoop Cluster

    Once we have our multi-node hadoop cluster up and running, let us create an EdgeNode or a GatewayNode. Gateway nodes are the interface between the Hadoop cluster and the outside network. Edge nodes are used to run client applications and cluster...

  • Install HBASE in Hadoop Cluster

    Apache HBase provides large-scale tabular storage for Hadoop using the Hadoop Distributed File System (HDFS). Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable. HBase is used in cases...

  • SQOOP Merge & Incremental Extraction from Oracle

    Let us check how to perform Incremental Extraction & Merge using Sqoop. The SQOOP Merge utility allows to combine two datasets where entries in one dataset should overwrite entries of an older dataset. For example, an incremental import run in...

  • Install Hive in Client Node of Hadoop Cluster

    In the previous article, we have shown how to setup a client node. Once this is done, now let's put Hadoop to use for some big data analytics purpose. One way to do that is by using Hive which let's us run SQL queries against the big data. A...

  • Stream Webserver Log into Hdfs using FLUME

    In this article we will use Apache Flume to gather stream access log data from our remote Web Server into Hadoop Distributed File System. We will be analyzing the access log in a real-time basis. So we have to setup Flume such that it collects the...

  • Hadoop DataLake Implementation

    In this multi-series article we will learn how to implement an Enterprise DataLake using Apache Hadoop, an open-source, java-based software framework for reliable, scalable & distributed computing. Apache Hadoop addresses the limitations of...

  • Apache Hadoop Architecture

    In this article we will learn about the Apache Hadoop framework architecture. The basic components of the Apache Hadoop HDFS & MapReduce engine are discussed in brief.