Hadoop DataLake Implementation Part 7
In this article we will load our master data table ‘Product’ as Slowly Changing Dimension of Type 2 to maintain full history, so as to analyze the sales and stocks data with reference to the historical master data.
Load Product Dimension Table
We have maintained SCD Type 2 implementation at the RDBMS side. So we will export the table as text file and then import to Hadoop HDFS. Once the file is in HDFS we will define the HIVE tables. In reality you may use the bulk export utility of RDBMS; also the export data file needs to be transferred securely from database server machine to the HDFS cluster. Login with user root and password hadoop.
ssh root@192.168.136.139 -p 2222
rm -rf /var/lib/mysql-files/dim_product.psv
mysql -u root -p
use sales_dw;
SHOW VARIABLES LIKE "secure_file_priv";
SELECT * INTO OUTFILE '/var/lib/mysql-files/dim_product.psv' FIELDS TERMINATED BY '|' FROM dim_product;
exit;
mv /var/lib/mysql-files/dim_product.psv /home/edw_user/sampledata
exit
Next login to ssh console as edw_user and follow the further steps to upload the data file to HDFS filesystem.
ssh edw_user@192.168.136.139 -p 2222
hdfs dfs -put /home/edw_user/sampledata/dim_product.psv /user/edw_user/sales/staging/product
hdfs dfs -tail /user/edw_user/sales/staging/product/dim_product.psv
Now let’s define a Hive external table for the Showroom dataset in staging. Connect to Beeline CLI using edw_user as username and password as hadoop. We will connect to hive schema ‘sales_analytics’.
One time setup
beeline
!connect jdbc:hive2://sandbox-hdp.hortonworks.com:10000/sales_analytics edw_user
CREATE EXTERNAL TABLE IF NOT EXISTS ext_dim_product (
prod_key INT,
id INT,
code VARCHAR(50),
category VARCHAR(6),
make VARCHAR(50),
model VARCHAR(50),
year VARCHAR(50),
color VARCHAR(50),
price INT,
currency VARCHAR(50),
update_date TIMESTAMP,
create_date TIMESTAMP,
start_date DATE,
end_date DATE,
active_flag VARCHAR(1),
load_date DATE
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION '/user/edw_user/sales/staging/product';
SELECT * FROM ext_dim_product LIMIT 10;
Next let’s define & load our final hive table for Showroom dimension, as Hive managed ORC table.
CREATE TABLE IF NOT EXISTS dim_product(
prod_key INT,
id INT,
code VARCHAR(50),
category VARCHAR(6),
make VARCHAR(50),
model VARCHAR(50),
year VARCHAR(50),
color VARCHAR(50),
price INT,
currency VARCHAR(50),
update_date TIMESTAMP,
create_date TIMESTAMP,
start_date DATE,
end_date DATE,
active_flag VARCHAR(1),
load_date DATE
)
STORED AS ORC
TBLPROPERTIES ("orc.compress"="SNAPPY");
!quit
Now we will write a script to load the data from external table to hive managed table. This script will be used later in oozie workflow manager to schedule the load.
Initial & Daily Script
vi /home/edw_user/sampledata/load_product.hql
INSERT OVERWRITE TABLE dim_product SELECT * FROM ext_dim_product;
ANALYZE TABLE dim_product COMPUTE STATISTICS FOR COLUMNS;
hdfs dfs -put /home/edw_user/sampledata/load_product.hql /user/edw_user/sales/scripts
Execute the script to trigger the initial data load.
beeline -u jdbc:hive2://sandbox-hdp.hortonworks.com:10000/sales_analytics -n edw_user -p hadoop -d org.apache.hive.jdbc.HiveDriver -f "/home/edw_user/sampledata/load_product.hql"
Finally move the product datafiles to archive directory. This will also be used later by oozie.
vi /home/edw_user/sampledata/archive_product.sh
hdfs dfs -mkdir /user/edw_user/sales/archive/product/`date +%Y%m%d`
hdfs dfs -mv /user/edw_user/sales/staging/product/* /user/edw_user/sales/archive/product/`date +%Y%m%d`
hdfs dfs -put /home/edw_user/sampledata/archive_product.sh /user/edw_user/sales/scripts
sh /home/edw_user/sampledata/archive_product.sh
exit
In the next article we will load the Sales fact table. There we will remap the product natural key with the surrogate keys for the transaction records using Pig script.