Hadoop DataLake Implementation Part 4

Updated on Oct 03, 2020

Now that our dummy OLTP source system & Hadoop HDFS directory structure is ready, we will first load the ‘dates’ data file in HDFS and further to a hive table.

Load Date Dimension Table

Login to ssh console with user edw_user. Going forward we are going to use the edw_user account to perform all the tasks. Put the data file from local filesystem to HDFS dates directory. Date Dimension table setup will be a one-time setup only.

ssh edw_user@192.168.136.139 -p 2222
hdfs dfs -put /home/edw_user/sampledata/dates.psv /user/edw_user/sales/staging/date
hdfs dfs -tail /user/edw_user/sales/staging/date/dates.psv

Now we will define & load the Date Dimension table in Hive. Connect to Beeline CLI using edw_user as username and password as hadoop. We will connect to hive schema ‘sales_analytics’.

beeline
!connect jdbc:hive2://sandbox-hdp.hortonworks.com:10000/sales_analytics edw_user

An external table is a table for which Hive does not manage storage. If you delete an external table, only the definition in Hive is deleted. The data remains. An internal table is a table that Hive manages. If you delete an internal table, both the definition in Hive and the data are deleted. First we will create a hive external table to project structure to the dates data file in HDFS followed by a Hive managed ORC table.

CREATE EXTERNAL TABLE IF NOT EXISTS ext_date (
year_number INT,
month_number INT,
day_of_year_number INT,
day_of_month_number INT,
day_of_week_number INT,
week_of_year_number INT,
day_name VARCHAR(20),
month_name VARCHAR (20),
quarter_number INT,
quarter_name VARCHAR(2),
year_quarter_name VARCHAR(10),
weekend_ind VARCHAR(1),
days_in_month_qty INT,
date_sk INT,
day_desc VARCHAR(10),
week_sk INT,
day_date DATE,
week_name VARCHAR(10),
week_of_month_number INT,
week_of_month_name VARCHAR(10),
month_sk INT,
quarter_sk INT,
year_sk INT,
year_sort_number VARCHAR(4),
day_of_week_sort_name VARCHAR(10)
) 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION '/user/edw_user/sales/staging/date'
TBLPROPERTIES ("skip.header.line.count"="1");

SELECT * FROM ext_date LIMIT 10;

Next let’s define our final table for date dimension, as Hive managed ORC table. The Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data. Using ORC files improves performance when Hive is reading, writing, and processing data. Queries on tables with Optimized column storage format performs 10-100x faster. Definitely the way to go to store your Hive data for optimal performance. Including compression (10x for zip), predicate pushdown (skipping blocks based on where conditions), column storage (only the needed columns are read). Use ORC storage in HDP, Parquet in case of CDH.

CREATE TABLE IF NOT EXISTS dim_date(
year_number INT,
month_number INT,
day_of_year_number INT,
day_of_month_number INT,
day_of_week_number INT,
week_of_year_number INT,
day_name VARCHAR(20),
month_name VARCHAR (20),
quarter_number INT,
quarter_name VARCHAR(2),
year_quarter_name VARCHAR(10),
weekend_ind VARCHAR(1),
days_in_month_qty INT,
date_sk INT,
day_desc VARCHAR(10),
week_sk INT,
day_date DATE,
week_name VARCHAR(10),
week_of_month_number INT,
week_of_month_name VARCHAR(10),
month_sk INT,
quarter_sk INT,
year_sk INT,
year_sort_number VARCHAR(4),
day_of_week_sort_name VARCHAR(10)
)
STORED AS ORC 
TBLPROPERTIES ("orc.compress"="SNAPPY");

ORC File provides various compression types. We can use SNAPPY compression for time based performance or ZLIB compression for resource performance (Save Drive Storage Space). Now we need to move the data from the external table to a Hive-managed ORC table. Using Hive to convert an external table into an ORC file format is very efficient and fast because the conversion is a parallel and distributed action, and no standalone ORC conversion tool is necessary. Once the data is in ORC table, we will drop the hive external table and move the date’s data file from staging to archive directory.

INSERT OVERWRITE TABLE dim_date SELECT * FROM ext_date;
ANALYZE TABLE dim_date COMPUTE STATISTICS FOR COLUMNS;
SELECT * FROM dim_date LIMIT 10;
DROP TABLE ext_date;
!quit

Finally move the dates datafile to archive directory.

hdfs dfs -mv /user/edw_user/sales/staging/date/dates.psv /user/edw_user/sales/archive/date

exit

In the next article we will load the showroom dimension table as SCD Type 1 using Sqoop & Hive.

Hadoop DataLake Implementation Part 4

Load Date Dimension Table

PrimeChess.org

Top 10 Articles