Now that our dummy OLTP source system & Hadoop HDFS directory structure is ready, we will first load the ‘dates’ data file in HDFS and further to a hive table.

Load Date Dimension Table

Login to ssh console with user edw_user. Going forward we are going to use the edw_user account to perform all the tasks. Put the data file from local filesystem to HDFS dates directory. Date Dimension table setup will be a one-time setup only.

ssh This email address is being protected from spambots. You need JavaScript enabled to view it. -p 2222
hdfs dfs -put /home/edw_user/sampledata/dates.psv /user/edw_user/sales/staging/date
hdfs dfs -tail /user/edw_user/sales/staging/date/dates.psv

Now we will define & load the Date Dimension table in Hive. Connect to Beeline CLI using edw_user as username and password as hadoop. We will connect to hive schema ‘sales_analytics’.

beeline
!connect jdbc:hive2://sandbox-hdp.hortonworks.com:10000/sales_analytics edw_user

An external table is a table for which Hive does not manage storage. If you delete an external table, only the definition in Hive is deleted. The data remains. An internal table is a table that Hive manages. If you delete an internal table, both the definition in Hive and the data are deleted. First we will create a hive external table to project structure to the dates data file in HDFS followed by a Hive managed ORC table.

CREATE EXTERNAL TABLE IF NOT EXISTS ext_date (
year_number INT,
month_number INT,
day_of_year_number INT,
day_of_month_number INT,
day_of_week_number INT,
week_of_year_number INT,
day_name VARCHAR(20),
month_name VARCHAR (20),
quarter_number INT,
quarter_name VARCHAR(2),
year_quarter_name VARCHAR(10),
weekend_ind VARCHAR(1),
days_in_month_qty INT,
date_sk INT,
day_desc VARCHAR(10),
week_sk INT,
day_date DATE,
week_name VARCHAR(10),
week_of_month_number INT,
week_of_month_name VARCHAR(10),
month_sk INT,
quarter_sk INT,
year_sk INT,
year_sort_number VARCHAR(4),
day_of_week_sort_name VARCHAR(10)
) 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION '/user/edw_user/sales/staging/date'
TBLPROPERTIES ("skip.header.line.count"="1");

SELECT * FROM ext_date LIMIT 10;

Next let’s define our final table for date dimension, as Hive managed ORC table. The Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data. Using ORC files improves performance when Hive is reading, writing, and processing data. Queries on tables with Optimized column storage format performs 10-100x faster. Definitely the way to go to store your Hive data for optimal performance. Including compression (10x for zip), predicate pushdown (skipping blocks based on where conditions), column storage (only the needed columns are read). Use ORC storage in HDP, Parquet in case of CDH.

CREATE TABLE IF NOT EXISTS dim_date(
year_number INT,
month_number INT,
day_of_year_number INT,
day_of_month_number INT,
day_of_week_number INT,
week_of_year_number INT,
day_name VARCHAR(20),
month_name VARCHAR (20),
quarter_number INT,
quarter_name VARCHAR(2),
year_quarter_name VARCHAR(10),
weekend_ind VARCHAR(1),
days_in_month_qty INT,
date_sk INT,
day_desc VARCHAR(10),
week_sk INT,
day_date DATE,
week_name VARCHAR(10),
week_of_month_number INT,
week_of_month_name VARCHAR(10),
month_sk INT,
quarter_sk INT,
year_sk INT,
year_sort_number VARCHAR(4),
day_of_week_sort_name VARCHAR(10)
)
STORED AS ORC 
TBLPROPERTIES ("orc.compress"="SNAPPY");

ORC File provides various compression types. We can use SNAPPY compression for time based performance or ZLIB compression for resource performance (Save Drive Storage Space). Now we need to move the data from the external table to a Hive-managed ORC table. Using Hive to convert an external table into an ORC file format is very efficient and fast because the conversion is a parallel and distributed action, and no standalone ORC conversion tool is necessary. Once the data is in ORC table, we will drop the hive external table and move the date’s data file from staging to archive directory.

INSERT OVERWRITE TABLE dim_date SELECT * FROM ext_date;
ANALYZE TABLE dim_date COMPUTE STATISTICS FOR COLUMNS;
SELECT * FROM dim_date LIMIT 10;
DROP TABLE ext_date;
!quit

Finally move the dates datafile to archive directory.

hdfs dfs -mv /user/edw_user/sales/staging/date/dates.psv /user/edw_user/sales/archive/date

exit

In the next article we will load the showroom dimension table as SCD Type 1 using Sqoop & Hive.


Have a question on this subject?

Ask questions to our expert community members and clear your doubts. Asking question or engaging in technical discussion is both easy and rewarding.

Are you on Twitter?

Start following us. This way we will always keep you updated with what's happening in Data Analytics community. We won't spam you. Promise.

  • Hadoop DataLake Implementation Part 5

    In this article we will load the showroom master data from MySQL source system to HDFS using Sqoop as SCD Type 1.

  • SQOOP import from MySQL

    In this article we will use Apache SQOOP to import data from MySQL database. For that let us create a MySql database & user and dump some data quickly. Let us download a MySQL database named Sakila Db from internet to get started. Next we will...

  • Hadoop DataLake Implementation Part 9

    In this article we will load our final fact table i.e. stock.

  • Install HBASE in Hadoop Cluster

    Apache HBase provides large-scale tabular storage for Hadoop using the Hadoop Distributed File System (HDFS). Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable. HBase is used in cases...

  • Fools Guide to Big data - What is Big Data

    Sure enough, you have heard the term, "Big Data" many times before. There is no dearth of information in the Internet and printed medium about this. But guess what, this term still remains vaguely defined and poorly understood. This essay is our...

  • Install SPARK in Hadoop Cluster

    Apache Spark is a fast and general purpose engine for large-scale data processing over a distributed cluster. Apache Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing. Spark run programs up to 100x...

  • SQOOP Merge & Incremental Extraction from Oracle

    Let us check how to perform Incremental Extraction & Merge using Sqoop. The SQOOP Merge utility allows to combine two datasets where entries in one dataset should overwrite entries of an older dataset. For example, an incremental import run in...

  • Hadoop MapReduce Basics

    The Hadoop, since its inception is changing the way the enterprises store, process and analyse data. MapReduce is the core part of the Hadoop framework and we can also call it as the core processing engine of Hadoop. It is a programming model...

  • Hadoop DataLake Implementation Part 2

    Now that we are familiar with HDP stack, in this article we are going to access HDP sandbox command line, Ambari Web UI, Hive & Ranger to create a user for our implementation setup.

  • Install Hive in Client Node of Hadoop Cluster

    In the previous article, we have shown how to setup a client node. Once this is done, now let's put Hadoop to use for some big data analytics purpose. One way to do that is by using Hive which let's us run SQL queries against the big data. A...