Now that our dummy OLTP source system & Hadoop HDFS directory structure is ready, we will first load the ‘dates’ data file in HDFS and further to a hive table.

Load Date Dimension Table

Login to ssh console with user edw_user. Going forward we are going to use the edw_user account to perform all the tasks. Put the data file from local filesystem to HDFS dates directory. Date Dimension table setup will be a one-time setup only.

ssh This email address is being protected from spambots. You need JavaScript enabled to view it. -p 2222
hdfs dfs -put /home/edw_user/sampledata/dates.psv /user/edw_user/sales/staging/date
hdfs dfs -tail /user/edw_user/sales/staging/date/dates.psv

Now we will define & load the Date Dimension table in Hive. Connect to Beeline CLI using edw_user as username and password as hadoop. We will connect to hive schema ‘sales_analytics’.

beeline
!connect jdbc:hive2://sandbox-hdp.hortonworks.com:10000/sales_analytics edw_user

An external table is a table for which Hive does not manage storage. If you delete an external table, only the definition in Hive is deleted. The data remains. An internal table is a table that Hive manages. If you delete an internal table, both the definition in Hive and the data are deleted. First we will create a hive external table to project structure to the dates data file in HDFS followed by a Hive managed ORC table.

CREATE EXTERNAL TABLE IF NOT EXISTS ext_date (
year_number INT,
month_number INT,
day_of_year_number INT,
day_of_month_number INT,
day_of_week_number INT,
week_of_year_number INT,
day_name VARCHAR(20),
month_name VARCHAR (20),
quarter_number INT,
quarter_name VARCHAR(2),
year_quarter_name VARCHAR(10),
weekend_ind VARCHAR(1),
days_in_month_qty INT,
date_sk INT,
day_desc VARCHAR(10),
week_sk INT,
day_date DATE,
week_name VARCHAR(10),
week_of_month_number INT,
week_of_month_name VARCHAR(10),
month_sk INT,
quarter_sk INT,
year_sk INT,
year_sort_number VARCHAR(4),
day_of_week_sort_name VARCHAR(10)
) 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION '/user/edw_user/sales/staging/date'
TBLPROPERTIES ("skip.header.line.count"="1");

SELECT * FROM ext_date LIMIT 10;

Next let’s define our final table for date dimension, as Hive managed ORC table. The Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data. Using ORC files improves performance when Hive is reading, writing, and processing data. Queries on tables with Optimized column storage format performs 10-100x faster. Definitely the way to go to store your Hive data for optimal performance. Including compression (10x for zip), predicate pushdown (skipping blocks based on where conditions), column storage (only the needed columns are read). Use ORC storage in HDP, Parquet in case of CDH.

CREATE TABLE IF NOT EXISTS dim_date(
year_number INT,
month_number INT,
day_of_year_number INT,
day_of_month_number INT,
day_of_week_number INT,
week_of_year_number INT,
day_name VARCHAR(20),
month_name VARCHAR (20),
quarter_number INT,
quarter_name VARCHAR(2),
year_quarter_name VARCHAR(10),
weekend_ind VARCHAR(1),
days_in_month_qty INT,
date_sk INT,
day_desc VARCHAR(10),
week_sk INT,
day_date DATE,
week_name VARCHAR(10),
week_of_month_number INT,
week_of_month_name VARCHAR(10),
month_sk INT,
quarter_sk INT,
year_sk INT,
year_sort_number VARCHAR(4),
day_of_week_sort_name VARCHAR(10)
)
STORED AS ORC 
TBLPROPERTIES ("orc.compress"="SNAPPY");

ORC File provides various compression types. We can use SNAPPY compression for time based performance or ZLIB compression for resource performance (Save Drive Storage Space). Now we need to move the data from the external table to a Hive-managed ORC table. Using Hive to convert an external table into an ORC file format is very efficient and fast because the conversion is a parallel and distributed action, and no standalone ORC conversion tool is necessary. Once the data is in ORC table, we will drop the hive external table and move the date’s data file from staging to archive directory.

INSERT OVERWRITE TABLE dim_date SELECT * FROM ext_date;
ANALYZE TABLE dim_date COMPUTE STATISTICS FOR COLUMNS;
SELECT * FROM dim_date LIMIT 10;
DROP TABLE ext_date;
!quit

Finally move the dates datafile to archive directory.

hdfs dfs -mv /user/edw_user/sales/staging/date/dates.psv /user/edw_user/sales/archive/date

exit

In the next article we will load the showroom dimension table as SCD Type 1 using Sqoop & Hive.


Have a question on this subject?

Ask questions to our expert community members and clear your doubts. Asking question or engaging in technical discussion is both easy and rewarding.

Are you on Twitter?

Start following us. This way we will always keep you updated with what's happening in Data Analytics community. We won't spam you. Promise.

  • Hadoop DataLake Implementation Part 7

    In this article we will load our master data table ‘Product’ as Slowly Changing Dimension of Type 2 to maintain full history, so as to analyze the sales and stocks data with reference to the historical master data.

  • Configuring MySQL as Hive Metastore

    In the previous article, we have learnt How to Install and Configure Hive with default Derby metastore. However, an embedded derby based metastore can process only one request at a time. Since this is very restrictive, we will setup a traditional...

  • Hadoop DataLake Implementation Part 6

    In this article we will load the Customer data in the Hive warehouse as SCD Type 1. This time we will follow a different approach to implement Insert/Update or Merge strategy using Hive QL, rather than SQOOP Merge utility

  • Install SQOOP in Client Node of Hadoop Cluster

    Sqoop is an open source software product of the Apache Software Foundation in the hadoop ecosystem, designed to transfer data between Hadoop and relational databases or mainframes. Sqoop can be used to import data from a relational database...

  • Install PIG In Client Node of Hadoop Cluster

    Apache Pig is a platform for analyzing large data sets. Pig Latin is the high level programming language that, lets us specify a sequence of data transformations such as merging data sets, filtering them, grouping them, and applying functions to...

  • How to Setup Hadoop Multi Node Cluster - Step By Step

    Setting up Hadoop in a single machine is easy, but no fun. Why? Because Hadoop is not meant for a single machine. Hadoop is meant to run on a computing cluster comprising of many machines. Running HDFS and MapReduce on a single machine is great for...

  • Hadoop DataLake Implementation

    In this multi-series article we will learn how to implement an Enterprise DataLake using Apache Hadoop, an open-source, java-based software framework for reliable, scalable & distributed computing. Apache Hadoop addresses the limitations of...

  • Understanding Map-Reduce with Examples

    In my previous article – “Fools guide to Big Data” – we have discussed about the origin of Bigdata and the need of big data analytics. We have also noted that Big Data is data that is too large, complex and dynamic for any conventional data tools...

  • SQOOP import from Oracle

    In this article we will use Apache SQOOP to import data from Oracle database. Now that we have an oracle server in our cluster ready, let us login to EdgeNode. Next we will configure sqoop to import this data in HDFS file system followed by direct...

  • Hadoop DataLake Implementation Part 2

    Now that we are familiar with HDP stack, in this article we are going to access HDP sandbox command line, Ambari Web UI, Hive & Ranger to create a user for our implementation setup.