In the previous article, we have shown how to setup a client node. Once this is done, now let's put Hadoop to use for some big data analytics purpose. One way to do that is by using Hive which let's us run SQL queries against the big data. A command line tool and JDBC driver are provided out of the box to connect users to Hive. Let's install Hive now.

The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Hive provides an SQL like query language called HiveQL with schema-on-read which implicitly converts queries into map reduce, Apache Tez and Spark jobs. HiveQL does not strictly follow the SQL-92 standard.

Hive Architecture

Hive internally is architected with the following components

  • Metastore

    Stores metadata for each of the table, such as their schema, partitions and locations.

  • Compiler

    Compiles HiveQL queries and generates the hadoop map-reduce stages and tasks

  • Optimiser

    Generates optimised directed acyclic graph (DAG) to perform various transformations like aggregation, joins etc.

  • Executor

    Interacts with the Yarn job tracker to manage the executions of the job

  • Driver

    Acts like a controller that receives the HiveQL statement, opens session, collects the query results and finally stores the necessary metadata generated during the execution of the HiveQL statement.

Hive also has a server side known as Hive Server 2 which enables remote clients to execute queries against Hive and serves the output result. Hive Server 2 supports multi-client concurrency and authentication using Thrift RPC, supporting API calls from other JDBC/ODBC clients. This allows other analytics tools such as Tableu, QlikView etc. to connect to Hiver Server 2.

Install Latest Stable Apache Hive

Get the Latest Stable Apache Hive Package from: http://www-us.apache.org/dist/hive/stable-2/

root@EdgeNode:~# cd /usr/local/
root@EdgeNode:/usr/local# wget https://www.apache.org/dist/hive/stable-2/apache-hive-2.1.0-bin.tar.gz
root@EdgeNode:/usr/local# tar -xzvf apache-hive-2.1.0-bin.tar.gz >> /dev/null
root@EdgeNode:/usr/local# mv apache-hive-2.1.0-bin /usr/local/hive
root@EdgeNode:/usr/local# rm apache-hive-2.1.0-bin.tar.gz

Next we need to setup HIVE environment script and set hadoop home and hive config directory. Create a copy of the the hive-env.sh.template file as hive-env.sh and add the below two lines:

root@EdgeNode:/usr/local# cp /usr/local/hive/conf/hive-env.sh.template /usr/local/hive/conf/hive-env.sh
root@EdgeNode:/usr/local# vi /usr/local/hive/conf/hive-env.sh

In the file, replace the HADOOP_HOME and HIVE_HOME as below:

HADOOP_HOME=/usr/local/hadoop
HIVE_CONF_DIR=/usr/local/hive/conf

Set the HIVE environment variables in .bashrc file. Append below lines to the file and source the environment file.

root@EdgeNode:/usr/local#  vi ~/.bashrc

export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin
export CLASSPATH=$CLASSPATH:/usr/local/hive/lib/*:.
root@EdgeNode:/usr/local# source ~/.bashrc

Next we need to Create Hive Warehouse in HDFS. Fin below the HDFS commands to create /tmp and /user/hive/warehouse (aka hive.metastore.warehouse.dir) and set them chmod g+w

root@EdgeNode:/usr/local# hadoop fs -mkdir /tmp
root@EdgeNode:/usr/local# hadoop fs -mkdir -p /user/hive/warehouse
root@EdgeNode:/usr/local# hadoop fs -chmod g+w /tmp
root@EdgeNode:/usr/local# hadoop fs -chmod g+w /user/hive/warehouse

Firstly we will check if the Hive is working with Default Embedded mode with Derby as the metastore database, and later we will configure & use MySql as the Metastore.

Run the schematool command below as an initialization step. This will create a metastore folder in the current directory for Derby to keep the hive metadata. So for that we will create a hive_work directory and initialise the derby metastore. Further while using HIVE we will invoke the CLI from this directory else the utility will not be able to find/load the metadata information available in the metastore.

root@EdgeNode:/usr/local# cd $HIVE_HOME
root@EdgeNode:/usr/local/hive# mkdir hive_work
root@EdgeNode:/usr/local/hive# cd hive_work
root@EdgeNode:/usr/local/hive/hive_work# $HIVE_HOME/bin/schematool -dbType derby -initSchema

Once the derby metastore initialisation completed successfully lets get started with few hive commands.

root@EdgeNode:/usr/local/hive/hive_work# hive

Check the Hive Databases and check if any tables exists in warehouse.

hive> SHOW DATABASES;
hive> USE default;
hive> SHOW TABLES;

Let us try to create HIVE tables based on our sample data we loaded earlier in HDFS.

hive> CREATE EXTERNAL TABLE stg_gdp_usa(
rec_dt DATE, gdp DECIMAL(10,2)
) 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/analysis/gdp/'
TBLPROPERTIES ("skip.header.line.count"="1");

hive> SHOW TABLES;


hive> CREATE TABLE gdp_usa(
rec_dt DATE, gdp DECIMAL(10,2)
)
STORED AS ORC;


hive> FROM stg_gdp_usa stg
INSERT OVERWRITE TABLE gdp_usa 
SELECT stg.*;


hive> SELECT COUNT(1), SUM(gdp) FROM gdp_usa;
hive> quit;

Now your Hive is ready with the default derby metastore in the client node. In the next article, we will configure Hive to work with MySQL based datastore. Stay tuned.


Have a question on this subject?

Ask questions to our expert community members and clear your doubts. Asking question or engaging in technical discussion is both easy and rewarding.

Are you on Twitter?

Start following us. This way we will always keep you updated with what's happening in Data Analytics community. We won't spam you. Promise.

  • Apache Hadoop Architecture

    In this article we will learn about the Apache Hadoop framework architecture. The basic components of the Apache Hadoop HDFS & MapReduce engine are discussed in brief.

  • Stream Webserver Log into Hdfs using FLUME

    In this article we will use Apache Flume to gather stream access log data from our remote Web Server into Hadoop Distributed File System. We will be analyzing the access log in a real-time basis. So we have to setup Flume such that it collects the...

  • Hadoop DataLake Implementation Part 2

    Now that we are familiar with HDP stack, in this article we are going to access HDP sandbox command line, Ambari Web UI, Hive & Ranger to create a user for our implementation setup.

  • Install SQOOP in Client Node of Hadoop Cluster

    Sqoop is an open source software product of the Apache Software Foundation in the hadoop ecosystem, designed to transfer data between Hadoop and relational databases or mainframes. Sqoop can be used to import data from a relational database...

  • Set up Client Node (Gateway Node) in Hadoop Cluster

    Once we have our multi-node hadoop cluster up and running, let us create an EdgeNode or a GatewayNode. Gateway nodes are the interface between the Hadoop cluster and the outside network. Edge nodes are used to run client applications and cluster...

  • Install FLUME In Client Node of Hadoop Cluster

    Apache Flume is a distributed, robust, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data or streaming event data from different sources to a centralized data store. Its main goal is to...

  • Install HBASE in Hadoop Cluster

    Apache HBase provides large-scale tabular storage for Hadoop using the Hadoop Distributed File System (HDFS). Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable. HBase is used in cases...

  • Introduction to Apache Hadoop

    The Apache Hadoop is next big data platform. Apache Hadoop is an open-source, java-based framework software for reliable, scalable & distributed computing. The Apache Hadoop allows distributed processing of very large data sets across clusters of...

  • Fools Guide to Big data - What is Big Data

    Sure enough, you have heard the term, "Big Data" many times before. There is no dearth of information in the Internet and printed medium about this. But guess what, this term still remains vaguely defined and poorly understood. This essay is our...

  • SQOOP Merge & Incremental Extraction from Oracle

    Let us check how to perform Incremental Extraction & Merge using Sqoop. The SQOOP Merge utility allows to combine two datasets where entries in one dataset should overwrite entries of an older dataset. For example, an incremental import run in...