In the previous article, we have shown how to setup a client node. Once this is done, now let's put Hadoop to use for some big data analytics purpose. One way to do that is by using Hive which let's us run SQL queries against the big data. A command line tool and JDBC driver are provided out of the box to connect users to Hive. Let's install Hive now.

The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Hive provides an SQL like query language called HiveQL with schema-on-read which implicitly converts queries into map reduce, Apache Tez and Spark jobs. HiveQL does not strictly follow the SQL-92 standard.

Hive Architecture

Hive internally is architected with the following components

  • Metastore

    Stores metadata for each of the table, such as their schema, partitions and locations.

  • Compiler

    Compiles HiveQL queries and generates the hadoop map-reduce stages and tasks

  • Optimiser

    Generates optimised directed acyclic graph (DAG) to perform various transformations like aggregation, joins etc.

  • Executor

    Interacts with the Yarn job tracker to manage the executions of the job

  • Driver

    Acts like a controller that receives the HiveQL statement, opens session, collects the query results and finally stores the necessary metadata generated during the execution of the HiveQL statement.

Hive also has a server side known as Hive Server 2 which enables remote clients to execute queries against Hive and serves the output result. Hive Server 2 supports multi-client concurrency and authentication using Thrift RPC, supporting API calls from other JDBC/ODBC clients. This allows other analytics tools such as Tableu, QlikView etc. to connect to Hiver Server 2.

Install Latest Stable Apache Hive

Get the Latest Stable Apache Hive Package from: http://www-us.apache.org/dist/hive/stable-2/

root@EdgeNode:~# cd /usr/local/
root@EdgeNode:/usr/local# wget https://www.apache.org/dist/hive/stable-2/apache-hive-2.1.0-bin.tar.gz
root@EdgeNode:/usr/local# tar -xzvf apache-hive-2.1.0-bin.tar.gz >> /dev/null
root@EdgeNode:/usr/local# mv apache-hive-2.1.0-bin /usr/local/hive
root@EdgeNode:/usr/local# rm apache-hive-2.1.0-bin.tar.gz

Next we need to setup HIVE environment script and set hadoop home and hive config directory. Create a copy of the the hive-env.sh.template file as hive-env.sh and add the below two lines:

root@EdgeNode:/usr/local# cp /usr/local/hive/conf/hive-env.sh.template /usr/local/hive/conf/hive-env.sh
root@EdgeNode:/usr/local# vi /usr/local/hive/conf/hive-env.sh

In the file, replace the HADOOP_HOME and HIVE_HOME as below:

HADOOP_HOME=/usr/local/hadoop
HIVE_CONF_DIR=/usr/local/hive/conf

Set the HIVE environment variables in .bashrc file. Append below lines to the file and source the environment file.

root@EdgeNode:/usr/local#  vi ~/.bashrc

export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin
export CLASSPATH=$CLASSPATH:/usr/local/hive/lib/*:.
root@EdgeNode:/usr/local# source ~/.bashrc

Next we need to Create Hive Warehouse in HDFS. Fin below the HDFS commands to create /tmp and /user/hive/warehouse (aka hive.metastore.warehouse.dir) and set them chmod g+w

root@EdgeNode:/usr/local# hadoop fs -mkdir /tmp
root@EdgeNode:/usr/local# hadoop fs -mkdir -p /user/hive/warehouse
root@EdgeNode:/usr/local# hadoop fs -chmod g+w /tmp
root@EdgeNode:/usr/local# hadoop fs -chmod g+w /user/hive/warehouse

Firstly we will check if the Hive is working with Default Embedded mode with Derby as the metastore database, and later we will configure & use MySql as the Metastore.

Run the schematool command below as an initialization step. This will create a metastore folder in the current directory for Derby to keep the hive metadata. So for that we will create a hive_work directory and initialise the derby metastore. Further while using HIVE we will invoke the CLI from this directory else the utility will not be able to find/load the metadata information available in the metastore.

root@EdgeNode:/usr/local# cd $HIVE_HOME
root@EdgeNode:/usr/local/hive# mkdir hive_work
root@EdgeNode:/usr/local/hive# cd hive_work
root@EdgeNode:/usr/local/hive/hive_work# $HIVE_HOME/bin/schematool -dbType derby -initSchema

Once the derby metastore initialisation completed successfully lets get started with few hive commands.

root@EdgeNode:/usr/local/hive/hive_work# hive

Check the Hive Databases and check if any tables exists in warehouse.

hive> SHOW DATABASES;
hive> USE default;
hive> SHOW TABLES;

Let us try to create HIVE tables based on our sample data we loaded earlier in HDFS.

hive> CREATE EXTERNAL TABLE stg_gdp_usa(
rec_dt DATE, gdp DECIMAL(10,2)
) 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/analysis/gdp/'
TBLPROPERTIES ("skip.header.line.count"="1");

hive> SHOW TABLES;


hive> CREATE TABLE gdp_usa(
rec_dt DATE, gdp DECIMAL(10,2)
)
STORED AS ORC;


hive> FROM stg_gdp_usa stg
INSERT OVERWRITE TABLE gdp_usa 
SELECT stg.*;


hive> SELECT COUNT(1), SUM(gdp) FROM gdp_usa;
hive> quit;

Now your Hive is ready with the default derby metastore in the client node. In the next article, we will configure Hive to work with MySQL based datastore. Stay tuned.


Have a question on this subject?

Ask questions to our expert community members and clear your doubts. Asking question or engaging in technical discussion is both easy and rewarding.

Are you on Twitter?

Start following us. This way we will always keep you updated with what's happening in Data Analytics community. We won't spam you. Promise.

  • Install FLUME In Client Node of Hadoop Cluster

    Apache Flume is a distributed, robust, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data or streaming event data from different sources to a centralized data store. Its main goal is to...

  • Hadoop DataLake Implementation Part 9

    In this article we will load our final fact table i.e. stock.

  • SQOOP import from Oracle

    In this article we will use Apache SQOOP to import data from Oracle database. Now that we have an oracle server in our cluster ready, let us login to EdgeNode. Next we will configure sqoop to import this data in HDFS file system followed by direct...

  • Set up Client Node (Gateway Node) in Hadoop Cluster

    Once we have our multi-node hadoop cluster up and running, let us create an EdgeNode or a GatewayNode. Gateway nodes are the interface between the Hadoop cluster and the outside network. Edge nodes are used to run client applications and cluster...

  • Configuring MySQL as Hive Metastore

    In the previous article, we have learnt How to Install and Configure Hive with default Derby metastore. However, an embedded derby based metastore can process only one request at a time. Since this is very restrictive, we will setup a traditional...

  • Hadoop DataLake Implementation Part 8

    In this article we will load our first fact table into Hive warehouse which is sales transactions.

  • Oracle Installation for SQOOP Import

    We would like to perform practical test of Apache SQOOP import/export utility between ORACLE relational database & Apache HADOOP file system, let us quickly setup an ORACLE server. For that we will be using cloud based services/servers as we did...

  • Apache Hadoop Architecture

    In this article we will learn about the Apache Hadoop framework architecture. The basic components of the Apache Hadoop HDFS & MapReduce engine are discussed in brief.

  • Fools Guide to Big data - What is Big Data

    Sure enough, you have heard the term, "Big Data" many times before. There is no dearth of information in the Internet and printed medium about this. But guess what, this term still remains vaguely defined and poorly understood. This essay is our...

  • Stream Webserver Log into Hdfs using FLUME

    In this article we will use Apache Flume to gather stream access log data from our remote Web Server into Hadoop Distributed File System. We will be analyzing the access log in a real-time basis. So we have to setup Flume such that it collects the...