Install Hive in Client Node of Hadoop Cluster

Updated on Oct 03, 2020

In the previous article, we have shown how to setup a client node. Once this is done, now let's put Hadoop to use for some big data analytics purpose. One way to do that is by using Hive which let's us run SQL queries against the big data. A command line tool and JDBC driver are provided out of the box to connect users to Hive. Let's install Hive now.

The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Hive provides an SQL like query language called HiveQL with schema-on-read which implicitly converts queries into map reduce, Apache Tez and Spark jobs. HiveQL does not strictly follow the SQL-92 standard.

Hive Architecture

Hive internally is architected with the following components

Metastore

Stores metadata for each of the table, such as their schema, partitions and locations.

Compiler

Compiles HiveQL queries and generates the hadoop map-reduce stages and tasks

Optimiser

Generates optimised directed acyclic graph (DAG) to perform various transformations like aggregation, joins etc.

Executor

Interacts with the Yarn job tracker to manage the executions of the job

Driver

Acts like a controller that receives the HiveQL statement, opens session, collects the query results and finally stores the necessary metadata generated during the execution of the HiveQL statement.

Hive also has a server side known as Hive Server 2 which enables remote clients to execute queries against Hive and serves the output result. Hive Server 2 supports multi-client concurrency and authentication using Thrift RPC, supporting API calls from other JDBC/ODBC clients. This allows other analytics tools such as Tableu, QlikView etc. to connect to Hiver Server 2.

Install Latest Stable Apache Hive

Get the Latest Stable Apache Hive Package from:
http://www-us.apache.org/dist/hive/stable-2/

root@EdgeNode:~# cd /usr/local/
root@EdgeNode:/usr/local# wget https://www.apache.org/dist/hive/stable-2/apache-hive-2.1.0-bin.tar.gz
root@EdgeNode:/usr/local# tar -xzvf apache-hive-2.1.0-bin.tar.gz >> /dev/null
root@EdgeNode:/usr/local# mv apache-hive-2.1.0-bin /usr/local/hive
root@EdgeNode:/usr/local# rm apache-hive-2.1.0-bin.tar.gz

Next we need to setup HIVE environment script and set hadoop home and hive config directory. Create a copy of the the hive-env.sh.template file as hive-env.sh and add the below two lines:

root@EdgeNode:/usr/local# cp /usr/local/hive/conf/hive-env.sh.template /usr/local/hive/conf/hive-env.sh
root@EdgeNode:/usr/local# vi /usr/local/hive/conf/hive-env.sh

In the file, replace the HADOOP_HOME and HIVE_HOME as below:

HADOOP_HOME=/usr/local/hadoop
HIVE_CONF_DIR=/usr/local/hive/conf

Set the HIVE environment variables in .bashrc file. Append below lines to the file and source the environment file.

root@EdgeNode:/usr/local#  vi ~/.bashrc

export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin
export CLASSPATH=$CLASSPATH:/usr/local/hive/lib/*:.

root@EdgeNode:/usr/local# source ~/.bashrc

Next we need to Create Hive Warehouse in HDFS. Fin below the HDFS commands to create /tmp and /user/hive/warehouse (aka hive.metastore.warehouse.dir) and set them chmod g+w

root@EdgeNode:/usr/local# hadoop fs -mkdir /tmp
root@EdgeNode:/usr/local# hadoop fs -mkdir -p /user/hive/warehouse
root@EdgeNode:/usr/local# hadoop fs -chmod g+w /tmp
root@EdgeNode:/usr/local# hadoop fs -chmod g+w /user/hive/warehouse

Firstly we will check if the Hive is working with Default Embedded mode with Derby as the metastore database, and later we will configure & use MySql as the Metastore.

Run the schematool command below as an initialization step. This will create a metastore folder in the current directory for Derby to keep the hive metadata. So for that we will create a hive_work directory and initialise the derby metastore. Further while using HIVE we will invoke the CLI from this directory else the utility will not be able to find/load the metadata information available in the metastore.

root@EdgeNode:/usr/local# cd $HIVE_HOME
root@EdgeNode:/usr/local/hive# mkdir hive_work
root@EdgeNode:/usr/local/hive# cd hive_work
root@EdgeNode:/usr/local/hive/hive_work# $HIVE_HOME/bin/schematool -dbType derby -initSchema

Once the derby metastore initialisation completed successfully lets get started with few hive commands.

root@EdgeNode:/usr/local/hive/hive_work# hive

Check the Hive Databases and check if any tables exists in warehouse.

hive> SHOW DATABASES;
hive> USE default;
hive> SHOW TABLES;

Let us try to create HIVE tables based on our sample data we loaded earlier in HDFS.

hive> CREATE EXTERNAL TABLE stg_gdp_usa(
rec_dt DATE, gdp DECIMAL(10,2)
) 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/analysis/gdp/'
TBLPROPERTIES ("skip.header.line.count"="1");

hive> SHOW TABLES;


hive> CREATE TABLE gdp_usa(
rec_dt DATE, gdp DECIMAL(10,2)
)
STORED AS ORC;


hive> FROM stg_gdp_usa stg
INSERT OVERWRITE TABLE gdp_usa 
SELECT stg.*;


hive> SELECT COUNT(1), SUM(gdp) FROM gdp_usa;
hive> quit;

Now your Hive is ready with the default derby metastore in the client node. In the next article, we will configure Hive to work with MySQL based datastore. Stay tuned.