Install Hive in Client Node of Hadoop Cluster
In the previous article, we have shown how to setup a client node. Once this is done, now let's put Hadoop to use for some big data analytics purpose. One way to do that is by using Hive which let's us run SQL queries against the big data. A command line tool and JDBC driver are provided out of the box to connect users to Hive. Let's install Hive now.
The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Hive provides an SQL like query language called HiveQL with schema-on-read which implicitly converts queries into map reduce, Apache Tez and Spark jobs. HiveQL does not strictly follow the SQL-92 standard.
Hive internally is architected with the following components
Stores metadata for each of the table, such as their schema, partitions and locations.
Compiles HiveQL queries and generates the hadoop map-reduce stages and tasks
Generates optimised directed acyclic graph (DAG) to perform various transformations like aggregation, joins etc.
Interacts with the Yarn job tracker to manage the executions of the job
Acts like a controller that receives the HiveQL statement, opens session, collects the query results and finally stores the necessary metadata generated during the execution of the HiveQL statement.
Hive also has a server side known as Hive Server 2 which enables remote clients to execute queries against Hive and serves the output result. Hive Server 2 supports multi-client concurrency and authentication using Thrift RPC, supporting API calls from other JDBC/ODBC clients. This allows other analytics tools such as Tableu, QlikView etc. to connect to Hiver Server 2.
Install Latest Stable Apache Hive
Get the Latest Stable Apache Hive Package from:
root@EdgeNode:~# cd /usr/local/ root@EdgeNode:/usr/local# wget https://www.apache.org/dist/hive/stable-2/apache-hive-2.1.0-bin.tar.gz root@EdgeNode:/usr/local# tar -xzvf apache-hive-2.1.0-bin.tar.gz >> /dev/null root@EdgeNode:/usr/local# mv apache-hive-2.1.0-bin /usr/local/hive root@EdgeNode:/usr/local# rm apache-hive-2.1.0-bin.tar.gz
Next we need to setup HIVE environment script and set hadoop home and hive config directory. Create a copy of the the hive-env.sh.template file as hive-env.sh and add the below two lines:
root@EdgeNode:/usr/local# cp /usr/local/hive/conf/hive-env.sh.template /usr/local/hive/conf/hive-env.sh root@EdgeNode:/usr/local# vi /usr/local/hive/conf/hive-env.sh
In the file, replace the HADOOP_HOME and HIVE_HOME as below:
Set the HIVE environment variables in .bashrc file. Append below lines to the file and source the environment file.
root@EdgeNode:/usr/local# vi ~/.bashrc export HIVE_HOME=/usr/local/hive export PATH=$PATH:$HIVE_HOME/bin export CLASSPATH=$CLASSPATH:/usr/local/hive/lib/*:.
root@EdgeNode:/usr/local# source ~/.bashrc
Next we need to Create Hive Warehouse in HDFS. Fin below the HDFS commands to create /tmp and /user/hive/warehouse (aka
hive.metastore.warehouse.dir) and set them
root@EdgeNode:/usr/local# hadoop fs -mkdir /tmp root@EdgeNode:/usr/local# hadoop fs -mkdir -p /user/hive/warehouse root@EdgeNode:/usr/local# hadoop fs -chmod g+w /tmp root@EdgeNode:/usr/local# hadoop fs -chmod g+w /user/hive/warehouse
Firstly we will check if the Hive is working with Default Embedded mode with Derby as the metastore database, and later we will configure & use MySql as the Metastore.
Run the schematool command below as an initialization step. This will create a metastore folder in the current directory for Derby to keep the hive metadata. So for that we will create a hive_work directory and initialise the derby metastore. Further while using HIVE we will invoke the CLI from this directory else the utility will not be able to find/load the metadata information available in the metastore.
root@EdgeNode:/usr/local# cd $HIVE_HOME root@EdgeNode:/usr/local/hive# mkdir hive_work root@EdgeNode:/usr/local/hive# cd hive_work root@EdgeNode:/usr/local/hive/hive_work# $HIVE_HOME/bin/schematool -dbType derby -initSchema
Once the derby metastore initialisation completed successfully lets get started with few hive commands.
Check the Hive Databases and check if any tables exists in warehouse.
hive> SHOW DATABASES; hive> USE default; hive> SHOW TABLES;
Let us try to create HIVE tables based on our sample data we loaded earlier in HDFS.
hive> CREATE EXTERNAL TABLE stg_gdp_usa( rec_dt DATE, gdp DECIMAL(10,2) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/analysis/gdp/' TBLPROPERTIES ("skip.header.line.count"="1"); hive> SHOW TABLES; hive> CREATE TABLE gdp_usa( rec_dt DATE, gdp DECIMAL(10,2) ) STORED AS ORC; hive> FROM stg_gdp_usa stg INSERT OVERWRITE TABLE gdp_usa SELECT stg.*; hive> SELECT COUNT(1), SUM(gdp) FROM gdp_usa; hive> quit;
Now your Hive is ready with the default
derby metastore in the client node. In the next article, we will configure Hive to work with MySQL based datastore. Stay tuned.