Apache HBase provides large-scale tabular storage for Hadoop using the Hadoop Distributed File System (HDFS). Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable. HBase is used in cases where we require random, realtime read/write access to Big Data. We can host very large tables (billions of rows X millions of columns) atop clusters of commodity hardware using HBase. In this article we will Install HBase in a fully distributed hadoop cluster.

HBase scales by splitting all rows into regions. Each region is hosted by exactly one server. Writes are held(sorted) in memory until flush. Reads merge rows in memory with flushed files. Reads & writes to a single row are consistent. A row is an atomic byte array of Key-Value map container with one row key. The row is atomic and gets flushed to disk periodically. But it doesn't have to be flushed into just a single file. It can be broken up in different store files with different properties, and reads can look at just a subset.This advanced design option is called Column Families. Column Family: divide columns into physical files. HBase has neither joins nor indexes as like any distributed DB.

Hbase Installation

We will configure our cluster to host the HBase Master Server in our NameNode & Region Servers in our DataNodes. Also Apache Zookeeper is a pre-requisite for HBase installation. In this case we will configure HBase, to manage its own instance of Zookeeper. We will configure the NameNode to host the Zookeeper Quorum. So let ssh & login to our NameNode.

Master Server Setup

Get the Latest Stable Release of HBase Package from the site:
http://www-us.apache.org/dist/hbase/stable
In the time of writing this article, HBase 1.2.3 is the latest stable version. We will install HBase under /usr/local/ directory.

root@NameNode:~# cd /usr/local
root@NameNode:/usr/local/# wget http://www-us.apache.org/dist/hbase/stable/hbase-1.2.3-bin.tar.gz
root@NameNode:/usr/local/# tar -xzvf hbase-1.2.3-bin.tar.gz >> /dev/null
root@NameNode:/usr/local/# mv hbase-1.2.3 /usr/local/hbase
root@NameNode:/usr/local/# rm hbase-1.2.3-bin.tar.gz

Set the HBase environment variables in .bashrc file. Append below lines to the file and source the environment file.

root@NameNode:/usr/local#  vi ~/.bashrc

export HBASE_HOME=/usr/local/hbase
export PATH=$PATH:$HBASE_HOME/bin
export CLASSPATH=$CLASSPATH:/usr/local/hbase/lib/*:.
root@NameNode:/usr/local# source ~/.bashrc

Next we need to configure HBase environment script and set the Java Home. Also we will configure HBase to manage it's Zookeper Instance. Open the hbase-env.sh file and append the lines to the file.

root@NameNode:/usr/local/hbase/conf# vi hbase-env.sh

export JAVA_HOME=/usr/lib/jvm/java-7-oracle/jre
export HBASE_MANAGES_ZK=true

Next we will configure the site specific properties of HBase in the file, hbase-site.xml.

root@NameNode:/usr/local/hbase/conf# vi hbase-site.xml
hbase-site.xml
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://NameNode:8020/hbase</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/usr/local/zookeeper</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>NameNode</value>
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2181</value>
</property>
</configuration>

Next we have to list down DataNodes which will host the Region Servers, in the file regionservers.

root@NameNode:/usr/local/hbase/conf# vi regionservers

DataNode1
DataNode2

Additionally we will create a local directory for Zookeper to maintain it's log file.

root@NameNode:/usr/local/hbase/conf# mkdir -p /usr/local/zookeeper

Region Server Setup

Now we have to configure our DataNodes to act as Region Servers. In our case we have two DataNodes. We will secure copy the hbase directory with the binaries and configuration files from the NameNode to the DataNodes.

root@NameNode:/usr/local/hbase/conf# cd /usr/local 
root@NameNode:/usr/local# scp -r hbase DataNode1:/usr/local
root@NameNode:/usr/local# scp -r hbase DataNode2:/usr/local

Next we need to update the Environment configuration of HBase in all the DataNodes. Append the below two lines in the .bashrc files in both the DataNodes.

root@NameNode:/usr/local# ssh root@DataNode1
root@DataNode1:~# vi ~/.bashrc

export HBASE_HOME=/usr/local/hbase
export PATH=$PATH:$HBASE_HOME/bin
export CLASSPATH=$CLASSPATH:/usr/local/hbase/lib/*:.


root@DataNode1:~# source ~/.bashrc
root@DataNode1:~# exit

Repeat the above step for all the other DataNodes.

Well we are done with the installation & configuration. So it's time to start the HBase services.

root@NameNode:/usr/local# $HBASE_HOME/bin/start-hbase.sh

Let us validate the services running in NameNode as well as in the DataNodes.

root@NameNode:/usr/local# jps

5721 NameNode
5943 SecondaryNameNode
6103 ResourceManager
6217 JobHistoryServer
6752 HQuorumPeer
6813 HMaster
7031 Jps
root@NameNode:/usr/local# ssh root@DataNode1
root@DataNode1:~# jps

3869 DataNode
4004 NodeManager
4196 HRegionServer
4444 Jps

root@DataNode1:~# exit

Quickly validate the installation.

root@NameNode:/usr/local# hbase version
HBase 1.2.3
Source code repository git://kalashnikov.att.net/Users/stack/checkouts/hbase.git.commit revision=bd63744624a26dc3350137b564fe746df7a721a4
Compiled by stack on Mon Aug 29 15:13:42 PDT 2016
From source with checksum 0ca49367ef6c3a680888bbc4f1485d18

Now let us start HBase shell and check some commands.

root@NameNode:~# $HBASE_HOME/bin/hbase shell

hbase(main):001:0> status
1 active master, 0 backup masters, 2 servers, 0 dead, 1.0000 average load

hbase(main):002:0> list
TABLE
0 row(s) in 0.0610 seconds

=> []
hbase(main):003:0> exit

Configure EdgeNode to access HBase

Let us configure of EdgeNode or Client Node to access HBase. Going forward we will see Hive & HBase interaction. Logon to the EdgeNode & secure copy the HBase directory from the NameNode.

root@EdgeNode:~# cd /usr/local
root@EdgeNode:/usr/local# scp -r root@NameNode:/usr/local/hbase /usr/local/

After that we will set the environment variables accordingly.

root@EdgeNode:/usr/local# vi ~/.bashrc

export HBASE_HOME=/usr/local/hbase
export PATH=$PATH:$HBASE_HOME/bin
export CLASSPATH=$CLASSPATH:/usr/local/hbase/lib/*:.
root@EdgeNode:/usr/local# source ~/.bashrc

Next login to HBase shell. We will create table with family columns, put some data in the table scan and get the data etc.

root@NameNode:~# $HBASE_HOME/bin/hbase shell

hbase(main):001:0> create 'cdr', 'index', 'customer', 'type', 'timing', 'usage', 'correspondent', 'network'
0 row(s) in 1.8530 seconds

=> Hbase::Table - cdr
hbase(main):002:0> put 'cdr', '010', 'index:customerindex', '0'
0 row(s) in 0.3470 seconds

hbase(main):003:0> put 'cdr', '010', 'index:customercount', '1'
0 row(s) in 0.0110 seconds

hbase(main):004:0> put 'cdr', '010', 'index:patterncdrindex', '0'
0 row(s) in 0.0210 seconds

hbase(main):005:0> put 'cdr', '010', 'index:customercdrcount', '10'
0 row(s) in 0.0190 seconds

hbase(main):006:0> put 'cdr', '010', 'index:customerpatternduration', '900'
0 row(s) in 0.0200 seconds

hbase(main):007:0> put 'cdr', '010', 'index:customerprofileduration', '900'
0 row(s) in 0.0090 seconds

hbase(main):008:0> put 'cdr', '010', 'index:profilemarker', 'Profile #1'
0 row(s) in 0.0070 seconds

hbase(main):009:0> put 'cdr', '010', 'index:patternmarker', 'Pattern #1 - 10 outgoing voice calls of 1-30 and toward the same corresp.'
0 row(s) in 1.2350 seconds

hbase(main):010:0> put 'cdr', '010', 'customer:cust_imsi', '208100000000000'
0 row(s) in 0.0210 seconds

hbase(main):011:0> put 'cdr', '010', 'customer:cust_isdn', '0600000000'
0 row(s) in 0.0110 seconds

hbase(main):012:0> put 'cdr', '010', 'customer:cust_imei', '350000000000000'
0 row(s) in 0.0240 seconds

hbase(main):013:0> put 'cdr', '010', 'customer:custoperator', 'FRAF2'
0 row(s) in 0.0160 seconds

hbase(main):014:0> put 'cdr', '010', 'type:calltype', 'MOC'
0 row(s) in 0.0340 seconds

hbase(main):015:0> put 'cdr', '010', 'type:callservice', 'Voice'
0 row(s) in 0.0090 seconds

hbase(main):016:0> list 'cdr'
TABLE
cdr
1 row(s) in 0.2020 seconds

hbase(main):017:0> scan 'cdr'
ROW                                           COLUMN+CELL
 010                                          column=customer:cust_imei, timestamp=1473821545112, value=350000000000000
 010                                          column=customer:cust_imsi, timestamp=1473821544911, value=208100000000000
 010                                          column=customer:cust_isdn, timestamp=1473821544985, value=0600000000
 010                                          column=customer:custoperator, timestamp=1473821546676, value=FRAF2
 010                                          column=index:customercdrcount, timestamp=1473821256779, value=10
 010                                          column=index:customercount, timestamp=1473821252725, value=1
 010                                          column=index:customerindex, timestamp=1473821252538, value=0
 010                                          column=index:customerpatternduration, timestamp=1473821286370, value=900
 010                                          column=index:customerprofileduration, timestamp=1473821286440, value=900
 010                                          column=index:patterncdrindex, timestamp=1473821252822, value=0
 010                                          column=index:patternmarker, timestamp=1473821469254, value=Pattern #1 - 10 outgoing voice calls of 1-30 and toward the same corresp.
 010                                          column=index:profilemarker, timestamp=1473821286471, value=Profile #1
 010                                          column=type:callservice, timestamp=1473821566933, value=Voice
 010                                          column=type:calltype, timestamp=1473821566853, value=MOC
1 row(s) in 0.0670 seconds

hbase(main):018:0> get 'cdr', '010'
COLUMN                                        CELL
 customer:cust_imei                           timestamp=1473821545112, value=350000000000000
 customer:cust_imsi                           timestamp=1473821544911, value=208100000000000
 customer:cust_isdn                           timestamp=1473821544985, value=0600000000
 customer:custoperator                        timestamp=1473821546676, value=FRAF2
 index:customercdrcount                       timestamp=1473821256779, value=10
 index:customercount                          timestamp=1473821252725, value=1
 index:customerindex                          timestamp=1473821252538, value=0
 index:customerpatternduration                timestamp=1473821286370, value=900
 index:customerprofileduration                timestamp=1473821286440, value=900
 index:patterncdrindex                        timestamp=1473821252822, value=0
 index:patternmarker                          timestamp=1473821469254, value=Pattern #1 - 10 outgoing voice calls of 1-30 and toward the same corresp.
 index:profilemarker                          timestamp=1473821286471, value=Profile #1
 type:callservice                             timestamp=1473821566933, value=Voice
 type:calltype                                timestamp=1473821566853, value=MOC
14 row(s) in 0.1050 seconds

hbase(main):019:0> exit

Check the HBase Web UI, at http://10.0.0.1:16010. In our case the HBase Master is in the NameNode.

In the next article, we will learn Apache Spark - in memory big data analytics.


Have a question on this subject?

Ask questions to our expert community members and clear your doubts. Asking question or engaging in technical discussion is both easy and rewarding.

Are you on Twitter?

Start following us. This way we will always keep you updated with what's happening in Data Analytics community. We won't spam you. Promise.

  • Stream Webserver Log into Hdfs using FLUME

    In this article we will use Apache Flume to gather stream access log data from our remote Web Server into Hadoop Distributed File System. We will be analyzing the access log in a real-time basis. So we have to setup Flume such that it collects the...

  • Hadoop DataLake Implementation Part 9

    In this article we will load our final fact table i.e. stock.

  • Set up Client Node (Gateway Node) in Hadoop Cluster

    Once we have our multi-node hadoop cluster up and running, let us create an EdgeNode or a GatewayNode. Gateway nodes are the interface between the Hadoop cluster and the outside network. Edge nodes are used to run client applications and cluster...

  • Install SPARK in Hadoop Cluster

    Apache Spark is a fast and general purpose engine for large-scale data processing over a distributed cluster. Apache Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing. Spark run programs up to 100x...

  • Hadoop DataLake Implementation Part 4

    Now that our dummy OLTP source system & Hadoop HDFS directory structure is ready, we will first load the ‘dates’ data file in HDFS and further to a hive table.

  • Install HBASE in Hadoop Cluster

    Apache HBase provides large-scale tabular storage for Hadoop using the Hadoop Distributed File System (HDFS). Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable. HBase is used in cases...

  • Hadoop DataLake Implementation Part 10

    In this article we will create oozie workflow to orchestrate the daily loading of showroom dimension table from MySQL source to HDFS using Sqoop, followed by Loading data from HDFS to Hive warehouse using Hive and finally housekkeping & archive.

  • Hadoop MapReduce Basics

    The Hadoop, since its inception is changing the way the enterprises store, process and analyse data. MapReduce is the core part of the Hadoop framework and we can also call it as the core processing engine of Hadoop. It is a programming model...

  • Oracle Installation for SQOOP Import

    We would like to perform practical test of Apache SQOOP import/export utility between ORACLE relational database & Apache HADOOP file system, let us quickly setup an ORACLE server. For that we will be using cloud based services/servers as we did...

  • Hadoop DataLake Implementation Part 8

    In this article we will load our first fact table into Hive warehouse which is sales transactions.