Apache HBase provides large-scale tabular storage for Hadoop using the Hadoop Distributed File System (HDFS). Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable. HBase is used in cases where we require random, realtime read/write access to Big Data. We can host very large tables (billions of rows X millions of columns) atop clusters of commodity hardware using HBase. In this article we will Install HBase in a fully distributed hadoop cluster.

HBase scales by splitting all rows into regions. Each region is hosted by exactly one server. Writes are held(sorted) in memory until flush. Reads merge rows in memory with flushed files. Reads & writes to a single row are consistent. A row is an atomic byte array of Key-Value map container with one row key. The row is atomic and gets flushed to disk periodically. But it doesn't have to be flushed into just a single file. It can be broken up in different store files with different properties, and reads can look at just a subset.This advanced design option is called Column Families. Column Family: divide columns into physical files. HBase has neither joins nor indexes as like any distributed DB.

Hbase Installation

We will configure our cluster to host the HBase Master Server in our NameNode & Region Servers in our DataNodes. Also Apache Zookeeper is a pre-requisite for HBase installation. In this case we will configure HBase, to manage its own instance of Zookeeper. We will configure the NameNode to host the Zookeeper Quorum. So let ssh & login to our NameNode.

Master Server Setup

Get the Latest Stable Release of HBase Package from the site:
http://www-us.apache.org/dist/hbase/stable
In the time of writing this article, HBase 1.2.3 is the latest stable version. We will install HBase under /usr/local/ directory.

root@NameNode:~# cd /usr/local
root@NameNode:/usr/local/# wget http://www-us.apache.org/dist/hbase/stable/hbase-1.2.3-bin.tar.gz
root@NameNode:/usr/local/# tar -xzvf hbase-1.2.3-bin.tar.gz >> /dev/null
root@NameNode:/usr/local/# mv hbase-1.2.3 /usr/local/hbase
root@NameNode:/usr/local/# rm hbase-1.2.3-bin.tar.gz

Set the HBase environment variables in .bashrc file. Append below lines to the file and source the environment file.

root@NameNode:/usr/local#  vi ~/.bashrc

export HBASE_HOME=/usr/local/hbase
export PATH=$PATH:$HBASE_HOME/bin
export CLASSPATH=$CLASSPATH:/usr/local/hbase/lib/*:.
root@NameNode:/usr/local# source ~/.bashrc

Next we need to configure HBase environment script and set the Java Home. Also we will configure HBase to manage it's Zookeper Instance. Open the hbase-env.sh file and append the lines to the file.

root@NameNode:/usr/local/hbase/conf# vi hbase-env.sh

export JAVA_HOME=/usr/lib/jvm/java-7-oracle/jre
export HBASE_MANAGES_ZK=true

Next we will configure the site specific properties of HBase in the file, hbase-site.xml.

root@NameNode:/usr/local/hbase/conf# vi hbase-site.xml
hbase-site.xml
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://NameNode:8020/hbase</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/usr/local/zookeeper</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>NameNode</value>
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2181</value>
</property>
</configuration>

Next we have to list down DataNodes which will host the Region Servers, in the file regionservers.

root@NameNode:/usr/local/hbase/conf# vi regionservers

DataNode1
DataNode2

Additionally we will create a local directory for Zookeper to maintain it's log file.

root@NameNode:/usr/local/hbase/conf# mkdir -p /usr/local/zookeeper

Region Server Setup

Now we have to configure our DataNodes to act as Region Servers. In our case we have two DataNodes. We will secure copy the hbase directory with the binaries and configuration files from the NameNode to the DataNodes.

root@NameNode:/usr/local/hbase/conf# cd /usr/local 
root@NameNode:/usr/local# scp -r hbase DataNode1:/usr/local
root@NameNode:/usr/local# scp -r hbase DataNode2:/usr/local

Next we need to update the Environment configuration of HBase in all the DataNodes. Append the below two lines in the .bashrc files in both the DataNodes.

root@NameNode:/usr/local# ssh root@DataNode1
root@DataNode1:~# vi ~/.bashrc

export HBASE_HOME=/usr/local/hbase
export PATH=$PATH:$HBASE_HOME/bin
export CLASSPATH=$CLASSPATH:/usr/local/hbase/lib/*:.


root@DataNode1:~# source ~/.bashrc
root@DataNode1:~# exit

Repeat the above step for all the other DataNodes.

Well we are done with the installation & configuration. So it's time to start the HBase services.

root@NameNode:/usr/local# $HBASE_HOME/bin/start-hbase.sh

Let us validate the services running in NameNode as well as in the DataNodes.

root@NameNode:/usr/local# jps

5721 NameNode
5943 SecondaryNameNode
6103 ResourceManager
6217 JobHistoryServer
6752 HQuorumPeer
6813 HMaster
7031 Jps
root@NameNode:/usr/local# ssh root@DataNode1
root@DataNode1:~# jps

3869 DataNode
4004 NodeManager
4196 HRegionServer
4444 Jps

root@DataNode1:~# exit

Quickly validate the installation.

root@NameNode:/usr/local# hbase version
HBase 1.2.3
Source code repository git://kalashnikov.att.net/Users/stack/checkouts/hbase.git.commit revision=bd63744624a26dc3350137b564fe746df7a721a4
Compiled by stack on Mon Aug 29 15:13:42 PDT 2016
From source with checksum 0ca49367ef6c3a680888bbc4f1485d18

Now let us start HBase shell and check some commands.

root@NameNode:~# $HBASE_HOME/bin/hbase shell

hbase(main):001:0> status
1 active master, 0 backup masters, 2 servers, 0 dead, 1.0000 average load

hbase(main):002:0> list
TABLE
0 row(s) in 0.0610 seconds

=> []
hbase(main):003:0> exit

Configure EdgeNode to access HBase

Let us configure of EdgeNode or Client Node to access HBase. Going forward we will see Hive & HBase interaction. Logon to the EdgeNode & secure copy the HBase directory from the NameNode.

root@EdgeNode:~# cd /usr/local
root@EdgeNode:/usr/local# scp -r root@NameNode:/usr/local/hbase /usr/local/

After that we will set the environment variables accordingly.

root@EdgeNode:/usr/local# vi ~/.bashrc

export HBASE_HOME=/usr/local/hbase
export PATH=$PATH:$HBASE_HOME/bin
export CLASSPATH=$CLASSPATH:/usr/local/hbase/lib/*:.
root@EdgeNode:/usr/local# source ~/.bashrc

Next login to HBase shell. We will create table with family columns, put some data in the table scan and get the data etc.

root@NameNode:~# $HBASE_HOME/bin/hbase shell

hbase(main):001:0> create 'cdr', 'index', 'customer', 'type', 'timing', 'usage', 'correspondent', 'network'
0 row(s) in 1.8530 seconds

=> Hbase::Table - cdr
hbase(main):002:0> put 'cdr', '010', 'index:customerindex', '0'
0 row(s) in 0.3470 seconds

hbase(main):003:0> put 'cdr', '010', 'index:customercount', '1'
0 row(s) in 0.0110 seconds

hbase(main):004:0> put 'cdr', '010', 'index:patterncdrindex', '0'
0 row(s) in 0.0210 seconds

hbase(main):005:0> put 'cdr', '010', 'index:customercdrcount', '10'
0 row(s) in 0.0190 seconds

hbase(main):006:0> put 'cdr', '010', 'index:customerpatternduration', '900'
0 row(s) in 0.0200 seconds

hbase(main):007:0> put 'cdr', '010', 'index:customerprofileduration', '900'
0 row(s) in 0.0090 seconds

hbase(main):008:0> put 'cdr', '010', 'index:profilemarker', 'Profile #1'
0 row(s) in 0.0070 seconds

hbase(main):009:0> put 'cdr', '010', 'index:patternmarker', 'Pattern #1 - 10 outgoing voice calls of 1-30 and toward the same corresp.'
0 row(s) in 1.2350 seconds

hbase(main):010:0> put 'cdr', '010', 'customer:cust_imsi', '208100000000000'
0 row(s) in 0.0210 seconds

hbase(main):011:0> put 'cdr', '010', 'customer:cust_isdn', '0600000000'
0 row(s) in 0.0110 seconds

hbase(main):012:0> put 'cdr', '010', 'customer:cust_imei', '350000000000000'
0 row(s) in 0.0240 seconds

hbase(main):013:0> put 'cdr', '010', 'customer:custoperator', 'FRAF2'
0 row(s) in 0.0160 seconds

hbase(main):014:0> put 'cdr', '010', 'type:calltype', 'MOC'
0 row(s) in 0.0340 seconds

hbase(main):015:0> put 'cdr', '010', 'type:callservice', 'Voice'
0 row(s) in 0.0090 seconds

hbase(main):016:0> list 'cdr'
TABLE
cdr
1 row(s) in 0.2020 seconds

hbase(main):017:0> scan 'cdr'
ROW                                           COLUMN+CELL
 010                                          column=customer:cust_imei, timestamp=1473821545112, value=350000000000000
 010                                          column=customer:cust_imsi, timestamp=1473821544911, value=208100000000000
 010                                          column=customer:cust_isdn, timestamp=1473821544985, value=0600000000
 010                                          column=customer:custoperator, timestamp=1473821546676, value=FRAF2
 010                                          column=index:customercdrcount, timestamp=1473821256779, value=10
 010                                          column=index:customercount, timestamp=1473821252725, value=1
 010                                          column=index:customerindex, timestamp=1473821252538, value=0
 010                                          column=index:customerpatternduration, timestamp=1473821286370, value=900
 010                                          column=index:customerprofileduration, timestamp=1473821286440, value=900
 010                                          column=index:patterncdrindex, timestamp=1473821252822, value=0
 010                                          column=index:patternmarker, timestamp=1473821469254, value=Pattern #1 - 10 outgoing voice calls of 1-30 and toward the same corresp.
 010                                          column=index:profilemarker, timestamp=1473821286471, value=Profile #1
 010                                          column=type:callservice, timestamp=1473821566933, value=Voice
 010                                          column=type:calltype, timestamp=1473821566853, value=MOC
1 row(s) in 0.0670 seconds

hbase(main):018:0> get 'cdr', '010'
COLUMN                                        CELL
 customer:cust_imei                           timestamp=1473821545112, value=350000000000000
 customer:cust_imsi                           timestamp=1473821544911, value=208100000000000
 customer:cust_isdn                           timestamp=1473821544985, value=0600000000
 customer:custoperator                        timestamp=1473821546676, value=FRAF2
 index:customercdrcount                       timestamp=1473821256779, value=10
 index:customercount                          timestamp=1473821252725, value=1
 index:customerindex                          timestamp=1473821252538, value=0
 index:customerpatternduration                timestamp=1473821286370, value=900
 index:customerprofileduration                timestamp=1473821286440, value=900
 index:patterncdrindex                        timestamp=1473821252822, value=0
 index:patternmarker                          timestamp=1473821469254, value=Pattern #1 - 10 outgoing voice calls of 1-30 and toward the same corresp.
 index:profilemarker                          timestamp=1473821286471, value=Profile #1
 type:callservice                             timestamp=1473821566933, value=Voice
 type:calltype                                timestamp=1473821566853, value=MOC
14 row(s) in 0.1050 seconds

hbase(main):019:0> exit

Check the HBase Web UI, at http://10.0.0.1:16010. In our case the HBase Master is in the NameNode.

In the next article, we will learn Apache Spark - in memory big data analytics.


Have a question on this subject?

Ask questions to our expert community members and clear your doubts. Asking question or engaging in technical discussion is both easy and rewarding.

Are you on Twitter?

Start following us. This way we will always keep you updated with what's happening in Data Analytics community. We won't spam you. Promise.

  • SQOOP import from Oracle

    In this article we will use Apache SQOOP to import data from Oracle database. Now that we have an oracle server in our cluster ready, let us login to EdgeNode. Next we will configure sqoop to import this data in HDFS file system followed by direct...

  • Install FLUME In Client Node of Hadoop Cluster

    Apache Flume is a distributed, robust, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data or streaming event data from different sources to a centralized data store. Its main goal is to...

  • Hadoop DataLake Implementation Part 2

    Now that we are familiar with HDP stack, in this article we are going to access HDP sandbox command line, Ambari Web UI, Hive & Ranger to create a user for our implementation setup.

  • Configuring MySQL as Hive Metastore

    In the previous article, we have learnt How to Install and Configure Hive with default Derby metastore. However, an embedded derby based metastore can process only one request at a time. Since this is very restrictive, we will setup a traditional...

  • Hadoop DataLake Implementation Part 10

    In this article we will create oozie workflow to orchestrate the daily loading of showroom dimension table from MySQL source to HDFS using Sqoop, followed by Loading data from HDFS to Hive warehouse using Hive and finally housekkeping & archive.

  • Apache Hadoop Architecture

    In this article we will learn about the Apache Hadoop framework architecture. The basic components of the Apache Hadoop HDFS & MapReduce engine are discussed in brief.

  • Hadoop MapReduce Basics

    The Hadoop, since its inception is changing the way the enterprises store, process and analyse data. MapReduce is the core part of the Hadoop framework and we can also call it as the core processing engine of Hadoop. It is a programming model...

  • Hadoop DataLake Implementation Part 3

    To complete our implementation setup we will create the source tables based on the downloaded datafiles. Let us first load the SQL files in MySQL server under a new database called ‘sales’. We will simulate this database schema as our OLTP source...

  • Fools Guide to Big data - What is Big Data

    Sure enough, you have heard the term, "Big Data" many times before. There is no dearth of information in the Internet and printed medium about this. But guess what, this term still remains vaguely defined and poorly understood. This essay is our...

  • Introduction to Apache Hadoop

    The Apache Hadoop is next big data platform. Apache Hadoop is an open-source, java-based framework software for reliable, scalable & distributed computing. The Apache Hadoop allows distributed processing of very large data sets across clusters of...