How to Setup Hadoop Multi Node Cluster - Step By Step

Updated on Oct 03, 2020

Setting up Hadoop in a single machine is easy, but no fun. Why? Because Hadoop is not meant for a single machine. Hadoop is meant to run on a computing cluster comprising of many machines. Running HDFS and MapReduce on a single machine is great for learning about these systems, but to do useful work we need to run Hadoop on multiple nodes. There are a few options when it comes to staring a Hadoop cluster, from building our own to running on rented hardware, or using any offering that provides Hadoop as a service in the cloud. But, how can we - the learners, the beginners, the amateurs - take advantage of multi-node Hadoop cluster? Well, allow us to show you.

Got Nodes?

If you don't have multiple machines to create your cluster, you can provision them from cloud

To create a multi-node cluster we will need - well - multiple nodes, that is multiple machines. As long as you have access to few spare linux machines - you are fine. But in case you don't - it's ludicrously easily to get a few machines up and running in cloud within minutes.

We will assume that you don't actually have any machine other than the computer you are reading this article on. And we will show you how you can still create your very own Hadoop cluster with multiple machines. On the other hand, if you actually happen to own a few spare machines (at least 3), you don't need to follow the immediate next part, just note down the IP Addresses of your machine and follow us from the next heading.

Provisioning Nodes from Cloud

We are going to use a very simple, fast, reliable and outrageously popular cloud service - Digital Ocean. The moment you start using it, you fall in love with it. No bells, no whistles. Simple machines for developers. Ok, here's what you do -

Go to Digital Ocean now. Sign-up and create your account. Don't worry, if you access Digital Ocean using this Link, you will get $10 credit - that's enough for running 2 machines free-of-charge for a whole month.

Logon to your account and click on "Create Droplet" (they call the machines droplet)

Choose Ubuntu 16.04.1 x64 bit as the image for the droplet

Choose a size of your choice (we are using 2GB /40GB Disk)

Select a datacenter region. Remember that create all the droplets in the same region for droplet to droplet private communication.

In the additional option, select "Private Networking"

Enter the droplet hostname as "NameNode".

Keep other things as default and click "Create"

Once our droplet is ready, you will receive an email with the Static IP address of your new droplet along with password for the root user account.

Name Node versus Data Node

There will be two types of nodes in a Hadoop cluster - NameNode and DataNode. If you had installed Hadoop in a single machine, you could have installed both of them in a single computer, but in a multi-node cluster they are usually on different machines. In our cluster, we will have one name node and multiple data nodes. DataNodes store the actual data of Hadoop, while the NameNode stores the metadata information.

Cluster Topology

We will build our clusters with 3 machines, 2 of which will be used as DataNode while the remaining one will be used as NameNode. The below picture illustrates the network topology along with the IP addresses:

Creating the NameNode

Now once your 1st machine is ready in cloud (or locally, if you had spares) note down the IP address of the machine that you want to configure as the NameNode. In my case, the IP address is 10.0.0.1. Make sure to change this IP addresses according to your IP address in the below commands:

Open up your terminal and connect to the droplet using SSH. (If you are on Windows, you may use Putty for this purpose).

$> ssh root@10.0.0.1

Install Java Environment

Hadoop requires Java as pre requisite. Let us update the system & install Oracle Java 1.7. During the installation, if it prompts for a confirmation, press Y.

root@NameNode:~# apt-get update
root@NameNode:~# add-apt-repository ppa:webupd8team/java
root@NameNode:~# apt-get update
root@NameNode:~# apt-get install oracle-java7-installer
root@NameNode:~# java -version

# Output looks like below
java version "1.7.0_80"
Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)

Setup machine alias in host file

Next we need to modify the host file to put the alias of the machine name. In /etc/hosts file delete everything and put the below two lines. In case if you are using Digital Ocean put the Private IP address of the droplet, not the public IPv4 address, so as to facilitate private networking i.e. droplet to droplet communication within Digital Ocean. * Make sure to change this IP addresses according to your IP address.

root@NameNode:~# vi /etc/hosts

10.0.0.1 NameNode

Setup SSH Server

The hadoop control scripts rely on SSH to perform cluster-wide operations. For example, there is a script for stopping and starting all the daemons in the clusters. To work seamlessly, SSH needs to be setup to allow password-less & passphrase-less login for the root/hadoop user from machines in the cluster. The simplest way to achieve this is to generate a public/private key pair, and it will be shared across the cluster.

root@NameNode:~# apt-get install openssh-server

Generate an SSH key for the root user

root@NameNode:~# ssh-keygen -t rsa -P ""

If prompted for SSH key file name, Enter file in which to save the key (/root/.ssh/id_rsa) and press ENTER. Next put the key to authorized_keys directory for future password-less access.

root@NameNode:~# cat /root/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys
root@NameNode:~# chmod 700 ~/.ssh
root@NameNode:~# chmod 600 ~/.ssh/authorized_keys

Connect and validate ssh password-less login to localhost. If you are being prompted to accept the connection, select yes.

root@NameNode:~# ssh localhost

Once okay, type exit to come out of the localhost.

Download and Install Hadoop distribution

Check first the Latest Stable Hadoop Release Available at: Apache Hadoop.

In the time of writing this article, Hadoop 2.7.2 is the latest stable version. We will install it under /usr/local/ directory. After that we will also create few additional directories like namenode for hadoop to store all namenode information and namesecondary to store the checkpoint images.

root@NameNode:~# cd /usr/local/
root@NameNode:/usr/local/# wget http://www.us.apache.org/dist/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz
root@NameNode:/usr/local/# tar -xzvf hadoop-2.7.2.tar.gz >> /dev/null
root@NameNode:/usr/local/# mv hadoop-2.7.2 /usr/local/hadoop
root@NameNode:/usr/local/# mkdir -p /usr/local/hadoop_work/hdfs/namenode
root@NameNode:/usr/local/# mkdir -p /usr/local/hadoop_work/hdfs/namesecondary

Setup Environment Variables

We will setup some environment variables in .bashrc so that every time we restart our machines, it knows where to find Java or Hadoop installation location inside the machine. To do this, first we need to find out where JAVA has been installed. If you have followed this tutorial, it's likely that Java has been installed within one of the subdirectories under /usr/lib/jvm/ directory. So, please browse to this location, and confirm that Java is there:

root@NameNode:/usr/local/# cd /usr/lib/jvm/java-7-oracle/jre
root@NameNode:/usr/lib/jvm/java-7-oracle/jre# java -version

If the above command runs fine, then Java is there inside the default-java directory.

Now open .bashrc

root@NameNode:/usr/lib/jvm/java-7-oracle/jre# vi ~/.bashrc

And put these lines at the end of your .bashrc file (Press SHIFT + G to directly go to the end of the file):

export JAVA_HOME=/usr/lib/jvm/java-7-oracle/jre
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export CLASSPATH=$CLASSPATH:/usr/local/hadoop/lib/*:.

export HADOOP_OPTS="$HADOOP_OPTS -Djava.security.egd=file:/dev/../dev/urandom"

Save the .bashrc file and source it like below:

root@NameNode:/usr/lib/jvm/java-7-oracle/jre# source ~/.bashrc

Setup JAVA_HOME under hadoop environment

It is suggested that you also setup the JAVA_HOME environment variable under Hadoop environment file. Open the hadoop-end.sh file,

root@NameNode:/usr/lib/jvm/java-7-oracle/jre# vi /usr/local/hadoop/etc/hadoop/hadoop-env.sh

Inside the file, find the line export JAVA_HOME=${JAVA_HOME}. Replace the line like below

export JAVA_HOME=/usr/lib/jvm/java-7-oracle/jre

Confirm Hadoop is installed

At this point, we should confirm if hadoop command is accessible from the terminal

root@NameNode:/usr/lib/jvm/java-7-oracle/jre# cd $HADOOP_HOME/etc/hadoop

root@NameNode:/usr/local/hadoop/etc/hadoop# hadoop version

# Output looks like below
Hadoop 2.7.2
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r b165c4fe8a74265c792ce23f546c64604acf0e41
Compiled by jenkins on 2016-01-26T00:08Z
Compiled with protoc 2.5.0
From source with checksum d0fda26633fa762bff87ec759ebe689c
This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-2.7.2.jar

Configuring the NameNode

Hadoop makes use of some XML based configuration file where it reads all the runtime parameters from. These configuration files are located under /usr/local/hadoop/etc/hadoop folder. We will configure/set some minimal options just to get us started on the Hadoop cluster. In this configuration we will use YARN as the cluster management framework.

Configure core-site.xml

This XML configuration file lets you setup site specific properties, such as I/O settings that are common to HDFS and MapReduce. Open the file and put the following properties:

root@NameNode:/usr/local/hadoop/etc/hadoop# vi core-site.xml

<?xml version="1.0"?>
<!-- core-site.xml -->
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://NameNode:8020/</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
</property>
</configuration>

Configure hdfs-site.xml

We will tell hadoop where is the name node directory (which we created previously at the end of Hadoop installation) and how many backup copies of the data files to be created in the system (called replication) inside this file under the configuration tag.

root@NameNode:/usr/local/hadoop/etc/hadoop# vi hdfs-site.xml

<?xml version="1.0"?>
<!-- hdfs-site.xml -->
<configuration></configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop_work/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop_work/hdfs/datanode</value>
</property>
<property>
<name>dfs.namenode.checkpoint.dir</name>
<value>file:/usr/local/hadoop_work/hdfs/namesecondary</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.block.size</name>
<value>134217728</value>
</property>
</configuration>

Configure mapred-site.xml

This controls the configuration settings for MapReduce daemons. Here we need to ensure that we will be using YARN framework. Also we will configure the MapReduce Job History server.

Copy the template file mapred-site.xml.template:

root@NameNode:/usr/local/hadoop/etc/hadoop# cp mapred-site.xml.template mapred-site.xml

root@NameNode:/usr/local/hadoop/etc/hadoop# vi mapred-site.xml

Then add the following properties

<?xml version="1.0"?>
<!-- mapred-site.xml -->
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>NameNode:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>NameNode:19888</value>
</property>
<property>
<name>yarn.app.mapreduce.am.staging-dir</name>
<value>/user/app</value>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Djava.security.egd=file:/dev/../dev/urandom</value>
</property>
</configuration>

Configure yarn-site.xml

This XML configuration file lets you setup YARN site specific properties for Resource Manager & Node Manager. Open the file:

root@NameNode:/usr/local/hadoop/etc/hadoop# vi yarn-site.xml

Then put the following properties under configuration:

<?xml version="1.0"?>
<!-- yarn-site.xml -->
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>NameNode</value>
</property>
<property>
<name>yarn.resourcemanager.bind-host</name>
<value>0.0.0.0</value>
</property>
<property>
<name>yarn.nodemanager.bind-host</name>
<value>0.0.0.0</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>file:/usr/local/hadoop_work/yarn/local</value>
</property>
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>file:/usr/local/hadoop_work/yarn/log</value>
</property>
<property>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>hdfs://NameNode:8020/var/log/hadoop-yarn/apps</value>
</property>
</configuration>

Setup the Master

Now you need to tell Hadoop NameNode the hostname of Secondary name node. In our case both NameNode & Secondary NameNode resides on the same machine. We do it by editing the masters file:

root@NameNode:/usr/local/hadoop/etc/hadoop# vi masters

Inside the file, put one line:

NameNode

Next, we will format the name node before starting anything now,

root@NameNode:/usr/local/hadoop/etc/hadoop# /usr/local/hadoop/bin/hadoop namenode -format

You should see a success command saying "Storage directory /usr/local/hadoop_work/hdfs/namenode has been successfully formatted"

Creating the Data Nodes

Next, we need to create the datanodes now. If you are using DigitalOcean, login and create two droplets with the minimal configurations as shown below (do not forget to check the private networking option on). If you are using your own network cluster, make sure you have two spare machines ready with hostnames as DataNode1 and DataNode2.

Adding the Data Nodes in Master

Now, at this point we have our master up and running and we have 2 data nodes machines ready as well. The hostnames and the IP addresses of all these machines are known to us (Your IPs will be different than mine). Again note if you are using DigitalOcean put the Private IP address not public IPv4 address.

Node	Hostname	IP
Name Node	NameNode	10.0.0.1
Data Node	DataNode1	10.0.100.1
Data Node	DataNode2	10.0.100.2

Next, we need to tell the master about these new data nodes.

root@NameNode:/usr/local/hadoop/etc/hadoop# vi slaves

Inside the file, remove everything and put the lines as below. Note: If you have more DataNodes, you have to according list their entries in the slaves file, so that the master is aware of the respective slaves:

DataNode1
DataNode2

Also add and append these hostnames in the /etc/hosts file of the master node by adding the following entries. Note use the Private IP address of the DataNodes accordingly.

root@NameNode:/usr/local/hadoop/etc/hadoop# vi /etc/hosts

10.0.100.1 DataNode1
10.0.100.2 DataNode2

Configuring the DataNodes

Next we need to login to the data nodes and perform the following tasks in each of the data nodes

Installing JAVA in data node

Updating /etc/hosts

Environment variable configuration

Hadoop installation

Remember that you need to do the above 4 tasks in all of your data nodes. Below, we are showing them for the first node. Login to your first data node machine and perform these tasks

$> ssh root@10.0.100.1

Installing JAVA in DataNode

root@DataNode1:~# apt-get update
root@DataNode1:~# add-apt-repository ppa:webupd8team/java
root@DataNode1:~# apt-get update
root@DataNode1:~# apt-get install oracle-java7-installer

Updating /etc/hosts

Open the /etc/hosts file

root@DataNode1:~# vi /etc/hosts

Remove the contents of the file (if any), and add the following lines. Remember to put proper Private IP address of your nodes/droplets

10.0.0.1 NameNode
10.0.100.1 DataNode1
10.0.100.2 DataNode2

Environment variable configuration

Open the .bashrc file,

DataNode1# vi ~/.bashrc

Then append the following lines at the end of the file

export JAVA_HOME=/usr/lib/jvm/java-7-oracle/jre
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export CLASSPATH=$CLASSPATH:/usr/local/hadoop/lib/*:.

export HADOOP_OPTS="$HADOOP_OPTS -Djava.security.egd=file:/dev/../dev/urandom"

Hadoop Installation

This time we will not install Hadoop from beginning as we have already installed it in the NameNode. Instead, we will just copy the installation directories from the NameNode to the DataNodes. In order for us to copy that, we will first need to include the public key of the NameNode to the authorised keys of all the data nodes so that we can leverage SSH file copy program (scp) for copying.

root@NameNode:/usr/local/hadoop/etc/hadoop# cat ~/.ssh/id_rsa.pub

#Output Public Key looks something like below
ssh-rss AA21....

....l5Mq4r root@NameNode

Once you have copied the above key (from the beginning ssh-rsa to the end of the file), login to the DataNodes one by one and paste the key in the authorised keys file. During the same time, also create a data node and yarn directories in the individual data nodes. They will be used by Datanode & Node manager daemons respectively.

root@DataNode1:~# vi ~/.ssh/authorized_keys

(append the copied key at the end of this file)

root@DataNode1:~# mkdir -p /usr/local/hadoop_work/hdfs/datanode
root@DataNode1:~# mkdir -p /usr/local/hadoop_work/yarn/local
root@DataNode1:~# mkdir -p /usr/local/hadoop_work/yarn/log

(remember to run the above code block in all the datanodes one by one)

Once done the NameNode can now communicate with the DataNodes via password less ssh.

Connect and validate ssh password-less login to all the Datanodes from the NameNode. If you are being prompted to accept the connection, select yes.

root@NameNode:/usr/local/hadoop/etc/hadoop# ssh DataNode1

Once okay, type exit come out of the remote host.

Now it's time to copy the hadoop installation i.e. the binaries as well as the our site specific configuration files from NameNode to DataNode. Login to NameNode and run the below command:

root@NameNode:/usr/local/hadoop/etc/hadoop# cd /usr/local 
root@NameNode:/usr/local# scp -r hadoop DataNode1:/usr/local
root@NameNode:/usr/local# scp -r hadoop DataNode2:/usr/local

Note: You have to do the same for any additional DataNodes if any. For our case we have only two data nodes.

It's time to start the hadoop distributed file system in the cluster first from the NameNode. If prompted for authentication, select yes.

root@NameNode:/usr/local# $HADOOP_HOME/sbin/start-dfs.sh

Let us first check the daemons running in the NameNode as well as the DataNodes in the Hadoop cluster. Ideally you should see NameNode & Secondary Name node started as java processes in NameNode and DataNode java process in DataNode's. Login to the individual nodes and check the running java processes using jps

root@NameNode:/usr/local# jps

root@DataNode1:~# jps

root@DataNode2:~# jps

Once the Namenode & Datanodes starts successfully, we have to create few directories in hadoop filesystem which has been listed in our site specific configuration files. This HDFS directories will be used by YARN Map Reduce Staging, YARN Log & Job History Server.

root@NameNode:/usr/local# hadoop fs -mkdir /tmp
root@NameNode:/usr/local# hadoop fs -chmod -R 1777 /tmp

root@NameNode:/usr/local# hadoop fs -mkdir /user
root@NameNode:/usr/local# hadoop fs -chmod -R 1777 /user
root@NameNode:/usr/local# hadoop fs -mkdir /user/app
root@NameNode:/usr/local# hadoop fs -chmod -R 1777 /user/app

root@NameNode:/usr/local# hadoop fs -mkdir -p /var/log/hadoop-yarn
root@NameNode:/usr/local# hadoop fs -chmod -R 1777 /var/log/hadoop-yarn
root@NameNode:/usr/local# hadoop fs -mkdir -p /var/log/hadoop-yarn/apps
root@NameNode:/usr/local# hadoop fs -chmod -R 1777 /var/log/hadoop-yarn/apps


# Now Verify the HDFS File Structure
root@NameNode:/usr/local# hadoop fs -ls -R /

# Output should look like below:
drwxrwxrwt - root supergroup 0 /tmp
drwxr-xr-x - root supergroup 0 /user
drwxrwxrwt - root supergroup 0 /user/app
drwxr-xr-x - root supergroup 0 /var
drwxr-xr-x - root supergroup 0 /var/log
drwxr-xr-x - root supergroup 0 /var/log/hadoop-yarn
drwxr-xr-x - root supergroup 0 /var/log/hadoop-yarn/apps

Now we need to start the YARN cluster framework. We should execute the below command from the Node hosting the Resource Manager. In our case the Resource Manager is in the same NameNode.

root@NameNode:/usr/local# $HADOOP_HOME/sbin/start-yarn.sh

Now we will start the MapReduce History Server. We should execute the below command from the Node hosting the History Server. In our case the History Server is in the same NameNode. Before that quickly edit the $HADOOP_HOME/etc/hadoop/mapred-site.xml file in the NameNode. Replace the hostname in the value for the property names as below from NameNode to 0.0.0.0:

- mapreduce.jobhistory.address from NameNode:10020 to 0.0.0.0:10020.
- mapreduce.jobhistory.webapp.address from NameNode:19888 to 0.0.0.0:19888

root@NameNode:/usr/local# $HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver

Let's check again the daemons running in the NameNode as well as the DataNodes in the Hadoop cluster. Ideally you should see NameNode, Secondary NameNode & ResourceManager started as java processes in NameNode; And DataNode & NodeManager java process in DataNode's. Login to the individual nodes and check the running java processes using jps

root@NameNode:/usr/local# jps

# Output should look like below:
3619 NameNode
3864 SecondaryNameNode
4041 ResourceManager
4240 JobHistoryServer

root@DataNode1:~# jps
# Output should look like below:
2650 DataNode
2787 NodeManager

root@DataNode2:~# jps
# Output should look like below:
2613 DataNode
2749 NodeManager

Great the cluster is up & running. Time to test the hadoop file system by uploading some dummy data file in HDFS. From the NameNode we issue the below commands

root@NameNode:/usr/local# hadoop fs -mkdir /analysis
root@NameNode:/usr/local# hadoop fs -ls /

Create a small data file and try to load it in HDFS cluster.

root@NameNode:/usr/local# echo "Some, dummy, file" > /root/dummy.csv 
root@NameNode:/usr/local# hadoop fs -put /root/dummy.csv /analysis/dummy.csv
root@NameNode:/usr/local# hadoop fs -ls /analysis
root@NameNode:/usr/local# hadoop fs -tail /analysis/dummy.csv

Finally check the cluster status in the web browser. Remember to use the Public IP Address of the NameNode, ResourceManager & HistoryServer respectively. In our case all are in the NameNode machine with IP 10.0.0.1

Go to -

http://10.0.0.1:50070
http://10.0.0.1:8088http://10.0.0.1:19888

Let's check the hdfs file system, replication etc from the web browser by going to the http://10.0.0.1:50070/explorer.html

Next we will run a sample hadoop map-reduce utility on our hadoop YARN cluster.

root@NameNode:/usr/local# hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar pi 2 4

In the next article, we will setup a client node (also called Edge Node or Gateway node), followed by Hive and Sqoop installation.