Once we have our multi-node hadoop cluster up and running, let us create an EdgeNode or a GatewayNode. Gateway nodes are the interface between the Hadoop cluster and the outside network. Edge nodes are used to run client applications and cluster administration tools.

The edge node does not have to be part of the cluster, however if it is outside of the cluster (meaning it doesn't have any specific Hadoop service roles running on it), it will need some basic components such as Hadoop binaries and current Hadoop cluster config files to submit jobs on the cluster. We will install client tools in our EdgeNode namely, HIVE, SQOOP, FLUME, PIG, OOZIE etc. Before that let’s setup an EdgeNode.

Lets set up an edge node for clients to access the Hadoop Cluster for submitting jobs. Spawn a droplet in DigitalOcean 4 GB Memory / 40 GB Disk / NYC3 - Ubuntu 16.04.1 x64 named as EdgeNode with Private Networking On, so that it can communicate with the other droplets. So currently, our hadoop cluster looks like below:

NodeHostnameIP
Name NodeNameNode10.0.0.1
Data NodeDataNode110.0.100.1
Data NodeDataNode210.0.100.2
Client NodeEdgeNode10.0.100.3

The setup of EdgeNode will be similar to any hadoop node in the cluster, although there will be no hadoop cluster services running on this node. There will be no entry in the NameNode’s Masters or Salves file for this edge node.

Setup steps involved:

  1. Install Java Runtime Environment:

    root@EdgeNode:~# apt-get update
    root@EdgeNode:~# add-apt-repository ppa:webupd8team/java
    root@EdgeNode:~# apt-get install oracle-java7-installer
    root@EdgeNode:~# java -version
    java version "1.7.0_80"
    Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
    Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
    
  2. Setup machine alias in host file:

    Modify /etc/hosts file as below

    127.0.0.1	localhost
    10.0.100.3	EdgeNode
    10.0.0.1	NameNode
    10.0.100.1	DataNode1
    10.0.100.2	DataNode2
    
  3. Setup SSH Server:

    EdgeNode requires password less access to NameNode. SSH needs to be setup, to allow password-less login from EdgeNode to NameNode machine in the cluster. The simplest way to achieve this is to generate a public/private key pair, and the public key will be shared with the master node.

    root@EdgeNode:~# apt-get  install openssh-server
    root@EdgeNode:~# ssh-keygen -t rsa -P ""
    
    root@EdgeNode:~# cat /root/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys
    root@EdgeNode:~# chmod 700 ~/.ssh
    root@EdgeNode:~# chmod 600 ~/.ssh/authorized_keys
    

    Now copy the Public Key of the EdgeNode & paste in the /root/.ssh/authorized_keys file in the NameNode.

  4. Getting Hadoop distribution & Configuration files:

    Let us get all the hadoop binaries and configuration files present in the NameNode into our EdgeNode, so that we have the same version of Hadoop binaries as that in the cluster and the configuration details of our cluster.

    root@EdgeNode:~# cd /usr/local
    root@EdgeNode:~# scp -r root@NameNode:/usr/local/hadoop /usr/local/
    
  5. Setup Environment Variables:

    Now open .bashrc and put these lines at the end of your .bashrc file (Press SHIFT + G to directly go to the end of the file):

    root@EdgeNode:/usr/local# vi ~/.bashrc
    
    export JAVA_HOME=/usr/lib/jvm/java-7-oracle/jre
    export PATH=$PATH:$JAVA_HOME/bin
    export HADOOP_HOME=/usr/local/hadoop
    export PATH=$PATH:$HADOOP_HOME/bin
    export PATH=$PATH:$HADOOP_HOME/sbin
    export HADOOP_MAPRED_HOME=$HADOOP_HOME
    export HADOOP_COMMON_HOME=$HADOOP_HOME
    export HADOOP_HDFS_HOME=$HADOOP_HOME
    export YARN_HOME=$HADOOP_HOME
    export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
    export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
    export CLASSPATH=$CLASSPATH:/usr/local/hadoop/lib/*:.
    

    Source the ~/.bashrc file:

    root@EdgeNode:/usr/local# source ~/.bashrc
    
  6. Confirm Hadoop Cluster is accessible from EdgeNode:

    Time to test the hadoop file system.

    root@EdgeNode:~# hadoop fs -ls /
    

    Time to test a hadoop map-reduce job.

    root@EdgeNode:~# hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar pi 2 4
    

Good job. Your client node is ready. Buckle up for installing hive in client node next.


Have a question on this subject?

Ask questions to our expert community members and clear your doubts. Asking question or engaging in technical discussion is both easy and rewarding.

Are you on Twitter?

Start following us. This way we will always keep you updated with what's happening in Data Analytics community. We won't spam you. Promise.

  • Apache Hadoop Architecture

    In this article we will learn about the Apache Hadoop framework architecture. The basic components of the Apache Hadoop HDFS & MapReduce engine are discussed in brief.

  • Install SQOOP in Client Node of Hadoop Cluster

    Sqoop is an open source software product of the Apache Software Foundation in the hadoop ecosystem, designed to transfer data between Hadoop and relational databases or mainframes. Sqoop can be used to import data from a relational database...

  • Hadoop DataLake Implementation Part 4

    Now that our dummy OLTP source system & Hadoop HDFS directory structure is ready, we will first load the ‘dates’ data file in HDFS and further to a hive table.

  • Install FLUME In Client Node of Hadoop Cluster

    Apache Flume is a distributed, robust, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data or streaming event data from different sources to a centralized data store. Its main goal is to...

  • SQOOP import from MySQL

    In this article we will use Apache SQOOP to import data from MySQL database. For that let us create a MySql database & user and dump some data quickly. Let us download a MySQL database named Sakila Db from internet to get started. Next we will...

  • Stream Webserver Log into Hdfs using FLUME

    In this article we will use Apache Flume to gather stream access log data from our remote Web Server into Hadoop Distributed File System. We will be analyzing the access log in a real-time basis. So we have to setup Flume such that it collects the...

  • Hadoop DataLake Implementation Part 7

    In this article we will load our master data table ‘Product’ as Slowly Changing Dimension of Type 2 to maintain full history, so as to analyze the sales and stocks data with reference to the historical master data.

  • Install Hive in Client Node of Hadoop Cluster

    In the previous article, we have shown how to setup a client node. Once this is done, now let's put Hadoop to use for some big data analytics purpose. One way to do that is by using Hive which let's us run SQL queries against the big data. A...

  • Install PIG In Client Node of Hadoop Cluster

    Apache Pig is a platform for analyzing large data sets. Pig Latin is the high level programming language that, lets us specify a sequence of data transformations such as merging data sets, filtering them, grouping them, and applying functions to...

  • Hadoop DataLake Implementation Part 3

    To complete our implementation setup we will create the source tables based on the downloaded datafiles. Let us first load the SQL files in MySQL server under a new database called ‘sales’. We will simulate this database schema as our OLTP source...