Once we have our multi-node hadoop cluster up and running, let us create an EdgeNode or a GatewayNode. Gateway nodes are the interface between the Hadoop cluster and the outside network. Edge nodes are used to run client applications and cluster administration tools.

The edge node does not have to be part of the cluster, however if it is outside of the cluster (meaning it doesn't have any specific Hadoop service roles running on it), it will need some basic components such as Hadoop binaries and current Hadoop cluster config files to submit jobs on the cluster. We will install client tools in our EdgeNode namely, HIVE, SQOOP, FLUME, PIG, OOZIE etc. Before that let’s setup an EdgeNode.

Lets set up an edge node for clients to access the Hadoop Cluster for submitting jobs. Spawn a droplet in DigitalOcean 4 GB Memory / 40 GB Disk / NYC3 - Ubuntu 16.04.1 x64 named as EdgeNode with Private Networking On, so that it can communicate with the other droplets. So currently, our hadoop cluster looks like below:

Name NodeNameNode10.0.0.1
Data NodeDataNode110.0.100.1
Data NodeDataNode210.0.100.2
Client NodeEdgeNode10.0.100.3

The setup of EdgeNode will be similar to any hadoop node in the cluster, although there will be no hadoop cluster services running on this node. There will be no entry in the NameNode’s Masters or Salves file for this edge node.

Setup steps involved:

  1. Install Java Runtime Environment:

    root@EdgeNode:~# apt-get update
    root@EdgeNode:~# add-apt-repository ppa:webupd8team/java
    root@EdgeNode:~# apt-get install oracle-java7-installer
    root@EdgeNode:~# java -version
    java version "1.7.0_80"
    Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
    Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
  2. Setup machine alias in host file:

    Modify /etc/hosts file as below	localhost	EdgeNode	NameNode	DataNode1	DataNode2
  3. Setup SSH Server:

    EdgeNode requires password less access to NameNode. SSH needs to be setup, to allow password-less login from EdgeNode to NameNode machine in the cluster. The simplest way to achieve this is to generate a public/private key pair, and the public key will be shared with the master node.

    root@EdgeNode:~# apt-get  install openssh-server
    root@EdgeNode:~# ssh-keygen -t rsa -P ""
    root@EdgeNode:~# cat /root/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys
    root@EdgeNode:~# chmod 700 ~/.ssh
    root@EdgeNode:~# chmod 600 ~/.ssh/authorized_keys

    Now copy the Public Key of the EdgeNode & paste in the /root/.ssh/authorized_keys file in the NameNode.

  4. Getting Hadoop distribution & Configuration files:

    Let us get all the hadoop binaries and configuration files present in the NameNode into our EdgeNode, so that we have the same version of Hadoop binaries as that in the cluster and the configuration details of our cluster.

    root@EdgeNode:~# cd /usr/local
    root@EdgeNode:~# scp -r root@NameNode:/usr/local/hadoop /usr/local/
  5. Setup Environment Variables:

    Now open .bashrc and put these lines at the end of your .bashrc file (Press SHIFT + G to directly go to the end of the file):

    root@EdgeNode:/usr/local# vi ~/.bashrc
    export JAVA_HOME=/usr/lib/jvm/java-7-oracle/jre
    export PATH=$PATH:$JAVA_HOME/bin
    export HADOOP_HOME=/usr/local/hadoop
    export PATH=$PATH:$HADOOP_HOME/bin
    export PATH=$PATH:$HADOOP_HOME/sbin
    export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
    export CLASSPATH=$CLASSPATH:/usr/local/hadoop/lib/*:.

    Source the ~/.bashrc file:

    root@EdgeNode:/usr/local# source ~/.bashrc
  6. Confirm Hadoop Cluster is accessible from EdgeNode:

    Time to test the hadoop file system.

    root@EdgeNode:~# hadoop fs -ls /

    Time to test a hadoop map-reduce job.

    root@EdgeNode:~# hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar pi 2 4

Good job. Your client node is ready. Buckle up for installing hive in client node next.

Have a question on this subject?

Ask questions to our expert community members and clear your doubts. Asking question or engaging in technical discussion is both easy and rewarding.

Are you on Twitter?

Start following us. This way we will always keep you updated with what's happening in Data Analytics community. We won't spam you. Promise.

  • Hadoop DataLake Implementation Part 7

    In this article we will load our master data table ‘Product’ as Slowly Changing Dimension of Type 2 to maintain full history, so as to analyze the sales and stocks data with reference to the historical master data.

  • Configuring MySQL as Hive Metastore

    In the previous article, we have learnt How to Install and Configure Hive with default Derby metastore. However, an embedded derby based metastore can process only one request at a time. Since this is very restrictive, we will setup a traditional...

  • Hadoop DataLake Implementation Part 3

    To complete our implementation setup we will create the source tables based on the downloaded datafiles. Let us first load the SQL files in MySQL server under a new database called ‘sales’. We will simulate this database schema as our OLTP source...

  • Install SQOOP in Client Node of Hadoop Cluster

    Sqoop is an open source software product of the Apache Software Foundation in the hadoop ecosystem, designed to transfer data between Hadoop and relational databases or mainframes. Sqoop can be used to import data from a relational database...

  • Hadoop DataLake Implementation Part 9

    In this article we will load our final fact table i.e. stock.

  • Hadoop DataLake Implementation Part 8

    In this article we will load our first fact table into Hive warehouse which is sales transactions.

  • Fools Guide to Big data - What is Big Data

    Sure enough, you have heard the term, "Big Data" many times before. There is no dearth of information in the Internet and printed medium about this. But guess what, this term still remains vaguely defined and poorly understood. This essay is our...

  • Hadoop DataLake Implementation Part 10

    In this article we will create oozie workflow to orchestrate the daily loading of showroom dimension table from MySQL source to HDFS using Sqoop, followed by Loading data from HDFS to Hive warehouse using Hive and finally housekkeping & archive.

  • Hadoop DataLake Implementation Part 6

    In this article we will load the Customer data in the Hive warehouse as SCD Type 1. This time we will follow a different approach to implement Insert/Update or Merge strategy using Hive QL, rather than SQOOP Merge utility

  • SQOOP import from Oracle

    In this article we will use Apache SQOOP to import data from Oracle database. Now that we have an oracle server in our cluster ready, let us login to EdgeNode. Next we will configure sqoop to import this data in HDFS file system followed by direct...