In this article we will use Apache Flume to gather stream access log data from our remote Web Server into Hadoop Distributed File System. We will be analyzing the access log in a real-time basis. So we have to setup Flume such that it collects the access log information from the web server and pushes it to the hadoop cluster. Once the data is in our HDFS, we can analayze better using HIVE. Lets check the multiple Flume agent configurations.

In this article we will configure a multi agent flume flow. To setup a multi-tier flow, we need to have an avro sink of first hop pointing to avro source of the next hop. This will result in the first Flume agent forwarding events to the next Flume agent. Here we link the AvroSink from the weblog agent to the AvroSource of the hdfs agent. This will result in the events coming from the external web server source, eventually getting stored in HDFS.

Install FLUME in Web Server

We have Nginx as our Web Server in a remote machine. In order to stream the access log from the web server to the hdfs, we need to install & configure Flume into the web server first. So ssh login to our WebServer node. Check if we are able to ping to our hadoop edge machine. In our case we have the EdgeNode IP as 10.0.100.3.

root@WebSrv:~# ping 10.0.100.3
PING 10.0.100.3 (10.0.100.3) 56(84) bytes of data.
64 bytes from 10.0.100.3: icmp_seq=1 ttl=63 time=0.887 ms
64 bytes from 10.0.100.3: icmp_seq=2 ttl=63 time=0.395 ms

We will install FLUME in the directory /usr/local. As a prerequisite, we also need to install Java.

root@WebSrv:~# cd /usr/local

Next, we will download a recent stable release of FLUME from the below Apache site:
http://www-us.apache.org/dist/flume/stable/
Next follow the Installation steps as below:

root@WebSrv:/usr/local# wget http://www-us.apache.org/dist/flume/stable/apache-flume-1.6.0-bin.tar.gz
root@WebSrv:/usr/local# tar -xzvf apache-flume-1.6.0-bin.tar.gz >> /dev/null
root@WebSrv:/usr/local# mv apache-flume-1.6.0-bin /usr/local/flume
root@WebSrv:/usr/local# rm apache-flume-1.6.0-bin.tar.gz

Next we will set the FLUME Environment variables in the .bashrc file. Append the below lines, save and quit.

root@WebSrv:/usr/local# vi ~/.bashrc


export FLUME_HOME=/usr/local/flume
export PATH=$PATH:$FLUME_HOME/bin

export CLASSPATH=$CLASSPATH:/usr/local/flume/lib/*:.

Source the environment file.

root@WebSrv:/usr/local# source ~/.bashrc

Next we need to set the JAVA_HOME in the flume environment file.

root@WebSrv:/usr/local# cd flume/conf
root@WebSrv:/usr/local/flume/conf# cp flume-env.sh.template flume-env.sh

Open the file flume-env.sh and set the JAVA_HOME as below:

vi flume-env.sh

export JAVA_HOME=/usr/lib/jvm/java-7-oracle/jre

Now let us validate flume is installed properly:

root@WebSrv:/usr/local/flume/conf# cd $FLUME_HOME/bin
root@WebSrv:/usr/local/flume/bin# flume-ng version
Flume 1.6.0
Source code repository: https://git-wip-us.apache.org/repos/asf/flume.git
Revision: 2561a23240a71ba20bf288c7c2cda88f443c2080
Compiled by hshreedharan on Mon May 11 11:15:44 PDT 2015
From source with checksum b29e416802ce9ece3269d34233baf43f

Configure Multiple Flume Agent

Once FLUME is installed we have to create the source side flume agent to collect access log streaming data and forward to another agent running on our EdgeNode of the Hadoop Cluster. The IP of our EdgeNode is 10.0.100.3.

root@WebSrv:/usr/local/flume/bin# vi $FLUME_HOME/conf/WebSrvAccLogAgent.conf
WebSrvAccLogAgent.conf
# WebSrvAccLogAgent

# Naming the components of the current agent.
WsAccLog.sources = NginxAccess
WsAccLog.sinks = AvroSink
WsAccLog.channels = MemChannel

# Source Configuration
WsAccLog.sources.NginxAccess.type = exec
WsAccLog.sources.NginxAccess.command = tail -F /srv/intellip.com/logs/access.log
WsAccLog.sources.NginxAccess.batchSize = 1
WsAccLog.sources.NginxAccess.interceptors = itime

# Timestamp Interceptor
WsAccLog.sources.NginxAccess.interceptors.itime.type = timestamp

# Sink Configuration (Send to Flume Collector Agent on Hadoop Edge Node)
WsAccLog.sinks.AvroSink.type = avro
WsAccLog.sinks.AvroSink.hostname = 10.0.100.3.
WsAccLog.sinks.AvroSink.port = 4545

# Channel Configuration
WsAccLog.channels.MemChannel.type = memory

# Bind Source & Sink to the Channel
WsAccLog.sources.NginxAccess.channels = MemChannel
WsAccLog.sinks.AvroSink.channel = MemChannel

Before we start the Source side Flume agent, we have to configure our next flume agent running in the EdgeNode of the hadoop cluster. Login to our EdgeNode using another ssh session.

root@EdgeNode:~# cd $FLUME_HOME
root@EdgeNode:/usr/local/flume# vi conf/EdgeAccLogAgent.conf
EdgeAccLogAgent.conf
# EdgeAccLogAgent

# Naming the components of the current agent.
EdgeAccLog.sources = AvroSource
EdgeAccLog.sinks = HDFS
EdgeAccLog.channels = MemChannel

# Source Configuration
EdgeAccLog.sources.AvroSource.type = avro
EdgeAccLog.sources.AvroSource.bind = 0.0.0.0
EdgeAccLog.sources.AvroSource.port = 4545

# Sink Configuration (Write to HDFS)
EdgeAccLog.sinks.HDFS.type = hdfs
EdgeAccLog.sinks.HDFS.hdfs.path = /flume_analytics/nginx/access_log
EdgeAccLog.sinks.HDFS.hdfs.fileType = DataStream
EdgeAccLog.sinks.HDFS.hdfs.rollCount = 1000

# Channel Configuration
EdgeAccLog.channels.MemChannel.type = memory

# Bind Source & Sink to the Channel
EdgeAccLog.sources.AvroSource.channels = MemChannel
EdgeAccLog.sinks.HDFS.channel = MemChannel

Next let us create the destination directory in HDFS where the logs files will be parked.

root@EdgeNode:/usr/local/flume# hadoop fs -mkdir -p /flume_analytics/nginx/access_log

First of all start the flume agent on the EdgeNode. root@EdgeNode:/usr/local/flume# $FLUME_HOME/bin/flume-ng agent --conf $FLUME_HOME/conf/ -f $FLUME_HOME/conf/EdgeAccLogAgent.conf -n EdgeAccLog

Next start the flume agent on the WebServer side.

root@WebSrv:/usr/local/flume/bin# $FLUME_HOME/bin/flume-ng agent --conf $FLUME_HOME/conf/ -f $FLUME_HOME/conf/WebSrvAccLogAgent.conf -n WsAccLog

Once it starts successfully, lets check HDFS file system to validate whether the streaming access logs events have been successfully collected & written by Flume. Open another session & ssh login to the EdgeNode to validate the results.

root@EdgeNode:~# hadoop fs -ls /flume_analytics/nginx/access_log

#Sample Output
/flume_analytics/nginx/access_log/FlumeData.1473927556551
/flume_analytics/nginx/access_log/FlumeData.1473927556552
/flume_analytics/nginx/access_log/FlumeData.1473927593779
/flume_analytics/nginx/access_log/FlumeData.1473927593780
...

root@EdgeNode:~# hadoop fs -tail /flume_analytics/nginx/access_log/FlumeData.1473927556551

We have successfully configured multiple flume agent. In the next article we will further analyze the access log files in HDFS using HIVE.


Have a question on this subject?

Ask questions to our expert community members and clear your doubts. Asking question or engaging in technical discussion is both easy and rewarding.

Are you on Twitter?

Start following us. This way we will always keep you updated with what's happening in Data Analytics community. We won't spam you. Promise.

  • Hadoop DataLake Implementation Part 3

    To complete our implementation setup we will create the source tables based on the downloaded datafiles. Let us first load the SQL files in MySQL server under a new database called ‘sales’. We will simulate this database schema as our OLTP source...

  • Install SPARK in Hadoop Cluster

    Apache Spark is a fast and general purpose engine for large-scale data processing over a distributed cluster. Apache Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing. Spark run programs up to 100x...

  • Hadoop DataLake Implementation Part 6

    In this article we will load the Customer data in the Hive warehouse as SCD Type 1. This time we will follow a different approach to implement Insert/Update or Merge strategy using Hive QL, rather than SQOOP Merge utility

  • Hadoop DataLake Implementation Part 9

    In this article we will load our final fact table i.e. stock.

  • SQOOP import from Oracle

    In this article we will use Apache SQOOP to import data from Oracle database. Now that we have an oracle server in our cluster ready, let us login to EdgeNode. Next we will configure sqoop to import this data in HDFS file system followed by direct...

  • Oracle Installation for SQOOP Import

    We would like to perform practical test of Apache SQOOP import/export utility between ORACLE relational database & Apache HADOOP file system, let us quickly setup an ORACLE server. For that we will be using cloud based services/servers as we did...

  • SQOOP Merge & Incremental Extraction from Oracle

    Let us check how to perform Incremental Extraction & Merge using Sqoop. The SQOOP Merge utility allows to combine two datasets where entries in one dataset should overwrite entries of an older dataset. For example, an incremental import run in...

  • SQOOP import from MySQL

    In this article we will use Apache SQOOP to import data from MySQL database. For that let us create a MySql database & user and dump some data quickly. Let us download a MySQL database named Sakila Db from internet to get started. Next we will...

  • Hadoop DataLake Implementation Part 10

    In this article we will create oozie workflow to orchestrate the daily loading of showroom dimension table from MySQL source to HDFS using Sqoop, followed by Loading data from HDFS to Hive warehouse using Hive and finally housekkeping & archive.

  • Hadoop DataLake Implementation Part 5

    In this article we will load the showroom master data from MySQL source system to HDFS using Sqoop as SCD Type 1.