Apache Flume is a distributed, robust, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data or streaming event data from different sources to a centralized data store. Its main goal is to deliver log data from various application or web servers to Apache Hadoop's HDFS. Flume supports a large set of sources and destinations types.

The basic components in Flume are as below:

  • Agent: A JVM process that is running flume. It receives the events from Source or other agents and forwards it to its next destination either Sink or another agent. We can have multiple agents in Flume.
  • Event: A single datum or unit of data that flows through a Flume agent. An Event carries a payload of byte array that is accompanied by an optional set of headers.
  • Source: Receives/Consumes Events having a specific format, delivered by an external source like a web server.
    E.g. HTTP Source, JMS Source, Avro Source, Thrift Source, twitter 1% source etc.
  • Sink: A Sink is responsible for removing an Event from the Channel and putting it into an external repository or forwarding it to the next Source at the next hop of the flow.
    E.g. HDFS Sink, Hive Sink, HBase Sink, Avro Sink, Thrift Sink etc.
  • Channel: A passive transient store that holds the Event from a Source until that Event is consumed by a Sink.
    E.g. Memory Channel, File Channel, JDBC Channel etc.

Install FLUME

First we will ssh login to our EdgeNode. We will install FLUME in the same base directory path where we already have our Hadoop binaries & configuration files and all other client tools in /usr/local. Later we will look into a use-case with HDFS as our Sink.

root@EdgeNode:~# cd /usr/local

Next, we will download a recent stable release of FLUME from the below Apache site:
http://www-us.apache.org/dist/flume/stable/
Next follow the Installation steps as below:

root@EdgeNode:/usr/local# wget http://www-us.apache.org/dist/flume/stable/apache-flume-1.6.0-bin.tar.gz
root@EdgeNode:/usr/local# tar -xzvf apache-flume-1.6.0-bin.tar.gz >> /dev/null
root@EdgeNode:/usr/local# mv apache-flume-1.6.0-bin /usr/local/flume
root@EdgeNode:/usr/local# rm apache-flume-1.6.0-bin.tar.gz

Next we will set the FLUME Environment variables in the .bashrc file. Append the below lines, save and quit.

root@EdgeNode:/usr/local# vi ~/.bashrc


export FLUME_HOME=/usr/local/flume
export PATH=$PATH:$FLUME_HOME/bin

export CLASSPATH=$CLASSPATH:/usr/local/flume/lib/*:.

Source the environment file.

root@EdgeNode:/usr/local# source ~/.bashrc

Next we need to set the JAVA_HOME in the flume environment file.

root@EdgeNode:/usr/local# cd flume/conf
root@EdgeNode:/usr/local/flume/conf# cp flume-env.sh.template flume-env.sh

Open the file flume-env.sh and set the JAVA_HOME as below:

vi flume-env.sh

export JAVA_HOME=/usr/lib/jvm/java-7-oracle/jre

Now let us validate flume is installed properly:

root@EdgeNode:/usr/local/flume/conf# cd $FLUME_HOME/bin
root@EdgeNode:/usr/local/flume/bin# flume-ng version
Flume 1.6.0
Source code repository: https://git-wip-us.apache.org/repos/asf/flume.git
Revision: 2561a23240a71ba20bf288c7c2cda88f443c2080
Compiled by hshreedharan on Mon May 11 11:15:44 PDT 2015
From source with checksum b29e416802ce9ece3269d34233baf43f

Finally we have installed and configured FLUME. Now let us create a simple flume config file to gather some streaming data.

Flume Configuration

Let us create a sample Flume Agent Configuration. Flume configuration file is a Java property file with key-value pairs. Since we can have multiple agents in Flume, we will configure each agent based on their unique name agent. In order to configure an agent we need to define the Source, Sink & the Channel. We will also bind the Source & Sink to this Channel. Lets take a Sequence Generator Source for our testing purpose. This test source actually generates the events ( generates number starting with 0 and increments by 1 ) continuously.

root@EdgeNode:/usr/local/flume/bin# vi $FLUME_HOME/conf/SeqGenAgent.conf
SeqGenAgent.conf
# Naming the components of the current agent.
SeqGen.sources = SeqSource
SeqGen.sinks = HDFS
SeqGen.channels = MemChannel

# Source Configuration
SeqGen.sources.SeqSource.type = seq

# Sink Configuration
SeqGen.sinks.HDFS.type = hdfs
SeqGen.sinks.HDFS.hdfs.path = /flume_analytics/seqgen_data
SeqGen.sinks.HDFS.hdfs.fileType = DataStream
SeqGen.sinks.HDFS.hdfs.rollCount = 10000

# Channel Configuration
SeqGen.channels.MemChannel.type = memory

# Bind Source & Sink to the Channel 
SeqGen.sources.SeqSource.channels = MemChannel
SeqGen.sinks.HDFS.channel = MemChannel

Before we start the Flume agent, lets us create the destination directory in HDFS.

root@EdgeNode:/usr/local/flume/bin# hadoop fs -mkdir -p /flume_analytics/seqgen_data

root@EdgeNode:/usr/local/flume/bin# flume-ng agent --conf $FLUME_HOME/conf/ -f $FLUME_HOME/conf/SeqGenAgent.conf -n SeqGen

Once it starts successfully, lets check HDFS file system to validate whether the generated logs events have been successfully collected & written by Flume. Open another session & ssh login to the EdgeNode to validate the results.

root@EdgeNode:~# hadoop fs -ls /flume_analytics/seqgen_data

#Sample Output
/flume_analytics/seqgen_data/FlumeData.1473598922914
/flume_analytics/seqgen_data/FlumeData.1473598922915
/flume_analytics/seqgen_data/FlumeData.1473598922916
...

root@EdgeNode:~# hadoop fs -tail /flume_analytics/seqgen_data/FlumeData.1473598922914

We have successfully completed configuring our first flume agent. Lets now finally stop the agent running in our previous shell session using Ctrl+Z

In this article we have done some basic hands-on with streaming data. Let's now level-up our skills with HBASE - the non-relational, distributed, column-oriented noSQL database that runs on HDFS.


Have a question on this subject?

Ask questions to our expert community members and clear your doubts. Asking question or engaging in technical discussion is both easy and rewarding.

Are you on Twitter?

Start following us. This way we will always keep you updated with what's happening in Data Analytics community. We won't spam you. Promise.

  • Hadoop DataLake Implementation Part 4

    Now that our dummy OLTP source system & Hadoop HDFS directory structure is ready, we will first load the ‘dates’ data file in HDFS and further to a hive table.

  • Install FLUME In Client Node of Hadoop Cluster

    Apache Flume is a distributed, robust, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data or streaming event data from different sources to a centralized data store. Its main goal is to...

  • SQOOP import from MySQL

    In this article we will use Apache SQOOP to import data from MySQL database. For that let us create a MySql database & user and dump some data quickly. Let us download a MySQL database named Sakila Db from internet to get started. Next we will...

  • Install Hive in Client Node of Hadoop Cluster

    In the previous article, we have shown how to setup a client node. Once this is done, now let's put Hadoop to use for some big data analytics purpose. One way to do that is by using Hive which let's us run SQL queries against the big data. A...

  • Hadoop DataLake Implementation Part 5

    In this article we will load the showroom master data from MySQL source system to HDFS using Sqoop as SCD Type 1.

  • Install SQOOP in Client Node of Hadoop Cluster

    Sqoop is an open source software product of the Apache Software Foundation in the hadoop ecosystem, designed to transfer data between Hadoop and relational databases or mainframes. Sqoop can be used to import data from a relational database...

  • Hadoop MapReduce Basics

    The Hadoop, since its inception is changing the way the enterprises store, process and analyse data. MapReduce is the core part of the Hadoop framework and we can also call it as the core processing engine of Hadoop. It is a programming model...

  • SQOOP import from Oracle

    In this article we will use Apache SQOOP to import data from Oracle database. Now that we have an oracle server in our cluster ready, let us login to EdgeNode. Next we will configure sqoop to import this data in HDFS file system followed by direct...

  • Hadoop DataLake Implementation Part 2

    Now that we are familiar with HDP stack, in this article we are going to access HDP sandbox command line, Ambari Web UI, Hive & Ranger to create a user for our implementation setup.

  • Install SPARK in Hadoop Cluster

    Apache Spark is a fast and general purpose engine for large-scale data processing over a distributed cluster. Apache Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing. Spark run programs up to 100x...