Sqoop is an open source software product of the Apache Software Foundation in the hadoop ecosystem, designed to transfer data between Hadoop and relational databases or mainframes. Sqoop can be used to import data from a relational database management system (RDBMS) such as MySQL , Oracle, MSSQL, PostgreSQL or a mainframe into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS.

Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance.

Sqoop Installation in EdgeNode

Before installing Sqoop, a release of Hadoop must be installed and configured. We will install Sqoop in the same directory path where we already have our Hadoop binaries & configuration files /usr/local. You invoke Sqoop through the program launch capability provided by Hadoop. The sqoop command-line program is a wrapper which runs the bin/hadoop script shipped with Hadoop. So before installation check $HADOOP_HOME, $HADOOP_COMMON_HOME, $HADOOP_MAPRED_HOME, $HADOOP_CONF_DIR environment variables are already set to the path of Hadoop installation.

Get the Latest stable version of Sqoop from:
http://www-us.apache.org/dist/sqoop/1.4.6/

Login to EdgeNode & follow the Installation steps as below:

root@EdgeNode:~# cd /usr/local/
root@EdgeNode:/usr/local# wget http://www-us.apache.org/dist/sqoop/1.4.6/sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz
root@EdgeNode:/usr/local# tar -xzvf sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz >> /dev/null
root@EdgeNode:/usr/local# mv sqoop-1.4.6.bin__hadoop-2.0.4-alpha /usr/local/sqoop
root@EdgeNode:/usr/local# rm sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz

Next we will set the SQOOP Environment variables in the .bashrc file. Append the below lines, save and quit.

root@EdgeNode:/usr/local# vi ~/.bashrc


export SQOOP_HOME=/usr/local/sqoop
export PATH=$PATH:$SQOOP_HOME/bin

export HADOOP_OPTS=-Djava.security.egd=file:/dev/../dev/urandom

Source the environment file.

root@EdgeNode:/usr/local# source ~/.bashrc

Next we need to configure the Sqoop environment script, which is invoked when scoop utilities are executed.

root@EdgeNode:/usr/local# cd sqoop/conf
root@EdgeNode:/usr/local/sqoop/conf# cp sqoop-env-template.sh sqoop-env.sh

Now set the paths for the environment variables accordingly, in sqoop-env.sh file

root@EdgeNode:/usr/local/sqoop/conf# vi sqoop-env.sh


export HADOOP_COMMON_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=/usr/local/hadoop
export HIVE_HOME=/usr/local/hive

Sqoop is a collection of related tools. A by-product of the sqoop import utility is a generated Java class which can encapsulate one row of the imported table. This class is used during the import process by Sqoop itself. The Java source code for this class is also provided, for use in subsequent MapReduce processing of the data. This class can serialize and deserialize data to and from the SequenceFile format. It can also parse the delimited-text form of a record. These abilities allow, to quickly develop MapReduce applications to process the HDFS-stored records.

So let us create an additional work directory for sqoop that will host a number of java, class, jar, avsc (Avro Schema) files corresponding to the tables when sqoop import is invoked.

root@EdgeNode:/usr/local/sqoop/conf# cd $SQOOP_HOME/
root@EdgeNode:/usr/local/sqoop# mkdir sqoop_work

Now let us validate sqoop is installed properly:

root@EdgeNode:/usr/local/sqoop# cd $SQOOP_HOME/bin
root@EdgeNode:/usr/local/sqoop/bin# sqoop-version
INFO sqoop.Sqoop: Running Sqoop version: 1.4.6

Check the available sqoop utility commands:

root@EdgeNode:/usr/local/sqoop/bin# sqoop help

RDBMS JDBC Drivers for Sqoop

As a part of the future requirement we will download & move the JDBC jar files for MySQL & Oracle databases to the Sqoop library. In the next article we will extract data from MySQL & also from Oracle using Sqoop and put them in HDFS cluster & further in HIVE tables.

root@EdgeNode:/usr/local/sqoop/bin# cd /usr/local/sqoop
root@EdgeNode:/usr/local/sqoop# wget http://cdn.mysql.com//Downloads/Connector-J/mysql-connector-java-5.0.8.tar.gz
root@EdgeNode:/usr/local/sqoop# tar -xzf mysql-connector-java-5.0.8.tar.gz >> /dev/null
root@EdgeNode:/usr/local/sqoop# mv mysql-connector-java-5.0.8/mysql-connector-java-5.0.8-bin.jar /usr/local/sqoop/lib/
root@EdgeNode:/usr/local/sqoop# rm -rf mysql-connector-java-5.0.8
root@EdgeNode:/usr/local/sqoop# rm mysql-connector-java-5.0.8.tar.gz

For Oracle JDBC driver, go to the below website:
http://www.oracle.com/technetwork/apps-tech/jdbc-112010-090769.html
Once you accept and license agreement and select ojdbc6.jar, copy the URL for the download so that we can directly download in the server. Next we rename the file as required.

root@EdgeNode:/usr/local/sqoop# wget http://download.oracle.com/otn/utilities_drivers/jdbc/11204/ojdbc6.jar?AuthParam=1472102440_acf2d63e2d3673651122947bf8cba738
root@EdgeNode:/usr/local/sqoop# mv ojdbc6.jar?AuthParam=1472102440_acf2d63e2d3673651122947bf8cba738 ojdbc6.jar
root@EdgeNode:/usr/local/sqoop# mv ojdbc6.jar  /usr/local/sqoop/lib/

Finally we have installed and configured SQOOP. In the next series of articles we will see Sqoop in real action. Let's import relational database data to hive using SQOOP.


Have a question on this subject?

Ask questions to our expert community members and clear your doubts. Asking question or engaging in technical discussion is both easy and rewarding.

Are you on Twitter?

Start following us. This way we will always keep you updated with what's happening in Data Analytics community. We won't spam you. Promise.

  • Understanding Map-Reduce with Examples

    In my previous article – “Fools guide to Big Data” – we have discussed about the origin of Bigdata and the need of big data analytics. We have also noted that Big Data is data that is too large, complex and dynamic for any conventional data tools...

  • How to Setup Hadoop Multi Node Cluster - Step By Step

    Setting up Hadoop in a single machine is easy, but no fun. Why? Because Hadoop is not meant for a single machine. Hadoop is meant to run on a computing cluster comprising of many machines. Running HDFS and MapReduce on a single machine is great for...

  • SQOOP import from Oracle

    In this article we will use Apache SQOOP to import data from Oracle database. Now that we have an oracle server in our cluster ready, let us login to EdgeNode. Next we will configure sqoop to import this data in HDFS file system followed by direct...

  • Hadoop MapReduce Basics

    The Hadoop, since its inception is changing the way the enterprises store, process and analyse data. MapReduce is the core part of the Hadoop framework and we can also call it as the core processing engine of Hadoop. It is a programming model...

  • Install SQOOP in Client Node of Hadoop Cluster

    Sqoop is an open source software product of the Apache Software Foundation in the hadoop ecosystem, designed to transfer data between Hadoop and relational databases or mainframes. Sqoop can be used to import data from a relational database...

  • Install Hive in Client Node of Hadoop Cluster

    In the previous article, we have shown how to setup a client node. Once this is done, now let's put Hadoop to use for some big data analytics purpose. One way to do that is by using Hive which let's us run SQL queries against the big data. A...

  • Install PIG In Client Node of Hadoop Cluster

    Apache Pig is a platform for analyzing large data sets. Pig Latin is the high level programming language that, lets us specify a sequence of data transformations such as merging data sets, filtering them, grouping them, and applying functions to...

  • Oracle Installation for SQOOP Import

    We would like to perform practical test of Apache SQOOP import/export utility between ORACLE relational database & Apache HADOOP file system, let us quickly setup an ORACLE server. For that we will be using cloud based services/servers as we did...

  • Introduction to Apache Hadoop

    The Apache Hadoop is next big data platform. Apache Hadoop is an open-source, java-based framework software for reliable, scalable & distributed computing. The Apache Hadoop allows distributed processing of very large data sets across clusters of...

  • Configuring MySQL as Hive Metastore

    In the previous article, we have learnt How to Install and Configure Hive with default Derby metastore. However, an embedded derby based metastore can process only one request at a time. Since this is very restrictive, we will setup a traditional...