Sqoop is an open source software product of the Apache Software Foundation in the hadoop ecosystem, designed to transfer data between Hadoop and relational databases or mainframes. Sqoop can be used to import data from a relational database management system (RDBMS) such as MySQL , Oracle, MSSQL, PostgreSQL or a mainframe into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS.

Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance.

Sqoop Installation in EdgeNode

Before installing Sqoop, a release of Hadoop must be installed and configured. We will install Sqoop in the same directory path where we already have our Hadoop binaries & configuration files /usr/local. You invoke Sqoop through the program launch capability provided by Hadoop. The sqoop command-line program is a wrapper which runs the bin/hadoop script shipped with Hadoop. So before installation check $HADOOP_HOME, $HADOOP_COMMON_HOME, $HADOOP_MAPRED_HOME, $HADOOP_CONF_DIR environment variables are already set to the path of Hadoop installation.

Get the Latest stable version of Sqoop from:
http://www-us.apache.org/dist/sqoop/1.4.6/

Login to EdgeNode & follow the Installation steps as below:

root@EdgeNode:~# cd /usr/local/
root@EdgeNode:/usr/local# wget http://www-us.apache.org/dist/sqoop/1.4.6/sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz
root@EdgeNode:/usr/local# tar -xzvf sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz >> /dev/null
root@EdgeNode:/usr/local# mv sqoop-1.4.6.bin__hadoop-2.0.4-alpha /usr/local/sqoop
root@EdgeNode:/usr/local# rm sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz

Next we will set the SQOOP Environment variables in the .bashrc file. Append the below lines, save and quit.

root@EdgeNode:/usr/local# vi ~/.bashrc


export SQOOP_HOME=/usr/local/sqoop
export PATH=$PATH:$SQOOP_HOME/bin

export HADOOP_OPTS=-Djava.security.egd=file:/dev/../dev/urandom

Source the environment file.

root@EdgeNode:/usr/local# source ~/.bashrc

Next we need to configure the Sqoop environment script, which is invoked when scoop utilities are executed.

root@EdgeNode:/usr/local# cd sqoop/conf
root@EdgeNode:/usr/local/sqoop/conf# cp sqoop-env-template.sh sqoop-env.sh

Now set the paths for the environment variables accordingly, in sqoop-env.sh file

root@EdgeNode:/usr/local/sqoop/conf# vi sqoop-env.sh


export HADOOP_COMMON_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=/usr/local/hadoop
export HIVE_HOME=/usr/local/hive

Sqoop is a collection of related tools. A by-product of the sqoop import utility is a generated Java class which can encapsulate one row of the imported table. This class is used during the import process by Sqoop itself. The Java source code for this class is also provided, for use in subsequent MapReduce processing of the data. This class can serialize and deserialize data to and from the SequenceFile format. It can also parse the delimited-text form of a record. These abilities allow, to quickly develop MapReduce applications to process the HDFS-stored records.

So let us create an additional work directory for sqoop that will host a number of java, class, jar, avsc (Avro Schema) files corresponding to the tables when sqoop import is invoked.

root@EdgeNode:/usr/local/sqoop/conf# cd $SQOOP_HOME/
root@EdgeNode:/usr/local/sqoop# mkdir sqoop_work

Now let us validate sqoop is installed properly:

root@EdgeNode:/usr/local/sqoop# cd $SQOOP_HOME/bin
root@EdgeNode:/usr/local/sqoop/bin# sqoop-version
INFO sqoop.Sqoop: Running Sqoop version: 1.4.6

Check the available sqoop utility commands:

root@EdgeNode:/usr/local/sqoop/bin# sqoop help

RDBMS JDBC Drivers for Sqoop

As a part of the future requirement we will download & move the JDBC jar files for MySQL & Oracle databases to the Sqoop library. In the next article we will extract data from MySQL & also from Oracle using Sqoop and put them in HDFS cluster & further in HIVE tables.

root@EdgeNode:/usr/local/sqoop/bin# cd /usr/local/sqoop
root@EdgeNode:/usr/local/sqoop# wget http://cdn.mysql.com//Downloads/Connector-J/mysql-connector-java-5.0.8.tar.gz
root@EdgeNode:/usr/local/sqoop# tar -xzf mysql-connector-java-5.0.8.tar.gz >> /dev/null
root@EdgeNode:/usr/local/sqoop# mv mysql-connector-java-5.0.8/mysql-connector-java-5.0.8-bin.jar /usr/local/sqoop/lib/
root@EdgeNode:/usr/local/sqoop# rm -rf mysql-connector-java-5.0.8
root@EdgeNode:/usr/local/sqoop# rm mysql-connector-java-5.0.8.tar.gz

For Oracle JDBC driver, go to the below website:
http://www.oracle.com/technetwork/apps-tech/jdbc-112010-090769.html
Once you accept and license agreement and select ojdbc6.jar, copy the URL for the download so that we can directly download in the server. Next we rename the file as required.

root@EdgeNode:/usr/local/sqoop# wget http://download.oracle.com/otn/utilities_drivers/jdbc/11204/ojdbc6.jar?AuthParam=1472102440_acf2d63e2d3673651122947bf8cba738
root@EdgeNode:/usr/local/sqoop# mv ojdbc6.jar?AuthParam=1472102440_acf2d63e2d3673651122947bf8cba738 ojdbc6.jar
root@EdgeNode:/usr/local/sqoop# mv ojdbc6.jar  /usr/local/sqoop/lib/

Finally we have installed and configured SQOOP. In the next series of articles we will see Sqoop in real action. Let's import relational database data to hive using SQOOP.


Have a question on this subject?

Ask questions to our expert community members and clear your doubts. Asking question or engaging in technical discussion is both easy and rewarding.

Are you on Twitter?

Start following us. This way we will always keep you updated with what's happening in Data Analytics community. We won't spam you. Promise.

  • Hadoop DataLake Implementation

    In this multi-series article we will learn how to implement an Enterprise DataLake using Apache Hadoop, an open-source, java-based software framework for reliable, scalable & distributed computing. Apache Hadoop addresses the limitations of...

  • Install SQOOP in Client Node of Hadoop Cluster

    Sqoop is an open source software product of the Apache Software Foundation in the hadoop ecosystem, designed to transfer data between Hadoop and relational databases or mainframes. Sqoop can be used to import data from a relational database...

  • Hadoop DataLake Implementation Part 8

    In this article we will load our first fact table into Hive warehouse which is sales transactions.

  • Understanding Map-Reduce with Examples

    In my previous article – “Fools guide to Big Data” – we have discussed about the origin of Bigdata and the need of big data analytics. We have also noted that Big Data is data that is too large, complex and dynamic for any conventional data tools...

  • Install Hive in Client Node of Hadoop Cluster

    In the previous article, we have shown how to setup a client node. Once this is done, now let's put Hadoop to use for some big data analytics purpose. One way to do that is by using Hive which let's us run SQL queries against the big data. A...

  • Set up Client Node (Gateway Node) in Hadoop Cluster

    Once we have our multi-node hadoop cluster up and running, let us create an EdgeNode or a GatewayNode. Gateway nodes are the interface between the Hadoop cluster and the outside network. Edge nodes are used to run client applications and cluster...

  • Hadoop DataLake Implementation Part 2

    Now that we are familiar with HDP stack, in this article we are going to access HDP sandbox command line, Ambari Web UI, Hive & Ranger to create a user for our implementation setup.

  • Install FLUME In Client Node of Hadoop Cluster

    Apache Flume is a distributed, robust, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data or streaming event data from different sources to a centralized data store. Its main goal is to...

  • Install SPARK in Hadoop Cluster

    Apache Spark is a fast and general purpose engine for large-scale data processing over a distributed cluster. Apache Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing. Spark run programs up to 100x...

  • Hadoop DataLake Implementation Part 6

    In this article we will load the Customer data in the Hive warehouse as SCD Type 1. This time we will follow a different approach to implement Insert/Update or Merge strategy using Hive QL, rather than SQOOP Merge utility