Install SQOOP in Client Node of Hadoop Cluster
Sqoop is an open source software product of the Apache Software Foundation in the hadoop ecosystem, designed to transfer data between Hadoop and relational databases or mainframes. Sqoop can be used to import data from a relational database management system (RDBMS) such as MySQL , Oracle, MSSQL, PostgreSQL or a mainframe into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS.
Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance.
Sqoop Installation in EdgeNode
Before installing Sqoop, a release of Hadoop must be installed and configured. We will install Sqoop in the same directory path where we already have our Hadoop binaries & configuration files /usr/local. You invoke Sqoop through the program launch capability provided by Hadoop. The sqoop command-line program is a wrapper which runs the bin/hadoop script shipped with Hadoop. So before installation check $HADOOP_HOME, $HADOOP_COMMON_HOME, $HADOOP_MAPRED_HOME, $HADOOP_CONF_DIR environment variables are already set to the path of Hadoop installation.
Get the Latest stable version of Sqoop from:
http://www-us.apache.org/dist/sqoop/1.4.6/
Login to EdgeNode & follow the Installation steps as below:
root@EdgeNode:~# cd /usr/local/
root@EdgeNode:/usr/local# wget http://www-us.apache.org/dist/sqoop/1.4.6/sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz
root@EdgeNode:/usr/local# tar -xzvf sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz >> /dev/null
root@EdgeNode:/usr/local# mv sqoop-1.4.6.bin__hadoop-2.0.4-alpha /usr/local/sqoop
root@EdgeNode:/usr/local# rm sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz
Next we will set the SQOOP Environment variables in the .bashrc file. Append the below lines, save and quit.
root@EdgeNode:/usr/local# vi ~/.bashrc
export SQOOP_HOME=/usr/local/sqoop
export PATH=$PATH:$SQOOP_HOME/bin
export HADOOP_OPTS=-Djava.security.egd=file:/dev/../dev/urandom
Source the environment file.
root@EdgeNode:/usr/local# source ~/.bashrc
Next we need to configure the Sqoop environment script, which is invoked when scoop utilities are executed.
root@EdgeNode:/usr/local# cd sqoop/conf
root@EdgeNode:/usr/local/sqoop/conf# cp sqoop-env-template.sh sqoop-env.sh
Now set the paths for the environment variables accordingly, in sqoop-env.sh file
root@EdgeNode:/usr/local/sqoop/conf# vi sqoop-env.sh
export HADOOP_COMMON_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=/usr/local/hadoop
export HIVE_HOME=/usr/local/hive
Sqoop is a collection of related tools. A by-product of the sqoop import utility is a generated Java class which can encapsulate one row of the imported table. This class is used during the import process by Sqoop itself. The Java source code for this class is also provided, for use in subsequent MapReduce processing of the data. This class can serialize and deserialize data to and from the SequenceFile format. It can also parse the delimited-text form of a record. These abilities allow, to quickly develop MapReduce applications to process the HDFS-stored records.
So let us create an additional work directory for sqoop that will host a number of java, class, jar, avsc (Avro Schema) files corresponding to the tables when sqoop import is invoked.
root@EdgeNode:/usr/local/sqoop/conf# cd $SQOOP_HOME/
root@EdgeNode:/usr/local/sqoop# mkdir sqoop_work
Now let us validate sqoop is installed properly:
root@EdgeNode:/usr/local/sqoop# cd $SQOOP_HOME/bin
root@EdgeNode:/usr/local/sqoop/bin# sqoop-version
INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
Check the available sqoop utility commands:
root@EdgeNode:/usr/local/sqoop/bin# sqoop help
RDBMS JDBC Drivers for Sqoop
As a part of the future requirement we will download & move the JDBC jar files for MySQL & Oracle databases to the Sqoop library. In the next article we will extract data from MySQL & also from Oracle using Sqoop and put them in HDFS cluster & further in HIVE tables.
root@EdgeNode:/usr/local/sqoop/bin# cd /usr/local/sqoop
root@EdgeNode:/usr/local/sqoop# wget http://cdn.mysql.com//Downloads/Connector-J/mysql-connector-java-5.0.8.tar.gz
root@EdgeNode:/usr/local/sqoop# tar -xzf mysql-connector-java-5.0.8.tar.gz >> /dev/null
root@EdgeNode:/usr/local/sqoop# mv mysql-connector-java-5.0.8/mysql-connector-java-5.0.8-bin.jar /usr/local/sqoop/lib/
root@EdgeNode:/usr/local/sqoop# rm -rf mysql-connector-java-5.0.8
root@EdgeNode:/usr/local/sqoop# rm mysql-connector-java-5.0.8.tar.gz
For Oracle JDBC driver, go to the below website:
http://www.oracle.com/technetwork/apps-tech/jdbc-112010-090769.html
Once you accept and license agreement and select ojdbc6.jar, copy the URL for the download so that we can directly download in the server. Next we rename the file as required.
root@EdgeNode:/usr/local/sqoop# wget http://download.oracle.com/otn/utilities_drivers/jdbc/11204/ojdbc6.jar?AuthParam=1472102440_acf2d63e2d3673651122947bf8cba738
root@EdgeNode:/usr/local/sqoop# mv ojdbc6.jar?AuthParam=1472102440_acf2d63e2d3673651122947bf8cba738 ojdbc6.jar
root@EdgeNode:/usr/local/sqoop# mv ojdbc6.jar /usr/local/sqoop/lib/
Finally we have installed and configured SQOOP. In the next series of articles we will see Sqoop in real action. Let's import relational database data to hive using SQOOP.