In this article we will use Apache SQOOP to import data from MySQL database. For that let us create a MySql database & user and dump some data quickly. Let us download a MySQL database named Sakila Db from internet to get started. Next we will configure sqoop to import this data in HDFS file system followed by direct import into Hive tables.

Sample MySQL Database for SQOOP Import

Let us download the online available database sakila db from the below link:
http://downloads.mysql.com/docs/sakila-db.tar.gz

root@EdgeNode:/usr/local/sqoop# cd $SQOOP_HOME/sqoop_work
root@EdgeNode:/usr/local/sqoop/sqoop_work# wget http://downloads.mysql.com/docs/sakila-db.tar.gz
root@EdgeNode:/usr/local/sqoop/sqoop_work# tar -xzvf sakila-db.tar.gz >> /dev/null
root@EdgeNode:/usr/local/sqoop/sqoop_work# ls sakila-db

	sakila-schema.sql
	sakila-data.sql

Next login to the MySql database using root & then we will execute the downloaded sql files to create a database sakila, sample tables along with data.

root@EdgeNode:/usr/local/sqoop/sqoop_work# mysql -u root -p
mysql> SOURCE /usr/local/sqoop/sqoop_work/sakila-db/sakila-schema.sql;
mysql> SOURCE /usr/local/sqoop/sqoop_work/sakila-db/sakila-data.sql;

Next we will create a user (e.g. password hadoop1234 ), grant user the access to the sakila database and will provide these credentials during Sqoop import.

mysql> CREATE USER 'sakila_user'@'localhost' IDENTIFIED BY 'hadoop1234';
mysql> GRANT ALL ON sakila.* TO 'sakila_user'@'localhost';
mysql> CREATE USER 'sakila_user'@'%' IDENTIFIED BY 'hadoop1234';
mysql> GRANT ALL ON sakila.* TO 'sakila_user'@'%';
mysql> CREATE USER 'sakila_user'@'10.0.100.3' IDENTIFIED BY 'hadoop1234';
mysql> GRANT ALL ON sakila.* TO 'sakila_user'@'10.0.100.3';
mysql> FLUSH PRIVILEGES;
mysql> quit;

Next login as the new user and check the database access and tables;

root@EdgeNode:/usr/local/sqoop/sqoop_work# mysql -usakila_user -p
mysql> SHOW DATABASES;
mysql> USE sakila;
mysql> SHOW TABLES;
mysql> quit;

Finally delete the downloaded files:

root@EdgeNode:/usr/local/sqoop/sqoop_work# rm sakila-db.tar.gz
root@EdgeNode:/usr/local/sqoop/sqoop_work# rm -rf sakila

SQOOP Import to HDFS

Now it’s time to explore Sqoop. Let us first invoke the Sqoop utility to list the databases accessible to an user in MySQL.

root@EdgeNode:/usr/local/sqoop/sqoop_work# sqoop list-databases --connect jdbc:mysql://10.0.100.3:3306/ --username sakila_user --password hadoop1234

In order to secure password, Use the option -P, which will prompt for password as below:

root@EdgeNode:/usr/local/sqoop/sqoop_work# sqoop list-databases --connect jdbc:mysql://10.0.100.3:3306/ --username sakila_user -P

Alternatively we can also use --options-file, which keeps the server user and password information as below:

root@EdgeNode:/usr/local/sqoop/sqoop_work# vi import_mysql.txt
	
list-databases
--connect
jdbc:mysql://10.0.100.3:3306/
--username
sakila_user
--password
hadoop1234
root@EdgeNode:/usr/local/sqoop/sqoop_work# sqoop --options-file /usr/local/sqoop/sqoop_work/import_mysql.txt

Now let’s use another Sqoop utility to list down the available tables in a database.

root@EdgeNode:/usr/local/sqoop/sqoop_work# sqoop list-tables --connect jdbc:mysql://10.0.100.3:3306/sakila --username sakila_user --password hadoop1234

Now it’s finally time to bring some data from MySQL table to HDFS using SQOOP import. We have a lot of options to control the import process which will be discussed in more detail at a later point. But before that take a look at the import statement below:

root@EdgeNode:/usr/local/sqoop/sqoop_work# sqoop import \
--bindir ./ \
--driver com.mysql.jdbc.Driver \
--connect jdbc:mysql://10.0.100.3:3306/sakila \
--username sakila_user \
--password hadoop1234 \
--table actor \
--columns "actor_id, first_name, last_name, last_update" \
--as-avrodatafile \
--compression-codec snappy \
--target-dir /user/root/actor \
--split-by actor_id \
--fetch-size 1000 \
--append \
--num-mappers 1 \
--validate

What does these option flags mean:

  • --bindir will be used by sqoop to generate the java, java class, java archive, avro schema file corresponding to the import table requested.
  • --driver is optional which directs sqoop to use the driver class file in the library to be used.
  • --connect followed by the jdbc connection credentials to the MySql databases and the database to connect to.
  • --username is the user to login to the database
  • --password is the database user’s login password credential
  • --table: Table to read/import from database.
  • --columns: List of columns to import from table. This is optional.
  • --as-avrodatafile: Instructs sqoop to Imports data to Avro Data Files. Other options that can be used are-
    1. --as-textfile: Imports data as plain text. This is the default if not mentioned in import.
    2. --as-sequencefile: Imports data to Sequence Files.
    3. --as-parquetfile: Imports data to Parquet Files.
  • --compress or -z: Enable compression
  • --compression-codec: Defines the hadoop compression codec to use. By default, data is not compressed. If compress is enabled default compression codec is gzip.
  • --target-dir: HDFS destination directory.
  • --split-by: Column of the table used to split work units.
  • --fetch-size: Number of entries to read from database at once.
  • --append: Append data to an existing dataset already in HDFS. Other options that can be used are-
    1. --delete-target-dir: Delete the import target directory if it exists
  • --num-mappers or –m: Use n number of map tasks to import data in parallel from database.
  • --validate: Validate and confirm if the import process was successful.

Once the sqoop import is successful, let us check the data in HDFS.

root@EdgeNode:/usr/local/sqoop/sqoop_work# cd $HADOOP_HOME/bin
root@EdgeNode:/usr/local/hadoop/bin# hadoop fs -ls /user/root/actor

	/user/root/actor/part-m-00000.avro

Hive Table

The next step we might be interested on is to load the data in our hadoop Hive database. Since we have imported the data previously as AVRO, we will be creating an avro based table in Hive. For that we need to move the avro schema file into HDFS.

root@EdgeNode:/usr/local/hadoop/bin# hadoop fs -mkdir -p /user/root/schema
root@EdgeNode:/usr/local/hadoop/bin# hadoop fs -copyFromLocal /usr/local/sqoop/sqoop_work/actor.avsc /user/root/schema/actor.avsc

Now let us create & load the hive table.

root@EdgeNode:/usr/local/hadoop/bin# cd $HIVE_HOME/hive_work
root@EdgeNode:/usr/local/hive/hive_work# hive
hive> SHOW DATABASES;
hive> USE default;

hive> CREATE TABLE actor
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.url'='hdfs:///user/root/schema/actor.avsc');


hive> LOAD DATA INPATH '/user/root/actor/part-*.avro' OVERWRITE INTO TABLE actor;
hive> select count(1) from actor;
hive> select * from actor limit 10;
hive> quit;

SQOOP import to HIVE

The main function of Sqoop import is to upload data into files in HDFS. However, if we have a Hive metastore associated with our HDFS cluster, Sqoop can also import the data into Hive by generating and executing a CREATE TABLE statement to define the data’s layout in Hive. Importing data into Hive is as simple as adding the --hive-import option to Sqoop command line. If the Hive table already exists, specify the --hive-overwrite option to indicate that existing table in hive must be replaced. After the data is imported into HDFS or this step is omitted, Sqoop will generate a Hive script containing a CREATE TABLE operation defining your columns using Hive’s types, and a LOAD DATA INPATH statement to move the data files into Hive’s warehouse directory.

The script will be executed by calling the installed copy of hive on the machine where Sqoop is run. If you have multiple Hive installations, or hive is not in your $PATH, use the --hive-home option to identify the Hive installation directory. Sqoop will use $HIVE_HOME/bin/hive from here. Let us now import another table and load it directly to hive.

root@EdgeNode:/usr/local/hive/hive_work# cd $SQOOP_HOME/sqoop_work

root@EdgeNode:/usr/local/sqoop/sqoop_work# sqoop import \
--bindir ./ \
--driver com.mysql.jdbc.Driver \
--connect jdbc:mysql://10.0.100.3:3306/sakila \
--username sakila_user \
--password hadoop1234 \
--table actor_info \
--as-textfile \
--num-mappers 2 \
--split-by actor_id \
--target-dir /user/hive/warehouse/actor_info \
--fields-terminated-by '|' \
--lines-terminated-by '\n' \
--hive-import \
--hive-table actor_info

This will create the actor_info table in the default hive database as in hive-site.xml. Now let us check the hive table.

root@EdgeNode:/usr/local/sqoop/sqoop_work# cd $HIVE_HOME/hive_work
root@EdgeNode:/usr/local/hive/hive_work# hive
hive> SHOW DATABASES;
hive> USE default;
hive> select count(1) from actor_info;
hive> select * from actor_info limit 10;
hive> quit;

Note that the delimited data files in HDFS backing the hive table, will be under the warehouse directory as configured in hive-site.xml

root@EdgeNode:/usr/local/sqoop/sqoop_work# $HADOOP_HOME/bin/hadoop fs –ls /user/hive/warehouse/actor_info

NOTE: * Sqoop currently does not support --as-avrodatafile and --as-sequencefile formats to be loaded in Hive.

SQOOP Import Database

Finally, just one more sqoop utility we might check at this point is import-all-tables. Cases we might be interested to bring an entire database schema to HDFS, we can use import-all-tables utility. First check the utility followed by the pre-requisites for this utility to execute successfully.

root@EdgeNode:/usr/local/sqoop/sqoop_work# cd $SQOOP_HOME/sqoop_work
root@EdgeNode:/usr/local/sqoop/sqoop_work# sqoop import-all-tables \
--bindir ./ \
--driver com.mysql.jdbc.Driver \
--connect jdbc:mysql://10.0.100.3:3306/sakila \
--username sakila_user \
--password hadoop1234 \
--exclude-tables actor,actor_info \
--as-textfile \
--num-mappers 1 \
--direct \
--fields-terminated-by '|' \
--lines-terminated-by '\n' \
--hive-import 

Now let us check the hive tables:

root@EdgeNode:/usr/local/sqoop/sqoop_work# cd $HIVE_HOME/hive_work
root@EdgeNode:/usr/local/hive/hive_work# hive

hive> SHOW DATABASES;
hive> USE default;
hive> SHOW TABLES:
hive> select count(1) from inventory;
hive> select * from film_actor limit 10;
hive> quit;

Pre-requisite for import-all-tables

For the import-all-tables utility to execute successfully, the following conditions must be met:

  • Each table must have a single-column primary key.
  • All columns of each table will be imported.
  • Do not use non-default splitting column, nor impose any conditions via a WHERE clause.

There are lots of other actions in Sqoop that needs to be discussed in detail. We will cover them as separate articles. But in the next article, let's learn how to bring data from the most popular RDBMS - Oracle to Hadoop using SQOOP.


Have a question on this subject?

Ask questions to our expert community members and clear your doubts. Asking question or engaging in technical discussion is both easy and rewarding.

Are you on Twitter?

Start following us. This way we will always keep you updated with what's happening in Data Analytics community. We won't spam you. Promise.

  • SQOOP Merge & Incremental Extraction from Oracle

    Let us check how to perform Incremental Extraction & Merge using Sqoop. The SQOOP Merge utility allows to combine two datasets where entries in one dataset should overwrite entries of an older dataset. For example, an incremental import run in...

  • SQOOP import from MySQL

    In this article we will use Apache SQOOP to import data from MySQL database. For that let us create a MySql database & user and dump some data quickly. Let us download a MySQL database named Sakila Db from internet to get started. Next we will...

  • Oracle Installation for SQOOP Import

    We would like to perform practical test of Apache SQOOP import/export utility between ORACLE relational database & Apache HADOOP file system, let us quickly setup an ORACLE server. For that we will be using cloud based services/servers as we did...

  • Install FLUME In Client Node of Hadoop Cluster

    Apache Flume is a distributed, robust, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data or streaming event data from different sources to a centralized data store. Its main goal is to...

  • Hadoop MapReduce Basics

    The Hadoop, since its inception is changing the way the enterprises store, process and analyse data. MapReduce is the core part of the Hadoop framework and we can also call it as the core processing engine of Hadoop. It is a programming model...

  • Hadoop DataLake Implementation Part 5

    In this article we will load the showroom master data from MySQL source system to HDFS using Sqoop as SCD Type 1.

  • Hadoop DataLake Implementation Part 9

    In this article we will load our final fact table i.e. stock.

  • Install Hive in Client Node of Hadoop Cluster

    In the previous article, we have shown how to setup a client node. Once this is done, now let's put Hadoop to use for some big data analytics purpose. One way to do that is by using Hive which let's us run SQL queries against the big data. A...

  • Understanding Map-Reduce with Examples

    In my previous article – “Fools guide to Big Data” – we have discussed about the origin of Bigdata and the need of big data analytics. We have also noted that Big Data is data that is too large, complex and dynamic for any conventional data tools...

  • Install HBASE in Hadoop Cluster

    Apache HBase provides large-scale tabular storage for Hadoop using the Hadoop Distributed File System (HDFS). Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable. HBase is used in cases...