In this article we will use Apache SQOOP to import data from MySQL database. For that let us create a MySql database & user and dump some data quickly. Let us download a MySQL database named Sakila Db from internet to get started. Next we will configure sqoop to import this data in HDFS file system followed by direct import into Hive tables.

Sample MySQL Database for SQOOP Import

Let us download the online available database sakila db from the below link:
http://downloads.mysql.com/docs/sakila-db.tar.gz

root@EdgeNode:/usr/local/sqoop# cd $SQOOP_HOME/sqoop_work
root@EdgeNode:/usr/local/sqoop/sqoop_work# wget http://downloads.mysql.com/docs/sakila-db.tar.gz
root@EdgeNode:/usr/local/sqoop/sqoop_work# tar -xzvf sakila-db.tar.gz >> /dev/null
root@EdgeNode:/usr/local/sqoop/sqoop_work# ls sakila-db

	sakila-schema.sql
	sakila-data.sql

Next login to the MySql database using root & then we will execute the downloaded sql files to create a database sakila, sample tables along with data.

root@EdgeNode:/usr/local/sqoop/sqoop_work# mysql -u root -p
mysql> SOURCE /usr/local/sqoop/sqoop_work/sakila-db/sakila-schema.sql;
mysql> SOURCE /usr/local/sqoop/sqoop_work/sakila-db/sakila-data.sql;

Next we will create a user (e.g. password hadoop1234 ), grant user the access to the sakila database and will provide these credentials during Sqoop import.

mysql> CREATE USER 'sakila_user'@'localhost' IDENTIFIED BY 'hadoop1234';
mysql> GRANT ALL ON sakila.* TO 'sakila_user'@'localhost';
mysql> CREATE USER 'sakila_user'@'%' IDENTIFIED BY 'hadoop1234';
mysql> GRANT ALL ON sakila.* TO 'sakila_user'@'%';
mysql> CREATE USER 'sakila_user'@'10.0.100.3' IDENTIFIED BY 'hadoop1234';
mysql> GRANT ALL ON sakila.* TO 'sakila_user'@'10.0.100.3';
mysql> FLUSH PRIVILEGES;
mysql> quit;

Next login as the new user and check the database access and tables;

root@EdgeNode:/usr/local/sqoop/sqoop_work# mysql -usakila_user -p
mysql> SHOW DATABASES;
mysql> USE sakila;
mysql> SHOW TABLES;
mysql> quit;

Finally delete the downloaded files:

root@EdgeNode:/usr/local/sqoop/sqoop_work# rm sakila-db.tar.gz
root@EdgeNode:/usr/local/sqoop/sqoop_work# rm -rf sakila

SQOOP Import to HDFS

Now it’s time to explore Sqoop. Let us first invoke the Sqoop utility to list the databases accessible to an user in MySQL.

root@EdgeNode:/usr/local/sqoop/sqoop_work# sqoop list-databases --connect jdbc:mysql://10.0.100.3:3306/ --username sakila_user --password hadoop1234

In order to secure password, Use the option -P, which will prompt for password as below:

root@EdgeNode:/usr/local/sqoop/sqoop_work# sqoop list-databases --connect jdbc:mysql://10.0.100.3:3306/ --username sakila_user -P

Alternatively we can also use --options-file, which keeps the server user and password information as below:

root@EdgeNode:/usr/local/sqoop/sqoop_work# vi import_mysql.txt
	
list-databases
--connect
jdbc:mysql://10.0.100.3:3306/
--username
sakila_user
--password
hadoop1234
root@EdgeNode:/usr/local/sqoop/sqoop_work# sqoop --options-file /usr/local/sqoop/sqoop_work/import_mysql.txt

Now let’s use another Sqoop utility to list down the available tables in a database.

root@EdgeNode:/usr/local/sqoop/sqoop_work# sqoop list-tables --connect jdbc:mysql://10.0.100.3:3306/sakila --username sakila_user --password hadoop1234

Now it’s finally time to bring some data from MySQL table to HDFS using SQOOP import. We have a lot of options to control the import process which will be discussed in more detail at a later point. But before that take a look at the import statement below:

root@EdgeNode:/usr/local/sqoop/sqoop_work# sqoop import \
--bindir ./ \
--driver com.mysql.jdbc.Driver \
--connect jdbc:mysql://10.0.100.3:3306/sakila \
--username sakila_user \
--password hadoop1234 \
--table actor \
--columns "actor_id, first_name, last_name, last_update" \
--as-avrodatafile \
--compression-codec snappy \
--target-dir /user/root/actor \
--split-by actor_id \
--fetch-size 1000 \
--append \
--num-mappers 1 \
--validate

What does these option flags mean:

  • --bindir will be used by sqoop to generate the java, java class, java archive, avro schema file corresponding to the import table requested.
  • --driver is optional which directs sqoop to use the driver class file in the library to be used.
  • --connect followed by the jdbc connection credentials to the MySql databases and the database to connect to.
  • --username is the user to login to the database
  • --password is the database user’s login password credential
  • --table: Table to read/import from database.
  • --columns: List of columns to import from table. This is optional.
  • --as-avrodatafile: Instructs sqoop to Imports data to Avro Data Files. Other options that can be used are-
    1. --as-textfile: Imports data as plain text. This is the default if not mentioned in import.
    2. --as-sequencefile: Imports data to Sequence Files.
    3. --as-parquetfile: Imports data to Parquet Files.
  • --compress or -z: Enable compression
  • --compression-codec: Defines the hadoop compression codec to use. By default, data is not compressed. If compress is enabled default compression codec is gzip.
  • --target-dir: HDFS destination directory.
  • --split-by: Column of the table used to split work units.
  • --fetch-size: Number of entries to read from database at once.
  • --append: Append data to an existing dataset already in HDFS. Other options that can be used are-
    1. --delete-target-dir: Delete the import target directory if it exists
  • --num-mappers or –m: Use n number of map tasks to import data in parallel from database.
  • --validate: Validate and confirm if the import process was successful.

Once the sqoop import is successful, let us check the data in HDFS.

root@EdgeNode:/usr/local/sqoop/sqoop_work# cd $HADOOP_HOME/bin
root@EdgeNode:/usr/local/hadoop/bin# hadoop fs -ls /user/root/actor

	/user/root/actor/part-m-00000.avro

Hive Table

The next step we might be interested on is to load the data in our hadoop Hive database. Since we have imported the data previously as AVRO, we will be creating an avro based table in Hive. For that we need to move the avro schema file into HDFS.

root@EdgeNode:/usr/local/hadoop/bin# hadoop fs -mkdir -p /user/root/schema
root@EdgeNode:/usr/local/hadoop/bin# hadoop fs -copyFromLocal /usr/local/sqoop/sqoop_work/actor.avsc /user/root/schema/actor.avsc

Now let us create & load the hive table.

root@EdgeNode:/usr/local/hadoop/bin# cd $HIVE_HOME/hive_work
root@EdgeNode:/usr/local/hive/hive_work# hive
hive> SHOW DATABASES;
hive> USE default;

hive> CREATE TABLE actor
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.url'='hdfs:///user/root/schema/actor.avsc');


hive> LOAD DATA INPATH '/user/root/actor/part-*.avro' OVERWRITE INTO TABLE actor;
hive> select count(1) from actor;
hive> select * from actor limit 10;
hive> quit;

SQOOP import to HIVE

The main function of Sqoop import is to upload data into files in HDFS. However, if we have a Hive metastore associated with our HDFS cluster, Sqoop can also import the data into Hive by generating and executing a CREATE TABLE statement to define the data’s layout in Hive. Importing data into Hive is as simple as adding the --hive-import option to Sqoop command line. If the Hive table already exists, specify the --hive-overwrite option to indicate that existing table in hive must be replaced. After the data is imported into HDFS or this step is omitted, Sqoop will generate a Hive script containing a CREATE TABLE operation defining your columns using Hive’s types, and a LOAD DATA INPATH statement to move the data files into Hive’s warehouse directory.

The script will be executed by calling the installed copy of hive on the machine where Sqoop is run. If you have multiple Hive installations, or hive is not in your $PATH, use the --hive-home option to identify the Hive installation directory. Sqoop will use $HIVE_HOME/bin/hive from here. Let us now import another table and load it directly to hive.

root@EdgeNode:/usr/local/hive/hive_work# cd $SQOOP_HOME/sqoop_work

root@EdgeNode:/usr/local/sqoop/sqoop_work# sqoop import \
--bindir ./ \
--driver com.mysql.jdbc.Driver \
--connect jdbc:mysql://10.0.100.3:3306/sakila \
--username sakila_user \
--password hadoop1234 \
--table actor_info \
--as-textfile \
--num-mappers 2 \
--split-by actor_id \
--target-dir /user/hive/warehouse/actor_info \
--fields-terminated-by '|' \
--lines-terminated-by '\n' \
--hive-import \
--hive-table actor_info

This will create the actor_info table in the default hive database as in hive-site.xml. Now let us check the hive table.

root@EdgeNode:/usr/local/sqoop/sqoop_work# cd $HIVE_HOME/hive_work
root@EdgeNode:/usr/local/hive/hive_work# hive
hive> SHOW DATABASES;
hive> USE default;
hive> select count(1) from actor_info;
hive> select * from actor_info limit 10;
hive> quit;

Note that the delimited data files in HDFS backing the hive table, will be under the warehouse directory as configured in hive-site.xml

root@EdgeNode:/usr/local/sqoop/sqoop_work# $HADOOP_HOME/bin/hadoop fs –ls /user/hive/warehouse/actor_info

NOTE: * Sqoop currently does not support --as-avrodatafile and --as-sequencefile formats to be loaded in Hive.

SQOOP Import Database

Finally, just one more sqoop utility we might check at this point is import-all-tables. Cases we might be interested to bring an entire database schema to HDFS, we can use import-all-tables utility. First check the utility followed by the pre-requisites for this utility to execute successfully.

root@EdgeNode:/usr/local/sqoop/sqoop_work# cd $SQOOP_HOME/sqoop_work
root@EdgeNode:/usr/local/sqoop/sqoop_work# sqoop import-all-tables \
--bindir ./ \
--driver com.mysql.jdbc.Driver \
--connect jdbc:mysql://10.0.100.3:3306/sakila \
--username sakila_user \
--password hadoop1234 \
--exclude-tables actor,actor_info \
--as-textfile \
--num-mappers 1 \
--direct \
--fields-terminated-by '|' \
--lines-terminated-by '\n' \
--hive-import 

Now let us check the hive tables:

root@EdgeNode:/usr/local/sqoop/sqoop_work# cd $HIVE_HOME/hive_work
root@EdgeNode:/usr/local/hive/hive_work# hive

hive> SHOW DATABASES;
hive> USE default;
hive> SHOW TABLES:
hive> select count(1) from inventory;
hive> select * from film_actor limit 10;
hive> quit;

Pre-requisite for import-all-tables

For the import-all-tables utility to execute successfully, the following conditions must be met:

  • Each table must have a single-column primary key.
  • All columns of each table will be imported.
  • Do not use non-default splitting column, nor impose any conditions via a WHERE clause.

There are lots of other actions in Sqoop that needs to be discussed in detail. We will cover them as separate articles. But in the next article, let's learn how to bring data from the most popular RDBMS - Oracle to Hadoop using SQOOP.


Have a question on this subject?

Ask questions to our expert community members and clear your doubts. Asking question or engaging in technical discussion is both easy and rewarding.

Are you on Twitter?

Start following us. This way we will always keep you updated with what's happening in Data Analytics community. We won't spam you. Promise.

  • Stream Webserver Log into Hdfs using FLUME

    In this article we will use Apache Flume to gather stream access log data from our remote Web Server into Hadoop Distributed File System. We will be analyzing the access log in a real-time basis. So we have to setup Flume such that it collects the...

  • Install SPARK in Hadoop Cluster

    Apache Spark is a fast and general purpose engine for large-scale data processing over a distributed cluster. Apache Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing. Spark run programs up to 100x...

  • Hadoop DataLake Implementation Part 4

    Now that our dummy OLTP source system & Hadoop HDFS directory structure is ready, we will first load the ‘dates’ data file in HDFS and further to a hive table.

  • SQOOP import from Oracle

    In this article we will use Apache SQOOP to import data from Oracle database. Now that we have an oracle server in our cluster ready, let us login to EdgeNode. Next we will configure sqoop to import this data in HDFS file system followed by direct...

  • Install PIG In Client Node of Hadoop Cluster

    Apache Pig is a platform for analyzing large data sets. Pig Latin is the high level programming language that, lets us specify a sequence of data transformations such as merging data sets, filtering them, grouping them, and applying functions to...

  • Hadoop DataLake Implementation Part 2

    Now that we are familiar with HDP stack, in this article we are going to access HDP sandbox command line, Ambari Web UI, Hive & Ranger to create a user for our implementation setup.

  • Apache Hadoop Architecture

    In this article we will learn about the Apache Hadoop framework architecture. The basic components of the Apache Hadoop HDFS & MapReduce engine are discussed in brief.

  • How to Setup Hadoop Multi Node Cluster - Step By Step

    Setting up Hadoop in a single machine is easy, but no fun. Why? Because Hadoop is not meant for a single machine. Hadoop is meant to run on a computing cluster comprising of many machines. Running HDFS and MapReduce on a single machine is great for...

  • Hadoop DataLake Implementation Part 7

    In this article we will load our master data table ‘Product’ as Slowly Changing Dimension of Type 2 to maintain full history, so as to analyze the sales and stocks data with reference to the historical master data.

  • Understanding Map-Reduce with Examples

    In my previous article – “Fools guide to Big Data” – we have discussed about the origin of Bigdata and the need of big data analytics. We have also noted that Big Data is data that is too large, complex and dynamic for any conventional data tools...