SQOOP import from MySQL
In this article we will use Apache SQOOP to import data from MySQL database. For that let us create a MySql database & user and dump some data quickly. Let us download a MySQL database named Sakila Db from internet to get started. Next we will configure sqoop to import this data in HDFS file system followed by direct import into Hive tables.
Sample MySQL Database for SQOOP Import
Let us download the online available database sakila db from the below link:
http://downloads.mysql.com/docs/sakila-db.tar.gz
root@EdgeNode:/usr/local/sqoop# cd $SQOOP_HOME/sqoop_work
root@EdgeNode:/usr/local/sqoop/sqoop_work# wget http://downloads.mysql.com/docs/sakila-db.tar.gz
root@EdgeNode:/usr/local/sqoop/sqoop_work# tar -xzvf sakila-db.tar.gz >> /dev/null
root@EdgeNode:/usr/local/sqoop/sqoop_work# ls sakila-db
sakila-schema.sql
sakila-data.sql
Next login to the MySql database using root & then we will execute the downloaded sql files to create a database sakila, sample tables along with data.
root@EdgeNode:/usr/local/sqoop/sqoop_work# mysql -u root -p
mysql> SOURCE /usr/local/sqoop/sqoop_work/sakila-db/sakila-schema.sql;
mysql> SOURCE /usr/local/sqoop/sqoop_work/sakila-db/sakila-data.sql;
Next we will create a user (e.g. password hadoop1234 ), grant user the access to the sakila database and will provide these credentials during Sqoop import.
mysql> CREATE USER 'sakila_user'@'localhost' IDENTIFIED BY 'hadoop1234';
mysql> GRANT ALL ON sakila.* TO 'sakila_user'@'localhost';
mysql> CREATE USER 'sakila_user'@'%' IDENTIFIED BY 'hadoop1234';
mysql> GRANT ALL ON sakila.* TO 'sakila_user'@'%';
mysql> CREATE USER 'sakila_user'@'10.0.100.3' IDENTIFIED BY 'hadoop1234';
mysql> GRANT ALL ON sakila.* TO 'sakila_user'@'10.0.100.3';
mysql> FLUSH PRIVILEGES;
mysql> quit;
Next login as the new user and check the database access and tables;
root@EdgeNode:/usr/local/sqoop/sqoop_work# mysql -usakila_user -p
mysql> SHOW DATABASES;
mysql> USE sakila;
mysql> SHOW TABLES;
mysql> quit;
Finally delete the downloaded files:
root@EdgeNode:/usr/local/sqoop/sqoop_work# rm sakila-db.tar.gz
root@EdgeNode:/usr/local/sqoop/sqoop_work# rm -rf sakila
SQOOP Import to HDFS
Now it’s time to explore Sqoop. Let us first invoke the Sqoop utility to list the databases accessible to an user in MySQL.
root@EdgeNode:/usr/local/sqoop/sqoop_work# sqoop list-databases --connect jdbc:mysql://10.0.100.3:3306/ --username sakila_user --password hadoop1234
In order to secure password, Use the option -P, which will prompt for password as below:
root@EdgeNode:/usr/local/sqoop/sqoop_work# sqoop list-databases --connect jdbc:mysql://10.0.100.3:3306/ --username sakila_user -P
Alternatively we can also use --options-file, which keeps the server user and password information as below:
root@EdgeNode:/usr/local/sqoop/sqoop_work# vi import_mysql.txt
list-databases
--connect
jdbc:mysql://10.0.100.3:3306/
--username
sakila_user
--password
hadoop1234
root@EdgeNode:/usr/local/sqoop/sqoop_work# sqoop --options-file /usr/local/sqoop/sqoop_work/import_mysql.txt
Now let’s use another Sqoop utility to list down the available tables in a database.
root@EdgeNode:/usr/local/sqoop/sqoop_work# sqoop list-tables --connect jdbc:mysql://10.0.100.3:3306/sakila --username sakila_user --password hadoop1234
Now it’s finally time to bring some data from MySQL table to HDFS using SQOOP import. We have a lot of options to control the import process which will be discussed in more detail at a later point. But before that take a look at the import statement below:
root@EdgeNode:/usr/local/sqoop/sqoop_work# sqoop import \
--bindir ./ \
--driver com.mysql.jdbc.Driver \
--connect jdbc:mysql://10.0.100.3:3306/sakila \
--username sakila_user \
--password hadoop1234 \
--table actor \
--columns "actor_id, first_name, last_name, last_update" \
--as-avrodatafile \
--compression-codec snappy \
--target-dir /user/root/actor \
--split-by actor_id \
--fetch-size 1000 \
--append \
--num-mappers 1 \
--validate
What does these option flags mean:
- --bindir will be used by sqoop to generate the java, java class, java archive, avro schema file corresponding to the import table requested.
- --driver is optional which directs sqoop to use the driver class file in the library to be used.
- --connect followed by the jdbc connection credentials to the MySql databases and the database to connect to.
- --username is the user to login to the database
- --password is the database user’s login password credential
- --table: Table to read/import from database.
- --columns: List of columns to import from table. This is optional.
- --as-avrodatafile: Instructs sqoop to Imports data to Avro Data Files. Other options that can be used are-
--as-textfile: Imports data as plain text. This is the default if not mentioned in import.
--as-sequencefile: Imports data to Sequence Files.
--as-parquetfile: Imports data to Parquet Files.
- --compress or -z: Enable compression
- --compression-codec: Defines the hadoop compression codec to use. By default, data is not compressed. If compress is enabled default compression codec is gzip.
- --target-dir: HDFS destination directory.
- --split-by: Column of the table used to split work units.
- --fetch-size: Number of entries to read from database at once.
- --append: Append data to an existing dataset already in HDFS. Other options that can be used are-
--delete-target-dir: Delete the import target directory if it exists
- --num-mappers or –m: Use n number of map tasks to import data in parallel from database.
- --validate: Validate and confirm if the import process was successful.
Once the sqoop import is successful, let us check the data in HDFS.
root@EdgeNode:/usr/local/sqoop/sqoop_work# cd $HADOOP_HOME/bin
root@EdgeNode:/usr/local/hadoop/bin# hadoop fs -ls /user/root/actor
/user/root/actor/part-m-00000.avro
Hive Table
The next step we might be interested on is to load the data in our hadoop Hive database. Since we have imported the data previously as AVRO, we will be creating an avro based table in Hive. For that we need to move the avro schema file into HDFS.
root@EdgeNode:/usr/local/hadoop/bin# hadoop fs -mkdir -p /user/root/schema
root@EdgeNode:/usr/local/hadoop/bin# hadoop fs -copyFromLocal /usr/local/sqoop/sqoop_work/actor.avsc /user/root/schema/actor.avsc
Now let us create & load the hive table.
root@EdgeNode:/usr/local/hadoop/bin# cd $HIVE_HOME/hive_work
root@EdgeNode:/usr/local/hive/hive_work# hive
hive> SHOW DATABASES;
hive> USE default;
hive> CREATE TABLE actor
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.url'='hdfs:///user/root/schema/actor.avsc');
hive> LOAD DATA INPATH '/user/root/actor/part-*.avro' OVERWRITE INTO TABLE actor;
hive> select count(1) from actor;
hive> select * from actor limit 10;
hive> quit;
SQOOP import to HIVE
The main function of Sqoop import is to upload data into files in HDFS. However, if we have a Hive metastore associated with our HDFS cluster, Sqoop can also import the data into Hive by generating and executing a CREATE TABLE statement to define the data’s layout in Hive. Importing data into Hive is as simple as adding the --hive-import option to Sqoop command line. If the Hive table already exists, specify the --hive-overwrite option to indicate that existing table in hive must be replaced. After the data is imported into HDFS or this step is omitted, Sqoop will generate a Hive script containing a CREATE TABLE operation defining your columns using Hive’s types, and a LOAD DATA INPATH statement to move the data files into Hive’s warehouse directory.
The script will be executed by calling the installed copy of hive on the machine where Sqoop is run. If you have multiple Hive installations, or hive is not in your $PATH, use the --hive-home option to identify the Hive installation directory. Sqoop will use $HIVE_HOME/bin/hive from here. Let us now import another table and load it directly to hive.
root@EdgeNode:/usr/local/hive/hive_work# cd $SQOOP_HOME/sqoop_work
root@EdgeNode:/usr/local/sqoop/sqoop_work# sqoop import \
--bindir ./ \
--driver com.mysql.jdbc.Driver \
--connect jdbc:mysql://10.0.100.3:3306/sakila \
--username sakila_user \
--password hadoop1234 \
--table actor_info \
--as-textfile \
--num-mappers 2 \
--split-by actor_id \
--target-dir /user/hive/warehouse/actor_info \
--fields-terminated-by '|' \
--lines-terminated-by '\n' \
--hive-import \
--hive-table actor_info
This will create the actor_info table in the default hive database as in hive-site.xml. Now let us check the hive table.
root@EdgeNode:/usr/local/sqoop/sqoop_work# cd $HIVE_HOME/hive_work
root@EdgeNode:/usr/local/hive/hive_work# hive
hive> SHOW DATABASES;
hive> USE default;
hive> select count(1) from actor_info;
hive> select * from actor_info limit 10;
hive> quit;
Note that the delimited data files in HDFS backing the hive table, will be under the warehouse directory as configured in hive-site.xml
root@EdgeNode:/usr/local/sqoop/sqoop_work# $HADOOP_HOME/bin/hadoop fs –ls /user/hive/warehouse/actor_info
NOTE: * Sqoop currently does not support --as-avrodatafile and --as-sequencefile formats to be loaded in Hive.
SQOOP Import Database
Finally, just one more sqoop utility we might check at this point is import-all-tables. Cases we might be interested to bring an entire database schema to HDFS, we can use import-all-tables utility. First check the utility followed by the pre-requisites for this utility to execute successfully.
root@EdgeNode:/usr/local/sqoop/sqoop_work# cd $SQOOP_HOME/sqoop_work
root@EdgeNode:/usr/local/sqoop/sqoop_work# sqoop import-all-tables \
--bindir ./ \
--driver com.mysql.jdbc.Driver \
--connect jdbc:mysql://10.0.100.3:3306/sakila \
--username sakila_user \
--password hadoop1234 \
--exclude-tables actor,actor_info \
--as-textfile \
--num-mappers 1 \
--direct \
--fields-terminated-by '|' \
--lines-terminated-by '\n' \
--hive-import
Now let us check the hive tables:
root@EdgeNode:/usr/local/sqoop/sqoop_work# cd $HIVE_HOME/hive_work
root@EdgeNode:/usr/local/hive/hive_work# hive
hive> SHOW DATABASES;
hive> USE default;
hive> SHOW TABLES:
hive> select count(1) from inventory;
hive> select * from film_actor limit 10;
hive> quit;
Pre-requisite for import-all-tables
For the import-all-tables utility to execute successfully, the following conditions must be met:
- Each table must have a single-column primary key.
- All columns of each table will be imported.
- Do not use non-default splitting column, nor impose any conditions via a WHERE clause.
There are lots of other actions in Sqoop that needs to be discussed in detail. We will cover them as separate articles. But in the next article, let's learn how to bring data from the most popular RDBMS - Oracle to Hadoop using SQOOP.