In this article we will use Apache SQOOP to import data from Oracle database. Now that we have an oracle server in our cluster ready, let us login to EdgeNode. Next we will configure sqoop to import this data in HDFS file system followed by direct import into Hive tables.

SQOOP Import to HDFS

Now it’s time to explore Sqoop. Let us first invoke the Sqoop utility to list the databases accessible to an user in Oracle.

root@EdgeNode:~# cd $SQOOP_HOME/sqoop_work
root@EdgeNode:/usr/local/sqoop/sqoop_work# export HADOOP_OPTS=-Djava.security.egd=file:/dev/../dev/urandom
root@EdgeNode:/usr/local/sqoop/sqoop_work# sqoop list-databases -D mapred.child.java.opts="\-Djava.security.egd=file:/dev/../dev/urandom" --connect "jdbc:oracle:thin:@10.0.100.4:1521:XE" --username "SYSTEM" --password "hadoop1234"

In order to secure password, Use the option -P, which will prompt for password as below:

root@EdgeNode:/usr/local/sqoop/sqoop_work# sqoop list-databases -D mapred.child.java.opts="\-Djava.security.egd=file:/dev/../dev/urandom" --connect "jdbc:oracle:thin:@10.0.100.4:1521:XE" --username "SYSTEM"  -P

Alternatively we can also use --options-file, which keeps the server user and password information as below:

root@EdgeNode:/usr/local/sqoop/sqoop_work# vi import_oracle.txt
	
--connect
"jdbc:oracle:thin:@10.0.100.4:1521:XE"
--username
"SYSTEM"  
--password
"hadoop1234"
root@EdgeNode:/usr/local/sqoop/sqoop_work# sqoop list-databases -D mapred.child.java.opts="\-Djava.security.egd=file:/dev/../dev/urandom" --options-file /usr/local/sqoop/sqoop_work/import_oracle.txt

Now let’s use another Sqoop utility to list down the available tables in a database.

root@EdgeNode:/usr/local/sqoop/sqoop_work# sqoop list-tables -D mapred.child.java.opts="\-Djava.security.egd=file:/dev/../dev/urandom" --connect jdbc:oracle:thin:@10.0.100.4:1521/xe --username "HR" --password "hrpass1234"

Now it’s finally time to bring some data from Oracle table to HDFS using SQOOP import. We have a lot of options to control the import process which will be discussed in more detail at a later point. But now take a look at the import statement below:

root@EdgeNode:/usr/local/sqoop/sqoop_work# sqoop import -D mapred.child.java.opts="\-Djava.security.egd=file:/dev/../dev/urandom" \
--bindir ./ \
--connect "jdbc:oracle:thin:@10.0.100.4:1521/XE" \
--username "HR" \
--password "hrpass1234" \
--table HR.JOBS \
--as-avrodatafile \
--compression-codec snappy \
--target-dir /user/root/jobs \
--fetch-size 1000 \
--append \
--num-mappers 1 \
--validate

Once the sqoop import is successful, let us check the data in HDFS.

root@EdgeNode:/usr/local/sqoop/sqoop_work# cd $HADOOP_HOME/bin
root@EdgeNode:/usr/local/hadoop/bin# hadoop fs -ls /user/root/jobs
	/user/root/jobs/part-m-00000.avro

root@EdgeNode:/usr/local/hadoop/bin# hadoop fs -cat /user/root/jobs/part-m-00000.avro

Hive Table

The next step we might be interested on is to load the data in our hadoop Hive database. Since we have imported the data previously as AVRO, we will be creating an avro based table in Hive. For that we need to move the avro schema file into HDFS.

root@EdgeNode:/usr/local/hadoop/bin# hadoop fs -mkdir -p /user/root/schema
root@EdgeNode:/usr/local/hadoop/bin# hadoop fs -copyFromLocal /usr/local/sqoop/sqoop_work/HR_JOBS.avsc /user/root/schema/jobs.avsc

Now let us create & load the hive table.

root@EdgeNode:/usr/local/hadoop/bin# cd $HIVE_HOME/hive_work
root@EdgeNode:/usr/local/hive/hive_work# hive
hive> SHOW DATABASES;
hive> USE default;

hive> CREATE TABLE jobs
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.url'='hdfs:///user/root/schema/jobs.avsc');

hive> LOAD DATA INPATH '/user/root/jobs/part-*.avro' OVERWRITE INTO TABLE jobs;
hive> select count(1) from jobs;
hive> select * from jobs limit 10;
hive> quit;

SQOOP import to HIVE

Let us now import another table using SQOOP and load it directly to our hadoop hive warehouse.

root@EdgeNode:/usr/local/hive/hive_work# cd $SQOOP_HOME/sqoop_work
root@EdgeNode:/usr/local/sqoop/sqoop_work# export HADOOP_OPTS=-Djava.security.egd=file:/dev/../dev/urandom
root@EdgeNode:/usr/local/sqoop/sqoop_work# sqoop import -D mapred.child.java.opts="\-Djava.security.egd=file:/dev/../dev/urandom" \
--bindir ./ \
--connect "jdbc:oracle:thin:@10.0.100.4:1521/XE" \
--username "HR" \
--password "hrpass1234" \
--num-mappers 1 \
--as-textfile \
--target-dir /user/hive/warehouse/job_history \
--fields-terminated-by '|' \
--lines-terminated-by '\n' \
--table HR.JOB_HISTORY \
--hive-import \
--hive-table job_history

This will create the job_history table in the default hive database as in hive-site.xml. Now let us check the hive table.

root@EdgeNode:/usr/local/sqoop/sqoop_work# cd $HIVE_HOME/hive_work
root@EdgeNode:/usr/local/hive/hive_work# hive
hive> SHOW DATABASES;
hive> USE default;
hive> select count(1) from job_history;
hive> select * from job_history;
hive> quit;

Note that the delimited data files in HDFS backing the hive table, will be under the warehouse directory as configured in hive-site.xml

root@EdgeNode:/usr/local/hive/hive_work# cd $HADOOP_HOME/bin
root@EdgeNode:/usr/local/hadoop/bin# hadoop fs –ls /user/hive/warehouse/job_history

SQOOP Import Database

Finally, just one more sqoop utility we will check at this point is import-all-tables. Cases we might be interested to bring an entire database schema to HDFS, we can use import-all-tables utility.

root@EdgeNode:/usr/local/hadoop/bin# cd $SQOOP_HOME/sqoop_work
root@EdgeNode:/usr/local/sqoop/sqoop_work# export HADOOP_OPTS=-Djava.security.egd=file:/dev/../dev/urandom
root@EdgeNode:/usr/local/sqoop/sqoop_work# sqoop import-all-tables -D mapred.child.java.opts="\-Djava.security.egd=file:/dev/../dev/urandom" \
--bindir ./ \
--connect "jdbc:oracle:thin:@10.0.100.4:1521/XE" \
--username "HR" \
--password "hrpass1234" \
--num-mappers 1 \
--exclude-tables "HR.JOB","HR.JOB_HISTORY" \
--as-textfile \
--direct \
--fields-terminated-by '|' \
--lines-terminated-by '\n' \
--hive-import

Now let us check the hive tables:

root@EdgeNode:/usr/local/sqoop/sqoop_work# cd $HIVE_HOME/hive_work
root@EdgeNode:/usr/local/hive/hive_work# hive
hive> SHOW DATABASES;
hive> USE default;
hive> show tables;
hive> select count(1) from inventory;
hive> select * from inventory limit 10;
hive> quit;

In our next article we will check another sqoop utility called Merge. We will perform an incremental extraction from an Oracle source table followed by Merge.


Have a question on this subject?

Ask questions to our expert community members and clear your doubts. Asking question or engaging in technical discussion is both easy and rewarding.

Are you on Twitter?

Start following us. This way we will always keep you updated with what's happening in Data Analytics community. We won't spam you. Promise.

  • SQOOP import from MySQL

    In this article we will use Apache SQOOP to import data from MySQL database. For that let us create a MySql database & user and dump some data quickly. Let us download a MySQL database named Sakila Db from internet to get started. Next we will...

  • Install Hive in Client Node of Hadoop Cluster

    In the previous article, we have shown how to setup a client node. Once this is done, now let's put Hadoop to use for some big data analytics purpose. One way to do that is by using Hive which let's us run SQL queries against the big data. A...

  • Install SPARK in Hadoop Cluster

    Apache Spark is a fast and general purpose engine for large-scale data processing over a distributed cluster. Apache Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing. Spark run programs up to 100x...

  • Fools Guide to Big data - What is Big Data

    Sure enough, you have heard the term, "Big Data" many times before. There is no dearth of information in the Internet and printed medium about this. But guess what, this term still remains vaguely defined and poorly understood. This essay is our...

  • Hadoop DataLake Implementation Part 5

    In this article we will load the showroom master data from MySQL source system to HDFS using Sqoop as SCD Type 1.

  • Install HBASE in Hadoop Cluster

    Apache HBase provides large-scale tabular storage for Hadoop using the Hadoop Distributed File System (HDFS). Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable. HBase is used in cases...

  • Hadoop MapReduce Basics

    The Hadoop, since its inception is changing the way the enterprises store, process and analyse data. MapReduce is the core part of the Hadoop framework and we can also call it as the core processing engine of Hadoop. It is a programming model...

  • Apache Hadoop Architecture

    In this article we will learn about the Apache Hadoop framework architecture. The basic components of the Apache Hadoop HDFS & MapReduce engine are discussed in brief.

  • Install PIG In Client Node of Hadoop Cluster

    Apache Pig is a platform for analyzing large data sets. Pig Latin is the high level programming language that, lets us specify a sequence of data transformations such as merging data sets, filtering them, grouping them, and applying functions to...

  • Introduction to Apache Hadoop

    The Apache Hadoop is next big data platform. Apache Hadoop is an open-source, java-based framework software for reliable, scalable & distributed computing. The Apache Hadoop allows distributed processing of very large data sets across clusters of...