In this article we will use Apache SQOOP to import data from Oracle database. Now that we have an oracle server in our cluster ready, let us login to EdgeNode. Next we will configure sqoop to import this data in HDFS file system followed by direct import into Hive tables.

SQOOP Import to HDFS

Now it’s time to explore Sqoop. Let us first invoke the Sqoop utility to list the databases accessible to an user in Oracle.

root@EdgeNode:~# cd $SQOOP_HOME/sqoop_work
root@EdgeNode:/usr/local/sqoop/sqoop_work# export HADOOP_OPTS=-Djava.security.egd=file:/dev/../dev/urandom
root@EdgeNode:/usr/local/sqoop/sqoop_work# sqoop list-databases -D mapred.child.java.opts="\-Djava.security.egd=file:/dev/../dev/urandom" --connect "jdbc:oracle:thin:@10.0.100.4:1521:XE" --username "SYSTEM" --password "hadoop1234"

In order to secure password, Use the option -P, which will prompt for password as below:

root@EdgeNode:/usr/local/sqoop/sqoop_work# sqoop list-databases -D mapred.child.java.opts="\-Djava.security.egd=file:/dev/../dev/urandom" --connect "jdbc:oracle:thin:@10.0.100.4:1521:XE" --username "SYSTEM"  -P

Alternatively we can also use --options-file, which keeps the server user and password information as below:

root@EdgeNode:/usr/local/sqoop/sqoop_work# vi import_oracle.txt
	
--connect
"jdbc:oracle:thin:@10.0.100.4:1521:XE"
--username
"SYSTEM"  
--password
"hadoop1234"
root@EdgeNode:/usr/local/sqoop/sqoop_work# sqoop list-databases -D mapred.child.java.opts="\-Djava.security.egd=file:/dev/../dev/urandom" --options-file /usr/local/sqoop/sqoop_work/import_oracle.txt

Now let’s use another Sqoop utility to list down the available tables in a database.

root@EdgeNode:/usr/local/sqoop/sqoop_work# sqoop list-tables -D mapred.child.java.opts="\-Djava.security.egd=file:/dev/../dev/urandom" --connect jdbc:oracle:thin:@10.0.100.4:1521/xe --username "HR" --password "hrpass1234"

Now it’s finally time to bring some data from Oracle table to HDFS using SQOOP import. We have a lot of options to control the import process which will be discussed in more detail at a later point. But now take a look at the import statement below:

root@EdgeNode:/usr/local/sqoop/sqoop_work# sqoop import -D mapred.child.java.opts="\-Djava.security.egd=file:/dev/../dev/urandom" \
--bindir ./ \
--connect "jdbc:oracle:thin:@10.0.100.4:1521/XE" \
--username "HR" \
--password "hrpass1234" \
--table HR.JOBS \
--as-avrodatafile \
--compression-codec snappy \
--target-dir /user/root/jobs \
--fetch-size 1000 \
--append \
--num-mappers 1 \
--validate

Once the sqoop import is successful, let us check the data in HDFS.

root@EdgeNode:/usr/local/sqoop/sqoop_work# cd $HADOOP_HOME/bin
root@EdgeNode:/usr/local/hadoop/bin# hadoop fs -ls /user/root/jobs
	/user/root/jobs/part-m-00000.avro

root@EdgeNode:/usr/local/hadoop/bin# hadoop fs -cat /user/root/jobs/part-m-00000.avro

Hive Table

The next step we might be interested on is to load the data in our hadoop Hive database. Since we have imported the data previously as AVRO, we will be creating an avro based table in Hive. For that we need to move the avro schema file into HDFS.

root@EdgeNode:/usr/local/hadoop/bin# hadoop fs -mkdir -p /user/root/schema
root@EdgeNode:/usr/local/hadoop/bin# hadoop fs -copyFromLocal /usr/local/sqoop/sqoop_work/HR_JOBS.avsc /user/root/schema/jobs.avsc

Now let us create & load the hive table.

root@EdgeNode:/usr/local/hadoop/bin# cd $HIVE_HOME/hive_work
root@EdgeNode:/usr/local/hive/hive_work# hive
hive> SHOW DATABASES;
hive> USE default;

hive> CREATE TABLE jobs
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.url'='hdfs:///user/root/schema/jobs.avsc');

hive> LOAD DATA INPATH '/user/root/jobs/part-*.avro' OVERWRITE INTO TABLE jobs;
hive> select count(1) from jobs;
hive> select * from jobs limit 10;
hive> quit;

SQOOP import to HIVE

Let us now import another table using SQOOP and load it directly to our hadoop hive warehouse.

root@EdgeNode:/usr/local/hive/hive_work# cd $SQOOP_HOME/sqoop_work
root@EdgeNode:/usr/local/sqoop/sqoop_work# export HADOOP_OPTS=-Djava.security.egd=file:/dev/../dev/urandom
root@EdgeNode:/usr/local/sqoop/sqoop_work# sqoop import -D mapred.child.java.opts="\-Djava.security.egd=file:/dev/../dev/urandom" \
--bindir ./ \
--connect "jdbc:oracle:thin:@10.0.100.4:1521/XE" \
--username "HR" \
--password "hrpass1234" \
--num-mappers 1 \
--as-textfile \
--target-dir /user/hive/warehouse/job_history \
--fields-terminated-by '|' \
--lines-terminated-by '\n' \
--table HR.JOB_HISTORY \
--hive-import \
--hive-table job_history

This will create the job_history table in the default hive database as in hive-site.xml. Now let us check the hive table.

root@EdgeNode:/usr/local/sqoop/sqoop_work# cd $HIVE_HOME/hive_work
root@EdgeNode:/usr/local/hive/hive_work# hive
hive> SHOW DATABASES;
hive> USE default;
hive> select count(1) from job_history;
hive> select * from job_history;
hive> quit;

Note that the delimited data files in HDFS backing the hive table, will be under the warehouse directory as configured in hive-site.xml

root@EdgeNode:/usr/local/hive/hive_work# cd $HADOOP_HOME/bin
root@EdgeNode:/usr/local/hadoop/bin# hadoop fs –ls /user/hive/warehouse/job_history

SQOOP Import Database

Finally, just one more sqoop utility we will check at this point is import-all-tables. Cases we might be interested to bring an entire database schema to HDFS, we can use import-all-tables utility.

root@EdgeNode:/usr/local/hadoop/bin# cd $SQOOP_HOME/sqoop_work
root@EdgeNode:/usr/local/sqoop/sqoop_work# export HADOOP_OPTS=-Djava.security.egd=file:/dev/../dev/urandom
root@EdgeNode:/usr/local/sqoop/sqoop_work# sqoop import-all-tables -D mapred.child.java.opts="\-Djava.security.egd=file:/dev/../dev/urandom" \
--bindir ./ \
--connect "jdbc:oracle:thin:@10.0.100.4:1521/XE" \
--username "HR" \
--password "hrpass1234" \
--num-mappers 1 \
--exclude-tables "HR.JOB","HR.JOB_HISTORY" \
--as-textfile \
--direct \
--fields-terminated-by '|' \
--lines-terminated-by '\n' \
--hive-import

Now let us check the hive tables:

root@EdgeNode:/usr/local/sqoop/sqoop_work# cd $HIVE_HOME/hive_work
root@EdgeNode:/usr/local/hive/hive_work# hive
hive> SHOW DATABASES;
hive> USE default;
hive> show tables;
hive> select count(1) from inventory;
hive> select * from inventory limit 10;
hive> quit;

In our next article we will check another sqoop utility called Merge. We will perform an incremental extraction from an Oracle source table followed by Merge.


Have a question on this subject?

Ask questions to our expert community members and clear your doubts. Asking question or engaging in technical discussion is both easy and rewarding.

Are you on Twitter?

Start following us. This way we will always keep you updated with what's happening in Data Analytics community. We won't spam you. Promise.

  • Introduction to Apache Hadoop

    The Apache Hadoop is next big data platform. Apache Hadoop is an open-source, java-based framework software for reliable, scalable & distributed computing. The Apache Hadoop allows distributed processing of very large data sets across clusters of...

  • Install SQOOP in Client Node of Hadoop Cluster

    Sqoop is an open source software product of the Apache Software Foundation in the hadoop ecosystem, designed to transfer data between Hadoop and relational databases or mainframes. Sqoop can be used to import data from a relational database...

  • Understanding Map-Reduce with Examples

    In my previous article – “Fools guide to Big Data” – we have discussed about the origin of Bigdata and the need of big data analytics. We have also noted that Big Data is data that is too large, complex and dynamic for any conventional data tools...

  • Install PIG In Client Node of Hadoop Cluster

    Apache Pig is a platform for analyzing large data sets. Pig Latin is the high level programming language that, lets us specify a sequence of data transformations such as merging data sets, filtering them, grouping them, and applying functions to...

  • Hadoop MapReduce Basics

    The Hadoop, since its inception is changing the way the enterprises store, process and analyse data. MapReduce is the core part of the Hadoop framework and we can also call it as the core processing engine of Hadoop. It is a programming model...

  • SQOOP import from Oracle

    In this article we will use Apache SQOOP to import data from Oracle database. Now that we have an oracle server in our cluster ready, let us login to EdgeNode. Next we will configure sqoop to import this data in HDFS file system followed by direct...

  • Hadoop DataLake Implementation Part 8

    In this article we will load our first fact table into Hive warehouse which is sales transactions.

  • Hadoop DataLake Implementation Part 7

    In this article we will load our master data table ‘Product’ as Slowly Changing Dimension of Type 2 to maintain full history, so as to analyze the sales and stocks data with reference to the historical master data.

  • Oracle Installation for SQOOP Import

    We would like to perform practical test of Apache SQOOP import/export utility between ORACLE relational database & Apache HADOOP file system, let us quickly setup an ORACLE server. For that we will be using cloud based services/servers as we did...

  • SQOOP Merge & Incremental Extraction from Oracle

    Let us check how to perform Incremental Extraction & Merge using Sqoop. The SQOOP Merge utility allows to combine two datasets where entries in one dataset should overwrite entries of an older dataset. For example, an incremental import run in...