SQOOP import from Oracle
In this article we will use Apache SQOOP to import data from Oracle database. Now that we have an oracle server in our cluster ready, let us login to EdgeNode. Next we will configure sqoop to import this data in HDFS file system followed by direct import into Hive tables.
SQOOP Import to HDFS
Now it’s time to explore Sqoop. Let us first invoke the Sqoop utility to list the databases accessible to an user in Oracle.
root@EdgeNode:~# cd $SQOOP_HOME/sqoop_work
root@EdgeNode:/usr/local/sqoop/sqoop_work# export HADOOP_OPTS=-Djava.security.egd=file:/dev/../dev/urandom
root@EdgeNode:/usr/local/sqoop/sqoop_work# sqoop list-databases -D mapred.child.java.opts="\-Djava.security.egd=file:/dev/../dev/urandom" --connect "jdbc:oracle:thin:@10.0.100.4:1521:XE" --username "SYSTEM" --password "hadoop1234"
In order to secure password, Use the option -P, which will prompt for password as below:
root@EdgeNode:/usr/local/sqoop/sqoop_work# sqoop list-databases -D mapred.child.java.opts="\-Djava.security.egd=file:/dev/../dev/urandom" --connect "jdbc:oracle:thin:@10.0.100.4:1521:XE" --username "SYSTEM" -P
Alternatively we can also use --options-file, which keeps the server user and password information as below:
root@EdgeNode:/usr/local/sqoop/sqoop_work# vi import_oracle.txt
--connect
"jdbc:oracle:thin:@10.0.100.4:1521:XE"
--username
"SYSTEM"
--password
"hadoop1234"
root@EdgeNode:/usr/local/sqoop/sqoop_work# sqoop list-databases -D mapred.child.java.opts="\-Djava.security.egd=file:/dev/../dev/urandom" --options-file /usr/local/sqoop/sqoop_work/import_oracle.txt
Now let’s use another Sqoop utility to list down the available tables in a database.
root@EdgeNode:/usr/local/sqoop/sqoop_work# sqoop list-tables -D mapred.child.java.opts="\-Djava.security.egd=file:/dev/../dev/urandom" --connect jdbc:oracle:thin:@10.0.100.4:1521/xe --username "HR" --password "hrpass1234"
Now it’s finally time to bring some data from Oracle table to HDFS using SQOOP import. We have a lot of options to control the import process which will be discussed in more detail at a later point. But now take a look at the import statement below:
root@EdgeNode:/usr/local/sqoop/sqoop_work# sqoop import -D mapred.child.java.opts="\-Djava.security.egd=file:/dev/../dev/urandom" \
--bindir ./ \
--connect "jdbc:oracle:thin:@10.0.100.4:1521/XE" \
--username "HR" \
--password "hrpass1234" \
--table HR.JOBS \
--as-avrodatafile \
--compression-codec snappy \
--target-dir /user/root/jobs \
--fetch-size 1000 \
--append \
--num-mappers 1 \
--validate
Once the sqoop import is successful, let us check the data in HDFS.
root@EdgeNode:/usr/local/sqoop/sqoop_work# cd $HADOOP_HOME/bin
root@EdgeNode:/usr/local/hadoop/bin# hadoop fs -ls /user/root/jobs
/user/root/jobs/part-m-00000.avro
root@EdgeNode:/usr/local/hadoop/bin# hadoop fs -cat /user/root/jobs/part-m-00000.avro
Hive Table
The next step we might be interested on is to load the data in our hadoop Hive database. Since we have imported the data previously as AVRO, we will be creating an avro based table in Hive. For that we need to move the avro schema file into HDFS.
root@EdgeNode:/usr/local/hadoop/bin# hadoop fs -mkdir -p /user/root/schema
root@EdgeNode:/usr/local/hadoop/bin# hadoop fs -copyFromLocal /usr/local/sqoop/sqoop_work/HR_JOBS.avsc /user/root/schema/jobs.avsc
Now let us create & load the hive table.
root@EdgeNode:/usr/local/hadoop/bin# cd $HIVE_HOME/hive_work
root@EdgeNode:/usr/local/hive/hive_work# hive
hive> SHOW DATABASES;
hive> USE default;
hive> CREATE TABLE jobs
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.url'='hdfs:///user/root/schema/jobs.avsc');
hive> LOAD DATA INPATH '/user/root/jobs/part-*.avro' OVERWRITE INTO TABLE jobs;
hive> select count(1) from jobs;
hive> select * from jobs limit 10;
hive> quit;
SQOOP import to HIVE
Let us now import another table using SQOOP and load it directly to our hadoop hive warehouse.
root@EdgeNode:/usr/local/hive/hive_work# cd $SQOOP_HOME/sqoop_work
root@EdgeNode:/usr/local/sqoop/sqoop_work# export HADOOP_OPTS=-Djava.security.egd=file:/dev/../dev/urandom
root@EdgeNode:/usr/local/sqoop/sqoop_work# sqoop import -D mapred.child.java.opts="\-Djava.security.egd=file:/dev/../dev/urandom" \
--bindir ./ \
--connect "jdbc:oracle:thin:@10.0.100.4:1521/XE" \
--username "HR" \
--password "hrpass1234" \
--num-mappers 1 \
--as-textfile \
--target-dir /user/hive/warehouse/job_history \
--fields-terminated-by '|' \
--lines-terminated-by '\n' \
--table HR.JOB_HISTORY \
--hive-import \
--hive-table job_history
This will create the job_history table in the default hive database as in hive-site.xml. Now let us check the hive table.
root@EdgeNode:/usr/local/sqoop/sqoop_work# cd $HIVE_HOME/hive_work
root@EdgeNode:/usr/local/hive/hive_work# hive
hive> SHOW DATABASES;
hive> USE default;
hive> select count(1) from job_history;
hive> select * from job_history;
hive> quit;
Note that the delimited data files in HDFS backing the hive table, will be under the warehouse directory as configured in hive-site.xml
root@EdgeNode:/usr/local/hive/hive_work# cd $HADOOP_HOME/bin
root@EdgeNode:/usr/local/hadoop/bin# hadoop fs –ls /user/hive/warehouse/job_history
SQOOP Import Database
Finally, just one more sqoop utility we will check at this point is import-all-tables. Cases we might be interested to bring an entire database schema to HDFS, we can use import-all-tables utility.
root@EdgeNode:/usr/local/hadoop/bin# cd $SQOOP_HOME/sqoop_work
root@EdgeNode:/usr/local/sqoop/sqoop_work# export HADOOP_OPTS=-Djava.security.egd=file:/dev/../dev/urandom
root@EdgeNode:/usr/local/sqoop/sqoop_work# sqoop import-all-tables -D mapred.child.java.opts="\-Djava.security.egd=file:/dev/../dev/urandom" \
--bindir ./ \
--connect "jdbc:oracle:thin:@10.0.100.4:1521/XE" \
--username "HR" \
--password "hrpass1234" \
--num-mappers 1 \
--exclude-tables "HR.JOB","HR.JOB_HISTORY" \
--as-textfile \
--direct \
--fields-terminated-by '|' \
--lines-terminated-by '\n' \
--hive-import
Now let us check the hive tables:
root@EdgeNode:/usr/local/sqoop/sqoop_work# cd $HIVE_HOME/hive_work
root@EdgeNode:/usr/local/hive/hive_work# hive
hive> SHOW DATABASES;
hive> USE default;
hive> show tables;
hive> select count(1) from inventory;
hive> select * from inventory limit 10;
hive> quit;
In our next article we will check another sqoop utility called Merge. We will perform an incremental extraction from an Oracle source table followed by Merge.