We would like to perform practical test of Apache SQOOP import/export utility between ORACLE relational database & Apache HADOOP file system, let us quickly setup an ORACLE server. For that we will be using cloud based services/servers as we did previously using Digital Ocean.

Till now, in our Digital Ocean Private Networking Cluster we have our nodes as below:

Name NodeNameNode10.0.0.1
Data NodeDataNode110.0.100.1
Data NodeDataNode210.0.100.2
Client NodeEdgeNode10.0.100.3

1 NameNode, 2 DataNodes & 1 EdgeNode/ClientNode/GatewayNode (Hadoop Binaries & Configuration files, SQOOP, HIVE)

Installing Oracle in Linux

Let us spawn a droplet in Digital Ocean:

  • Distributions: CentOS 7.2 x64
  • Size: 2GB /2 CPUs; 40 GB SSD Disk; 3TB Transfer
  • Data center region: New York3. ( The Data center should be same as that, where your hadoop cluster & edge node resides in order to enable private networking )
  • Additional Options: Check Private Networking on. (So that Sqoop in Edgenode and Datanodes in hadoop cluster can communicate with the oracle server)
  • Hostname: oraxe

Finally create the droplet.
Once the droplet is up ssh to the machine. Firstly setup the /etc/hosts file correctly. Remove all the lines and add the below two lines, save and quit. In my case the Private IP address is

[root@oraxe ~]# vi /etc/hosts oraxe localhost

Note: The IP mentioned above in the first line is the Private IP address of the droplet in Digital Ocean not the IPv4 Public address.

Update/Install the packages for libaio, bc, flex and unzip.

[root@oraxe ~]# yum install libaio bc flex unzip

Download Oracle Express Edition from the below link:
Once you accept and license agreement Copy the URL for the download Oracle Database Express Edition 11g Release 2 for Linux x64, so that we can directly download in the server. Next we rename the file as required.

[root@oraxe ~]# wget http://download.oracle.com/otn/linux/oracle11g/xe/oracle-xe-11.2.0-1.0.x86_64.rpm.zip?AuthParam=1472010283_e868ce0dc999fce3da22a522653a185b
[root@oraxe ~]# mv oracle-xe-11.2.0-1.0.x86_64.rpm.zip\?AuthParam\=1472010283_e868ce0dc999fce3da22a522653a185b oracle-xe-11.2.0-1.0.x86_64.rpm.zip
[root@oraxe ~]# unzip -q oracle-xe-11.2.0-1.0.x86_64.rpm.zip
[root@oraxe ~]# cd Disk1
[root@oraxe Disk1]# ls	
	oracle-xe-11.2.0-1.0.x86_64.rpm  response  upgrade

Installation of Oracle mostly encounter the below error:
“This system does not meet the minimum requirements for swap space. Based on the amount of physical memory available on the system, Oracle Database 11g Express Edition requires 2048 MB of swap space. This system has 0 MB of swap space. Configure more swap space on the system and retry the installation.” Lets fix the swap space issue.

[root@oraxe Disk1]# dd if=/dev/zero of=/swapfile bs=1024M count=2
[root@oraxe Disk1]# mkswap /swapfile
[root@oraxe Disk1]# free -m | grep Swap
[root@oraxe Disk1]# swapon /swapfile
[root@oraxe Disk1]# free -m | grep Swap

Now Let us start the Oracle Installer.

[root@oraxe Disk1]# rpm -ivh oracle-xe-11.2.0-1.0.x86_64.rpm

Once the Database Installation Completed successfully, we need to configure the database. Mostly for all the options we will choose the default value, simply by pressing Enter.

[root@oraxe Disk1]# /etc/init.d/oracle-xe configure
HTTP Port: 8088
Listener Port: 1521
SYS/SYSTEM Password: (e.g. hadoop 1234)
Start Oracle on boot: y

Once done you should see the success status as below
Starting Oracle Net Listener...Done  
Configuring database...Done  
Starting Oracle Database 11g Express Edition instance...Done  
Installation completed successfully.

The installation created the directory /u01 under which Oracle XE is installed. To set the required Oracle environment variables, use the oracle_env.sh the script included under cd /u01/app/oracle/product/11.2.0/xe/bin

[root@oraxe Disk1]# cd /u01/app/oracle/product/11.2.0/xe/bin
[root@oraxe /u01/app/oracle/product/11.2.0/xe/bin]# . ./oracle_env.sh

This sets the environment for our current session. But in order to set the environment permanently, append the following line to the .bashrc file append the below line, save and quit.

[root@oraxe /u01/app/oracle/product/11.2.0/xe/bin]# vi .bashrc

	. /u01/app/oracle/product/11.2.0/xe/bin/oracle_env.sh

Now let’s try to access Oracle SQL*Plus utility:

[root@oraxe /u01/app/oracle/product/11.2.0/xe/bin]# sqlplus /nolog
SQL*Plus: Release Production on Wed Sep 21 08:17:26 2011
Copyright (c) 1982, 2011, Oracle.  All rights reserved.

SQL> connect sys/Password as sysdba

Now we will unlock & reset the password of a default database called HR, so that we can use the same from sqoop.


Login to this HR account later & check the tables/data, as we will be using these tables in our sqoop import job.
To allow remote access to Oracle 11g XE GUI (as well as Application Express GUI) issue the following from SQL*Plus

SQL> exit

We will now be able to access the Oracle 11g XE Home Page GUI at http://localhost:8080/apex/f?p=4950:1
Replace localhost above with your IP as required and log in as SYSTEM and the password.

We are done with the Oracle Installation.
We need to map the Private IP address & Hostname of the Oracle Server in all the Nodes in our cluster. Once done, it’s time for us to login to our client node or edge node where we have our SQOOP installed. In the next article we will see SQOOP in action with Oracle RDBMS.

Have a question on this subject?

Ask questions to our expert community members and clear your doubts. Asking question or engaging in technical discussion is both easy and rewarding.

Are you on Twitter?

Start following us. This way we will always keep you updated with what's happening in Data Analytics community. We won't spam you. Promise.

  • Install SQOOP in Client Node of Hadoop Cluster

    Sqoop is an open source software product of the Apache Software Foundation in the hadoop ecosystem, designed to transfer data between Hadoop and relational databases or mainframes. Sqoop can be used to import data from a relational database...

  • Install PIG In Client Node of Hadoop Cluster

    Apache Pig is a platform for analyzing large data sets. Pig Latin is the high level programming language that, lets us specify a sequence of data transformations such as merging data sets, filtering them, grouping them, and applying functions to...

  • Hadoop DataLake Implementation

    In this multi-series article we will learn how to implement an Enterprise DataLake using Apache Hadoop, an open-source, java-based software framework for reliable, scalable & distributed computing. Apache Hadoop addresses the limitations of...

  • Install Hive in Client Node of Hadoop Cluster

    In the previous article, we have shown how to setup a client node. Once this is done, now let's put Hadoop to use for some big data analytics purpose. One way to do that is by using Hive which let's us run SQL queries against the big data. A...

  • Install FLUME In Client Node of Hadoop Cluster

    Apache Flume is a distributed, robust, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data or streaming event data from different sources to a centralized data store. Its main goal is to...

  • Stream Webserver Log into Hdfs using FLUME

    In this article we will use Apache Flume to gather stream access log data from our remote Web Server into Hadoop Distributed File System. We will be analyzing the access log in a real-time basis. So we have to setup Flume such that it collects the...

  • Set up Client Node (Gateway Node) in Hadoop Cluster

    Once we have our multi-node hadoop cluster up and running, let us create an EdgeNode or a GatewayNode. Gateway nodes are the interface between the Hadoop cluster and the outside network. Edge nodes are used to run client applications and cluster...

  • Hadoop DataLake Implementation Part 5

    In this article we will load the showroom master data from MySQL source system to HDFS using Sqoop as SCD Type 1.

  • How to Setup Hadoop Multi Node Cluster - Step By Step

    Setting up Hadoop in a single machine is easy, but no fun. Why? Because Hadoop is not meant for a single machine. Hadoop is meant to run on a computing cluster comprising of many machines. Running HDFS and MapReduce on a single machine is great for...

  • Hadoop DataLake Implementation Part 4

    Now that our dummy OLTP source system & Hadoop HDFS directory structure is ready, we will first load the ‘dates’ data file in HDFS and further to a hive table.