In this article we will create oozie workflow to orchestrate the daily loading of showroom dimension table from MySQL source to HDFS using Sqoop, followed by Loading data from HDFS to Hive warehouse using Hive and finally housekkeping & archive.

Setup Oozie Workflow

Login to Ambari Web UI, with user edw_user & password hadoop. Open Workflow Manager View to develop our Worflow to load the data lake on a daily basis.

1. Let us first create a daily load workflow for Showroom Dimension. Click on Create Workflow, and name it as Wf_Showroom.

  • Add a Sqoop Action Node, namely sq_stg_showroom in between the Start & End node. Next click on the Settings icon for the Sqoop Node.
    On the General Tab, select option Send As to Args. Add five Args as –
    job
    --exec
    jb_stg_showroom
    --meta-connect
    jdbc:hsqldb:hsql://sandbox-hdp.hortonworks.com:16001/sqoop
    

    Next click on Settings and enter the Workflow Parameters configurations as below:
    oozie.action.sharelib.for.sqoop=sqoop
    

  • Next add a Hive2 Action Node, namely hv_dim_showroom after the node sq_stg_showroom. Next click on the Settings icon for the Hive2 Node. On the General Tab, select Hive Option as Script.
    Add the script file path as – /user/edw_user/sales/scripts/load_showroom.hql
    Enter the jdbc-url as jdbc:hive2://sandbox-hdp.hortonworks.com:10000/sales_analytics edw_user
    Enter password as hadoop

  • Next add a Shell Action Node, namely sh_arc_showroom after the node hv_dim_showroom. Next click on the settings icon for Shell Node.
    Under General, exec enter script file path as - /user/edw_user/sales/scripts/archive_showroom.sh
    Under Advanced Properties enter File as /user/edw_user/sales/scripts/archive_showroom.sh

Finally click on Save with Workflow Path as /user/edw_user/sales/workflows/showroom_workflow.xml, check the Overwrite option. Next click on Submit Job, check Overwrite & Run on submit, to test the workflow.

2. Now we will create a daily load workflow for Sales Fact. Click on Create and develop a Workflow, named Wf_Sales.

  • Add a Sqoop Action Node, namely sq_stg_sales in between the Start & End node. Next click on the Settings icon for the Sqoop Node.
    On the General Tab, select option Send As to Args. Add five Args as –
    job
    --exec
    jb_stg_showroom
    --meta-connect
    jdbc:hsqldb:hsql://sandbox-hdp.hortonworks.com:16001/sqoop
    

    Next click on Settings and enter the Workflow Parameters configurations as below:
    oozie.action.sharelib.for.sqoop=sqoop
    

  • Next add a Pig Action Node, namely pg_stg_sales in between the Start & End node. Next click on the Settings icon for the Sqoop Node.
    On the General Tab, under Script browse and enter the file path as - /user/edw_user/sales/scripts/transform_sales.pig
    Under Advanced Properties enter File as /user/edw_user/sales/scripts/transform_sales.sh
    Under Configuration enter the following parameters:
    hive.metastore.uris=thrift://sandbox-hdp.hortonworks.com:9083
    mapreduce.map.output.compress=true
    

    Next click on Settings and enter the Workflow Parameters configurations as below:
    oozie.action.sharelib.for.sqoop=sqoop
    oozie.action.sharelib.for.pig=pig,hive,hcatalog
    

  • Next add a Hive2 Action Node, namely hv_fact_sales after the node pq_stg_sales. Next click on the Settings icon for the Hive2 Node. On the General Tab, select Hive Option as Script.
    Add the script file path as – /user/edw_user/sales/scripts/load_sales.hql
    Enter the jdbc-url as jdbc:hive2://sandbox-hdp.hortonworks.com:10000/sales_analytics edw_user
    Enter password as hadoop

  • Next add a Shell Action Node, namely sh_arc_sales after the node hv_fact_sales. Next click on the settings icon for Shell Node.
    Under General, exec enter script file path as - /user/edw_user/sales/scripts/archive_sales.sh
    Under Advanced Properties enter File as /user/edw_user/sales/scripts/archive_sales.sh

Finally click on Save with Workflow Path as /user/edw_user/sales/workflows/sales_workflow.xml, check the Overwrite option. Next click on Submit Job, check Overwrite & Run on submit, to test the workflow.


Have a question on this subject?

Ask questions to our expert community members and clear your doubts. Asking question or engaging in technical discussion is both easy and rewarding.

Are you on Twitter?

Start following us. This way we will always keep you updated with what's happening in Data Analytics community. We won't spam you. Promise.

  • Hadoop DataLake Implementation Part 2

    Now that we are familiar with HDP stack, in this article we are going to access HDP sandbox command line, Ambari Web UI, Hive & Ranger to create a user for our implementation setup.

  • Hadoop DataLake Implementation Part 4

    Now that our dummy OLTP source system & Hadoop HDFS directory structure is ready, we will first load the ‘dates’ data file in HDFS and further to a hive table.

  • SQOOP import from Oracle

    In this article we will use Apache SQOOP to import data from Oracle database. Now that we have an oracle server in our cluster ready, let us login to EdgeNode. Next we will configure sqoop to import this data in HDFS file system followed by direct...

  • Hadoop DataLake Implementation Part 9

    In this article we will load our final fact table i.e. stock.

  • Hadoop DataLake Implementation Part 8

    In this article we will load our first fact table into Hive warehouse which is sales transactions.

  • How to Setup Hadoop Multi Node Cluster - Step By Step

    Setting up Hadoop in a single machine is easy, but no fun. Why? Because Hadoop is not meant for a single machine. Hadoop is meant to run on a computing cluster comprising of many machines. Running HDFS and MapReduce on a single machine is great for...

  • Hadoop DataLake Implementation Part 10

    In this article we will create oozie workflow to orchestrate the daily loading of showroom dimension table from MySQL source to HDFS using Sqoop, followed by Loading data from HDFS to Hive warehouse using Hive and finally housekkeping & archive.

  • Hadoop MapReduce Basics

    The Hadoop, since its inception is changing the way the enterprises store, process and analyse data. MapReduce is the core part of the Hadoop framework and we can also call it as the core processing engine of Hadoop. It is a programming model...

  • Oracle Installation for SQOOP Import

    We would like to perform practical test of Apache SQOOP import/export utility between ORACLE relational database & Apache HADOOP file system, let us quickly setup an ORACLE server. For that we will be using cloud based services/servers as we did...

  • Install FLUME In Client Node of Hadoop Cluster

    Apache Flume is a distributed, robust, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data or streaming event data from different sources to a centralized data store. Its main goal is to...