In this article we will create oozie workflow to orchestrate the daily loading of showroom dimension table from MySQL source to HDFS using Sqoop, followed by Loading data from HDFS to Hive warehouse using Hive and finally housekkeping & archive.

Setup Oozie Workflow

Login to Ambari Web UI, with user edw_user & password hadoop. Open Workflow Manager View to develop our Worflow to load the data lake on a daily basis.

1. Let us first create a daily load workflow for Showroom Dimension. Click on Create Workflow, and name it as Wf_Showroom.

  • Add a Sqoop Action Node, namely sq_stg_showroom in between the Start & End node. Next click on the Settings icon for the Sqoop Node.
    On the General Tab, select option Send As to Args. Add five Args as –
    job
    --exec
    jb_stg_showroom
    --meta-connect
    jdbc:hsqldb:hsql://sandbox-hdp.hortonworks.com:16001/sqoop
    

    Next click on Settings and enter the Workflow Parameters configurations as below:
    oozie.action.sharelib.for.sqoop=sqoop
    

  • Next add a Hive2 Action Node, namely hv_dim_showroom after the node sq_stg_showroom. Next click on the Settings icon for the Hive2 Node. On the General Tab, select Hive Option as Script.
    Add the script file path as – /user/edw_user/sales/scripts/load_showroom.hql
    Enter the jdbc-url as jdbc:hive2://sandbox-hdp.hortonworks.com:10000/sales_analytics edw_user
    Enter password as hadoop

  • Next add a Shell Action Node, namely sh_arc_showroom after the node hv_dim_showroom. Next click on the settings icon for Shell Node.
    Under General, exec enter script file path as - /user/edw_user/sales/scripts/archive_showroom.sh
    Under Advanced Properties enter File as /user/edw_user/sales/scripts/archive_showroom.sh

Finally click on Save with Workflow Path as /user/edw_user/sales/workflows/showroom_workflow.xml, check the Overwrite option. Next click on Submit Job, check Overwrite & Run on submit, to test the workflow.

2. Now we will create a daily load workflow for Sales Fact. Click on Create and develop a Workflow, named Wf_Sales.

  • Add a Sqoop Action Node, namely sq_stg_sales in between the Start & End node. Next click on the Settings icon for the Sqoop Node.
    On the General Tab, select option Send As to Args. Add five Args as –
    job
    --exec
    jb_stg_showroom
    --meta-connect
    jdbc:hsqldb:hsql://sandbox-hdp.hortonworks.com:16001/sqoop
    

    Next click on Settings and enter the Workflow Parameters configurations as below:
    oozie.action.sharelib.for.sqoop=sqoop
    

  • Next add a Pig Action Node, namely pg_stg_sales in between the Start & End node. Next click on the Settings icon for the Sqoop Node.
    On the General Tab, under Script browse and enter the file path as - /user/edw_user/sales/scripts/transform_sales.pig
    Under Advanced Properties enter File as /user/edw_user/sales/scripts/transform_sales.sh
    Under Configuration enter the following parameters:
    hive.metastore.uris=thrift://sandbox-hdp.hortonworks.com:9083
    mapreduce.map.output.compress=true
    

    Next click on Settings and enter the Workflow Parameters configurations as below:
    oozie.action.sharelib.for.sqoop=sqoop
    oozie.action.sharelib.for.pig=pig,hive,hcatalog
    

  • Next add a Hive2 Action Node, namely hv_fact_sales after the node pq_stg_sales. Next click on the Settings icon for the Hive2 Node. On the General Tab, select Hive Option as Script.
    Add the script file path as – /user/edw_user/sales/scripts/load_sales.hql
    Enter the jdbc-url as jdbc:hive2://sandbox-hdp.hortonworks.com:10000/sales_analytics edw_user
    Enter password as hadoop

  • Next add a Shell Action Node, namely sh_arc_sales after the node hv_fact_sales. Next click on the settings icon for Shell Node.
    Under General, exec enter script file path as - /user/edw_user/sales/scripts/archive_sales.sh
    Under Advanced Properties enter File as /user/edw_user/sales/scripts/archive_sales.sh

Finally click on Save with Workflow Path as /user/edw_user/sales/workflows/sales_workflow.xml, check the Overwrite option. Next click on Submit Job, check Overwrite & Run on submit, to test the workflow.


Have a question on this subject?

Ask questions to our expert community members and clear your doubts. Asking question or engaging in technical discussion is both easy and rewarding.

Are you on Twitter?

Start following us. This way we will always keep you updated with what's happening in Data Analytics community. We won't spam you. Promise.

  • SQOOP Merge & Incremental Extraction from Oracle

    Let us check how to perform Incremental Extraction & Merge using Sqoop. The SQOOP Merge utility allows to combine two datasets where entries in one dataset should overwrite entries of an older dataset. For example, an incremental import run in...

  • Hadoop DataLake Implementation Part 3

    To complete our implementation setup we will create the source tables based on the downloaded datafiles. Let us first load the SQL files in MySQL server under a new database called ‘sales’. We will simulate this database schema as our OLTP source...

  • Hadoop DataLake Implementation Part 10

    In this article we will create oozie workflow to orchestrate the daily loading of showroom dimension table from MySQL source to HDFS using Sqoop, followed by Loading data from HDFS to Hive warehouse using Hive and finally housekkeping & archive.

  • Install FLUME In Client Node of Hadoop Cluster

    Apache Flume is a distributed, robust, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data or streaming event data from different sources to a centralized data store. Its main goal is to...

  • Understanding Map-Reduce with Examples

    In my previous article – “Fools guide to Big Data” – we have discussed about the origin of Bigdata and the need of big data analytics. We have also noted that Big Data is data that is too large, complex and dynamic for any conventional data tools...

  • Install SQOOP in Client Node of Hadoop Cluster

    Sqoop is an open source software product of the Apache Software Foundation in the hadoop ecosystem, designed to transfer data between Hadoop and relational databases or mainframes. Sqoop can be used to import data from a relational database...

  • Hadoop DataLake Implementation Part 8

    In this article we will load our first fact table into Hive warehouse which is sales transactions.

  • Install Hive in Client Node of Hadoop Cluster

    In the previous article, we have shown how to setup a client node. Once this is done, now let's put Hadoop to use for some big data analytics purpose. One way to do that is by using Hive which let's us run SQL queries against the big data. A...

  • Hadoop MapReduce Basics

    The Hadoop, since its inception is changing the way the enterprises store, process and analyse data. MapReduce is the core part of the Hadoop framework and we can also call it as the core processing engine of Hadoop. It is a programming model...

  • How to Setup Hadoop Multi Node Cluster - Step By Step

    Setting up Hadoop in a single machine is easy, but no fun. Why? Because Hadoop is not meant for a single machine. Hadoop is meant to run on a computing cluster comprising of many machines. Running HDFS and MapReduce on a single machine is great for...