SlideShare a Scribd company logo
1 of 31
Download to read offline
Everything that you ever wanted to
know about Oozie, but were afraid
              to ask

       B Lublinsky, A Yakubovich
Apache Oozie
• Oozie is a workflow/coordination system to
  manage Apache Hadoop jobs.
• A single Oozie server implements all four
  functional Oozie components:
  – Oozie workflow
  – Oozie coordinator
  – Oozie bundle
  – Oozie SLA.
Main components
                                                                               Oozie Server


                                                                    Bundle
3rd party application


                         time condition monitoring

                                                                Coordinator


             WS API

                                                                    workflow
                                                                                                                   data condition monitoring




                                            action
 Oozie Command                  action                        action
  Line Interface
                                                     action
                                                                               wf logic              job submission
                                                                                                     and monitoring




                                                     definitions,
                                                       states




                                                                                              Oozie shared
                                                                                                libraries
                                                                                                                            HDFS



                                            Bundle
                                           Coordinator
                                            Coordinator
                                                                                                       MapReduce

                                             Data
                                          Coordinator
                                           Coordinator
                                            Coordinator



                                           Workflow
                                           Coordinator
                                            Coordinator
                                                                                                             Hadoop
Oozie workflow
Workflow Language
Flow-control   XML element type       Description
node
Decision       workflow:DECISION      expressing “switch-case” logic

Fork           workflow:FORK          splits one path of execution into multiple concurrent paths
Join           workflow:JOIN          waits until every concurrent execution path of a previous fork
                                      node arrives to it
Kill           workflow:kill          forces a workflow job to kill (abort) itself

Action node    XML element type    Description
java           workflow:JAVA       invokes the main() method from the specified java class
fs             workflow:FS         manipulate files and directories in HDFS; supports commands:
                                   move, delete, mkdir
MapReduce      workflow:MAP-REDUCE starts a Hadoop map/reduce job; that could be java MR job,
                                   streaming job or pipe job
Pig            workflow:pig        runs a Pig job
Sub workflow   workflow:SUB-       runs a child workflow job
               WORKFLOW
Hive *         workflow:HIVE       runs a Hive job
Shell *        workflow:SHELL      runs a Shell command
ssh *          workflow:SSH        starts a shell command on a remote machine as a remote secure
                                   shell
Sqoop *        workflow:SQOOP      runs a Sqoop job
Email *        workflow:EMAIL      sending emails from Oozie workflow application
Distcp ?                           Under development (Yahoo)
Workflow actions
• Oozie workflow supports two types of actions:
    Synchronous, executed inside Oozie runtime
    Asynchronous, executed as a Map Reduce job.
 ActionStartCommand             WorkflowStore                    Services          ActionExecutorContext                 JavaActionExecutor             JobClient


         1 : workflow := getWorkflow()


            2 : action := getAction()


                                           3 : context := init<>()


                            4 : executor := get()



                                                                     5 : start()



                                                                                                                                 6 : submitLauncher()




                                                                                      7 : jobClient := get()

                                                                                                                                  8 : runningJob := submit()


                                                                                                         9 : setStartData()
Workflow lifecycle

                       PREP




KILLED                RUNNING               FAILED




          SUSPENDED             SUCCEDDED
Oozie execution console
Extending Oozie workflow
• Oozie provides a “minimal” workflow language, which
  contains only a handful of control and actions nodes.
• Oozie supports a very elegant extensibility mechanism –
  custom action nodes. Custom action nodes allow to extend
  Oozie’ language with additional actions (verbs).
• Creation of custom action requires implementation of
  following:
   – Java action implementation, which extends ActionExecutor
     class.
   – Implementation of the action’s XML schema defining action’s
     configuration parameters
   – Packaging of java implementation and configuration schema
     into action jar, which has to be added to Oozie war
   – extending oozie-site.xml to register information about custom
     executor with Oozie runtime.
Oozie Workflow Client
• Oozie provides an easy way for integration with enterprise
  applications through Oozie client APIs. It provides two
  types of APIs
• REST HTTP API
   Number of HTTP requests
   • Info requests (job status, job configuration)
   • Job management (submit, start, suspend, resume, kill)
   Example: job definition info request
       GET /oozie/v0/job/job-ID?show=definition
• Java API - package org.apache.oozie.client
   – OozieClient
       start(), submit(), run(), reRunXXX(), resume(), kill(), suspend()
   – WorkflowJob, WorkflowAction
   – CoordinatorJob, CoordinatorAction
   – SLAEvent
Oozie workflow good, bad and ugly
• Good
   – Nice integration with Hadoop ecosystem, allowing to easily build
     processes encompassing synchronized execution of multiple Map
     Reduce, Hive, Pig, etc jobs.
   – Nice UI for tracking execution progress
   – Simple APIs for integration with other applications
   – Simple extensibility APIs
• Bad
   – Process has to be expressed directly in hPDL with no visual support
   – No support for Uber Jars (but we added our own)
• Ugly
   – Static forking (but you can regenerate workflow and invoke on a fly)
   – No support for loops
Oozie Coordinator
Coordinator language
Element type   Description                                         Attributes and sub-elements
coordinator-   top-level element in coordinator instance           frequency
app                                                                start
                                                                   end
controls       specify the execution policy for coordinator and timeout (actions)
               it’s elements (workflow actions)                 concurrency (actions)
                                                                execution order (workflow
                                                                instances)
action         Required singular element specifying the            Workflow name
               associated workflow. The jobs specified in
               workflow consume and produce dataset
               instances
datasets       Collection of data referred to by a logical name.
               Datasets serve to specify data dependences
               between workflow instances
input event    specifies the input conditions (in the form of
               present data sets) that are required in order to
               execute a coordinator action
output event   specifies the dataset that should be produced
               by coordinator action
Coordinator lifecycle
Oozie Bundle
Bundle lifecycle

                                  PREP




 PREPSUSPENDED       PREPPAUSED          RUNNING    KILLED




SUSPENDED                                  FAILED   PAUSED
                   SUCCEDDED
Oozie SLA
SLA Navigation
                      COORD_JOBS

                       id
                       app_name
                       app_path
                       …
                                         WF_JOBS
SLA_EVENT

event_id                                id
alert_contact                           app_name
alert-frieuency                         app_path
…                                       …
sla_id
...                   COORD_ACTIONS

                        id
                        action_number
                        action_xml      WF_ACTIONS
                        …
                        external_id
                        ...              id
                                         conf
                                         console_url
                                         …
Everything you wanted to know, but were afraid to ask about Oozie
Using Probes to analyze/monitor Places

• Select probe data for specified time/location
• Validate – Filter - Transform probe data
• Calculate statistics on available probe data
• Distribute data per geo-tiles
• Calculate place statistics (e.g. attendance index)
-------------------------------------------------------------
If exception condition happens, report failure
If all steps succeeded, report success
Workflow as acyclic graph
Workflow – fragment 1
Workflow – fragment 2
Oozie tips and tricks
Configuring workflow
• Oozie provides 3 overlapping mechanisms to configure workflow -
  config-default.xml, jobs properties file and job arguments that can
  be passed to Oozie as part of command line invocations.
• The way Oozie processes these three sets of the parameters is as
  follows:
    – Use all of the parameters from command line invocation
    – For remaining unresolved parameters, job config is used
    – Use config-default.xml for everything else
• Although documentation does not describe clearly when to use
  which, the overall recommendation is as follows:
    – Use config-default.xml for defining parameters that never change for a
      given workflow
    – Use jobs properties for the parameters that are common for a given
      deployment of a workflow
    – Use command line arguments for the parameters that are specific for
      a given workflow invocation.
Accessing and storing process
                variables
• Accessing
  – Through the arguments in java main
• Storing
     String ooziePropFileName =
            System.getProperty("oozie.action.output.properties");
     OutputStream os = new FileOutputStream(new
            File(ooziePropFileName));
     Properties props = new Properties();
     props.setProperty(key, value);
     props.store(os, "");
     os.close();
Validating data presence
• Oozie provides two possible approaches for validating
  resource file(s) presence
   – using Oozie coordinator’s input events based on the data set -
     technically the simplest implementation approach, but it does
     not provide a more complex decision support that might be
     required. It just either runs a corresponding workflow or not.
   – custom java node inside Oozie workflow. - allows to extend
     decision logic by sending notifications about data absence, run
     execution on partial data under certain timing conditions, etc.
• Additional configuration parameters for Oozie coordinator,
  for example, ability to wait for files arrival, etc. can expand
  usage of Oozie coordinator.
Invoking map Reduce jobs
• Oozie provides two different ways of invoking Map Reduce
  job – MapReduce action and java action.
• Invocation of Map Reduce job with java action is somewhat
  similar to invocation of this job with Hadoop command line
  from the edge node. You specify a driver as a class for the
  java activity and Oozie invokes the driver. This approach
  has two main advantages:
   – The same driver class can be used for both – running Map
     Reduce job from an edge node and a java action in an Oozie
     process.
   – A driver provides a convenient place for executing additional
     code, for example clean-up required for Map Reduce execution.
• Driver requires a proper shutdown hook to ensure that
  there are no lingering Map Reduce jobs
Implementing predefined looping and
              forking
• hPDL is an XML document with the well-defined
  schema.
• This means that the actual workflow can be easily
  manipulated using JAXB objects, which can be
  generated from hPDL schema using xjc compiler.
• This means that we can create the complete
  workflow programmatically, based on calculated
  amount of fork branches or implementing loops
  as a repeated actions.
• The other option is creation of template process
  and modifying it based on calculated parameters.
Oozie client security (or lack of)
• By default Oozie client reads clients identity from the
  local machine OS and passes it to the Oozie server,
  which uses this identity for MR jobs invocation
• Impersonation can be implemented by overwriting
  OozieClient class’ method createConfiguration, where
  client variables can be set through new constructor.
         public Properties createConfiguration() {
             Properties conf = new Properties();
             if(user == null)
                conf.setProperty(USER_NAME, System.getProperty("user.name"));
             else
                conf.setProperty(USER_NAME, user);
             return conf;
          }
uber jars with Oozie
uber jar contains resources: other jars, so libraries, zip files


                                                           unpack resources
     Oozie                               launcher        to current uber jar dir
     server                             java action
                                                         set inverse classloader
                       uber jar
                   Classes (Launcher)                      invoke MR driver
                                                            pass arguments
                      jars so zip

<java>                                                    set shutdown hook
   …                                                      ‘wait for complete’
  <main-class>${wfUberLauncher}</main-class>
  <arg>-appStart=${wfAppMain}</arg>
   …                                                  mapper
</java>                                                   mapper

More Related Content

What's hot

Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestMigrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestDatabricks
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkKazuaki Ishizaki
 
Play! Framework for JavaEE Developers
Play! Framework for JavaEE DevelopersPlay! Framework for JavaEE Developers
Play! Framework for JavaEE DevelopersTeng Shiu Huang
 
KCD Munich 2023 - Demystifying Container Images Understanding Multi-Architect...
KCD Munich 2023 - Demystifying Container Images Understanding Multi-Architect...KCD Munich 2023 - Demystifying Container Images Understanding Multi-Architect...
KCD Munich 2023 - Demystifying Container Images Understanding Multi-Architect...Robert Bohne
 
Drools 6.0 (Red Hat Summit)
Drools 6.0 (Red Hat Summit)Drools 6.0 (Red Hat Summit)
Drools 6.0 (Red Hat Summit)Mark Proctor
 
Introduction to Spring Framework
Introduction to Spring FrameworkIntroduction to Spring Framework
Introduction to Spring Framework Serhat Can
 
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-PlatformDelight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-PlatformDatabricks
 
How to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkHow to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkIlya Ganelin
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache FlinkAKASH SIHAG
 
Python Streaming Pipelines with Beam on Flink
Python Streaming Pipelines with Beam on FlinkPython Streaming Pipelines with Beam on Flink
Python Streaming Pipelines with Beam on FlinkAljoscha Krettek
 
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsMonitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsDatabricks
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
社内Java8勉強会 ラムダ式とストリームAPI
社内Java8勉強会 ラムダ式とストリームAPI社内Java8勉強会 ラムダ式とストリームAPI
社内Java8勉強会 ラムダ式とストリームAPIAkihiro Ikezoe
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache ArrowWes McKinney
 
FlutterでGraphQLを扱う
FlutterでGraphQLを扱うFlutterでGraphQLを扱う
FlutterでGraphQLを扱うIgaHironobu
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopEvans Ye
 
SPA時代のOGPとの戦い方
SPA時代のOGPとの戦い方SPA時代のOGPとの戦い方
SPA時代のOGPとの戦い方Yoichi Toyota
 

What's hot (20)

Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestMigrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
 
Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
 
Play! Framework for JavaEE Developers
Play! Framework for JavaEE DevelopersPlay! Framework for JavaEE Developers
Play! Framework for JavaEE Developers
 
KCD Munich 2023 - Demystifying Container Images Understanding Multi-Architect...
KCD Munich 2023 - Demystifying Container Images Understanding Multi-Architect...KCD Munich 2023 - Demystifying Container Images Understanding Multi-Architect...
KCD Munich 2023 - Demystifying Container Images Understanding Multi-Architect...
 
Drools 6.0 (Red Hat Summit)
Drools 6.0 (Red Hat Summit)Drools 6.0 (Red Hat Summit)
Drools 6.0 (Red Hat Summit)
 
Introduction to Spring Framework
Introduction to Spring FrameworkIntroduction to Spring Framework
Introduction to Spring Framework
 
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-PlatformDelight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
 
How to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkHow to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They Work
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache Flink
 
Python Streaming Pipelines with Beam on Flink
Python Streaming Pipelines with Beam on FlinkPython Streaming Pipelines with Beam on Flink
Python Streaming Pipelines with Beam on Flink
 
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsMonitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
社内Java8勉強会 ラムダ式とストリームAPI
社内Java8勉強会 ラムダ式とストリームAPI社内Java8勉強会 ラムダ式とストリームAPI
社内Java8勉強会 ラムダ式とストリームAPI
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
FlutterでGraphQLを扱う
FlutterでGraphQLを扱うFlutterでGraphQLを扱う
FlutterでGraphQLを扱う
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
SPA時代のOGPとの戦い方
SPA時代のOGPとの戦い方SPA時代のOGPとの戦い方
SPA時代のOGPとの戦い方
 

Viewers also liked

Oozie sweet
Oozie sweetOozie sweet
Oozie sweetmislam77
 
Oozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY WayOozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY WayDataWorks Summit
 
Oozie towards zero downtime
Oozie towards zero downtimeOozie towards zero downtime
Oozie towards zero downtimeDataWorks Summit
 
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case StudyOozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case StudyFX Live Group
 
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for HadoopMay 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for HadoopYahoo Developer Network
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 
Process Safety Life Cycle Management: Best Practices and Processes
Process Safety Life Cycle Management: Best Practices and ProcessesProcess Safety Life Cycle Management: Best Practices and Processes
Process Safety Life Cycle Management: Best Practices and ProcessesMd Rahaman
 
Oozie @ Riot Games
Oozie @ Riot GamesOozie @ Riot Games
Oozie @ Riot GamesMatt Goeke
 
Hadoop in Data Warehousing
Hadoop in Data WarehousingHadoop in Data Warehousing
Hadoop in Data WarehousingAlexey Grigorev
 
July 2012 HUG: Overview of Oozie Qualification Process
July 2012 HUG: Overview of Oozie Qualification ProcessJuly 2012 HUG: Overview of Oozie Qualification Process
July 2012 HUG: Overview of Oozie Qualification ProcessYahoo Developer Network
 
Oozie HUG May12
Oozie HUG May12Oozie HUG May12
Oozie HUG May12mislam77
 
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas NApache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas NYahoo Developer Network
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Cloudera, Inc.
 
Dmp hadoop getting_start
Dmp hadoop getting_startDmp hadoop getting_start
Dmp hadoop getting_startGim GyungJin
 

Viewers also liked (20)

Apache Oozie
Apache OozieApache Oozie
Apache Oozie
 
Oozie sweet
Oozie sweetOozie sweet
Oozie sweet
 
Oozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY WayOozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY Way
 
Oozie towards zero downtime
Oozie towards zero downtimeOozie towards zero downtime
Oozie towards zero downtime
 
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case StudyOozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
 
Oozie meetup - HA
Oozie meetup - HAOozie meetup - HA
Oozie meetup - HA
 
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for HadoopMay 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
 
October 2013 HUG: Oozie 4.x
October 2013 HUG: Oozie 4.xOctober 2013 HUG: Oozie 4.x
October 2013 HUG: Oozie 4.x
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Process Safety Life Cycle Management: Best Practices and Processes
Process Safety Life Cycle Management: Best Practices and ProcessesProcess Safety Life Cycle Management: Best Practices and Processes
Process Safety Life Cycle Management: Best Practices and Processes
 
Oozie @ Riot Games
Oozie @ Riot GamesOozie @ Riot Games
Oozie @ Riot Games
 
Hadoop in Data Warehousing
Hadoop in Data WarehousingHadoop in Data Warehousing
Hadoop in Data Warehousing
 
July 2012 HUG: Overview of Oozie Qualification Process
July 2012 HUG: Overview of Oozie Qualification ProcessJuly 2012 HUG: Overview of Oozie Qualification Process
July 2012 HUG: Overview of Oozie Qualification Process
 
Oozie at Yahoo
Oozie at YahooOozie at Yahoo
Oozie at Yahoo
 
Oozie HUG May12
Oozie HUG May12Oozie HUG May12
Oozie HUG May12
 
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas NApache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
 
Advanced Oozie
Advanced OozieAdvanced Oozie
Advanced Oozie
 
October 2014 HUG : Oozie HA
October 2014 HUG : Oozie HAOctober 2014 HUG : Oozie HA
October 2014 HUG : Oozie HA
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Dmp hadoop getting_start
Dmp hadoop getting_startDmp hadoop getting_start
Dmp hadoop getting_start
 

Similar to Everything you wanted to know, but were afraid to ask about Oozie

Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for HadoopJoe Crobak
 
MEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftMEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftLee Stott
 
A Tale of a Server Architecture (Frozen Rails 2012)
A Tale of a Server Architecture (Frozen Rails 2012)A Tale of a Server Architecture (Frozen Rails 2012)
A Tale of a Server Architecture (Frozen Rails 2012)Flowdock
 
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...Cloudera, Inc.
 
oozieee.pdf
oozieee.pdfoozieee.pdf
oozieee.pdfwwww63
 
Hbase coprocessor with Oozie WF referencing 3rd Party jars
Hbase coprocessor with Oozie WF referencing 3rd Party jarsHbase coprocessor with Oozie WF referencing 3rd Party jars
Hbase coprocessor with Oozie WF referencing 3rd Party jarsJinith Joseph
 
Oozie Summit 2011
Oozie Summit 2011Oozie Summit 2011
Oozie Summit 2011mislam77
 
Apache Oozie Workflow Scheduler - Module 10
Apache Oozie Workflow Scheduler - Module 10Apache Oozie Workflow Scheduler - Module 10
Apache Oozie Workflow Scheduler - Module 10Rohit Agrawal
 
Oozie Hug May 2011
Oozie Hug May 2011Oozie Hug May 2011
Oozie Hug May 2011mislam77
 
WORKS 11 Presentation
WORKS 11 PresentationWORKS 11 Presentation
WORKS 11 Presentationdgarijo
 
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Status update OEG - Nov 2012
Status update OEG - Nov 2012Status update OEG - Nov 2012
Status update OEG - Nov 2012dgarijo
 

Similar to Everything you wanted to know, but were afraid to ask about Oozie (20)

Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for Hadoop
 
MEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftMEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop Microsoft
 
A Tale of a Server Architecture (Frozen Rails 2012)
A Tale of a Server Architecture (Frozen Rails 2012)A Tale of a Server Architecture (Frozen Rails 2012)
A Tale of a Server Architecture (Frozen Rails 2012)
 
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
 
Apache Oozie
Apache OozieApache Oozie
Apache Oozie
 
Apache Oozie.pptx
Apache Oozie.pptxApache Oozie.pptx
Apache Oozie.pptx
 
Hadoop Oozie
Hadoop OozieHadoop Oozie
Hadoop Oozie
 
oozieee.pdf
oozieee.pdfoozieee.pdf
oozieee.pdf
 
Hbase coprocessor with Oozie WF referencing 3rd Party jars
Hbase coprocessor with Oozie WF referencing 3rd Party jarsHbase coprocessor with Oozie WF referencing 3rd Party jars
Hbase coprocessor with Oozie WF referencing 3rd Party jars
 
Oozie Summit 2011
Oozie Summit 2011Oozie Summit 2011
Oozie Summit 2011
 
Apache Oozie Workflow Scheduler - Module 10
Apache Oozie Workflow Scheduler - Module 10Apache Oozie Workflow Scheduler - Module 10
Apache Oozie Workflow Scheduler - Module 10
 
Oozie Hug May 2011
Oozie Hug May 2011Oozie Hug May 2011
Oozie Hug May 2011
 
Introducing spring
Introducing springIntroducing spring
Introducing spring
 
WORKS 11 Presentation
WORKS 11 PresentationWORKS 11 Presentation
WORKS 11 Presentation
 
F03-Cloud-Obiwee
F03-Cloud-ObiweeF03-Cloud-Obiwee
F03-Cloud-Obiwee
 
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
 
Cloud computing era
Cloud computing eraCloud computing era
Cloud computing era
 
Status update OEG - Nov 2012
Status update OEG - Nov 2012Status update OEG - Nov 2012
Status update OEG - Nov 2012
 
BPMS1
BPMS1BPMS1
BPMS1
 
BPMS1
BPMS1BPMS1
BPMS1
 

More from Chicago Hadoop Users Group

Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Chicago Hadoop Users Group
 
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your BusinessChoosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your BusinessChicago Hadoop Users Group
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopChicago Hadoop Users Group
 
HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917Chicago Hadoop Users Group
 
Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416Chicago Hadoop Users Group
 

More from Chicago Hadoop Users Group (18)

Kinetica master chug_9.12
Kinetica master chug_9.12Kinetica master chug_9.12
Kinetica master chug_9.12
 
Chug dl presentation
Chug dl presentationChug dl presentation
Chug dl presentation
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Using Apache Drill
Using Apache DrillUsing Apache Drill
Using Apache Drill
 
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
 
Meet Spark
Meet SparkMeet Spark
Meet Spark
 
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your BusinessChoosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
 
An Overview of Ambari
An Overview of AmbariAn Overview of Ambari
An Overview of Ambari
 
Hadoop and Big Data Security
Hadoop and Big Data SecurityHadoop and Big Data Security
Hadoop and Big Data Security
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Scalding for Hadoop
Scalding for HadoopScalding for Hadoop
Scalding for Hadoop
 
Financial Data Analytics with Hadoop
Financial Data Analytics with HadoopFinancial Data Analytics with Hadoop
Financial Data Analytics with Hadoop
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache Hadoop
 
HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917
 
Map Reduce v2 and YARN - CHUG - 20120604
Map Reduce v2 and YARN - CHUG - 20120604Map Reduce v2 and YARN - CHUG - 20120604
Map Reduce v2 and YARN - CHUG - 20120604
 
Hadoop in a Windows Shop - CHUG - 20120416
Hadoop in a Windows Shop - CHUG - 20120416Hadoop in a Windows Shop - CHUG - 20120416
Hadoop in a Windows Shop - CHUG - 20120416
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416
 

Recently uploaded

COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 

Recently uploaded (20)

COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 

Everything you wanted to know, but were afraid to ask about Oozie

  • 1. Everything that you ever wanted to know about Oozie, but were afraid to ask B Lublinsky, A Yakubovich
  • 2. Apache Oozie • Oozie is a workflow/coordination system to manage Apache Hadoop jobs. • A single Oozie server implements all four functional Oozie components: – Oozie workflow – Oozie coordinator – Oozie bundle – Oozie SLA.
  • 3. Main components Oozie Server Bundle 3rd party application time condition monitoring Coordinator WS API workflow data condition monitoring action Oozie Command action action Line Interface action wf logic job submission and monitoring definitions, states Oozie shared libraries HDFS Bundle Coordinator Coordinator MapReduce Data Coordinator Coordinator Coordinator Workflow Coordinator Coordinator Hadoop
  • 5. Workflow Language Flow-control XML element type Description node Decision workflow:DECISION expressing “switch-case” logic Fork workflow:FORK splits one path of execution into multiple concurrent paths Join workflow:JOIN waits until every concurrent execution path of a previous fork node arrives to it Kill workflow:kill forces a workflow job to kill (abort) itself Action node XML element type Description java workflow:JAVA invokes the main() method from the specified java class fs workflow:FS manipulate files and directories in HDFS; supports commands: move, delete, mkdir MapReduce workflow:MAP-REDUCE starts a Hadoop map/reduce job; that could be java MR job, streaming job or pipe job Pig workflow:pig runs a Pig job Sub workflow workflow:SUB- runs a child workflow job WORKFLOW Hive * workflow:HIVE runs a Hive job Shell * workflow:SHELL runs a Shell command ssh * workflow:SSH starts a shell command on a remote machine as a remote secure shell Sqoop * workflow:SQOOP runs a Sqoop job Email * workflow:EMAIL sending emails from Oozie workflow application Distcp ? Under development (Yahoo)
  • 6. Workflow actions • Oozie workflow supports two types of actions:  Synchronous, executed inside Oozie runtime  Asynchronous, executed as a Map Reduce job. ActionStartCommand WorkflowStore Services ActionExecutorContext JavaActionExecutor JobClient 1 : workflow := getWorkflow() 2 : action := getAction() 3 : context := init<>() 4 : executor := get() 5 : start() 6 : submitLauncher() 7 : jobClient := get() 8 : runningJob := submit() 9 : setStartData()
  • 7. Workflow lifecycle PREP KILLED RUNNING FAILED SUSPENDED SUCCEDDED
  • 9. Extending Oozie workflow • Oozie provides a “minimal” workflow language, which contains only a handful of control and actions nodes. • Oozie supports a very elegant extensibility mechanism – custom action nodes. Custom action nodes allow to extend Oozie’ language with additional actions (verbs). • Creation of custom action requires implementation of following: – Java action implementation, which extends ActionExecutor class. – Implementation of the action’s XML schema defining action’s configuration parameters – Packaging of java implementation and configuration schema into action jar, which has to be added to Oozie war – extending oozie-site.xml to register information about custom executor with Oozie runtime.
  • 10. Oozie Workflow Client • Oozie provides an easy way for integration with enterprise applications through Oozie client APIs. It provides two types of APIs • REST HTTP API Number of HTTP requests • Info requests (job status, job configuration) • Job management (submit, start, suspend, resume, kill) Example: job definition info request GET /oozie/v0/job/job-ID?show=definition • Java API - package org.apache.oozie.client – OozieClient start(), submit(), run(), reRunXXX(), resume(), kill(), suspend() – WorkflowJob, WorkflowAction – CoordinatorJob, CoordinatorAction – SLAEvent
  • 11. Oozie workflow good, bad and ugly • Good – Nice integration with Hadoop ecosystem, allowing to easily build processes encompassing synchronized execution of multiple Map Reduce, Hive, Pig, etc jobs. – Nice UI for tracking execution progress – Simple APIs for integration with other applications – Simple extensibility APIs • Bad – Process has to be expressed directly in hPDL with no visual support – No support for Uber Jars (but we added our own) • Ugly – Static forking (but you can regenerate workflow and invoke on a fly) – No support for loops
  • 13. Coordinator language Element type Description Attributes and sub-elements coordinator- top-level element in coordinator instance frequency app start end controls specify the execution policy for coordinator and timeout (actions) it’s elements (workflow actions) concurrency (actions) execution order (workflow instances) action Required singular element specifying the Workflow name associated workflow. The jobs specified in workflow consume and produce dataset instances datasets Collection of data referred to by a logical name. Datasets serve to specify data dependences between workflow instances input event specifies the input conditions (in the form of present data sets) that are required in order to execute a coordinator action output event specifies the dataset that should be produced by coordinator action
  • 16. Bundle lifecycle PREP PREPSUSPENDED PREPPAUSED RUNNING KILLED SUSPENDED FAILED PAUSED SUCCEDDED
  • 18. SLA Navigation COORD_JOBS id app_name app_path … WF_JOBS SLA_EVENT event_id id alert_contact app_name alert-frieuency app_path … … sla_id ... COORD_ACTIONS id action_number action_xml WF_ACTIONS … external_id ... id conf console_url …
  • 20. Using Probes to analyze/monitor Places • Select probe data for specified time/location • Validate – Filter - Transform probe data • Calculate statistics on available probe data • Distribute data per geo-tiles • Calculate place statistics (e.g. attendance index) ------------------------------------------------------------- If exception condition happens, report failure If all steps succeeded, report success
  • 24. Oozie tips and tricks
  • 25. Configuring workflow • Oozie provides 3 overlapping mechanisms to configure workflow - config-default.xml, jobs properties file and job arguments that can be passed to Oozie as part of command line invocations. • The way Oozie processes these three sets of the parameters is as follows: – Use all of the parameters from command line invocation – For remaining unresolved parameters, job config is used – Use config-default.xml for everything else • Although documentation does not describe clearly when to use which, the overall recommendation is as follows: – Use config-default.xml for defining parameters that never change for a given workflow – Use jobs properties for the parameters that are common for a given deployment of a workflow – Use command line arguments for the parameters that are specific for a given workflow invocation.
  • 26. Accessing and storing process variables • Accessing – Through the arguments in java main • Storing String ooziePropFileName = System.getProperty("oozie.action.output.properties"); OutputStream os = new FileOutputStream(new File(ooziePropFileName)); Properties props = new Properties(); props.setProperty(key, value); props.store(os, ""); os.close();
  • 27. Validating data presence • Oozie provides two possible approaches for validating resource file(s) presence – using Oozie coordinator’s input events based on the data set - technically the simplest implementation approach, but it does not provide a more complex decision support that might be required. It just either runs a corresponding workflow or not. – custom java node inside Oozie workflow. - allows to extend decision logic by sending notifications about data absence, run execution on partial data under certain timing conditions, etc. • Additional configuration parameters for Oozie coordinator, for example, ability to wait for files arrival, etc. can expand usage of Oozie coordinator.
  • 28. Invoking map Reduce jobs • Oozie provides two different ways of invoking Map Reduce job – MapReduce action and java action. • Invocation of Map Reduce job with java action is somewhat similar to invocation of this job with Hadoop command line from the edge node. You specify a driver as a class for the java activity and Oozie invokes the driver. This approach has two main advantages: – The same driver class can be used for both – running Map Reduce job from an edge node and a java action in an Oozie process. – A driver provides a convenient place for executing additional code, for example clean-up required for Map Reduce execution. • Driver requires a proper shutdown hook to ensure that there are no lingering Map Reduce jobs
  • 29. Implementing predefined looping and forking • hPDL is an XML document with the well-defined schema. • This means that the actual workflow can be easily manipulated using JAXB objects, which can be generated from hPDL schema using xjc compiler. • This means that we can create the complete workflow programmatically, based on calculated amount of fork branches or implementing loops as a repeated actions. • The other option is creation of template process and modifying it based on calculated parameters.
  • 30. Oozie client security (or lack of) • By default Oozie client reads clients identity from the local machine OS and passes it to the Oozie server, which uses this identity for MR jobs invocation • Impersonation can be implemented by overwriting OozieClient class’ method createConfiguration, where client variables can be set through new constructor. public Properties createConfiguration() { Properties conf = new Properties(); if(user == null) conf.setProperty(USER_NAME, System.getProperty("user.name")); else conf.setProperty(USER_NAME, user); return conf; }
  • 31. uber jars with Oozie uber jar contains resources: other jars, so libraries, zip files unpack resources Oozie launcher to current uber jar dir server java action set inverse classloader uber jar Classes (Launcher) invoke MR driver pass arguments jars so zip <java> set shutdown hook … ‘wait for complete’ <main-class>${wfUberLauncher}</main-class> <arg>-appStart=${wfAppMain}</arg> … mapper </java> mapper