SlideShare una empresa de Scribd logo
1 de 20
Descargar para leer sin conexión
Oozie:
Scheduling Workflows

         On
the
Grid


       Mohammad
K
Islam

     kamrul@yahoo‐inc.com

Agenda

•  Oozie
Overview

•  Oozie
3.x
features:


   –  Bundle

   –  Scalability

   –  Usability


•  Challenges

•  Future
Plan

•  Q&A


Overview:
Workflow

•  Oozie
executes
workflow
defined
as
DAG
of
jobs.

•  The
job
type
includes:
Map‐Reduce/
Pipes/
Streaming/
   Pig/Custom
Java
Code
etc.

•  Introduced
in
Oozie
1.x.

                                      M/R   

                                   streaming  

                                       job


           M/R  

  start
      
             fork
                                 join

           job



                                      Pig
                        MORE
           decision

                                      job



                                                              M/R  

                                                                                      ENOUGH

                                                              job





                                                   FS

                            end
                                          Java

                                                  job 

Overview:
Coordinator

•  Oozie
executes
workflow
based
on:

   –  Time
Dependency
(Frequency)

   –  Data
Dependency


•  Introduced
in
Oozie
2.x.


                   Oozie
Server

                                            Check


  WS
API
            Oozie

           Data
Availability

                   Coordinator


                     Oozie


 Oozie
             Workflow

 Client
                                        Hadoop

Oozie
3.x:
Bundle

•  User
can
define
and
execute
a bunch of 
   coordinator
applicaons.

•  User
could
start/stop/suspend/resume/rerun in

   the
bundle
level.

•  Benefits:
Easy
to
maintain
and
control
large
data

   pipelines
applicaons
for
Service
Engineering

   team.

                 Oozie
Server
            Check


  WS
API
                            Data
Availability

                   Bundle


                 Coordinator


 Oozie
           Workflow

 Client
                                      Hadoop

Oozie
AbstracNon
Layers

                               Bundle                                            Layer
1



      Coord Job 1                                   Coord Job 2 



                                                                                 Layer
2

Coord                       Coord               Coord                Coord 
Action 1                    Action 2            Action1              Action 2 




WF Job 1                    WF Job 1           WF Job 2              WF Job 2 



                     PIG 
                                                                                 Layer
3

                     Job 
 M/R                                    M/R                   PIG 
 Job                                    Job                   Job 

                     FS 
                     Job 
Enhanced
Stability
and
Scalability

•  Issue
:


   –  At
very
high
load,
Oozie
becomes
slow.

   –  90%
of
the
total
Oozie
support
incidence.


•  Reason:


   –  Lot
of
acve
but
non‐progressing
jobs.

   –  Oozie
internal
queue
is
full.

•  Resoluon:

   –  Throcle
the
number
of
acve
jobs/coordinator

   –  Put
the
job
into
meout
state.

   –  Enforce
the
uniqueness
for
oozie
queue
element.


Improved
Usability

•  Issue:


   –  Coordinator
job’s
status
is
not
intuive
and
causes

      confusion
to
the
Oozie
user.

•  Reason:

   –  Status
SUCCEEDED
doesn’t
mean
job
is

      successful!!

   –  Status
PREMATER
is
for
oozie
internal
use
only.

      But
it
was
exposed
to
user.

•  Resoluon:

   –  Redesign
Coordinator
status

Coordinator
Status
Redesign

Current
                    SUSPENDED
               KILLED



    PREP
      PREMATER
                 Running
   SUCCEEDED




                                                     FAILED





New
           SUSPENDED
                             KILLED



                                                     SUCCEEDED

       PREP
     Running


                                                    DONE_WITH_ERROR


                 PAUSED
                              FAILED

The
Second
Year
...

•  Number
of
Releases

   –  Feature
Releases
:
3

   –  Patches
:
9

•  Backward compa5bility is
strongly
maintained.


•  No
need
to
resubmit
the
job
if
Oozie
is
restarted.

•  Code
Overhaul:

   –  Re‐designed
the
command
pacern
to
avoid
DB

      connecon
leaks
and
to
improve
DB
connecons

      usages.

Oozie
Usages

•  Y!
internal
usages:

   –  Total
number
of
user
:
377

   –  Total
number
of
processed
jobs
≈
600K/month

•  External
downloads:

   –  1500+
in
last
8
months
from
Github

   –  A
large
number
of
downloads
maintained

by
3rd

      party
packaging.



Oozie
Usages
Cont.

•  User
Community:

  –  Membership

     •  Y!
internal
‐
265

     •  External
–
163

  –  Message
(approximate):

     •  Y!
internal
–
9/day

     •  External
–
7/day

Challenges
1
:Data
Availability
Check

•  Issue
:


   –  Currently
checks
directory
in
every
minute
(polling 
      based).

   –  Increases
NN
overhead
and
does not scale well.

•  Reason:
No
meta‐data
system
with

   appropriate
noficaons
mechanism.

•  Planned
resoluon:
Incorporate
with
HCatalog

   metadata
system.


Challenges
2
:
Adaptability
to
Hadoop


•  Issues
:
If
Hadoop
NN
or
JT
is
down,
Oozie

   submits
job
and
obviously
fails.
User
intervenon

   is
required
when
Hadoop
server
is
back.

•  Impact:
Inconvenient
for
Oozie
user.
For
example,

   if
Hadoop
is
restarted
on
Friday
night,
job
will
not

   run
unl
next
Monday.

•  Planned
Resoluon:
Graceful
handling
of
Hadoop

   downme:


   –  If
Hadoop
is
down,
block
submission.


   –  When
Hadoop
becomes
available


      •  Submit
the
blocked
job


      •  Auto‐resubmit
the
untraced
job.


Challenges
3:
Horizontally
Scalable

•  Issues:
One
instance
of
Oozie
could
not
efficiently

   handle
a
very
large
number
of
jobs
(say
100K/
   hours).
In
addion,
Oozie
doesn’t
support
load

   balancing.

•  
Reason:
Oozie
internal
task
queue
is
not

   synchronized
across
mulple
Oozie
instances.

•  Planned
Resoluon:
Use
Zookeeper
for
coordinaon.

•  Benefits:
As
the
load
increases,
add
extra
Oozie

   server.

Future
Plan

•  AutomaNc
Failover:
Using
ZooKeeper.

•  Monitoring:
Rich
WS
API
for
applicaon

   Monitoring/Alerng.

•  Improved
Usability:


  –  Distcp
acon

  –  Hive
Acon

•  Asynchronous
data
processing.

•  Incremental
data
processing.

•  Apache
MigraNon:
Works
iniated.


Q&A



•  Github
link:
hcp://yahoo.github.com/oozie

•  Mailing
list:
Oozie-users@yahoogroups.com



                     Mohammad
K
Islam

                  kamrul@yahoo‐inc.com

Backup
Slides

Oozie
Workflow
Applicaon

•  Contents

   –  A
workflow.xml
file


   –  Resource
files,
config
files
and
Pig
scripts

   –  All
necessary
JAR
and
nave
library
files



•  Parameters

   –  The
workflow.xml,
is
parameterized,
parameters

      can
be
propagated
to
map-reduce,
pig &
ssh

      jobs


•  Deployment

   –  In
a
directory
in
the
HDFS
of
the
Hadoop
cluster

      where
the
Hadoop
&
Pig
jobs
will
run



                                                      19

Oozie

                      Running
a
Workflow
Job
                                                 cmd



Workflow
ApplicaNon
Deployment

    
$ hadoop    fs   –mkdir hdfs://usr/tucu/wordcount-wf
    
$ hadoop    fs   –mkdir hdfs://usr/tucu/wordcount-wf/lib
    
$ hadoop    fs   –copyFromLocal workflow.xml wordcount.xml hdfs://usr/tucu/wordcount-wf
    
$ hadoop    fs   –copyFromLocal hadoop-examples.jar hdfs://usr/tucu/wordcount-wf/lib
    
$



Workflow
Job
ExecuNon

    
$ oozie run -o http://foo.corp:8080/oozie 
                 -a hdfs://bar.corp:9000/usr/tucu/wordcount-wf 

                 input=/data/2008/input output=/data/2008/output
    
 Workflow job id [1234567890-wordcount-wf]
    
$




Workflow
Job
Status

    
$ oozie status -o http://foo.corp:8080/oozie -j 1234567890-wordcount-wf
    
 Workflow job status [RUNNING]
      ...
    
$




                                                                                       20


Más contenido relacionado

La actualidad más candente

Oozie & sqoop by pradeep
Oozie & sqoop by pradeepOozie & sqoop by pradeep
Oozie & sqoop by pradeepPradeep Pandey
 
Apache Oozie Workflow Scheduler - Module 10
Apache Oozie Workflow Scheduler - Module 10Apache Oozie Workflow Scheduler - Module 10
Apache Oozie Workflow Scheduler - Module 10Rohit Agrawal
 
Oozie towards zero downtime
Oozie towards zero downtimeOozie towards zero downtime
Oozie towards zero downtimeDataWorks Summit
 
Clogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewClogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewMadhur Nawandar
 
Data Pipeline Management Framework on Oozie
Data Pipeline Management Framework on OozieData Pipeline Management Framework on Oozie
Data Pipeline Management Framework on OozieShareThis
 
Introduction to Apache Camel
Introduction to Apache CamelIntroduction to Apache Camel
Introduction to Apache CamelFuseSource.com
 
High Performance Hibernate JavaZone 2016
High Performance Hibernate JavaZone 2016High Performance Hibernate JavaZone 2016
High Performance Hibernate JavaZone 2016Vlad Mihalcea
 
SCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud Infrastructure
SCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud InfrastructureSCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud Infrastructure
SCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud InfrastructureMatt Ray
 
Chef for OpenStack - OpenStack Fall 2012 Summit
Chef for OpenStack  - OpenStack Fall 2012 SummitChef for OpenStack  - OpenStack Fall 2012 Summit
Chef for OpenStack - OpenStack Fall 2012 SummitMatt Ray
 
TXLF: Chef- Software Defined Infrastructure Today & Tomorrow
TXLF: Chef- Software Defined Infrastructure Today & TomorrowTXLF: Chef- Software Defined Infrastructure Today & Tomorrow
TXLF: Chef- Software Defined Infrastructure Today & TomorrowMatt Ray
 
High-Performance JDBC Voxxed Bucharest 2016
High-Performance JDBC Voxxed Bucharest 2016High-Performance JDBC Voxxed Bucharest 2016
High-Performance JDBC Voxxed Bucharest 2016Vlad Mihalcea
 
High-Performance Hibernate Devoxx France 2016
High-Performance Hibernate Devoxx France 2016High-Performance Hibernate Devoxx France 2016
High-Performance Hibernate Devoxx France 2016Vlad Mihalcea
 
Node.js und die Oracle-Datenbank
Node.js und die Oracle-DatenbankNode.js und die Oracle-Datenbank
Node.js und die Oracle-DatenbankCarsten Czarski
 
Apache camel overview dec 2011
Apache camel overview dec 2011Apache camel overview dec 2011
Apache camel overview dec 2011Marcelo Jabali
 
Apache Camel: The Swiss Army Knife of Open Source Integration
Apache Camel: The Swiss Army Knife of Open Source IntegrationApache Camel: The Swiss Army Knife of Open Source Integration
Apache Camel: The Swiss Army Knife of Open Source Integrationprajods
 
Parallel batch processing with spring batch slideshare
Parallel batch processing with spring batch   slideshareParallel batch processing with spring batch   slideshare
Parallel batch processing with spring batch slideshareMorten Andersen-Gott
 

La actualidad más candente (20)

Oozie & sqoop by pradeep
Oozie & sqoop by pradeepOozie & sqoop by pradeep
Oozie & sqoop by pradeep
 
Apache Oozie Workflow Scheduler - Module 10
Apache Oozie Workflow Scheduler - Module 10Apache Oozie Workflow Scheduler - Module 10
Apache Oozie Workflow Scheduler - Module 10
 
Oozie at Yahoo
Oozie at YahooOozie at Yahoo
Oozie at Yahoo
 
Oozie towards zero downtime
Oozie towards zero downtimeOozie towards zero downtime
Oozie towards zero downtime
 
Advanced Oozie
Advanced OozieAdvanced Oozie
Advanced Oozie
 
Hadoop Oozie
Hadoop OozieHadoop Oozie
Hadoop Oozie
 
Oozie meetup - HA
Oozie meetup - HAOozie meetup - HA
Oozie meetup - HA
 
Clogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewClogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overview
 
Data Pipeline Management Framework on Oozie
Data Pipeline Management Framework on OozieData Pipeline Management Framework on Oozie
Data Pipeline Management Framework on Oozie
 
Introduction to Apache Camel
Introduction to Apache CamelIntroduction to Apache Camel
Introduction to Apache Camel
 
High Performance Hibernate JavaZone 2016
High Performance Hibernate JavaZone 2016High Performance Hibernate JavaZone 2016
High Performance Hibernate JavaZone 2016
 
SCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud Infrastructure
SCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud InfrastructureSCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud Infrastructure
SCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud Infrastructure
 
Chef for OpenStack - OpenStack Fall 2012 Summit
Chef for OpenStack  - OpenStack Fall 2012 SummitChef for OpenStack  - OpenStack Fall 2012 Summit
Chef for OpenStack - OpenStack Fall 2012 Summit
 
TXLF: Chef- Software Defined Infrastructure Today & Tomorrow
TXLF: Chef- Software Defined Infrastructure Today & TomorrowTXLF: Chef- Software Defined Infrastructure Today & Tomorrow
TXLF: Chef- Software Defined Infrastructure Today & Tomorrow
 
High-Performance JDBC Voxxed Bucharest 2016
High-Performance JDBC Voxxed Bucharest 2016High-Performance JDBC Voxxed Bucharest 2016
High-Performance JDBC Voxxed Bucharest 2016
 
High-Performance Hibernate Devoxx France 2016
High-Performance Hibernate Devoxx France 2016High-Performance Hibernate Devoxx France 2016
High-Performance Hibernate Devoxx France 2016
 
Node.js und die Oracle-Datenbank
Node.js und die Oracle-DatenbankNode.js und die Oracle-Datenbank
Node.js und die Oracle-Datenbank
 
Apache camel overview dec 2011
Apache camel overview dec 2011Apache camel overview dec 2011
Apache camel overview dec 2011
 
Apache Camel: The Swiss Army Knife of Open Source Integration
Apache Camel: The Swiss Army Knife of Open Source IntegrationApache Camel: The Swiss Army Knife of Open Source Integration
Apache Camel: The Swiss Army Knife of Open Source Integration
 
Parallel batch processing with spring batch slideshare
Parallel batch processing with spring batch   slideshareParallel batch processing with spring batch   slideshare
Parallel batch processing with spring batch slideshare
 

Destacado

Hive at LinkedIn
Hive at LinkedIn Hive at LinkedIn
Hive at LinkedIn mislam77
 
Yarn at LinkedIn
Yarn at LinkedIn Yarn at LinkedIn
Yarn at LinkedIn mislam77
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for HadoopJoe Crobak
 
Oozie hugnov11
Oozie hugnov11Oozie hugnov11
Oozie hugnov11mislam77
 
Oozie Hug May 2011
Oozie Hug May 2011Oozie Hug May 2011
Oozie Hug May 2011mislam77
 
Securing Hadoop's REST APIs with Apache Knox Gateway Hadoop Summit June 6th, ...
Securing Hadoop's REST APIs with Apache Knox Gateway Hadoop Summit June 6th, ...Securing Hadoop's REST APIs with Apache Knox Gateway Hadoop Summit June 6th, ...
Securing Hadoop's REST APIs with Apache Knox Gateway Hadoop Summit June 6th, ...Kevin Minder
 
Ambari Meetup: Architecture and Demo
Ambari Meetup: Architecture and DemoAmbari Meetup: Architecture and Demo
Ambari Meetup: Architecture and DemoHortonworks
 
Ambari: Agent Registration Flow
Ambari: Agent Registration FlowAmbari: Agent Registration Flow
Ambari: Agent Registration FlowHortonworks
 
Apache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARNApache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARNHortonworks
 
Spark Workflow Management
Spark Workflow ManagementSpark Workflow Management
Spark Workflow ManagementRomi Kuntsman
 
Managing 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with AmbariManaging 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with AmbariDataWorks Summit
 
(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot Instances(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot InstancesAmazon Web Services
 
Oracle migrations and upgrades
Oracle migrations and upgradesOracle migrations and upgrades
Oracle migrations and upgradesDurga Gadiraju
 
Deploying and Managing Hadoop Clusters with AMBARI
Deploying and Managing Hadoop Clusters with AMBARIDeploying and Managing Hadoop Clusters with AMBARI
Deploying and Managing Hadoop Clusters with AMBARIDataWorks Summit
 
Curb your insecurity with HDP - Tips for a Secure Cluster
Curb your insecurity with HDP - Tips for a Secure ClusterCurb your insecurity with HDP - Tips for a Secure Cluster
Curb your insecurity with HDP - Tips for a Secure Clusterahortonworks
 
Hadoop crashcourse v3
Hadoop crashcourse v3Hadoop crashcourse v3
Hadoop crashcourse v3Hortonworks
 
Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayDataWorks Summit
 

Destacado (20)

Hive at LinkedIn
Hive at LinkedIn Hive at LinkedIn
Hive at LinkedIn
 
Yarn at LinkedIn
Yarn at LinkedIn Yarn at LinkedIn
Yarn at LinkedIn
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for Hadoop
 
Oozie hugnov11
Oozie hugnov11Oozie hugnov11
Oozie hugnov11
 
Oozie Hug May 2011
Oozie Hug May 2011Oozie Hug May 2011
Oozie Hug May 2011
 
Hadoop bootcamp getting started
Hadoop bootcamp getting startedHadoop bootcamp getting started
Hadoop bootcamp getting started
 
Securing Hadoop's REST APIs with Apache Knox Gateway Hadoop Summit June 6th, ...
Securing Hadoop's REST APIs with Apache Knox Gateway Hadoop Summit June 6th, ...Securing Hadoop's REST APIs with Apache Knox Gateway Hadoop Summit June 6th, ...
Securing Hadoop's REST APIs with Apache Knox Gateway Hadoop Summit June 6th, ...
 
Ambari Meetup: Architecture and Demo
Ambari Meetup: Architecture and DemoAmbari Meetup: Architecture and Demo
Ambari Meetup: Architecture and Demo
 
Ambari: Agent Registration Flow
Ambari: Agent Registration FlowAmbari: Agent Registration Flow
Ambari: Agent Registration Flow
 
Apache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARNApache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARN
 
Spark Workflow Management
Spark Workflow ManagementSpark Workflow Management
Spark Workflow Management
 
Managing 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with AmbariManaging 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with Ambari
 
(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot Instances(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot Instances
 
Oracle migrations and upgrades
Oracle migrations and upgradesOracle migrations and upgrades
Oracle migrations and upgrades
 
Deploying and Managing Hadoop Clusters with AMBARI
Deploying and Managing Hadoop Clusters with AMBARIDeploying and Managing Hadoop Clusters with AMBARI
Deploying and Managing Hadoop Clusters with AMBARI
 
Apache Ranger
Apache RangerApache Ranger
Apache Ranger
 
Big Data Introduction
Big Data IntroductionBig Data Introduction
Big Data Introduction
 
Curb your insecurity with HDP - Tips for a Secure Cluster
Curb your insecurity with HDP - Tips for a Secure ClusterCurb your insecurity with HDP - Tips for a Secure Cluster
Curb your insecurity with HDP - Tips for a Secure Cluster
 
Hadoop crashcourse v3
Hadoop crashcourse v3Hadoop crashcourse v3
Hadoop crashcourse v3
 
Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox Gateway
 

Similar a Oozie Summit 2011

MEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftMEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftLee Stott
 
Partitioning CCGrid 2012
Partitioning CCGrid 2012Partitioning CCGrid 2012
Partitioning CCGrid 2012Weiwei Chen
 
Now That I Have Choreography, What Do I Do With It?
Now That I Have Choreography, What Do I Do With It?Now That I Have Choreography, What Do I Do With It?
Now That I Have Choreography, What Do I Do With It?Julian Dunn
 
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...Cloudera, Inc.
 
Gofer 200707
Gofer 200707Gofer 200707
Gofer 200707oscon2007
 
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Work Queues
Work QueuesWork Queues
Work Queuesciconf
 
Gearman and CodeIgniter
Gearman and CodeIgniterGearman and CodeIgniter
Gearman and CodeIgniterErik Giberti
 
A Tale of a Server Architecture (Frozen Rails 2012)
A Tale of a Server Architecture (Frozen Rails 2012)A Tale of a Server Architecture (Frozen Rails 2012)
A Tale of a Server Architecture (Frozen Rails 2012)Flowdock
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloudelliando dias
 
Zend Products and PHP for IBMi
Zend Products and PHP for IBMi  Zend Products and PHP for IBMi
Zend Products and PHP for IBMi Shlomo Vanunu
 
IBM System z - zEnterprise a future platform for enterprise systems
IBM System z - zEnterprise a future platform for enterprise systemsIBM System z - zEnterprise a future platform for enterprise systems
IBM System z - zEnterprise a future platform for enterprise systemsIBM Sverige
 
Fremtidens platform til koncernsystemer (IBM System z)
Fremtidens platform til koncernsystemer (IBM System z)Fremtidens platform til koncernsystemer (IBM System z)
Fremtidens platform til koncernsystemer (IBM System z)IBM Danmark
 
Pig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataPig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataDataWorks Summit
 

Similar a Oozie Summit 2011 (20)

Nov 2011 HUG: Oozie
Nov 2011 HUG: Oozie Nov 2011 HUG: Oozie
Nov 2011 HUG: Oozie
 
MEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftMEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop Microsoft
 
Partitioning CCGrid 2012
Partitioning CCGrid 2012Partitioning CCGrid 2012
Partitioning CCGrid 2012
 
Os Bunce
Os BunceOs Bunce
Os Bunce
 
Now That I Have Choreography, What Do I Do With It?
Now That I Have Choreography, What Do I Do With It?Now That I Have Choreography, What Do I Do With It?
Now That I Have Choreography, What Do I Do With It?
 
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
 
Gofer 200707
Gofer 200707Gofer 200707
Gofer 200707
 
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
 
Work Queues
Work QueuesWork Queues
Work Queues
 
Gearman and CodeIgniter
Gearman and CodeIgniterGearman and CodeIgniter
Gearman and CodeIgniter
 
A Tale of a Server Architecture (Frozen Rails 2012)
A Tale of a Server Architecture (Frozen Rails 2012)A Tale of a Server Architecture (Frozen Rails 2012)
A Tale of a Server Architecture (Frozen Rails 2012)
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloud
 
Zend Products and PHP for IBMi
Zend Products and PHP for IBMi  Zend Products and PHP for IBMi
Zend Products and PHP for IBMi
 
IBM System z - zEnterprise a future platform for enterprise systems
IBM System z - zEnterprise a future platform for enterprise systemsIBM System z - zEnterprise a future platform for enterprise systems
IBM System z - zEnterprise a future platform for enterprise systems
 
BPMS1
BPMS1BPMS1
BPMS1
 
BPMS1
BPMS1BPMS1
BPMS1
 
Fremtidens platform til koncernsystemer (IBM System z)
Fremtidens platform til koncernsystemer (IBM System z)Fremtidens platform til koncernsystemer (IBM System z)
Fremtidens platform til koncernsystemer (IBM System z)
 
Pig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataPig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big Data
 

Último

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 

Último (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

Oozie Summit 2011

  • 1. Oozie:
Scheduling Workflows
 On
the
Grid
 Mohammad
K
Islam
 kamrul@yahoo‐inc.com

  • 2. Agenda
 •  Oozie
Overview
 •  Oozie
3.x
features:

 –  Bundle
 –  Scalability
 –  Usability

 •  Challenges
 •  Future
Plan
 •  Q&A


  • 3. Overview:
Workflow
 •  Oozie
executes
workflow
defined
as
DAG
of
jobs.
 •  The
job
type
includes:
Map‐Reduce/
Pipes/
Streaming/ Pig/Custom
Java
Code
etc.
 •  Introduced
in
Oozie
1.x.
 M/R 
 streaming 
 job
 M/R 
 start 
 fork
 join
 job
 Pig
 MORE
 decision
 job
 M/R 
 ENOUGH
 job
 FS
 end
 Java
 job 

  • 4. Overview:
Coordinator
 •  Oozie
executes
workflow
based
on:
 –  Time
Dependency
(Frequency)
 –  Data
Dependency

 •  Introduced
in
Oozie
2.x.
 Oozie
Server
 Check

 WS
API
 Oozie

 Data
Availability
 Coordinator
 Oozie

 Oozie
 Workflow
 Client
 Hadoop

  • 5. Oozie
3.x:
Bundle
 •  User
can
define
and
execute
a bunch of  coordinator
applicaons.
 •  User
could
start/stop/suspend/resume/rerun in
 the
bundle
level.
 •  Benefits:
Easy
to
maintain
and
control
large
data
 pipelines
applicaons
for
Service
Engineering
 team.
 Oozie
Server
 Check

 WS
API
 Data
Availability
 Bundle
 Coordinator
 Oozie
 Workflow
 Client
 Hadoop

  • 6. Oozie
AbstracNon
Layers
 Bundle  Layer
1
 Coord Job 1  Coord Job 2  Layer
2
 Coord  Coord  Coord  Coord  Action 1  Action 2  Action1   Action 2  WF Job 1  WF Job 1  WF Job 2  WF Job 2  PIG  Layer
3
 Job  M/R  M/R  PIG  Job  Job  Job  FS  Job 
  • 7. Enhanced
Stability
and
Scalability
 •  Issue
:

 –  At
very
high
load,
Oozie
becomes
slow.
 –  90%
of
the
total
Oozie
support
incidence.

 •  Reason:

 –  Lot
of
acve
but
non‐progressing
jobs.
 –  Oozie
internal
queue
is
full.
 •  Resoluon:
 –  Throcle
the
number
of
acve
jobs/coordinator
 –  Put
the
job
into
meout
state.
 –  Enforce
the
uniqueness
for
oozie
queue
element.


  • 8. Improved
Usability
 •  Issue:

 –  Coordinator
job’s
status
is
not
intuive
and
causes
 confusion
to
the
Oozie
user.
 •  Reason:
 –  Status
SUCCEEDED
doesn’t
mean
job
is
 successful!!
 –  Status
PREMATER
is
for
oozie
internal
use
only.
 But
it
was
exposed
to
user.
 •  Resoluon:
 –  Redesign
Coordinator
status

  • 9. Coordinator
Status
Redesign
 Current
 SUSPENDED
 KILLED
 PREP
 PREMATER
 Running
 SUCCEEDED
 FAILED
 New
 SUSPENDED
 KILLED
 SUCCEEDED
 PREP
 Running
 DONE_WITH_ERROR
 PAUSED
 FAILED

  • 10. The
Second
Year
...
 •  Number
of
Releases
 –  Feature
Releases
:
3
 –  Patches
:
9
 •  Backward compa5bility is
strongly
maintained.

 •  No
need
to
resubmit
the
job
if
Oozie
is
restarted.
 •  Code
Overhaul:
 –  Re‐designed
the
command
pacern
to
avoid
DB
 connecon
leaks
and
to
improve
DB
connecons
 usages.

  • 11. Oozie
Usages
 •  Y!
internal
usages:
 –  Total
number
of
user
:
377
 –  Total
number
of
processed
jobs
≈
600K/month
 •  External
downloads:
 –  1500+
in
last
8
months
from
Github
 –  A
large
number
of
downloads
maintained

by
3rd
 party
packaging.



  • 12. Oozie
Usages
Cont.
 •  User
Community:
 –  Membership
 •  Y!
internal
‐
265
 •  External
–
163
 –  Message
(approximate):
 •  Y!
internal
–
9/day
 •  External
–
7/day

  • 13. Challenges
1
:Data
Availability
Check
 •  Issue
:

 –  Currently
checks
directory
in
every
minute
(polling  based).
 –  Increases
NN
overhead
and
does not scale well.
 •  Reason:
No
meta‐data
system
with
 appropriate
noficaons
mechanism.
 •  Planned
resoluon:
Incorporate
with
HCatalog
 metadata
system.


  • 14. Challenges
2
:
Adaptability
to
Hadoop

 •  Issues
:
If
Hadoop
NN
or
JT
is
down,
Oozie
 submits
job
and
obviously
fails.
User
intervenon
 is
required
when
Hadoop
server
is
back.
 •  Impact:
Inconvenient
for
Oozie
user.
For
example,
 if
Hadoop
is
restarted
on
Friday
night,
job
will
not
 run
unl
next
Monday.
 •  Planned
Resoluon:
Graceful
handling
of
Hadoop
 downme:

 –  If
Hadoop
is
down,
block
submission.

 –  When
Hadoop
becomes
available

 •  Submit
the
blocked
job

 •  Auto‐resubmit
the
untraced
job.


  • 15. Challenges
3:
Horizontally
Scalable
 •  Issues:
One
instance
of
Oozie
could
not
efficiently
 handle
a
very
large
number
of
jobs
(say
100K/ hours).
In
addion,
Oozie
doesn’t
support
load
 balancing.
 •  
Reason:
Oozie
internal
task
queue
is
not
 synchronized
across
mulple
Oozie
instances.
 •  Planned
Resoluon:
Use
Zookeeper
for
coordinaon.
 •  Benefits:
As
the
load
increases,
add
extra
Oozie
 server.

  • 16. Future
Plan
 •  AutomaNc
Failover:
Using
ZooKeeper.
 •  Monitoring:
Rich
WS
API
for
applicaon
 Monitoring/Alerng.
 •  Improved
Usability:

 –  Distcp
acon
 –  Hive
Acon
 •  Asynchronous
data
processing.
 •  Incremental
data
processing.
 •  Apache
MigraNon:
Works
iniated.


  • 19. Oozie
Workflow
Applicaon
 •  Contents
 –  A
workflow.xml
file

 –  Resource
files,
config
files
and
Pig
scripts
 –  All
necessary
JAR
and
nave
library
files

 •  Parameters
 –  The
workflow.xml,
is
parameterized,
parameters
 can
be
propagated
to
map-reduce,
pig &
ssh
 jobs
 •  Deployment
 –  In
a
directory
in
the
HDFS
of
the
Hadoop
cluster
 where
the
Hadoop
&
Pig
jobs
will
run
 19

  • 20. Oozie
 Running
a
Workflow
Job
 cmd
 Workflow
ApplicaNon
Deployment
 $ hadoop fs –mkdir hdfs://usr/tucu/wordcount-wf $ hadoop fs –mkdir hdfs://usr/tucu/wordcount-wf/lib $ hadoop fs –copyFromLocal workflow.xml wordcount.xml hdfs://usr/tucu/wordcount-wf $ hadoop fs –copyFromLocal hadoop-examples.jar hdfs://usr/tucu/wordcount-wf/lib $ Workflow
Job
ExecuNon
 $ oozie run -o http://foo.corp:8080/oozie -a hdfs://bar.corp:9000/usr/tucu/wordcount-wf 
 input=/data/2008/input output=/data/2008/output Workflow job id [1234567890-wordcount-wf] $
 Workflow
Job
Status
 $ oozie status -o http://foo.corp:8080/oozie -j 1234567890-wordcount-wf Workflow job status [RUNNING] ... $ 20