SlideShare a Scribd company logo
1 of 31
Download to read offline
Oozie: Towards a Scalable Workflow
 Management System for Hadoop

                     Mohammad Islam
                                 And
                        Virag Kothari
Accepted Paper
•  Workshop in ACM/SIGMOD, May 2012.
•  It is a team effort!

Mohammad Islam      Angelo Huang
Mohamed Battisha    Michelle Chiang
Santhosh Srinivasan Craig Peters

Andreas Neumann     Alejandro Abdelnur
Presentation Workflow
Installing Oozie

Step 1: Download the Oozie tarball
curl -O http://mirrors.sonic.net/apache/incubator/oozie/oozie-3.1.3-incubating/
oozie-3.1.3-incubating-distro.tar.gz

Step 2: Unpack the tarball
tar –xzvf <PATH_TO_OOZIE_TAR>

Step 3: Run the setup script
bin/oozie-setup.sh -hadoop 0.20.200 ${HADOOP_HOME} -extjs /tmp/ext-2.2.zip

Step 4: Start oozie
bin/oozie-start.sh

Step 5: Check status of oozie
bin/oozie admin -oozie http://localhost:11000/oozie -status
Running an Example

•  Standalone Map-Reduce job
$ hadoop jar /usr/joe/hadoop-examples.jar org.myorg.wordcount inputDir outputDir


•  Using Oozie


                   MapReduce    OK                           <workflow –app name =..>
    Start                                End                 <start..>
                   wordcount
                                                             <action>
                                                                 <map-reduce>
                        ERROR                                    ……
                                                                 ……
                                                             </workflow>
                       Kill


              Example DAG                                    Workflow.xml
Example Workflow
<action name=’wordcount'>
     <map-reduce>
        <configuration>
              <property>
                   <name>mapred.mapper.class</name>            mapred.mapper.class =
                   <value>org.myorg.WordCount.Map</value>      org.myorg.WordCount.Map
              </property>
              <property>
                   <name>mapred.reducer.class</name>
                   <value>org.myorg.WordCount.Reduce</value>   mapred.reducer.class =
              </property>                                      org.myorg.WordCount.Reduce
              <property>
                   <name>mapred.input.dir</name>
                   <value>usr/joe/inputDir </value>            mapred.input.dir = inputDir
            </property>
            <property>
                   <name>mapred.output.dir</name>
                   <value>/usr/joe/outputDir</value>           mapred.output.dir = outputDir
             </property>
         </configuration>
      </map-reduce>
 </action>
A Workflow Application
Three components required for a Workflow:

1)  Workflow.xml:
    Contains job definition


2) Libraries:
   optional ‘lib/’ directory contains .jar/.so files

3) Properties file:
•  Parameterization of Workflow xml
•  Mandatory property is oozie.wf.application.path
Workflow Submission
Run Workflow Job

 $ oozie job –run -config job.properties -oozie http://localhost:11000/oozie/
 Workflow ID: 00123-123456-oozie-wrkf-W

Check Workflow Job Status

  $ oozie job –info 00123-123456-oozie-wrkf-W -oozie http://localhost:11000/
oozie/
 -----------------------------------------------------------------------
 Workflow Name: test-wf
 App Path: hdfs://localhost:11000/user/your_id/oozie/
 Workflow job status [RUNNING]
  ...
 ------------------------------------------------------------------------
Key Features and Design
Decisions
•  Multi-tenant
•  Security
  –  Authenticate every request
  –  Pass appropriate token to Hadoop job
•  Scalability
  –  Vertical: Add extra memory/disk
  –  Horizontal: Add machines
Oozie Job Processing

         Oozie Security




                                                      Hadoop
                                  Access
                                  Secure
       Job                                 Kerberos
                   Oozie Server

End
user
Oozie-Hadoop Security

           Oozie Security




                                                         Hadoop
                                     Access
                                     Secure
     Job                                      Kerberos
                      Oozie Server


End user
Oozie-Hadoop Security

 •    Oozie is a multi-tenant system
 •    Job can be scheduled to run later
 •    Oozie submits/maintains the hadoop jobs
 •    Hadoop needs security token for each
      request

Question: Who should provide the security
token to hadoop and how?
Oozie-Hadoop Security Contd.

•  Answer: Oozie
•  How?
  – Hadoop considers Oozie as a super-user
  – Hadoop does not check end-user
    credential
  – Hadoop only checks the credential of
    Oozie process

•  BUT hadoop job is executed as end-user.
•  Oozie utilizes doAs() functionality of Hadoop.
User-Oozie Security

           Oozie Security




                                                         Hadoop
                                     Access
                                     Secure
      Job                                     Kerberos
                      Oozie Server


End user
Why Oozie Security?

•  One user should not modify another user’s
   job
•  Hadoop doesn’t authenticate end–user
•  Oozie has to verify its user before passing
   the job to Hadoop
How does Oozie Support Security?

•  Built-in authentication
  –  Kerberos
  –  Non-secured (default)
•  Design Decision
  –  Pluggable authentication
  –  Easy to include new type of authentication
  –  Yahoo supports 3 types of authentication.
Job Submission to Hadoop

•  Oozie is designed to handle thousands of
   jobs at the same time

•  Question : Should Oozie server
  –  Submit the hadoop job directly?
  –  Wait for it to finish?


 •  Answer: No
Job Submission Contd.
•  Reason
  –  Resource constraints: A single Oozie process
     can’t simultaneously create thousands of thread
     for each hadoop job. (Scaling limitation)
  –  Isolation: Running user code on Oozie server
     might de-stabilize Oozie
•  Design Decision
  –  Create a launcher hadoop job
  –  Execute the actual user job from the launcher.
  –  Wait asynchronously for the job to finish.
Job Submission to Hadoop


                    Hadoop Cluster
             5     Job
                                Actual
                 Tracker
                                M/R Job
Oozie                       3
Server       1                  4
                 Launcher
         2        Mapper
Job Submission Contd.

•  Advantages
  –  Horizontal scalability: If load increases, add
     machines into Hadoop cluster
  –  Stability: Isolation of user code and system
     process
•  Disadvantages
  –  Extra map-slot is occupied by each job.
Production Setup

•  Total number of nodes: 42K+
•  Total number of Clusters: 25+
•  Total number of processed jobs ≈ 750K/month
•  Data presented from two clusters
•  Each of them have nearly 4K nodes
•  Total number of users /cluster = 50
Oozie Usage Pattern @ Y!
                  Distribution of Job Types On Production Clusters
             50

             45

             40

             35
Percentage




             30

             25

             20                                                      #1 Cluster
             15                                                      #2 Cluster
             10

              5

              0

                       fs          java         map-reduce     pig
                                          Job type
Experimental Setup

•    Number of nodes: 7
•    Number of map-slots: 28
•    4 Core, RAM: 16 GB
•    64 bit RHEL
•    Oozie Server
     –  3 GB RAM
     –  Internal Queue size = 10 K
     –  # Worker Thread = 300
Job Acceptance
                                         Workflow Acceptance Rate
 workflows Accepted/Min



                          1400
                          1200
                          1000
                           800
                           600
                           400
                           200
                             0
                                 2   6   10   14   20   40   52 100 120 200 320 640
                                          Number of Submission Threads

Observation: Oozie can accept a large number of jobs
Time Line of a Oozie Job

  User       Oozie            Job         Job
 submits   submits to     completes   completes
   Job      Hadoop        at Hadoop    at Oozie

                                             Time

   Preparation              Completion
   Overhead                 Overhead

Total Oozie Overhead = Preparation + Completion
Oozie Overhead
                                          Per Action Overhead
Overhead in millisecs




                        1800
                        1600
                        1400
                        1200
                        1000
                         800
                         600
                         400
                         200
                           0
                               1 Action       5 Actions   10 Actions    50 Actions
                                           Number of Actions/Workflow

Observation: Oozie overhead is less when multiple
actions are in the same workflow.
Oozie Futures

•  Scalability
  –  Hot-Hot/Load balancing service
  –  Replace SQL DB with Zookeeper
•  Improved Usability
•  Extend the benchmarking scope
•  Monitoring WS API
Take Away ..

•  Oozie is
  –  Easier to use
  –  Scalable
  –  Secure and multi-tenant
Q&A




   Mohammad K Islam                Virag Kothari
kamrul@yahoo-inc.com      virag@yahoo-inc.com
    http://incubator.apache.org/oozie/
Vertical Scalability

•  Oozie asynchronously processes the user
   request.
•  Memory resident
  –  An internal queue to store any sub-task.
  –  Thread pool to process the sub-tasks.
•  Both items
  –  Fixed in size
  –  Don’t change due to load variations
  –  Size can be configured if needed. Might need
     extra memory
Challenges in Scalability

•  Centralized persistence storage
  –  Currently supports any SQL DB (such as
     Oracle, MySQL, derby etc)
  –  Each DBMS has respective limitations
  –  Oozie scalability is limited by the underlying
     DBMS
  –  Using of Zookeeper might be an option

More Related Content

What's hot

Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas NApache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Yahoo Developer Network
 
Parallel batch processing with spring batch slideshare
Parallel batch processing with spring batch   slideshareParallel batch processing with spring batch   slideshare
Parallel batch processing with spring batch slideshare
Morten Andersen-Gott
 
Introduction to Apache Camel
Introduction to Apache CamelIntroduction to Apache Camel
Introduction to Apache Camel
FuseSource.com
 

What's hot (20)

Apache Oozie Workflow Scheduler - Module 10
Apache Oozie Workflow Scheduler - Module 10Apache Oozie Workflow Scheduler - Module 10
Apache Oozie Workflow Scheduler - Module 10
 
Oozie @ Riot Games
Oozie @ Riot GamesOozie @ Riot Games
Oozie @ Riot Games
 
Oozie towards zero downtime
Oozie towards zero downtimeOozie towards zero downtime
Oozie towards zero downtime
 
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas NApache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
 
Advanced Oozie
Advanced OozieAdvanced Oozie
Advanced Oozie
 
Oozie at Yahoo
Oozie at YahooOozie at Yahoo
Oozie at Yahoo
 
Oozie meetup - HA
Oozie meetup - HAOozie meetup - HA
Oozie meetup - HA
 
Data Pipeline Management Framework on Oozie
Data Pipeline Management Framework on OozieData Pipeline Management Framework on Oozie
Data Pipeline Management Framework on Oozie
 
Hadoop Oozie
Hadoop OozieHadoop Oozie
Hadoop Oozie
 
Oozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY WayOozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY Way
 
Clogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overviewClogeny Hadoop ecosystem - an overview
Clogeny Hadoop ecosystem - an overview
 
SCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud Infrastructure
SCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud InfrastructureSCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud Infrastructure
SCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud Infrastructure
 
TXLF: Chef- Software Defined Infrastructure Today & Tomorrow
TXLF: Chef- Software Defined Infrastructure Today & TomorrowTXLF: Chef- Software Defined Infrastructure Today & Tomorrow
TXLF: Chef- Software Defined Infrastructure Today & Tomorrow
 
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer toolsMay 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
 
Chef for OpenStack - OpenStack Fall 2012 Summit
Chef for OpenStack  - OpenStack Fall 2012 SummitChef for OpenStack  - OpenStack Fall 2012 Summit
Chef for OpenStack - OpenStack Fall 2012 Summit
 
Parallel batch processing with spring batch slideshare
Parallel batch processing with spring batch   slideshareParallel batch processing with spring batch   slideshare
Parallel batch processing with spring batch slideshare
 
Hadoop HDFS
Hadoop HDFS Hadoop HDFS
Hadoop HDFS
 
Introduction to Apache Camel
Introduction to Apache CamelIntroduction to Apache Camel
Introduction to Apache Camel
 
High Performance Hibernate JavaZone 2016
High Performance Hibernate JavaZone 2016High Performance Hibernate JavaZone 2016
High Performance Hibernate JavaZone 2016
 
October 2014 HUG : Oozie HA
October 2014 HUG : Oozie HAOctober 2014 HUG : Oozie HA
October 2014 HUG : Oozie HA
 

Viewers also liked

HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
Zheng Shao
 

Viewers also liked (7)

Building and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache OozieBuilding and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache Oozie
 
A Basic Hive Inspection
A Basic Hive InspectionA Basic Hive Inspection
A Basic Hive Inspection
 
Hive tuning
Hive tuningHive tuning
Hive tuning
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Big Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBig Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must Know
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache Oozie
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for Hadoop
 

Similar to Oozie HUG May12

De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-Cloud
DataWorks Summit
 
Writing robust Node.js applications
Writing robust Node.js applicationsWriting robust Node.js applications
Writing robust Node.js applications
Tom Croucher
 
Stack kicker devopsdays-london-2013
Stack kicker devopsdays-london-2013Stack kicker devopsdays-london-2013
Stack kicker devopsdays-london-2013
Simon McCartney
 

Similar to Oozie HUG May12 (20)

De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-Cloud
 
Debugging Hive with Hadoop-in-the-Cloud
Debugging Hive with Hadoop-in-the-CloudDebugging Hive with Hadoop-in-the-Cloud
Debugging Hive with Hadoop-in-the-Cloud
 
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
 
Building a Dev/Test Cloud with Apache CloudStack
Building a Dev/Test Cloud with Apache CloudStackBuilding a Dev/Test Cloud with Apache CloudStack
Building a Dev/Test Cloud with Apache CloudStack
 
Oozie Hug May 2011
Oozie Hug May 2011Oozie Hug May 2011
Oozie Hug May 2011
 
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - AltiscaleOC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
 
oozieee.pdf
oozieee.pdfoozieee.pdf
oozieee.pdf
 
Oozie hugnov11
Oozie hugnov11Oozie hugnov11
Oozie hugnov11
 
Nov 2011 HUG: Oozie
Nov 2011 HUG: Oozie Nov 2011 HUG: Oozie
Nov 2011 HUG: Oozie
 
Building a Dev/Test Cloud with Apache CloudStack
Building a Dev/Test Cloud with Apache CloudStackBuilding a Dev/Test Cloud with Apache CloudStack
Building a Dev/Test Cloud with Apache CloudStack
 
Dmp hadoop getting_start
Dmp hadoop getting_startDmp hadoop getting_start
Dmp hadoop getting_start
 
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleDebugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
 
Connecting the Worlds of Java and Ruby with JRuby
Connecting the Worlds of Java and Ruby with JRubyConnecting the Worlds of Java and Ruby with JRuby
Connecting the Worlds of Java and Ruby with JRuby
 
BP-6 Repository Customization Best Practices
BP-6 Repository Customization Best PracticesBP-6 Repository Customization Best Practices
BP-6 Repository Customization Best Practices
 
Writing robust Node.js applications
Writing robust Node.js applicationsWriting robust Node.js applications
Writing robust Node.js applications
 
Stack kicker devopsdays-london-2013
Stack kicker devopsdays-london-2013Stack kicker devopsdays-london-2013
Stack kicker devopsdays-london-2013
 
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"
J1 2015 "Debugging Java Apps in Containers: No Heavy Welding Gear Required"
 
Ruby on Rails All Hands Meeting
Ruby on Rails All Hands MeetingRuby on Rails All Hands Meeting
Ruby on Rails All Hands Meeting
 
Os Bunce
Os BunceOs Bunce
Os Bunce
 
Celery: The Distributed Task Queue
Celery: The Distributed Task QueueCelery: The Distributed Task Queue
Celery: The Distributed Task Queue
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 

Oozie HUG May12

  • 1. Oozie: Towards a Scalable Workflow Management System for Hadoop Mohammad Islam And Virag Kothari
  • 2. Accepted Paper •  Workshop in ACM/SIGMOD, May 2012. •  It is a team effort! Mohammad Islam Angelo Huang Mohamed Battisha Michelle Chiang Santhosh Srinivasan Craig Peters Andreas Neumann Alejandro Abdelnur
  • 4. Installing Oozie Step 1: Download the Oozie tarball curl -O http://mirrors.sonic.net/apache/incubator/oozie/oozie-3.1.3-incubating/ oozie-3.1.3-incubating-distro.tar.gz Step 2: Unpack the tarball tar –xzvf <PATH_TO_OOZIE_TAR> Step 3: Run the setup script bin/oozie-setup.sh -hadoop 0.20.200 ${HADOOP_HOME} -extjs /tmp/ext-2.2.zip Step 4: Start oozie bin/oozie-start.sh Step 5: Check status of oozie bin/oozie admin -oozie http://localhost:11000/oozie -status
  • 5. Running an Example •  Standalone Map-Reduce job $ hadoop jar /usr/joe/hadoop-examples.jar org.myorg.wordcount inputDir outputDir •  Using Oozie MapReduce OK <workflow –app name =..> Start End <start..> wordcount <action> <map-reduce> ERROR …… …… </workflow> Kill Example DAG Workflow.xml
  • 6. Example Workflow <action name=’wordcount'> <map-reduce> <configuration> <property> <name>mapred.mapper.class</name> mapred.mapper.class = <value>org.myorg.WordCount.Map</value> org.myorg.WordCount.Map </property> <property> <name>mapred.reducer.class</name> <value>org.myorg.WordCount.Reduce</value> mapred.reducer.class = </property> org.myorg.WordCount.Reduce <property> <name>mapred.input.dir</name> <value>usr/joe/inputDir </value> mapred.input.dir = inputDir </property> <property> <name>mapred.output.dir</name> <value>/usr/joe/outputDir</value> mapred.output.dir = outputDir </property> </configuration> </map-reduce> </action>
  • 7. A Workflow Application Three components required for a Workflow: 1)  Workflow.xml: Contains job definition 2) Libraries: optional ‘lib/’ directory contains .jar/.so files 3) Properties file: •  Parameterization of Workflow xml •  Mandatory property is oozie.wf.application.path
  • 8. Workflow Submission Run Workflow Job $ oozie job –run -config job.properties -oozie http://localhost:11000/oozie/ Workflow ID: 00123-123456-oozie-wrkf-W Check Workflow Job Status $ oozie job –info 00123-123456-oozie-wrkf-W -oozie http://localhost:11000/ oozie/ ----------------------------------------------------------------------- Workflow Name: test-wf App Path: hdfs://localhost:11000/user/your_id/oozie/ Workflow job status [RUNNING] ... ------------------------------------------------------------------------
  • 9. Key Features and Design Decisions •  Multi-tenant •  Security –  Authenticate every request –  Pass appropriate token to Hadoop job •  Scalability –  Vertical: Add extra memory/disk –  Horizontal: Add machines
  • 10. Oozie Job Processing Oozie Security Hadoop Access Secure Job Kerberos Oozie Server End user
  • 11. Oozie-Hadoop Security Oozie Security Hadoop Access Secure Job Kerberos Oozie Server End user
  • 12. Oozie-Hadoop Security •  Oozie is a multi-tenant system •  Job can be scheduled to run later •  Oozie submits/maintains the hadoop jobs •  Hadoop needs security token for each request Question: Who should provide the security token to hadoop and how?
  • 13. Oozie-Hadoop Security Contd. •  Answer: Oozie •  How? – Hadoop considers Oozie as a super-user – Hadoop does not check end-user credential – Hadoop only checks the credential of Oozie process •  BUT hadoop job is executed as end-user. •  Oozie utilizes doAs() functionality of Hadoop.
  • 14. User-Oozie Security Oozie Security Hadoop Access Secure Job Kerberos Oozie Server End user
  • 15. Why Oozie Security? •  One user should not modify another user’s job •  Hadoop doesn’t authenticate end–user •  Oozie has to verify its user before passing the job to Hadoop
  • 16. How does Oozie Support Security? •  Built-in authentication –  Kerberos –  Non-secured (default) •  Design Decision –  Pluggable authentication –  Easy to include new type of authentication –  Yahoo supports 3 types of authentication.
  • 17. Job Submission to Hadoop •  Oozie is designed to handle thousands of jobs at the same time •  Question : Should Oozie server –  Submit the hadoop job directly? –  Wait for it to finish? •  Answer: No
  • 18. Job Submission Contd. •  Reason –  Resource constraints: A single Oozie process can’t simultaneously create thousands of thread for each hadoop job. (Scaling limitation) –  Isolation: Running user code on Oozie server might de-stabilize Oozie •  Design Decision –  Create a launcher hadoop job –  Execute the actual user job from the launcher. –  Wait asynchronously for the job to finish.
  • 19. Job Submission to Hadoop Hadoop Cluster 5 Job Actual Tracker M/R Job Oozie 3 Server 1 4 Launcher 2 Mapper
  • 20. Job Submission Contd. •  Advantages –  Horizontal scalability: If load increases, add machines into Hadoop cluster –  Stability: Isolation of user code and system process •  Disadvantages –  Extra map-slot is occupied by each job.
  • 21. Production Setup •  Total number of nodes: 42K+ •  Total number of Clusters: 25+ •  Total number of processed jobs ≈ 750K/month •  Data presented from two clusters •  Each of them have nearly 4K nodes •  Total number of users /cluster = 50
  • 22. Oozie Usage Pattern @ Y! Distribution of Job Types On Production Clusters 50 45 40 35 Percentage 30 25 20 #1 Cluster 15 #2 Cluster 10 5 0 fs java map-reduce pig Job type
  • 23. Experimental Setup •  Number of nodes: 7 •  Number of map-slots: 28 •  4 Core, RAM: 16 GB •  64 bit RHEL •  Oozie Server –  3 GB RAM –  Internal Queue size = 10 K –  # Worker Thread = 300
  • 24. Job Acceptance Workflow Acceptance Rate workflows Accepted/Min 1400 1200 1000 800 600 400 200 0 2 6 10 14 20 40 52 100 120 200 320 640 Number of Submission Threads Observation: Oozie can accept a large number of jobs
  • 25. Time Line of a Oozie Job User Oozie Job Job submits submits to completes completes Job Hadoop at Hadoop at Oozie Time Preparation Completion Overhead Overhead Total Oozie Overhead = Preparation + Completion
  • 26. Oozie Overhead Per Action Overhead Overhead in millisecs 1800 1600 1400 1200 1000 800 600 400 200 0 1 Action 5 Actions 10 Actions 50 Actions Number of Actions/Workflow Observation: Oozie overhead is less when multiple actions are in the same workflow.
  • 27. Oozie Futures •  Scalability –  Hot-Hot/Load balancing service –  Replace SQL DB with Zookeeper •  Improved Usability •  Extend the benchmarking scope •  Monitoring WS API
  • 28. Take Away .. •  Oozie is –  Easier to use –  Scalable –  Secure and multi-tenant
  • 29. Q&A Mohammad K Islam Virag Kothari kamrul@yahoo-inc.com virag@yahoo-inc.com http://incubator.apache.org/oozie/
  • 30. Vertical Scalability •  Oozie asynchronously processes the user request. •  Memory resident –  An internal queue to store any sub-task. –  Thread pool to process the sub-tasks. •  Both items –  Fixed in size –  Don’t change due to load variations –  Size can be configured if needed. Might need extra memory
  • 31. Challenges in Scalability •  Centralized persistence storage –  Currently supports any SQL DB (such as Oracle, MySQL, derby etc) –  Each DBMS has respective limitations –  Oozie scalability is limited by the underlying DBMS –  Using of Zookeeper might be an option