Más contenido relacionado La actualidad más candente (20) Similar a Azkaban - WorkFlow Scheduler/Automation Engine (20) Azkaban - WorkFlow Scheduler/Automation Engine1. Hadoop WorkFlow
Scheduler /
Automation Engine
Azkaban & Oozie
Praveen Thirukonda
Senior Associate
Data & Analytics
Orange County, CA
09/11/2014
2. © 2014 KPMG LLP, a Delaware limited liability partnership and the U.S. member firm of the KPMG network of independent member firms
affiliated with KPMG International Cooperative (“KPMG International”), a Swiss entity. All rights reserved. INTERNAL USE ONLY. Not for
distribution to clients unless the technical and policy review requirements of Tax Services Manual section 23.7 are satisfied.
1
What is a workflow?
- A workflow is a Directed Acyclic Graph
(DAG) of “jobs” where each job has one
or more inputs and outputs.
- A workflow scheduler helps us manage
the co ordination among the various
jobs.
3. © 2014 KPMG LLP, a Delaware limited liability partnership and the U.S. member firm of the KPMG network of independent member firms
affiliated with KPMG International Cooperative (“KPMG International”), a Swiss entity. All rights reserved. INTERNAL USE ONLY. Not for
distribution to clients unless the technical and policy review requirements of Tax Services Manual section 23.7 are satisfied.
2
When do we need a workflow scheduler?
- In a Data Pipeline, Batch jobs need to be
scheduled to run periodically.
- They also typically have intricate
dependency chains—for example,
dependencies on various data extraction
processes or previous steps.
- Larger processes might have 50 or 60
steps, of which some might run in
parallel and others must wait for the
output of earlier steps.
5. © 2014 KPMG LLP, a Delaware limited liability partnership and the U.S. member firm of the KPMG network of independent member firms
affiliated with KPMG International Cooperative (“KPMG International”), a Swiss entity. All rights reserved. INTERNAL USE ONLY. Not for
distribution to clients unless the technical and policy review requirements of Tax Services Manual section 23.7 are satisfied.
4
What is Azkaban?
- “cron on steroids”
- A workflow scheduler can be seen as a
combination of the cron and make Unix
utilities combined with a friendly UI.
6. © 2014 KPMG LLP, a Delaware limited liability partnership and the U.S. member firm of the KPMG network of independent member firms
affiliated with KPMG International Cooperative (“KPMG International”), a Swiss entity. All rights reserved. INTERNAL USE ONLY. Not for
distribution to clients unless the technical and policy review requirements of Tax Services Manual section 23.7 are satisfied.
5
What is Azkaban?
- Azkaban was implemented at LinkedIn to
solve the problem of Hadoop job
dependencies.
- Azkaban resolves the ordering through
job dependencies and provides an easy
to use web user interface to maintain and
track your workflows.
7. © 2014 KPMG LLP, a Delaware limited liability partnership and the U.S. member firm of the KPMG network of independent member firms
affiliated with KPMG International Cooperative (“KPMG International”), a Swiss entity. All rights reserved. INTERNAL USE ONLY. Not for
distribution to clients unless the technical and policy review requirements of Tax Services Manual section 23.7 are satisfied.
6
An Image is worth a 1000 words..
9. © 2014 KPMG LLP, a Delaware limited liability partnership and the U.S. member firm of the KPMG network of independent member firms
affiliated with KPMG International Cooperative (“KPMG International”), a Swiss entity. All rights reserved. INTERNAL USE ONLY. Not for
distribution to clients unless the technical and policy review requirements of Tax Services Manual section 23.7 are satisfied.
8
What is Apache Oozie?
- Similar to Azkaban.
- Whereas Azkaban uses a series of
Properties files, Oozie uses an XML file.
- Oozie supports Java API, command line
methods for workflow submission in
addition to Browser interface/REST API.
- Oozie is part of our Hortonworks
environment in our cluster.
10. © 2014 KPMG LLP, a Delaware limited liability partnership and the U.S. member firm of the KPMG network of independent member firms
affiliated with KPMG International Cooperative (“KPMG International”), a Swiss entity. All rights reserved. INTERNAL USE ONLY. Not for
distribution to clients unless the technical and policy review requirements of Tax Services Manual section 23.7 are satisfied.
9
Advantages of using a workflow scheduler
- Let’s you easily manage dependencies within
the various tasks.
- Scheduling of workflows
- Monitor the progress of your workflow with
nice interface.
- Email alerts on failure and successes
- Retrying of failed jobs.
11. © 2014 KPMG LLP, a Delaware limited liability partnership and the U.S. member firm of the KPMG network of independent member firms
affiliated with KPMG International Cooperative (“KPMG International”), a Swiss entity. All rights reserved. INTERNAL USE ONLY. Not for
distribution to clients unless the technical and policy review requirements of Tax Services Manual section 23.7 are satisfied.
10
Application of a workflow scheduler
- Real Life example of how and where you
might use a workflow scheduler in your
Big Data System architecture?
13. © 2014 KPMG LLP, a Delaware limited liability partnership and
the U.S. member firm of the KPMG network of independent
member firms affiliated with KPMG International Cooperative
(“KPMG International”), a Swiss entity. All rights reserved.
The KPMG name, logo and “cutting through complexity” are
registered trademarks or trademarks of KPMG International.
Notas del editor Got raw data from car, cleaned and preprocessed data on EC2 machines, based on amount of data spun up EMR instances, copied data to it, ran MR jobs, ran Hive scripts (which were dynamically created), then used sqoop to copy over final processed output to Postgresql db, then shut down the emr instances and did cleanup operations.