Apache Oozie has come a long way and now accounts for over 2.8 Million jobs per month on Yahoo's grid infrastructure. If you are running Hadoop jobs repeatedly and thinking of a smarter way of doing it, Apache Oozie is the answer. Be it running complex data transformation jobs chained one after another or simple daily data copy, Oozie workflows will help you to manage these tasks efficiently. Mona will cover the new features introduced in Apache Oozie 4.x, in particular, Apache HCatalog Integration, Job Notifications and SLA Monitoring for building large-scale and efficient data processing pipelines.
4. Overview
Why Oozie?
The Need
The Problem
§
Doing something on the grid often
required multiple steps
§
Workflow scheduler with better support for
grid jobs (native integration with Hadoop)
§
MapReduce job
§
orchestrate dependency between jobs
§
Pig job
§
§
Streaming job
execute at specific time or on data
availability
§
HDFS operation (mkdir, chmod, etc)…
§
retry jobs in the event of failures
(reliable)
§
custom job control
Common framework for communication
and execution of production process
§
shell scripts
§
§
§
Multiple ad-hoc solutions existed
cron…
§
§
sync (clocked dataset) awareness
A server-based workflow
async (unspecifiedsystem to
scheduling freq) data
awareness
manage Hadoop jobs
§
Cost of building and running apps were
high
§
§
4
development and applications
engineering
support, operations, and hardware
Yahoo Confidential & Proprietary
§
Horizontally scalable and extensible
system
§
Open-source
§
Workflows to couple resources instead
of having a monolithic code base
5. Overview
Oozie – A Workflow Engine
§ Oozie executes workflow defined as DAG of jobs
§ The job type includes MapReduce, Pig, Hive, shell script, custom Java code
etc.
§ Introduced in Oozie 1.x
M/R
job
start
M/R
job
OK
fork
join
MORE
Pig
job
ERROR
kill
Control-flow nodes
(start, kill, end | fork, join, decision)
M/R
job
end
FS
job
Action nodes
(map reduce, pig, hive, distcp, java, fs, sub-workflow, shell, ssh, email)
5
Yahoo Confidential & Proprietary
decision
ENOUGH
Java
6. Overview
Example M/R Action
JT and NN
Mapper
Reducer
Input Directory
Output Directory
Queue Name
6
Yahoo Confidential & Proprietary
8. Overview
Oozie (Coordinator) – A Scheduler
§ Oozie executes workflow based on
§ time dependency (frequency)
§ data dependency
§ Introduced in 2.x
Oozie Server
WS API
Oozie
Client
8
Yahoo Confidential & Proprietary
Oozie
Coordinator
Oozie
Workflow
Check
Data Availability
HDFS/ HCat
9. Overview
Oozie (Bundle) – A Pipeline Framework
§ Users can define and execute a “bundle” of coordinator apps
§ large scale data processing (inter-related coordinators)
§ operability and manageability of pipelines
§ User can start/stop/suspend/resume/rerun in the bundle level
§ Introduced in 3.x, bundles are optional
Oozie Server
Bundle
WS API
Check
Data Availability
Oozie
Coordinator
Oozie
Client
9
Yahoo Confidential & Proprietary
Oozie
Workflow
HDFS/ HCat
14. Use Cases and Common Patterns
Use Case 1: Time Triggers
Execute your workflow every 15 minutes
00:15
14
Yahoo Confidential & Proprietary
00:30
00:45
01:00
15. Use Cases and Common Patterns
Use Case 2: Time and Data Triggers
Materialize your workflow every hour, but only run them when the input
data is ready (that is loaded to the grid every hour)
Hadoop
Input Data
Exists?
01:00
15
Yahoo Confidential & Proprietary
02:00
03:00
04:00
16. Use Cases and Common Patterns
Use Case 2: Time and Data Triggers
<coordinator-app name=“coord1” frequency=“${1*HOURS}”…>
<datasets>
<dataset name="logs" frequency=“${1*HOURS}” initial-instance="2009-01-01T23:59Z">
<uri-template>hdfs://bar:9000/app/logs/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template>
</dataset>
Dataset Definition
</datasets>
<input-events>
<data-in name=“inputLogs” dataset="logs">
<instance>${current(0)}</instance>
</data-in>
</input-events>
Input Events Definition
with time of coordinator action materialized (created)
<action>
<workflow>
<app-path>hdfs://bar:9000/usr/abc/logsprocessor-wf</app-path>
<configuration>
<property> <name>inputData</name><value>${dataIn(‘inputLogs’)}</value> </property>
</configuration>
</workflow>
</action>
16
Yahoo Confidential & Proprietary
Action Definition
17. Use Cases and Common Patterns
Use Case 3: Rolling Window
Access 15 minute datasets and roll them up into hourly datasets
00:15
00:30
00:45
01:15
01:00
01:00
17
Yahoo Confidential & Proprietary
01:30
01:45
02:00
02:00
18. Use Cases and Common Patterns
Use Case 4: Sliding Window
Access last 24 hours of data, and roll them up every hour
01:00
02:00
03:00
…
24:00
24:00
02:00
03:00
04:00
…
+1 day
01:00
+1 day
01:00
03:00
04:00
05:00
…
+1 day
02:00
+1 day
02:00
18
Yahoo Confidential & Proprietary
19. Where are We Today
Proven Scale and Multi-tenancy
§
2.8 M jobs/month
13,000 jobs/server day
§
16% of all Hadoop jobs
§
75 products
§
255 monthly users
§
2,000+ projects
§
5.4 M compute hrs/month
§
770,000 workflows
§
Between 1-8 actions
§
250 coordinator jobs/day
§
Yahoo Confidential & Proprietary
§
§
19
17 clusters
Avg. 4 actions/workflow
§
67% of Oozie jobs kicked
thru coordinator
20. Where are We Today
Mix Of Job Types For Workflows
Pig
MapReduce
100%
Java
Other
4%
90%
80%
SAMPLE USE OF JOB TYPES
28%
§ Data processing/ filtering
§ Aggregation
MapReduce
§ Publishing data (HDFS/
HCat)
Java
§ Legacy code and logic
Others
70%
Pig
§ Distcp and shell
§ Data copy/ transfer
60%
50%
29%
40%
30%
20%
39%
10%
0%
Jobs
20
Yahoo Confidential & Proprietary
22. What’s New in Oozie
Existing Features (Oozie 3.x)
§ HBase access through Oozie, via credentials
§ HCatalog access through Oozie, via credentials
§ Email action
§ DistCp action (intra as well as inter-cluster copy)
§ Shell action (run any script e.g. perl, python, hadoop CLI)
§ Workflow dry-run & Fork-Join validation
§ Bulk monitoring (REST API)
§ Coordinator EL functions for parameterized workflows
§ Job DAG
22
Yahoo Confidential & Proprietary
23. What’s New in Oozie
HBase Credentials
§ Add in workflow.xml
§
Add a section of "credentials". The type is "hbase”.
§
Specify the java action to use the credentials.
§
Put hbase-site.xml in oozie application path. And use <file> in workflow.xml to put hbase-site.xml in the distributed cache. A copy of the
hbase-site.xml can be found in gateway:/home/gs/conf/hbase/hbase-site.xml.
§
Put jars "guava-*.jar, zookeeper-*.jar, hbase-*.jar, protobuf-java-*.jar” in workflow “lib” dir
§ Make sure you are using Oozie XSD version 0.3 and above for the tag.
<workflow-‐app
name="foo-‐wf"
xmlns="uri:oozie:workflow:0.3">
<credentials>
<credential
name="hbase.cert"
type="hbase">
</credential>
//
optional
properties
-‐
zookeeper.znode.parent,
hbase.zookeeper.quorum
</credentials>
<start
to=”map-‐reduce-‐action"
/>
<action
name=’map-‐reduce-‐action'
cred="hbase.cert">
<map-‐reduce>
<configuration>
<property>
<name>mapred.mapper.class</name>
<value>SampleMapperHBase</value>
</property>
<property>
<name>mapred.reducer.class</name>
<value>org.apache.oozie.example.DemoReducer</value>
</property>
</configuration>
<file>hbase-‐site.xml#hbase-‐site.xml</file>
</java>
§ Refer to http://twiki.corp.yahoo.com/view/CCDI/UseHbaseCred
23
Yahoo Confidential & Proprietary
25. What’s New in Oozie
1
HCatalog Integration
§ Oozie now supports HCatalog datasets, in addition to HDFS
§
Query HCat server directly -OR-
§
Receive ‘partition created’ notifications
§ With HDFS datasets, poll NameNode to check data availability
§
Delay
§
Single source
data exists?
Oozie
data exists?
…….
NameNode
HDFS
/data/click/2013/03/10
/data/click/2013/03/11
/data/click/2013/03/12
…….
25
Yahoo Confidential & Proprietary
26. What’s New in Oozie
Latest Oozie 4.0 Features
HCatalog Integration
<coordinator-‐app
name=”hcat-‐coord”
…
>
›
HCat - metastore has info about HDFS
datasets, locations and file formats.
›
Using HCat loader and storer, dataset can be
<datasets>
<dataset
name=”inp-‐logs"
frequency="${coord:hours(1)}”>
<uri-‐template>${hcatNode}/${db}/${table}/ds=${YEAR}-‐$
{MONTH}-‐${DAY};region=${region}</uri-‐template>
<done-‐flag></done-‐flag>
consumed uniformly using Pig, Hive and
Map/Reduce in Oozie, using the “database,
<dataset
name=”out-‐logs"
frequency=”${coord:days(1)}”>
table, partition” abstraction.
›
</dataset>
<uri-‐template>${hcatNode}/${db}/${outputtable}/ds=$
{dataOut};region=${region}</uri-‐template>
Oozie notified on partition availability via JMS
messages, to trigger workflows immediately
›
Use JARs hcatalog-core.jar, webhcat-javaclient.jar, hive-common.jar, hive-exec.jar,
<done-‐flag></done-‐flag>
</dataset>
...
<property>
<name>FILTER</name>
<value>${coord:dataInPartitionFilter('input',
'pig')}
</value>
hive-metastore.jar, hive-serde.jar and
libfb303.jar in workflow ‘lib’
§
26
Docs http://oozie.apache.org/docs/4.0.0/
DG_HCatalogIntegration.html
Yahoo Confidential & Proprietary
Pig
action
script:
A
=
load
'$DB.$TABLE'
using
org.apache.hcatalog.pig.HCatLoader();
B
=
FILTER
A
BY
$FILTER;
C
=
foreach
B
generate
foo,
bar;
store
C
into
'$OUTPUT_DB.$OUTPUT_TABLE'
USING
org.apache.hcatalog.pig.HCatStorer('$OUTPUT_PARTITION');
27. With HCatalog + Notifications
What’s New in Oozie
High-level Diagram
/data/click/2013/03/12
Data Producer
Produce data (distcp, pig, M/R..)
HDFS
Update metadata
(ALTER TABLE click ADD PARTITION(data=‘2013/03/12’)
location ’hdfs://data/click/2013/03/12’)
HCatalog
27
Yahoo Confidential & Proprietary
28. What’s New in Oozie
With HCatalog + Notifications
High-level Diagram
Data Producer
Oozie
HDFS
1. Query/Poll Partition
2. Register Topic
Message Bus
(e..g, ActiveMQ)
28
Yahoo Confidential & Proprietary
HCatalog
29. What’s New in Oozie
With HCatalog + Notifications
High-level Diagram
/data/click/2013/03/12
Data Producer
Produce data (distcp, pig, M/R..)
HDFS
Update metadata
(ALTER TABLE click ADD PARTITION(data=‘2013/03/12’)
location ’hdfs://data/click/2013/03/12’)
Oozie
1. Query/Poll Partition
2. Register Topic
Start workflow
4. Notify New Partition
Message Bus
(e..g, ActiveMQ)
29
Yahoo Confidential & Proprietary
HCatalog
3. Push notification
<New Partition>
30. What’s New in Oozie
Latest Oozie 4.0 Features
2
Job Notifications
§ Notification event sent on jobs’ status change
§ Messages sent on the configured JMScompliant message broker
§ Users should write message listeners to listen
on select topics (e.g. username)
§ To filter more, apply JMS selectors on
Filter desired app-types for notification:
<property>
<name>oozie.service.EventHandlerService.
filter.app.types</name>
<value>workflow_job,
workflow_action,
coordinator_job,
coordinator_action</value>
</property>
Notification Msg Example:
Coordinator Action Failure Event
› Header (Selectors)
messages.
•
•
•
•
§ E.g. user, jobid, app-type, status, msg-type (JOB
or SLA).
§ Docs http://oozie.apache.org/docs/4.0.0/
DG_JMSNotifications.html
30
Yahoo Confidential & Proprietary
›
AppType – Coordinator_Action
Status - FAILURE
User
App-Name
Message Body (JSON)
•
•
•
•
•
•
•
ID (coord action id)
Parent ID (coord Job ID)
NominalTime
StartTime
EndTime
Status - FAILED, KILLED, SUSPENDED, TIMEDOUT
Error-Code, Error-Message (if KILLED or FAILED)
31. Latest Oozie 4.0 Features
SLA Monitoring
3
§ Oozie can actively track SLAs on Jobs’
§
Start-time, End-time, Duration
§ Event Status
§
START_MET, START_MISS
§
END_MET, END_MISS
§
DURATION_MET, DURATION_MISS
§ At any time, the SLA processing stage will reflect:
§
Not_Started <-- Job not yet begun
§
In_Process <-- Job started and is running, and SLAs are
being tracked
§
Met <-- caused by an END_MET
§
Miss <-- caused by an END_MISS
§ Access/Filter SLA info via
§
§
JMS Messages
§
31
REST API
§
§
Web-console dashboard
Email alert
Docs http://oozie.apache.org/docs/4.0.0/DG_SLAMonitoring.html
Yahoo Confidential & Proprietary
What’s New in Oozie
<workflow-‐app
xmlns="uri:oozie:workflow:
0.5"
xmlns:sla="uri:oozie:sla:0.2"
name=”sla-‐wf">
...
<end
name="end"/>
<sla:info>
<sla:nominal-‐time>${nominalTime}
</sla:nominal-‐time>
<sla:should-‐start>${shouldStart}
</sla:should-‐start>
<sla:should-‐end>${shouldEnd}
</sla:should-‐end>
<sla:max-‐duration>${duration}
</sla:max-‐duration>
<sla:alert-‐events>start_miss,end_miss
</sla:alert-‐events>
<sla:alert-‐contact>joe@yahoo
</sla:alert-‐contact>
</sla:info>
</workflow-‐app>
32. What’s New in Oozie
SLA Monitoring Dashboard
32
Yahoo Confidential & Proprietary
33. Demo
Checking Oozie Job
1. CLI (yoozie_client)
$ oozie job -oozie http://localhost:11000/oozie -info 14-20090525161321-oozie-joe
---------------------------------------------------------------------------------------------------------------Workflow Name : map-reduce-wf
App Path : hdfs://localhost:8020/user/joe/workflows/map-reduce
Status : SUCCEEDED
Run : 0
User : joe
Group : users
Created : 2009-05-26 05:01
Started : 2009-05-26 05:01
Ended : 2009-05-26 05:01
Actions
--------------------------------------------------------------------------------------------------------------------Action Name Type Status Transition External Id External Status Error Code Start End
-----------------------------------------------------------------------------------------------------------------------------------------------------hadoop1 map-reduce OK end job_200904281535_0254 SUCCEEDED - 2009-05-26 05:01 2009-05-26 05:01
------------------------------------------------------------------------------------------------------------------------------------------------------
33
Yahoo Confidential & Proprietary
34. Demo
Checking / Debugging Oozie Jobs
2. Web-Console
e.g. http://my-oozie-server:4080/oozie
Docs - https://cwiki.apache.org/confluence/display/OOZIE/Map+Reduce+Cookbook
34
Yahoo Confidential & Proprietary
36. Oozie at ASF
Oozie vs. Other Workflow Systems
Champion
LinkedIn
Spotify
Apache
Affiliation
TLP
License only
License only
Language
Java
Java
Python
Adoption
High, part of all standard Hadoop
distributions
Low
Low
Code
Complexity
High (>100K lines)
Medium (< 50K lines)
Low (<10K lines)
Hadoop Job
Support
Extensive built-in support
Limited job types
Limited job types
Docs &
Support
Excellent
Limited
Limited
Auth.
Kerberos, custom
xml-based, custom
Linux-based
Reruns
Yes (recovery, retries at all levels)
Partial
After removing output,
idempotent
UI
36
Yahoo! (now ASF)
Average
Good
-
Yahoo Confidential & Proprietary
37. Roadmap
The Next Release
§ Scalability and performance improvements to handle higher loads
§
More 1 and 5 min frequency jobs
§ High Availability with Load Balancing
§ Flexible Cron-Based Scheduling
§ Handling cluster Rolling upgrades for Hadoop 2.0
37
Yahoo Confidential & Proprietary