PhD Thesis presented on November 29th 2013 at INSA-Lyon
Abstract - Science gateways, such as the Virtual Imaging Platform (VIP), enable transparent access to distributed computing and storage resources for scientific computations. However, their large scale and the number of middleware systems involved lead to many errors and faults. In practice, science gateways are often backed by substantial support staff who monitors running experiments by performing simple yet crucial actions such as rescheduling tasks, restarting services, killing misbehaving runs or replicating data files to reliable storage facilities. Fair quality of service (QoS) can then be delivered, yet with important human intervention. Automating such operations is challenging for two reasons. First, the problem is online by nature because no reliable user activity prediction can be assumed, and new workloads may arrive at any time. Therefore, the considered metrics, decisions and actions have to remain simple and to yield results while the application is still executing. Second, it is non-clairvoyant due to the lack of information about applications and resources in production conditions. Computing resources are usually dynamically provisioned from heterogeneous clusters, clouds or desktop grids without any reliable estimate of their availability and characteristics. Models of application execution times are hardly available either, in particular on heterogeneous computing resources. In this thesis, we propose a general healing process for autonomous detection and handling of operational incidents in workflow executions. Instances are modeled as Fuzzy Finite State Machines (FuSM) where state degrees of membership are determined by an external healing process. Degrees of membership are computed from metrics assuming that incidents have outlier performance, e.g. a site or a particular invocation behaves differently than the others. Based on incident degrees, the healing process identifies incident levels using thresholds determined from the platform history. A specific set of actions is then selected from association rules among incident levels.
For more information visit http://www.rafaelsilva.com
Scale your database traffic with Read & Write split using MySQL Router
A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids
1. A science-gateway for workflow executions:
online and non-clairvoyant self-healing
of workflow executions on grids
Rafael FERREIRA DA SILVA
University of Lyon, CNRS, INSERM, CREATIS
Villeurbanne, France
Supervisors:
Frédéric DESPREZ and Tristan GLATARD
This work was funded by the French National Agency for Research
1under grant ANR-09-COSI-03 "VIP”
2. Outline
— Technical context and challenges
— Contributions
—
—
—
—
Self-healing of workflow executions on grids
Treatment of blocked activities
Optimization of task granularity
Fairness control among workflow executions
— Conclusions
2
3. Outline
— Technical context and challenges
— Contributions
—
—
—
—
Self-healing of workflow executions on grids
Treatment of blocked activities
Optimization of task granularity
Fairness control among workflow executions
— Conclusions
3
4. Heavy Medical Simulations
Treatement planning for
prostate protontherapy
[L. Grevillot, D. Sarrut]
Medical-Imaging Execution Platform
491 users from 52 countries
CPU Time: 2 months
Virtual
Imaging
Platform
Simulated diffusion
weighted images
[L. Wang, Y. Zhu, I. Magnin]
CPU Time: 8 years
Echography simulation
[O. Bernard, M. Alessandrini]
CPU Time: 42 hours
4
Public Computing Infrastructure
150 computing sites world-wide
Goal: Self-healing of workflow executions on grids
to handle operational issues
5. 2. User launches
a simulation
(application workflow)
1. Input data
upload
11. Download results
Science-Gateway
8. Inputs download
9. Execution
10. Results upload
5
Virtual Imaging Platform (VIP)
Workflow Execution
3. Workflow engine
generates invocations
4. Invocations are
wrapped into grid jobs
High-level interface
Software-as-a-Service
5. Jobs are submitted
to a Pilot Engine
6. Pilot jobs are
submitted to the
distributed infrastructure
7. Pilot jobs
fetch grid jobs
6. Workflow Execution
2. User launches
a simulation
(application workflow)
3. Workflow engine
generates invocations
4. Invocations are
wrapped into grid jobs
1. Input data
upload
11. Download results
Workflow Management
8. Inputs download
System
Applications described as workflows
Parallel language
Grid-aware enactor
9. Execution
5. Jobs are submitted
to a Pilot Engine
6. Pilot jobs are
submitted to the
distributed infrastructure
10. Results upload
6
7. Pilot jobs
fetch grid jobs
7. Workflow Execution
2. User launches
a simulation
(application workflow)
3. Workflow engine
generates invocations
4. Invocations are
wrapped into grid jobs
1. Input data
upload
11. Download results
8. Inputs download
Workload Management System
9. Execution
Pilot jobs run special agents that fetch user tasks
from the task queue, set up their environment and
steer their execution
10. Results upload
7
5. Jobs are submitted
to a Pilot Engine
6. Pilot jobs are
submitted to the
distributed infrastructure
7. Pilot jobs
fetch grid jobs
8. Workflow Execution
2. User launches
a simulation
(application workflow)
3. Workflow engine
generates invocations
4. Invocations are
wrapped into grid jobs
1. Input data
upload
11. Download results
European Grid Infrastructure (EGI)
+100 computing sites
+25,000 job slots
~4PB of Storage
8. Inputs download
9. Execution
10. Results upload
8
5. Jobs are submitted
to a Pilot Engine
6. Pilot jobs are
submitted to the
distributed infrastructure
7. Pilot jobs
fetch grid jobs
9. Challenges
— Several workflow execution errors
Average workflow completion
rate is about 60%
Number of launched and completed workflow in VIP from Jan to Dec 2012
— Several dysfunctional and performance problems
— Requires manual interventions
— Problem: costly manual operations
— e.g.: rescheduling tasks, restarting services, killing misbehaving
experiments, or replicating data files
9
10. Objectives
— Objective: Automated platform administration
— Autonomous detection of operational incidents
— Perform appropriate set of actions
— Assumptions: Online and non-clairvoyant
—
—
—
—
10
Decisions must be fast
No information about tasks (duration, data transfer time, etc.)
No information about resources (availability, performance, etc.)
No user activity and workloads prediction
11. Outline
— Technical context and challenges
— Contributions
—
—
—
—
Self-healing of workflow executions on grids
Treatment of blocked activities
Optimization of task granularity
Fairness control among workflow executions
— Conclusions
11
12. State of the Art
— Self-healing of workflow executions
— Most works from the literature are offline and/or clairvoyant
— Common techniques to address operational incidents
— Task resubmission
— [Kandaswamy et al., 2008], [Zhang et al., 2009], [Montagnat et al., 2010]
— Task and file replication
— [Cirne et al., 2007], [Ben-Yehuda et al., 2012], [Ma et al., 2013]
— Task grouping
— [Muthuvelu et al., 2005-2013], [Lie and Liao, 2009], [Chen et al., 2013]
— Heuristics to fairly schedule workflow tasks
— [Zhao and Sakellariou, 2006], [N’Takpe and Suter, 2009], [Casanova et al., 2010]
12
13. Fuzzy Finite State Machine
— The healing process sets the degree of FuSM states from incident
Crisp states
detection metrics
Possible values: 0 or 1
Fuzzy states
Values between 0 and 1
13
14. General MAPE-K loop
event
(job completion and failures)
or
timeout
Incident 1
degree η = 0.8
level
1
level
2
Incident 2
degree η = 0.4
Incident 3
degree η = 0.1
level
1
level
1
level
3
level
2
level
3
level
2
level
3
Monitoring
Analysis
Set of Actions
6e+04
=
0e+00
x2
Frequency
Monitoring data
0.0
Execution
0.2
0.4
0.6
0.8
∑
n
ηj
j =1
1.0
ηu
Knowledge
Roulette wheel selection
€
Planning
Rule
Confidence (ρ)
ρxη
Selected
2è 1
0.8
0.32
Selected
Incident 2
3 è 1
0.2
0.02
Incident 1
1 è 1
1.0
0.80
Roulette wheel selection
based on association rules
14
ηi
Association rules
for incident 1
R. Ferreira da Silva, T. Glatard, F. Desprez, Self-healing of workflow activity incidents on
distributed computing infrastructures, Future Generation Computer Systems (FGCS), 2013.
15. Incident Levels and Actions
— Incident degrees are quantified in discrete incident levels
— Thresholds are determined from mode clustering
Thresholds τ cluster platform
configurations into groups
No actions are triggered
15
Triggers a set of actions
16. A-priori knowledge
— Based on the workload of VIP
— January 2011 to April 2012
338,989 completed
138,480 error
105,488 aborted
15,576 aborted replicas
48,293 stalled
112 users
2,941 workflow executions
680,988 tasks
34,162 queued
339,545 pilot jobs
16
R. Ferreira da Silva, T. Glatard, A science-gateway workload archive to study pilot jobs,
user activity, bag of tasks, task sub-steps, and workflow executionss, CoreGRID/ERCIM
Workshop on Grids, Clouds and P2P Computing (CGWS), Rhodes Island, Greece, 2012.
17. Outline
— Technical context and challenges
— Contributions
—
—
—
—
Self-healing of workflow executions on grids
Treatment of blocked activities
Optimization of task granularity
Fairness control among workflow executions
— Conclusions
17
18. Incident: Activity Blocked
— A task is late compared to the others
Long-tail effect
80 100
60
40
0
20
Completed Jobs
FIELD-II/pasa - workflow-9SIeNv
0.0e+00
4.0e+06
8.0e+06
1.2e+07
Time (s)
Task completion rate of a real simulation
Job flow of a real simulation
— Possible causes
— Longer waiting times
— Lost tasks (e.g. killed by site due to quota violation)
— Resources with poor performance
18
19. Activity Blocked: State of the Art
— Task replication
— Is commonly used to address non-clairvoyant problems
— Drawback: may overload the system and degrade fairness
— Task replication in the literature
— Is used to increase the probability to complete a task [Ramakrishnan et
al., 2009]
— Use of the Weibull distribution to estimate the number of replicas [Litke
et al., 2007]
— Tasks are replicated only in the tail phase [Ben-Yehuda et al., 2012]
— Evaluation of the waste of resources by using replication [Cirne et al.,
2007]
All approaches make strong assumptions on task
or resource characteristics
19
20. Activity Blocked: Degree
— Degree computed from all completed tasks of the activity
— Task phases: setup è inputs download è execution è outputs upload
— Assumption: bag of tasks (all tasks have equal durations)
— Median-based estimation:
Median duration
of task phases
Estimated task
duration
Real task
duration
50s
42s
42s
250s
300s
300s
400s
400s*
20s
15s
15s
Mi = 715s
Ei = 757s
completed
current
?
*: max(400s, 20s) = 400s
— Incident degree: task performance w.r.t median
20
21. Activity blocked: levels and actions
— Levels: identified from the platform logs extracted from VIP on EGI
Activity Blocked
degree ηb
τb
Frequency
150
Level 1
(no€
actions)
100
Level 2
action: replicate tasks
50
0
0.00
0.25
0.50
ηb
d
0.75
— Actions
— Task replication
— Cancel replicas with
bad performance
— Replicate only if all
21
Level 1
active replicas are running
1.00
Replication process for one task
Level 2
22. Activity Blocked: Results
— Goal: Self-Healing vs No-Healing
— Cope with recoverable errors
Mean-Shift/hs3
FIELD-II/pasa
12000
8000
No−Healing
Self−Healing
4000
0
8000
No−Healing
Self−Healing
4000
0
1
2
3
Repetitions
4
5
Average execution speed up: 3.4
Resource waste:
w=
22
€
Makespan (s)
Makespan (s)
12000
(CPU + data) self −healing
−1
(CPU + data) no−healing
1
2
3
Repetitions
4
5
Average execution speed up: 2.9
Self-Healing process reduced resource
consumption up to 35% when compared to
the No-Healing execution
23. Number of Completed Tasks
Repetition 1
Repetition 2
Repetition 3
1.0
0.8
0.6
0.4
CDF
0.2
0
50
100
Repetition 4
150
20
60
0
50
100
150
Repetition 5
200
0
50
100
150
1.0
0.8
0.6
0.4
0.2
0
40
0
50
100
Time (min)
Curve similarities up to 95% indicate similar grid conditions
23
No−Healing
Self−Healing
24. Activity Blocked: Conclusions
— First results in controlling blocked activities in these conditions
— Conditions: production system, non-clairvoyant, online
— Limitation
— The method only works for bag-of-tasks
— The waste metric does not consider resource performance
— Currently used in production by VIP
— From Aug 2012 to Oct 2013 more than 6000 workflow executions benefited
— Publications
R. Ferreira da Silva, T. Glatard, F. Desprez, Self-healing of operational workflow incidents on
distributed computing infrastructures, IEEE/ACM International Symposium on Cluster, Cloud and
Grid Computing (CCGrid), Ottawa, Canada, 2012.
R. Ferreira da Silva, T. Glatard, F. Desprez, Self-healing of workflow activity incidents on
distributed computing infrastructures, Future Generation Computer Systems (FGCS), 2013.
24
25. Outline
— Technical context and challenges
— Contributions
—
—
—
—
Self-healing of workflow executions on grids
Treatment of blocked activities
Optimization of task granularity
Fairness control among workflow executions
— Conclusions
25
26. Incident: Fineness Control
— Low performance of lightweight (a.k.a. fine-grained) tasks:
— High queuing times
— Communication overhead
t5
t4
lightweight tasks
Lightweight task
executions are delayed
t3
t2
Group into coarse-grained tasks
reduces the cost of data transfers
when grouped tasks share input data,
and saves queuing time
Resources
t1
R3
t4
R2
R1
t3
t1
t2
t5
time
26
27. Fineness Control: State of the Art
— Task grouping in the literature
— Groups tasks based on the granularity size (processing time) [Muthuvelu et
al., 2005]
— Adds bandwidth to the definition of the granularity size
[Ng et al., 2006],
[And et al., 2009]
— Defines the granularity size based on QoS requirements
— Task file size, CPU time, resource constraints [Muthuvelu et al., 2008]
— Drawback: only works under stationary load
— Adaptive algorithms (non-stationary load)
— Monitors information about the current availability and capability of
resources
[Liu and Liao, 2009], [Muthuvelu et al., 2013]
All approaches make strong assumptions on task or resource characteristics
27
28. Fineness Control: Degree
— Task execution
Queued Time
qj
t
Shared Input Data
€
Other Input
Data
Application Execution
~
t _ shared
— Incident degree
€
€
η f = max i∈[1,m ]{ f i = di ⋅ ri }
€
Median task phase durations
28
i = waiting task
n = number of waiting tasks
29. Fineness control: levels and actions
— Levels: identified from the platform logs extracted from VIP on EGI
Fineness Control
degree ηf
6e+04
Level 1
(no actions)
€
Level 1
Level 2
0e+00
Frequency
τf
action: task grouping
0.0
0.2
0.4
0.6
0.8
ηf
— Actions
— Task grouping
— Grouped pairwise until η f ≤ τ f
or until Q ≤ R
29
€
1.0
Level 2
30. Coarseness control
t5
t4+t5
Tasks at t1
Grouped tasks
at t2
t3
t2
— Non-stationary load
t2+t3
t4
— Loss of parallelism
— Task-degrouping
t1
Resources
R3
Loss of parallelism
R2
t4+t5
R1
t1
t1
t2+t3
time
t2
— Incident degree
— Levels
R
ηc =
Q+ R
30
€
τ c = 0.5
€
De-group tasks
when R Q
31. Results: Non-Stationary Load
— Experiment
— Evaluate the de-grouping control process under non-stationary load
Resources appear progressively
Resources appear suddenly
Makespan (s)
6000
4000
Fineness
Fineness−Coarseness
No−Granularity
2000
0
Run 1
Run 2
Run 3
Speeds up executions up to a factor of 1.5 for
Fineness, and 2.1 for Fineness-Coarseness
Run 4
Run 5
Fineness is penalized by its lack of
adaptation: slowdown of 20%
Linear correlation coefficient between the makespan and the
average queuing time is 0.91, which indicates they are correlated
31
31
32. Task Granularity: Conclusions
— First results in controlling task granularity in these conditions
— Conditions: production system, non-clairvoyant, online
— Limitation
— The method only works for data-intensive workloads
— Future Work
— Task pre-emption to handle the scenario where resources suddenly appear
and all tasks are running
— Publications
R. Ferreira da Silva, T. Glatard, F. Desprez, On-line, non-clairvoyant optimization of
workflow activity granularity task on grids, Euro-Par, Aachen, 2013.
R. Ferreira da Silva, T. Glatard, F. Desprez, Controlling fairness and task granularity in
distributed, online, non-clairvoyant workflow executions, Concurrency and Computation:
Practice and Experience (CCPE), Submited, 2014.
32
33. Outline
— Technical context and challenges
— Contributions
—
—
—
—
Self-healing of workflow executions on grids
Treatment of blocked activities
Optimization of task granularity
Fairness control among workflow executions
— Conclusions
33
34. Incident: Unfairness Among Workflow Executions
— Under resource contention workflows are unequally slowed
down by concurrent executions
t1,1
t3,1
t1,2
t2,2
t3,2
t1,3
t2,3
t3,3
t1,4
t2,4
t3,4
t1,5
3 identical workflows
submitted sequentially
(ti,j = 10s)
t2,1
t2,5
Makespan with
concurrent executions
t3,5
slowdown(s) =
M multi
M own
Makespan without
concurrent executions
€
Resources
s1 =
R3
t1,3
t2,1
t2,4
t3,2
t1,2
t1,5
t2,3
t3,1
t3,4
R1
t1,1
t1,4
t2,2
t2,5
s2 =
40
= 2.0
20
s3 =
50
= 2.5
20
t3,5
R2
20
= 1.0
20
t3,3
0
10
20
30
40
€
time
€
Identical workflow executions do not
experience the same slowdown
34
€
35. Fairness: State of the Art
— Workflow execution fairness in the literature
— Addresses fairness based on the slowdown of DAGs based on execution
and data transfer times
[Zhao and Sakellariou, 2006], [Casanova et al., 2010]
— Proposes a mapping procedure to increase fairness based on the critical
path length
[N’Takpe and Suter, 2009]
— Online, but clairvoyant, HEFT-like algorithms [Hsu et al., 2011], [Sommerfield
and Richter, 2011], [Arabnejad and Barbosa, 2012]
— Non-clairvoyant, but offline, scheduling strategy based on task labeling
and adaptive allocation
[Hirales-Carbajal et al., 2012]
No algorithm was proposed in a non-clairvoyant and online case
35
36. Fairness Control: Degree
— Unfairness degree
ηu = W max − W min
Max difference between the
fractions of pending work
where:
$
'
Qi, j
W i = max j∈[1,n i ]%
⋅ Ti, j (
Qi, j + Ri, j ⋅ Pi, j
)
€
€
Performance
A low Pi,j indicates that resources
allocated to the activity have bad
performance for the activity
36
i = activity, ni = active activities
Qi,j = number of waiting tasks
Ri,j = number of running tasks
Relative observed duration
Median task phase durations
37. Fairness Control: Levels and Actions
— Levels: identified from the platform logs extracted from VIP on EGI
Fairness Control
degree ηu
τu
Level 1
(no actions)
Level 1
6e+04
€
action: task prioritization
0e+00
Frequency
Level 2
0.0
0.2
0.4
0.6
0.8
1.0
ηu
— Actions
— Task prioritization
— Task priority is an integer initialized to 1
— Increase priority of Δi,j tasks
37
Level 2
38. Fairness Control: Metrics
— Unfairness
— Is the area under the curve ηu during the execution:
M
µ = ∑ηu (t i )⋅ (t i − t i−1 )
i=2
€
— Slowdown
s=
M multi
M own
where:
€
38
€
M own = max p∈Ω ∑ t u
u∈p
This metric measures if the fairness process
can indeed minimize its own criterion ηu
40. Results: different workflows
— Tests whether unfairness among different workflows is detected and
properly handled
Slowdown
Repetition 1
Repetition 2
Repetition 3
Repetition 4
100
FIELD−II
Gate
10
PET−Sorteo
SimuBloch
1
Fairness
No−Fairness
Repetition 1
Fairness
No−Fairness
Fairness
Repetition 2
No−Fairness
Fairness
Repetition 3
No−Fairness
Repetition 4
1.00
ηf
0.75
Fairness
0.50
No−Fairness
0.25
0.00
0
5000 100001500020000 0
10000
20000
0
Time (s)
20000
40000
0
500010000
15000
20000
Reduced slowdown stand. dev. up to a factor of 3.8,
and unfairness value up to a factor 1.9
40
41. Fairness Control: Conclusions
— First results in controlling fairness among workflow executions in
these conditions
— Conditions: production system, non-clairvoyant, online
— Limitation
— Fairness optimization is delayed due to the acquisition of information
about the applications
— The method works best for applications with a lot of short tasks
— Future Work
— Evaluation of the influence of the metrics’ parameters
— Publications
R. Ferreira da Silva, T. Glatard, F. Desprez, Workflow fairness control on online and
non-clairvoyant distributed computing platforms, Euro-Par, Aachen, 2013.
R. Ferreira da Silva, T. Glatard, F. Desprez, Controlling fairness and task granularity in
distributed, online, non-clairvoyant workflow executions, Concurrency and Computation:
Practice and Experience (CCPE), Submited, 2014.
41
42. Outline
— Technical context and challenges
— Contributions
—
—
—
—
Self-healing of workflow executions on grids
Treatment of blocked activities
Optimization of task granularity
Fairness control among workflow executions
— Conclusions
42
43. Contributions Summary
Self-healing of workflow incidents
- Generic MAPE-K loop
- Non-clairvoyance and online
[Ferreira da Silva et al.,
CCGRID’12, FGCS’13]
Treatment of blocked activities
- Properly detects and handles blocked activities
Optimization of task granularity
[Ferreira da Silva et al., EuroPar’13a]
Fairness control among workflow executions
[Ferreira da Silva et al., EuroPar’13b, CPE’14]
Science-gateway model for workload archive
[Ferreira da Silva and Glatard,
CGWS’12]
All methods were evaluated on VIP
[Ferreira da Silva et al.,
HealthGrid’11; Glatard et al.,
TMI’13]
- Properly detects and handles lightweight tasks under
stationary and non-stationary loads
- Properly detects and handles unfairness among
workflow executions
- Illustration by using traces of the VIP from 2011/2012
- Production platform with about 500 users
43
44. Perspectives
— Mode detection automation
— Automatically detect variation on threshold values
— Time-windowed historical information
— User’s behavior may change
— Errors may be restricted to a specific time span
— Optimization of the incident selection method
— There is no mechanism to prevent an incident to be successively selected
— Sensitivity analysis of parameters
— Evaluate the influence of parameters on the metrics
— Workflow workload archive
— The science gateway workload archive model does not embrace all
characteristics inherent to a workflow execution
44
45. A science-gateway for workflow executions:
online and non-clairvoyant self-healing
of workflow executions on grids
Thank you for your attention.
Questions?
http://vip.creatis.insa-lyon.fr!
Rafael FERREIRA DA SILVA
University of Lyon, CNRS, INSERM, CREATIS
Villeurbanne, France
Supervisors:
Frédéric DESPREZ and Tristan GLATARD