A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

A science-gateway for workflow executions:
online and non-clairvoyant self-healing
of workflow executions on grids
Rafael FERREIRA DA SILVA

University of Lyon, CNRS, INSERM, CREATIS
Villeurbanne, France

Supervisors:
Frédéric DESPREZ and Tristan GLATARD

This work was funded by the French National Agency for Research
1under grant ANR-09-COSI-03 "VIP”

Outline
—  Technical context and challenges
—  Contributions
— 
— 
— 
— 

Self-healing of workflow executions on grids
Treatment of blocked activities
Optimization of task granularity
Fairness control among workflow executions

—  Conclusions

2

Outline
— 
— 
— 
— 



3

Heavy Medical Simulations
Treatement planning for
prostate protontherapy
[L. Grevillot, D. Sarrut]

Medical-Imaging Execution Platform
491 users from 52 countries

CPU Time: 2 months
Virtual
Imaging
Platform
Simulated diffusion
weighted images
[L. Wang, Y. Zhu, I. Magnin]

CPU Time: 8 years

Echography simulation
[O. Bernard, M. Alessandrini]

CPU Time: 42 hours

4

Public Computing Infrastructure
150 computing sites world-wide
Goal: Self-healing of workflow executions on grids
to handle operational issues

2. User launches
a simulation
(application workflow)

1. Input data
upload

11. Download results

Science-Gateway
8. Inputs download

9. Execution

10. Results upload

5

Virtual Imaging Platform (VIP)

Workflow Execution
3. Workflow engine
generates invocations

4. Invocations are
wrapped into grid jobs

High-level interface
Software-as-a-Service

5. Jobs are submitted
to a Pilot Engine
6. Pilot jobs are
submitted to the
distributed infrastructure

7. Pilot jobs
fetch grid jobs

Workflow Execution
2. User launches
a simulation

3. Workflow engine

4. Invocations are
1. Input data
upload


Workflow Management
8. Inputs download

System
Applications described as workflows
Parallel language
Grid-aware enactor
9. Execution

to a Pilot Engine

6. Pilot jobs are
submitted to the

10. Results upload

6

7. Pilot jobs
fetch grid jobs

Workflow Execution
2. User launches
a simulation

3. Workflow engine

4. Invocations are
1. Input data
upload


8. Inputs download

Workload Management System
9. Execution
Pilot jobs run special agents that fetch user tasks
from the task queue, set up their environment and
steer their execution
10. Results upload

7

to a Pilot Engine
6. Pilot jobs are
submitted to the

7. Pilot jobs
fetch grid jobs

Workflow Execution
2. User launches
a simulation

3. Workflow engine

4. Invocations are
1. Input data
upload


European Grid Infrastructure (EGI)
+100 computing sites
+25,000 job slots
~4PB of Storage

8. Inputs download

9. Execution

10. Results upload

8

to a Pilot Engine
6. Pilot jobs are
submitted to the

7. Pilot jobs
fetch grid jobs

Challenges
—  Several workflow execution errors
Average workflow completion
rate is about 60%

Number of launched and completed workflow in VIP from Jan to Dec 2012

—  Several dysfunctional and performance problems
—  Requires manual interventions

—  Problem: costly manual operations
—  e.g.: rescheduling tasks, restarting services, killing misbehaving
experiments, or replicating data files

9

Objectives
—  Objective: Automated platform administration
—  Autonomous detection of operational incidents
—  Perform appropriate set of actions

—  Assumptions: Online and non-clairvoyant
— 
— 
— 
— 

10

Decisions must be fast
No information about tasks (duration, data transfer time, etc.)
No information about resources (availability, performance, etc.)
No user activity and workloads prediction

Outline
— 
— 
— 
— 



11

State of the Art
—  Self-healing of workflow executions
—  Most works from the literature are offline and/or clairvoyant

—  Common techniques to address operational incidents
—  Task resubmission
—  [Kandaswamy et al., 2008], [Zhang et al., 2009], [Montagnat et al., 2010]

—  Task and file replication
—  [Cirne et al., 2007], [Ben-Yehuda et al., 2012], [Ma et al., 2013]

—  Task grouping
—  [Muthuvelu et al., 2005-2013], [Lie and Liao, 2009], [Chen et al., 2013]

—  Heuristics to fairly schedule workflow tasks
—  [Zhao and Sakellariou, 2006], [N’Takpe and Suter, 2009], [Casanova et al., 2010]

12

Fuzzy Finite State Machine
—  The healing process sets the degree of FuSM states from incident

Crisp states

detection metrics

Possible values: 0 or 1
Fuzzy states

Values between 0 and 1

13

General MAPE-K loop
event
(job completion and failures)
or
timeout

Incident 1
degree η = 0.8
level
1

level
2

Incident 2
degree η = 0.4

Incident 3
degree η = 0.1

level
1

level
1

level
3

level
2

level
3

level
2

level
3

Monitoring

Analysis

Set of Actions

6e+04

=

0e+00

x2

Frequency

Monitoring data

0.0

Execution

0.2

0.4

0.6

0.8

∑

n

ηj

j =1

1.0

ηu

Knowledge

Roulette wheel selection

€

Planning
Rule

Confidence (ρ)

ρxη

Selected

2è 1

0.8

0.32

Selected

Incident 2

3 è 1

0.2

0.02

Incident 1

1 è 1

1.0

0.80

Roulette wheel selection
based on association rules

14

ηi

Association rules
for incident 1

R. Ferreira da Silva, T. Glatard, F. Desprez, Self-healing of workﬂow activity incidents on
distributed computing infrastructures, Future Generation Computer Systems (FGCS), 2013.

Incident Levels and Actions
—  Incident degrees are quantified in discrete incident levels
—  Thresholds are determined from mode clustering
Thresholds τ cluster platform
configurations into groups

No actions are triggered

15

Triggers a set of actions

A-priori knowledge
—  Based on the workload of VIP
—  January 2011 to April 2012
338,989 completed
138,480 error
105,488 aborted
15,576 aborted replicas
48,293 stalled
112 users

2,941 workflow executions

680,988 tasks

34,162 queued

339,545 pilot jobs

16

R. Ferreira da Silva, T. Glatard, A science-gateway workload archive to study pilot jobs,

user activity, bag of tasks, task sub-steps, and workﬂow executionss, CoreGRID/ERCIM

Workshop on Grids, Clouds and P2P Computing (CGWS), Rhodes Island, Greece, 2012.

Outline
— 
— 
— 
— 



17

Incident: Activity Blocked
—  A task is late compared to the others
Long-tail effect
80 100
60
40
0

20

Completed Jobs

FIELD-II/pasa - workflow-9SIeNv

0.0e+00

4.0e+06

8.0e+06

1.2e+07

Time (s)

Task completion rate of a real simulation

Job flow of a real simulation

—  Possible causes
—  Longer waiting times
—  Lost tasks (e.g. killed by site due to quota violation)
—  Resources with poor performance
18

Activity Blocked: State of the Art
—  Task replication
—  Is commonly used to address non-clairvoyant problems
—  Drawback: may overload the system and degrade fairness

—  Task replication in the literature
—  Is used to increase the probability to complete a task [Ramakrishnan et
al., 2009]

—  Use of the Weibull distribution to estimate the number of replicas [Litke
et al., 2007]

—  Tasks are replicated only in the tail phase [Ben-Yehuda et al., 2012]
—  Evaluation of the waste of resources by using replication [Cirne et al.,
2007]

All approaches make strong assumptions on task
or resource characteristics
19

Activity Blocked: Degree
—  Degree computed from all completed tasks of the activity
—  Task phases: setup è inputs download è execution è outputs upload
—  Assumption: bag of tasks (all tasks have equal durations)
—  Median-based estimation:
Median duration
of task phases

Estimated task
duration

Real task
duration

50s

42s

42s

250s

300s

300s

400s

400s*

20s

15s

15s

Mi = 715s

Ei = 757s

completed
current

?

*: max(400s, 20s) = 400s

—  Incident degree: task performance w.r.t median

20

Activity blocked: levels and actions
—  Levels: identified from the platform logs extracted from VIP on EGI
Activity Blocked
degree ηb

τb

Frequency

150

Level 1
(no€
actions)

100

Level 2

action: replicate tasks

50

0
0.00

0.25

0.50
ηb

d

0.75

—  Actions
—  Task replication
—  Cancel replicas with
bad performance

—  Replicate only if all
21

Level 1

active replicas are running

1.00

Replication process for one task

Level 2

Activity Blocked: Results
—  Goal: Self-Healing vs No-Healing
—  Cope with recoverable errors
Mean-Shift/hs3

FIELD-II/pasa
12000

8000
No−Healing
Self−Healing
4000

0

8000
No−Healing
Self−Healing
4000

0
1

2

3
Repetitions

4

5

Average execution speed up: 3.4

Resource waste:

w=

22
€

Makespan (s)

Makespan (s)

12000

(CPU + data) self −healing
−1
(CPU + data) no−healing

1

2

3
Repetitions

4

5

Average execution speed up: 2.9

Self-Healing process reduced resource
consumption up to 35% when compared to
the No-Healing execution

Number of Completed Tasks
Repetition 1

Repetition 2

Repetition 3

1.0
0.8
0.6
0.4

CDF

0.2

0

50
100
Repetition 4

150

20

60

0

50

100
150
Repetition 5

200

0

50

100

150

1.0
0.8
0.6
0.4
0.2

0

40

0

50

100

Time (min)

Curve similarities up to 95% indicate similar grid conditions

23

No−Healing
Self−Healing

Activity Blocked: Conclusions
—  First results in controlling blocked activities in these conditions
—  Conditions: production system, non-clairvoyant, online

—  Limitation
—  The method only works for bag-of-tasks
—  The waste metric does not consider resource performance

—  Currently used in production by VIP
—  From Aug 2012 to Oct 2013 more than 6000 workflow executions benefited

—  Publications
R. Ferreira da Silva, T. Glatard, F. Desprez, Self-healing of operational workﬂow incidents on
distributed computing infrastructures, IEEE/ACM International Symposium on Cluster, Cloud and
Grid Computing (CCGrid), Ottawa, Canada, 2012.

R. Ferreira da Silva, T. Glatard, F. Desprez, Self-healing of workﬂow activity incidents on
distributed computing infrastructures, Future Generation Computer Systems (FGCS), 2013.

24

Outline
— 
— 
— 
— 



25

Incident: Fineness Control
—  Low performance of lightweight (a.k.a. fine-grained) tasks:
—  High queuing times
—  Communication overhead
t5
t4
lightweight tasks

Lightweight task
executions are delayed

t3
t2

Group into coarse-grained tasks
reduces the cost of data transfers
when grouped tasks share input data,
and saves queuing time

Resources

t1

R3

t4

R2
R1

t3
t1

t2

t5
time

26

Fineness Control: State of the Art
—  Task grouping in the literature
—  Groups tasks based on the granularity size (processing time) [Muthuvelu et
al., 2005]

—  Adds bandwidth to the definition of the granularity size

[Ng et al., 2006],

[And et al., 2009]

—  Defines the granularity size based on QoS requirements
—  Task file size, CPU time, resource constraints [Muthuvelu et al., 2008]
—  Drawback: only works under stationary load

—  Adaptive algorithms (non-stationary load)
—  Monitors information about the current availability and capability of
resources

[Liu and Liao, 2009], [Muthuvelu et al., 2013]

All approaches make strong assumptions on task or resource characteristics
27

Fineness Control: Degree
—  Task execution

Queued Time

qj

t
Shared Input Data

€

Other Input
Data

Application Execution

~

t _ shared

—  Incident degree
€

€
η f = max i∈[1,m ]{ f i = di ⋅ ri }

€

Median task phase durations
28

i = waiting task
n = number of waiting tasks

Fineness control: levels and actions
Fineness Control
degree ηf

6e+04

Level 1
(no actions)
€

Level 1

Level 2

0e+00

Frequency

τf

action: task grouping
0.0

0.2

0.4

0.6

0.8

ηf

—  Actions
—  Task grouping
—  Grouped pairwise until η f ≤ τ f
or until Q ≤ R

29

€

1.0

Level 2

Coarseness control
t5
t4+t5

Tasks at t1

Grouped tasks
at t2

t3
t2

—  Non-stationary load

t2+t3

t4

—  Loss of parallelism
—  Task-degrouping

t1

Resources

R3
Loss of parallelism

R2

t4+t5

R1

t1
t1

t2+t3
time

t2

—  Incident degree

—  Levels

R
ηc =
Q+ R
30

€

τ c = 0.5

€

De-group tasks
when R Q

Results: Non-Stationary Load
—  Experiment
—  Evaluate the de-grouping control process under non-stationary load
Resources appear progressively

Resources appear suddenly

Makespan (s)

6000

4000

Fineness
Fineness−Coarseness
No−Granularity

2000

0
Run 1

Run 2

Run 3

Speeds up executions up to a factor of 1.5 for
Fineness, and 2.1 for Fineness-Coarseness

Run 4

Run 5

Fineness is penalized by its lack of
adaptation: slowdown of 20%

Linear correlation coefficient between the makespan and the
average queuing time is 0.91, which indicates they are correlated
31

31

Task Granularity: Conclusions
—  First results in controlling task granularity in these conditions

—  The method only works for data-intensive workloads

—  Future Work
—  Task pre-emption to handle the scenario where resources suddenly appear
and all tasks are running

R. Ferreira da Silva, T. Glatard, F. Desprez, On-line, non-clairvoyant optimization of

workﬂow activity granularity task on grids, Euro-Par, Aachen, 2013.

R. Ferreira da Silva, T. Glatard, F. Desprez, Controlling fairness and task granularity in

distributed, online, non-clairvoyant workﬂow executions, Concurrency and Computation:

Practice and Experience (CCPE), Submited, 2014.

32

Outline
— 
— 
— 
— 



33

Incident: Unfairness Among Workflow Executions
—  Under resource contention workflows are unequally slowed
down by concurrent executions
t1,1

t3,1

t1,2

t2,2

t3,2

t1,3

t2,3

t3,3

t1,4

t2,4

t3,4

t1,5

3 identical workflows
submitted sequentially
(ti,j = 10s)

t2,1

t2,5

Makespan with
concurrent executions

t3,5

slowdown(s) =

M multi
M own
Makespan without
concurrent executions

€

Resources

s1 =
R3

t1,3

t2,1

t2,4

t3,2

t1,2

t1,5

t2,3

t3,1

t3,4

R1

t1,1

t1,4

t2,2

t2,5

s2 =

40
= 2.0
20

s3 =

50
= 2.5
20

t3,5

R2

20
= 1.0
20

t3,3

0

10

20

30

40

€

time

€

Identical workflow executions do not
experience the same slowdown
34

€

Fairness: State of the Art
—  Workflow execution fairness in the literature
—  Addresses fairness based on the slowdown of DAGs based on execution
and data transfer times

[Zhao and Sakellariou, 2006], [Casanova et al., 2010]

—  Proposes a mapping procedure to increase fairness based on the critical
path length

[N’Takpe and Suter, 2009]

—  Online, but clairvoyant, HEFT-like algorithms [Hsu et al., 2011], [Sommerfield
and Richter, 2011], [Arabnejad and Barbosa, 2012]

—  Non-clairvoyant, but offline, scheduling strategy based on task labeling
and adaptive allocation

[Hirales-Carbajal et al., 2012]

No algorithm was proposed in a non-clairvoyant and online case

35

Fairness Control: Degree
—  Unfairness degree
ηu = W max − W min

Max difference between the
fractions of pending work

where:

$
'
Qi, j
W i = max j∈[1,n i ]%
⋅ Ti, j (
Qi, j + Ri, j ⋅ Pi, j
)

€

€

Performance

A low Pi,j indicates that resources
allocated to the activity have bad
performance for the activity
36

i = activity, ni = active activities
Qi,j = number of waiting tasks
Ri,j = number of running tasks

Relative observed duration

Median task phase durations

Fairness Control: Levels and Actions
Fairness Control
degree ηu

τu

Level 1
(no actions)

Level 1

6e+04

€

action: task prioritization

0e+00

Frequency

Level 2

0.0

0.2

0.4

0.6

0.8

1.0

ηu

—  Actions
—  Task prioritization
—  Task priority is an integer initialized to 1
—  Increase priority of Δi,j tasks
37

Level 2

Fairness Control: Metrics
—  Unfairness
—  Is the area under the curve ηu during the execution:
M

µ = ∑ηu (t i )⋅ (t i − t i−1 )
i=2

€

—  Slowdown
s=

M multi
M own

where:

€
38

€

M own = max p∈Ω ∑ t u
u∈p

This metric measures if the fairness process
can indeed minimize its own criterion ηu

Results: identical workflows
—  Tests whether unfairness among identical workflows is properly
addressed

Repetition 1

Repetition 2

Repetition 3

Repetition 4

Makespan (s)

30000
20000

Gate 1
Gate 2

10000

Gate 3

0
Fairness

No−Fairness

Repetition 1

Fairness

No−Fairness

Repetition 2

Fairness

No−Fairness

Repetition 3

Fairness

No−Fairness

Repetition 4

1.00

ηf

0.75
Fairness

0.50

No−Fairness

0.25
0.00
0

10000

20000

30000 0

5000 10000 15000 20000
0
Time (s)

10000 20000 30000

0 500010000
15000
20000
25000

Makespans and unfairness degree values are significantly reduced

39

Results: different workflows
—  Tests whether unfairness among different workflows is detected and
properly handled

Slowdown

Repetition 1

Repetition 2

Repetition 3

Repetition 4

100

FIELD−II
Gate

10

PET−Sorteo
SimuBloch

1
Fairness

No−Fairness

Repetition 1

Fairness

No−Fairness

Fairness

Repetition 2

No−Fairness

Fairness

Repetition 3

No−Fairness

Repetition 4

1.00

ηf

0.75
Fairness

0.50

No−Fairness

0.25
0.00
0

5000 100001500020000 0

10000

20000

0
Time (s)

20000

40000

0

500010000
15000
20000

Reduced slowdown stand. dev. up to a factor of 3.8,
and unfairness value up to a factor 1.9
40

Fairness Control: Conclusions
—  First results in controlling fairness among workflow executions in
these conditions


—  Fairness optimization is delayed due to the acquisition of information
about the applications
—  The method works best for applications with a lot of short tasks

—  Future Work
—  Evaluation of the influence of the metrics’ parameters

R. Ferreira da Silva, T. Glatard, F. Desprez, Workﬂow fairness control on online and

non-clairvoyant distributed computing platforms, Euro-Par, Aachen, 2013.

R. Ferreira da Silva, T. Glatard, F. Desprez, Controlling fairness and task granularity in

distributed, online, non-clairvoyant workﬂow executions, Concurrency and Computation:

Practice and Experience (CCPE), Submited, 2014.

41

Outline
— 
— 
— 
— 



42

Contributions Summary
Self-healing of workflow incidents
- Generic MAPE-K loop
- Non-clairvoyance and online

[Ferreira da Silva et al.,
CCGRID’12, FGCS’13]


- Properly detects and handles blocked activities


[Ferreira da Silva et al., EuroPar’13a]


[Ferreira da Silva et al., EuroPar’13b, CPE’14]

Science-gateway model for workload archive

[Ferreira da Silva and Glatard,
CGWS’12]

All methods were evaluated on VIP

[Ferreira da Silva et al.,
HealthGrid’11; Glatard et al.,
TMI’13]

- Properly detects and handles lightweight tasks under
stationary and non-stationary loads
- Properly detects and handles unfairness among
workflow executions

- Illustration by using traces of the VIP from 2011/2012
- Production platform with about 500 users

43

Perspectives
—  Mode detection automation
—  Automatically detect variation on threshold values

—  Time-windowed historical information
—  User’s behavior may change
—  Errors may be restricted to a specific time span

—  Optimization of the incident selection method
—  There is no mechanism to prevent an incident to be successively selected

—  Sensitivity analysis of parameters
—  Evaluate the influence of parameters on the metrics

—  Workflow workload archive
—  The science gateway workload archive model does not embrace all
characteristics inherent to a workflow execution

44

A science-gateway for workflow executions:
online and non-clairvoyant self-healing
of workflow executions on grids
Thank you for your attention.
Questions?

http://vip.creatis.insa-lyon.fr!
Rafael FERREIRA DA SILVA

University of Lyon, CNRS, INSERM, CREATIS
Villeurbanne, France

Supervisors:
Frédéric DESPREZ and Tristan GLATARD

A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids

Similar a A science-gateway for workflow executions: online and non-clairvoyant self-healingof workflow executions on grids (20)

Más de Rafael Ferreira da Silva

Más de Rafael Ferreira da Silva (20)

Último

Último (20)