SlideShare una empresa de Scribd logo
1 de 45
Descargar para leer sin conexión
A science-gateway for workflow executions:
online and non-clairvoyant self-healing
of workflow executions on grids
Rafael FERREIRA DA SILVA

University of Lyon, CNRS, INSERM, CREATIS
Villeurbanne, France

Supervisors:
Frédéric DESPREZ and Tristan GLATARD

This work was funded by the French National Agency for Research
1under grant ANR-09-COSI-03 "VIP”
Outline
—  Technical context and challenges
—  Contributions
— 
— 
— 
— 

Self-healing of workflow executions on grids
Treatment of blocked activities
Optimization of task granularity
Fairness control among workflow executions

—  Conclusions

2
Outline
—  Technical context and challenges
—  Contributions
— 
— 
— 
— 

Self-healing of workflow executions on grids
Treatment of blocked activities
Optimization of task granularity
Fairness control among workflow executions

—  Conclusions

3
Heavy Medical Simulations
Treatement planning for
prostate protontherapy
[L. Grevillot, D. Sarrut]

Medical-Imaging Execution Platform
491 users from 52 countries

CPU Time: 2 months
Virtual
Imaging
Platform
Simulated diffusion
weighted images
[L. Wang, Y. Zhu, I. Magnin]

CPU Time: 8 years

Echography simulation
[O. Bernard, M. Alessandrini]

CPU Time: 42 hours

4

Public Computing Infrastructure
150 computing sites world-wide
Goal: Self-healing of workflow executions on grids
to handle operational issues
2. User launches
a simulation
(application workflow)

1. Input data
upload

11. Download results

Science-Gateway
8. Inputs download

9. Execution

10. Results upload

5

Virtual Imaging Platform (VIP)

Workflow Execution
3. Workflow engine
generates invocations

4. Invocations are
wrapped into grid jobs

High-level interface
Software-as-a-Service

5. Jobs are submitted
to a Pilot Engine
6. Pilot jobs are
submitted to the
distributed infrastructure

7. Pilot jobs
fetch grid jobs
Workflow Execution
2. User launches
a simulation
(application workflow)

3. Workflow engine
generates invocations

4. Invocations are
wrapped into grid jobs
1. Input data
upload

11. Download results

Workflow Management
8. Inputs download

System
Applications described as workflows
Parallel language
Grid-aware enactor
9. Execution

5. Jobs are submitted
to a Pilot Engine

6. Pilot jobs are
submitted to the
distributed infrastructure

10. Results upload

6

7. Pilot jobs
fetch grid jobs
Workflow Execution
2. User launches
a simulation
(application workflow)

3. Workflow engine
generates invocations

4. Invocations are
wrapped into grid jobs
1. Input data
upload

11. Download results

8. Inputs download

Workload Management System
9. Execution
Pilot jobs run special agents that fetch user tasks
from the task queue, set up their environment and
steer their execution
10. Results upload

7

5. Jobs are submitted
to a Pilot Engine
6. Pilot jobs are
submitted to the
distributed infrastructure

7. Pilot jobs
fetch grid jobs
Workflow Execution
2. User launches
a simulation
(application workflow)

3. Workflow engine
generates invocations

4. Invocations are
wrapped into grid jobs
1. Input data
upload

11. Download results

European Grid Infrastructure (EGI)
+100 computing sites
+25,000 job slots
~4PB of Storage

8. Inputs download

9. Execution

10. Results upload

8

5. Jobs are submitted
to a Pilot Engine
6. Pilot jobs are
submitted to the
distributed infrastructure

7. Pilot jobs
fetch grid jobs
Challenges
—  Several workflow execution errors
Average workflow completion
rate is about 60%

Number of launched and completed workflow in VIP from Jan to Dec 2012

—  Several dysfunctional and performance problems
—  Requires manual interventions

—  Problem: costly manual operations
—  e.g.: rescheduling tasks, restarting services, killing misbehaving
experiments, or replicating data files

9
Objectives
—  Objective: Automated platform administration
—  Autonomous detection of operational incidents
—  Perform appropriate set of actions

—  Assumptions: Online and non-clairvoyant
— 
— 
— 
— 

10

Decisions must be fast
No information about tasks (duration, data transfer time, etc.)
No information about resources (availability, performance, etc.)
No user activity and workloads prediction
Outline
—  Technical context and challenges
—  Contributions
— 
— 
— 
— 

Self-healing of workflow executions on grids
Treatment of blocked activities
Optimization of task granularity
Fairness control among workflow executions

—  Conclusions

11
State of the Art
—  Self-healing of workflow executions
—  Most works from the literature are offline and/or clairvoyant

—  Common techniques to address operational incidents
—  Task resubmission
—  [Kandaswamy et al., 2008], [Zhang et al., 2009], [Montagnat et al., 2010]

—  Task and file replication
—  [Cirne et al., 2007], [Ben-Yehuda et al., 2012], [Ma et al., 2013]

—  Task grouping
—  [Muthuvelu et al., 2005-2013], [Lie and Liao, 2009], [Chen et al., 2013]

—  Heuristics to fairly schedule workflow tasks
—  [Zhao and Sakellariou, 2006], [N’Takpe and Suter, 2009], [Casanova et al., 2010]

12
Fuzzy Finite State Machine
—  The healing process sets the degree of FuSM states from incident

Crisp states

detection metrics

Possible values: 0 or 1
Fuzzy states

Values between 0 and 1

13
General MAPE-K loop
event
(job completion and failures)
or
timeout

Incident 1
degree η = 0.8
level
1

level
2

Incident 2
degree η = 0.4

Incident 3
degree η = 0.1

level
1

level
1

level
3

level
2

level
3

level
2

level
3

Monitoring

Analysis

Set of Actions

6e+04

=

0e+00

x2

Frequency

Monitoring data

0.0

Execution

0.2

0.4

0.6

0.8

∑

n

ηj

j =1

1.0

ηu

Knowledge

Roulette wheel selection

€

Planning
Rule

Confidence (ρ)

ρxη

Selected

2è 1

0.8

0.32

Selected

Incident 2

3 è 1

0.2

0.02

Incident 1

1 è 1	


1.0

0.80

Roulette wheel selection
based on association rules

14

ηi

Association rules
for incident 1

R. Ferreira da Silva, T. Glatard, F. Desprez, Self-healing of workflow activity incidents on 
distributed computing infrastructures, Future Generation Computer Systems (FGCS), 2013.
Incident Levels and Actions
—  Incident degrees are quantified in discrete incident levels
—  Thresholds are determined from mode clustering
Thresholds τ cluster platform
configurations into groups

No actions are triggered

15

Triggers a set of actions
A-priori knowledge
—  Based on the workload of VIP
—  January 2011 to April 2012
338,989 completed
138,480 error
105,488 aborted
15,576 aborted replicas
48,293 stalled
112 users

2,941 workflow executions

680,988 tasks

34,162 queued

339,545 pilot jobs

16

R. Ferreira da Silva, T. Glatard, A science-gateway workload archive to study pilot jobs, 	

user activity, bag of tasks, task sub-steps, and workflow executionss, CoreGRID/ERCIM 	

Workshop on Grids, Clouds and P2P Computing (CGWS), Rhodes Island, Greece, 2012.
Outline
—  Technical context and challenges
—  Contributions
— 
— 
— 
— 

Self-healing of workflow executions on grids
Treatment of blocked activities
Optimization of task granularity
Fairness control among workflow executions

—  Conclusions

17
Incident: Activity Blocked
—  A task is late compared to the others
Long-tail effect
80 100
60
40
0

20

Completed Jobs

FIELD-II/pasa - workflow-9SIeNv

0.0e+00

4.0e+06

8.0e+06

1.2e+07

Time (s)

Task completion rate of a real simulation

Job flow of a real simulation

—  Possible causes
—  Longer waiting times
—  Lost tasks (e.g. killed by site due to quota violation)
—  Resources with poor performance
18
Activity Blocked: State of the Art
—  Task replication
—  Is commonly used to address non-clairvoyant problems
—  Drawback: may overload the system and degrade fairness

—  Task replication in the literature
—  Is used to increase the probability to complete a task [Ramakrishnan et
al., 2009]

—  Use of the Weibull distribution to estimate the number of replicas [Litke
et al., 2007]

—  Tasks are replicated only in the tail phase [Ben-Yehuda et al., 2012]
—  Evaluation of the waste of resources by using replication [Cirne et al.,
2007]

All approaches make strong assumptions on task
or resource characteristics
19
Activity Blocked: Degree
—  Degree computed from all completed tasks of the activity
—  Task phases: setup è inputs download è execution è outputs upload
—  Assumption: bag of tasks (all tasks have equal durations)
—  Median-based estimation:
Median duration
of task phases

Estimated task
duration

Real task
duration

50s

42s

42s

250s

300s

300s

400s

400s*

20s

15s

15s

Mi = 715s

Ei = 757s

completed
current

?

*: max(400s, 20s) = 400s

—  Incident degree: task performance w.r.t median

20
Activity blocked: levels and actions
—  Levels: identified from the platform logs extracted from VIP on EGI
Activity Blocked
degree ηb

τb

Frequency

150

Level 1
(no€
actions)

100

Level 2

action: replicate tasks

50

0
0.00

0.25

0.50
ηb

d

0.75

—  Actions
—  Task replication
—  Cancel replicas with
bad performance

—  Replicate only if all
21

Level 1

active replicas are running

1.00

Replication process for one task

Level 2
Activity Blocked: Results
—  Goal: Self-Healing vs No-Healing
—  Cope with recoverable errors
Mean-Shift/hs3

FIELD-II/pasa
12000

8000
No−Healing
Self−Healing
4000

0

8000
No−Healing
Self−Healing
4000

0
1

2

3
Repetitions

4

5

Average execution speed up: 3.4

Resource waste:

w=

22
€

Makespan (s)

Makespan (s)

12000

(CPU + data) self −healing
−1
(CPU + data) no−healing

1

2

3
Repetitions

4

5

Average execution speed up: 2.9

Self-Healing process reduced resource
consumption up to 35% when compared to
the No-Healing execution
Number of Completed Tasks
Repetition 1

Repetition 2

Repetition 3

1.0
0.8
0.6
0.4

CDF

0.2

0

50
100
Repetition 4

150

20

60

0

50

100
150
Repetition 5

200

0

50

100

150

1.0
0.8
0.6
0.4
0.2

0

40

0

50

100

Time (min)

Curve similarities up to 95% indicate similar grid conditions

23

No−Healing
Self−Healing
Activity Blocked: Conclusions
—  First results in controlling blocked activities in these conditions
—  Conditions: production system, non-clairvoyant, online

—  Limitation
—  The method only works for bag-of-tasks
—  The waste metric does not consider resource performance

—  Currently used in production by VIP
—  From Aug 2012 to Oct 2013 more than 6000 workflow executions benefited

—  Publications
R. Ferreira da Silva, T. Glatard, F. Desprez, Self-healing of operational workflow incidents on
distributed computing infrastructures, IEEE/ACM International Symposium on Cluster, Cloud and
Grid Computing (CCGrid), Ottawa, Canada, 2012.	

R. Ferreira da Silva, T. Glatard, F. Desprez, Self-healing of workflow activity incidents on 
distributed computing infrastructures, Future Generation Computer Systems (FGCS), 2013.	

24
Outline
—  Technical context and challenges
—  Contributions
— 
— 
— 
— 

Self-healing of workflow executions on grids
Treatment of blocked activities
Optimization of task granularity
Fairness control among workflow executions

—  Conclusions

25
Incident: Fineness Control
—  Low performance of lightweight (a.k.a. fine-grained) tasks:
—  High queuing times
—  Communication overhead
t5
t4
lightweight tasks

Lightweight task
executions are delayed

t3
t2

Group into coarse-grained tasks
reduces the cost of data transfers
when grouped tasks share input data,
and saves queuing time

Resources

t1

R3

t4

R2
R1

t3
t1

t2

t5
time

26
Fineness Control: State of the Art
—  Task grouping in the literature
—  Groups tasks based on the granularity size (processing time) [Muthuvelu et
al., 2005]

—  Adds bandwidth to the definition of the granularity size

[Ng et al., 2006],

[And et al., 2009]

—  Defines the granularity size based on QoS requirements
—  Task file size, CPU time, resource constraints [Muthuvelu et al., 2008]
—  Drawback: only works under stationary load

—  Adaptive algorithms (non-stationary load)
—  Monitors information about the current availability and capability of
resources

[Liu and Liao, 2009], [Muthuvelu et al., 2013]

All approaches make strong assumptions on task or resource characteristics
27
Fineness Control: Degree
—  Task execution

Queued Time	


qj

t
Shared Input Data	


€

Other Input
Data	


Application Execution	


~

t _ shared

—  Incident degree
€

€
η f = max i∈[1,m ]{ f i = di ⋅ ri }

€

Median task phase durations
28

i = waiting task
n = number of waiting tasks
Fineness control: levels and actions
—  Levels: identified from the platform logs extracted from VIP on EGI
Fineness Control
degree ηf

6e+04

Level 1
(no actions)
€

Level 1

Level 2

0e+00

Frequency

τf

action: task grouping
0.0

0.2

0.4

0.6

0.8

ηf

—  Actions
—  Task grouping
—  Grouped pairwise until η f ≤ τ f
or until Q ≤ R

29

€

1.0

Level 2
Coarseness control
t5
t4+t5

Tasks at t1

Grouped tasks
at t2

t3
t2

—  Non-stationary load

t2+t3

t4

—  Loss of parallelism
—  Task-degrouping

t1

Resources

R3
Loss of parallelism

R2

t4+t5

R1

t1
t1

t2+t3
time

t2

—  Incident degree

—  Levels

R
ηc =
Q+ R
30

€

τ c = 0.5

€

De-group tasks
when R  Q
Results: Non-Stationary Load
—  Experiment
—  Evaluate the de-grouping control process under non-stationary load
Resources appear progressively

Resources appear suddenly

Makespan (s)

6000

4000

Fineness
Fineness−Coarseness
No−Granularity

2000

0
Run 1

Run 2

Run 3

Speeds up executions up to a factor of 1.5 for
Fineness, and 2.1 for Fineness-Coarseness

Run 4

Run 5

Fineness is penalized by its lack of
adaptation: slowdown of 20%

Linear correlation coefficient between the makespan and the
average queuing time is 0.91, which indicates they are correlated
31

31
Task Granularity: Conclusions
—  First results in controlling task granularity in these conditions
—  Conditions: production system, non-clairvoyant, online

—  Limitation
—  The method only works for data-intensive workloads

—  Future Work
—  Task pre-emption to handle the scenario where resources suddenly appear
and all tasks are running

—  Publications
R. Ferreira da Silva, T. Glatard, F. Desprez, On-line, non-clairvoyant optimization of 	

workflow activity granularity task on grids, Euro-Par, Aachen, 2013.	

R. Ferreira da Silva, T. Glatard, F. Desprez, Controlling fairness and task granularity in 	

distributed, online, non-clairvoyant workflow executions, Concurrency and Computation: 	

Practice and Experience (CCPE), Submited, 2014.	


32
Outline
—  Technical context and challenges
—  Contributions
— 
— 
— 
— 

Self-healing of workflow executions on grids
Treatment of blocked activities
Optimization of task granularity
Fairness control among workflow executions

—  Conclusions

33
Incident: Unfairness Among Workflow Executions
—  Under resource contention workflows are unequally slowed
down by concurrent executions
t1,1

t3,1

t1,2

t2,2

t3,2

t1,3

t2,3

t3,3

t1,4

t2,4

t3,4

t1,5

3 identical workflows
submitted sequentially
(ti,j = 10s)

t2,1

t2,5

Makespan with
concurrent executions

t3,5

slowdown(s) =

M multi
M own
Makespan without
concurrent executions

€

Resources

s1 =
R3

t1,3

t2,1

t2,4

t3,2

t1,2

t1,5

t2,3

t3,1

t3,4

R1

t1,1

t1,4

t2,2

t2,5

s2 =

40
= 2.0
20

s3 =

50
= 2.5
20

t3,5

R2

20
= 1.0
20

t3,3

0

10

20

30

40

€

time

€

Identical workflow executions do not
experience the same slowdown
34

€
Fairness: State of the Art
—  Workflow execution fairness in the literature
—  Addresses fairness based on the slowdown of DAGs based on execution
and data transfer times

[Zhao and Sakellariou, 2006], [Casanova et al., 2010]

—  Proposes a mapping procedure to increase fairness based on the critical
path length

[N’Takpe and Suter, 2009]

—  Online, but clairvoyant, HEFT-like algorithms [Hsu et al., 2011], [Sommerfield
and Richter, 2011], [Arabnejad and Barbosa, 2012]

—  Non-clairvoyant, but offline, scheduling strategy based on task labeling
and adaptive allocation

[Hirales-Carbajal et al., 2012]

No algorithm was proposed in a non-clairvoyant and online case

35
Fairness Control: Degree
—  Unfairness degree
ηu = W max − W min

Max difference between the
fractions of pending work

where:

$
'
Qi, j
W i = max j∈[1,n i ]%
⋅ Ti, j (
 Qi, j + Ri, j ⋅ Pi, j
)

€

€

Performance

A low Pi,j indicates that resources
allocated to the activity have bad
performance for the activity
36

i = activity, ni = active activities
Qi,j = number of waiting tasks
Ri,j = number of running tasks

Relative observed duration

Median task phase durations
Fairness Control: Levels and Actions
—  Levels: identified from the platform logs extracted from VIP on EGI
Fairness Control
degree ηu

τu

Level 1
(no actions)

Level 1

6e+04

€

action: task prioritization

0e+00

Frequency

Level 2

0.0

0.2

0.4

0.6

0.8

1.0

ηu

—  Actions
—  Task prioritization
—  Task priority is an integer initialized to 1
—  Increase priority of Δi,j tasks
37

Level 2
Fairness Control: Metrics
—  Unfairness
—  Is the area under the curve ηu during the execution:
M

µ = ∑ηu (t i )⋅ (t i − t i−1 )
i=2

€

—  Slowdown
s=

M multi
M own

where:

€
38

€

M own = max p∈Ω ∑ t u
u∈p

This metric measures if the fairness process
can indeed minimize its own criterion ηu
Results: identical workflows
—  Tests whether unfairness among identical workflows is properly
addressed

Repetition 1

Repetition 2

Repetition 3

Repetition 4

Makespan (s)

30000
20000

Gate 1
Gate 2

10000

Gate 3

0
Fairness

No−Fairness

Repetition 1

Fairness

No−Fairness

Repetition 2

Fairness

No−Fairness

Repetition 3

Fairness

No−Fairness

Repetition 4

1.00

ηf

0.75
Fairness

0.50

No−Fairness

0.25
0.00
0

10000

20000

30000 0

5000 10000 15000 20000
0
Time (s)

10000 20000 30000

0 500010000
15000
20000
25000

Makespans and unfairness degree values are significantly reduced

39
Results: different workflows
—  Tests whether unfairness among different workflows is detected and
properly handled

Slowdown

Repetition 1

Repetition 2

Repetition 3

Repetition 4

100

FIELD−II
Gate

10

PET−Sorteo
SimuBloch

1
Fairness

No−Fairness

Repetition 1

Fairness

No−Fairness

Fairness

Repetition 2

No−Fairness

Fairness

Repetition 3

No−Fairness

Repetition 4

1.00

ηf

0.75
Fairness

0.50

No−Fairness

0.25
0.00
0

5000 100001500020000 0

10000

20000

0
Time (s)

20000

40000

0

500010000
15000
20000

Reduced slowdown stand. dev. up to a factor of 3.8,
and unfairness value up to a factor 1.9
40
Fairness Control: Conclusions
—  First results in controlling fairness among workflow executions in
these conditions

—  Conditions: production system, non-clairvoyant, online

—  Limitation
—  Fairness optimization is delayed due to the acquisition of information
about the applications
—  The method works best for applications with a lot of short tasks

—  Future Work
—  Evaluation of the influence of the metrics’ parameters

—  Publications
R. Ferreira da Silva, T. Glatard, F. Desprez, Workflow fairness control on online and 	

non-clairvoyant distributed computing platforms, Euro-Par, Aachen, 2013.	

R. Ferreira da Silva, T. Glatard, F. Desprez, Controlling fairness and task granularity in 	

distributed, online, non-clairvoyant workflow executions, Concurrency and Computation: 	

Practice and Experience (CCPE), Submited, 2014.	

41
Outline
—  Technical context and challenges
—  Contributions
— 
— 
— 
— 

Self-healing of workflow executions on grids
Treatment of blocked activities
Optimization of task granularity
Fairness control among workflow executions

—  Conclusions

42
Contributions Summary
Self-healing of workflow incidents
- Generic MAPE-K loop
- Non-clairvoyance and online

[Ferreira da Silva et al.,
CCGRID’12, FGCS’13]

Treatment of blocked activities

- Properly detects and handles blocked activities

Optimization of task granularity

[Ferreira da Silva et al., EuroPar’13a]

Fairness control among workflow executions

[Ferreira da Silva et al., EuroPar’13b, CPE’14]

Science-gateway model for workload archive

[Ferreira da Silva and Glatard,
CGWS’12]

All methods were evaluated on VIP

[Ferreira da Silva et al.,
HealthGrid’11; Glatard et al.,
TMI’13]

- Properly detects and handles lightweight tasks under
stationary and non-stationary loads
- Properly detects and handles unfairness among
workflow executions

- Illustration by using traces of the VIP from 2011/2012
- Production platform with about 500 users

43
Perspectives
—  Mode detection automation
—  Automatically detect variation on threshold values

—  Time-windowed historical information
—  User’s behavior may change
—  Errors may be restricted to a specific time span

—  Optimization of the incident selection method
—  There is no mechanism to prevent an incident to be successively selected

—  Sensitivity analysis of parameters
—  Evaluate the influence of parameters on the metrics

—  Workflow workload archive
—  The science gateway workload archive model does not embrace all
characteristics inherent to a workflow execution

44
A science-gateway for workflow executions:
online and non-clairvoyant self-healing
of workflow executions on grids
Thank you for your attention.
Questions?

http://vip.creatis.insa-lyon.fr!
Rafael FERREIRA DA SILVA

University of Lyon, CNRS, INSERM, CREATIS
Villeurbanne, France

Supervisors:
Frédéric DESPREZ and Tristan GLATARD

Más contenido relacionado

La actualidad más candente

The Interplay of Workflow Execution and Resource Provisioning
The Interplay of Workflow Execution and Resource ProvisioningThe Interplay of Workflow Execution and Resource Provisioning
The Interplay of Workflow Execution and Resource ProvisioningRafael Ferreira da Silva
 
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesDataWorks Summit/Hadoop Summit
 
Task scheduling Survey in Cloud Computing
Task scheduling Survey in Cloud ComputingTask scheduling Survey in Cloud Computing
Task scheduling Survey in Cloud ComputingRamandeep Kaur
 
Auto-scaling Techniques for Elastic Data Stream Processing
Auto-scaling Techniques for Elastic Data Stream ProcessingAuto-scaling Techniques for Elastic Data Stream Processing
Auto-scaling Techniques for Elastic Data Stream ProcessingZbigniew Jerzak
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsLeila panahi
 
Adaptive Replication for Elastic Data Stream Processing
Adaptive Replication for Elastic Data Stream ProcessingAdaptive Replication for Elastic Data Stream Processing
Adaptive Replication for Elastic Data Stream ProcessingZbigniew Jerzak
 
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...IJSRD
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceinside-BigData.com
 
Noha danms13 talk_final
Noha danms13 talk_finalNoha danms13 talk_final
Noha danms13 talk_finalNoha Elprince
 
4838281 operating-system-scheduling-on-multicore-architectures
4838281 operating-system-scheduling-on-multicore-architectures4838281 operating-system-scheduling-on-multicore-architectures
4838281 operating-system-scheduling-on-multicore-architecturesIslam Samir
 
Scientific Applications of The Data Distribution Service
Scientific Applications of The Data Distribution ServiceScientific Applications of The Data Distribution Service
Scientific Applications of The Data Distribution ServiceAngelo Corsaro
 
TASK SCHEDULING USING AMALGAMATION OF MET HEURISTICS SWARM OPTIMIZATION ALGOR...
TASK SCHEDULING USING AMALGAMATION OF MET HEURISTICS SWARM OPTIMIZATION ALGOR...TASK SCHEDULING USING AMALGAMATION OF MET HEURISTICS SWARM OPTIMIZATION ALGOR...
TASK SCHEDULING USING AMALGAMATION OF MET HEURISTICS SWARM OPTIMIZATION ALGOR...Journal For Research
 
BDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBenchBDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBencht_ivanov
 
A Queue Simulation Tool for a High Performance Scientific Computing Center
A Queue Simulation Tool for a High Performance Scientific Computing CenterA Queue Simulation Tool for a High Performance Scientific Computing Center
A Queue Simulation Tool for a High Performance Scientific Computing CenterJames McGalliard
 
GRIMES_Visualizing_Telemetry
GRIMES_Visualizing_TelemetryGRIMES_Visualizing_Telemetry
GRIMES_Visualizing_TelemetryKevin Grimes
 
Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...Anubhav Jain
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...Ian Foster
 
Task Scheduling in Grid Computing.
Task Scheduling in Grid Computing.Task Scheduling in Grid Computing.
Task Scheduling in Grid Computing.Ramandeep Kaur
 
Genetic Algorithm for Process Scheduling
Genetic Algorithm for Process SchedulingGenetic Algorithm for Process Scheduling
Genetic Algorithm for Process SchedulingLogin Technoligies
 
How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?Anubhav Jain
 

La actualidad más candente (20)

The Interplay of Workflow Execution and Resource Provisioning
The Interplay of Workflow Execution and Resource ProvisioningThe Interplay of Workflow Execution and Resource Provisioning
The Interplay of Workflow Execution and Resource Provisioning
 
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with Dependencies
 
Task scheduling Survey in Cloud Computing
Task scheduling Survey in Cloud ComputingTask scheduling Survey in Cloud Computing
Task scheduling Survey in Cloud Computing
 
Auto-scaling Techniques for Elastic Data Stream Processing
Auto-scaling Techniques for Elastic Data Stream ProcessingAuto-scaling Techniques for Elastic Data Stream Processing
Auto-scaling Techniques for Elastic Data Stream Processing
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling Algorithms
 
Adaptive Replication for Elastic Data Stream Processing
Adaptive Replication for Elastic Data Stream ProcessingAdaptive Replication for Elastic Data Stream Processing
Adaptive Replication for Elastic Data Stream Processing
 
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...
Fault Tolerance in Big Data Processing Using Heartbeat Messages and Data Repl...
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental science
 
Noha danms13 talk_final
Noha danms13 talk_finalNoha danms13 talk_final
Noha danms13 talk_final
 
4838281 operating-system-scheduling-on-multicore-architectures
4838281 operating-system-scheduling-on-multicore-architectures4838281 operating-system-scheduling-on-multicore-architectures
4838281 operating-system-scheduling-on-multicore-architectures
 
Scientific Applications of The Data Distribution Service
Scientific Applications of The Data Distribution ServiceScientific Applications of The Data Distribution Service
Scientific Applications of The Data Distribution Service
 
TASK SCHEDULING USING AMALGAMATION OF MET HEURISTICS SWARM OPTIMIZATION ALGOR...
TASK SCHEDULING USING AMALGAMATION OF MET HEURISTICS SWARM OPTIMIZATION ALGOR...TASK SCHEDULING USING AMALGAMATION OF MET HEURISTICS SWARM OPTIMIZATION ALGOR...
TASK SCHEDULING USING AMALGAMATION OF MET HEURISTICS SWARM OPTIMIZATION ALGOR...
 
BDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBenchBDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBench
 
A Queue Simulation Tool for a High Performance Scientific Computing Center
A Queue Simulation Tool for a High Performance Scientific Computing CenterA Queue Simulation Tool for a High Performance Scientific Computing Center
A Queue Simulation Tool for a High Performance Scientific Computing Center
 
GRIMES_Visualizing_Telemetry
GRIMES_Visualizing_TelemetryGRIMES_Visualizing_Telemetry
GRIMES_Visualizing_Telemetry
 
Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
Task Scheduling in Grid Computing.
Task Scheduling in Grid Computing.Task Scheduling in Grid Computing.
Task Scheduling in Grid Computing.
 
Genetic Algorithm for Process Scheduling
Genetic Algorithm for Process SchedulingGenetic Algorithm for Process Scheduling
Genetic Algorithm for Process Scheduling
 
How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?
 

Similar a A science-gateway for workflow executions: online and non-clairvoyant self-healing of workflow executions on grids

A Multiple Kernel Learning Based Fusion Framework for Real-Time Multi-View Ac...
A Multiple Kernel Learning Based Fusion Framework for Real-Time Multi-View Ac...A Multiple Kernel Learning Based Fusion Framework for Real-Time Multi-View Ac...
A Multiple Kernel Learning Based Fusion Framework for Real-Time Multi-View Ac...Francisco (Paco) Florez-Revuelta
 
Workflow fairness control on online and non-clairvoyant distributed computing...
Workflow fairness control on online and non-clairvoyant distributed computing...Workflow fairness control on online and non-clairvoyant distributed computing...
Workflow fairness control on online and non-clairvoyant distributed computing...Rafael Ferreira da Silva
 
software effort estimation
 software effort estimation software effort estimation
software effort estimationBesharam Dil
 
Automated Discovery of Performance Regressions in Enterprise Applications
Automated Discovery of Performance Regressions in Enterprise ApplicationsAutomated Discovery of Performance Regressions in Enterprise Applications
Automated Discovery of Performance Regressions in Enterprise ApplicationsSAIL_QU
 
On-line, non-clairvoyant optimization of workflow activity granularity task o...
On-line, non-clairvoyant optimization of workflow activity granularity task o...On-line, non-clairvoyant optimization of workflow activity granularity task o...
On-line, non-clairvoyant optimization of workflow activity granularity task o...Rafael Ferreira da Silva
 
Iwsm2014 performance measurement for cloud computing applications using iso...
Iwsm2014   performance measurement for cloud computing applications using iso...Iwsm2014   performance measurement for cloud computing applications using iso...
Iwsm2014 performance measurement for cloud computing applications using iso...Nesma
 
Requirements vs design vs runtime
Requirements vs design vs runtimeRequirements vs design vs runtime
Requirements vs design vs runtimebdemchak
 
Software Testing:
 A Research Travelogue 
(2000–2014)
Software Testing:
 A Research Travelogue 
(2000–2014)Software Testing:
 A Research Travelogue 
(2000–2014)
Software Testing:
 A Research Travelogue 
(2000–2014)Alex Orso
 
VL/HCC 2014 - A Longitudinal Study of Programmers' Backtracking
VL/HCC 2014 - A Longitudinal Study of Programmers' BacktrackingVL/HCC 2014 - A Longitudinal Study of Programmers' Backtracking
VL/HCC 2014 - A Longitudinal Study of Programmers' BacktrackingYoungSeok Yoon
 
Hairong Qi V Swaminathan
Hairong Qi V SwaminathanHairong Qi V Swaminathan
Hairong Qi V SwaminathanFNian
 
Thesis Presentation on Energy Efficiency Improvement in Data Centers
Thesis Presentation on Energy Efficiency Improvement in Data CentersThesis Presentation on Energy Efficiency Improvement in Data Centers
Thesis Presentation on Energy Efficiency Improvement in Data CentersMonica Vitali
 
A Multi-Agent System Approach to Load-Balancing and Resource Allocation for D...
A Multi-Agent System Approach to Load-Balancing and Resource Allocation for D...A Multi-Agent System Approach to Load-Balancing and Resource Allocation for D...
A Multi-Agent System Approach to Load-Balancing and Resource Allocation for D...Soumya Banerjee
 
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...ijsrd.com
 
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...ijsrd.com
 
Wearable Accelerometer Optimal Positions for Human Motion Recognition(LifeTec...
Wearable Accelerometer Optimal Positions for Human Motion Recognition(LifeTec...Wearable Accelerometer Optimal Positions for Human Motion Recognition(LifeTec...
Wearable Accelerometer Optimal Positions for Human Motion Recognition(LifeTec...sugiuralab
 
Performance Evaluation of a Network Using Simulation Tools or Packet Tracer
Performance Evaluation of a Network Using Simulation Tools or Packet TracerPerformance Evaluation of a Network Using Simulation Tools or Packet Tracer
Performance Evaluation of a Network Using Simulation Tools or Packet TracerIOSRjournaljce
 
IRJET- Automated Student’s Attendance Management using Convolutional Neural N...
IRJET- Automated Student’s Attendance Management using Convolutional Neural N...IRJET- Automated Student’s Attendance Management using Convolutional Neural N...
IRJET- Automated Student’s Attendance Management using Convolutional Neural N...IRJET Journal
 

Similar a A science-gateway for workflow executions: online and non-clairvoyant self-healing of workflow executions on grids (20)

Srushti_M.E_PPT.ppt
Srushti_M.E_PPT.pptSrushti_M.E_PPT.ppt
Srushti_M.E_PPT.ppt
 
A Multiple Kernel Learning Based Fusion Framework for Real-Time Multi-View Ac...
A Multiple Kernel Learning Based Fusion Framework for Real-Time Multi-View Ac...A Multiple Kernel Learning Based Fusion Framework for Real-Time Multi-View Ac...
A Multiple Kernel Learning Based Fusion Framework for Real-Time Multi-View Ac...
 
Workflow fairness control on online and non-clairvoyant distributed computing...
Workflow fairness control on online and non-clairvoyant distributed computing...Workflow fairness control on online and non-clairvoyant distributed computing...
Workflow fairness control on online and non-clairvoyant distributed computing...
 
Ajila (1)
Ajila (1)Ajila (1)
Ajila (1)
 
software effort estimation
 software effort estimation software effort estimation
software effort estimation
 
Automated Discovery of Performance Regressions in Enterprise Applications
Automated Discovery of Performance Regressions in Enterprise ApplicationsAutomated Discovery of Performance Regressions in Enterprise Applications
Automated Discovery of Performance Regressions in Enterprise Applications
 
On-line, non-clairvoyant optimization of workflow activity granularity task o...
On-line, non-clairvoyant optimization of workflow activity granularity task o...On-line, non-clairvoyant optimization of workflow activity granularity task o...
On-line, non-clairvoyant optimization of workflow activity granularity task o...
 
Iwsm2014 performance measurement for cloud computing applications using iso...
Iwsm2014   performance measurement for cloud computing applications using iso...Iwsm2014   performance measurement for cloud computing applications using iso...
Iwsm2014 performance measurement for cloud computing applications using iso...
 
Requirements vs design vs runtime
Requirements vs design vs runtimeRequirements vs design vs runtime
Requirements vs design vs runtime
 
Software Testing:
 A Research Travelogue 
(2000–2014)
Software Testing:
 A Research Travelogue 
(2000–2014)Software Testing:
 A Research Travelogue 
(2000–2014)
Software Testing:
 A Research Travelogue 
(2000–2014)
 
GRID COMPUTING
GRID COMPUTINGGRID COMPUTING
GRID COMPUTING
 
VL/HCC 2014 - A Longitudinal Study of Programmers' Backtracking
VL/HCC 2014 - A Longitudinal Study of Programmers' BacktrackingVL/HCC 2014 - A Longitudinal Study of Programmers' Backtracking
VL/HCC 2014 - A Longitudinal Study of Programmers' Backtracking
 
Hairong Qi V Swaminathan
Hairong Qi V SwaminathanHairong Qi V Swaminathan
Hairong Qi V Swaminathan
 
Thesis Presentation on Energy Efficiency Improvement in Data Centers
Thesis Presentation on Energy Efficiency Improvement in Data CentersThesis Presentation on Energy Efficiency Improvement in Data Centers
Thesis Presentation on Energy Efficiency Improvement in Data Centers
 
A Multi-Agent System Approach to Load-Balancing and Resource Allocation for D...
A Multi-Agent System Approach to Load-Balancing and Resource Allocation for D...A Multi-Agent System Approach to Load-Balancing and Resource Allocation for D...
A Multi-Agent System Approach to Load-Balancing and Resource Allocation for D...
 
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
 
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
 
Wearable Accelerometer Optimal Positions for Human Motion Recognition(LifeTec...
Wearable Accelerometer Optimal Positions for Human Motion Recognition(LifeTec...Wearable Accelerometer Optimal Positions for Human Motion Recognition(LifeTec...
Wearable Accelerometer Optimal Positions for Human Motion Recognition(LifeTec...
 
Performance Evaluation of a Network Using Simulation Tools or Packet Tracer
Performance Evaluation of a Network Using Simulation Tools or Packet TracerPerformance Evaluation of a Network Using Simulation Tools or Packet Tracer
Performance Evaluation of a Network Using Simulation Tools or Packet Tracer
 
IRJET- Automated Student’s Attendance Management using Convolutional Neural N...
IRJET- Automated Student’s Attendance Management using Convolutional Neural N...IRJET- Automated Student’s Attendance Management using Convolutional Neural N...
IRJET- Automated Student’s Attendance Management using Convolutional Neural N...
 

Más de Rafael Ferreira da Silva

Towards an Infrastructure for Enabling Systematic Development and Research of...
Towards an Infrastructure for Enabling Systematic Development and Research of...Towards an Infrastructure for Enabling Systematic Development and Research of...
Towards an Infrastructure for Enabling Systematic Development and Research of...Rafael Ferreira da Silva
 
Modeling and Simulation of Parallel and Distributed Computing Systems with Si...
Modeling and Simulation of Parallel and Distributed Computing Systems with Si...Modeling and Simulation of Parallel and Distributed Computing Systems with Si...
Modeling and Simulation of Parallel and Distributed Computing Systems with Si...Rafael Ferreira da Silva
 
Good Practices for Developing Scientific Software Frameworks: The WRENCH fram...
Good Practices for Developing Scientific Software Frameworks: The WRENCH fram...Good Practices for Developing Scientific Software Frameworks: The WRENCH fram...
Good Practices for Developing Scientific Software Frameworks: The WRENCH fram...Rafael Ferreira da Silva
 
WorkflowHub: Community Framework for Enabling Scientific Workflow Research a...
WorkflowHub: Community Framework for Enabling  Scientific Workflow Research a...WorkflowHub: Community Framework for Enabling  Scientific Workflow Research a...
WorkflowHub: Community Framework for Enabling Scientific Workflow Research a...Rafael Ferreira da Silva
 
Bridging Concepts and Practice in eScience via Simulation-driven Engineering
Bridging Concepts and Practice in eScience via Simulation-driven EngineeringBridging Concepts and Practice in eScience via Simulation-driven Engineering
Bridging Concepts and Practice in eScience via Simulation-driven EngineeringRafael Ferreira da Silva
 
Accurately Simulating Energy Consumption of I/O-intensive Scientific Workflows
Accurately Simulating Energy Consumption of I/O-intensive Scientific WorkflowsAccurately Simulating Energy Consumption of I/O-intensive Scientific Workflows
Accurately Simulating Energy Consumption of I/O-intensive Scientific WorkflowsRafael Ferreira da Silva
 
Running Accurate, Scalable, and Reproducible Simulations of Distributed Syste...
Running Accurate, Scalable, and Reproducible Simulations of Distributed Syste...Running Accurate, Scalable, and Reproducible Simulations of Distributed Syste...
Running Accurate, Scalable, and Reproducible Simulations of Distributed Syste...Rafael Ferreira da Silva
 
WRENCH: Workflow Management System Simulation Workbench
WRENCH: Workflow Management System Simulation WorkbenchWRENCH: Workflow Management System Simulation Workbench
WRENCH: Workflow Management System Simulation WorkbenchRafael Ferreira da Silva
 
On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows
On the Use of Burst Buffers for Accelerating Data-Intensive Scientific WorkflowsOn the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows
On the Use of Burst Buffers for Accelerating Data-Intensive Scientific WorkflowsRafael Ferreira da Silva
 
Automating Environmental Computing Applications with Scientific Workflows
Automating Environmental Computing Applications with Scientific WorkflowsAutomating Environmental Computing Applications with Scientific Workflows
Automating Environmental Computing Applications with Scientific WorkflowsRafael Ferreira da Silva
 
Analysis of User Submission Behavior on HPC and HTC
Analysis of User Submission Behavior on HPC and HTCAnalysis of User Submission Behavior on HPC and HTC
Analysis of User Submission Behavior on HPC and HTCRafael Ferreira da Silva
 
Automating Real-time Seismic Analysis Through Streaming and High Throughput W...
Automating Real-time Seismic Analysis Through Streaming and High Throughput W...Automating Real-time Seismic Analysis Through Streaming and High Throughput W...
Automating Real-time Seismic Analysis Through Streaming and High Throughput W...Rafael Ferreira da Silva
 
Performance Analysis of an I/O-Intensive Workflow executing on Google Cloud a...
Performance Analysis of an I/O-Intensive Workflow executing on Google Cloud a...Performance Analysis of an I/O-Intensive Workflow executing on Google Cloud a...
Performance Analysis of an I/O-Intensive Workflow executing on Google Cloud a...Rafael Ferreira da Silva
 
Pegasus - automate, recover, and debug scientific computations
Pegasus - automate, recover, and debug scientific computationsPegasus - automate, recover, and debug scientific computations
Pegasus - automate, recover, and debug scientific computationsRafael Ferreira da Silva
 
Experiments with Complex Scientific Applications on Hybrid Cloud Infrastructures
Experiments with Complex Scientific Applications on Hybrid Cloud InfrastructuresExperiments with Complex Scientific Applications on Hybrid Cloud Infrastructures
Experiments with Complex Scientific Applications on Hybrid Cloud InfrastructuresRafael Ferreira da Silva
 
A Unified Approach for Modeling and Optimization of Energy, Makespan and Reli...
A Unified Approach for Modeling and Optimization of Energy, Makespan and Reli...A Unified Approach for Modeling and Optimization of Energy, Makespan and Reli...
A Unified Approach for Modeling and Optimization of Energy, Makespan and Reli...Rafael Ferreira da Silva
 
Leveraging Semantics to Improve Reproducibility in Scientific Workflows
Leveraging Semantics to Improve Reproducibility in Scientific WorkflowsLeveraging Semantics to Improve Reproducibility in Scientific Workflows
Leveraging Semantics to Improve Reproducibility in Scientific WorkflowsRafael Ferreira da Silva
 
Toward Fine-Grained Online Task Characteristics Estimation in Scientific Work...
Toward Fine-Grained Online Task Characteristics Estimation in Scientific Work...Toward Fine-Grained Online Task Characteristics Estimation in Scientific Work...
Toward Fine-Grained Online Task Characteristics Estimation in Scientific Work...Rafael Ferreira da Silva
 
VIP: design and implementation of the portal and execution service
VIP: design and implementation of the portal and execution serviceVIP: design and implementation of the portal and execution service
VIP: design and implementation of the portal and execution serviceRafael Ferreira da Silva
 
A science-gateway workload archive application to the self-healing of workflo...
A science-gateway workload archive application to the self-healing of workflo...A science-gateway workload archive application to the self-healing of workflo...
A science-gateway workload archive application to the self-healing of workflo...Rafael Ferreira da Silva
 

Más de Rafael Ferreira da Silva (20)

Towards an Infrastructure for Enabling Systematic Development and Research of...
Towards an Infrastructure for Enabling Systematic Development and Research of...Towards an Infrastructure for Enabling Systematic Development and Research of...
Towards an Infrastructure for Enabling Systematic Development and Research of...
 
Modeling and Simulation of Parallel and Distributed Computing Systems with Si...
Modeling and Simulation of Parallel and Distributed Computing Systems with Si...Modeling and Simulation of Parallel and Distributed Computing Systems with Si...
Modeling and Simulation of Parallel and Distributed Computing Systems with Si...
 
Good Practices for Developing Scientific Software Frameworks: The WRENCH fram...
Good Practices for Developing Scientific Software Frameworks: The WRENCH fram...Good Practices for Developing Scientific Software Frameworks: The WRENCH fram...
Good Practices for Developing Scientific Software Frameworks: The WRENCH fram...
 
WorkflowHub: Community Framework for Enabling Scientific Workflow Research a...
WorkflowHub: Community Framework for Enabling  Scientific Workflow Research a...WorkflowHub: Community Framework for Enabling  Scientific Workflow Research a...
WorkflowHub: Community Framework for Enabling Scientific Workflow Research a...
 
Bridging Concepts and Practice in eScience via Simulation-driven Engineering
Bridging Concepts and Practice in eScience via Simulation-driven EngineeringBridging Concepts and Practice in eScience via Simulation-driven Engineering
Bridging Concepts and Practice in eScience via Simulation-driven Engineering
 
Accurately Simulating Energy Consumption of I/O-intensive Scientific Workflows
Accurately Simulating Energy Consumption of I/O-intensive Scientific WorkflowsAccurately Simulating Energy Consumption of I/O-intensive Scientific Workflows
Accurately Simulating Energy Consumption of I/O-intensive Scientific Workflows
 
Running Accurate, Scalable, and Reproducible Simulations of Distributed Syste...
Running Accurate, Scalable, and Reproducible Simulations of Distributed Syste...Running Accurate, Scalable, and Reproducible Simulations of Distributed Syste...
Running Accurate, Scalable, and Reproducible Simulations of Distributed Syste...
 
WRENCH: Workflow Management System Simulation Workbench
WRENCH: Workflow Management System Simulation WorkbenchWRENCH: Workflow Management System Simulation Workbench
WRENCH: Workflow Management System Simulation Workbench
 
On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows
On the Use of Burst Buffers for Accelerating Data-Intensive Scientific WorkflowsOn the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows
On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows
 
Automating Environmental Computing Applications with Scientific Workflows
Automating Environmental Computing Applications with Scientific WorkflowsAutomating Environmental Computing Applications with Scientific Workflows
Automating Environmental Computing Applications with Scientific Workflows
 
Analysis of User Submission Behavior on HPC and HTC
Analysis of User Submission Behavior on HPC and HTCAnalysis of User Submission Behavior on HPC and HTC
Analysis of User Submission Behavior on HPC and HTC
 
Automating Real-time Seismic Analysis Through Streaming and High Throughput W...
Automating Real-time Seismic Analysis Through Streaming and High Throughput W...Automating Real-time Seismic Analysis Through Streaming and High Throughput W...
Automating Real-time Seismic Analysis Through Streaming and High Throughput W...
 
Performance Analysis of an I/O-Intensive Workflow executing on Google Cloud a...
Performance Analysis of an I/O-Intensive Workflow executing on Google Cloud a...Performance Analysis of an I/O-Intensive Workflow executing on Google Cloud a...
Performance Analysis of an I/O-Intensive Workflow executing on Google Cloud a...
 
Pegasus - automate, recover, and debug scientific computations
Pegasus - automate, recover, and debug scientific computationsPegasus - automate, recover, and debug scientific computations
Pegasus - automate, recover, and debug scientific computations
 
Experiments with Complex Scientific Applications on Hybrid Cloud Infrastructures
Experiments with Complex Scientific Applications on Hybrid Cloud InfrastructuresExperiments with Complex Scientific Applications on Hybrid Cloud Infrastructures
Experiments with Complex Scientific Applications on Hybrid Cloud Infrastructures
 
A Unified Approach for Modeling and Optimization of Energy, Makespan and Reli...
A Unified Approach for Modeling and Optimization of Energy, Makespan and Reli...A Unified Approach for Modeling and Optimization of Energy, Makespan and Reli...
A Unified Approach for Modeling and Optimization of Energy, Makespan and Reli...
 
Leveraging Semantics to Improve Reproducibility in Scientific Workflows
Leveraging Semantics to Improve Reproducibility in Scientific WorkflowsLeveraging Semantics to Improve Reproducibility in Scientific Workflows
Leveraging Semantics to Improve Reproducibility in Scientific Workflows
 
Toward Fine-Grained Online Task Characteristics Estimation in Scientific Work...
Toward Fine-Grained Online Task Characteristics Estimation in Scientific Work...Toward Fine-Grained Online Task Characteristics Estimation in Scientific Work...
Toward Fine-Grained Online Task Characteristics Estimation in Scientific Work...
 
VIP: design and implementation of the portal and execution service
VIP: design and implementation of the portal and execution serviceVIP: design and implementation of the portal and execution service
VIP: design and implementation of the portal and execution service
 
A science-gateway workload archive application to the self-healing of workflo...
A science-gateway workload archive application to the self-healing of workflo...A science-gateway workload archive application to the self-healing of workflo...
A science-gateway workload archive application to the self-healing of workflo...
 

Último

Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 

Último (20)

Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 

A science-gateway for workflow executions: online and non-clairvoyant self-healing of workflow executions on grids

  • 1. A science-gateway for workflow executions: online and non-clairvoyant self-healing of workflow executions on grids Rafael FERREIRA DA SILVA University of Lyon, CNRS, INSERM, CREATIS Villeurbanne, France Supervisors: Frédéric DESPREZ and Tristan GLATARD This work was funded by the French National Agency for Research 1under grant ANR-09-COSI-03 "VIP”
  • 2. Outline —  Technical context and challenges —  Contributions —  —  —  —  Self-healing of workflow executions on grids Treatment of blocked activities Optimization of task granularity Fairness control among workflow executions —  Conclusions 2
  • 3. Outline —  Technical context and challenges —  Contributions —  —  —  —  Self-healing of workflow executions on grids Treatment of blocked activities Optimization of task granularity Fairness control among workflow executions —  Conclusions 3
  • 4. Heavy Medical Simulations Treatement planning for prostate protontherapy [L. Grevillot, D. Sarrut] Medical-Imaging Execution Platform 491 users from 52 countries CPU Time: 2 months Virtual Imaging Platform Simulated diffusion weighted images [L. Wang, Y. Zhu, I. Magnin] CPU Time: 8 years Echography simulation [O. Bernard, M. Alessandrini] CPU Time: 42 hours 4 Public Computing Infrastructure 150 computing sites world-wide Goal: Self-healing of workflow executions on grids to handle operational issues
  • 5. 2. User launches a simulation (application workflow) 1. Input data upload 11. Download results Science-Gateway 8. Inputs download 9. Execution 10. Results upload 5 Virtual Imaging Platform (VIP) Workflow Execution 3. Workflow engine generates invocations 4. Invocations are wrapped into grid jobs High-level interface Software-as-a-Service 5. Jobs are submitted to a Pilot Engine 6. Pilot jobs are submitted to the distributed infrastructure 7. Pilot jobs fetch grid jobs
  • 6. Workflow Execution 2. User launches a simulation (application workflow) 3. Workflow engine generates invocations 4. Invocations are wrapped into grid jobs 1. Input data upload 11. Download results Workflow Management 8. Inputs download System Applications described as workflows Parallel language Grid-aware enactor 9. Execution 5. Jobs are submitted to a Pilot Engine 6. Pilot jobs are submitted to the distributed infrastructure 10. Results upload 6 7. Pilot jobs fetch grid jobs
  • 7. Workflow Execution 2. User launches a simulation (application workflow) 3. Workflow engine generates invocations 4. Invocations are wrapped into grid jobs 1. Input data upload 11. Download results 8. Inputs download Workload Management System 9. Execution Pilot jobs run special agents that fetch user tasks from the task queue, set up their environment and steer their execution 10. Results upload 7 5. Jobs are submitted to a Pilot Engine 6. Pilot jobs are submitted to the distributed infrastructure 7. Pilot jobs fetch grid jobs
  • 8. Workflow Execution 2. User launches a simulation (application workflow) 3. Workflow engine generates invocations 4. Invocations are wrapped into grid jobs 1. Input data upload 11. Download results European Grid Infrastructure (EGI) +100 computing sites +25,000 job slots ~4PB of Storage 8. Inputs download 9. Execution 10. Results upload 8 5. Jobs are submitted to a Pilot Engine 6. Pilot jobs are submitted to the distributed infrastructure 7. Pilot jobs fetch grid jobs
  • 9. Challenges —  Several workflow execution errors Average workflow completion rate is about 60% Number of launched and completed workflow in VIP from Jan to Dec 2012 —  Several dysfunctional and performance problems —  Requires manual interventions —  Problem: costly manual operations —  e.g.: rescheduling tasks, restarting services, killing misbehaving experiments, or replicating data files 9
  • 10. Objectives —  Objective: Automated platform administration —  Autonomous detection of operational incidents —  Perform appropriate set of actions —  Assumptions: Online and non-clairvoyant —  —  —  —  10 Decisions must be fast No information about tasks (duration, data transfer time, etc.) No information about resources (availability, performance, etc.) No user activity and workloads prediction
  • 11. Outline —  Technical context and challenges —  Contributions —  —  —  —  Self-healing of workflow executions on grids Treatment of blocked activities Optimization of task granularity Fairness control among workflow executions —  Conclusions 11
  • 12. State of the Art —  Self-healing of workflow executions —  Most works from the literature are offline and/or clairvoyant —  Common techniques to address operational incidents —  Task resubmission —  [Kandaswamy et al., 2008], [Zhang et al., 2009], [Montagnat et al., 2010] —  Task and file replication —  [Cirne et al., 2007], [Ben-Yehuda et al., 2012], [Ma et al., 2013] —  Task grouping —  [Muthuvelu et al., 2005-2013], [Lie and Liao, 2009], [Chen et al., 2013] —  Heuristics to fairly schedule workflow tasks —  [Zhao and Sakellariou, 2006], [N’Takpe and Suter, 2009], [Casanova et al., 2010] 12
  • 13. Fuzzy Finite State Machine —  The healing process sets the degree of FuSM states from incident Crisp states detection metrics Possible values: 0 or 1 Fuzzy states Values between 0 and 1 13
  • 14. General MAPE-K loop event (job completion and failures) or timeout Incident 1 degree η = 0.8 level 1 level 2 Incident 2 degree η = 0.4 Incident 3 degree η = 0.1 level 1 level 1 level 3 level 2 level 3 level 2 level 3 Monitoring Analysis Set of Actions 6e+04 = 0e+00 x2 Frequency Monitoring data 0.0 Execution 0.2 0.4 0.6 0.8 ∑ n ηj j =1 1.0 ηu Knowledge Roulette wheel selection € Planning Rule Confidence (ρ) ρxη Selected 2è 1 0.8 0.32 Selected Incident 2 3 è 1 0.2 0.02 Incident 1 1 è 1 1.0 0.80 Roulette wheel selection based on association rules 14 ηi Association rules for incident 1 R. Ferreira da Silva, T. Glatard, F. Desprez, Self-healing of workflow activity incidents on distributed computing infrastructures, Future Generation Computer Systems (FGCS), 2013.
  • 15. Incident Levels and Actions —  Incident degrees are quantified in discrete incident levels —  Thresholds are determined from mode clustering Thresholds τ cluster platform configurations into groups No actions are triggered 15 Triggers a set of actions
  • 16. A-priori knowledge —  Based on the workload of VIP —  January 2011 to April 2012 338,989 completed 138,480 error 105,488 aborted 15,576 aborted replicas 48,293 stalled 112 users 2,941 workflow executions 680,988 tasks 34,162 queued 339,545 pilot jobs 16 R. Ferreira da Silva, T. Glatard, A science-gateway workload archive to study pilot jobs, user activity, bag of tasks, task sub-steps, and workflow executionss, CoreGRID/ERCIM Workshop on Grids, Clouds and P2P Computing (CGWS), Rhodes Island, Greece, 2012.
  • 17. Outline —  Technical context and challenges —  Contributions —  —  —  —  Self-healing of workflow executions on grids Treatment of blocked activities Optimization of task granularity Fairness control among workflow executions —  Conclusions 17
  • 18. Incident: Activity Blocked —  A task is late compared to the others Long-tail effect 80 100 60 40 0 20 Completed Jobs FIELD-II/pasa - workflow-9SIeNv 0.0e+00 4.0e+06 8.0e+06 1.2e+07 Time (s) Task completion rate of a real simulation Job flow of a real simulation —  Possible causes —  Longer waiting times —  Lost tasks (e.g. killed by site due to quota violation) —  Resources with poor performance 18
  • 19. Activity Blocked: State of the Art —  Task replication —  Is commonly used to address non-clairvoyant problems —  Drawback: may overload the system and degrade fairness —  Task replication in the literature —  Is used to increase the probability to complete a task [Ramakrishnan et al., 2009] —  Use of the Weibull distribution to estimate the number of replicas [Litke et al., 2007] —  Tasks are replicated only in the tail phase [Ben-Yehuda et al., 2012] —  Evaluation of the waste of resources by using replication [Cirne et al., 2007] All approaches make strong assumptions on task or resource characteristics 19
  • 20. Activity Blocked: Degree —  Degree computed from all completed tasks of the activity —  Task phases: setup è inputs download è execution è outputs upload —  Assumption: bag of tasks (all tasks have equal durations) —  Median-based estimation: Median duration of task phases Estimated task duration Real task duration 50s 42s 42s 250s 300s 300s 400s 400s* 20s 15s 15s Mi = 715s Ei = 757s completed current ? *: max(400s, 20s) = 400s —  Incident degree: task performance w.r.t median 20
  • 21. Activity blocked: levels and actions —  Levels: identified from the platform logs extracted from VIP on EGI Activity Blocked degree ηb τb Frequency 150 Level 1 (no€ actions) 100 Level 2 action: replicate tasks 50 0 0.00 0.25 0.50 ηb d 0.75 —  Actions —  Task replication —  Cancel replicas with bad performance —  Replicate only if all 21 Level 1 active replicas are running 1.00 Replication process for one task Level 2
  • 22. Activity Blocked: Results —  Goal: Self-Healing vs No-Healing —  Cope with recoverable errors Mean-Shift/hs3 FIELD-II/pasa 12000 8000 No−Healing Self−Healing 4000 0 8000 No−Healing Self−Healing 4000 0 1 2 3 Repetitions 4 5 Average execution speed up: 3.4 Resource waste: w= 22 € Makespan (s) Makespan (s) 12000 (CPU + data) self −healing −1 (CPU + data) no−healing 1 2 3 Repetitions 4 5 Average execution speed up: 2.9 Self-Healing process reduced resource consumption up to 35% when compared to the No-Healing execution
  • 23. Number of Completed Tasks Repetition 1 Repetition 2 Repetition 3 1.0 0.8 0.6 0.4 CDF 0.2 0 50 100 Repetition 4 150 20 60 0 50 100 150 Repetition 5 200 0 50 100 150 1.0 0.8 0.6 0.4 0.2 0 40 0 50 100 Time (min) Curve similarities up to 95% indicate similar grid conditions 23 No−Healing Self−Healing
  • 24. Activity Blocked: Conclusions —  First results in controlling blocked activities in these conditions —  Conditions: production system, non-clairvoyant, online —  Limitation —  The method only works for bag-of-tasks —  The waste metric does not consider resource performance —  Currently used in production by VIP —  From Aug 2012 to Oct 2013 more than 6000 workflow executions benefited —  Publications R. Ferreira da Silva, T. Glatard, F. Desprez, Self-healing of operational workflow incidents on distributed computing infrastructures, IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), Ottawa, Canada, 2012. R. Ferreira da Silva, T. Glatard, F. Desprez, Self-healing of workflow activity incidents on distributed computing infrastructures, Future Generation Computer Systems (FGCS), 2013. 24
  • 25. Outline —  Technical context and challenges —  Contributions —  —  —  —  Self-healing of workflow executions on grids Treatment of blocked activities Optimization of task granularity Fairness control among workflow executions —  Conclusions 25
  • 26. Incident: Fineness Control —  Low performance of lightweight (a.k.a. fine-grained) tasks: —  High queuing times —  Communication overhead t5 t4 lightweight tasks Lightweight task executions are delayed t3 t2 Group into coarse-grained tasks reduces the cost of data transfers when grouped tasks share input data, and saves queuing time Resources t1 R3 t4 R2 R1 t3 t1 t2 t5 time 26
  • 27. Fineness Control: State of the Art —  Task grouping in the literature —  Groups tasks based on the granularity size (processing time) [Muthuvelu et al., 2005] —  Adds bandwidth to the definition of the granularity size [Ng et al., 2006], [And et al., 2009] —  Defines the granularity size based on QoS requirements —  Task file size, CPU time, resource constraints [Muthuvelu et al., 2008] —  Drawback: only works under stationary load —  Adaptive algorithms (non-stationary load) —  Monitors information about the current availability and capability of resources [Liu and Liao, 2009], [Muthuvelu et al., 2013] All approaches make strong assumptions on task or resource characteristics 27
  • 28. Fineness Control: Degree —  Task execution Queued Time qj t Shared Input Data € Other Input Data Application Execution ~ t _ shared —  Incident degree € € η f = max i∈[1,m ]{ f i = di ⋅ ri } € Median task phase durations 28 i = waiting task n = number of waiting tasks
  • 29. Fineness control: levels and actions —  Levels: identified from the platform logs extracted from VIP on EGI Fineness Control degree ηf 6e+04 Level 1 (no actions) € Level 1 Level 2 0e+00 Frequency τf action: task grouping 0.0 0.2 0.4 0.6 0.8 ηf —  Actions —  Task grouping —  Grouped pairwise until η f ≤ τ f or until Q ≤ R 29 € 1.0 Level 2
  • 30. Coarseness control t5 t4+t5 Tasks at t1 Grouped tasks at t2 t3 t2 —  Non-stationary load t2+t3 t4 —  Loss of parallelism —  Task-degrouping t1 Resources R3 Loss of parallelism R2 t4+t5 R1 t1 t1 t2+t3 time t2 —  Incident degree —  Levels R ηc = Q+ R 30 € τ c = 0.5 € De-group tasks when R Q
  • 31. Results: Non-Stationary Load —  Experiment —  Evaluate the de-grouping control process under non-stationary load Resources appear progressively Resources appear suddenly Makespan (s) 6000 4000 Fineness Fineness−Coarseness No−Granularity 2000 0 Run 1 Run 2 Run 3 Speeds up executions up to a factor of 1.5 for Fineness, and 2.1 for Fineness-Coarseness Run 4 Run 5 Fineness is penalized by its lack of adaptation: slowdown of 20% Linear correlation coefficient between the makespan and the average queuing time is 0.91, which indicates they are correlated 31 31
  • 32. Task Granularity: Conclusions —  First results in controlling task granularity in these conditions —  Conditions: production system, non-clairvoyant, online —  Limitation —  The method only works for data-intensive workloads —  Future Work —  Task pre-emption to handle the scenario where resources suddenly appear and all tasks are running —  Publications R. Ferreira da Silva, T. Glatard, F. Desprez, On-line, non-clairvoyant optimization of workflow activity granularity task on grids, Euro-Par, Aachen, 2013. R. Ferreira da Silva, T. Glatard, F. Desprez, Controlling fairness and task granularity in distributed, online, non-clairvoyant workflow executions, Concurrency and Computation: Practice and Experience (CCPE), Submited, 2014. 32
  • 33. Outline —  Technical context and challenges —  Contributions —  —  —  —  Self-healing of workflow executions on grids Treatment of blocked activities Optimization of task granularity Fairness control among workflow executions —  Conclusions 33
  • 34. Incident: Unfairness Among Workflow Executions —  Under resource contention workflows are unequally slowed down by concurrent executions t1,1 t3,1 t1,2 t2,2 t3,2 t1,3 t2,3 t3,3 t1,4 t2,4 t3,4 t1,5 3 identical workflows submitted sequentially (ti,j = 10s) t2,1 t2,5 Makespan with concurrent executions t3,5 slowdown(s) = M multi M own Makespan without concurrent executions € Resources s1 = R3 t1,3 t2,1 t2,4 t3,2 t1,2 t1,5 t2,3 t3,1 t3,4 R1 t1,1 t1,4 t2,2 t2,5 s2 = 40 = 2.0 20 s3 = 50 = 2.5 20 t3,5 R2 20 = 1.0 20 t3,3 0 10 20 30 40 € time € Identical workflow executions do not experience the same slowdown 34 €
  • 35. Fairness: State of the Art —  Workflow execution fairness in the literature —  Addresses fairness based on the slowdown of DAGs based on execution and data transfer times [Zhao and Sakellariou, 2006], [Casanova et al., 2010] —  Proposes a mapping procedure to increase fairness based on the critical path length [N’Takpe and Suter, 2009] —  Online, but clairvoyant, HEFT-like algorithms [Hsu et al., 2011], [Sommerfield and Richter, 2011], [Arabnejad and Barbosa, 2012] —  Non-clairvoyant, but offline, scheduling strategy based on task labeling and adaptive allocation [Hirales-Carbajal et al., 2012] No algorithm was proposed in a non-clairvoyant and online case 35
  • 36. Fairness Control: Degree —  Unfairness degree ηu = W max − W min Max difference between the fractions of pending work where: $ ' Qi, j W i = max j∈[1,n i ]% ⋅ Ti, j ( Qi, j + Ri, j ⋅ Pi, j ) € € Performance A low Pi,j indicates that resources allocated to the activity have bad performance for the activity 36 i = activity, ni = active activities Qi,j = number of waiting tasks Ri,j = number of running tasks Relative observed duration Median task phase durations
  • 37. Fairness Control: Levels and Actions —  Levels: identified from the platform logs extracted from VIP on EGI Fairness Control degree ηu τu Level 1 (no actions) Level 1 6e+04 € action: task prioritization 0e+00 Frequency Level 2 0.0 0.2 0.4 0.6 0.8 1.0 ηu —  Actions —  Task prioritization —  Task priority is an integer initialized to 1 —  Increase priority of Δi,j tasks 37 Level 2
  • 38. Fairness Control: Metrics —  Unfairness —  Is the area under the curve ηu during the execution: M µ = ∑ηu (t i )⋅ (t i − t i−1 ) i=2 € —  Slowdown s= M multi M own where: € 38 € M own = max p∈Ω ∑ t u u∈p This metric measures if the fairness process can indeed minimize its own criterion ηu
  • 39. Results: identical workflows —  Tests whether unfairness among identical workflows is properly addressed Repetition 1 Repetition 2 Repetition 3 Repetition 4 Makespan (s) 30000 20000 Gate 1 Gate 2 10000 Gate 3 0 Fairness No−Fairness Repetition 1 Fairness No−Fairness Repetition 2 Fairness No−Fairness Repetition 3 Fairness No−Fairness Repetition 4 1.00 ηf 0.75 Fairness 0.50 No−Fairness 0.25 0.00 0 10000 20000 30000 0 5000 10000 15000 20000 0 Time (s) 10000 20000 30000 0 500010000 15000 20000 25000 Makespans and unfairness degree values are significantly reduced 39
  • 40. Results: different workflows —  Tests whether unfairness among different workflows is detected and properly handled Slowdown Repetition 1 Repetition 2 Repetition 3 Repetition 4 100 FIELD−II Gate 10 PET−Sorteo SimuBloch 1 Fairness No−Fairness Repetition 1 Fairness No−Fairness Fairness Repetition 2 No−Fairness Fairness Repetition 3 No−Fairness Repetition 4 1.00 ηf 0.75 Fairness 0.50 No−Fairness 0.25 0.00 0 5000 100001500020000 0 10000 20000 0 Time (s) 20000 40000 0 500010000 15000 20000 Reduced slowdown stand. dev. up to a factor of 3.8, and unfairness value up to a factor 1.9 40
  • 41. Fairness Control: Conclusions —  First results in controlling fairness among workflow executions in these conditions —  Conditions: production system, non-clairvoyant, online —  Limitation —  Fairness optimization is delayed due to the acquisition of information about the applications —  The method works best for applications with a lot of short tasks —  Future Work —  Evaluation of the influence of the metrics’ parameters —  Publications R. Ferreira da Silva, T. Glatard, F. Desprez, Workflow fairness control on online and non-clairvoyant distributed computing platforms, Euro-Par, Aachen, 2013. R. Ferreira da Silva, T. Glatard, F. Desprez, Controlling fairness and task granularity in distributed, online, non-clairvoyant workflow executions, Concurrency and Computation: Practice and Experience (CCPE), Submited, 2014. 41
  • 42. Outline —  Technical context and challenges —  Contributions —  —  —  —  Self-healing of workflow executions on grids Treatment of blocked activities Optimization of task granularity Fairness control among workflow executions —  Conclusions 42
  • 43. Contributions Summary Self-healing of workflow incidents - Generic MAPE-K loop - Non-clairvoyance and online [Ferreira da Silva et al., CCGRID’12, FGCS’13] Treatment of blocked activities - Properly detects and handles blocked activities Optimization of task granularity [Ferreira da Silva et al., EuroPar’13a] Fairness control among workflow executions [Ferreira da Silva et al., EuroPar’13b, CPE’14] Science-gateway model for workload archive [Ferreira da Silva and Glatard, CGWS’12] All methods were evaluated on VIP [Ferreira da Silva et al., HealthGrid’11; Glatard et al., TMI’13] - Properly detects and handles lightweight tasks under stationary and non-stationary loads - Properly detects and handles unfairness among workflow executions - Illustration by using traces of the VIP from 2011/2012 - Production platform with about 500 users 43
  • 44. Perspectives —  Mode detection automation —  Automatically detect variation on threshold values —  Time-windowed historical information —  User’s behavior may change —  Errors may be restricted to a specific time span —  Optimization of the incident selection method —  There is no mechanism to prevent an incident to be successively selected —  Sensitivity analysis of parameters —  Evaluate the influence of parameters on the metrics —  Workflow workload archive —  The science gateway workload archive model does not embrace all characteristics inherent to a workflow execution 44
  • 45. A science-gateway for workflow executions: online and non-clairvoyant self-healing of workflow executions on grids Thank you for your attention. Questions? http://vip.creatis.insa-lyon.fr! Rafael FERREIRA DA SILVA University of Lyon, CNRS, INSERM, CREATIS Villeurbanne, France Supervisors: Frédéric DESPREZ and Tristan GLATARD