SlideShare una empresa de Scribd logo
1 de 30
Descargar para leer sin conexión
A science-gateway workload archive
                application to the self-healing
                    of workflow incidents

    Rafael FERREIRA DA SILVA, Tristan GLATARD                             Frédéric DESPREZ
       University of Lyon, CNRS, INSERM, CREATIS               INRIA, University of Lyon, LIP ENS Lyon
                                                                                             ,
                    Villeurbanne, France                                    Lyon, France




                              Journées Scientifiques Mésocentres et France Grilles
                                            October 1st-3rd 2012



1
                                                                    Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
Context: Workload Archives

                                                                                     Assumptions validation



    exit_code                       task_status




                                                              useful for
         submit_time                                    ime
                                                t ion_t                               Computational activity

                   site_name              execu                                            modeling


                   inpu
                        t   _file
                                                id
                                      workflow_
          activity_name                                                               Methods evaluation
                                                                                  (simulation or experimental)


      Information produced by grid workflow executions




2
                                                                   Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
Science-gateway architecture

                   0. Login                               3. Launch workflow
              1. Send input data
    User
                                                                                     Workflow Engine
                                         Web Portal


                          2. Transfer
                                                                                                         4. Generate and
                          input files
                                                                                                         submit task



       Storage
      Element


           8. Get files                                     7. Get task
           9. Execute
           10. Upload results                                                             Pilot Manager
                                   Computing site


                                            6. Schedule                              5. Submit
                                             pilot jobs                              pilot jobs
                                                          Meta-Scheduler

3
                                                                     Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
State of the Art

                                                                                                   Grid Workload Archives

    exit_code                        task_status

          submit_time                                    time
                                                   tion_
                                             execu
                        site_name
                    inpu
                         t   _file
                                                 d
                                       workflow_i               Information gathered
           activity_name
                                                                at infrastructure-level
                                                                                                                tasks




                Lack of critical information:
                •  Dependencies among tasks                                               •  Parallel Workloads Archive
                                                                                           (http://www.cs.huji.ac.il/labs/parallel/workload/)
                •  Task sub-steps
                                                                                          •  Grid Workloads Archive
                •  Application-level scheduling artifacts                                  (http://gwa.ewi.tudelft.nl/pmwiki/)
                •  User




4
                                                                                             Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
At infrastructure-level

                   0. Login                               3. Launch workflow
              1. Send input data
    User
                                                                                     Workflow Engine
                                         Web Portal


                          2. Transfer
                                                                                                         4. Generate and
                          input files
                                                                                                         submit task



       Storage
      Element


           8. Get files                                     7. Get task
           9. Execute
           10. Upload results                                                             Pilot Manager
                                   Computing site


                                            6. Schedule                              5. Submit
                                             pilot jobs                              pilot jobs
                                                          Meta-Scheduler

5
                                                                     Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
Outline
  A science-gateway workload archive
  Case studies
        Pilot Jobs
        Accounting
        Task analysis
        Bag of tasks

  Workflow Self-Healing
  Conclusions



6
                                    Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
Our approach

                                                                                             Science-Gateway Workload Archive
    exit_code                        task_status

          submit_time                                    time
                                                   tion_
                                             execu
                        site_name
                    inpu
                         t   _file
                                                 d
                                                                  Information gathered
                                       workflow_i
           activity_name                                        at science-gateway level




                Advantages:                                                                            workflow executions
                •  Fine-grained information about tasks
                •  Dependencies among tasks
                •  Workflow characterization
                •  Accounting




7
                                                                                           Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
At science-gateway level

                   0. Login                               3. Launch workflow
              1. Send input data
    User
                                                                                     Workflow Engine
                                         Web Portal


                          2. Transfer
                                                                                                         4. Generate and
                          input files
                                                                                                         submit task



       Storage
      Element


           8. Get files                                     7. Get task
           9. Execute
           10. Upload results                                                             Pilot Manager
                                   Computing site


                                            6. Schedule                              5. Submit
                                             pilot jobs                              pilot jobs
                                                          Meta-Scheduler

8
                                                                     Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
Virtual Imaging Platform
  Virtual Imaging Platform (VIP)
      Medical imaging science-gateway
      Grid of 129 sites (EGI – http://www.egi.eu)
                                                                             Applications
  Significant usage
      Registered users: 244 from 26 countries
      Applications: 18                                                      File transfer

      Consumed 32 CPU years in 2011                               VIP – http://vip.creatis.insa-lyon.fr




                                     VIP usage in 2011: CPU consumption
                                     of VIP and related platforms on EGI.

9
                                                         Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
SGWA
  Science Gateway Workload Archive (SGWA)
       Archive is extracted from VIP




                                  Science-gateway archive model


          Task, Site and Workflow Execution               File and Pilot Job extracted from
          acquired from databases populated                  the parsing of task standard
           by the workflow engine at runtime                     output and error files


10
                                                             Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
Workload for Case Studies
  Based on the workload of VIP
        January 2011 to April 2012

                                                                            338,989 completed
                                                                            138,480 error
                                                                           105,488 aborted
                                                                             15,576 aborted replicas
                                                                             48,293 stalled
                                                                             34,162 queued
     112 users     2,941 workflow executions    680,988 tasks




                                               339,545 pilot jobs




11
                                                       Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
Pilot Jobs
  A single pilot can wrap several
     tasks and users                                               282331
                                                          250000
                                                          200000

  At infrastructure-level                                150000




                                              Frequency
                                                          100000

       Assimilates pilot jobs to tasks and               50000
                                                                            28121

        users                                                                           11885
                                                                                                      6721
                                                                                                             10487



       Valid for only 62% of the tasks                       0
                                                                     1       2            3            4      5
                                                                                    Tasks per pilot
       Valid for 95% of user-task
       associations
                                                                   323214
                                                          300000
                                                          250000
                                                          200000
                                                          150000




                                              Frequency
  At science-gateway level                               100000

                                                          50000


       Users and tasks are correctly                                       15178



       associated to pilots
                                                                                         1079
                                                                                                       70     4
                                                              0
                                                                     1       2            3            4      5
                                                                                    Users per pilot




12
                                                Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
Accounting: Users
  Authentications based on login and password are mapped to
     X.509 robot certificates

  At infrastructure-level
       All VIP users are reported as a single user

  At science-gateway level
       Maps task executions to VIP users


                             40


                             30
                     Users




                                                                                                      EGI

                             20                                                                       VIP



                             10


                              0
                                  1   2    3   4   5   6   7    8   9   10 11 12 13 14 15 16
                                                               Months

                                          Number of reported EGI and VIP users
13
                                                                                   Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
Accounting: CPU and
                                       Wall-clock Time
  Huge discrepancy of values                              6e+05
                                                                     VIP jobs

       Pilot jobs do not register to




                                          Number of jobs
                                                           5e+05     EGI jobs



        the pilot system
                                                           4e+05

                                                           3e+05


       Absence of workload                                2e+05

                                                           1e+05


       Outputs unretrievable                                                     5              10   15
                                                                                         Month
       Pilot setup time                                           Number of submitted pilot jobs
                                                                         by EGI and VIP
       Lost tasks (a.k.a. stalled)
                                                           150

                                                                   VIP CPU time

                                                                   VIP Wall−clock time
                                                           100

  Undetectable at infrastructure-level                            EGI CPU time




                                          Years
                                                                   EGI Wall−clock time

                                                            50




                                                                                  5              10    15
                                                                                         Month
                                                                   Consumed CPU and wall-clock time
                                                                           by EGI and VIP

14
                                          Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
Task Analysis
  At infrastructure-level
       Limited to task exit codes                                  55165
                                                                                  50925
                                                        50000                                   48293




                                      Number of tasks
                                                        40000

                                                        30000

  At science-gateway level                             20000                                             19463



         Fine-grained information
                                                        10000
                                                                                                                     1123
                                                            0

         Steps in task life                                    application       input         stalled
                                                                                          Error causes
                                                                                                          output     folder




         Error causes
         Replicas per task                               1.0

                                                          0.8         download

                                                                      execution

                                                          0.6         upload




                                                    CDF
                                                          0.4

                                                          0.2



                                                                1                         100                10000
                                                                                                Time(s)

                                                                               Different steps in task life



15
                                          Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
Bag of Tasks:
                                                                    at Infrastructure level
  Evaluation of the accuracy of Iosup et al.[8] method to detect
     bag of tasks (BoT)
                                                                                   Task 1


                                                                                            Task 2
  Two successively submitted
     tasks are in the same BoT if                                              Δ1,2         Δ2,3             Task 3
     the time interval between
     submission times is lower                                                t1       t2              t3                               time

     or equal to Δ.                                                                Δ
                                                                                            Δ


                           BoT 1                                                       BoT 2

                           Task 1              Δ1,2 ≤Δ                                 Task 3               Δ2,3 >Δ
                                               |t1 – t2|≤Δ                                                  |t2 – t3|>Δ

                           Task 2



16   [8] Iosup, A., Jan, M., Sonmez, O., Epema, D.: The Characteristics and
     performance of groups of jobs in grids. In: Euro-Par. (2007) 382-393               Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
Bag of Tasks: Size and Duration
              Infrastructure vs science-gateway
  90% of Batch BoTs size ranges              0.8



     from 2 to 10 while it represents         0.6




                                        CDF
     50% of Real Batch
                                              0.4

                                              0.2                                                    Real Batch

                                                                                                     Batch

                                              0.0
                                                             200        400       600        800             1000
                                                                    Size (number of tasks)




                                              0.8

  Non-Batch duration is                      0.6

     overestimated up to 400%



                                        CDF
                                                                                              Real Batch
                                              0.4
                                                                                              Real Non−Batch

                                              0.2                                             Batch

                                                                                              Non−Batch

                                              0.0
                                                            10000      20000     30000       40000           50000
                                                                         Duration (s)


                                                     Real Batch = ground-truth BoT
                                                     Real Non-Batch = ground-truth non-BoT
                                                     Batch = Iosup et al. BoT
                                                     Non-Batch = Iosup et al. non-BoT

17
                                                    Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
Bag of Tasks: Inter-arrival Time
                     and Consumed CPU Time
  Batch and Non-Batch inter-arrival         0.8



     times are underestimated by             0.6




                                       CDF
     about 30%                               0.4
                                                                                                Real Batch

                                                                                                Real Non−Batch

                                             0.2                                                Batch

                                                                                                Non−Batch

                                             0.0
                                                            2000      4000           6000      8000          10000
                                                                   Inter−Arrival Time (s)



                                             0.8

  CPU times are underestimated of           0.6

     25% for Non-Batch and of about




                                       CDF
     20% for Batch
                                                                                                Real Batch
                                             0.4
                                                                                                Real Non−Batch

                                             0.2                                                Batch

                                                                                                Non−Batch



                                                   0      5000     10000     15000     20000    25000        30000
                                                              Consumed CPUTime (KCPUs)


                                                    Real Batch = ground-truth BoT
                                                    Real Non-Batch = ground-truth non-BoT
                                                    Batch = Iosup et al. BoT
                                                    Non-Batch = Iosup et al. non-BoT

18
                                                   Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
Outline
  A science-gateway workload archive
  Case studies
         Pilot Jobs
         Accounting
         Task analysis
         Bag of tasks

  Workflow Self-Healing
  Conclusions



19
                                    Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
Workflow Self-Healing
  Problem: costly manual operations
       Rescheduling tasks, restarting services, killing misbehaving
       experiments or replicating data files


  Objective: automated platform administration
       Autonomous detection of operational incidents
       Perform appropriate set of actions


  Assumptions: online and non-clairvoyant
       Only partial information available
       Decisions must be fast
       Production conditions, no user activity and workloads prediction

20
                                               Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
General MAPE-K loop
            event                    Incident 1                                         Incident 2                               Incident 3
(job completion and failures)
                                     degree η = 0.8                                     degree η = 0.4                           degree η = 0.1
             or
          timeout                     level   level                       level          level            level         level    level   level     level
                                        1       2                           3              1                2             3        1       2         3
 Monitoring                                                                                                                                                              Analysis


                                                                                                                                                                 0.07
                                                                            Monitoring data

              x2                                                                                                                                                                     ηi



                                                            15000
                                                      Frequency
                                                                                                                                                        0.30                 =       n
                                                                                                                                                                                 ∑        ηj

                                                 0 5000
       Set of Actions                                                                                                                                              0.61              j =1
                                                                    0.0      0.2       0.4       0.6       0.8    1.0
                                                                                   Estimation by Median
                                                                                            !b




 Execution                      Knowledge                                                                                                          Roulette wheel selection
                                                                                                                                                               €
 Planning

                                                                                                          Rule            Confidence (ρ)         ρxη
       Selected                                      0.37                                                 2 1                  0.8              0.32              Selected

       Incident 2                    0.66                                                                 31                   0.2              0.02              Incident 1
                                                                                                          1  1	

              1.0              0.80
                                                                           0.16
                                 Roulette wheel selection                                                                 Association rules
                                based on association rules                                                                 for incident 1

21
                                                                                                                                Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
Incident: Activity Blocked
  An invocation is late compared to the others
                                           FIELD-II/pasa - workflow-9SIeNv
                        80 100
       Completed Jobs

                        60
                        40
                        20
                        0




                                 0.0e+00       4.0e+06              8.0e+06   1.2e+07

                                                         Time (s)


                             Invocations completion rate for a simulation                    Job flow for a simulation



  Possible causes
       Longer waiting times
       Lost tasks (e.g. killed by site due to quota violation)
       Resources with poor performance



22
                                                                                        Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
Activity blocked: degree
      Degree computed from all completed jobs of the activity
        Job phases: setup  inputs download  execution  outputs upload
        Assumption: bag-of-tasks (all jobs have equal durations)
        Median-based estimation:
                    Median duration   Estimated job   Real job
                     of jobs phases      duration     duration
                           50s             42s           42s
                                                                    completed
                          250s            300s          300s
                          400s           400s*           20s        current

                           15s             15s           ?
                     Mi = 715s        Ei = 757s
                                                                                          *: max(400s, 20s) = 400s


      Incident degree: job performance w.r.t median
                Ei
         d=           ∈ [0,1]
              Mi + Ei

23
                                                                 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
€
Activity blocked: levels and actions
  Levels: identified from the platform logs
                                                          τ1

                                   Level 1                      Level 2
                   15000




                                 (no actions)
             Frequency




                                                €                               action: replicate jobs
        0 5000




                           0.0     0.2       0.4       0.6       0.8      1.0
                                                  d
                                         Estimation by Median
                                                  !b                               Replication process for one task



  Actions
       Job replication
         Cancel replicas with
                 bad performance
         Replicate only if all
                 active replicas are running

24
                                                                                      Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
Experiments
  Goal: Self-Healing vs No-Healing
       Cope with recoverable errors

  Metrics
       Makespan of the activity execution
       Resource waste

                     (CPU + data) self −healing
                w=                              −1
                      (CPU + data) no−healing

         For w < 0: self-healing consumed less resources
       € For w > 0: self-healing wasted resources
        




25
                                                     Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
Experiment Conditions
  Software
       Virtual Imaging Platform
       MOTEUR workflow engine
       DIRAC pilot job system

  Infrastructure
       European Grid Infrastructure (EGI): production, shared
       Self-Healing and No-Healing launched simultaneously

  Experiment parameters
       Task and file replication limited to 5
       Failed task resubmission limited to 5



26
                                                 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
Applications
                   FIELD-II/pasa                              Mean-Shift/hs3
            •  Ultrasound imaging                       •  Image denoising
               simulation                               •  250 invocations
            •  122 invocations                          •  CPU Time: 1 hour
            •  CPU Time: 15 min                         •  ~182 MB
            •  ~210 MB                                  •  CPU-intensive
            •  Data-intensive




          Image courtesy of ANR project US-Tagging      Image courtesy of Ting Li
     http://www.creatis.insa-lyon.fr/us-tagging/news    http://www.creatis.insa-lyon.fr
                          O. Bernard, M. Alessandrini


27
                                                                       Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
Results
  Experiment: tests if recoverable errors are detected
                                         FIELD-II/pasa                                                                 Mean-Shift/hs3

                    12000
                                                                                                      20000
                    10000
     Makespan (s)




                                                                                       Makespan (s)
                    8000                                                                              15000
                                                                        No−Healing                                                                           No−Healing
                    6000                                                Self−Healing                  10000                                                  Self−Healing
                    4000
                                                                                                      5000
                    2000

                      0                                                                                 0
                                 1       2       3         4     5                                               1       2        3             4   5
                                             Repetitions                                                                      Repetitions


                            speeds up execution up to 4                                                      speeds up execution up to 2.6

                            Repetition        w                                                                                      Repetition         w
                                1              –0.10                                                                                        1           –0.02
                                                                Self-Healing process reduced resource
                                2              –0.15                                                                                        2           –0.20
                                                               consumption up to 26% when compared
                                3              –0.09                 to the No-Healing execution                                            3           –0.02
                                4                 0.05                                                                                      4           –0.02
                                5              –0.26                                                                                        5           –0.01



28
                                                                                                                Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
Conclusions
  Science-gateway model of workload archive
       Illustration by using traces of the VIP from 2011/2012

  Added value when compared to infrastructure-level traces
         Exactly identify tasks and users
         Distinguishes additional workload artifacts from real workload
         Fine-grained information about tasks
         Ground-truth of bag of tasks

  Self-healing of worklfow incidents
         Implements a generic MAPE-K loop
         Incident degrees computed online
         Speeds up execution up to a factor of 4
         Reduced resource consumption up to 26%
         Successfull example of self-healing loop deployed in production

  VIP is openly available at http://vip.creatis.insa-lyon.fr
  Traces are available to the community in the
     Grid Observatory: http://www.grid-observatory.org
29
                                                        Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
A science-gateway workload archive
                application to the self-healing
                    of workflow incidents
                         Thank you for your attention.
                                 Questions?

                                          ACKNOWLEDGMENTS
                                   VIP users and project members
                        French National Agency for Research (ANR-09-COSI-03)
                                    European Grid Initiative (EGI)
                                            France-Grilles


     Rafael FERREIRA DA SILVA, Tristan GLATARD                      Frédéric DESPREZ
       University of Lyon, CNRS, INSERM, CREATIS         INRIA, University of Lyon, LIP ENS Lyon
                                                                                       ,
                    Villeurbanne, France                              Lyon, France

30
                                                             Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr

Más contenido relacionado

Similar a A science-gateway workload archive application to the self-healing of workflow incidents

Breaking Parser Logic: Take Your Path Normalization Off and Pop 0days Out!
Breaking Parser Logic: Take Your Path Normalization Off and Pop 0days Out!Breaking Parser Logic: Take Your Path Normalization Off and Pop 0days Out!
Breaking Parser Logic: Take Your Path Normalization Off and Pop 0days Out!
Priyanka Aash
 

Similar a A science-gateway workload archive application to the self-healing of workflow incidents (20)

Self-healing of operational workflow incidents on distributed computing infra...
Self-healing of operational workflow incidents on distributed computing infra...Self-healing of operational workflow incidents on distributed computing infra...
Self-healing of operational workflow incidents on distributed computing infra...
 
VIP: design and implementation of the portal and execution service
VIP: design and implementation of the portal and execution serviceVIP: design and implementation of the portal and execution service
VIP: design and implementation of the portal and execution service
 
Ogce Workflow Suite Tg09
Ogce Workflow Suite Tg09Ogce Workflow Suite Tg09
Ogce Workflow Suite Tg09
 
04.egovFrame Runtime Environment Workshop
04.egovFrame Runtime Environment Workshop04.egovFrame Runtime Environment Workshop
04.egovFrame Runtime Environment Workshop
 
Spring Batch Behind the Scenes
Spring Batch Behind the ScenesSpring Batch Behind the Scenes
Spring Batch Behind the Scenes
 
Spring Batch Performance Tuning
Spring Batch Performance TuningSpring Batch Performance Tuning
Spring Batch Performance Tuning
 
WORKS 11 Presentation
WORKS 11 PresentationWORKS 11 Presentation
WORKS 11 Presentation
 
Lean Php Presentation
Lean Php PresentationLean Php Presentation
Lean Php Presentation
 
Breaking Parser Logic: Take Your Path Normalization Off and Pop 0days Out!
Breaking Parser Logic: Take Your Path Normalization Off and Pop 0days Out!Breaking Parser Logic: Take Your Path Normalization Off and Pop 0days Out!
Breaking Parser Logic: Take Your Path Normalization Off and Pop 0days Out!
 
Celery: The Distributed Task Queue
Celery: The Distributed Task QueueCelery: The Distributed Task Queue
Celery: The Distributed Task Queue
 
Vinay Kumar [InfluxData] | InfluxDB Tasks Demonstration | InfluxDays 2022
Vinay Kumar [InfluxData] | InfluxDB Tasks Demonstration | InfluxDays 2022Vinay Kumar [InfluxData] | InfluxDB Tasks Demonstration | InfluxDays 2022
Vinay Kumar [InfluxData] | InfluxDB Tasks Demonstration | InfluxDays 2022
 
Understanding Framework Architecture using Eclipse
Understanding Framework Architecture using EclipseUnderstanding Framework Architecture using Eclipse
Understanding Framework Architecture using Eclipse
 
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
 
Deep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVecDeep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVec
 
.NET Project Manual
.NET Project Manual.NET Project Manual
.NET Project Manual
 
AhmedReda
AhmedRedaAhmedReda
AhmedReda
 
FireWorks workflow software
FireWorks workflow softwareFireWorks workflow software
FireWorks workflow software
 
Springboard deepdive
Springboard deepdiveSpringboard deepdive
Springboard deepdive
 
Hibernate complete notes_by_sekhar_sir_javabynatara_j
Hibernate complete notes_by_sekhar_sir_javabynatara_jHibernate complete notes_by_sekhar_sir_javabynatara_j
Hibernate complete notes_by_sekhar_sir_javabynatara_j
 
Hibernate complete notes_by_sekhar_sir_javabynatara_j
Hibernate complete notes_by_sekhar_sir_javabynatara_jHibernate complete notes_by_sekhar_sir_javabynatara_j
Hibernate complete notes_by_sekhar_sir_javabynatara_j
 

Más de Rafael Ferreira da Silva

Towards an Infrastructure for Enabling Systematic Development and Research of...
Towards an Infrastructure for Enabling Systematic Development and Research of...Towards an Infrastructure for Enabling Systematic Development and Research of...
Towards an Infrastructure for Enabling Systematic Development and Research of...
Rafael Ferreira da Silva
 
WorkflowHub: Community Framework for Enabling Scientific Workflow Research a...
WorkflowHub: Community Framework for Enabling  Scientific Workflow Research a...WorkflowHub: Community Framework for Enabling  Scientific Workflow Research a...
WorkflowHub: Community Framework for Enabling Scientific Workflow Research a...
Rafael Ferreira da Silva
 
Accurately Simulating Energy Consumption of I/O-intensive Scientific Workflows
Accurately Simulating Energy Consumption of I/O-intensive Scientific WorkflowsAccurately Simulating Energy Consumption of I/O-intensive Scientific Workflows
Accurately Simulating Energy Consumption of I/O-intensive Scientific Workflows
Rafael Ferreira da Silva
 
Running Accurate, Scalable, and Reproducible Simulations of Distributed Syste...
Running Accurate, Scalable, and Reproducible Simulations of Distributed Syste...Running Accurate, Scalable, and Reproducible Simulations of Distributed Syste...
Running Accurate, Scalable, and Reproducible Simulations of Distributed Syste...
Rafael Ferreira da Silva
 
Analysis of User Submission Behavior on HPC and HTC
Analysis of User Submission Behavior on HPC and HTCAnalysis of User Submission Behavior on HPC and HTC
Analysis of User Submission Behavior on HPC and HTC
Rafael Ferreira da Silva
 
A Unified Approach for Modeling and Optimization of Energy, Makespan and Reli...
A Unified Approach for Modeling and Optimization of Energy, Makespan and Reli...A Unified Approach for Modeling and Optimization of Energy, Makespan and Reli...
A Unified Approach for Modeling and Optimization of Energy, Makespan and Reli...
Rafael Ferreira da Silva
 
Toward Fine-Grained Online Task Characteristics Estimation in Scientific Work...
Toward Fine-Grained Online Task Characteristics Estimation in Scientific Work...Toward Fine-Grained Online Task Characteristics Estimation in Scientific Work...
Toward Fine-Grained Online Task Characteristics Estimation in Scientific Work...
Rafael Ferreira da Silva
 
On-line, non-clairvoyant optimization of workflow activity granularity task o...
On-line, non-clairvoyant optimization of workflow activity granularity task o...On-line, non-clairvoyant optimization of workflow activity granularity task o...
On-line, non-clairvoyant optimization of workflow activity granularity task o...
Rafael Ferreira da Silva
 

Más de Rafael Ferreira da Silva (19)

Towards an Infrastructure for Enabling Systematic Development and Research of...
Towards an Infrastructure for Enabling Systematic Development and Research of...Towards an Infrastructure for Enabling Systematic Development and Research of...
Towards an Infrastructure for Enabling Systematic Development and Research of...
 
Modeling and Simulation of Parallel and Distributed Computing Systems with Si...
Modeling and Simulation of Parallel and Distributed Computing Systems with Si...Modeling and Simulation of Parallel and Distributed Computing Systems with Si...
Modeling and Simulation of Parallel and Distributed Computing Systems with Si...
 
Good Practices for Developing Scientific Software Frameworks: The WRENCH fram...
Good Practices for Developing Scientific Software Frameworks: The WRENCH fram...Good Practices for Developing Scientific Software Frameworks: The WRENCH fram...
Good Practices for Developing Scientific Software Frameworks: The WRENCH fram...
 
WorkflowHub: Community Framework for Enabling Scientific Workflow Research a...
WorkflowHub: Community Framework for Enabling  Scientific Workflow Research a...WorkflowHub: Community Framework for Enabling  Scientific Workflow Research a...
WorkflowHub: Community Framework for Enabling Scientific Workflow Research a...
 
Bridging Concepts and Practice in eScience via Simulation-driven Engineering
Bridging Concepts and Practice in eScience via Simulation-driven EngineeringBridging Concepts and Practice in eScience via Simulation-driven Engineering
Bridging Concepts and Practice in eScience via Simulation-driven Engineering
 
Accurately Simulating Energy Consumption of I/O-intensive Scientific Workflows
Accurately Simulating Energy Consumption of I/O-intensive Scientific WorkflowsAccurately Simulating Energy Consumption of I/O-intensive Scientific Workflows
Accurately Simulating Energy Consumption of I/O-intensive Scientific Workflows
 
Running Accurate, Scalable, and Reproducible Simulations of Distributed Syste...
Running Accurate, Scalable, and Reproducible Simulations of Distributed Syste...Running Accurate, Scalable, and Reproducible Simulations of Distributed Syste...
Running Accurate, Scalable, and Reproducible Simulations of Distributed Syste...
 
WRENCH: Workflow Management System Simulation Workbench
WRENCH: Workflow Management System Simulation WorkbenchWRENCH: Workflow Management System Simulation Workbench
WRENCH: Workflow Management System Simulation Workbench
 
The Interplay of Workflow Execution and Resource Provisioning
The Interplay of Workflow Execution and Resource ProvisioningThe Interplay of Workflow Execution and Resource Provisioning
The Interplay of Workflow Execution and Resource Provisioning
 
On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows
On the Use of Burst Buffers for Accelerating Data-Intensive Scientific WorkflowsOn the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows
On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows
 
Automating Environmental Computing Applications with Scientific Workflows
Automating Environmental Computing Applications with Scientific WorkflowsAutomating Environmental Computing Applications with Scientific Workflows
Automating Environmental Computing Applications with Scientific Workflows
 
Analysis of User Submission Behavior on HPC and HTC
Analysis of User Submission Behavior on HPC and HTCAnalysis of User Submission Behavior on HPC and HTC
Analysis of User Submission Behavior on HPC and HTC
 
Experiments with Complex Scientific Applications on Hybrid Cloud Infrastructures
Experiments with Complex Scientific Applications on Hybrid Cloud InfrastructuresExperiments with Complex Scientific Applications on Hybrid Cloud Infrastructures
Experiments with Complex Scientific Applications on Hybrid Cloud Infrastructures
 
A Unified Approach for Modeling and Optimization of Energy, Makespan and Reli...
A Unified Approach for Modeling and Optimization of Energy, Makespan and Reli...A Unified Approach for Modeling and Optimization of Energy, Makespan and Reli...
A Unified Approach for Modeling and Optimization of Energy, Makespan and Reli...
 
Leveraging Semantics to Improve Reproducibility in Scientific Workflows
Leveraging Semantics to Improve Reproducibility in Scientific WorkflowsLeveraging Semantics to Improve Reproducibility in Scientific Workflows
Leveraging Semantics to Improve Reproducibility in Scientific Workflows
 
Toward Fine-Grained Online Task Characteristics Estimation in Scientific Work...
Toward Fine-Grained Online Task Characteristics Estimation in Scientific Work...Toward Fine-Grained Online Task Characteristics Estimation in Scientific Work...
Toward Fine-Grained Online Task Characteristics Estimation in Scientific Work...
 
On-line, non-clairvoyant optimization of workflow activity granularity task o...
On-line, non-clairvoyant optimization of workflow activity granularity task o...On-line, non-clairvoyant optimization of workflow activity granularity task o...
On-line, non-clairvoyant optimization of workflow activity granularity task o...
 
Workflow fairness control on online and non-clairvoyant distributed computing...
Workflow fairness control on online and non-clairvoyant distributed computing...Workflow fairness control on online and non-clairvoyant distributed computing...
Workflow fairness control on online and non-clairvoyant distributed computing...
 
Multi-infrastructure workflow execution for medical simulation in the Virtual...
Multi-infrastructure workflow execution for medical simulation in the Virtual...Multi-infrastructure workflow execution for medical simulation in the Virtual...
Multi-infrastructure workflow execution for medical simulation in the Virtual...
 

Último

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Último (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

A science-gateway workload archive application to the self-healing of workflow incidents

  • 1. A science-gateway workload archive application to the self-healing of workflow incidents Rafael FERREIRA DA SILVA, Tristan GLATARD Frédéric DESPREZ University of Lyon, CNRS, INSERM, CREATIS INRIA, University of Lyon, LIP ENS Lyon , Villeurbanne, France Lyon, France Journées Scientifiques Mésocentres et France Grilles October 1st-3rd 2012 1 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
  • 2. Context: Workload Archives Assumptions validation exit_code task_status useful for submit_time ime t ion_t Computational activity site_name execu modeling inpu t _file id workflow_ activity_name Methods evaluation (simulation or experimental) Information produced by grid workflow executions 2 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
  • 3. Science-gateway architecture 0. Login 3. Launch workflow 1. Send input data User Workflow Engine Web Portal 2. Transfer 4. Generate and input files submit task Storage Element 8. Get files 7. Get task 9. Execute 10. Upload results Pilot Manager Computing site 6. Schedule 5. Submit pilot jobs pilot jobs Meta-Scheduler 3 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
  • 4. State of the Art Grid Workload Archives exit_code task_status submit_time time tion_ execu site_name inpu t _file d workflow_i Information gathered activity_name at infrastructure-level tasks Lack of critical information: •  Dependencies among tasks •  Parallel Workloads Archive (http://www.cs.huji.ac.il/labs/parallel/workload/) •  Task sub-steps •  Grid Workloads Archive •  Application-level scheduling artifacts (http://gwa.ewi.tudelft.nl/pmwiki/) •  User 4 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
  • 5. At infrastructure-level 0. Login 3. Launch workflow 1. Send input data User Workflow Engine Web Portal 2. Transfer 4. Generate and input files submit task Storage Element 8. Get files 7. Get task 9. Execute 10. Upload results Pilot Manager Computing site 6. Schedule 5. Submit pilot jobs pilot jobs Meta-Scheduler 5 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
  • 6. Outline   A science-gateway workload archive   Case studies   Pilot Jobs   Accounting   Task analysis   Bag of tasks   Workflow Self-Healing   Conclusions 6 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
  • 7. Our approach Science-Gateway Workload Archive exit_code task_status submit_time time tion_ execu site_name inpu t _file d Information gathered workflow_i activity_name at science-gateway level Advantages: workflow executions •  Fine-grained information about tasks •  Dependencies among tasks •  Workflow characterization •  Accounting 7 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
  • 8. At science-gateway level 0. Login 3. Launch workflow 1. Send input data User Workflow Engine Web Portal 2. Transfer 4. Generate and input files submit task Storage Element 8. Get files 7. Get task 9. Execute 10. Upload results Pilot Manager Computing site 6. Schedule 5. Submit pilot jobs pilot jobs Meta-Scheduler 8 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
  • 9. Virtual Imaging Platform   Virtual Imaging Platform (VIP)   Medical imaging science-gateway   Grid of 129 sites (EGI – http://www.egi.eu) Applications   Significant usage   Registered users: 244 from 26 countries   Applications: 18 File transfer   Consumed 32 CPU years in 2011 VIP – http://vip.creatis.insa-lyon.fr VIP usage in 2011: CPU consumption of VIP and related platforms on EGI. 9 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
  • 10. SGWA   Science Gateway Workload Archive (SGWA)   Archive is extracted from VIP Science-gateway archive model Task, Site and Workflow Execution File and Pilot Job extracted from acquired from databases populated the parsing of task standard by the workflow engine at runtime output and error files 10 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
  • 11. Workload for Case Studies   Based on the workload of VIP   January 2011 to April 2012 338,989 completed 138,480 error 105,488 aborted 15,576 aborted replicas 48,293 stalled 34,162 queued 112 users 2,941 workflow executions 680,988 tasks 339,545 pilot jobs 11 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
  • 12. Pilot Jobs   A single pilot can wrap several tasks and users 282331 250000 200000   At infrastructure-level 150000 Frequency 100000   Assimilates pilot jobs to tasks and 50000 28121 users 11885 6721 10487   Valid for only 62% of the tasks 0 1 2 3 4 5 Tasks per pilot   Valid for 95% of user-task associations 323214 300000 250000 200000 150000 Frequency   At science-gateway level 100000 50000   Users and tasks are correctly 15178 associated to pilots 1079 70 4 0 1 2 3 4 5 Users per pilot 12 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
  • 13. Accounting: Users   Authentications based on login and password are mapped to X.509 robot certificates   At infrastructure-level   All VIP users are reported as a single user   At science-gateway level   Maps task executions to VIP users 40 30 Users EGI 20 VIP 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Months Number of reported EGI and VIP users 13 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
  • 14. Accounting: CPU and Wall-clock Time   Huge discrepancy of values 6e+05 VIP jobs   Pilot jobs do not register to Number of jobs 5e+05 EGI jobs the pilot system 4e+05 3e+05   Absence of workload 2e+05 1e+05   Outputs unretrievable 5 10 15 Month   Pilot setup time Number of submitted pilot jobs by EGI and VIP   Lost tasks (a.k.a. stalled) 150 VIP CPU time VIP Wall−clock time 100   Undetectable at infrastructure-level EGI CPU time Years EGI Wall−clock time 50 5 10 15 Month Consumed CPU and wall-clock time by EGI and VIP 14 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
  • 15. Task Analysis   At infrastructure-level   Limited to task exit codes 55165 50925 50000 48293 Number of tasks 40000 30000   At science-gateway level 20000 19463   Fine-grained information 10000 1123 0   Steps in task life application input stalled Error causes output folder   Error causes   Replicas per task 1.0 0.8 download execution 0.6 upload CDF 0.4 0.2 1 100 10000 Time(s) Different steps in task life 15 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
  • 16. Bag of Tasks: at Infrastructure level   Evaluation of the accuracy of Iosup et al.[8] method to detect bag of tasks (BoT) Task 1 Task 2   Two successively submitted tasks are in the same BoT if Δ1,2 Δ2,3 Task 3 the time interval between submission times is lower t1 t2 t3 time or equal to Δ. Δ Δ BoT 1 BoT 2 Task 1 Δ1,2 ≤Δ Task 3 Δ2,3 >Δ |t1 – t2|≤Δ |t2 – t3|>Δ Task 2 16 [8] Iosup, A., Jan, M., Sonmez, O., Epema, D.: The Characteristics and performance of groups of jobs in grids. In: Euro-Par. (2007) 382-393 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
  • 17. Bag of Tasks: Size and Duration Infrastructure vs science-gateway   90% of Batch BoTs size ranges 0.8 from 2 to 10 while it represents 0.6 CDF 50% of Real Batch 0.4 0.2 Real Batch Batch 0.0 200 400 600 800 1000 Size (number of tasks) 0.8   Non-Batch duration is 0.6 overestimated up to 400% CDF Real Batch 0.4 Real Non−Batch 0.2 Batch Non−Batch 0.0 10000 20000 30000 40000 50000 Duration (s) Real Batch = ground-truth BoT Real Non-Batch = ground-truth non-BoT Batch = Iosup et al. BoT Non-Batch = Iosup et al. non-BoT 17 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
  • 18. Bag of Tasks: Inter-arrival Time and Consumed CPU Time   Batch and Non-Batch inter-arrival 0.8 times are underestimated by 0.6 CDF about 30% 0.4 Real Batch Real Non−Batch 0.2 Batch Non−Batch 0.0 2000 4000 6000 8000 10000 Inter−Arrival Time (s) 0.8   CPU times are underestimated of 0.6 25% for Non-Batch and of about CDF 20% for Batch Real Batch 0.4 Real Non−Batch 0.2 Batch Non−Batch 0 5000 10000 15000 20000 25000 30000 Consumed CPUTime (KCPUs) Real Batch = ground-truth BoT Real Non-Batch = ground-truth non-BoT Batch = Iosup et al. BoT Non-Batch = Iosup et al. non-BoT 18 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
  • 19. Outline   A science-gateway workload archive   Case studies   Pilot Jobs   Accounting   Task analysis   Bag of tasks   Workflow Self-Healing   Conclusions 19 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
  • 20. Workflow Self-Healing   Problem: costly manual operations   Rescheduling tasks, restarting services, killing misbehaving experiments or replicating data files   Objective: automated platform administration   Autonomous detection of operational incidents   Perform appropriate set of actions   Assumptions: online and non-clairvoyant   Only partial information available   Decisions must be fast   Production conditions, no user activity and workloads prediction 20 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
  • 21. General MAPE-K loop event Incident 1 Incident 2 Incident 3 (job completion and failures) degree η = 0.8 degree η = 0.4 degree η = 0.1 or timeout level level level level level level level level level 1 2 3 1 2 3 1 2 3 Monitoring Analysis 0.07 Monitoring data x2 ηi 15000 Frequency 0.30 = n ∑ ηj 0 5000 Set of Actions 0.61 j =1 0.0 0.2 0.4 0.6 0.8 1.0 Estimation by Median !b Execution Knowledge Roulette wheel selection € Planning Rule Confidence (ρ) ρxη Selected 0.37 2 1 0.8 0.32 Selected Incident 2 0.66 31 0.2 0.02 Incident 1 1  1 1.0 0.80 0.16 Roulette wheel selection Association rules based on association rules for incident 1 21 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
  • 22. Incident: Activity Blocked   An invocation is late compared to the others FIELD-II/pasa - workflow-9SIeNv 80 100 Completed Jobs 60 40 20 0 0.0e+00 4.0e+06 8.0e+06 1.2e+07 Time (s) Invocations completion rate for a simulation Job flow for a simulation   Possible causes   Longer waiting times   Lost tasks (e.g. killed by site due to quota violation)   Resources with poor performance 22 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
  • 23. Activity blocked: degree   Degree computed from all completed jobs of the activity   Job phases: setup  inputs download  execution  outputs upload   Assumption: bag-of-tasks (all jobs have equal durations)   Median-based estimation: Median duration Estimated job Real job of jobs phases duration duration 50s 42s 42s completed 250s 300s 300s 400s 400s* 20s current 15s 15s ? Mi = 715s Ei = 757s *: max(400s, 20s) = 400s   Incident degree: job performance w.r.t median Ei d= ∈ [0,1] Mi + Ei 23 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr €
  • 24. Activity blocked: levels and actions   Levels: identified from the platform logs τ1 Level 1 Level 2 15000 (no actions) Frequency € action: replicate jobs 0 5000 0.0 0.2 0.4 0.6 0.8 1.0 d Estimation by Median !b Replication process for one task   Actions   Job replication   Cancel replicas with bad performance   Replicate only if all active replicas are running 24 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
  • 25. Experiments   Goal: Self-Healing vs No-Healing   Cope with recoverable errors   Metrics   Makespan of the activity execution   Resource waste (CPU + data) self −healing w= −1 (CPU + data) no−healing   For w < 0: self-healing consumed less resources € For w > 0: self-healing wasted resources   25 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
  • 26. Experiment Conditions   Software   Virtual Imaging Platform   MOTEUR workflow engine   DIRAC pilot job system   Infrastructure   European Grid Infrastructure (EGI): production, shared   Self-Healing and No-Healing launched simultaneously   Experiment parameters   Task and file replication limited to 5   Failed task resubmission limited to 5 26 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
  • 27. Applications FIELD-II/pasa Mean-Shift/hs3 •  Ultrasound imaging •  Image denoising simulation •  250 invocations •  122 invocations •  CPU Time: 1 hour •  CPU Time: 15 min •  ~182 MB •  ~210 MB •  CPU-intensive •  Data-intensive Image courtesy of ANR project US-Tagging Image courtesy of Ting Li http://www.creatis.insa-lyon.fr/us-tagging/news http://www.creatis.insa-lyon.fr O. Bernard, M. Alessandrini 27 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
  • 28. Results   Experiment: tests if recoverable errors are detected FIELD-II/pasa Mean-Shift/hs3 12000 20000 10000 Makespan (s) Makespan (s) 8000 15000 No−Healing No−Healing 6000 Self−Healing 10000 Self−Healing 4000 5000 2000 0 0 1 2 3 4 5 1 2 3 4 5 Repetitions Repetitions speeds up execution up to 4 speeds up execution up to 2.6 Repetition w Repetition w 1 –0.10 1 –0.02 Self-Healing process reduced resource 2 –0.15 2 –0.20 consumption up to 26% when compared 3 –0.09 to the No-Healing execution 3 –0.02 4 0.05 4 –0.02 5 –0.26 5 –0.01 28 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
  • 29. Conclusions   Science-gateway model of workload archive   Illustration by using traces of the VIP from 2011/2012   Added value when compared to infrastructure-level traces   Exactly identify tasks and users   Distinguishes additional workload artifacts from real workload   Fine-grained information about tasks   Ground-truth of bag of tasks   Self-healing of worklfow incidents   Implements a generic MAPE-K loop   Incident degrees computed online   Speeds up execution up to a factor of 4   Reduced resource consumption up to 26%   Successfull example of self-healing loop deployed in production   VIP is openly available at http://vip.creatis.insa-lyon.fr   Traces are available to the community in the Grid Observatory: http://www.grid-observatory.org 29 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr
  • 30. A science-gateway workload archive application to the self-healing of workflow incidents Thank you for your attention. Questions? ACKNOWLEDGMENTS VIP users and project members French National Agency for Research (ANR-09-COSI-03) European Grid Initiative (EGI) France-Grilles Rafael FERREIRA DA SILVA, Tristan GLATARD Frédéric DESPREZ University of Lyon, CNRS, INSERM, CREATIS INRIA, University of Lyon, LIP ENS Lyon , Villeurbanne, France Lyon, France 30 Rafael Ferreira da Silva – rafael.silva@creatis.insa-lyon.fr