SlideShare una empresa de Scribd logo
1 de 104
Persistent Systems
                                  January 5, 2013


                        Software Faults, Failures and
                             Their Mitigations



                  Kishor Trivedi
Duke High Availability Assurance Lab (DHAAL)
     Dept. of Electrical & Computer Engineering
                   Duke University
                 Durham, NC 27708
                  kst@ee.duke.edu
             www.ee.duke.edu/~kst
             Copyright © 2013 by K.S. Trivedi
Duke University
Research Triangle Park
(RTP)

                                                      Duke

                                      UNC-CH
                                                               NC state



                 North
   USA           Carolina




                   2        Copyright © 2013 by K.S. Trivedi
Duke University
• U.S. News & World Report in its 2012 edition,

   – ranked the university's undergraduate program
     8th among all universities in USA




                         Copyright © 2013 by K.S. Trivedi   3
Duke University




NCAA Men’s Basketball Champions 2010 (also 1998,1999 & 2001)
             4       Copyright © 2013 by K.S. Trivedi
Trivedi’s research triangle
                                          Stochastic modeling methods & numerical
                       Theory             solution methods:
                                          Large Fault trees, Stochastic Petri Nets,
                                          Large/stiff Markov & non-Markov models
                                          Fluid stochastic models
                                          Performability & Markov reward models
                                          Software aging and rejuvenation
                                          Security, Survivability, Resilience
Software               Books:             quantification
                                          Machine Learning
Packages
                                                   Applications
                        Blue, Red,
                                                 Reliability/availability/performance
                                                 Avionics (Boeing), Space (NASA/JPL),
                       White
 HARP (NASA), SAVE (IBM),                        Power systems (GE),
 IRAP (Boeing)                                   Automobile systems (GM)
 SHARPE, SPNP, SREPT                             Computer systems (EMC,SUN,HP,TCS)
                                                 Telco systems (AT&T, Lucent, Avaya)
                                                 Computer Networks (Motorola)
                                                 Virtualized Data center (NEC)
                                                 Cloud computing (IBM, NEC, Cisco)
                   5        Copyright © 2013 by K.S. Trivedi Aging & Rejuvenation (Huawei)
                                                 Software
Probability and Statistics with Reliability, Queuing,
and Computer Science Applications, Second edition,
    John Wiley, 2001 (Bluebook) [First edition by
                Prentice-Hall, 1982]
                  Textbooks
Performance and Reliability Analysis of Computer
 Systems: An Example-Based Approach Using the
SHARPE Software Package, Kluwer (now Springer),
                1996 (Redbook)

     Queuing Networks and Markov Chains,
   John Wiley, second edition, 2006 (White book)

            6     Copyright © 2013 by K.S. Trivedi
DHAAL & Industry
                            A Success Story
•   Reliability Prediction of Boeing 787 Current Return Network for FAA
    Certification
•   Security Quantification (DARPA SITAR)/NSF
•   Survivability Quantification for Lucent POTS and Siemens Smartgrid
•   Cloud performance, availability, power with IBM and Cisco
•   Reliability/Availability Prediction of SIP protocol on IBM WebSphere
•   Software Aging and Rejuvenation: state of the art, theory, measurements,
    and implementation (IBM x-series); Huawei
•   Cloud computing security (Measurements, quantification, and
    implementation) - NATO Science for Peace and Security
•   NASA-JPL Failures data analytics
•   NEC collaboration for Performability Management (VMs allocation, VMs
    and VMM rejuvenation, etc) in Virtualized Data Center
•   Data analytics (statistical and machine learning techniques used to
    interpret huge volume of data) - WiPro Technologies
•   Software Reliability/Availability/Performability analysis – Short courses,
    seminars, and consulting - Tata consulting services

7                           Copyright © 2013 by K.S. Trivedi               7
Outline
•   Motivation
•   A Real System
•   Software Fault Classification
•   Environmental Diversity
•   Methods of Mitigation
•   Software Aging and Rejuvenation
•   Conclusions




                     Copyright © 2013 by K.S. Trivedi
Pervasive Dependence on Computer Systems
               Need for High Reliability/Availability

                    Communication



Health & Medicine                                  Avionics



                                                         Banking
 Entertainmen
 t



                      Copyright © 2013 by K.S. Trivedi
Basic Definitions
• Steady-state availability (Ass) or just availability
    Long-term probability that the system is available when
     requested:
                                MTTF
               Ass   =
                             MTTF + MTTR
    MTTF is the system mean time to failure, a complex
     combination of component MTTFs

    MTTR is the system mean time to recovery
     - may consist of many phases
                         Copyright © 2013 by K.S. Trivedi
Basic Definitions

• Downtime in minutes per year
  (un)availability is usually presented in terms of annual downtime.


   – Downtime = 8760×60 ×(1- Ass) minutes.


   – 5 NINES (Ass = 0.99999)  5.26 minutes annual downtime




                           Copyright © 2013 by K.S. Trivedi
Number of Nines– Reality Check

• 49% of Fortune 500 companies experience at least 1.6 hours of
  downtime per week

   – Approx. 80 hours/year=4800 minutes/year

   – Ass=(8760-80)/8760=0.9908

   – That is, between 2 NINES and 3 NINES!


• This study assumes planned and unplanned downtime,
  together


                      Copyright © 2013 by K.S. Trivedi
Achieving High Availability
              is a Challenge
• Black Sept. 2011, In the same week!!!!:
  – Microsoft Cloud service outage (2.5 hours)
  – Google Docs service outage (1 hour)
     • A memory leak due to a software update


• Sept. 2012 GoDaddy (4 hours)
  – 5 millions of websites affected
• Oct. 2012 Amazon
  – 10/15/2012 Webservices – 6 hours (Memory leak)
  – 10/27/2012 EC2 – > 2 hours

                     Copyright © 2013 by K.S. Trivedi
Downtown Costs per Hour
   •   Brokerage operations                                                              $6,450,000
   •   Credit card authorization                                                         $2,600,000
   •   eBay (1 outage 22 hours)                                                           $225,000
   •   Amazon.com                                                                         $180,000
   •   Package shipping services                                                          $150,000
   •   Home shopping channel                                                              $113,000
   •   Catalog sales center                                                                 $90,000
   •   Airline reservation center                                                           $89,000
   •   Cellular service activation                                                          $41,000
   •   On-line network fees                                                                 $25,000
   •   ATM service fees                                                                     $14,000
Sources: InternetWeek 4/3/2000; Fibre Channel: A Comprehensive Introduction, R. Kembel
          2000, p.8. ”...based on a survey done by Contingency Planning Research."
                                   Copyright © 2013 by K.S. Trivedi
High Reliability/Availability



• Hardware fault tolerance, fault management,
  reliability/availability modeling/assurance relatively well
  developed

• System outages more due to software faults


      Key Challenge:
•
            Software reliability is one of the
      weakest links in system
      reliability/availabilityby K.S. Trivedi
                   Copyright © 2013
Software is the problem
im Gray’s paper titled “W do computers
                             hy
top and what can be done about it?”
 arted to pointed out this trend in 1985, followed by his paper
A census of tandem system availability between 1985 and 1990”




                                                                       2005
                                                   Across different industries….
                1985                Copyright © 2013 by K.S. Trivedi          16
Increasing SW Failure Rate?
                                  Planetary Missions Flight Software: A. Nikora of JPL




                                                                                                                              T interval
                                                                                                                               he
                                                                                                                              between the first
                                                                                                                              and last launch:
                                                                                                                              8.76 years.


                                                                                                                              T interval between
                                                                                                                               he
                                                                                                                              successive launches
                                                                                                                              ranges from:
                                                                                                                              23 to 790 days.

                                                                                                                              Similar results for
                                                                                                                              ground software




 Mars      Pathfinder   CASSINI    Mars       Mars     Stardust    Mars     Genesis     Mars         Deep         Mars
 Global                           Climate     Polar               Odyssey             Exploration   Impact   Reconnaissance
Surveyor                          Orbiter
                                               Copyright © 2013 by K.S. Trivedi
                                             Lander                                     Rover                    Orbiter

                                            Mission Name (in launch order)
High Reliability/Availability:
                               Software is the problem



• Fault avoidance
   – good software engineering practices
   – difficult for large/complex software systems

   – Impossible to fully test and verify if software is fault-free
     “Testing shows the presence, not the absence, of bugs”
            - E. W. Dijkstra

• Yet there are stringent requirements for failure-free
  operation



                                 Copyright © 2013 by K.S. Trivedi
High Reliability/Availability:
           Software is the problem (2)




Software fault tolerance is a potential
solution to improve software reliability in lieu of
virtually impossible fault-free software




                Copyright © 2013 by K.S. Trivedi
Software Fault Tolerance
                        Classical Techniques


Design diversity
   – N-version programming
   – Recovery block


Expensive  not used much in practice!

Yet there are stringent requirements for failure-free
  operation

Challenge: Affordable Software Fault Tolerance

                         Copyright © 2013 by K.S. Trivedi
Outline
•   Motivation
•   A Real System
•   Software Fault Classification
•   Environmental Diversity
•   Methods of Mitigation
•   Software Aging and Rejuvenation
•   Conclusions




                       Copyright © 2013 by K.S. Trivedi
High availability SIP Application Server
                   Configuration on IBM WebSphere


P RDC 2008 and
ISSRE 2010
papers




                           Copyright © 2013 by K.S. Trivedi
High availability SIP Application Server
            configuration on WebSphere

Hardware configuration:
   – Two BladeCenter chassis; 4 blades (nodes) on each chassis (1 chassis
     sufficient for performance)

Software configuration:
   – 2 copies of SIP/Proxy servers (1 sufficient for performance)

   – 12 copies of WAS (6 sufficient for performance)

   – Each WAS instance forms a redundancy pair (replication domain) with
     WAS installed on another node on a different chassis

• The system has hardware redundancy and software
  redundancy

                        Copyright © 2013 by K.S. Trivedi
High availability SIP Application Server
           configuration on WebSphere


Software Fault Tolerance
  – Identical copies of SIP proxy used as backups (hot spares)
  – Identical copies of WebSphere Applications Server (WAS) used
    as backups (hot spares)
  – Type of software redundancy – (not design diversity) but
    replication of identical software copies
  – Normal recovery
     • restart software, reboot node or fail-over to a software replica; only
       when all else fails, a “software repair” is invoked




                        Copyright © 2013 by K.S. Trivedi
Escalated levels of Recovery
     Single Process
     Restart (SPR)
                      A real example                   The flowchart depicts the
                      Avaya Servers and                actions taken for recovery
       IF 3 SPR       Media Gateways                   after a failure is detected.
NO      within
          60
       seconds
                                                       Try the simplest recovery
                                                       method first, then a more
            YES
     System Warm                                       complex etc.
     Restart (SWR)



         IF 3
NO                    YES                                   IF 3 SCR       NO
         SWR                      System Cold
                                                             within
        within                    Restart (SCR)
                                                             15 min
        15 min

                                                                YES

                                                             Avaya
                                     IF 3
                       YES                              Communication
                                    ACMSR
        OS Reboot                                           Manager
                                    within
                                                        Software reloads
                                    15 min
                                                            (ACMSR)
                             Copyright © 2013 by K.S. Trivedi
                                          NO
Software Fault Tolerance: New
                 Thinking

Retry, restart, reboot!

  – Known to help in dealing with hardware
    transients


  – Do they help in dealing with failures caused by
    software bugs?

  – If yes, why?
                   Copyright © 2013 by K.S. Trivedi
A Cartoon




Why is this true………at least for computers?
          Copyright © 2013 by K.S. Trivedi
Software Fault Tolerance: New
                   Thinking

Failover to an identical software replica (that is not a
  diverse version)

   – Does it help?

   – If yes, why?



Twenty years ago this would be considered crazy!

                      Copyright © 2013 by K.S. Trivedi
Outline
• Motivation
• A Real System
• Software Fault Classification
   – Fighting Bugs: Remove, Retry, Replicate and Rejuvenate, M.
     Grottke and K. Trivedi, IEEE Computer Magazine, Feb. 2007
• Environmental Diversity
• Methods of Mitigation
• Software Aging and Rejuvenation
• Conclusions




                       Copyright © 2013 by K.S. Trivedi
Software Faults
main threats to high reliability,
     availability & safety



      Copyright © 2012 byby K.S. Trivedi
        Copyright © 2013 K.S. Trivedi
IFIP Working Group 10.4 (Laprie)

 • Failure occurs when the delivered service no
   longer complies with the desired output.
 • Error is that part of the system state which is
   liable to lead to subsequent failure.
 • Fault is adjudged or hypothesized cause of an
   error.
Faults are the cause of errors that may lead to failures
          Fault               Error                    Failure


 31                 Copyright © 2013 by K.S. Trivedi
Need to Classify bug types
• We submit that a software fault tolerance
  approach based on retry, restart, reboot or fail-
  over to an identical software replica (not a
  diverse version) work because of a significant
  number of software failures are caused by
  Mandelbugs as opposed to the traditional
  software bugs now called Bohrbugs



                  Copyright © 2013 by K.S. Trivedi
Need to Classify bug types

• In recent years, researchers have reported the
  phenomenon of “software aging” (i.e.,
  degraded performance and/or increased
  failure rate of long-running software systems).
• Puzzle: How can performance and failure rate
  change if the software code is not modified?!
⇒ Study software fault types and their
  relationships


                 Copyright © 2013 by K.S. Trivedi
Jim Gray’s Definitions

• The terms “Bohrbug” and “Heisenbug” were
  first used in print by Jim Gray in 1985.
• “Bohrbugs, like the Bohr atom, are solid, easily
  detected by standard techniques, and hence
  boring.”
• “Most production software faults are soft. If
  the program state is reinitialized
  and the failed operation is retried,             J. Gray
  the operation will not fail a second time. … The
  assertion that most production software bugs are soft
  – Heisenbugs that go away when you look at them –
  is well known to systems programmers.” (Gray, 1985)

                      Copyright © 2013 by K.S. Trivedi
Bruce Lindsay’s Definition
   Based on Gray’s paper, researchers
    have often equated Heisenbugs with
    soft faults.
   However, when Bruce Lindsay
    originally coined the term in the 1960s
    (while working with Jim Gray), he had
    a more narrow definition in mind.

• “Heisenbugs as originally defined …          B. Lindsay, photo by T. Upton

  are bugs in which clearly the system behavior is incorrect,
  and when you try to look to see why it’s incorrect, the
  problem goes away.” (Lindsay, 2004)
• The term alludes to the physicist Werner Heisenberg and his
  Uncertainty Principle.
                           Copyright © 2013 by K.S. Trivedi
Heisenbug – Our Definition
• Heisenbug := A fault that stops
  causing a failure or that manifests
  differently when one attempts to
  probe or isolate it.
• How can probing affect the bug?
   1. Some debuggers initialize unused
      memory to default values, thus
      preventing failures due to
      improper initialization.
   2. Trying to investigate a failure can
      influence process scheduling in
      such a way that a scheduling-
      related failure does not occur
      again.

                       Copyright © 2013 by K.S. Trivedi
A Classification of Software Faults
• Bohrbug := A fault that is easily
  isolated and that manifests
  consistently under a well-defined set
  of conditions, because its activation
  and error propagation lack
  complexity.
   Example: A bug causing a failure
    whenever the user enters a negative
    date of birth
   Since they are easily found, Bohrbugs tend to be
    detected and fixed during the software testing phase.
   The term alludes to the physicist Niels Bohr and his
    rather simple atomic model.
                     Copyright © 2013 by K.S. Trivedi
Mandelbug – Definition

• Mandelbug := A fault whose
  activation and/or error
  propagation are complex.
  Typically, a Mandelbug is
  difficult to isolate, and/or the
  failures caused by a it are not
  systematically reproducible.
• Example: A bug whose
  activation is scheduling-dependent
 The residual faults in a thoroughly-tested piece of

  software are mainly Mandelbugs.
 The term alludes to the mathematician Benoît

  Mandelbrot and his research in fractal geometry.

                  Copyright © 2013 by K.S. Trivedi
Mandelbug: “Complexity” (1)                                       39




• The explanation of the possible sources of complexity is based on the
  “chain of threats” linking faults with errors and failures:




• First source of complexity: Time lag between fault activation and failure
  occurrence, e.g., because several different error states have to be
  traversed in the error propagation.
• Example: The result of an erroneous calculation may at first be kept in the
  system memory and cause a failure only later, when it is being accessed
  and used.



                            Copyright © 2013 by K.S. Trivedi
Mandelbug: “Complexity” (2)                                         40




• Second source of complexity: Fault activation and/or error
  propagation depend on interactions between conditions occurring
  inside the application and conditions that accrue within the system-
  internal environment of the application.




• Example: A fault causing failures due to side-effects of other
  applications
                     Copyright © 2013 by K.S. Trivedi
Mandelbugs: Consequences                        41



• Mandelbugs are difficult to detect and remove
  during the software testing phase.
• An operation that failed due to a Mandelbug may
  execute correctly upon retry even if the fault has
  not been removed; changing the environment may
  suffice.
• Potential recovery techniques:
   –   “Microreboot” of individual components
   –   Application restart
   –   System reboot
   –   Failover to a standby component (replicate)
   –   Manual recovery
                     Copyright © 2013 by K.S. Trivedi
Examples of Types of Bugs in
                   IT System
• Mandelbugs in IT Systems: Trivedi, Mansharamani, Kim,
  Grottke, and Nambiar. “Recovery from failures due to
  Mandelbugs in IT systems”. PRDC 2011.

• The projects ranged across a number of business systems in
  the banking, financial, government, IT, pharmacy, and
  telecom sector.




42                    Copyright © 2013 by K.S. Trivedi
Examples of Types of Bugs
                in IT System (cont.)
• Exemple of Mandelbug in a large telecom system
     – Slow response times of the front end screens.
     – The problem was hard to analyze since the screens would freeze at
       random points in the day.
         • As the days went by the frequency of incidents of these freezes kept increasing.

• Class of MandelBug encountered
     – The users would wait for some time and their operations would
       resume.
     – The IT operations team rebooted the servers and the operations could
       resume for a few hours.




43                               Copyright © 2013 by K.S. Trivedi
Examples of Types of Bugs
                in IT System (cont.)
• Reason of the Problem
     – Whenever a front end screen was invoked, a temporary file was
       created at the centralized server.
         • This file was never cleaned up even after a screen was closed.
     – As a result, tens of thousands of small files kept accumulating on the
       disk causing sluggish behavior.
• Solution of the problem
     – A cleanup utility was written to move these files periodically to
       another file system and later delete them.




44                               Copyright © 2013 by K.S. Trivedi
Examples of Types of Bugs
                in IT System (cont.)
• Exemple of Mandelbug in a government tax information
  system.
     – All organizations should submit the income tax deducted at source
       (TDS) records for all of their employees.
     – Sporadically when a large corporation uploaded its file, every once in
       a while the application server would crash.
• Class of MandelBug encountered
     – The IT operations staff would then increase the JVM heap size and
       restart the JVM, which would allow the file to be uploaded without
       any problem.




45                          Copyright © 2013 by K.S. Trivedi
Examples of Types of Bugs
               in IT System (cont.)
• Reason of the Problem
     – The probability of a failure occurrence increased after each JVM
       restart, as the heap got consumed more and more.
• Solution of the problem
     – Reconfigurate system parameters to resume operations successfully.




46                          Copyright © 2013 by K.S. Trivedi
Aging-related Bug – Definition
    • Aging-related bug := A fault that
      leads to the accumulation of errors
      either inside the running application
      or in its system-context
      environment, resulting in an
      increased failure rate and/or
      degraded performance.
    Example:
       A bug causing memory leaks in the application

    Note that the aging phenomenon requires a delay
     between fault activation and failure occurrence.
    Note also that the software appears to age due to such
     a bug; there is no physical deterioration
                         Copyright © 2013 by K.S. Trivedi
Relationships
   Bohrbug and Mandelbug are complementary
    antonyms.
   Aging-related bugs are a subtype of Mandelbugs

      Mandelbugs


                         Aging Related Bugs


                          Aging-Related Bugs
                               -

      Bohrbugs




                   Copyright © 2013 by K.S. Trivedi
Important Questions about these Bugs

• What fraction of bugs are Bohrbugs, Mandelbugs and aging-related
  bugs
    – How do these fractions vary
        • over time
        • over projects, languages, application types,…
    – Need Measurements
    – Current NASA/JPL Project with Allen Nikora & Michael Grottke; preliminary
      results from one NASA software project:
        •   52% Bohrbugs
        • 35% Mandelbugs (non-aging-related)
        •   4% Aging-related bugs
        •   7% Operator related
        •   2% Unclassified
   – Very similar results for Linux, MySQL, Apache AXIS, httpd
• What are the methods of mitigation for the different fault types

                              Copyright © 2013 by K.S. Trivedi
Trends in SW Fault Type Proportions
                              Planetary Missions Flight Software




• Fault Type Proportions vs. Runtime for Four Earlier Missions (of 8 missions
  analyzed)
• Result: The proportion of Bohrbugs seems to settle at around the same value. Such
  a convergence to similar values is less obvious for the other fault types.

                             Copyright © 2013 by K.S. Trivedi
Outline
•   Motivation
•   A Real System
•   Software Fault Classification
•   Environmental Diversity
•   Methods of Mitigation
•   Software Aging and Rejuvenation
•   Conclusions




                       Copyright © 2013 by K.S. Trivedi
Environmental diversity
A new thinking to deal with software faults and failures




                  Copyright © 2012 byby K.S. Trivedi
                    Copyright © 2013 K.S. Trivedi
Software Fault Tolerance: New
                  Thinking
New thinking: Environmental Diversity as opposed to
  Design Diversity


Our claim is that this works since failures due to
  Mandelbugs are not negligible, we have an
  affordable software fault tolerance technique that
  we call
Environmental Diversity


                   Copyright © 2013 by K.S. Trivedi
What is environmental diversity?
• The underlying idea of Environmental diversity
  – Retry a previously faulty operation and it works
  – Why?
  – because of the environment where the operation was
    executed has changed enough to avoid the fault
    activation.
• The environment is understood as
  – OS resources, other applications running concurrently
    and sharing the same resources, interleaving of
    operations, concurrency, or synchronization.
                   Copyright © 2013 by K.S. Trivedi
What is environmental diversity?

         • The execution of an application depends on
           the environment
                                  Restart the application


Ap1     Ap2     Ap3      Ap4                          Ap4         Ap6   Ap3     Ap5

      Operating System                                       Operating System


         Hardware                                                  Hardware


Environment at time t1                           Environment at time t1+n
                               Copyright © 2013 by K.S. Trivedi
Environmental Diversity
• Restart an application, reboot a node or failover to an
  identical standby replica work because of the environmental
  diversity that will be underlying these actions;
   – By environment here we mean the resources of the OS, other
     applications running concurrently and sharing system resources,
     interleaving of operations, concurrency, synchronization etc.
• Environmental Diversity uses time redundancy over
  expensive design diversity
       •   [Adams] Restart
       •   [Jalote et al.] Rollback, rollforward
       •   [Patriot] Occasional reboot, “switch off and on”
       •   [Avaya Swift] restart process; failover to a replica
       •   [IBM SIP] escalated levels: restart, reboot, failover…
       •   [IBM Director-X-series] Rejuvenation



                           Copyright © 2013 by K.S. Trivedi
Outline
•   Motivation
•   A Real System
•   Software Fault Classification
•   Environmental Diversity
•   Methods of Mitigation
•   Software Aging and Rejuvenation
•   Conclusions




                       Copyright © 2013 by K.S. Trivedi
Methods of Mitigation



    Copyright © 2012 byby K.S. Trivedi
      Copyright © 2013 K.S. Trivedi
Mitigation




Copyright © 2013 by K.S. Trivedi
Bohrbugs: Remove


 Find and fix the bugs during testing
 Failure data collected during testing
 Calibrate a software reliability growth model (SRGM) using failure data;
  this model is then used for prediction
 Many SRGMs exist (JM,NHPP,HGRGM, etc.)
     Books by Lyu, Musa, Cai
     Gokhale & Trivedi, A Time/Structure Based Software Reliability Model, Annals
      of Software Engineering, 1999
 Measurements  Empirical (statistical) models


                           Copyright © 2013 by K.S. Trivedi
Mitigation




Copyright © 2013 by K.S. Trivedi
OS Availability Model (IBM BladeCenter)


                                Fix (Failed due to a Bohrbug)




        Reboot (Failure due to a Mandelbug)




           Copyright © 2013 by K.S. Trivedi
Availability model of a Proxy or a WAS (IBM SIP on websphere)



                                                          •   Failure detection
                                                               – By WLM
                                                               –    By Node Agent
                                                               –    Manual detection
                                                          •   Recovery
                                                               – Node Agent
                                                                      • Auto process restart
                                                               –    Manual recovery
                                                                      • Process restart
                                                                      • Node reboot
                                                                      • Repair




                    Application server and proxy server




                 Copyright © 2013 by K.S. Trivedi
Outline
•   Motivation
•   A Real System
•   Software Fault Classification
•   Environmental Diversity
•   Methods of Mitigation
•   Software Aging and Rejuvenation
•   Conclusions




                    Copyright © 2013 by K.S. Trivedi
Aging Related Bugs: Replicate, Restart,
         Reboot, Rejuvenate


            Copyright © 2012 byby K.S. Trivedi
              Copyright © 2013 K.S. Trivedi
Software Aging

Aging phenomenon
    Error conditions accumulating over time


                  Performance degradation, system failure
                   Performance degradation, system failure

Main causes of Software Aging
Memory leak, fragmentation, Unterminated threads, Data corruption, Round-
off errors, Unreleased file-locks, etc
Observed system
    OS, Middle-ware, Netscape, Internet Explorer etc
                             Copyright © 2013 by K.S. Trivedi
Software Aging - Definition

“Software Aging” phenomenon

  Long-running software tends to show an increasing
  failure rate.

  Not related to application program becoming
  obsolete due to changing
  requirements/maintenance.

  Software appears to age; no real deterioration
                  Copyright © 2013 by K.S. Trivedi
Software Aging - Examples

• Cisco Catalyst Switch [Matias Jr.]
• File system aging [Smith & Seltzer]
• Gradual service degradation in the AT&T transaction processing
  system [Avritzer et al.]
• Error accumulation in Patriot missile system’s software [Marshall]
• Resources exhaustion in Apache [Li et al., Grottke et al.]
• Physical memory degradation in a SOAP-based Server [Silva et al.]
• Software aging in Linux [Cotroneo et al.]
• Crash/hang failures in general purpose applications after a long
  runtime


                        Copyright © 2013 by K.S. Trivedi
Measurements Showing Resource
               Exhaustion or Depletion

      Real Memory Free                                                           File Table Size




AMethodology for Detection and Estimation of Software Aging,
S. Garg, A. van Moorsel, K. Vaidyanathan and K. Trivedi.
Pro c. o f IEEE I Symp. o n So ftware Re liability Eng ine e ring , Nov. 1998.
                 ntl.
                                     Copyright © 2013 by K.S. Trivedi
Software Fault Types & Their Mitigation




          Copyright © 2013 by K.S. Trivedi
Software rejuvenation

Software rejuvenation is a cost effective solution for
improving software reliability by avoiding/postponing
unanticipated software failures/crashes.


It allows proactive recovery to be carried either
automatically or at the discretion of the
user/administrator


Rejuvenation of the environment, not of software
                     Copyright © 2013 by K.S. Trivedi
Software Rejuvenation


Counteracts the software aging phenomenon
   Frees up OS resources; Removes error accumulation


Common techniques for cleaning
   Garbage collection, defragmentation, flushing kernel and file
   server tables etc.


Challenge: Rejuvenation scheduling/granularity


                        Copyright © 2013 by K.S. Trivedi
SW Rejuvenation: The Genesis

“Software Rejuvenation: Analysis, Module and
  Applications”, Y. Huang, C. Kintala, N. Koletis, N.
  Fulton, in FTCS 1995
   An insight into operational software, that no-one had before (at least,
     formally). It changed
       • How practitioners looked at making software more dependable
       • Windfall of performance and dependability modeling problems for academicians
       • Ideas to build better, real-world systems as Internet evolved
       • Led to recognition of “software aging” phenomenon
       • Brought about Phds, tenureships, publications, patents, awards, tools, systems,
         funding for many many people around the world.



                               Copyright © 2013 by K.S. Trivedi
Software Rejuvenation
                                        Examples
AT&T billing applications [Huang et al.]
Patriot missile system software - switch off/on every 8 hours [Marshall]
On-board preventive maintenance for long-life deep space missions
   (NASA’s X2000 Advanced Flight Systems Program) [Tai et al.]
IBM Director Software Rejuvenation (x-series) [IBM & Duke Researchers]
Microsoft IIS 5.0 process recycling tool
Process restart in Apache [Li et al.]
ISS FS SSC (ISS File system) - switch off and on every 2 months [NASA ISS
    reports]
For more examples:
    "Software rejuvenation - Do IT & Telco industries use it?". Javier Alonso, Antonio Bovenzi,
    Jinghui Li, Yakun Wang, Stefano Russo, and Kishor Trivedi. The 4rd International Workshop
    on Software Aging and Rejuvenation (WoSAR 2012) . Held in conjunction with The 23nd
    annual International Symposium on Software Reliability Engineering (ISSRE 2012), Dallas,
    USA, 2012.

                                    Copyright © 2013 by K.S. Trivedi
Software Rejuvenation –Trade-off

• Advantages
  – Reduces costs of sudden aging-related failures
  – Can be applied at the discretion of the user/administrator


• Disadvantages
  – Direct costs of carrying out rejuvenation
  – Opportunity costs of rejuvenation (downtime, decreased
    performance, lost transactions etc)

      Important research issue:
      Find optimal times to perform rejuvenation!
                     Copyright © 2013 by K.S. Trivedi
Software rejuvenation - Approaches



Two approaches based on WHEN:

  Time-Based rejuvenation approaches
     Rejuvenation applied regularly and at predetermined time intervals.

     Widely used in real environments


          – Web servers (Apache)
          – ISS two-months reboot
          – Telecommunication systems

                            Copyright © 2013 by K.S. Trivedi
Software rejuvenation - Approaches

Two approaches based on WHEN:

  Measurement (Inspection)-based rejuvenation approaches

     • Threshold based or predictive

     • System metrics continually monitored

     • Rejuvenation triggered when the crash is imminent based on the
       observation/prediction

     • Reduce potentially useless rejuvenation actions and downtime in the process
       [Silva et al.]


                            Copyright © 2013 by K.S. Trivedi
Software rejuvenation




     Copyright © 2013 by K.S. Trivedi
Software Rejuvenation – Approaches


Two approaches based on HOW:

  Use analytical model to optimize rejuvenation schedule
     • Lucent Bell Labs [Huang et al., ‘95]

     • Duke [IEEE-TC’98, SIGMETRICS’96, ISSRE’95, PRDC’00, SIGMETRICS’01,
       Comp J.’01, SRDS’02, DSN’02, ISSRE’02, DSN’03, IEEE-TR’05]


     • Others [IPDS’98, PNPM’99]


                         Copyright © 2013 by K.S. Trivedi
Software Rejuvenation – Approaches


Two approaches based on HOW:

  Use measurements of resource degradation to
    determine/predict optimal rejuvenation schedule

     • Duke [ISSRE’98, ISSRE’99, IBMJRD’01, ISESE’02, IEEE-TPDS’05]

     • Duke (formerly UPC) [GRID’07, IEEE-TC’09, DSN’10]



                         Copyright © 2013 by K.S. Trivedi
Failure rate

Preventive maintenance is useful only if failure rate is
  increasing

If the time to failure distribution is exponential then
    failure rate is Constant

Need to assume (and establish) that TTF is IFR


                     Copyright © 2013 by K.S. Trivedi
Analytic Models

Single node models
   –   CTMC model
   –   SMP model




Cluster systems
   –   IBM Cluster model (Time-based, condition-based)


                       Copyright © 2013 by K.S. Trivedi
Analytic Models
                Software Aging and Rejuvenation


A simple and useful model of increasing failure rate:
                            Failure
                           probable
         Robust state                          Failed state
                             state




      Time to failure: Hypo-exponential distribution
                   Increasing failure rate              aging




                           Copyright © 2013 by K.S. Trivedi
Analytic Models
                                               CTMC model [Huang95]
  Failed state




                                                                                      Robust state
                           Sf



                                                                                           S0
                                                                             r1                            r3
                                     r1

                                                                Failed state             r2
                           λ




                                                                                                                     Rejuvenation
  Failure probable state




                                                 Robust state                                        r4
                                                                        Sf           λ     Sp                   Sr       state
                           Sp




                                          S0
                                r2




                                                                                  Failure probable state



      Model w/o rejuvenation                                             Model with rejuvenation


From this Continuous-time Markov chain model
Can find closed-form expression for the optimal rejuvenation trigger
   rate (r4)             Copyright © 2013 by K.S. Trivedi
Analytic Models
                    Semi-Markov model [Dohi00]

Relax the assumption of exponentially distributed sojourn times (time-
   independent transition rates)
Hence have a semi Markov model


                                     0

          Completion of                             Completion of
             Repair                                 Rejuvenation
                             State
                            change


           2                         1                         3
Can find closed-form expression for the optimal (deterministic) time to
                  System Failure        Rejuvenation
   rejuvenation trigger


                            Copyright © 2013 by K.S. Trivedi
Rejuvenation in Cluster Systems



                                             Cluster System




[Pfister] Collection of independent, self-contained computer
   systems working together to provide a more reliable and
   powerful system than a single node by itself
Easier scaling to larger systems, high levels of
   availability/performance and low management costs
                        Copyright © 2013 by K.S. Trivedi
Rejuvenation for Cluster Systems
                      Motivation




Rejuvenation using the fail-over mechanisms

Long-terms benefits in terms of
  availability/performance

Continuous operation (possibly at a degraded level)

   Practically zero downtime

                Copyright © 2013 by K.S. Trivedi
Rejuvenation for Cluster Systems
                           Motivation


Less disruptive and lower overhead than unplanned
  outages

Transparent to user/application

Most current industry initiatives reactive

Two approaches
   Simple time-based (periodic)
   Condition-based (only from the “failure-impending” state

                     Copyright © 2013 by K.S. Trivedi
Rejuvenation for Cluster Systems
                                    SRN Models


Rejuvenation using the fail-over mechanisms in a rolling fashion

Modeling using SRNs (Stochastic Reward Nets)

Analysis for 2 rejuvenation policies
   Simple time-based policy
       • All nodes rejuvenated successively at the end of each rejuvenation interval
   Condition-based policy
       • Nodes rejuvenated only from the “failure-probable” state




                            Copyright © 2013 by K.S. Trivedi
SRN Model
Basic Cluster Model




 Copyright © 2013 by K.S. Trivedi
SRN Model
Simple Time-Based Rejuvenation




       Copyright © 2013 by K.S. Trivedi
Model Parameters

Transition                        Mean time

Tfprob                            240 hours
Tnodefail                         720 hours
Tnoderepair                       30 mins
Tsysrepair                        4 hours
Trejuv                            10 mins

costnodefail                      $5000/hour

costnoderejuv                     $250/hour

              Copyright © 2013 by K.S. Trivedi
Model Measures

  Measures Computed



Unavailability   (#Psysfail == 1) ? 1 : 0
Cost             #Prejuv*costrejuv + #Pnodefail*costnodefail + #Psysfail*costsysfail




                               Copyright © 2013 by K.S. Trivedi
Results
                               Simple Time-Based Rejuvenation

                  8/ configuration
                   1                                                            8/ configuration
                                                                                 2




As rejuv. int. increases, rejuvenation is performed less frequently
When rejuv int is close to zero, the system is rejuvenating very frequently resulting in high cost/downtime
When rejuv. int. goes beyond optimal value, system failures become frequent
resulting in high cost/downtime              Copyright © 2013 by K.S. Trivedi
Measuring Performance Variables

Objective
   Detection and validation of aging


Periodically monitor and collect data on the attributes
  responsible for the “health” of the system

Quantify the effect of aging on system resources
   Proposed metric – Estimated time to exhaustion
   Proposed metric – Evaluation function using PCA approach




                          Copyright © 2013 by K.S. Trivedi
Measuring Performance Variables
Approaches
   Time-based (workload-independent) estimation [Garg98]
   Workload-based estimation [Vaidyanathan99]
   ARMA/ARX models [Li02]
   ALT and ADT techniques [Matias06]
   Non-parametric Algorithms [Dohi00]
   Non-linear models [Hoffman07]
   Principal component Analysis (PCA) and System identification [Jia]
   Pattern recognition [Vaidyanathan & Gross]
   Threshold-based approaches [Silva09]
   Machine Learning Approaches [Alonso10]


                              Copyright © 2013 by K.S. Trivedi
Data Collection
  Experimental Setup                                   97




                                   SNMP-based resource
                                   monitoring tool:
                                   Data related to OS
                                   resource usage
                                   (memory, process table,
                                   file table etc.) and
                                   system activity (CPU
                                   usage, paging,
                                   swapping, NFS,
                                   interrupts etc. )
                                   collected for over 3
                                   months at 10 min
                                   intervals




Copyright © 2013 by K.S. Trivedi
Time Plots
       Non-parametric Regression Smoothing                              98




Real Memory Free                                      File Table Size




 Trend detection: Seasonal Kendall test for trend

                   Copyright © 2013 by K.S. Trivedi
IBM xSeries
         Software Rejuvenation Agent (SRA)


Implemented in a high-availability clustered
  environment

Monitors consumable resources, estimate time to
 exhaustion and generates alerts if within user
 notification horizon




                    Copyright © 2013 by K.S. Trivedi
IBM xSeries
      Software Rejuvenation Agent (SRA)


IBM Director system management tool
  – Provides GUI to configure SRA
  – Acts upon alerts

Two versions
  – Periodic rejuvenation
  – Prediction-based rejuvenation


                  Copyright © 2013 by K.S. Trivedi
Summary



It is possible to enhance software availability during
    operation exploiting environmental diversity


Multiple types of recovery after a software failure can be
 judiciously employed: restart, failover to a replica, reboot
 and if all else fails repair (patch)


                       Copyright © 2013 by K.S. Trivedi
Summary


Software aging not anecdotal – real life scientific
  phenomenon


Rejuvenation implemented in several special purpose
  applications and many general purpose cluster systems




                       Copyright © 2013 by K.S. Trivedi
Key References
•   Software Rejuvenation: Analysis, Module and Applications, Y. Huang, C. Kintala, N. Kolettis
    and N. Fulton, In Proc. FTCS-25, June 1995.
•   A Methodology for Detection and Estimation of Software Aging, S. Garg, A. van Moorsel, K.
    Vaidyanathan and K. S. Trivedi. Proc. ISSRE 1998.
•   Performance and Reliability Evaluation of Passive Replication Schemes in Application Level
    Fault Tolerance, S. Garg, Y. Huang, C. M. R. Kintala, K. S. Trivedi, and S. Yajnik. In Proc. FTCS
    1999.
•   Statistical Non-Parametric Algorithms to Estimate the Optimal Software Rejuvenation
    Schedule, T. Dohi, K. Goseva-Popstojanova and K. S. Trivedi, Proc. PRDC 2000.
•   Proactive Management of Software Aging, V. Castelli, R. E. Harper, P. Heidelberger, S. W.
    Hunter, K. S. Trivedi, K. Vaidyanathan and W. Zeggert, IBM Journal of Research &
    Development, March 2001.
•   A Comprehensive Model for Software Rejuvenation, K. Vaidyanathan and K. S. Trivedi. IEEE-
    TDSC, April-June 2005.
•   Analysis of software aging in a web server, M. Grottke, L. Li, K. Vaidyanathan and K. S.
    Trivedi, IEEE Trans. Reliability, Sept. 2006.


                                 Copyright © 2013 by K.S. Trivedi
Key References
•   Fighting Bugs: Remove, Retry, Replicate and Rejuvenate, M. Grottke and K. Trivedi, IEEE Computer,
    Feb. 2007.
•   Availability Modeling of SIP Protocol on IBM WebSphere, K. S. Trivedi, D. Wang, D. J. Hunt, A.
    Rindos, W. E. Smith, B. Vashaw, Proc. PRDC 2008.
•   Using Accelerated Life Tests to Estimate Time to Software Aging Failure, MATIAS JR, R., TRIVEDI,
    K., Maciel, P. , ISSRE, 2010.
•   Accelerated Degradation Tests Applied to Software Aging Experiments, Rivalino Matias, Jr., K. S.
    Trivedi and Paulo J. F. Filho and Pedro A. Barbetta, IEEE Transactions on Reliability, March 2010.
•   An Empirical Investigation of Fault Types in Space Mission System Software, M.Grottke, A. P.
    Nikora and K. S. Trivedi, Proc. DSN, 2010.
•   Software fault mitigation and availability assurance techniques, K. S. Trivedi, M. Grottke, and E.
    Andrade. International Journal of System Assurance Engineering and Management, 2011.
•   Recovery from Failures due to Mandelbugs in IT Systems, K. Trivedi, R. Mansharamani, D.S. Kim,
    M. Grottke, M. Nambiar , Proc. PRDC 2011
•   O. Kyas. (2001). Network Troubleshooting, Palo Alto California, Agilent Technologies (book)
•   M. Kaaniche and K. Kanoun (1996). Reliability of a Commercial Telecommunications System, ISSRE
    1996
•   R. Cramp, M. A. Vouk, and W. Jones (1992). On Operational Availability of a Large Software-Based
    Telecommunications System, ISSRE 1992




                                     Copyright © 2013 by K.S. Trivedi

Más contenido relacionado

Destacado

7 historical software bugs
 7 historical software bugs 7 historical software bugs
7 historical software bugsAlexandre Uehara
 
Introduction to Software Review
Introduction to Software ReviewIntroduction to Software Review
Introduction to Software ReviewPhilip Johnson
 
First and second order semi-Markov chains for wind speed modeling
First and second order semi-Markov chains for wind speed modelingFirst and second order semi-Markov chains for wind speed modeling
First and second order semi-Markov chains for wind speed modelingNozir Shokirov
 
Software failure The knight's story
Software failure The knight's storySoftware failure The knight's story
Software failure The knight's storyMuhammad Saim
 
Mars climate obiter failure
Mars climate obiter failureMars climate obiter failure
Mars climate obiter failureYe Jiadong
 
Software Disasters
Software DisastersSoftware Disasters
Software DisastersArno Huetter
 
Failure of Mars Climate Orbiter
Failure of Mars Climate OrbiterFailure of Mars Climate Orbiter
Failure of Mars Climate OrbiterMaharsh17
 
Unit 6-energy-resources
Unit 6-energy-resourcesUnit 6-energy-resources
Unit 6-energy-resourcesanuragmbst
 
Major blackout in the world
Major blackout in the worldMajor blackout in the world
Major blackout in the worldChandan Kumar
 
software failures
 software failures software failures
software failuresRespa Peter
 
Lecture 1 introduction to failure analysis
Lecture 1 introduction to failure analysisLecture 1 introduction to failure analysis
Lecture 1 introduction to failure analysisbingrazonado
 
Object Modelling in Software Engineering
Object Modelling in Software EngineeringObject Modelling in Software Engineering
Object Modelling in Software Engineeringguest7fe55d5e
 
Ch8-Software Engineering 9
Ch8-Software Engineering 9Ch8-Software Engineering 9
Ch8-Software Engineering 9Ian Sommerville
 

Destacado (20)

7 historical software bugs
 7 historical software bugs 7 historical software bugs
7 historical software bugs
 
Introduction to Software Review
Introduction to Software ReviewIntroduction to Software Review
Introduction to Software Review
 
First and second order semi-Markov chains for wind speed modeling
First and second order semi-Markov chains for wind speed modelingFirst and second order semi-Markov chains for wind speed modeling
First and second order semi-Markov chains for wind speed modeling
 
Software bugs
Software bugsSoftware bugs
Software bugs
 
Software failure The knight's story
Software failure The knight's storySoftware failure The knight's story
Software failure The knight's story
 
Mars climate obiter failure
Mars climate obiter failureMars climate obiter failure
Mars climate obiter failure
 
Software Disasters
Software DisastersSoftware Disasters
Software Disasters
 
Failure of Mars Climate Orbiter
Failure of Mars Climate OrbiterFailure of Mars Climate Orbiter
Failure of Mars Climate Orbiter
 
Unit 6-energy-resources
Unit 6-energy-resourcesUnit 6-energy-resources
Unit 6-energy-resources
 
Major blackout in the world
Major blackout in the worldMajor blackout in the world
Major blackout in the world
 
software failures
 software failures software failures
software failures
 
Warsaw airbus accident
Warsaw airbus accidentWarsaw airbus accident
Warsaw airbus accident
 
Intro to requirements eng.
Intro to requirements eng.Intro to requirements eng.
Intro to requirements eng.
 
Java Beans
Java BeansJava Beans
Java Beans
 
Lecture 1 introduction to failure analysis
Lecture 1 introduction to failure analysisLecture 1 introduction to failure analysis
Lecture 1 introduction to failure analysis
 
Object Modelling in Software Engineering
Object Modelling in Software EngineeringObject Modelling in Software Engineering
Object Modelling in Software Engineering
 
Unit 8
Unit 8Unit 8
Unit 8
 
Ariane 5 launcher failure
Ariane 5 launcher failure Ariane 5 launcher failure
Ariane 5 launcher failure
 
Java beans
Java beansJava beans
Java beans
 
Ch8-Software Engineering 9
Ch8-Software Engineering 9Ch8-Software Engineering 9
Ch8-Software Engineering 9
 

Similar a Software Faults, Failures and Their Mitigations | Turing100@Persistent

Grid is Dead ? Nimrod on the Cloud
Grid is Dead ? Nimrod on the CloudGrid is Dead ? Nimrod on the Cloud
Grid is Dead ? Nimrod on the CloudAdianto Wibisono
 
Video Analytics on Hadoop webinar victor fang-201309
Video Analytics on Hadoop webinar victor fang-201309Video Analytics on Hadoop webinar victor fang-201309
Video Analytics on Hadoop webinar victor fang-201309DrVictorFang
 
Security TechTalk | AWS Public Sector Summit 2016
Security TechTalk | AWS Public Sector Summit 2016Security TechTalk | AWS Public Sector Summit 2016
Security TechTalk | AWS Public Sector Summit 2016Amazon Web Services
 
An Analysis Of Cloud ReliabilityApproaches Based on Cloud Components And Reli...
An Analysis Of Cloud ReliabilityApproaches Based on Cloud Components And Reli...An Analysis Of Cloud ReliabilityApproaches Based on Cloud Components And Reli...
An Analysis Of Cloud ReliabilityApproaches Based on Cloud Components And Reli...ijafrc
 
Cloud Economics in Training and Simulation
Cloud Economics in Training and SimulationCloud Economics in Training and Simulation
Cloud Economics in Training and SimulationNane Kratzke
 
Declare Victory with Big Data
Declare Victory with Big DataDeclare Victory with Big Data
Declare Victory with Big DataJ On The Beach
 
ACTOR CRITIC APPROACH BASED ANOMALY DETECTION FOR EDGE COMPUTING ENVIRONMENTS
ACTOR CRITIC APPROACH BASED ANOMALY DETECTION FOR EDGE COMPUTING ENVIRONMENTSACTOR CRITIC APPROACH BASED ANOMALY DETECTION FOR EDGE COMPUTING ENVIRONMENTS
ACTOR CRITIC APPROACH BASED ANOMALY DETECTION FOR EDGE COMPUTING ENVIRONMENTSIJCNCJournal
 
Actor Critic Approach based Anomaly Detection for Edge Computing Environments
Actor Critic Approach based Anomaly Detection for Edge Computing EnvironmentsActor Critic Approach based Anomaly Detection for Edge Computing Environments
Actor Critic Approach based Anomaly Detection for Edge Computing EnvironmentsIJCNCJournal
 
SDN Dependability: Assessment, Techniques, and Tools - SDN Research Group - I...
SDN Dependability: Assessment, Techniques, and Tools - SDN Research Group - I...SDN Dependability: Assessment, Techniques, and Tools - SDN Research Group - I...
SDN Dependability: Assessment, Techniques, and Tools - SDN Research Group - I...Stenio Fernandes
 
How Can We Answer the Really BIG Questions?
How Can We Answer the Really BIG Questions?How Can We Answer the Really BIG Questions?
How Can We Answer the Really BIG Questions?Amazon Web Services
 
Driving Towards Cloud 2015: A Technology Vision to Meet the Demands of Cloud ...
Driving Towards Cloud 2015: A Technology Vision to Meet the Demands of Cloud ...Driving Towards Cloud 2015: A Technology Vision to Meet the Demands of Cloud ...
Driving Towards Cloud 2015: A Technology Vision to Meet the Demands of Cloud ...Intel IT Center
 
Why you should believe in cloud - ITCluster iQuest Cluj Napoca
Why you should believe in cloud - ITCluster iQuest Cluj Napoca Why you should believe in cloud - ITCluster iQuest Cluj Napoca
Why you should believe in cloud - ITCluster iQuest Cluj Napoca Radu Vunvulea
 
Artificial intelligence in IoT-to-core network operations and management
Artificial intelligence in IoT-to-core network operations and managementArtificial intelligence in IoT-to-core network operations and management
Artificial intelligence in IoT-to-core network operations and managementADVA
 
Design and implement a new cloud security method based on multi clouds on ope...
Design and implement a new cloud security method based on multi clouds on ope...Design and implement a new cloud security method based on multi clouds on ope...
Design and implement a new cloud security method based on multi clouds on ope...csandit
 
DESIGN AND IMPLEMENT A NEW CLOUD SECURITY METHOD BASED ON MULTI CLOUDS ON OPE...
DESIGN AND IMPLEMENT A NEW CLOUD SECURITY METHOD BASED ON MULTI CLOUDS ON OPE...DESIGN AND IMPLEMENT A NEW CLOUD SECURITY METHOD BASED ON MULTI CLOUDS ON OPE...
DESIGN AND IMPLEMENT A NEW CLOUD SECURITY METHOD BASED ON MULTI CLOUDS ON OPE...cscpconf
 
IaaS Cloud Benchmarking: Approaches, Challenges, and Experience
IaaS Cloud Benchmarking: Approaches, Challenges, and ExperienceIaaS Cloud Benchmarking: Approaches, Challenges, and Experience
IaaS Cloud Benchmarking: Approaches, Challenges, and ExperienceAlexandru Iosup
 

Similar a Software Faults, Failures and Their Mitigations | Turing100@Persistent (20)

Big Data & The Cloud
Big Data & The CloudBig Data & The Cloud
Big Data & The Cloud
 
Big Data on AWS
Big Data on AWSBig Data on AWS
Big Data on AWS
 
Grid is Dead ? Nimrod on the Cloud
Grid is Dead ? Nimrod on the CloudGrid is Dead ? Nimrod on the Cloud
Grid is Dead ? Nimrod on the Cloud
 
16h30 p duff-big-data-final
16h30   p duff-big-data-final16h30   p duff-big-data-final
16h30 p duff-big-data-final
 
Video Analytics on Hadoop webinar victor fang-201309
Video Analytics on Hadoop webinar victor fang-201309Video Analytics on Hadoop webinar victor fang-201309
Video Analytics on Hadoop webinar victor fang-201309
 
Security TechTalk | AWS Public Sector Summit 2016
Security TechTalk | AWS Public Sector Summit 2016Security TechTalk | AWS Public Sector Summit 2016
Security TechTalk | AWS Public Sector Summit 2016
 
Cloud computing
Cloud computingCloud computing
Cloud computing
 
An Analysis Of Cloud ReliabilityApproaches Based on Cloud Components And Reli...
An Analysis Of Cloud ReliabilityApproaches Based on Cloud Components And Reli...An Analysis Of Cloud ReliabilityApproaches Based on Cloud Components And Reli...
An Analysis Of Cloud ReliabilityApproaches Based on Cloud Components And Reli...
 
Cloud Economics in Training and Simulation
Cloud Economics in Training and SimulationCloud Economics in Training and Simulation
Cloud Economics in Training and Simulation
 
Declare Victory with Big Data
Declare Victory with Big DataDeclare Victory with Big Data
Declare Victory with Big Data
 
ACTOR CRITIC APPROACH BASED ANOMALY DETECTION FOR EDGE COMPUTING ENVIRONMENTS
ACTOR CRITIC APPROACH BASED ANOMALY DETECTION FOR EDGE COMPUTING ENVIRONMENTSACTOR CRITIC APPROACH BASED ANOMALY DETECTION FOR EDGE COMPUTING ENVIRONMENTS
ACTOR CRITIC APPROACH BASED ANOMALY DETECTION FOR EDGE COMPUTING ENVIRONMENTS
 
Actor Critic Approach based Anomaly Detection for Edge Computing Environments
Actor Critic Approach based Anomaly Detection for Edge Computing EnvironmentsActor Critic Approach based Anomaly Detection for Edge Computing Environments
Actor Critic Approach based Anomaly Detection for Edge Computing Environments
 
SDN Dependability: Assessment, Techniques, and Tools - SDN Research Group - I...
SDN Dependability: Assessment, Techniques, and Tools - SDN Research Group - I...SDN Dependability: Assessment, Techniques, and Tools - SDN Research Group - I...
SDN Dependability: Assessment, Techniques, and Tools - SDN Research Group - I...
 
How Can We Answer the Really BIG Questions?
How Can We Answer the Really BIG Questions?How Can We Answer the Really BIG Questions?
How Can We Answer the Really BIG Questions?
 
Driving Towards Cloud 2015: A Technology Vision to Meet the Demands of Cloud ...
Driving Towards Cloud 2015: A Technology Vision to Meet the Demands of Cloud ...Driving Towards Cloud 2015: A Technology Vision to Meet the Demands of Cloud ...
Driving Towards Cloud 2015: A Technology Vision to Meet the Demands of Cloud ...
 
Why you should believe in cloud - ITCluster iQuest Cluj Napoca
Why you should believe in cloud - ITCluster iQuest Cluj Napoca Why you should believe in cloud - ITCluster iQuest Cluj Napoca
Why you should believe in cloud - ITCluster iQuest Cluj Napoca
 
Artificial intelligence in IoT-to-core network operations and management
Artificial intelligence in IoT-to-core network operations and managementArtificial intelligence in IoT-to-core network operations and management
Artificial intelligence in IoT-to-core network operations and management
 
Design and implement a new cloud security method based on multi clouds on ope...
Design and implement a new cloud security method based on multi clouds on ope...Design and implement a new cloud security method based on multi clouds on ope...
Design and implement a new cloud security method based on multi clouds on ope...
 
DESIGN AND IMPLEMENT A NEW CLOUD SECURITY METHOD BASED ON MULTI CLOUDS ON OPE...
DESIGN AND IMPLEMENT A NEW CLOUD SECURITY METHOD BASED ON MULTI CLOUDS ON OPE...DESIGN AND IMPLEMENT A NEW CLOUD SECURITY METHOD BASED ON MULTI CLOUDS ON OPE...
DESIGN AND IMPLEMENT A NEW CLOUD SECURITY METHOD BASED ON MULTI CLOUDS ON OPE...
 
IaaS Cloud Benchmarking: Approaches, Challenges, and Experience
IaaS Cloud Benchmarking: Approaches, Challenges, and ExperienceIaaS Cloud Benchmarking: Approaches, Challenges, and Experience
IaaS Cloud Benchmarking: Approaches, Challenges, and Experience
 

Más de Persistent Systems Ltd.

Skilling for SMAC by Anand Deshpande, Founder, Chairman and Managing Director...
Skilling for SMAC by Anand Deshpande, Founder, Chairman and Managing Director...Skilling for SMAC by Anand Deshpande, Founder, Chairman and Managing Director...
Skilling for SMAC by Anand Deshpande, Founder, Chairman and Managing Director...Persistent Systems Ltd.
 
Embedded Linux Evolution | Turing Techtalk
Embedded Linux Evolution | Turing TechtalkEmbedded Linux Evolution | Turing Techtalk
Embedded Linux Evolution | Turing TechtalkPersistent Systems Ltd.
 
Life and Work of Ken Thompson and Dennis Ritchie | Turing Techtalk
Life and Work of Ken Thompson and Dennis Ritchie | Turing TechtalkLife and Work of Ken Thompson and Dennis Ritchie | Turing Techtalk
Life and Work of Ken Thompson and Dennis Ritchie | Turing TechtalkPersistent Systems Ltd.
 
Life and Work of Ivan Sutherland | Turing100@Persistent
Life and Work of Ivan Sutherland | Turing100@PersistentLife and Work of Ivan Sutherland | Turing100@Persistent
Life and Work of Ivan Sutherland | Turing100@PersistentPersistent Systems Ltd.
 
Evolution of the modern graphics architectures with a focus on GPUs | Turing1...
Evolution of the modern graphics architectures with a focus on GPUs | Turing1...Evolution of the modern graphics architectures with a focus on GPUs | Turing1...
Evolution of the modern graphics architectures with a focus on GPUs | Turing1...Persistent Systems Ltd.
 
What is wrong with the Internet? [On the foundations of internet security, fu...
What is wrong with the Internet? [On the foundations of internet security, fu...What is wrong with the Internet? [On the foundations of internet security, fu...
What is wrong with the Internet? [On the foundations of internet security, fu...Persistent Systems Ltd.
 
Life and Work of Ronald L. Rivest, Adi Shamir & Leonard M. Adleman | Turing10...
Life and Work of Ronald L. Rivest, Adi Shamir & Leonard M. Adleman | Turing10...Life and Work of Ronald L. Rivest, Adi Shamir & Leonard M. Adleman | Turing10...
Life and Work of Ronald L. Rivest, Adi Shamir & Leonard M. Adleman | Turing10...Persistent Systems Ltd.
 
Life and Work of Judea Perl | Turing100@Persistent
Life and Work of Judea Perl | Turing100@PersistentLife and Work of Judea Perl | Turing100@Persistent
Life and Work of Judea Perl | Turing100@PersistentPersistent Systems Ltd.
 
Early History of Fortran: The Making of a Wonder | Turing100@Persistent
Early History of Fortran: The Making of a Wonder | Turing100@PersistentEarly History of Fortran: The Making of a Wonder | Turing100@Persistent
Early History of Fortran: The Making of a Wonder | Turing100@PersistentPersistent Systems Ltd.
 
Life and Work of Dr. John Backus | Turing100@Persistent
Life and Work of Dr. John Backus | Turing100@PersistentLife and Work of Dr. John Backus | Turing100@Persistent
Life and Work of Dr. John Backus | Turing100@PersistentPersistent Systems Ltd.
 
Life and Work of Jim Gray | Turing100@Persistent
Life and Work of Jim Gray | Turing100@PersistentLife and Work of Jim Gray | Turing100@Persistent
Life and Work of Jim Gray | Turing100@PersistentPersistent Systems Ltd.
 
Systems Design Experiences or Just Some War Stories…
Systems Design Experiences or Just Some War Stories…Systems Design Experiences or Just Some War Stories…
Systems Design Experiences or Just Some War Stories…Persistent Systems Ltd.
 
Life & Work of Butler Lampson | Turing100@Persistent
Life & Work of Butler Lampson | Turing100@PersistentLife & Work of Butler Lampson | Turing100@Persistent
Life & Work of Butler Lampson | Turing100@PersistentPersistent Systems Ltd.
 
Life & Work of Robin Milner | Turing100@Persistent
Life & Work of Robin Milner | Turing100@PersistentLife & Work of Robin Milner | Turing100@Persistent
Life & Work of Robin Milner | Turing100@PersistentPersistent Systems Ltd.
 
Life & Work of Dr. Vinton Cerf and Dr. Robert Kahn | Turing100@Persistent
Life & Work of Dr. Vinton Cerf and Dr. Robert Kahn | Turing100@PersistentLife & Work of Dr. Vinton Cerf and Dr. Robert Kahn | Turing100@Persistent
Life & Work of Dr. Vinton Cerf and Dr. Robert Kahn | Turing100@PersistentPersistent Systems Ltd.
 
Net Neutrality | Turing100@Persistent Systems
Net Neutrality | Turing100@Persistent SystemsNet Neutrality | Turing100@Persistent Systems
Net Neutrality | Turing100@Persistent SystemsPersistent Systems Ltd.
 
Alan Turing Scientist Unlimited | Turing100@Persistent Systems
Alan Turing Scientist Unlimited | Turing100@Persistent SystemsAlan Turing Scientist Unlimited | Turing100@Persistent Systems
Alan Turing Scientist Unlimited | Turing100@Persistent SystemsPersistent Systems Ltd.
 
Life and work of E.F. (Ted) Codd | Turing100@Persistent
Life and work of E.F. (Ted) Codd | Turing100@PersistentLife and work of E.F. (Ted) Codd | Turing100@Persistent
Life and work of E.F. (Ted) Codd | Turing100@PersistentPersistent Systems Ltd.
 
Alan Turing Centenary @ Persistent Systems
Alan Turing Centenary @ Persistent SystemsAlan Turing Centenary @ Persistent Systems
Alan Turing Centenary @ Persistent SystemsPersistent Systems Ltd.
 

Más de Persistent Systems Ltd. (20)

Skilling for SMAC by Anand Deshpande, Founder, Chairman and Managing Director...
Skilling for SMAC by Anand Deshpande, Founder, Chairman and Managing Director...Skilling for SMAC by Anand Deshpande, Founder, Chairman and Managing Director...
Skilling for SMAC by Anand Deshpande, Founder, Chairman and Managing Director...
 
Embedded Linux Evolution | Turing Techtalk
Embedded Linux Evolution | Turing TechtalkEmbedded Linux Evolution | Turing Techtalk
Embedded Linux Evolution | Turing Techtalk
 
Life and Work of Ken Thompson and Dennis Ritchie | Turing Techtalk
Life and Work of Ken Thompson and Dennis Ritchie | Turing TechtalkLife and Work of Ken Thompson and Dennis Ritchie | Turing Techtalk
Life and Work of Ken Thompson and Dennis Ritchie | Turing Techtalk
 
Life and Work of Ivan Sutherland | Turing100@Persistent
Life and Work of Ivan Sutherland | Turing100@PersistentLife and Work of Ivan Sutherland | Turing100@Persistent
Life and Work of Ivan Sutherland | Turing100@Persistent
 
Evolution of the modern graphics architectures with a focus on GPUs | Turing1...
Evolution of the modern graphics architectures with a focus on GPUs | Turing1...Evolution of the modern graphics architectures with a focus on GPUs | Turing1...
Evolution of the modern graphics architectures with a focus on GPUs | Turing1...
 
What is wrong with the Internet? [On the foundations of internet security, fu...
What is wrong with the Internet? [On the foundations of internet security, fu...What is wrong with the Internet? [On the foundations of internet security, fu...
What is wrong with the Internet? [On the foundations of internet security, fu...
 
Life and Work of Ronald L. Rivest, Adi Shamir & Leonard M. Adleman | Turing10...
Life and Work of Ronald L. Rivest, Adi Shamir & Leonard M. Adleman | Turing10...Life and Work of Ronald L. Rivest, Adi Shamir & Leonard M. Adleman | Turing10...
Life and Work of Ronald L. Rivest, Adi Shamir & Leonard M. Adleman | Turing10...
 
Life and Work of Judea Perl | Turing100@Persistent
Life and Work of Judea Perl | Turing100@PersistentLife and Work of Judea Perl | Turing100@Persistent
Life and Work of Judea Perl | Turing100@Persistent
 
Early History of Fortran: The Making of a Wonder | Turing100@Persistent
Early History of Fortran: The Making of a Wonder | Turing100@PersistentEarly History of Fortran: The Making of a Wonder | Turing100@Persistent
Early History of Fortran: The Making of a Wonder | Turing100@Persistent
 
Life and Work of Dr. John Backus | Turing100@Persistent
Life and Work of Dr. John Backus | Turing100@PersistentLife and Work of Dr. John Backus | Turing100@Persistent
Life and Work of Dr. John Backus | Turing100@Persistent
 
Life and Work of Jim Gray | Turing100@Persistent
Life and Work of Jim Gray | Turing100@PersistentLife and Work of Jim Gray | Turing100@Persistent
Life and Work of Jim Gray | Turing100@Persistent
 
System Anecdotes | Turing100@Persistent
System Anecdotes | Turing100@PersistentSystem Anecdotes | Turing100@Persistent
System Anecdotes | Turing100@Persistent
 
Systems Design Experiences or Just Some War Stories…
Systems Design Experiences or Just Some War Stories…Systems Design Experiences or Just Some War Stories…
Systems Design Experiences or Just Some War Stories…
 
Life & Work of Butler Lampson | Turing100@Persistent
Life & Work of Butler Lampson | Turing100@PersistentLife & Work of Butler Lampson | Turing100@Persistent
Life & Work of Butler Lampson | Turing100@Persistent
 
Life & Work of Robin Milner | Turing100@Persistent
Life & Work of Robin Milner | Turing100@PersistentLife & Work of Robin Milner | Turing100@Persistent
Life & Work of Robin Milner | Turing100@Persistent
 
Life & Work of Dr. Vinton Cerf and Dr. Robert Kahn | Turing100@Persistent
Life & Work of Dr. Vinton Cerf and Dr. Robert Kahn | Turing100@PersistentLife & Work of Dr. Vinton Cerf and Dr. Robert Kahn | Turing100@Persistent
Life & Work of Dr. Vinton Cerf and Dr. Robert Kahn | Turing100@Persistent
 
Net Neutrality | Turing100@Persistent Systems
Net Neutrality | Turing100@Persistent SystemsNet Neutrality | Turing100@Persistent Systems
Net Neutrality | Turing100@Persistent Systems
 
Alan Turing Scientist Unlimited | Turing100@Persistent Systems
Alan Turing Scientist Unlimited | Turing100@Persistent SystemsAlan Turing Scientist Unlimited | Turing100@Persistent Systems
Alan Turing Scientist Unlimited | Turing100@Persistent Systems
 
Life and work of E.F. (Ted) Codd | Turing100@Persistent
Life and work of E.F. (Ted) Codd | Turing100@PersistentLife and work of E.F. (Ted) Codd | Turing100@Persistent
Life and work of E.F. (Ted) Codd | Turing100@Persistent
 
Alan Turing Centenary @ Persistent Systems
Alan Turing Centenary @ Persistent SystemsAlan Turing Centenary @ Persistent Systems
Alan Turing Centenary @ Persistent Systems
 

Último

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 

Último (20)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 

Software Faults, Failures and Their Mitigations | Turing100@Persistent

  • 1. Persistent Systems January 5, 2013 Software Faults, Failures and Their Mitigations Kishor Trivedi Duke High Availability Assurance Lab (DHAAL) Dept. of Electrical & Computer Engineering Duke University Durham, NC 27708 kst@ee.duke.edu www.ee.duke.edu/~kst Copyright © 2013 by K.S. Trivedi
  • 2. Duke University Research Triangle Park (RTP) Duke UNC-CH NC state North USA Carolina 2 Copyright © 2013 by K.S. Trivedi
  • 3. Duke University • U.S. News & World Report in its 2012 edition, – ranked the university's undergraduate program 8th among all universities in USA Copyright © 2013 by K.S. Trivedi 3
  • 4. Duke University NCAA Men’s Basketball Champions 2010 (also 1998,1999 & 2001) 4 Copyright © 2013 by K.S. Trivedi
  • 5. Trivedi’s research triangle Stochastic modeling methods & numerical Theory solution methods: Large Fault trees, Stochastic Petri Nets, Large/stiff Markov & non-Markov models Fluid stochastic models Performability & Markov reward models Software aging and rejuvenation Security, Survivability, Resilience Software Books: quantification Machine Learning Packages Applications Blue, Red, Reliability/availability/performance Avionics (Boeing), Space (NASA/JPL), White HARP (NASA), SAVE (IBM), Power systems (GE), IRAP (Boeing) Automobile systems (GM) SHARPE, SPNP, SREPT Computer systems (EMC,SUN,HP,TCS) Telco systems (AT&T, Lucent, Avaya) Computer Networks (Motorola) Virtualized Data center (NEC) Cloud computing (IBM, NEC, Cisco) 5 Copyright © 2013 by K.S. Trivedi Aging & Rejuvenation (Huawei) Software
  • 6. Probability and Statistics with Reliability, Queuing, and Computer Science Applications, Second edition, John Wiley, 2001 (Bluebook) [First edition by Prentice-Hall, 1982] Textbooks Performance and Reliability Analysis of Computer Systems: An Example-Based Approach Using the SHARPE Software Package, Kluwer (now Springer), 1996 (Redbook) Queuing Networks and Markov Chains, John Wiley, second edition, 2006 (White book) 6 Copyright © 2013 by K.S. Trivedi
  • 7. DHAAL & Industry A Success Story • Reliability Prediction of Boeing 787 Current Return Network for FAA Certification • Security Quantification (DARPA SITAR)/NSF • Survivability Quantification for Lucent POTS and Siemens Smartgrid • Cloud performance, availability, power with IBM and Cisco • Reliability/Availability Prediction of SIP protocol on IBM WebSphere • Software Aging and Rejuvenation: state of the art, theory, measurements, and implementation (IBM x-series); Huawei • Cloud computing security (Measurements, quantification, and implementation) - NATO Science for Peace and Security • NASA-JPL Failures data analytics • NEC collaboration for Performability Management (VMs allocation, VMs and VMM rejuvenation, etc) in Virtualized Data Center • Data analytics (statistical and machine learning techniques used to interpret huge volume of data) - WiPro Technologies • Software Reliability/Availability/Performability analysis – Short courses, seminars, and consulting - Tata consulting services 7 Copyright © 2013 by K.S. Trivedi 7
  • 8. Outline • Motivation • A Real System • Software Fault Classification • Environmental Diversity • Methods of Mitigation • Software Aging and Rejuvenation • Conclusions Copyright © 2013 by K.S. Trivedi
  • 9. Pervasive Dependence on Computer Systems Need for High Reliability/Availability Communication Health & Medicine Avionics Banking Entertainmen t Copyright © 2013 by K.S. Trivedi
  • 10. Basic Definitions • Steady-state availability (Ass) or just availability  Long-term probability that the system is available when requested: MTTF Ass = MTTF + MTTR  MTTF is the system mean time to failure, a complex combination of component MTTFs  MTTR is the system mean time to recovery - may consist of many phases Copyright © 2013 by K.S. Trivedi
  • 11. Basic Definitions • Downtime in minutes per year (un)availability is usually presented in terms of annual downtime. – Downtime = 8760×60 ×(1- Ass) minutes. – 5 NINES (Ass = 0.99999)  5.26 minutes annual downtime Copyright © 2013 by K.S. Trivedi
  • 12. Number of Nines– Reality Check • 49% of Fortune 500 companies experience at least 1.6 hours of downtime per week – Approx. 80 hours/year=4800 minutes/year – Ass=(8760-80)/8760=0.9908 – That is, between 2 NINES and 3 NINES! • This study assumes planned and unplanned downtime, together Copyright © 2013 by K.S. Trivedi
  • 13. Achieving High Availability is a Challenge • Black Sept. 2011, In the same week!!!!: – Microsoft Cloud service outage (2.5 hours) – Google Docs service outage (1 hour) • A memory leak due to a software update • Sept. 2012 GoDaddy (4 hours) – 5 millions of websites affected • Oct. 2012 Amazon – 10/15/2012 Webservices – 6 hours (Memory leak) – 10/27/2012 EC2 – > 2 hours Copyright © 2013 by K.S. Trivedi
  • 14. Downtown Costs per Hour • Brokerage operations $6,450,000 • Credit card authorization $2,600,000 • eBay (1 outage 22 hours) $225,000 • Amazon.com $180,000 • Package shipping services $150,000 • Home shopping channel $113,000 • Catalog sales center $90,000 • Airline reservation center $89,000 • Cellular service activation $41,000 • On-line network fees $25,000 • ATM service fees $14,000 Sources: InternetWeek 4/3/2000; Fibre Channel: A Comprehensive Introduction, R. Kembel 2000, p.8. ”...based on a survey done by Contingency Planning Research." Copyright © 2013 by K.S. Trivedi
  • 15. High Reliability/Availability • Hardware fault tolerance, fault management, reliability/availability modeling/assurance relatively well developed • System outages more due to software faults Key Challenge: • Software reliability is one of the weakest links in system reliability/availabilityby K.S. Trivedi Copyright © 2013
  • 16. Software is the problem im Gray’s paper titled “W do computers hy top and what can be done about it?” arted to pointed out this trend in 1985, followed by his paper A census of tandem system availability between 1985 and 1990” 2005 Across different industries…. 1985 Copyright © 2013 by K.S. Trivedi 16
  • 17. Increasing SW Failure Rate? Planetary Missions Flight Software: A. Nikora of JPL T interval he between the first and last launch: 8.76 years. T interval between he successive launches ranges from: 23 to 790 days. Similar results for ground software Mars Pathfinder CASSINI Mars Mars Stardust Mars Genesis Mars Deep Mars Global Climate Polar Odyssey Exploration Impact Reconnaissance Surveyor Orbiter Copyright © 2013 by K.S. Trivedi Lander Rover Orbiter Mission Name (in launch order)
  • 18. High Reliability/Availability: Software is the problem • Fault avoidance – good software engineering practices – difficult for large/complex software systems – Impossible to fully test and verify if software is fault-free “Testing shows the presence, not the absence, of bugs” - E. W. Dijkstra • Yet there are stringent requirements for failure-free operation Copyright © 2013 by K.S. Trivedi
  • 19. High Reliability/Availability: Software is the problem (2) Software fault tolerance is a potential solution to improve software reliability in lieu of virtually impossible fault-free software Copyright © 2013 by K.S. Trivedi
  • 20. Software Fault Tolerance Classical Techniques Design diversity – N-version programming – Recovery block Expensive  not used much in practice! Yet there are stringent requirements for failure-free operation Challenge: Affordable Software Fault Tolerance Copyright © 2013 by K.S. Trivedi
  • 21. Outline • Motivation • A Real System • Software Fault Classification • Environmental Diversity • Methods of Mitigation • Software Aging and Rejuvenation • Conclusions Copyright © 2013 by K.S. Trivedi
  • 22. High availability SIP Application Server Configuration on IBM WebSphere P RDC 2008 and ISSRE 2010 papers Copyright © 2013 by K.S. Trivedi
  • 23. High availability SIP Application Server configuration on WebSphere Hardware configuration: – Two BladeCenter chassis; 4 blades (nodes) on each chassis (1 chassis sufficient for performance) Software configuration: – 2 copies of SIP/Proxy servers (1 sufficient for performance) – 12 copies of WAS (6 sufficient for performance) – Each WAS instance forms a redundancy pair (replication domain) with WAS installed on another node on a different chassis • The system has hardware redundancy and software redundancy Copyright © 2013 by K.S. Trivedi
  • 24. High availability SIP Application Server configuration on WebSphere Software Fault Tolerance – Identical copies of SIP proxy used as backups (hot spares) – Identical copies of WebSphere Applications Server (WAS) used as backups (hot spares) – Type of software redundancy – (not design diversity) but replication of identical software copies – Normal recovery • restart software, reboot node or fail-over to a software replica; only when all else fails, a “software repair” is invoked Copyright © 2013 by K.S. Trivedi
  • 25. Escalated levels of Recovery Single Process Restart (SPR) A real example The flowchart depicts the Avaya Servers and actions taken for recovery IF 3 SPR Media Gateways after a failure is detected. NO within 60 seconds Try the simplest recovery method first, then a more YES System Warm complex etc. Restart (SWR) IF 3 NO YES IF 3 SCR NO SWR System Cold within within Restart (SCR) 15 min 15 min YES Avaya IF 3 YES Communication ACMSR OS Reboot Manager within Software reloads 15 min (ACMSR) Copyright © 2013 by K.S. Trivedi NO
  • 26. Software Fault Tolerance: New Thinking Retry, restart, reboot! – Known to help in dealing with hardware transients – Do they help in dealing with failures caused by software bugs? – If yes, why? Copyright © 2013 by K.S. Trivedi
  • 27. A Cartoon Why is this true………at least for computers? Copyright © 2013 by K.S. Trivedi
  • 28. Software Fault Tolerance: New Thinking Failover to an identical software replica (that is not a diverse version) – Does it help? – If yes, why? Twenty years ago this would be considered crazy! Copyright © 2013 by K.S. Trivedi
  • 29. Outline • Motivation • A Real System • Software Fault Classification – Fighting Bugs: Remove, Retry, Replicate and Rejuvenate, M. Grottke and K. Trivedi, IEEE Computer Magazine, Feb. 2007 • Environmental Diversity • Methods of Mitigation • Software Aging and Rejuvenation • Conclusions Copyright © 2013 by K.S. Trivedi
  • 30. Software Faults main threats to high reliability, availability & safety Copyright © 2012 byby K.S. Trivedi Copyright © 2013 K.S. Trivedi
  • 31. IFIP Working Group 10.4 (Laprie) • Failure occurs when the delivered service no longer complies with the desired output. • Error is that part of the system state which is liable to lead to subsequent failure. • Fault is adjudged or hypothesized cause of an error. Faults are the cause of errors that may lead to failures Fault Error Failure 31 Copyright © 2013 by K.S. Trivedi
  • 32. Need to Classify bug types • We submit that a software fault tolerance approach based on retry, restart, reboot or fail- over to an identical software replica (not a diverse version) work because of a significant number of software failures are caused by Mandelbugs as opposed to the traditional software bugs now called Bohrbugs Copyright © 2013 by K.S. Trivedi
  • 33. Need to Classify bug types • In recent years, researchers have reported the phenomenon of “software aging” (i.e., degraded performance and/or increased failure rate of long-running software systems). • Puzzle: How can performance and failure rate change if the software code is not modified?! ⇒ Study software fault types and their relationships Copyright © 2013 by K.S. Trivedi
  • 34. Jim Gray’s Definitions • The terms “Bohrbug” and “Heisenbug” were first used in print by Jim Gray in 1985. • “Bohrbugs, like the Bohr atom, are solid, easily detected by standard techniques, and hence boring.” • “Most production software faults are soft. If the program state is reinitialized and the failed operation is retried, J. Gray the operation will not fail a second time. … The assertion that most production software bugs are soft – Heisenbugs that go away when you look at them – is well known to systems programmers.” (Gray, 1985) Copyright © 2013 by K.S. Trivedi
  • 35. Bruce Lindsay’s Definition  Based on Gray’s paper, researchers have often equated Heisenbugs with soft faults.  However, when Bruce Lindsay originally coined the term in the 1960s (while working with Jim Gray), he had a more narrow definition in mind. • “Heisenbugs as originally defined … B. Lindsay, photo by T. Upton are bugs in which clearly the system behavior is incorrect, and when you try to look to see why it’s incorrect, the problem goes away.” (Lindsay, 2004) • The term alludes to the physicist Werner Heisenberg and his Uncertainty Principle. Copyright © 2013 by K.S. Trivedi
  • 36. Heisenbug – Our Definition • Heisenbug := A fault that stops causing a failure or that manifests differently when one attempts to probe or isolate it. • How can probing affect the bug? 1. Some debuggers initialize unused memory to default values, thus preventing failures due to improper initialization. 2. Trying to investigate a failure can influence process scheduling in such a way that a scheduling- related failure does not occur again. Copyright © 2013 by K.S. Trivedi
  • 37. A Classification of Software Faults • Bohrbug := A fault that is easily isolated and that manifests consistently under a well-defined set of conditions, because its activation and error propagation lack complexity.  Example: A bug causing a failure whenever the user enters a negative date of birth  Since they are easily found, Bohrbugs tend to be detected and fixed during the software testing phase.  The term alludes to the physicist Niels Bohr and his rather simple atomic model. Copyright © 2013 by K.S. Trivedi
  • 38. Mandelbug – Definition • Mandelbug := A fault whose activation and/or error propagation are complex. Typically, a Mandelbug is difficult to isolate, and/or the failures caused by a it are not systematically reproducible. • Example: A bug whose activation is scheduling-dependent  The residual faults in a thoroughly-tested piece of software are mainly Mandelbugs.  The term alludes to the mathematician Benoît Mandelbrot and his research in fractal geometry. Copyright © 2013 by K.S. Trivedi
  • 39. Mandelbug: “Complexity” (1) 39 • The explanation of the possible sources of complexity is based on the “chain of threats” linking faults with errors and failures: • First source of complexity: Time lag between fault activation and failure occurrence, e.g., because several different error states have to be traversed in the error propagation. • Example: The result of an erroneous calculation may at first be kept in the system memory and cause a failure only later, when it is being accessed and used. Copyright © 2013 by K.S. Trivedi
  • 40. Mandelbug: “Complexity” (2) 40 • Second source of complexity: Fault activation and/or error propagation depend on interactions between conditions occurring inside the application and conditions that accrue within the system- internal environment of the application. • Example: A fault causing failures due to side-effects of other applications Copyright © 2013 by K.S. Trivedi
  • 41. Mandelbugs: Consequences 41 • Mandelbugs are difficult to detect and remove during the software testing phase. • An operation that failed due to a Mandelbug may execute correctly upon retry even if the fault has not been removed; changing the environment may suffice. • Potential recovery techniques: – “Microreboot” of individual components – Application restart – System reboot – Failover to a standby component (replicate) – Manual recovery Copyright © 2013 by K.S. Trivedi
  • 42. Examples of Types of Bugs in IT System • Mandelbugs in IT Systems: Trivedi, Mansharamani, Kim, Grottke, and Nambiar. “Recovery from failures due to Mandelbugs in IT systems”. PRDC 2011. • The projects ranged across a number of business systems in the banking, financial, government, IT, pharmacy, and telecom sector. 42 Copyright © 2013 by K.S. Trivedi
  • 43. Examples of Types of Bugs in IT System (cont.) • Exemple of Mandelbug in a large telecom system – Slow response times of the front end screens. – The problem was hard to analyze since the screens would freeze at random points in the day. • As the days went by the frequency of incidents of these freezes kept increasing. • Class of MandelBug encountered – The users would wait for some time and their operations would resume. – The IT operations team rebooted the servers and the operations could resume for a few hours. 43 Copyright © 2013 by K.S. Trivedi
  • 44. Examples of Types of Bugs in IT System (cont.) • Reason of the Problem – Whenever a front end screen was invoked, a temporary file was created at the centralized server. • This file was never cleaned up even after a screen was closed. – As a result, tens of thousands of small files kept accumulating on the disk causing sluggish behavior. • Solution of the problem – A cleanup utility was written to move these files periodically to another file system and later delete them. 44 Copyright © 2013 by K.S. Trivedi
  • 45. Examples of Types of Bugs in IT System (cont.) • Exemple of Mandelbug in a government tax information system. – All organizations should submit the income tax deducted at source (TDS) records for all of their employees. – Sporadically when a large corporation uploaded its file, every once in a while the application server would crash. • Class of MandelBug encountered – The IT operations staff would then increase the JVM heap size and restart the JVM, which would allow the file to be uploaded without any problem. 45 Copyright © 2013 by K.S. Trivedi
  • 46. Examples of Types of Bugs in IT System (cont.) • Reason of the Problem – The probability of a failure occurrence increased after each JVM restart, as the heap got consumed more and more. • Solution of the problem – Reconfigurate system parameters to resume operations successfully. 46 Copyright © 2013 by K.S. Trivedi
  • 47. Aging-related Bug – Definition • Aging-related bug := A fault that leads to the accumulation of errors either inside the running application or in its system-context environment, resulting in an increased failure rate and/or degraded performance.  Example:  A bug causing memory leaks in the application  Note that the aging phenomenon requires a delay between fault activation and failure occurrence.  Note also that the software appears to age due to such a bug; there is no physical deterioration Copyright © 2013 by K.S. Trivedi
  • 48. Relationships  Bohrbug and Mandelbug are complementary antonyms.  Aging-related bugs are a subtype of Mandelbugs Mandelbugs Aging Related Bugs Aging-Related Bugs - Bohrbugs Copyright © 2013 by K.S. Trivedi
  • 49. Important Questions about these Bugs • What fraction of bugs are Bohrbugs, Mandelbugs and aging-related bugs – How do these fractions vary • over time • over projects, languages, application types,… – Need Measurements – Current NASA/JPL Project with Allen Nikora & Michael Grottke; preliminary results from one NASA software project: • 52% Bohrbugs • 35% Mandelbugs (non-aging-related) • 4% Aging-related bugs • 7% Operator related • 2% Unclassified – Very similar results for Linux, MySQL, Apache AXIS, httpd • What are the methods of mitigation for the different fault types Copyright © 2013 by K.S. Trivedi
  • 50. Trends in SW Fault Type Proportions Planetary Missions Flight Software • Fault Type Proportions vs. Runtime for Four Earlier Missions (of 8 missions analyzed) • Result: The proportion of Bohrbugs seems to settle at around the same value. Such a convergence to similar values is less obvious for the other fault types. Copyright © 2013 by K.S. Trivedi
  • 51. Outline • Motivation • A Real System • Software Fault Classification • Environmental Diversity • Methods of Mitigation • Software Aging and Rejuvenation • Conclusions Copyright © 2013 by K.S. Trivedi
  • 52. Environmental diversity A new thinking to deal with software faults and failures Copyright © 2012 byby K.S. Trivedi Copyright © 2013 K.S. Trivedi
  • 53. Software Fault Tolerance: New Thinking New thinking: Environmental Diversity as opposed to Design Diversity Our claim is that this works since failures due to Mandelbugs are not negligible, we have an affordable software fault tolerance technique that we call Environmental Diversity Copyright © 2013 by K.S. Trivedi
  • 54. What is environmental diversity? • The underlying idea of Environmental diversity – Retry a previously faulty operation and it works – Why? – because of the environment where the operation was executed has changed enough to avoid the fault activation. • The environment is understood as – OS resources, other applications running concurrently and sharing the same resources, interleaving of operations, concurrency, or synchronization. Copyright © 2013 by K.S. Trivedi
  • 55. What is environmental diversity? • The execution of an application depends on the environment Restart the application Ap1 Ap2 Ap3 Ap4 Ap4 Ap6 Ap3 Ap5 Operating System Operating System Hardware Hardware Environment at time t1 Environment at time t1+n Copyright © 2013 by K.S. Trivedi
  • 56. Environmental Diversity • Restart an application, reboot a node or failover to an identical standby replica work because of the environmental diversity that will be underlying these actions; – By environment here we mean the resources of the OS, other applications running concurrently and sharing system resources, interleaving of operations, concurrency, synchronization etc. • Environmental Diversity uses time redundancy over expensive design diversity • [Adams] Restart • [Jalote et al.] Rollback, rollforward • [Patriot] Occasional reboot, “switch off and on” • [Avaya Swift] restart process; failover to a replica • [IBM SIP] escalated levels: restart, reboot, failover… • [IBM Director-X-series] Rejuvenation Copyright © 2013 by K.S. Trivedi
  • 57. Outline • Motivation • A Real System • Software Fault Classification • Environmental Diversity • Methods of Mitigation • Software Aging and Rejuvenation • Conclusions Copyright © 2013 by K.S. Trivedi
  • 58. Methods of Mitigation Copyright © 2012 byby K.S. Trivedi Copyright © 2013 K.S. Trivedi
  • 59. Mitigation Copyright © 2013 by K.S. Trivedi
  • 60. Bohrbugs: Remove  Find and fix the bugs during testing  Failure data collected during testing  Calibrate a software reliability growth model (SRGM) using failure data; this model is then used for prediction  Many SRGMs exist (JM,NHPP,HGRGM, etc.)  Books by Lyu, Musa, Cai  Gokhale & Trivedi, A Time/Structure Based Software Reliability Model, Annals of Software Engineering, 1999  Measurements  Empirical (statistical) models Copyright © 2013 by K.S. Trivedi
  • 61. Mitigation Copyright © 2013 by K.S. Trivedi
  • 62. OS Availability Model (IBM BladeCenter) Fix (Failed due to a Bohrbug) Reboot (Failure due to a Mandelbug) Copyright © 2013 by K.S. Trivedi
  • 63. Availability model of a Proxy or a WAS (IBM SIP on websphere) • Failure detection – By WLM – By Node Agent – Manual detection • Recovery – Node Agent • Auto process restart – Manual recovery • Process restart • Node reboot • Repair Application server and proxy server Copyright © 2013 by K.S. Trivedi
  • 64. Outline • Motivation • A Real System • Software Fault Classification • Environmental Diversity • Methods of Mitigation • Software Aging and Rejuvenation • Conclusions Copyright © 2013 by K.S. Trivedi
  • 65. Aging Related Bugs: Replicate, Restart, Reboot, Rejuvenate Copyright © 2012 byby K.S. Trivedi Copyright © 2013 K.S. Trivedi
  • 66. Software Aging Aging phenomenon Error conditions accumulating over time Performance degradation, system failure Performance degradation, system failure Main causes of Software Aging Memory leak, fragmentation, Unterminated threads, Data corruption, Round- off errors, Unreleased file-locks, etc Observed system OS, Middle-ware, Netscape, Internet Explorer etc Copyright © 2013 by K.S. Trivedi
  • 67. Software Aging - Definition “Software Aging” phenomenon Long-running software tends to show an increasing failure rate. Not related to application program becoming obsolete due to changing requirements/maintenance. Software appears to age; no real deterioration Copyright © 2013 by K.S. Trivedi
  • 68. Software Aging - Examples • Cisco Catalyst Switch [Matias Jr.] • File system aging [Smith & Seltzer] • Gradual service degradation in the AT&T transaction processing system [Avritzer et al.] • Error accumulation in Patriot missile system’s software [Marshall] • Resources exhaustion in Apache [Li et al., Grottke et al.] • Physical memory degradation in a SOAP-based Server [Silva et al.] • Software aging in Linux [Cotroneo et al.] • Crash/hang failures in general purpose applications after a long runtime Copyright © 2013 by K.S. Trivedi
  • 69. Measurements Showing Resource Exhaustion or Depletion Real Memory Free File Table Size AMethodology for Detection and Estimation of Software Aging, S. Garg, A. van Moorsel, K. Vaidyanathan and K. Trivedi. Pro c. o f IEEE I Symp. o n So ftware Re liability Eng ine e ring , Nov. 1998. ntl. Copyright © 2013 by K.S. Trivedi
  • 70. Software Fault Types & Their Mitigation Copyright © 2013 by K.S. Trivedi
  • 71. Software rejuvenation Software rejuvenation is a cost effective solution for improving software reliability by avoiding/postponing unanticipated software failures/crashes. It allows proactive recovery to be carried either automatically or at the discretion of the user/administrator Rejuvenation of the environment, not of software Copyright © 2013 by K.S. Trivedi
  • 72. Software Rejuvenation Counteracts the software aging phenomenon Frees up OS resources; Removes error accumulation Common techniques for cleaning Garbage collection, defragmentation, flushing kernel and file server tables etc. Challenge: Rejuvenation scheduling/granularity Copyright © 2013 by K.S. Trivedi
  • 73. SW Rejuvenation: The Genesis “Software Rejuvenation: Analysis, Module and Applications”, Y. Huang, C. Kintala, N. Koletis, N. Fulton, in FTCS 1995 An insight into operational software, that no-one had before (at least, formally). It changed • How practitioners looked at making software more dependable • Windfall of performance and dependability modeling problems for academicians • Ideas to build better, real-world systems as Internet evolved • Led to recognition of “software aging” phenomenon • Brought about Phds, tenureships, publications, patents, awards, tools, systems, funding for many many people around the world. Copyright © 2013 by K.S. Trivedi
  • 74. Software Rejuvenation Examples AT&T billing applications [Huang et al.] Patriot missile system software - switch off/on every 8 hours [Marshall] On-board preventive maintenance for long-life deep space missions (NASA’s X2000 Advanced Flight Systems Program) [Tai et al.] IBM Director Software Rejuvenation (x-series) [IBM & Duke Researchers] Microsoft IIS 5.0 process recycling tool Process restart in Apache [Li et al.] ISS FS SSC (ISS File system) - switch off and on every 2 months [NASA ISS reports] For more examples: "Software rejuvenation - Do IT & Telco industries use it?". Javier Alonso, Antonio Bovenzi, Jinghui Li, Yakun Wang, Stefano Russo, and Kishor Trivedi. The 4rd International Workshop on Software Aging and Rejuvenation (WoSAR 2012) . Held in conjunction with The 23nd annual International Symposium on Software Reliability Engineering (ISSRE 2012), Dallas, USA, 2012. Copyright © 2013 by K.S. Trivedi
  • 75. Software Rejuvenation –Trade-off • Advantages – Reduces costs of sudden aging-related failures – Can be applied at the discretion of the user/administrator • Disadvantages – Direct costs of carrying out rejuvenation – Opportunity costs of rejuvenation (downtime, decreased performance, lost transactions etc) Important research issue: Find optimal times to perform rejuvenation! Copyright © 2013 by K.S. Trivedi
  • 76. Software rejuvenation - Approaches Two approaches based on WHEN: Time-Based rejuvenation approaches Rejuvenation applied regularly and at predetermined time intervals. Widely used in real environments – Web servers (Apache) – ISS two-months reboot – Telecommunication systems Copyright © 2013 by K.S. Trivedi
  • 77. Software rejuvenation - Approaches Two approaches based on WHEN: Measurement (Inspection)-based rejuvenation approaches • Threshold based or predictive • System metrics continually monitored • Rejuvenation triggered when the crash is imminent based on the observation/prediction • Reduce potentially useless rejuvenation actions and downtime in the process [Silva et al.] Copyright © 2013 by K.S. Trivedi
  • 78. Software rejuvenation Copyright © 2013 by K.S. Trivedi
  • 79. Software Rejuvenation – Approaches Two approaches based on HOW: Use analytical model to optimize rejuvenation schedule • Lucent Bell Labs [Huang et al., ‘95] • Duke [IEEE-TC’98, SIGMETRICS’96, ISSRE’95, PRDC’00, SIGMETRICS’01, Comp J.’01, SRDS’02, DSN’02, ISSRE’02, DSN’03, IEEE-TR’05] • Others [IPDS’98, PNPM’99] Copyright © 2013 by K.S. Trivedi
  • 80. Software Rejuvenation – Approaches Two approaches based on HOW: Use measurements of resource degradation to determine/predict optimal rejuvenation schedule • Duke [ISSRE’98, ISSRE’99, IBMJRD’01, ISESE’02, IEEE-TPDS’05] • Duke (formerly UPC) [GRID’07, IEEE-TC’09, DSN’10] Copyright © 2013 by K.S. Trivedi
  • 81. Failure rate Preventive maintenance is useful only if failure rate is increasing If the time to failure distribution is exponential then failure rate is Constant Need to assume (and establish) that TTF is IFR Copyright © 2013 by K.S. Trivedi
  • 82. Analytic Models Single node models – CTMC model – SMP model Cluster systems – IBM Cluster model (Time-based, condition-based) Copyright © 2013 by K.S. Trivedi
  • 83. Analytic Models Software Aging and Rejuvenation A simple and useful model of increasing failure rate: Failure probable Robust state Failed state state Time to failure: Hypo-exponential distribution Increasing failure rate aging Copyright © 2013 by K.S. Trivedi
  • 84. Analytic Models CTMC model [Huang95] Failed state Robust state Sf S0 r1 r3 r1 Failed state r2 λ Rejuvenation Failure probable state Robust state r4 Sf λ Sp Sr state Sp S0 r2 Failure probable state Model w/o rejuvenation Model with rejuvenation From this Continuous-time Markov chain model Can find closed-form expression for the optimal rejuvenation trigger rate (r4) Copyright © 2013 by K.S. Trivedi
  • 85. Analytic Models Semi-Markov model [Dohi00] Relax the assumption of exponentially distributed sojourn times (time- independent transition rates) Hence have a semi Markov model 0 Completion of Completion of Repair Rejuvenation State change 2 1 3 Can find closed-form expression for the optimal (deterministic) time to System Failure Rejuvenation rejuvenation trigger Copyright © 2013 by K.S. Trivedi
  • 86. Rejuvenation in Cluster Systems Cluster System [Pfister] Collection of independent, self-contained computer systems working together to provide a more reliable and powerful system than a single node by itself Easier scaling to larger systems, high levels of availability/performance and low management costs Copyright © 2013 by K.S. Trivedi
  • 87. Rejuvenation for Cluster Systems Motivation Rejuvenation using the fail-over mechanisms Long-terms benefits in terms of availability/performance Continuous operation (possibly at a degraded level) Practically zero downtime Copyright © 2013 by K.S. Trivedi
  • 88. Rejuvenation for Cluster Systems Motivation Less disruptive and lower overhead than unplanned outages Transparent to user/application Most current industry initiatives reactive Two approaches Simple time-based (periodic) Condition-based (only from the “failure-impending” state Copyright © 2013 by K.S. Trivedi
  • 89. Rejuvenation for Cluster Systems SRN Models Rejuvenation using the fail-over mechanisms in a rolling fashion Modeling using SRNs (Stochastic Reward Nets) Analysis for 2 rejuvenation policies Simple time-based policy • All nodes rejuvenated successively at the end of each rejuvenation interval Condition-based policy • Nodes rejuvenated only from the “failure-probable” state Copyright © 2013 by K.S. Trivedi
  • 90. SRN Model Basic Cluster Model Copyright © 2013 by K.S. Trivedi
  • 91. SRN Model Simple Time-Based Rejuvenation Copyright © 2013 by K.S. Trivedi
  • 92. Model Parameters Transition Mean time Tfprob 240 hours Tnodefail 720 hours Tnoderepair 30 mins Tsysrepair 4 hours Trejuv 10 mins costnodefail $5000/hour costnoderejuv $250/hour Copyright © 2013 by K.S. Trivedi
  • 93. Model Measures Measures Computed Unavailability (#Psysfail == 1) ? 1 : 0 Cost #Prejuv*costrejuv + #Pnodefail*costnodefail + #Psysfail*costsysfail Copyright © 2013 by K.S. Trivedi
  • 94. Results Simple Time-Based Rejuvenation 8/ configuration 1 8/ configuration 2 As rejuv. int. increases, rejuvenation is performed less frequently When rejuv int is close to zero, the system is rejuvenating very frequently resulting in high cost/downtime When rejuv. int. goes beyond optimal value, system failures become frequent resulting in high cost/downtime Copyright © 2013 by K.S. Trivedi
  • 95. Measuring Performance Variables Objective Detection and validation of aging Periodically monitor and collect data on the attributes responsible for the “health” of the system Quantify the effect of aging on system resources Proposed metric – Estimated time to exhaustion Proposed metric – Evaluation function using PCA approach Copyright © 2013 by K.S. Trivedi
  • 96. Measuring Performance Variables Approaches Time-based (workload-independent) estimation [Garg98] Workload-based estimation [Vaidyanathan99] ARMA/ARX models [Li02] ALT and ADT techniques [Matias06] Non-parametric Algorithms [Dohi00] Non-linear models [Hoffman07] Principal component Analysis (PCA) and System identification [Jia] Pattern recognition [Vaidyanathan & Gross] Threshold-based approaches [Silva09] Machine Learning Approaches [Alonso10] Copyright © 2013 by K.S. Trivedi
  • 97. Data Collection Experimental Setup 97 SNMP-based resource monitoring tool: Data related to OS resource usage (memory, process table, file table etc.) and system activity (CPU usage, paging, swapping, NFS, interrupts etc. ) collected for over 3 months at 10 min intervals Copyright © 2013 by K.S. Trivedi
  • 98. Time Plots Non-parametric Regression Smoothing 98 Real Memory Free File Table Size Trend detection: Seasonal Kendall test for trend Copyright © 2013 by K.S. Trivedi
  • 99. IBM xSeries Software Rejuvenation Agent (SRA) Implemented in a high-availability clustered environment Monitors consumable resources, estimate time to exhaustion and generates alerts if within user notification horizon Copyright © 2013 by K.S. Trivedi
  • 100. IBM xSeries Software Rejuvenation Agent (SRA) IBM Director system management tool – Provides GUI to configure SRA – Acts upon alerts Two versions – Periodic rejuvenation – Prediction-based rejuvenation Copyright © 2013 by K.S. Trivedi
  • 101. Summary It is possible to enhance software availability during operation exploiting environmental diversity Multiple types of recovery after a software failure can be judiciously employed: restart, failover to a replica, reboot and if all else fails repair (patch) Copyright © 2013 by K.S. Trivedi
  • 102. Summary Software aging not anecdotal – real life scientific phenomenon Rejuvenation implemented in several special purpose applications and many general purpose cluster systems Copyright © 2013 by K.S. Trivedi
  • 103. Key References • Software Rejuvenation: Analysis, Module and Applications, Y. Huang, C. Kintala, N. Kolettis and N. Fulton, In Proc. FTCS-25, June 1995. • A Methodology for Detection and Estimation of Software Aging, S. Garg, A. van Moorsel, K. Vaidyanathan and K. S. Trivedi. Proc. ISSRE 1998. • Performance and Reliability Evaluation of Passive Replication Schemes in Application Level Fault Tolerance, S. Garg, Y. Huang, C. M. R. Kintala, K. S. Trivedi, and S. Yajnik. In Proc. FTCS 1999. • Statistical Non-Parametric Algorithms to Estimate the Optimal Software Rejuvenation Schedule, T. Dohi, K. Goseva-Popstojanova and K. S. Trivedi, Proc. PRDC 2000. • Proactive Management of Software Aging, V. Castelli, R. E. Harper, P. Heidelberger, S. W. Hunter, K. S. Trivedi, K. Vaidyanathan and W. Zeggert, IBM Journal of Research & Development, March 2001. • A Comprehensive Model for Software Rejuvenation, K. Vaidyanathan and K. S. Trivedi. IEEE- TDSC, April-June 2005. • Analysis of software aging in a web server, M. Grottke, L. Li, K. Vaidyanathan and K. S. Trivedi, IEEE Trans. Reliability, Sept. 2006. Copyright © 2013 by K.S. Trivedi
  • 104. Key References • Fighting Bugs: Remove, Retry, Replicate and Rejuvenate, M. Grottke and K. Trivedi, IEEE Computer, Feb. 2007. • Availability Modeling of SIP Protocol on IBM WebSphere, K. S. Trivedi, D. Wang, D. J. Hunt, A. Rindos, W. E. Smith, B. Vashaw, Proc. PRDC 2008. • Using Accelerated Life Tests to Estimate Time to Software Aging Failure, MATIAS JR, R., TRIVEDI, K., Maciel, P. , ISSRE, 2010. • Accelerated Degradation Tests Applied to Software Aging Experiments, Rivalino Matias, Jr., K. S. Trivedi and Paulo J. F. Filho and Pedro A. Barbetta, IEEE Transactions on Reliability, March 2010. • An Empirical Investigation of Fault Types in Space Mission System Software, M.Grottke, A. P. Nikora and K. S. Trivedi, Proc. DSN, 2010. • Software fault mitigation and availability assurance techniques, K. S. Trivedi, M. Grottke, and E. Andrade. International Journal of System Assurance Engineering and Management, 2011. • Recovery from Failures due to Mandelbugs in IT Systems, K. Trivedi, R. Mansharamani, D.S. Kim, M. Grottke, M. Nambiar , Proc. PRDC 2011 • O. Kyas. (2001). Network Troubleshooting, Palo Alto California, Agilent Technologies (book) • M. Kaaniche and K. Kanoun (1996). Reliability of a Commercial Telecommunications System, ISSRE 1996 • R. Cramp, M. A. Vouk, and W. Jones (1992). On Operational Availability of a Large Software-Based Telecommunications System, ISSRE 1992 Copyright © 2013 by K.S. Trivedi