Software Faults, Failures and Their Mitigations | Turing100@Persistent

Persistent Systems
January 5, 2013

Software Faults, Failures and
Their Mitigations

Kishor Trivedi
Duke High Availability Assurance Lab (DHAAL)
Dept. of Electrical & Computer Engineering
Duke University
Durham, NC 27708
kst@ee.duke.edu
www.ee.duke.edu/~kst
Copyright © 2013 by K.S. Trivedi

Duke University
Research Triangle Park
(RTP)

Duke

UNC-CH
NC state

North
USA Carolina

2 Copyright © 2013 by K.S. Trivedi

Duke University
• U.S. News & World Report in its 2012 edition,

– ranked the university's undergraduate program
8th among all universities in USA

Copyright © 2013 by K.S. Trivedi 3

Duke University

NCAA Men’s Basketball Champions 2010 (also 1998,1999 & 2001)

Trivedi’s research triangle
Stochastic modeling methods & numerical
Theory solution methods:
Large Fault trees, Stochastic Petri Nets,
Large/stiff Markov & non-Markov models
Fluid stochastic models
Performability & Markov reward models
Software aging and rejuvenation
Security, Survivability, Resilience
Software Books: quantification
Machine Learning
Packages
Applications
Blue, Red,
Reliability/availability/performance
Avionics (Boeing), Space (NASA/JPL),
White
HARP (NASA), SAVE (IBM), Power systems (GE),
IRAP (Boeing) Automobile systems (GM)
SHARPE, SPNP, SREPT Computer systems (EMC,SUN,HP,TCS)
Telco systems (AT&T, Lucent, Avaya)
Computer Networks (Motorola)
Virtualized Data center (NEC)
Cloud computing (IBM, NEC, Cisco)
5 Copyright © 2013 by K.S. Trivedi Aging & Rejuvenation (Huawei)
Software

Probability and Statistics with Reliability, Queuing,
and Computer Science Applications, Second edition,
John Wiley, 2001 (Bluebook) [First edition by
Prentice-Hall, 1982]
Textbooks
Performance and Reliability Analysis of Computer
Systems: An Example-Based Approach Using the
SHARPE Software Package, Kluwer (now Springer),
1996 (Redbook)

Queuing Networks and Markov Chains,
John Wiley, second edition, 2006 (White book)


DHAAL & Industry
A Success Story
• Reliability Prediction of Boeing 787 Current Return Network for FAA
Certification
• Security Quantification (DARPA SITAR)/NSF
• Survivability Quantification for Lucent POTS and Siemens Smartgrid
• Cloud performance, availability, power with IBM and Cisco
• Reliability/Availability Prediction of SIP protocol on IBM WebSphere
• Software Aging and Rejuvenation: state of the art, theory, measurements,
and implementation (IBM x-series); Huawei
• Cloud computing security (Measurements, quantification, and
implementation) - NATO Science for Peace and Security
• NASA-JPL Failures data analytics
• NEC collaboration for Performability Management (VMs allocation, VMs
and VMM rejuvenation, etc) in Virtualized Data Center
• Data analytics (statistical and machine learning techniques used to
interpret huge volume of data) - WiPro Technologies
• Software Reliability/Availability/Performability analysis – Short courses,
seminars, and consulting - Tata consulting services

7 Copyright © 2013 by K.S. Trivedi 7

Outline
• Motivation
• A Real System
• Software Fault Classification
• Environmental Diversity
• Methods of Mitigation
• Software Aging and Rejuvenation
• Conclusions


Pervasive Dependence on Computer Systems
Need for High Reliability/Availability

Communication

Health & Medicine Avionics

Banking
Entertainmen
t


Basic Definitions
• Steady-state availability (Ass) or just availability
 Long-term probability that the system is available when
requested:
MTTF
Ass =
MTTF + MTTR
 MTTF is the system mean time to failure, a complex
combination of component MTTFs

 MTTR is the system mean time to recovery
- may consist of many phases

Basic Definitions

• Downtime in minutes per year
(un)availability is usually presented in terms of annual downtime.

– Downtime = 8760×60 ×(1- Ass) minutes.

– 5 NINES (Ass = 0.99999)  5.26 minutes annual downtime


Number of Nines– Reality Check

• 49% of Fortune 500 companies experience at least 1.6 hours of
downtime per week

– Approx. 80 hours/year=4800 minutes/year

– Ass=(8760-80)/8760=0.9908

– That is, between 2 NINES and 3 NINES!

• This study assumes planned and unplanned downtime,
together


Achieving High Availability
is a Challenge
• Black Sept. 2011, In the same week!!!!:
– Microsoft Cloud service outage (2.5 hours)
– Google Docs service outage (1 hour)
• A memory leak due to a software update

• Sept. 2012 GoDaddy (4 hours)
– 5 millions of websites affected
• Oct. 2012 Amazon
– 10/15/2012 Webservices – 6 hours (Memory leak)
– 10/27/2012 EC2 – > 2 hours


Downtown Costs per Hour
• Brokerage operations $6,450,000
• Credit card authorization $2,600,000
• eBay (1 outage 22 hours) $225,000
• Amazon.com $180,000
• Package shipping services $150,000
• Home shopping channel $113,000
• Catalog sales center $90,000
• Airline reservation center $89,000
• Cellular service activation $41,000
• On-line network fees $25,000
• ATM service fees $14,000
Sources: InternetWeek 4/3/2000; Fibre Channel: A Comprehensive Introduction, R. Kembel
2000, p.8. ”...based on a survey done by Contingency Planning Research."

High Reliability/Availability

• Hardware fault tolerance, fault management,
reliability/availability modeling/assurance relatively well
developed

• System outages more due to software faults

Key Challenge:
•
Software reliability is one of the
weakest links in system
reliability/availabilityby K.S. Trivedi
Copyright © 2013

Software is the problem
im Gray’s paper titled “W do computers
hy
top and what can be done about it?”
arted to pointed out this trend in 1985, followed by his paper
A census of tandem system availability between 1985 and 1990”

2005
Across different industries….
1985 Copyright © 2013 by K.S. Trivedi 16

Increasing SW Failure Rate?
Planetary Missions Flight Software: A. Nikora of JPL

T interval
he
between the first
and last launch:
8.76 years.

T interval between
he
successive launches
ranges from:
23 to 790 days.

Similar results for
ground software

Mars Pathfinder CASSINI Mars Mars Stardust Mars Genesis Mars Deep Mars
Global Climate Polar Odyssey Exploration Impact Reconnaissance
Surveyor Orbiter
Lander Rover Orbiter

Mission Name (in launch order)

High Reliability/Availability:
Software is the problem

• Fault avoidance
– good software engineering practices
– difficult for large/complex software systems

– Impossible to fully test and verify if software is fault-free
“Testing shows the presence, not the absence, of bugs”
- E. W. Dijkstra

• Yet there are stringent requirements for failure-free
operation


High Reliability/Availability:
Software is the problem (2)

Software fault tolerance is a potential
solution to improve software reliability in lieu of
virtually impossible fault-free software


Software Fault Tolerance
Classical Techniques

Design diversity
– N-version programming
– Recovery block

Expensive  not used much in practice!

Yet there are stringent requirements for failure-free
operation

Challenge: Affordable Software Fault Tolerance


High availability SIP Application Server
Configuration on IBM WebSphere

P RDC 2008 and
ISSRE 2010
papers


configuration on WebSphere

Hardware configuration:
– Two BladeCenter chassis; 4 blades (nodes) on each chassis (1 chassis
sufficient for performance)

Software configuration:
– 2 copies of SIP/Proxy servers (1 sufficient for performance)

– 12 copies of WAS (6 sufficient for performance)

– Each WAS instance forms a redundancy pair (replication domain) with
WAS installed on another node on a different chassis

• The system has hardware redundancy and software
redundancy


configuration on WebSphere

Software Fault Tolerance
– Identical copies of SIP proxy used as backups (hot spares)
– Identical copies of WebSphere Applications Server (WAS) used
as backups (hot spares)
– Type of software redundancy – (not design diversity) but
replication of identical software copies
– Normal recovery
• restart software, reboot node or fail-over to a software replica; only
when all else fails, a “software repair” is invoked


Escalated levels of Recovery
Single Process
Restart (SPR)
A real example The flowchart depicts the
Avaya Servers and actions taken for recovery
IF 3 SPR Media Gateways after a failure is detected.
NO within
60
seconds
Try the simplest recovery
method first, then a more
YES
System Warm complex etc.
Restart (SWR)

IF 3
NO YES IF 3 SCR NO
SWR System Cold
within
within Restart (SCR)
15 min
15 min

YES

Avaya
IF 3
YES Communication
ACMSR
OS Reboot Manager
within
Software reloads
15 min
(ACMSR)
NO

Software Fault Tolerance: New
Thinking

Retry, restart, reboot!

– Known to help in dealing with hardware
transients

– Do they help in dealing with failures caused by
software bugs?

– If yes, why?

A Cartoon

Why is this true………at least for computers?

Thinking

Failover to an identical software replica (that is not a
diverse version)

– Does it help?

– If yes, why?

Twenty years ago this would be considered crazy!


Outline
• Motivation
• A Real System
• Software Fault Classification
– Fighting Bugs: Remove, Retry, Replicate and Rejuvenate, M.
Grottke and K. Trivedi, IEEE Computer Magazine, Feb. 2007
• Environmental Diversity
• Methods of Mitigation
• Software Aging and Rejuvenation
• Conclusions


Software Faults
main threats to high reliability,
availability & safety

Copyright © 2012 byby K.S. Trivedi
Copyright © 2013 K.S. Trivedi

IFIP Working Group 10.4 (Laprie)

• Failure occurs when the delivered service no
longer complies with the desired output.
• Error is that part of the system state which is
liable to lead to subsequent failure.
• Fault is adjudged or hypothesized cause of an
error.
Faults are the cause of errors that may lead to failures
Fault Error Failure


Need to Classify bug types
• We submit that a software fault tolerance
approach based on retry, restart, reboot or fail-
over to an identical software replica (not a
diverse version) work because of a significant
number of software failures are caused by
Mandelbugs as opposed to the traditional
software bugs now called Bohrbugs


Need to Classify bug types

• In recent years, researchers have reported the
phenomenon of “software aging” (i.e.,
degraded performance and/or increased
failure rate of long-running software systems).
• Puzzle: How can performance and failure rate
change if the software code is not modified?!
⇒ Study software fault types and their
relationships


Jim Gray’s Definitions

• The terms “Bohrbug” and “Heisenbug” were
first used in print by Jim Gray in 1985.
• “Bohrbugs, like the Bohr atom, are solid, easily
detected by standard techniques, and hence
boring.”
• “Most production software faults are soft. If
the program state is reinitialized
and the failed operation is retried, J. Gray
the operation will not fail a second time. … The
assertion that most production software bugs are soft
– Heisenbugs that go away when you look at them –
is well known to systems programmers.” (Gray, 1985)


Bruce Lindsay’s Definition
 Based on Gray’s paper, researchers
have often equated Heisenbugs with
soft faults.
 However, when Bruce Lindsay
originally coined the term in the 1960s
(while working with Jim Gray), he had
a more narrow definition in mind.

• “Heisenbugs as originally defined … B. Lindsay, photo by T. Upton

are bugs in which clearly the system behavior is incorrect,
and when you try to look to see why it’s incorrect, the
problem goes away.” (Lindsay, 2004)
• The term alludes to the physicist Werner Heisenberg and his
Uncertainty Principle.

Heisenbug – Our Definition
• Heisenbug := A fault that stops
causing a failure or that manifests
differently when one attempts to
probe or isolate it.
• How can probing affect the bug?
1. Some debuggers initialize unused
memory to default values, thus
preventing failures due to
improper initialization.
2. Trying to investigate a failure can
influence process scheduling in
such a way that a scheduling-
related failure does not occur
again.


A Classification of Software Faults
• Bohrbug := A fault that is easily
isolated and that manifests
consistently under a well-defined set
of conditions, because its activation
and error propagation lack
complexity.
 Example: A bug causing a failure
whenever the user enters a negative
date of birth
 Since they are easily found, Bohrbugs tend to be
detected and fixed during the software testing phase.
 The term alludes to the physicist Niels Bohr and his
rather simple atomic model.

Mandelbug – Definition

• Mandelbug := A fault whose
activation and/or error
propagation are complex.
Typically, a Mandelbug is
difficult to isolate, and/or the
failures caused by a it are not
systematically reproducible.
• Example: A bug whose
activation is scheduling-dependent
 The residual faults in a thoroughly-tested piece of

software are mainly Mandelbugs.
 The term alludes to the mathematician Benoît

Mandelbrot and his research in fractal geometry.


Mandelbug: “Complexity” (1) 39

• The explanation of the possible sources of complexity is based on the
“chain of threats” linking faults with errors and failures:

• First source of complexity: Time lag between fault activation and failure
occurrence, e.g., because several different error states have to be
traversed in the error propagation.
• Example: The result of an erroneous calculation may at first be kept in the
system memory and cause a failure only later, when it is being accessed
and used.


Mandelbug: “Complexity” (2) 40

• Second source of complexity: Fault activation and/or error
propagation depend on interactions between conditions occurring
inside the application and conditions that accrue within the system-
internal environment of the application.

• Example: A fault causing failures due to side-effects of other
applications

Mandelbugs: Consequences 41

• Mandelbugs are difficult to detect and remove
during the software testing phase.
• An operation that failed due to a Mandelbug may
execute correctly upon retry even if the fault has
not been removed; changing the environment may
suffice.
• Potential recovery techniques:
– “Microreboot” of individual components
– Application restart
– System reboot
– Failover to a standby component (replicate)
– Manual recovery

Examples of Types of Bugs in
IT System
• Mandelbugs in IT Systems: Trivedi, Mansharamani, Kim,
Grottke, and Nambiar. “Recovery from failures due to
Mandelbugs in IT systems”. PRDC 2011.

• The projects ranged across a number of business systems in
the banking, financial, government, IT, pharmacy, and
telecom sector.


Examples of Types of Bugs
in IT System (cont.)
• Exemple of Mandelbug in a large telecom system
– Slow response times of the front end screens.
– The problem was hard to analyze since the screens would freeze at
random points in the day.
• As the days went by the frequency of incidents of these freezes kept increasing.

• Class of MandelBug encountered
– The users would wait for some time and their operations would
resume.
– The IT operations team rebooted the servers and the operations could
resume for a few hours.


• Reason of the Problem
– Whenever a front end screen was invoked, a temporary file was
created at the centralized server.
• This file was never cleaned up even after a screen was closed.
– As a result, tens of thousands of small files kept accumulating on the
disk causing sluggish behavior.
• Solution of the problem
– A cleanup utility was written to move these files periodically to
another file system and later delete them.


• Exemple of Mandelbug in a government tax information
system.
– All organizations should submit the income tax deducted at source
(TDS) records for all of their employees.
– Sporadically when a large corporation uploaded its file, every once in
a while the application server would crash.
• Class of MandelBug encountered
– The IT operations staff would then increase the JVM heap size and
restart the JVM, which would allow the file to be uploaded without
any problem.


• Reason of the Problem
– The probability of a failure occurrence increased after each JVM
restart, as the heap got consumed more and more.
• Solution of the problem
– Reconfigurate system parameters to resume operations successfully.


Aging-related Bug – Definition
• Aging-related bug := A fault that
leads to the accumulation of errors
either inside the running application
or in its system-context
environment, resulting in an
increased failure rate and/or
degraded performance.
 Example:
 A bug causing memory leaks in the application

 Note that the aging phenomenon requires a delay
between fault activation and failure occurrence.
 Note also that the software appears to age due to such
a bug; there is no physical deterioration

Relationships
 Bohrbug and Mandelbug are complementary
antonyms.
 Aging-related bugs are a subtype of Mandelbugs

Mandelbugs

Aging Related Bugs

Aging-Related Bugs
-

Bohrbugs


Important Questions about these Bugs

• What fraction of bugs are Bohrbugs, Mandelbugs and aging-related
bugs
– How do these fractions vary
• over time
• over projects, languages, application types,…
– Need Measurements
– Current NASA/JPL Project with Allen Nikora & Michael Grottke; preliminary
results from one NASA software project:
• 52% Bohrbugs
• 35% Mandelbugs (non-aging-related)
• 4% Aging-related bugs
• 7% Operator related
• 2% Unclassified
– Very similar results for Linux, MySQL, Apache AXIS, httpd
• What are the methods of mitigation for the different fault types


Trends in SW Fault Type Proportions
Planetary Missions Flight Software

• Fault Type Proportions vs. Runtime for Four Earlier Missions (of 8 missions
analyzed)
• Result: The proportion of Bohrbugs seems to settle at around the same value. Such
a convergence to similar values is less obvious for the other fault types.


Environmental diversity
A new thinking to deal with software faults and failures


Thinking
New thinking: Environmental Diversity as opposed to
Design Diversity

Our claim is that this works since failures due to
Mandelbugs are not negligible, we have an
affordable software fault tolerance technique that
we call
Environmental Diversity


What is environmental diversity?
• The underlying idea of Environmental diversity
– Retry a previously faulty operation and it works
– Why?
– because of the environment where the operation was
executed has changed enough to avoid the fault
activation.
• The environment is understood as
– OS resources, other applications running concurrently
and sharing the same resources, interleaving of
operations, concurrency, or synchronization.

What is environmental diversity?

• The execution of an application depends on
the environment
Restart the application

Ap1 Ap2 Ap3 Ap4 Ap4 Ap6 Ap3 Ap5

Operating System Operating System

Hardware Hardware

Environment at time t1 Environment at time t1+n

Environmental Diversity
• Restart an application, reboot a node or failover to an
identical standby replica work because of the environmental
diversity that will be underlying these actions;
– By environment here we mean the resources of the OS, other
applications running concurrently and sharing system resources,
interleaving of operations, concurrency, synchronization etc.
• Environmental Diversity uses time redundancy over
expensive design diversity
• [Adams] Restart
• [Jalote et al.] Rollback, rollforward
• [Patriot] Occasional reboot, “switch off and on”
• [Avaya Swift] restart process; failover to a replica
• [IBM SIP] escalated levels: restart, reboot, failover…
• [IBM Director-X-series] Rejuvenation


Methods of Mitigation


Mitigation


Bohrbugs: Remove

 Find and fix the bugs during testing
 Failure data collected during testing
 Calibrate a software reliability growth model (SRGM) using failure data;
this model is then used for prediction
 Many SRGMs exist (JM,NHPP,HGRGM, etc.)
 Books by Lyu, Musa, Cai
 Gokhale & Trivedi, A Time/Structure Based Software Reliability Model, Annals
of Software Engineering, 1999
 Measurements  Empirical (statistical) models


OS Availability Model (IBM BladeCenter)

Fix (Failed due to a Bohrbug)

Reboot (Failure due to a Mandelbug)


Availability model of a Proxy or a WAS (IBM SIP on websphere)

• Failure detection
– By WLM
– By Node Agent
– Manual detection
• Recovery
– Node Agent
• Auto process restart
– Manual recovery
• Process restart
• Node reboot
• Repair

Application server and proxy server


Aging Related Bugs: Replicate, Restart,
Reboot, Rejuvenate


Software Aging

Aging phenomenon
Error conditions accumulating over time

Performance degradation, system failure
Performance degradation, system failure

Main causes of Software Aging
Memory leak, fragmentation, Unterminated threads, Data corruption, Round-
off errors, Unreleased file-locks, etc
Observed system
OS, Middle-ware, Netscape, Internet Explorer etc

Software Aging - Definition

“Software Aging” phenomenon

Long-running software tends to show an increasing
failure rate.

Not related to application program becoming
obsolete due to changing
requirements/maintenance.

Software appears to age; no real deterioration

Software Aging - Examples

• Cisco Catalyst Switch [Matias Jr.]
• File system aging [Smith & Seltzer]
• Gradual service degradation in the AT&T transaction processing
system [Avritzer et al.]
• Error accumulation in Patriot missile system’s software [Marshall]
• Resources exhaustion in Apache [Li et al., Grottke et al.]
• Physical memory degradation in a SOAP-based Server [Silva et al.]
• Software aging in Linux [Cotroneo et al.]
• Crash/hang failures in general purpose applications after a long
runtime


Measurements Showing Resource
Exhaustion or Depletion

Real Memory Free File Table Size

AMethodology for Detection and Estimation of Software Aging,
S. Garg, A. van Moorsel, K. Vaidyanathan and K. Trivedi.
Pro c. o f IEEE I Symp. o n So ftware Re liability Eng ine e ring , Nov. 1998.
ntl.

Software Fault Types & Their Mitigation


Software rejuvenation

Software rejuvenation is a cost effective solution for
improving software reliability by avoiding/postponing
unanticipated software failures/crashes.

It allows proactive recovery to be carried either
automatically or at the discretion of the
user/administrator

Rejuvenation of the environment, not of software

Software Rejuvenation

Counteracts the software aging phenomenon
Frees up OS resources; Removes error accumulation

Common techniques for cleaning
Garbage collection, defragmentation, flushing kernel and file
server tables etc.

Challenge: Rejuvenation scheduling/granularity


SW Rejuvenation: The Genesis

“Software Rejuvenation: Analysis, Module and
Applications”, Y. Huang, C. Kintala, N. Koletis, N.
Fulton, in FTCS 1995
An insight into operational software, that no-one had before (at least,
formally). It changed
• How practitioners looked at making software more dependable
• Windfall of performance and dependability modeling problems for academicians
• Ideas to build better, real-world systems as Internet evolved
• Led to recognition of “software aging” phenomenon
• Brought about Phds, tenureships, publications, patents, awards, tools, systems,
funding for many many people around the world.


Software Rejuvenation
Examples
AT&T billing applications [Huang et al.]
Patriot missile system software - switch off/on every 8 hours [Marshall]
On-board preventive maintenance for long-life deep space missions
(NASA’s X2000 Advanced Flight Systems Program) [Tai et al.]
IBM Director Software Rejuvenation (x-series) [IBM & Duke Researchers]
Microsoft IIS 5.0 process recycling tool
Process restart in Apache [Li et al.]
ISS FS SSC (ISS File system) - switch off and on every 2 months [NASA ISS
reports]
For more examples:
"Software rejuvenation - Do IT & Telco industries use it?". Javier Alonso, Antonio Bovenzi,
Jinghui Li, Yakun Wang, Stefano Russo, and Kishor Trivedi. The 4rd International Workshop
on Software Aging and Rejuvenation (WoSAR 2012) . Held in conjunction with The 23nd
annual International Symposium on Software Reliability Engineering (ISSRE 2012), Dallas,
USA, 2012.


Software Rejuvenation –Trade-off

• Advantages
– Reduces costs of sudden aging-related failures
– Can be applied at the discretion of the user/administrator

• Disadvantages
– Direct costs of carrying out rejuvenation
– Opportunity costs of rejuvenation (downtime, decreased
performance, lost transactions etc)

Important research issue:
Find optimal times to perform rejuvenation!

Software rejuvenation - Approaches

Two approaches based on WHEN:

Time-Based rejuvenation approaches
Rejuvenation applied regularly and at predetermined time intervals.

Widely used in real environments

– Web servers (Apache)
– ISS two-months reboot
– Telecommunication systems


Software rejuvenation - Approaches

Two approaches based on WHEN:

Measurement (Inspection)-based rejuvenation approaches

• Threshold based or predictive

• System metrics continually monitored

• Rejuvenation triggered when the crash is imminent based on the
observation/prediction

• Reduce potentially useless rejuvenation actions and downtime in the process
[Silva et al.]


Software rejuvenation


Software Rejuvenation – Approaches

Two approaches based on HOW:

Use analytical model to optimize rejuvenation schedule
• Lucent Bell Labs [Huang et al., ‘95]

• Duke [IEEE-TC’98, SIGMETRICS’96, ISSRE’95, PRDC’00, SIGMETRICS’01,
Comp J.’01, SRDS’02, DSN’02, ISSRE’02, DSN’03, IEEE-TR’05]

• Others [IPDS’98, PNPM’99]


Software Rejuvenation – Approaches

Two approaches based on HOW:

Use measurements of resource degradation to
determine/predict optimal rejuvenation schedule

• Duke [ISSRE’98, ISSRE’99, IBMJRD’01, ISESE’02, IEEE-TPDS’05]

• Duke (formerly UPC) [GRID’07, IEEE-TC’09, DSN’10]


Failure rate

Preventive maintenance is useful only if failure rate is
increasing

If the time to failure distribution is exponential then
failure rate is Constant

Need to assume (and establish) that TTF is IFR


Analytic Models

Single node models
– CTMC model
– SMP model

Cluster systems
– IBM Cluster model (Time-based, condition-based)


Analytic Models
Software Aging and Rejuvenation

A simple and useful model of increasing failure rate:
Failure
probable
Robust state Failed state
state

Time to failure: Hypo-exponential distribution
Increasing failure rate aging


Analytic Models
CTMC model [Huang95]
Failed state

Robust state
Sf

S0
r1 r3
r1

Failed state r2
λ

Rejuvenation
Failure probable state

Robust state r4
Sf λ Sp Sr state
Sp

S0
r2

Failure probable state

Model w/o rejuvenation Model with rejuvenation

From this Continuous-time Markov chain model
Can find closed-form expression for the optimal rejuvenation trigger
rate (r4) Copyright © 2013 by K.S. Trivedi

Analytic Models
Semi-Markov model [Dohi00]

Relax the assumption of exponentially distributed sojourn times (time-
independent transition rates)
Hence have a semi Markov model

0

Completion of Completion of
Repair Rejuvenation
State
change

2 1 3
Can find closed-form expression for the optimal (deterministic) time to
System Failure Rejuvenation
rejuvenation trigger


Rejuvenation in Cluster Systems

Cluster System

[Pfister] Collection of independent, self-contained computer
systems working together to provide a more reliable and
powerful system than a single node by itself
Easier scaling to larger systems, high levels of
availability/performance and low management costs

Rejuvenation for Cluster Systems
Motivation

Rejuvenation using the fail-over mechanisms

Long-terms benefits in terms of
availability/performance

Continuous operation (possibly at a degraded level)

Practically zero downtime


Motivation

Less disruptive and lower overhead than unplanned
outages

Transparent to user/application

Most current industry initiatives reactive

Two approaches
Simple time-based (periodic)
Condition-based (only from the “failure-impending” state


SRN Models

Rejuvenation using the fail-over mechanisms in a rolling fashion

Modeling using SRNs (Stochastic Reward Nets)

Analysis for 2 rejuvenation policies
Simple time-based policy
• All nodes rejuvenated successively at the end of each rejuvenation interval
Condition-based policy
• Nodes rejuvenated only from the “failure-probable” state


SRN Model
Basic Cluster Model


SRN Model
Simple Time-Based Rejuvenation


Model Parameters

Transition Mean time

Tfprob 240 hours
Tnodefail 720 hours
Tnoderepair 30 mins
Tsysrepair 4 hours
Trejuv 10 mins

costnodefail $5000/hour

costnoderejuv $250/hour


Model Measures

Measures Computed

Unavailability (#Psysfail == 1) ? 1 : 0
Cost #Prejuv*costrejuv + #Pnodefail*costnodefail + #Psysfail*costsysfail


Results
Simple Time-Based Rejuvenation

8/ configuration
1 8/ configuration
2

As rejuv. int. increases, rejuvenation is performed less frequently
When rejuv int is close to zero, the system is rejuvenating very frequently resulting in high cost/downtime
When rejuv. int. goes beyond optimal value, system failures become frequent
resulting in high cost/downtime Copyright © 2013 by K.S. Trivedi

Measuring Performance Variables

Objective
Detection and validation of aging

Periodically monitor and collect data on the attributes
responsible for the “health” of the system

Quantify the effect of aging on system resources
Proposed metric – Estimated time to exhaustion
Proposed metric – Evaluation function using PCA approach


Measuring Performance Variables
Approaches
Time-based (workload-independent) estimation [Garg98]
Workload-based estimation [Vaidyanathan99]
ARMA/ARX models [Li02]
ALT and ADT techniques [Matias06]
Non-parametric Algorithms [Dohi00]
Non-linear models [Hoffman07]
Principal component Analysis (PCA) and System identification [Jia]
Pattern recognition [Vaidyanathan & Gross]
Threshold-based approaches [Silva09]
Machine Learning Approaches [Alonso10]


Data Collection
Experimental Setup 97

SNMP-based resource
monitoring tool:
Data related to OS
resource usage
(memory, process table,
file table etc.) and
system activity (CPU
usage, paging,
swapping, NFS,
interrupts etc. )
collected for over 3
months at 10 min
intervals


Time Plots
Non-parametric Regression Smoothing 98

Real Memory Free File Table Size

Trend detection: Seasonal Kendall test for trend


IBM xSeries
Software Rejuvenation Agent (SRA)

Implemented in a high-availability clustered
environment

Monitors consumable resources, estimate time to
exhaustion and generates alerts if within user
notification horizon


IBM xSeries
Software Rejuvenation Agent (SRA)

IBM Director system management tool
– Provides GUI to configure SRA
– Acts upon alerts

Two versions
– Periodic rejuvenation
– Prediction-based rejuvenation


Summary

It is possible to enhance software availability during
operation exploiting environmental diversity

Multiple types of recovery after a software failure can be
judiciously employed: restart, failover to a replica, reboot
and if all else fails repair (patch)


Summary

Software aging not anecdotal – real life scientific
phenomenon

Rejuvenation implemented in several special purpose
applications and many general purpose cluster systems


Key References
• Software Rejuvenation: Analysis, Module and Applications, Y. Huang, C. Kintala, N. Kolettis
and N. Fulton, In Proc. FTCS-25, June 1995.
• A Methodology for Detection and Estimation of Software Aging, S. Garg, A. van Moorsel, K.
Vaidyanathan and K. S. Trivedi. Proc. ISSRE 1998.
• Performance and Reliability Evaluation of Passive Replication Schemes in Application Level
Fault Tolerance, S. Garg, Y. Huang, C. M. R. Kintala, K. S. Trivedi, and S. Yajnik. In Proc. FTCS
1999.
• Statistical Non-Parametric Algorithms to Estimate the Optimal Software Rejuvenation
Schedule, T. Dohi, K. Goseva-Popstojanova and K. S. Trivedi, Proc. PRDC 2000.
• Proactive Management of Software Aging, V. Castelli, R. E. Harper, P. Heidelberger, S. W.
Hunter, K. S. Trivedi, K. Vaidyanathan and W. Zeggert, IBM Journal of Research &
Development, March 2001.
• A Comprehensive Model for Software Rejuvenation, K. Vaidyanathan and K. S. Trivedi. IEEE-
TDSC, April-June 2005.
• Analysis of software aging in a web server, M. Grottke, L. Li, K. Vaidyanathan and K. S.
Trivedi, IEEE Trans. Reliability, Sept. 2006.


Key References
• Fighting Bugs: Remove, Retry, Replicate and Rejuvenate, M. Grottke and K. Trivedi, IEEE Computer,
Feb. 2007.
• Availability Modeling of SIP Protocol on IBM WebSphere, K. S. Trivedi, D. Wang, D. J. Hunt, A.
Rindos, W. E. Smith, B. Vashaw, Proc. PRDC 2008.
• Using Accelerated Life Tests to Estimate Time to Software Aging Failure, MATIAS JR, R., TRIVEDI,
K., Maciel, P. , ISSRE, 2010.
• Accelerated Degradation Tests Applied to Software Aging Experiments, Rivalino Matias, Jr., K. S.
Trivedi and Paulo J. F. Filho and Pedro A. Barbetta, IEEE Transactions on Reliability, March 2010.
• An Empirical Investigation of Fault Types in Space Mission System Software, M.Grottke, A. P.
Nikora and K. S. Trivedi, Proc. DSN, 2010.
• Software fault mitigation and availability assurance techniques, K. S. Trivedi, M. Grottke, and E.
Andrade. International Journal of System Assurance Engineering and Management, 2011.
• Recovery from Failures due to Mandelbugs in IT Systems, K. Trivedi, R. Mansharamani, D.S. Kim,
M. Grottke, M. Nambiar , Proc. PRDC 2011
• O. Kyas. (2001). Network Troubleshooting, Palo Alto California, Agilent Technologies (book)
• M. Kaaniche and K. Kanoun (1996). Reliability of a Commercial Telecommunications System, ISSRE
1996
• R. Cramp, M. A. Vouk, and W. Jones (1992). On Operational Availability of a Large Software-Based
Telecommunications System, ISSRE 1992


Software Faults, Failures and Their Mitigations | Turing100@Persistent

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (20)

Similar a Software Faults, Failures and Their Mitigations | Turing100@Persistent

Similar a Software Faults, Failures and Their Mitigations | Turing100@Persistent (20)

Más de Persistent Systems Ltd.

Más de Persistent Systems Ltd. (20)

Último

Último (20)

Software Faults, Failures and Their Mitigations | Turing100@Persistent