SlideShare una empresa de Scribd logo
1 de 82
Descargar para leer sin conexión
How to apply software reliability
engineering
Ann Marie Neufelder
SoftRel, LLC
amneufelder@softrel.com
http://www.softrel.com
© SoftRel, LLC 2016. This presentation may not be reprinted in whole or part
without written permission from amneufelder@softrel.com
1
Software is increasing in size, hence it’s effect
on system reliability is increasing
 The increase in size of F16A to F35 is just one example[1]
 With increased size comes increased complexity and increased
failures due to software as shown next
0
10000000
20000000
30000000
1970 1980 1990 2000 2010 2020
SIZE IN SLOC (SOURCE LINES OF CODE)
OF FIGHTER AIRCRAFT SINCE 1974
[1] Delivering Military Software Affordably, Christian Hagen and Jeff Sorenson, Defense AT&L, March-April 2012.
2
These are just a few failure events due
to software
Failure Event Associated software fault
Several patients suffered radiation overdose
from the Therac 25 equipment in the mid-
1980s. [THERAC]
A race condition combined with ambiguous
error messages and missing hardware overrides.
AT&T long distance service was down for 9
hours in January 1991. [AT&T]
An improperly placed “break” statement was
introduced into the code while making another
change.
Ariane 5 Explosion in 1996. [ARIAN5] An unhandled mismatch between 64 bit and 16
bit format.
NASA Mars Climate Orbiter crash in
1999.[MARS]
Metric/English unit mismatch. Mars Climate
Orbiter was written to take thrust instructions
using the metric unit Newton (N), while the
software on the ground that generated those
instructions used the Imperial measure pound-
force (lbf).
28 cancer patients were over-radiated in
Panama City in 2000. [PANAMA]
The software was reconfigured in a manner that
had not been tested by the manufacturer.
On October 8th, 2005, The European Space
Agency's CryoSat-1 satellite was lost shortly
after launching. [CRYOSAT]
Flight Control System code was missing a
required command from the on-board flight
control system to the main engine.
A rail car fire in a major underground metro
system in April 2007. [RAILCAR]
Missing error detection and recovery by the
software.
3
Software reliability timeline 4
1960’s 1970’s 1980’s 1990’s
1962 First
recorded
system
failure due
to software
Many software reliability estimation models developed.
Main obstacle – can’t be used until late in life cycle.
1968
The term
“software
reliability” is
invented.
First publicly available
model to predict software
reliability early in lifecycle
developed by USAF Rome
Air Development Center
with SAIC and Research
Triangle Park –
Main obstacles – model
only useful for aircraft and
model never updated after
1992.
SoftRel, LLC
develops
models based
on RL model
but usable on
all applications
A few proprietary models
developed
2000’s
IEEE 1633
Rewritten to be
practical
IEEE 1633 Recommended Practices for
Software Reliability
 Chaired by Ann Marie Neufelder, Softrel, LLC
 Vice Chaired by Martha Wetherholt, NASA WHQ
 Every branch of DoD, NASA, NRC, major defense contractors,
medical device industry, participated in development/approval
of document
 Revised the 2008 edition which was poorly received as it was
written for academic audience
 Document received 100% approval on first IEEE ballot on
5/24/16
 Document will be formally approved by IEEE on 9/16/16, and
released by end of year
5
Mapping of IEEE 1633 to available software
reliability tools
Section Contents Tools Available
1,2,3, 4 Overview, definitions and acronyms, Tailoring
guidance
5.1 Planning for software reliability
5.2 Develop a failure modes model – SFMEA,
Software Fault Tree Analysis
Frestimate System
Software Analysis Module,
Software FMEA Toolkit
6.1 Overview of SRE models
5.3, 6.2 Apply SRE during development Frestimate, Software
Reliability Toolkit
5.4, 6.3 Apply SRE during testing Frestimate Estimation
Module
5.5 Support Release decision Frestimate
5.6 Apply SRE in operation Frestimate Estimation
Module
6
Table of contents for this presentation 7
Section Contents
1 Planning for software reliability
2 Develop a failure modes model – SFMEA, Software Fault Tree
Analysis, Root Cause Analysis
3 Overview of SRE models
4 Apply software reliability during development
5 Apply software reliability during testing
6 Support Release decision
7 Apply software reliability in operation
Planning for Software
Reliability
SECTION 1
8
Before using any models it’s
prudent to do some planning
1. What are the software Line Replaceable Units in your system?
 Today’s systems have many software LRUs – not just one
 SRE can be applied to in-house developed software, COTS, FOSS, GFS, and
firmware
2. System specific failure definition and scoring criteria is an essential first
step. The more specific the definitions, the better.
3. Perform an initial risk assessment
 Can the software effect safety?
 How mature is the product and target hardware?
 Is the actual size of the software always bigger than expected or planned?
 Is the actual reliability growth always smaller than planned?
 Are the releases spaced so close together that defects are piling up from one
release to the next?
 Is this the very first deployed version of this software for this product?
 Do we have the right people developing the software throughout the
development process?
 Is there a key technology change during software development?
9
Establish an initial risk level for the
software with regards to reliability
Successful release
Mediocre
release
Distressed
release
No identified risks 78% 27% 0%
Exactly one of these risks 11% 64% 50%
Exactly two of these risks 11% 6% 30%
Exactly three of these risks 0% 0% 10%
Four or more of these risks 0% 3% 10%
10
Distressed – Seriously late, increasing failure rate upon deployment, less than
40% of inherent defects are removed upon release, results in recall or
unplanned maintenance release to fix the defects deployed
Successful – Schedule isn’t seriously stalled, 75% of inherent defects are
removed upon release, failure rate is decreasing upon delivery, doesn’t result
in unplanned maintenance release
Mediocre – Deployed with 40-75% of the inherent defects removed, causes
schedule delays, eventually the many defects are corrected
Determine SRPP based on risk level
 “Software Reliability Program Plan” tailored based on
the risk level of the particular software release.
 Defines which Software Reliability Engineering (SRE)
tasks are implemented for this program
 i.e. failure mode analysis, predictions, sensitivity analysis,
etc.
 SRPP can be part of the Reliability Plan or part of the
Software Development Plan or a self standing
document
11
Develop a failure modes model –
SFMEA, Software Fault Tree
Analysis
SECTION 2
12
Software FMEA and Software Fault
Tree Analysis
Requirements , interfaces, design ,
code, users manuals, installation scripts,
changes to the design and code
Failure Modes
Events
SFMEA
works
this way
These are visible
to end users
These are
visible to
software
engineers.
FTA
Works
this
way
13
General guidance for when to use a
SFMEA versus a SFTA versus both
Selection characteristic SFTA SFMEA Both
Small number of clearly defined top level hazards 
Interest in identifying failures that are due to a combination of
events, including events caused by both software and hardware

Very large or complex system with a lot of code 
The detailed design/code have not been started yet 
The SRS does not describe very well how the software should
handle negative behavior or hazardous events

A symptom is known but not the failure modes or top level
effects

Brand new technology or product. System level hazards not
completely understood

Interest in identifying failure modes and/or single point failures 
The product is mature but the code is suspect 
The personnel available for the analyses have more experience
with the software than with the system

14
Key benefits of Software FMEAs
 Many software systems fail when deployed because the engineers did not
consider what the software should “Not” do
 SFMEA is one of 2 analyses for identifying the failure space so often
overlooked
 Useful for early identification of
 Defects that easier to see when looking at the design or code but difficult to
see during testing
 i.e. can be used to improve the efficiency of design or code reviews
 Single point failures due to software
 Defects that cannot be addressed by redundancy or other hardware controls
 Abnormal behavior that might be missing from the requirements or design
specifications
 Unwritten assumptions
 Features that need fault handling design
 Addressing one failure mode could mean eliminating several failures
15
Existing SFMEA guidance
Guidance Comments
Mil-Std 1629A Procedures for
Performing a Failure Mode, Effects and
Criticality Analysis, November 24, 1980.
Defines how FMEAs are performed but it
doesn’t discuss software components
MIL-HDBK-338B, Military Handbook:
Electronic Reliability Design Handbook,
October 1, 1998.
Adapted in 1988 to apply to software.
However, the guidance provides only a
few failure modes and a limited example.
There is no discussion of the software
related viewpoints.
“SAE ARP 5580 Recommended Failure
Modes and Effects Analysis (FMEA)
Practices for Non-Automobile
Applications”, July, 2001, Society of
Automotive Engineers.
Introduced the concepts of the various
software viewpoints. Introduced a few
failure modes but examples and
guidance is limited.
“Effective Application of Software Failure
Modes Effects Analysis”, November,
2014, AM Neufelder, produced for
Quanterion, Inc.
Identifies hundreds of software specific
failure modes and root causes, 8 possible
viewpoints and dozens of real world
examples.
16
The process for performing a Software Failure Modes
Effects Analyses
Generate CIL
Mitigate
Analyze failure modes
and root causes
Prepare the Software FMEA
Identify
resources
Brainstorm/
research
failure
modes
Identify
equivalent
failure modes
Identify
consequences
Identify local/
subsystem/
system
failure effects
Identify severity
and likelihood
Identify corrective
actionsIdentify preventive
measures
Identify
compensating
provisions
Analyze
applicable
failure modes
Identify
root causes(s)
for each
failure mode
Generate a Critical
Items List (CIL)
Identify
applicability
Set
ground
rules
Select
viewpoints
Identify
riskiest
software
Gather
artifacts
Define
likelihood
and
severity
Select
template
and
tools
Revise RPN
Decide
selection
scheme
Define scope Identify resources Tailor the SFMEA
Software has different
viewpoints, and failure
modes than hardware
17
SFMEA viewpoints
Software
viewpoint
Level of architecture applicable
for viewpoint
Failure Modes
Functional The system and software
requirements
The system does not do it’s required
function or performs a function that it
should not
Interface The interface design The system components aren’t
synchronized or compatible
Detailed The detailed design or code The design and/or code isn’t
implemented to the requirements or
design
Maintenance A change to the design or code The change to the design or code will
cause a new fault in the software
Usability The ability for the software to be
consistent and user friendly
The end user causes a system failure
because of the software interface
Serviceability The ability for the software to be
installed or updated without a
software engineer
The software doesn’t operate because it
isn’t installed or updated properly
Vulnerability The ability for the software to
protect the system from hackers
The software is performing the wrong
functions because it is being controlled
externally. Or sensitive information has
been leaked to the wrong people.
18
Applicability of each of the viewpoints
FMEA When this viewpoint is relevant
Functional Any new system or any time there is a new or updated set of
requirements.
Interface Anytime there is complex hardware and software interfaces or
software to software interfaces.
Detailed Almost any type of system is applicable. Most useful for
mathematically intensive functions.
Maintenance An older legacy system which is prone to errors whenever
changes are made.
Usability Anytime user misuse can impact the overall system reliability.
Serviceability Any software that is mass distributed or installed in difficult to
service locations.
Vulnerability The software is at risk from hacking or intentional abuse.
19
Failure modes associated with each viewpoint
Failure mode
categories
Description
Functional
Interface
Detailed
Maintenance
Usability
Vulnerability
Serviceability
Faulty functionality The software provides the incorrect functionality or
fails to provide required functionality
X X X
Faulty timing The software or parts of it execute too early or too late
or the software responds too quickly or too sluggishly
X X X
Faulty sequence/
order
A particular event is initiated in the incorrect order or
not at all.
X X X X X
Faulty data Data is corrupted, incorrect, in the incorrect units, etc. X X X X X
Faulty error
detection and/or
recovery
Software fails to detect or recover from a failure in the
system
X X X X X
False alarm Software detects a failure when there is none X X X X X
Faulty
synchronization
The parts of the system aren’t synchronized or
communicating.
X X
Faulty Logic There is complex logic and the software executes the
incorrect response for a certain set of conditions
X X X X
Faulty Algorithms/
Computations
A formula or set of formulas does not work for all
possible inputs
X X X X
20
Failure modes associated with each viewpoint
Failure mode
categories
Description
Functional
Interface
Detailed
Maintenance
Usability
Vulnerability
Serviceability
Memory
management
The software runs out of memory or runs too
slowly
X X X
User makes
mistake
The software fails to prohibit incorrect actions
or inputs
X
User can’t
recover from
mistake
The software fails to recover from incorrect
inputs or actions
X
Faulty user
instructions
The user manual has the incorrect instructions
or is missing instructions needed to operate
the software
X
User misuses or
abuses
An illegal user is abusing system or a legal
user is misusing system
X X
Faulty
Installation
The software installation package installs or
reinstalls the software improperly requiring
either a reinstall or a downgrade
X X
21
Software Fault Tree Analysis
 Why are they used on software?
 When there is an intermittent problem in operation and the root
cause cannot be determined
 To identify what the software should NOT be doing which helps to
define the exception handling requirements
 To identify events that are caused by combinations of defects/root
causes such as interactions between HW and SW
 What’s different between HW and SW fault trees?
 Mechanically, software fault trees work the same as hardware fault
trees.
 The major differences is the types of events and modes that
appear on the tree.
 The software FTA should be an integrated into the system FTA.
Otherwise, interactions between software and hardware won’t
be analyzed.
22
This is the overview of how to include
software in the system FTA
Plan the SFTA
Brainstorm
System
Failure
Events
Place each event at the
top of a tree and describe
in past tense
Brainstorm sub-events due to software
(see next page)
Place event on tree and
describe in past tense
Use the risk/severity to rank mitigation effort or
Determine probability of each top level event
Revise the applicable
Requirements or design
Gather
Applicable
Product
Documents
such as
requirements
and design
23
The software failure modes and root causes
are the sub-events on the tree
Generic failure
mode Specific software root cause
Faulty
functionality This LRU performed an extraneous function
This LRU failed to execute when required
This LRU is missing a function
This LRU performed a function but not as required
Faulty
sequencing This LRU executed while in the wrong state
This LRU executed out of order
This LRU failed to terminate when required
This LRU terminated prematurely
Faulty timing This LRU executed too early
This LRU executed too late
Faulty data This LRU manipulating data in the wrong unit of measure or scale
This LRU can 't handle blank or missing data
This LRU can 't handle corrupt data
This LRU data/results are too big
This LRU data or results are too small
24
The software failure modes are
the sub-events on the tree
Generic failure
mode Specific root cause
Faulty error
handling This LRU generated a false alarm
This LRU A failure in the hardware, system or software
has occurred
A failure in the
hardware, system or
software has
occurred
This LRU detected a system failure but provided an
incorrect recovery
This LRU failed to detect errors in the incoming data,
hardware, software, user or system
Faulty processing This LRU consumed too many resources while executing
This LRU was unable to communicate/interface with the rest of the system
Faulty usability This LRU caused the user to make a mistake
This LRU User made mistake because of user manual
This LRU failed to prevent common human mistakes
This LRU allowed user to perform functions that they should not perform
This LRU prevented user from performing functions that they should be
allowed to perform
Faulty
serviceability This LRU installed improperly
This LRU updated improperly
This LRU is the wrong version or is outdated
25
Example of these failure modes on the system fault tree
26
Overview of SRE models
SECTION 3
27
Overview of SRE Models
 Software reliability can be predicted before the code is written, estimated
during testing and calculated once the software is fielded
28
Prediction/
Assessment
Reliability Growth Models
Used before code is written
•Predictions can be
incorporated into the system
RBD
•Supports planning
•Supports sensitivity analysis
•A few models have been
available since 1987 due to
expense
Used during system level testing or
operation
•Determines when to stop testing
•Validates prediction
•Less useful than prediction for planning and
avoiding problematic releases
•Many models have been developed since
1970s of which only a few are useful.
Section of IEEE 1633 Recommended Practices for
Software Reliability, 2016
5.3 5.4
Limitations of each type of modeling
 All are based on historical actual
data
 All generate a prediction by
calibrating current project against
historical project(s)
 Accuracy depends on
 How similar historical data is to
current project
 Application type
 Product stability (version 1
versus version 50)
 Capabilities of the development
team
 How current the historical data is
 How much historical data exists
 All are based on extrapolating an
existing trend into the future
 Accuracy depends on
 Test coverage
 Low test coverage usually
results in optimistic results
 How closely actual trend matches
assumed trend
 i.e. if model assumes a
logarithmic trend is that the
actual trend?
 How closely the model
assumptions match actual
 Defect removal
 Defect independence
29
PREDICTION/ASSESSMENT
MODELS
RELIABILITY GROWTH
MODELS
Apply Software Reliability
during development
SECTION 4
30
Software reliability prediction/assessment
goals
 Allows reliability engineering practitioners to
 Predict any number of SRE metrics for each software LRU well
before the software is developed
 Merge software reliability predictions into the system fault tree
 Merge into the system Reliability Block Diagram (RBD)
 Predict reliability growth needed to reach the system allocation
 Determine, prior to the software being developed, whether the
system allocation will be met
31
Software reliability prediction/assessment
goals
 Allows software and engineering management to
 Benchmark SRE to others in same industry
 Predict probability of late delivery
 Predict improvement scenarios
 Analyze sensitivity between development practices and reliability and
perform tradeoffs
 Identify practices that are effective for improving SRE
 Identify practices that aren’t effective for improving SRE (every moment
spent on an ineffective practice is a moment that’s not spent on an
effective practice)
 Predict optimal spacing between releases so as to
 avoid defect pileup which directly effects software reliability
 ensure that there is adequate SRE growth across software releases
 Determine how many people are needed to support the software once
deployed
32
Industry approved framework for
early software reliability predictions
33
1. Predict
effective
size
2. Predict
testing or
fielded
defect
density
3. Predict testing
or fielded
defects
5. Predict
failure
rate/MTTF
during test
or
operation4. Identify defect
profile over time
7. Predict
mission
duration and
reliability
6. MTSWR and
availability
Sensitivity
Analysis
This framework has been used for decades. What has changed over the years
are the models available for steps 1, 2 and 4. These models evolve because
software languages, development methods and deployment life cycles have
evolved.
Available Methods for predicting defect density
 Ideally prediction models optimize simplicity, and accuracy and are updated
regularly for changes in SW technology
Method Number of
inputs
Comments
SEI CMMi lookup chart
or industry lookup chart*
1 Usually least accurate since there is only 1 input.
Useful for COTS or quick estimate.
Shortcut model* 22 •More accurate than lookup charts
•Questions can be answered by almost anyone
familiar with the project
Rome Laboratory TR-92-
52**
45-212 •Hasn’t been updated in 23 years which in software
world is akin to a millennium
Full-scale models** 98-300 •More accurate than the shortcut model
•Questions require input from software leads,
software testing, software designers
•Fully supports sensitivity analysis
Neufelder model** 149 •Based on Process Grade Factors
8
* These models are recommended in the normative section of the IEEE 1633 Recommended Practices for
Software Reliability, 2016. ** These models are recommended in Annexes of IEEE 1633 Recommended
Practices for Software Reliability, 2016.
Predict any number of SRE metrics for each software LRU
well before the software is developed
Predict availability,
reliability, failure rate,
MTTF, MTBI, MTBCF
for each software LRU
as well as all software
LRUs combined
35
Merge software predictions into
system RBD
36
A particular software LRU is in series
with the particular hardware LRU that it
supports. Several software LRUs such
as COTS, Operating System, firmware,
etc. may be in series with each other
and the hardware
Merge software predictions into
system fault tree
37
Once the predictions
for each software
LRU are complete,
they can be merged
into the system FTA
Predict reliability growth needed to
reach the system allocation
 The predictions are performed over a period of operational time to
allow the practitioner to determine how much growth is needed to
reach a specific objective
 Adding new features in subsequent releases can effect the objective
38
If the allocation for the
software LRUs combined is
3500 hours MTTCF then
the allocation will be met
after about 9 months into
the first release and then
about 4 months into the
next release.
Determine, prior to the software being developed,
whether the system allocation will be met
 The system allocation for the software is met when
 The reliability growth needed to achieve the objective is feasible
with the given plans for
 Features to be implemented and their predicted size
 Practices to be deployed during development and testing
 If the allocation can not be met tradeoffs can be performed
such as
 Avoid reinventing the wheel (writing new code when you can
purchase the same functions commercially) so as to decrease the
size of the software to be developed
 Postponing some features to a later version
 Deploying more smaller less risky releases instead of bigger more
risky releases
 Implementing development practices that reduce defect density
39
Benchmark SRE to others in industry, Predict
probability of late delivery, Predict improvement
scenarios
40
Complete
assessment
and
calculate
score
If the software development organization could
transition to the next percentile group…
•Average defect reduction is about 55%
•Average probability (late) reduction is about 25%
Assessment models provide for means to identify how to
transition to the next percentile group
Predicted
Percentile Group
World class
Distressed
Very good
Good
Average
Fair
Poor
1%
99%
10%
25%
50%
75%
90%
Score
Predicted
Normalized
Fielded
Defect
Density
Predicted
Probability
late
delivery
.011
2.069
.060
.112
.205
.608
1.111
10%
100%
20%
25%
36%
85%
100%
Predict both
defect density
and probability
of late delivery
Identify practices that are effective for improving
software reliability
Top ten factors quantitatively associated with better software reliability
1. Software engineers have product/industry domain expertise
2. Software engineers conduct formal white/clear box unit testing
3. Testers start writing test plans before any code is written
4. Management outsources features that aren’t in the organization’s line of business
5. Management avoids outsourcing features that are in organization’s line of business
6. No one skips the requirements, design, unit test or system testing even for small releases
7. Management plans ahead – even for small releases. Most projects are late because of
unscheduled defect fixes from the previous release (and didn’t plan on it)
8. Everyone avoids “Big Blobs” - big teams, long milestones - especially when there is a large
project
9. Pictures are used in the requirements and detailed design whenever possible
10. It is defined in writing what the software should NOT do
23
Identify practices that aren’t effective for improving
software reliability
Top ten factors that aren’t quantitatively associated with improved reliability
1. Requirements, design and code reviews that don’t have a defined agenda and criteria
2. Waiting until the code is done to decide how to test it
3. Focusing on style instead of function when doing product reviews
4. Using automated tools before you know how to perform the task manually
5. Too much emphasis on independent Software Quality Assurance organization
6. Too much emphasis on independent software test organization
7. Too much emphasis on the processes and not enough on the techniques and the people
8. Misusing complexity metrics
9. Hiring software engineers based on their experience with a particular language instead of with a
particular industry
10. Using agile development as an excuse not to do the things you don’t want to do
23
Predict optimal spacing between releases to avoid defect
and ensure adequate SRE growth across software releases
 This is an example of the predicted defect discovery profile predicted over the
next 4 releases.
 One can visibly see that defects are predicted to “pile up”. Failure rate is
proportional to defect discovery so when one is ramping up so is the other. The
SRE for version 1 may be within the required range but future releases may not
be.
 Predictions can be used to identify optimal spacing between releases so as to
avoid “defect pileup”.
43
Predict optimal spacing between releases to avoid defect
and ensure adequate SRE growth across software releases
 This is an example of a predicted defect discovery profile that
has been optimized
 There is no pileup across releases
 It supports a defined and repeatable maintenance staff as
discussed next
44
Determine how many people are needed to
support the software once deployed
 If you can predict the defect discovery rate and you know how
many defects can be fixed per month, you can predict how many
people are needed to support the software once deployed
45
Tip: The number one reason why a
software release is late is the previous
project. Neglecting to plan out the
maintenance staffing does not make
the defects go away. But it could
make the next
release late.
Apply SRE during testing –
Reliability Growth Models
SECTION 5
46
Overview
 Reliability growth models have been in use since the 1970s for software
reliability
 Thanks to the academic community the hundreds of models developed
 Have no real roadmap on how or when to use
 Require data that isn’t feasible in a non-academic environment
 Assume that the failure rate is decreasing
 Yield the same or similar results
 Don’t have methods to solve for parameters or compute confidence
 This was resolved in the 2016 edition of the IEEE Recommended Practices
for Software Reliability.
 Overview of the models
 How to select the model(s)
 When to use them and when not to
 How to use with incremental development life cycle
47
Reliability Growth Model framework 48
1. Collect date of
software failure,
severity and
accumulated
operational hours
between failures
2. Plot the data.
Determine if
failure rate is
increasing or
decreasing.
Observe trends.
3. Select the
model(s) that
best fits the
current trend
4. Compute failure
rate, MTBF,
MTBCF, reliability
and availability
5. Verify the
accuracy against the
next actual time to
failure. Compute the
confidence.
Support release
decision
New defects discovered in testing
6. Estimate
remaining defects
and test hours
required to reach an
objective
Collect data during software system
testing
49
For each day during software
system testing collect:
1. # hours the software was in
operation (by all computers) on
that day (x)
2. # defects were discovered on
that day (f)
n = cumulative defects
t = cumulative hours
Plot the data
 Fault rate (n/t) plotted on x
axis
 Cumulative defects (n)
plotted on y axis
 If plot has negative slope
then fault rate is decreasing
These parameters are used by
the models:
 Y intercept = estimated
inherent defects N0
 X intercept = estimated
initial failure rate l0
 K = 1/slope
50
Example of increasing fault rate
 In this example, the fault rate is increasing. This means
that most of the models can’t be used.
 This is a common situation during the early part of
software testing
51
Example fault rate that’s increasing and
then decreasing
 In this example, the fault rate increased initially
and then decreased steadily. In this case the most
recent data can be used to extrapolate the future
trend.
52
Selecting the reliability growth model(s)
Model name Inherent
defect count
Effort
required (1
low, 3 high)
Can be used when
exact time of failure
unknown
Increasing fault rate
Weibull Finite/not
fixed
3 Yes
Peaked fault rate
Shooman Constant Defect
Removal Rate Model
Finite/fixed 1 Yes
Decreasing fault rate
Shooman Constant Defect
Removal Rate Model
Finite/fixed 1 Yes
Linearly Decreasing
General exponential models
including:
 Time based (Goel-Okumoto)
 Defect based (Musa Basic)
Finite/fixed 2 Yes
Shooman Linearly Decreasing
Model
Finite/fixed 1 Yes
Non-Linearly Decreasing
Logarithmic time and defect
based models (Musa)
Infinite 1 Yes
Shooman Exponentially
Decreasing Model
Finite/fixed 3 Yes
Log-logistic Finite/fixed 3 Yes
Geometric Infinite 3 No
Increasing and then decreasing
Yamada (Delayed)
S-shaped
Infinite 3 Yes
Weibull Finite/not
fixed
3 Yes
53
1.Eliminate
models that
don’t fit the
observed trend.
2. Use all
applicable
models or select
the one with
least effort.
3. Some models
expect exact
time of failure
which might not
be easy to
collect in
testing.
Bolded models are in normative section of IEEE 1633 Recommended Practices for Software Reliability, 2016
Compute failure rate, MTTF with
the 2 simplest models
Model Estimated
remaining
defects
Estimated current
failure rate
Estimated
current MTBF
Estimated current
reliability
Defect based
general
Exponential
N0 - n l(n) = l0 (1-(n/N0)) The inverse of
the estimated
failure rate
e-( l(n) * mission time)
Time based
general
Exponential
l(t) = N0ke-kt
e-( l(t) * mission time)
54
N0, l0, k estimated graphically as shown earlier
n – cumulative defects discovered in testing to date
t – cumulative hours of operation in testing to date
Mission time – How long the software must operate to complete one mission
or cycle
Both models are in normative section of the IEEE 1633 Recommended Practices
for Software Reliability
Example with real data 55
N0 = 117.77
l0 = .137226
k = .001165
n=84 defects
discovered to
date
t=1628
operational test
hours to date
y = -857.97x + 117.77
-60
-40
-20
0
20
40
60
80
100
120
140
0 0.05 0.1 0.15 0.2
CumulativeFaults(n)
Fault Rate n/t
Cumulative faults versus fault rate
X intercept = .137226
Slope = 117.77/.137226
k = .137225/117.77
Y intercept = 117.77
Example 56
The two models have different results because the first model
assumes that failure rate only changes when a fault occurs. The
second model accounts for time spent without a fault. If the
software is operating for extended periods of time without failure,
the second model will take that into account.
Model Estimated
remaining
defects
Estimated current failure rate in
failures per hour
Estimated current
reliability
Defect
based
general
Exponential
118-84 =
34
71% of
defects are
estimated
to be
removed.
l(84) =
.137226*(1-84/117.77) =
.03935
e-( .03935 * 8) =
.772993
Time based
general
Exponential
l(1628) =
117.77*.001165*e(-.001165*1628) =
.02059
e-( .02059 * 8) =
.84813
Determine the relative accuracy of the
model
When the next fault is encountered, the relative accuracy of the last estimation can be
computed. During testing, the trend might change. Hence, it’s prudent to determine
which model currently has the lowest relative error as well as the model with the lowest
relative error of all data points.
In the above example, the time based general exponential model has the lowest relative
error overall and with the most recent data.
The logarithmic models have the highest relative error for this dataset. This is expected as
the fault rate plot doesn’t indicate a logarithmic fault rate trend.
57
Compute the confidence of the
estimates
 The confidence of the failure rate estimates is determined by the
confidence in the estimates of the parameters. The more data points
the better the confidence in the estimated values of N0, l0.
58
Forecast
 Any of the reliability growth models can forecast into the future a specific
number of test hours.
 This forecast is useful if you want to know what the failure rate will be on a
specific milestone date based on the expected number of test hours per
day between now and then.
59
Estimate remaining defects or test
hours to reach an objective
 You can determine how many more defects need to be found to reach a specific
MTBF objective
 You can also determine how many more test hours are needed to reach that
objective (assuming that the discovered defects are corrected)
60
In this example,
between 6 and 7
defects need to be
found and removed
to meet the objective.
Based on the current
trend it will take
about 767 hours to
find those defects.
Support release decision
SECTION 6
61
Support release decision
The reliability growth models are only one part of the decision
process. The degree of test coverage is the other key part.
 If the requirements, design, stress cases and features have not been
covered in testing then the software should not be deployed
regardless of the results of the models.
 Otherwise, If the fault rate is increasing the software should not be
deployed.
 Otherwise, if the residual or remaining defects is more than 25% of
the remaining defects the software should not be deployed.
 Otherwise, if the residual or remaining defects is more than the
support staff can handle, the software should not be deployed.
 Otherwise, if the objective failure rate or MTBF have been met the
software may be deployed if all other metrics required for
deployment are met.
62
Apply software reliability
during operation
SECTION 7
63
Apply SRE in operation
 Once the software is deployed the actual failure rate
is computed directly from
 Actual failures reported during some period of time (such
as a month)
 Actual operational hours the software was used during that
period of time across all users and installed systems
 The reliability growth models can be used with
operational data as well as testing data
64
Conclusions
 Software reliability can be predicted before the code is written using
prediction/assessment models
 It can be applied to COTS software as well as custom software
 A variety of metrics can be predicted
 The predictions can be used for sensitivity analysis and defect reduction
 Software reliability can be estimated during testing using the
reliability growth models
 Used to determine when to stop testing
 Used to quantify effort required to reach an objective
 Used to quantify staffing required to support the software once
deployed
65
Frequently Asked Questions
 Can I predict the software reliability when there is an agile or
incremental software development lifecycle?
 Yes, your options are
 You can use the models for each internal increment and then combine the results
of each internal increment to yield a prediction for each field release
 You can add up the code size predicted for each increment and do a prediction
for the field release based on sum of all increment sizes
 How often are the predictions updated during development?
 Whenever the size estimates have a major change or whenever
there is a major review
 The surveys are not updated once complete unless it is known that
something on the survey has changed
 i.e. there is a major change in staffing, tools or other resource during
development, etc.
66
Frequently Asked Questions
 Which prediction models are preferred?
 The ones that you can complete accurately and the ones that
reflect your application type
 If you can’t answer most of the questions in a particular mode
survey then you shouldn’t use that model
 If the application lookup charts don’t have your application type
you shouldn’t use them
67
Frequently Asked Questions
 What are the tools available for SRE?
68
Capability Tools Available Link
Software FMEA Software FMEA Toolkit http://www.softrel.com/5SFMEAToolkit.html
Apply SRE
during
development
Frestimate, Software
Reliability Toolkit
http://www.softrel.com/1About_Frestimate.html
http://www.softrel.com/4SWReliabilityToolkit.html
http://www.softrel.com/4About_SW_Predictions.html
http://www.softrel.com/2About_Assessment.html
Merge
predictions into
an RBD or fault
tree
Frestimate System
Software Analysis
Module
http://www.softrel.com/1Frestimate_Components.html
Sensitivity
analysis
Basic capabilities in
Frestimate standard
edition, advanced
capabilities in
Frestimate Manager’s
edition
http://www.softrel.com/3About_Sensitivity_Analysis.html
http://www.softrel.com/1CostModule.html
Apply SRE
during testing or
operation
Frestimate Estimation
Module (WhenToStop)
http://www.softrel.com/1WhenToStop_Module.html
Support Release
decision
Frestimate Standard or
Manager’s edition
http://www.softrel.com/1Frestimate_Components.html
References
 [1] “The Cold Hard Truth About Reliable Software”, A. Neufelder,
SoftRel, LLC, 2014
 [2]Four references are
a) J. McCall, W. Randell, J. Dunham, L. Lauterbach, Software Reliability,
Measurement, and Testing Software Reliability and Test Integration RL-
TR-92-52, Rome Laboratory, Rome, NY, 1992
b) "System and Software Reliability Assurance Notebook", P. Lakey, Boeing
Corp., A. Neufelder, produced for Rome Laboratory, 1997.
c) Section 8 of MIL-HDBK-338B, 1 October 1998
d) Keene, Dr. Samuel, Cole, G.F. “Gerry”, “Reliability Growth of Fielded
Software”, Reliability Review, Vol 14, March 1994.
69
Related Terms
 Error
 Related to human mistakes made while developing the software
 Ex: Human forgets that b may approach 0 in algorithm c = a/b
 Fault or defect
 Related to the design or code
 Ex: This code is implemented without exception handling “c = a/b;”
 Defect rate is from developer’s perspective
 Defects measured/predicted during testing or operation
 Defect density = defects/normalized size
 Failure
 An event
 Ex: During execution the conditions are so that the value of b approaches 0
and the software crashes or hangs
 Failure rate is from system or end user’s perspective
 KSLOC
 1000 source lines of code – common measure of software size
70
Backup Slides
ADDITIONAL DETAILS ABOUT MODELS
71
Prediction/Assessment steps
72
Industry approved framework for
early software reliability predictions
73
1. Predict
effective
size
2. Predict
testing or
fielded
defect
density
3. Predict testing
or fielded
defects
5. Predict
failure
rate/MTTF
during test
or
operation4. Identify defect
profile over time
7. Predict
mission
duration and
reliability
6. MTSWR and
availability
Sensitivity
Analysis
This framework has been used for decades. What has changed over the years
are the models available for steps 1, 2 and 4. These models evolve because
software languages, development methods and deployment life cycles have
evolved.
1. Predict size
If everything else is equal, more code means more defects
 For in house software
 Predict effective size of new, modified and reused code using best
available industry method
 For COTS software (assuming vendor can’t provide
effective size estimates)
 Determine installed application size in KB (only EXEs and DLLs)
 Convert application size to KSLOC using industry conversion
 Assess reuse effectiveness by using default multiplier of 1%
 Accounts for fact that COTS has been fielded to multiple sites
7
2. Predict defect density
 Ideally prediction models optimize simplicity, and accuracy and are updated
regularly for changes in SW technology
Method Number of
inputs
Comments
SEI CMMi lookup chart
or industry lookup chart*
1 Usually least accurate since there is only 1 input.
Useful for COTS or quick estimate.
Shortcut model* 22 •More accurate than lookup charts
•Questions can be answered by almost anyone
familiar with the project
Rome Laboratory TR-92-
52**
45-212 •Hasn’t been updated in 23 years which in software
world is akin to a millennium
Full-scale models** 98-300 •More accurate than the shortcut model
•Questions require input from software leads,
software testing, software designers
•Fully supports sensitivity analysis
Neufelder model** 149 •Based on Process Grade Factors
8
* These models are recommended in the normative section of the IEEE 1633 Recommended Practices for
Software Reliability, 2016. ** These models are recommended in Annexes of IEEE 1633 Recommended
Practices for Software Reliability, 2016.
3. Predict testing or fielded defects
 Defects can be predicted as follows
 Testing defect density * Effective size = Defects predicted to be found during
testing (Entire yellow area)
 Fielded defect density * Effective size = Defects predicted to be found in
operation (Entire red area)
Defects predicted
after system
testing
Defects predicted
during system
testing
0
2
4
6
8
10
12
Defects over life of version
12
4. Predict shape of defect discovery
profile
Growth rate (Q)
derived from slope .
Default = 4.5. Ranges
from 3 to 10
Development Test Operation
Defects
Calendar time
This width is growth
period (time until no
more residual defects
occur) =TF = usually
3* average time
between releases.
Default = 48.
An exponential formula is solved
as an array to yield this area
Defects(month i) =
Defects (N)
=area
Typical
start of
systems
Testing
Delivery
milestone
-N ( (-Q*i/TF))/TF)(-Q*(i-
)expexp 1
13
Rate at which defects result in observed
failures (growth rate)
Faster growth rate and shorter growth period – Example:
Software is shipped to millions of end users at the same time
and each of them uses the software differently.
Slower growth rate and longer growth
period – Example: Software deliveries
are staged such that the possible
inputs/operational profile is constrained
and predictable
By default, the growth rate will be in this range
14
5. Use defect discovery profile to predict failure
rate/MTTF
 Dividing defect profile by duty cycle profile yields a prediction of failure rate
as shown next
 Ti= duty cycle for month i - how much the software is operated during some
period of calendar time. Ex:
 If software is operating 24/7 ->duty cycle is 730 hours per month
 If software operates during normal working hours ->duty cycle is 176
hours per month
 MTTF i=
 MTTCF i
 % severe = % of all fielded defects that are predicted to impact availability
i
i
ileDefectprofsevere
T
*%
i
i
ileDefectprof
T
15
6. Predict MTSWR (Mean Time To Software Restore)
and Availability
 Needed to predict availability
 For hardware, MTTR is used. For software, MTSWR is used.
 MTSWR =weighted average of time for applicable restore actions
by the expected number of defects that are associated with each
restore action
 Availability profile over growth period = Availabilityi=
 In the below example, MTSWR is a weighted average of the two
rows
Operational restore action Average
restore time
Percentage
weight
Correct the software 40 hours .01
Restart or reboot 15 minutes .99
MTSWRMTTCF
MTTCF
i
i

16
7. Predict mission time and reliability
 Reliability profile over growth period =
 Ri= exp(-mission time/ MTTCF
i)
 Mission time = how long the software will take to
perform a specific operation or mission
 Not to be confused with duty cycle or testing time
 Example: A typical dishwasher cycle is 45 minutes. The software
is not executing outside of this time, so reliability is computed
for the 45 minute cycle.
81
Confidence Bounds and prediction error
 Software prediction confidence bounds are a function of
0
1000
2000
3000
4000
5000
6000
7000
0 2 4 6 8 10 12 14
MTTF
Months after delivery
Nominal MTTF
Lower bound MTTF
Upper bound MTTF
Parameter Contribution to prediction error
Size prediction error due to scope
change
Until code is complete, this will usually have
the largest relative error
Size prediction error due to error in
sizing estimate (scope unchanged)
Minimized with use of tools, historical data
Defect density prediction error Minimized by validating model inputs
Growth rate error Not usually a large source of error
18

Más contenido relacionado

La actualidad más candente

V model Over view (Software Engineering)
V model Over view (Software Engineering)V model Over view (Software Engineering)
V model Over view (Software Engineering)Badar Rameez. CH.
 
Software FMEA and Software FTA – An Effective Tool for Embedded Software Qual...
Software FMEA and Software FTA – An Effective Tool for Embedded Software Qual...Software FMEA and Software FTA – An Effective Tool for Embedded Software Qual...
Software FMEA and Software FTA – An Effective Tool for Embedded Software Qual...Mahindra Satyam
 
Software Engineering - Ch4
Software Engineering - Ch4Software Engineering - Ch4
Software Engineering - Ch4Siddharth Ayer
 
What is Software Testing | Edureka
What is Software Testing | EdurekaWhat is Software Testing | Edureka
What is Software Testing | EdurekaEdureka!
 
Software development life cycle
Software development life cycleSoftware development life cycle
Software development life cycleGurban Daniel
 
SDLC - Software Development Life Cycle
SDLC - Software Development Life CycleSDLC - Software Development Life Cycle
SDLC - Software Development Life CycleSaravanan Manoharan
 
An Introduction to Software Performance Engineering
An Introduction to Software Performance EngineeringAn Introduction to Software Performance Engineering
An Introduction to Software Performance EngineeringCorrelsense
 
Software quality assurance activites
Software quality assurance activitesSoftware quality assurance activites
Software quality assurance activitesGolu Gupta
 
Defects in software testing
Defects in software testingDefects in software testing
Defects in software testingsandeepsingh2808
 
Software Development Life Cycle
Software Development Life CycleSoftware Development Life Cycle
Software Development Life CycleSlideshare
 
Sdlc process document
Sdlc process documentSdlc process document
Sdlc process documentPesara Swamy
 
Defect analysis course v1.0
Defect analysis course   v1.0Defect analysis course   v1.0
Defect analysis course v1.0Gunesh Apte
 
V Model in Software Testing
V Model in Software TestingV Model in Software Testing
V Model in Software TestingAbdul Raheem
 
Software testing principles
Software testing principlesSoftware testing principles
Software testing principlesDonato Di Pierro
 
ISO26262-6 Software development process (Ver 3.0)
ISO26262-6 Software development process (Ver 3.0)ISO26262-6 Software development process (Ver 3.0)
ISO26262-6 Software development process (Ver 3.0)Hongseok Lee
 
10 Business Advantages of DevOps
10 Business Advantages of DevOps10 Business Advantages of DevOps
10 Business Advantages of DevOpscliqtechno
 

La actualidad más candente (20)

V model Over view (Software Engineering)
V model Over view (Software Engineering)V model Over view (Software Engineering)
V model Over view (Software Engineering)
 
Software FMEA and Software FTA – An Effective Tool for Embedded Software Qual...
Software FMEA and Software FTA – An Effective Tool for Embedded Software Qual...Software FMEA and Software FTA – An Effective Tool for Embedded Software Qual...
Software FMEA and Software FTA – An Effective Tool for Embedded Software Qual...
 
Software Engineering - Ch4
Software Engineering - Ch4Software Engineering - Ch4
Software Engineering - Ch4
 
What is Software Testing | Edureka
What is Software Testing | EdurekaWhat is Software Testing | Edureka
What is Software Testing | Edureka
 
Software development life cycle
Software development life cycleSoftware development life cycle
Software development life cycle
 
Software quality management standards
Software quality management standardsSoftware quality management standards
Software quality management standards
 
SDLC - Software Development Life Cycle
SDLC - Software Development Life CycleSDLC - Software Development Life Cycle
SDLC - Software Development Life Cycle
 
An Introduction to Software Performance Engineering
An Introduction to Software Performance EngineeringAn Introduction to Software Performance Engineering
An Introduction to Software Performance Engineering
 
Software quality assurance activites
Software quality assurance activitesSoftware quality assurance activites
Software quality assurance activites
 
Defects in software testing
Defects in software testingDefects in software testing
Defects in software testing
 
Software Development Life Cycle
Software Development Life CycleSoftware Development Life Cycle
Software Development Life Cycle
 
SQE Lecture 1.pptx
SQE Lecture 1.pptxSQE Lecture 1.pptx
SQE Lecture 1.pptx
 
Sdlc process document
Sdlc process documentSdlc process document
Sdlc process document
 
DevOps Delivery Pipeline
DevOps Delivery PipelineDevOps Delivery Pipeline
DevOps Delivery Pipeline
 
Defect analysis course v1.0
Defect analysis course   v1.0Defect analysis course   v1.0
Defect analysis course v1.0
 
ISTQB foundation level - day 2
ISTQB foundation level - day 2ISTQB foundation level - day 2
ISTQB foundation level - day 2
 
V Model in Software Testing
V Model in Software TestingV Model in Software Testing
V Model in Software Testing
 
Software testing principles
Software testing principlesSoftware testing principles
Software testing principles
 
ISO26262-6 Software development process (Ver 3.0)
ISO26262-6 Software development process (Ver 3.0)ISO26262-6 Software development process (Ver 3.0)
ISO26262-6 Software development process (Ver 3.0)
 
10 Business Advantages of DevOps
10 Business Advantages of DevOps10 Business Advantages of DevOps
10 Business Advantages of DevOps
 

Similar a Overview of software reliability engineering

IEEE 1633 Recommended Practice on Software Reliability
IEEE 1633 Recommended Practice on Software ReliabilityIEEE 1633 Recommended Practice on Software Reliability
IEEE 1633 Recommended Practice on Software ReliabilityHilaire (Ananda) Perera P.Eng.
 
Software reliability engineering
Software reliability engineeringSoftware reliability engineering
Software reliability engineeringMark Turner CRP
 
Software Fault Tolerance
Software Fault ToleranceSoftware Fault Tolerance
Software Fault ToleranceAnkit Singh
 
Top 5 best practice for delivering secure in-vehicle software
Top 5 best practice for delivering secure in-vehicle softwareTop 5 best practice for delivering secure in-vehicle software
Top 5 best practice for delivering secure in-vehicle softwareRogue Wave Software
 
IRJET- Research Study on Testing Mantle in SDLC
IRJET- Research Study on Testing Mantle in SDLCIRJET- Research Study on Testing Mantle in SDLC
IRJET- Research Study on Testing Mantle in SDLCIRJET Journal
 
The Top Ten things that have been proven to effect software reliability
The Top Ten things that have been proven to effect software reliabilityThe Top Ten things that have been proven to effect software reliability
The Top Ten things that have been proven to effect software reliabilityAnn Marie Neufelder
 
Top Ten things that have been proven to effect software reliability
Top Ten things that have been proven to effect software reliabilityTop Ten things that have been proven to effect software reliability
Top Ten things that have been proven to effect software reliabilityAnn Marie Neufelder
 
The Top Ten things that have been proven to effect software reliability
The Top Ten things that have been proven to effect software reliabilityThe Top Ten things that have been proven to effect software reliability
The Top Ten things that have been proven to effect software reliabilityAnn Marie Neufelder
 
Reliable software in a continuous integration/continuous deployment (CI/CD) e...
Reliable software in a continuous integration/continuous deployment (CI/CD) e...Reliable software in a continuous integration/continuous deployment (CI/CD) e...
Reliable software in a continuous integration/continuous deployment (CI/CD) e...Ann Marie Neufelder
 
Software controlled electron mechanical systems reliability
Software controlled electron mechanical systems reliabilitySoftware controlled electron mechanical systems reliability
Software controlled electron mechanical systems reliabilityASQ Reliability Division
 
Defect Tracking Software Project Presentation
Defect Tracking Software Project PresentationDefect Tracking Software Project Presentation
Defect Tracking Software Project PresentationShiv Prakash
 
Software reliability
Software reliabilitySoftware reliability
Software reliabilityAnand Kumar
 
Volume 2-issue-6-1983-1986
Volume 2-issue-6-1983-1986Volume 2-issue-6-1983-1986
Volume 2-issue-6-1983-1986Editor IJARCET
 
Volume 2-issue-6-1983-1986
Volume 2-issue-6-1983-1986Volume 2-issue-6-1983-1986
Volume 2-issue-6-1983-1986Editor IJARCET
 
Software Engineering Unit-1
Software Engineering Unit-1Software Engineering Unit-1
Software Engineering Unit-1Samura Daniel
 
Software Reliability
Software ReliabilitySoftware Reliability
Software Reliabilityranapoonam1
 
Software engineering study materials
Software engineering study materialsSoftware engineering study materials
Software engineering study materialssmruti sarangi
 

Similar a Overview of software reliability engineering (20)

Introduction to software FMEA
Introduction to software FMEAIntroduction to software FMEA
Introduction to software FMEA
 
IEEE 1633 Recommended Practice on Software Reliability
IEEE 1633 Recommended Practice on Software ReliabilityIEEE 1633 Recommended Practice on Software Reliability
IEEE 1633 Recommended Practice on Software Reliability
 
Software reliability engineering
Software reliability engineeringSoftware reliability engineering
Software reliability engineering
 
Software Fault Tolerance
Software Fault ToleranceSoftware Fault Tolerance
Software Fault Tolerance
 
Top 5 best practice for delivering secure in-vehicle software
Top 5 best practice for delivering secure in-vehicle softwareTop 5 best practice for delivering secure in-vehicle software
Top 5 best practice for delivering secure in-vehicle software
 
IRJET- Research Study on Testing Mantle in SDLC
IRJET- Research Study on Testing Mantle in SDLCIRJET- Research Study on Testing Mantle in SDLC
IRJET- Research Study on Testing Mantle in SDLC
 
The Top Ten things that have been proven to effect software reliability
The Top Ten things that have been proven to effect software reliabilityThe Top Ten things that have been proven to effect software reliability
The Top Ten things that have been proven to effect software reliability
 
Top Ten things that have been proven to effect software reliability
Top Ten things that have been proven to effect software reliabilityTop Ten things that have been proven to effect software reliability
Top Ten things that have been proven to effect software reliability
 
The Top Ten things that have been proven to effect software reliability
The Top Ten things that have been proven to effect software reliabilityThe Top Ten things that have been proven to effect software reliability
The Top Ten things that have been proven to effect software reliability
 
Reliable software in a continuous integration/continuous deployment (CI/CD) e...
Reliable software in a continuous integration/continuous deployment (CI/CD) e...Reliable software in a continuous integration/continuous deployment (CI/CD) e...
Reliable software in a continuous integration/continuous deployment (CI/CD) e...
 
Software controlled electron mechanical systems reliability
Software controlled electron mechanical systems reliabilitySoftware controlled electron mechanical systems reliability
Software controlled electron mechanical systems reliability
 
30420130403005
3042013040300530420130403005
30420130403005
 
Defect Tracking Software Project Presentation
Defect Tracking Software Project PresentationDefect Tracking Software Project Presentation
Defect Tracking Software Project Presentation
 
Software reliability
Software reliabilitySoftware reliability
Software reliability
 
Volume 2-issue-6-1983-1986
Volume 2-issue-6-1983-1986Volume 2-issue-6-1983-1986
Volume 2-issue-6-1983-1986
 
Volume 2-issue-6-1983-1986
Volume 2-issue-6-1983-1986Volume 2-issue-6-1983-1986
Volume 2-issue-6-1983-1986
 
Software Engineering Unit-1
Software Engineering Unit-1Software Engineering Unit-1
Software Engineering Unit-1
 
Software Reliability
Software ReliabilitySoftware Reliability
Software Reliability
 
Software engineering study materials
Software engineering study materialsSoftware engineering study materials
Software engineering study materials
 
Intro
IntroIntro
Intro
 

Último

Vertical- Machining - Center - VMC -LMW-Machine-Tool-Division.pptx
Vertical- Machining - Center - VMC -LMW-Machine-Tool-Division.pptxVertical- Machining - Center - VMC -LMW-Machine-Tool-Division.pptx
Vertical- Machining - Center - VMC -LMW-Machine-Tool-Division.pptxLMW Machine Tool Division
 
How to Write a Good Scientific Paper.pdf
How to Write a Good Scientific Paper.pdfHow to Write a Good Scientific Paper.pdf
How to Write a Good Scientific Paper.pdfRedhwan Qasem Shaddad
 
Nodal seismic construction requirements.pptx
Nodal seismic construction requirements.pptxNodal seismic construction requirements.pptx
Nodal seismic construction requirements.pptxwendy cai
 
Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...
Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...
Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...amrabdallah9
 
Transforming Process Safety Management: Challenges, Benefits, and Transition ...
Transforming Process Safety Management: Challenges, Benefits, and Transition ...Transforming Process Safety Management: Challenges, Benefits, and Transition ...
Transforming Process Safety Management: Challenges, Benefits, and Transition ...soginsider
 
cloud computing notes for anna university syllabus
cloud computing notes for anna university syllabuscloud computing notes for anna university syllabus
cloud computing notes for anna university syllabusViolet Violet
 
Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...
Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...
Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...Amil baba
 
Technology Features of Apollo HDD Machine, Its Technical Specification with C...
Technology Features of Apollo HDD Machine, Its Technical Specification with C...Technology Features of Apollo HDD Machine, Its Technical Specification with C...
Technology Features of Apollo HDD Machine, Its Technical Specification with C...Apollo Techno Industries Pvt Ltd
 
ASME BPVC 2023 Section I para leer y entender
ASME BPVC 2023 Section I para leer y entenderASME BPVC 2023 Section I para leer y entender
ASME BPVC 2023 Section I para leer y entenderjuancarlos286641
 
sdfsadopkjpiosufoiasdoifjasldkjfl a asldkjflaskdjflkjsdsdf
sdfsadopkjpiosufoiasdoifjasldkjfl a asldkjflaskdjflkjsdsdfsdfsadopkjpiosufoiasdoifjasldkjfl a asldkjflaskdjflkjsdsdf
sdfsadopkjpiosufoiasdoifjasldkjfl a asldkjflaskdjflkjsdsdfJulia Kaye
 
Design of Clutches and Brakes in Design of Machine Elements.pptx
Design of Clutches and Brakes in Design of Machine Elements.pptxDesign of Clutches and Brakes in Design of Machine Elements.pptx
Design of Clutches and Brakes in Design of Machine Elements.pptxYogeshKumarKJMIT
 
Guardians and Glitches: Navigating the Duality of Gen AI in AppSec
Guardians and Glitches: Navigating the Duality of Gen AI in AppSecGuardians and Glitches: Navigating the Duality of Gen AI in AppSec
Guardians and Glitches: Navigating the Duality of Gen AI in AppSecTrupti Shiralkar, CISSP
 
GENERAL CONDITIONS FOR CONTRACTS OF CIVIL ENGINEERING WORKS
GENERAL CONDITIONS  FOR  CONTRACTS OF CIVIL ENGINEERING WORKS GENERAL CONDITIONS  FOR  CONTRACTS OF CIVIL ENGINEERING WORKS
GENERAL CONDITIONS FOR CONTRACTS OF CIVIL ENGINEERING WORKS Bahzad5
 
Popular-NO1 Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialis...
Popular-NO1 Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialis...Popular-NO1 Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialis...
Popular-NO1 Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialis...Amil baba
 
Mohs Scale of Hardness, Hardness Scale.pptx
Mohs Scale of Hardness, Hardness Scale.pptxMohs Scale of Hardness, Hardness Scale.pptx
Mohs Scale of Hardness, Hardness Scale.pptxKISHAN KUMAR
 
Phase noise transfer functions.pptx
Phase noise transfer      functions.pptxPhase noise transfer      functions.pptx
Phase noise transfer functions.pptxSaiGouthamSunkara
 

Último (20)

Vertical- Machining - Center - VMC -LMW-Machine-Tool-Division.pptx
Vertical- Machining - Center - VMC -LMW-Machine-Tool-Division.pptxVertical- Machining - Center - VMC -LMW-Machine-Tool-Division.pptx
Vertical- Machining - Center - VMC -LMW-Machine-Tool-Division.pptx
 
Lecture 4 .pdf
Lecture 4                              .pdfLecture 4                              .pdf
Lecture 4 .pdf
 
How to Write a Good Scientific Paper.pdf
How to Write a Good Scientific Paper.pdfHow to Write a Good Scientific Paper.pdf
How to Write a Good Scientific Paper.pdf
 
Nodal seismic construction requirements.pptx
Nodal seismic construction requirements.pptxNodal seismic construction requirements.pptx
Nodal seismic construction requirements.pptx
 
Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...
Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...
Strategies of Urban Morphologyfor Improving Outdoor Thermal Comfort and Susta...
 
Transforming Process Safety Management: Challenges, Benefits, and Transition ...
Transforming Process Safety Management: Challenges, Benefits, and Transition ...Transforming Process Safety Management: Challenges, Benefits, and Transition ...
Transforming Process Safety Management: Challenges, Benefits, and Transition ...
 
Présentation IIRB 2024 Marine Cordonnier.pdf
Présentation IIRB 2024 Marine Cordonnier.pdfPrésentation IIRB 2024 Marine Cordonnier.pdf
Présentation IIRB 2024 Marine Cordonnier.pdf
 
cloud computing notes for anna university syllabus
cloud computing notes for anna university syllabuscloud computing notes for anna university syllabus
cloud computing notes for anna university syllabus
 
Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...
Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...
Best-NO1 Best Rohani Amil In Lahore Kala Ilam In Lahore Kala Jadu Amil In Lah...
 
Technology Features of Apollo HDD Machine, Its Technical Specification with C...
Technology Features of Apollo HDD Machine, Its Technical Specification with C...Technology Features of Apollo HDD Machine, Its Technical Specification with C...
Technology Features of Apollo HDD Machine, Its Technical Specification with C...
 
Présentation IIRB 2024 Chloe Dufrane.pdf
Présentation IIRB 2024 Chloe Dufrane.pdfPrésentation IIRB 2024 Chloe Dufrane.pdf
Présentation IIRB 2024 Chloe Dufrane.pdf
 
ASME BPVC 2023 Section I para leer y entender
ASME BPVC 2023 Section I para leer y entenderASME BPVC 2023 Section I para leer y entender
ASME BPVC 2023 Section I para leer y entender
 
sdfsadopkjpiosufoiasdoifjasldkjfl a asldkjflaskdjflkjsdsdf
sdfsadopkjpiosufoiasdoifjasldkjfl a asldkjflaskdjflkjsdsdfsdfsadopkjpiosufoiasdoifjasldkjfl a asldkjflaskdjflkjsdsdf
sdfsadopkjpiosufoiasdoifjasldkjfl a asldkjflaskdjflkjsdsdf
 
Design of Clutches and Brakes in Design of Machine Elements.pptx
Design of Clutches and Brakes in Design of Machine Elements.pptxDesign of Clutches and Brakes in Design of Machine Elements.pptx
Design of Clutches and Brakes in Design of Machine Elements.pptx
 
Guardians and Glitches: Navigating the Duality of Gen AI in AppSec
Guardians and Glitches: Navigating the Duality of Gen AI in AppSecGuardians and Glitches: Navigating the Duality of Gen AI in AppSec
Guardians and Glitches: Navigating the Duality of Gen AI in AppSec
 
GENERAL CONDITIONS FOR CONTRACTS OF CIVIL ENGINEERING WORKS
GENERAL CONDITIONS  FOR  CONTRACTS OF CIVIL ENGINEERING WORKS GENERAL CONDITIONS  FOR  CONTRACTS OF CIVIL ENGINEERING WORKS
GENERAL CONDITIONS FOR CONTRACTS OF CIVIL ENGINEERING WORKS
 
計劃趕得上變化
計劃趕得上變化計劃趕得上變化
計劃趕得上變化
 
Popular-NO1 Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialis...
Popular-NO1 Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialis...Popular-NO1 Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialis...
Popular-NO1 Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialis...
 
Mohs Scale of Hardness, Hardness Scale.pptx
Mohs Scale of Hardness, Hardness Scale.pptxMohs Scale of Hardness, Hardness Scale.pptx
Mohs Scale of Hardness, Hardness Scale.pptx
 
Phase noise transfer functions.pptx
Phase noise transfer      functions.pptxPhase noise transfer      functions.pptx
Phase noise transfer functions.pptx
 

Overview of software reliability engineering

  • 1. How to apply software reliability engineering Ann Marie Neufelder SoftRel, LLC amneufelder@softrel.com http://www.softrel.com © SoftRel, LLC 2016. This presentation may not be reprinted in whole or part without written permission from amneufelder@softrel.com 1
  • 2. Software is increasing in size, hence it’s effect on system reliability is increasing  The increase in size of F16A to F35 is just one example[1]  With increased size comes increased complexity and increased failures due to software as shown next 0 10000000 20000000 30000000 1970 1980 1990 2000 2010 2020 SIZE IN SLOC (SOURCE LINES OF CODE) OF FIGHTER AIRCRAFT SINCE 1974 [1] Delivering Military Software Affordably, Christian Hagen and Jeff Sorenson, Defense AT&L, March-April 2012. 2
  • 3. These are just a few failure events due to software Failure Event Associated software fault Several patients suffered radiation overdose from the Therac 25 equipment in the mid- 1980s. [THERAC] A race condition combined with ambiguous error messages and missing hardware overrides. AT&T long distance service was down for 9 hours in January 1991. [AT&T] An improperly placed “break” statement was introduced into the code while making another change. Ariane 5 Explosion in 1996. [ARIAN5] An unhandled mismatch between 64 bit and 16 bit format. NASA Mars Climate Orbiter crash in 1999.[MARS] Metric/English unit mismatch. Mars Climate Orbiter was written to take thrust instructions using the metric unit Newton (N), while the software on the ground that generated those instructions used the Imperial measure pound- force (lbf). 28 cancer patients were over-radiated in Panama City in 2000. [PANAMA] The software was reconfigured in a manner that had not been tested by the manufacturer. On October 8th, 2005, The European Space Agency's CryoSat-1 satellite was lost shortly after launching. [CRYOSAT] Flight Control System code was missing a required command from the on-board flight control system to the main engine. A rail car fire in a major underground metro system in April 2007. [RAILCAR] Missing error detection and recovery by the software. 3
  • 4. Software reliability timeline 4 1960’s 1970’s 1980’s 1990’s 1962 First recorded system failure due to software Many software reliability estimation models developed. Main obstacle – can’t be used until late in life cycle. 1968 The term “software reliability” is invented. First publicly available model to predict software reliability early in lifecycle developed by USAF Rome Air Development Center with SAIC and Research Triangle Park – Main obstacles – model only useful for aircraft and model never updated after 1992. SoftRel, LLC develops models based on RL model but usable on all applications A few proprietary models developed 2000’s IEEE 1633 Rewritten to be practical
  • 5. IEEE 1633 Recommended Practices for Software Reliability  Chaired by Ann Marie Neufelder, Softrel, LLC  Vice Chaired by Martha Wetherholt, NASA WHQ  Every branch of DoD, NASA, NRC, major defense contractors, medical device industry, participated in development/approval of document  Revised the 2008 edition which was poorly received as it was written for academic audience  Document received 100% approval on first IEEE ballot on 5/24/16  Document will be formally approved by IEEE on 9/16/16, and released by end of year 5
  • 6. Mapping of IEEE 1633 to available software reliability tools Section Contents Tools Available 1,2,3, 4 Overview, definitions and acronyms, Tailoring guidance 5.1 Planning for software reliability 5.2 Develop a failure modes model – SFMEA, Software Fault Tree Analysis Frestimate System Software Analysis Module, Software FMEA Toolkit 6.1 Overview of SRE models 5.3, 6.2 Apply SRE during development Frestimate, Software Reliability Toolkit 5.4, 6.3 Apply SRE during testing Frestimate Estimation Module 5.5 Support Release decision Frestimate 5.6 Apply SRE in operation Frestimate Estimation Module 6
  • 7. Table of contents for this presentation 7 Section Contents 1 Planning for software reliability 2 Develop a failure modes model – SFMEA, Software Fault Tree Analysis, Root Cause Analysis 3 Overview of SRE models 4 Apply software reliability during development 5 Apply software reliability during testing 6 Support Release decision 7 Apply software reliability in operation
  • 9. Before using any models it’s prudent to do some planning 1. What are the software Line Replaceable Units in your system?  Today’s systems have many software LRUs – not just one  SRE can be applied to in-house developed software, COTS, FOSS, GFS, and firmware 2. System specific failure definition and scoring criteria is an essential first step. The more specific the definitions, the better. 3. Perform an initial risk assessment  Can the software effect safety?  How mature is the product and target hardware?  Is the actual size of the software always bigger than expected or planned?  Is the actual reliability growth always smaller than planned?  Are the releases spaced so close together that defects are piling up from one release to the next?  Is this the very first deployed version of this software for this product?  Do we have the right people developing the software throughout the development process?  Is there a key technology change during software development? 9
  • 10. Establish an initial risk level for the software with regards to reliability Successful release Mediocre release Distressed release No identified risks 78% 27% 0% Exactly one of these risks 11% 64% 50% Exactly two of these risks 11% 6% 30% Exactly three of these risks 0% 0% 10% Four or more of these risks 0% 3% 10% 10 Distressed – Seriously late, increasing failure rate upon deployment, less than 40% of inherent defects are removed upon release, results in recall or unplanned maintenance release to fix the defects deployed Successful – Schedule isn’t seriously stalled, 75% of inherent defects are removed upon release, failure rate is decreasing upon delivery, doesn’t result in unplanned maintenance release Mediocre – Deployed with 40-75% of the inherent defects removed, causes schedule delays, eventually the many defects are corrected
  • 11. Determine SRPP based on risk level  “Software Reliability Program Plan” tailored based on the risk level of the particular software release.  Defines which Software Reliability Engineering (SRE) tasks are implemented for this program  i.e. failure mode analysis, predictions, sensitivity analysis, etc.  SRPP can be part of the Reliability Plan or part of the Software Development Plan or a self standing document 11
  • 12. Develop a failure modes model – SFMEA, Software Fault Tree Analysis SECTION 2 12
  • 13. Software FMEA and Software Fault Tree Analysis Requirements , interfaces, design , code, users manuals, installation scripts, changes to the design and code Failure Modes Events SFMEA works this way These are visible to end users These are visible to software engineers. FTA Works this way 13
  • 14. General guidance for when to use a SFMEA versus a SFTA versus both Selection characteristic SFTA SFMEA Both Small number of clearly defined top level hazards  Interest in identifying failures that are due to a combination of events, including events caused by both software and hardware  Very large or complex system with a lot of code  The detailed design/code have not been started yet  The SRS does not describe very well how the software should handle negative behavior or hazardous events  A symptom is known but not the failure modes or top level effects  Brand new technology or product. System level hazards not completely understood  Interest in identifying failure modes and/or single point failures  The product is mature but the code is suspect  The personnel available for the analyses have more experience with the software than with the system  14
  • 15. Key benefits of Software FMEAs  Many software systems fail when deployed because the engineers did not consider what the software should “Not” do  SFMEA is one of 2 analyses for identifying the failure space so often overlooked  Useful for early identification of  Defects that easier to see when looking at the design or code but difficult to see during testing  i.e. can be used to improve the efficiency of design or code reviews  Single point failures due to software  Defects that cannot be addressed by redundancy or other hardware controls  Abnormal behavior that might be missing from the requirements or design specifications  Unwritten assumptions  Features that need fault handling design  Addressing one failure mode could mean eliminating several failures 15
  • 16. Existing SFMEA guidance Guidance Comments Mil-Std 1629A Procedures for Performing a Failure Mode, Effects and Criticality Analysis, November 24, 1980. Defines how FMEAs are performed but it doesn’t discuss software components MIL-HDBK-338B, Military Handbook: Electronic Reliability Design Handbook, October 1, 1998. Adapted in 1988 to apply to software. However, the guidance provides only a few failure modes and a limited example. There is no discussion of the software related viewpoints. “SAE ARP 5580 Recommended Failure Modes and Effects Analysis (FMEA) Practices for Non-Automobile Applications”, July, 2001, Society of Automotive Engineers. Introduced the concepts of the various software viewpoints. Introduced a few failure modes but examples and guidance is limited. “Effective Application of Software Failure Modes Effects Analysis”, November, 2014, AM Neufelder, produced for Quanterion, Inc. Identifies hundreds of software specific failure modes and root causes, 8 possible viewpoints and dozens of real world examples. 16
  • 17. The process for performing a Software Failure Modes Effects Analyses Generate CIL Mitigate Analyze failure modes and root causes Prepare the Software FMEA Identify resources Brainstorm/ research failure modes Identify equivalent failure modes Identify consequences Identify local/ subsystem/ system failure effects Identify severity and likelihood Identify corrective actionsIdentify preventive measures Identify compensating provisions Analyze applicable failure modes Identify root causes(s) for each failure mode Generate a Critical Items List (CIL) Identify applicability Set ground rules Select viewpoints Identify riskiest software Gather artifacts Define likelihood and severity Select template and tools Revise RPN Decide selection scheme Define scope Identify resources Tailor the SFMEA Software has different viewpoints, and failure modes than hardware 17
  • 18. SFMEA viewpoints Software viewpoint Level of architecture applicable for viewpoint Failure Modes Functional The system and software requirements The system does not do it’s required function or performs a function that it should not Interface The interface design The system components aren’t synchronized or compatible Detailed The detailed design or code The design and/or code isn’t implemented to the requirements or design Maintenance A change to the design or code The change to the design or code will cause a new fault in the software Usability The ability for the software to be consistent and user friendly The end user causes a system failure because of the software interface Serviceability The ability for the software to be installed or updated without a software engineer The software doesn’t operate because it isn’t installed or updated properly Vulnerability The ability for the software to protect the system from hackers The software is performing the wrong functions because it is being controlled externally. Or sensitive information has been leaked to the wrong people. 18
  • 19. Applicability of each of the viewpoints FMEA When this viewpoint is relevant Functional Any new system or any time there is a new or updated set of requirements. Interface Anytime there is complex hardware and software interfaces or software to software interfaces. Detailed Almost any type of system is applicable. Most useful for mathematically intensive functions. Maintenance An older legacy system which is prone to errors whenever changes are made. Usability Anytime user misuse can impact the overall system reliability. Serviceability Any software that is mass distributed or installed in difficult to service locations. Vulnerability The software is at risk from hacking or intentional abuse. 19
  • 20. Failure modes associated with each viewpoint Failure mode categories Description Functional Interface Detailed Maintenance Usability Vulnerability Serviceability Faulty functionality The software provides the incorrect functionality or fails to provide required functionality X X X Faulty timing The software or parts of it execute too early or too late or the software responds too quickly or too sluggishly X X X Faulty sequence/ order A particular event is initiated in the incorrect order or not at all. X X X X X Faulty data Data is corrupted, incorrect, in the incorrect units, etc. X X X X X Faulty error detection and/or recovery Software fails to detect or recover from a failure in the system X X X X X False alarm Software detects a failure when there is none X X X X X Faulty synchronization The parts of the system aren’t synchronized or communicating. X X Faulty Logic There is complex logic and the software executes the incorrect response for a certain set of conditions X X X X Faulty Algorithms/ Computations A formula or set of formulas does not work for all possible inputs X X X X 20
  • 21. Failure modes associated with each viewpoint Failure mode categories Description Functional Interface Detailed Maintenance Usability Vulnerability Serviceability Memory management The software runs out of memory or runs too slowly X X X User makes mistake The software fails to prohibit incorrect actions or inputs X User can’t recover from mistake The software fails to recover from incorrect inputs or actions X Faulty user instructions The user manual has the incorrect instructions or is missing instructions needed to operate the software X User misuses or abuses An illegal user is abusing system or a legal user is misusing system X X Faulty Installation The software installation package installs or reinstalls the software improperly requiring either a reinstall or a downgrade X X 21
  • 22. Software Fault Tree Analysis  Why are they used on software?  When there is an intermittent problem in operation and the root cause cannot be determined  To identify what the software should NOT be doing which helps to define the exception handling requirements  To identify events that are caused by combinations of defects/root causes such as interactions between HW and SW  What’s different between HW and SW fault trees?  Mechanically, software fault trees work the same as hardware fault trees.  The major differences is the types of events and modes that appear on the tree.  The software FTA should be an integrated into the system FTA. Otherwise, interactions between software and hardware won’t be analyzed. 22
  • 23. This is the overview of how to include software in the system FTA Plan the SFTA Brainstorm System Failure Events Place each event at the top of a tree and describe in past tense Brainstorm sub-events due to software (see next page) Place event on tree and describe in past tense Use the risk/severity to rank mitigation effort or Determine probability of each top level event Revise the applicable Requirements or design Gather Applicable Product Documents such as requirements and design 23
  • 24. The software failure modes and root causes are the sub-events on the tree Generic failure mode Specific software root cause Faulty functionality This LRU performed an extraneous function This LRU failed to execute when required This LRU is missing a function This LRU performed a function but not as required Faulty sequencing This LRU executed while in the wrong state This LRU executed out of order This LRU failed to terminate when required This LRU terminated prematurely Faulty timing This LRU executed too early This LRU executed too late Faulty data This LRU manipulating data in the wrong unit of measure or scale This LRU can 't handle blank or missing data This LRU can 't handle corrupt data This LRU data/results are too big This LRU data or results are too small 24
  • 25. The software failure modes are the sub-events on the tree Generic failure mode Specific root cause Faulty error handling This LRU generated a false alarm This LRU A failure in the hardware, system or software has occurred A failure in the hardware, system or software has occurred This LRU detected a system failure but provided an incorrect recovery This LRU failed to detect errors in the incoming data, hardware, software, user or system Faulty processing This LRU consumed too many resources while executing This LRU was unable to communicate/interface with the rest of the system Faulty usability This LRU caused the user to make a mistake This LRU User made mistake because of user manual This LRU failed to prevent common human mistakes This LRU allowed user to perform functions that they should not perform This LRU prevented user from performing functions that they should be allowed to perform Faulty serviceability This LRU installed improperly This LRU updated improperly This LRU is the wrong version or is outdated 25
  • 26. Example of these failure modes on the system fault tree 26
  • 27. Overview of SRE models SECTION 3 27
  • 28. Overview of SRE Models  Software reliability can be predicted before the code is written, estimated during testing and calculated once the software is fielded 28 Prediction/ Assessment Reliability Growth Models Used before code is written •Predictions can be incorporated into the system RBD •Supports planning •Supports sensitivity analysis •A few models have been available since 1987 due to expense Used during system level testing or operation •Determines when to stop testing •Validates prediction •Less useful than prediction for planning and avoiding problematic releases •Many models have been developed since 1970s of which only a few are useful. Section of IEEE 1633 Recommended Practices for Software Reliability, 2016 5.3 5.4
  • 29. Limitations of each type of modeling  All are based on historical actual data  All generate a prediction by calibrating current project against historical project(s)  Accuracy depends on  How similar historical data is to current project  Application type  Product stability (version 1 versus version 50)  Capabilities of the development team  How current the historical data is  How much historical data exists  All are based on extrapolating an existing trend into the future  Accuracy depends on  Test coverage  Low test coverage usually results in optimistic results  How closely actual trend matches assumed trend  i.e. if model assumes a logarithmic trend is that the actual trend?  How closely the model assumptions match actual  Defect removal  Defect independence 29 PREDICTION/ASSESSMENT MODELS RELIABILITY GROWTH MODELS
  • 30. Apply Software Reliability during development SECTION 4 30
  • 31. Software reliability prediction/assessment goals  Allows reliability engineering practitioners to  Predict any number of SRE metrics for each software LRU well before the software is developed  Merge software reliability predictions into the system fault tree  Merge into the system Reliability Block Diagram (RBD)  Predict reliability growth needed to reach the system allocation  Determine, prior to the software being developed, whether the system allocation will be met 31
  • 32. Software reliability prediction/assessment goals  Allows software and engineering management to  Benchmark SRE to others in same industry  Predict probability of late delivery  Predict improvement scenarios  Analyze sensitivity between development practices and reliability and perform tradeoffs  Identify practices that are effective for improving SRE  Identify practices that aren’t effective for improving SRE (every moment spent on an ineffective practice is a moment that’s not spent on an effective practice)  Predict optimal spacing between releases so as to  avoid defect pileup which directly effects software reliability  ensure that there is adequate SRE growth across software releases  Determine how many people are needed to support the software once deployed 32
  • 33. Industry approved framework for early software reliability predictions 33 1. Predict effective size 2. Predict testing or fielded defect density 3. Predict testing or fielded defects 5. Predict failure rate/MTTF during test or operation4. Identify defect profile over time 7. Predict mission duration and reliability 6. MTSWR and availability Sensitivity Analysis This framework has been used for decades. What has changed over the years are the models available for steps 1, 2 and 4. These models evolve because software languages, development methods and deployment life cycles have evolved.
  • 34. Available Methods for predicting defect density  Ideally prediction models optimize simplicity, and accuracy and are updated regularly for changes in SW technology Method Number of inputs Comments SEI CMMi lookup chart or industry lookup chart* 1 Usually least accurate since there is only 1 input. Useful for COTS or quick estimate. Shortcut model* 22 •More accurate than lookup charts •Questions can be answered by almost anyone familiar with the project Rome Laboratory TR-92- 52** 45-212 •Hasn’t been updated in 23 years which in software world is akin to a millennium Full-scale models** 98-300 •More accurate than the shortcut model •Questions require input from software leads, software testing, software designers •Fully supports sensitivity analysis Neufelder model** 149 •Based on Process Grade Factors 8 * These models are recommended in the normative section of the IEEE 1633 Recommended Practices for Software Reliability, 2016. ** These models are recommended in Annexes of IEEE 1633 Recommended Practices for Software Reliability, 2016.
  • 35. Predict any number of SRE metrics for each software LRU well before the software is developed Predict availability, reliability, failure rate, MTTF, MTBI, MTBCF for each software LRU as well as all software LRUs combined 35
  • 36. Merge software predictions into system RBD 36 A particular software LRU is in series with the particular hardware LRU that it supports. Several software LRUs such as COTS, Operating System, firmware, etc. may be in series with each other and the hardware
  • 37. Merge software predictions into system fault tree 37 Once the predictions for each software LRU are complete, they can be merged into the system FTA
  • 38. Predict reliability growth needed to reach the system allocation  The predictions are performed over a period of operational time to allow the practitioner to determine how much growth is needed to reach a specific objective  Adding new features in subsequent releases can effect the objective 38 If the allocation for the software LRUs combined is 3500 hours MTTCF then the allocation will be met after about 9 months into the first release and then about 4 months into the next release.
  • 39. Determine, prior to the software being developed, whether the system allocation will be met  The system allocation for the software is met when  The reliability growth needed to achieve the objective is feasible with the given plans for  Features to be implemented and their predicted size  Practices to be deployed during development and testing  If the allocation can not be met tradeoffs can be performed such as  Avoid reinventing the wheel (writing new code when you can purchase the same functions commercially) so as to decrease the size of the software to be developed  Postponing some features to a later version  Deploying more smaller less risky releases instead of bigger more risky releases  Implementing development practices that reduce defect density 39
  • 40. Benchmark SRE to others in industry, Predict probability of late delivery, Predict improvement scenarios 40 Complete assessment and calculate score If the software development organization could transition to the next percentile group… •Average defect reduction is about 55% •Average probability (late) reduction is about 25% Assessment models provide for means to identify how to transition to the next percentile group Predicted Percentile Group World class Distressed Very good Good Average Fair Poor 1% 99% 10% 25% 50% 75% 90% Score Predicted Normalized Fielded Defect Density Predicted Probability late delivery .011 2.069 .060 .112 .205 .608 1.111 10% 100% 20% 25% 36% 85% 100% Predict both defect density and probability of late delivery
  • 41. Identify practices that are effective for improving software reliability Top ten factors quantitatively associated with better software reliability 1. Software engineers have product/industry domain expertise 2. Software engineers conduct formal white/clear box unit testing 3. Testers start writing test plans before any code is written 4. Management outsources features that aren’t in the organization’s line of business 5. Management avoids outsourcing features that are in organization’s line of business 6. No one skips the requirements, design, unit test or system testing even for small releases 7. Management plans ahead – even for small releases. Most projects are late because of unscheduled defect fixes from the previous release (and didn’t plan on it) 8. Everyone avoids “Big Blobs” - big teams, long milestones - especially when there is a large project 9. Pictures are used in the requirements and detailed design whenever possible 10. It is defined in writing what the software should NOT do 23
  • 42. Identify practices that aren’t effective for improving software reliability Top ten factors that aren’t quantitatively associated with improved reliability 1. Requirements, design and code reviews that don’t have a defined agenda and criteria 2. Waiting until the code is done to decide how to test it 3. Focusing on style instead of function when doing product reviews 4. Using automated tools before you know how to perform the task manually 5. Too much emphasis on independent Software Quality Assurance organization 6. Too much emphasis on independent software test organization 7. Too much emphasis on the processes and not enough on the techniques and the people 8. Misusing complexity metrics 9. Hiring software engineers based on their experience with a particular language instead of with a particular industry 10. Using agile development as an excuse not to do the things you don’t want to do 23
  • 43. Predict optimal spacing between releases to avoid defect and ensure adequate SRE growth across software releases  This is an example of the predicted defect discovery profile predicted over the next 4 releases.  One can visibly see that defects are predicted to “pile up”. Failure rate is proportional to defect discovery so when one is ramping up so is the other. The SRE for version 1 may be within the required range but future releases may not be.  Predictions can be used to identify optimal spacing between releases so as to avoid “defect pileup”. 43
  • 44. Predict optimal spacing between releases to avoid defect and ensure adequate SRE growth across software releases  This is an example of a predicted defect discovery profile that has been optimized  There is no pileup across releases  It supports a defined and repeatable maintenance staff as discussed next 44
  • 45. Determine how many people are needed to support the software once deployed  If you can predict the defect discovery rate and you know how many defects can be fixed per month, you can predict how many people are needed to support the software once deployed 45 Tip: The number one reason why a software release is late is the previous project. Neglecting to plan out the maintenance staffing does not make the defects go away. But it could make the next release late.
  • 46. Apply SRE during testing – Reliability Growth Models SECTION 5 46
  • 47. Overview  Reliability growth models have been in use since the 1970s for software reliability  Thanks to the academic community the hundreds of models developed  Have no real roadmap on how or when to use  Require data that isn’t feasible in a non-academic environment  Assume that the failure rate is decreasing  Yield the same or similar results  Don’t have methods to solve for parameters or compute confidence  This was resolved in the 2016 edition of the IEEE Recommended Practices for Software Reliability.  Overview of the models  How to select the model(s)  When to use them and when not to  How to use with incremental development life cycle 47
  • 48. Reliability Growth Model framework 48 1. Collect date of software failure, severity and accumulated operational hours between failures 2. Plot the data. Determine if failure rate is increasing or decreasing. Observe trends. 3. Select the model(s) that best fits the current trend 4. Compute failure rate, MTBF, MTBCF, reliability and availability 5. Verify the accuracy against the next actual time to failure. Compute the confidence. Support release decision New defects discovered in testing 6. Estimate remaining defects and test hours required to reach an objective
  • 49. Collect data during software system testing 49 For each day during software system testing collect: 1. # hours the software was in operation (by all computers) on that day (x) 2. # defects were discovered on that day (f) n = cumulative defects t = cumulative hours
  • 50. Plot the data  Fault rate (n/t) plotted on x axis  Cumulative defects (n) plotted on y axis  If plot has negative slope then fault rate is decreasing These parameters are used by the models:  Y intercept = estimated inherent defects N0  X intercept = estimated initial failure rate l0  K = 1/slope 50
  • 51. Example of increasing fault rate  In this example, the fault rate is increasing. This means that most of the models can’t be used.  This is a common situation during the early part of software testing 51
  • 52. Example fault rate that’s increasing and then decreasing  In this example, the fault rate increased initially and then decreased steadily. In this case the most recent data can be used to extrapolate the future trend. 52
  • 53. Selecting the reliability growth model(s) Model name Inherent defect count Effort required (1 low, 3 high) Can be used when exact time of failure unknown Increasing fault rate Weibull Finite/not fixed 3 Yes Peaked fault rate Shooman Constant Defect Removal Rate Model Finite/fixed 1 Yes Decreasing fault rate Shooman Constant Defect Removal Rate Model Finite/fixed 1 Yes Linearly Decreasing General exponential models including:  Time based (Goel-Okumoto)  Defect based (Musa Basic) Finite/fixed 2 Yes Shooman Linearly Decreasing Model Finite/fixed 1 Yes Non-Linearly Decreasing Logarithmic time and defect based models (Musa) Infinite 1 Yes Shooman Exponentially Decreasing Model Finite/fixed 3 Yes Log-logistic Finite/fixed 3 Yes Geometric Infinite 3 No Increasing and then decreasing Yamada (Delayed) S-shaped Infinite 3 Yes Weibull Finite/not fixed 3 Yes 53 1.Eliminate models that don’t fit the observed trend. 2. Use all applicable models or select the one with least effort. 3. Some models expect exact time of failure which might not be easy to collect in testing. Bolded models are in normative section of IEEE 1633 Recommended Practices for Software Reliability, 2016
  • 54. Compute failure rate, MTTF with the 2 simplest models Model Estimated remaining defects Estimated current failure rate Estimated current MTBF Estimated current reliability Defect based general Exponential N0 - n l(n) = l0 (1-(n/N0)) The inverse of the estimated failure rate e-( l(n) * mission time) Time based general Exponential l(t) = N0ke-kt e-( l(t) * mission time) 54 N0, l0, k estimated graphically as shown earlier n – cumulative defects discovered in testing to date t – cumulative hours of operation in testing to date Mission time – How long the software must operate to complete one mission or cycle Both models are in normative section of the IEEE 1633 Recommended Practices for Software Reliability
  • 55. Example with real data 55 N0 = 117.77 l0 = .137226 k = .001165 n=84 defects discovered to date t=1628 operational test hours to date y = -857.97x + 117.77 -60 -40 -20 0 20 40 60 80 100 120 140 0 0.05 0.1 0.15 0.2 CumulativeFaults(n) Fault Rate n/t Cumulative faults versus fault rate X intercept = .137226 Slope = 117.77/.137226 k = .137225/117.77 Y intercept = 117.77
  • 56. Example 56 The two models have different results because the first model assumes that failure rate only changes when a fault occurs. The second model accounts for time spent without a fault. If the software is operating for extended periods of time without failure, the second model will take that into account. Model Estimated remaining defects Estimated current failure rate in failures per hour Estimated current reliability Defect based general Exponential 118-84 = 34 71% of defects are estimated to be removed. l(84) = .137226*(1-84/117.77) = .03935 e-( .03935 * 8) = .772993 Time based general Exponential l(1628) = 117.77*.001165*e(-.001165*1628) = .02059 e-( .02059 * 8) = .84813
  • 57. Determine the relative accuracy of the model When the next fault is encountered, the relative accuracy of the last estimation can be computed. During testing, the trend might change. Hence, it’s prudent to determine which model currently has the lowest relative error as well as the model with the lowest relative error of all data points. In the above example, the time based general exponential model has the lowest relative error overall and with the most recent data. The logarithmic models have the highest relative error for this dataset. This is expected as the fault rate plot doesn’t indicate a logarithmic fault rate trend. 57
  • 58. Compute the confidence of the estimates  The confidence of the failure rate estimates is determined by the confidence in the estimates of the parameters. The more data points the better the confidence in the estimated values of N0, l0. 58
  • 59. Forecast  Any of the reliability growth models can forecast into the future a specific number of test hours.  This forecast is useful if you want to know what the failure rate will be on a specific milestone date based on the expected number of test hours per day between now and then. 59
  • 60. Estimate remaining defects or test hours to reach an objective  You can determine how many more defects need to be found to reach a specific MTBF objective  You can also determine how many more test hours are needed to reach that objective (assuming that the discovered defects are corrected) 60 In this example, between 6 and 7 defects need to be found and removed to meet the objective. Based on the current trend it will take about 767 hours to find those defects.
  • 62. Support release decision The reliability growth models are only one part of the decision process. The degree of test coverage is the other key part.  If the requirements, design, stress cases and features have not been covered in testing then the software should not be deployed regardless of the results of the models.  Otherwise, If the fault rate is increasing the software should not be deployed.  Otherwise, if the residual or remaining defects is more than 25% of the remaining defects the software should not be deployed.  Otherwise, if the residual or remaining defects is more than the support staff can handle, the software should not be deployed.  Otherwise, if the objective failure rate or MTBF have been met the software may be deployed if all other metrics required for deployment are met. 62
  • 63. Apply software reliability during operation SECTION 7 63
  • 64. Apply SRE in operation  Once the software is deployed the actual failure rate is computed directly from  Actual failures reported during some period of time (such as a month)  Actual operational hours the software was used during that period of time across all users and installed systems  The reliability growth models can be used with operational data as well as testing data 64
  • 65. Conclusions  Software reliability can be predicted before the code is written using prediction/assessment models  It can be applied to COTS software as well as custom software  A variety of metrics can be predicted  The predictions can be used for sensitivity analysis and defect reduction  Software reliability can be estimated during testing using the reliability growth models  Used to determine when to stop testing  Used to quantify effort required to reach an objective  Used to quantify staffing required to support the software once deployed 65
  • 66. Frequently Asked Questions  Can I predict the software reliability when there is an agile or incremental software development lifecycle?  Yes, your options are  You can use the models for each internal increment and then combine the results of each internal increment to yield a prediction for each field release  You can add up the code size predicted for each increment and do a prediction for the field release based on sum of all increment sizes  How often are the predictions updated during development?  Whenever the size estimates have a major change or whenever there is a major review  The surveys are not updated once complete unless it is known that something on the survey has changed  i.e. there is a major change in staffing, tools or other resource during development, etc. 66
  • 67. Frequently Asked Questions  Which prediction models are preferred?  The ones that you can complete accurately and the ones that reflect your application type  If you can’t answer most of the questions in a particular mode survey then you shouldn’t use that model  If the application lookup charts don’t have your application type you shouldn’t use them 67
  • 68. Frequently Asked Questions  What are the tools available for SRE? 68 Capability Tools Available Link Software FMEA Software FMEA Toolkit http://www.softrel.com/5SFMEAToolkit.html Apply SRE during development Frestimate, Software Reliability Toolkit http://www.softrel.com/1About_Frestimate.html http://www.softrel.com/4SWReliabilityToolkit.html http://www.softrel.com/4About_SW_Predictions.html http://www.softrel.com/2About_Assessment.html Merge predictions into an RBD or fault tree Frestimate System Software Analysis Module http://www.softrel.com/1Frestimate_Components.html Sensitivity analysis Basic capabilities in Frestimate standard edition, advanced capabilities in Frestimate Manager’s edition http://www.softrel.com/3About_Sensitivity_Analysis.html http://www.softrel.com/1CostModule.html Apply SRE during testing or operation Frestimate Estimation Module (WhenToStop) http://www.softrel.com/1WhenToStop_Module.html Support Release decision Frestimate Standard or Manager’s edition http://www.softrel.com/1Frestimate_Components.html
  • 69. References  [1] “The Cold Hard Truth About Reliable Software”, A. Neufelder, SoftRel, LLC, 2014  [2]Four references are a) J. McCall, W. Randell, J. Dunham, L. Lauterbach, Software Reliability, Measurement, and Testing Software Reliability and Test Integration RL- TR-92-52, Rome Laboratory, Rome, NY, 1992 b) "System and Software Reliability Assurance Notebook", P. Lakey, Boeing Corp., A. Neufelder, produced for Rome Laboratory, 1997. c) Section 8 of MIL-HDBK-338B, 1 October 1998 d) Keene, Dr. Samuel, Cole, G.F. “Gerry”, “Reliability Growth of Fielded Software”, Reliability Review, Vol 14, March 1994. 69
  • 70. Related Terms  Error  Related to human mistakes made while developing the software  Ex: Human forgets that b may approach 0 in algorithm c = a/b  Fault or defect  Related to the design or code  Ex: This code is implemented without exception handling “c = a/b;”  Defect rate is from developer’s perspective  Defects measured/predicted during testing or operation  Defect density = defects/normalized size  Failure  An event  Ex: During execution the conditions are so that the value of b approaches 0 and the software crashes or hangs  Failure rate is from system or end user’s perspective  KSLOC  1000 source lines of code – common measure of software size 70
  • 73. Industry approved framework for early software reliability predictions 73 1. Predict effective size 2. Predict testing or fielded defect density 3. Predict testing or fielded defects 5. Predict failure rate/MTTF during test or operation4. Identify defect profile over time 7. Predict mission duration and reliability 6. MTSWR and availability Sensitivity Analysis This framework has been used for decades. What has changed over the years are the models available for steps 1, 2 and 4. These models evolve because software languages, development methods and deployment life cycles have evolved.
  • 74. 1. Predict size If everything else is equal, more code means more defects  For in house software  Predict effective size of new, modified and reused code using best available industry method  For COTS software (assuming vendor can’t provide effective size estimates)  Determine installed application size in KB (only EXEs and DLLs)  Convert application size to KSLOC using industry conversion  Assess reuse effectiveness by using default multiplier of 1%  Accounts for fact that COTS has been fielded to multiple sites 7
  • 75. 2. Predict defect density  Ideally prediction models optimize simplicity, and accuracy and are updated regularly for changes in SW technology Method Number of inputs Comments SEI CMMi lookup chart or industry lookup chart* 1 Usually least accurate since there is only 1 input. Useful for COTS or quick estimate. Shortcut model* 22 •More accurate than lookup charts •Questions can be answered by almost anyone familiar with the project Rome Laboratory TR-92- 52** 45-212 •Hasn’t been updated in 23 years which in software world is akin to a millennium Full-scale models** 98-300 •More accurate than the shortcut model •Questions require input from software leads, software testing, software designers •Fully supports sensitivity analysis Neufelder model** 149 •Based on Process Grade Factors 8 * These models are recommended in the normative section of the IEEE 1633 Recommended Practices for Software Reliability, 2016. ** These models are recommended in Annexes of IEEE 1633 Recommended Practices for Software Reliability, 2016.
  • 76. 3. Predict testing or fielded defects  Defects can be predicted as follows  Testing defect density * Effective size = Defects predicted to be found during testing (Entire yellow area)  Fielded defect density * Effective size = Defects predicted to be found in operation (Entire red area) Defects predicted after system testing Defects predicted during system testing 0 2 4 6 8 10 12 Defects over life of version 12
  • 77. 4. Predict shape of defect discovery profile Growth rate (Q) derived from slope . Default = 4.5. Ranges from 3 to 10 Development Test Operation Defects Calendar time This width is growth period (time until no more residual defects occur) =TF = usually 3* average time between releases. Default = 48. An exponential formula is solved as an array to yield this area Defects(month i) = Defects (N) =area Typical start of systems Testing Delivery milestone -N ( (-Q*i/TF))/TF)(-Q*(i- )expexp 1 13
  • 78. Rate at which defects result in observed failures (growth rate) Faster growth rate and shorter growth period – Example: Software is shipped to millions of end users at the same time and each of them uses the software differently. Slower growth rate and longer growth period – Example: Software deliveries are staged such that the possible inputs/operational profile is constrained and predictable By default, the growth rate will be in this range 14
  • 79. 5. Use defect discovery profile to predict failure rate/MTTF  Dividing defect profile by duty cycle profile yields a prediction of failure rate as shown next  Ti= duty cycle for month i - how much the software is operated during some period of calendar time. Ex:  If software is operating 24/7 ->duty cycle is 730 hours per month  If software operates during normal working hours ->duty cycle is 176 hours per month  MTTF i=  MTTCF i  % severe = % of all fielded defects that are predicted to impact availability i i ileDefectprofsevere T *% i i ileDefectprof T 15
  • 80. 6. Predict MTSWR (Mean Time To Software Restore) and Availability  Needed to predict availability  For hardware, MTTR is used. For software, MTSWR is used.  MTSWR =weighted average of time for applicable restore actions by the expected number of defects that are associated with each restore action  Availability profile over growth period = Availabilityi=  In the below example, MTSWR is a weighted average of the two rows Operational restore action Average restore time Percentage weight Correct the software 40 hours .01 Restart or reboot 15 minutes .99 MTSWRMTTCF MTTCF i i  16
  • 81. 7. Predict mission time and reliability  Reliability profile over growth period =  Ri= exp(-mission time/ MTTCF i)  Mission time = how long the software will take to perform a specific operation or mission  Not to be confused with duty cycle or testing time  Example: A typical dishwasher cycle is 45 minutes. The software is not executing outside of this time, so reliability is computed for the 45 minute cycle. 81
  • 82. Confidence Bounds and prediction error  Software prediction confidence bounds are a function of 0 1000 2000 3000 4000 5000 6000 7000 0 2 4 6 8 10 12 14 MTTF Months after delivery Nominal MTTF Lower bound MTTF Upper bound MTTF Parameter Contribution to prediction error Size prediction error due to scope change Until code is complete, this will usually have the largest relative error Size prediction error due to error in sizing estimate (scope unchanged) Minimized with use of tools, historical data Defect density prediction error Minimized by validating model inputs Growth rate error Not usually a large source of error 18

Notas del editor

  1. Over the decades, software has grown exponentially in size as shown in this figure. The size of an average software system makes it very difficult to test thoroughly and completely. Even medium sized software systems have an almost infinite number of possibilities with regards to test paths. Additionally many software failures are related to what the software does NOT do and SHOULD do. These are things that are often not in the test plan because they are not in the software requirements or design documents. The need for software reliability and failure mode analysis only increases as the size and complexity of the software system increases.
  2. Over the last 5 decades there have been many system failures due to software. This page shows just a few of them. For every software related event that is in the public domain it’s suspected that several more or not in the public domain due to security and confidentiality. The events on the left side are fairly big events. Yet the failure modes that caused them are quite often a very small but critical defect in the code.
  3. http://www.softrel.com/1About_Frestimate.html http://www.softrel.com/2About_Assessment.html http://www.softrel.com/3About_Sensitivity_Analysis.html http://www.softrel.com/4About_SW_Predictions.html http://www.softrel.com/5About_SFMEA.html
  4. SRE – Software Reliability Engineering.
  5. An initial SRE risk assessment is useful for ball parking the SRE effort. The initial risk assessment are usually in line with the results of the SRE models and hence can be used as a sanity check for the models.
  6. It helps to know the difference between a software FMEA and a software fault tree analysis. The software FMEA is a bottom up analysis. It starts with the requirements, interfaces, design and code which are very visible to the software engineers and works upwards to the events which will ultimately be visible to the end users or system. The fault tree analysis works in the other direction. It starts with an event that’s visible to the system or end users and works its way down to the requirements, interfaces, design or code that may cause that event. The two analyses ideally meet in the middle.
  7. Ideally the SFMEA and SFTA meet in the middle. The SFMEA uncovers failure events from the bottom up while the SFTA uncovers failure events from the top down. Ideally the 2 approaches meet in the middle so as to minimize overlap.
  8. Simply stated, people often overestimate how many and the types of defects that they can find during software and systems testing. The purpose of the software FMEA is to identify what the software should not do so that the requirements, design, code and test plans can reflect that. It’s normal for human beings to define requirements in positive terms. However, it is often the unexpected events that cause the software and hence the system to fail. This analysis provides for a way to identify the negative requirements that will ultimately require fault handling.
  9. The Military Standard on FMEA doesn’t discuss software at all. The military handbook discusses it but doesn’t provide the level of detail required to fully apply the FMEA to software. The SAE guidebook provides more detail but still shows very few software specific failure modes and guidance. This presentation is based on the latest guidebook published by Quanterion, Inc which is dedicated to providing the failure modes, viewpoints and examples needed for any organization developing or acquiring software systems to perform the software FMEA.
  10. The process for doing a software FMEA is similar to that for doing a hardware FMEA. The first step is to prepare the software FMEA. This step includes defining the scope of the software FMEA, identifying the resources needed for the software FMEA and tailoring the software FMEA to the particular needs of the project. The next major step is to analyze the failure modes and root causes. This is where most of the effort is typically spent. Once the applicable software specific failure modes and root causes are identified the consequences on the software and the system are identified for each failure mode and root cause. Then, the corrective action and mitigation for the failure modes is identified. The risk probability number is updated if the failure mode is mitigated or is planned to be mitigated. Finally the failure modes and root causes that are equivalent (if any) are consolidated and a list of Critical Items is generated. At this point the software and hardware critical items are typically merged so as to produce a system wide list of critical items. The CIL will often be used to enrich the existing test plans as well as the existing requirements and design documents. The CIL can also be used as inputs for any existing health monitoring software.
  11. The most common viewpoints are the first five shown on this list. Not all of the viewpoints apply to a particular software product or release. Some apply more than others as shown on the next page.
  12. These are the 8 viewpoints and when they are most applicable. Any time you have a brand new software system, the functional viewpoint will be applicable. The only time the functional viewpoint is not applicable is when the code is being changed but the requirements are not changing. An example of this would be if you have product that runs on a particular Operating System and you rewrite the code for the product to work exactly the same but on another Operating System. The code will change but not the software requirements. The interface software FMEA is applicable almost all of the time as it focuses on the interfaces between 2 or more software LRUs or a software LRU and a hardware LRU. The only time an interface software FMEA is less applicable is if the software is very small and it has simple interfaces to very stable hardware. The detailed software FMEA is always applicable. If your system is mathematically intensive this viewpoint may be the most productive at identifying failure modes. However, as we will see later the detailed viewpoint can also be the most time consuming so some sampling is almost always required. The maintenance software FMEA is applicable only when the software is in a maintenance phase of it’s life or if the software is so fragile that any time a change is made to it, a new defect is likely to be introduced. The usability FMEA is most applicable if the user can contribute to a system failure because of the software. The serviceability FMEA applies mostly to software applications that are mass deployed or software applications that are deployed to difficult to reach geography. If the installation package doesn’t work that could mean that many end users, or one difficult to reach end user can’t operate the software. Vulnerability is applicable to most system. It is recommended that your organization seek an expert to help with vulnerability. This presentation provides for failure modes that affect both reliability and vulnerability. However, this presentation does not cover failure modes related to encryption, etc. The production viewpoint is applicable when there are chronic problems with multiple software releases. The goal is to find out what the organization is not developing reliable software as opposed to identifying specific requirements, design, code, install scripts, user manuals, user instructions that can cause the system to fail.
  13. The failure modes for software are different than for hardware. The next few pages summarizes the key failure modes. These failure modes are applicable for several viewpoints. For example, faulty functionality can effect the requirements, detailed design and maintenance actions. Before doing the SFMEA it’s prudent to identify which of the failure modes have been historically frequent for this particular type of system. Note that faulty functionality, faulty timing, faulty sequent/order, faulty error detection and/or recovery, false alarms, faulty synchronization, faulty logic apply to all software systems of any size or industry.
  14. The above shows some more failure modes. Memory management failure modes are typically the most visible when looking at the detailed design or code. Memory failure modes can also result in vulnerability issues. If there have been a considerable number of system failures caused by human being who are attempting to use the software without malice then the usability FMEA may be applicable while the vulnerability FMEA is applicable for malicious users.
  15. The process for the software FTA is no different than for hardware. While doing the system FTA, one merges in the events that can be caused by software. For each event caused by software, there are several possible failure modes that may or may not be applicable for that event.
  16. These are the most common failure modes and root causes for software.
  17. These are also very common. Note that the faulty error handling is often a “AND” gate between the software and the hardware failures. The hardware can fail and the software fails to detect that failure. The software can also detect a failure in the hardware but execute the wrong recovery for it. The software can also generate a false alarm. This means that the system or hardware or software is not in a failed state but the software incorrectly determined that it is.
  18. This is the Isograph Reliability Workbench. It can import software fault trees from the Softrel Frestimate System Reliability Module.
  19. The models for predicting defect density range from simple to complex depending on how much data is available. Typically the lookup chart type models are used for COTS software while any of the models are applicable for in house projects.
  20. Note that USAF Rome Laboratory TR92-15 also determined that application domain expertise is also the most sensitive. See the complete list at “The Cold Hard Facts about Reliable Software”, A. Neufelder, SoftRel, LLC, 2015
  21. This image was generated with the Frestimate Standard Edition
  22. The models that are in bold are in the normative section of IEEE 1633 Recommended Practices for Software Reliability, 2016.
  23. http://www.softrel.com/3About_Sensitivity_Analysis.html
  24. This chart simplifies the differences between errors, faults and failures. During this class we will be counting either faults or failures depending on what phase of the lifecycle war are measuring and what models we are using to measure reliability.
  25. Estimating source lines of code from object code: Windows and Embedded Control Systems, Les Hatton, CISM, University of Kingston, August 3, 2005. http://www.leshatton.org/Documents/LOC2005.pdf
  26. The models for predicting defect density range from simple to complex depending on how much data is available. Typically the lookup chart type models are used for COTS software while any of the models are applicable for in house projects.
  27. The above illustrates how the N, TF, Q inputs work. Once the testing or deployed defects are predicted, the next step is to extrapolate those defects over time to yield a defect distribution that can be used to predict failure rate and MTTF. This is done via the use of a common exponential model. Defects(month i) = N ( exp(-Q*(i-1)/TF) - exp(-Q*i/TF) ) Where N = total predicted deployed defects , Q=growth rate = 6, TF= growth period = 48, i=month of interest. Notice that the above formula is an array of values starting from i=1 which is the first month after deployment extending to i=n which is the last month of growth. This is because defects are presumed to be trending downward because 1) either they will be circumvented and not therefore not repeated again or 2) they will be corrected in a maintenance release or patch prior to the next major release. Since this is an exponential model the MTTF(i) is simply the inverse of the failure rate (i).
  28. The growth rate and growth period vary as a function of each other. The bigger the growth rate, the faster the software reliability grows or stabilizes. So, the bigger the growth rate, the shorter the growth period and vice versa. For experimental systems in which the hardware is often not available until deployment, the growth rate of the software may be very high. For systems which have a staged delivery process (delivery to only a few customers or end sites at a time or only a few features at a time) the growth rate can be smaller. Note that the growth rate is not a function of the total volume of defects, only how fast those defects are found during operation.
  29. The MTTF predictions take into account defects of every severity level since the defect density prediction models take into consideration defects of every severity level. However, not every software defect impacts availability. In fact, in industry studies including that conducted by SoftRel, LLC, only about 1-2% of all software defects will impact availability. By impacting availability that means that the system is unavailable or partly unavailable and that there is no immediate circumvention or fix that can be executed during operation. So, the MTTCF (Mean Time Between Critical Failure) is computed similarly to the MTTF except that the number of fielded defects (N) is multiplied by the percentage predicted to impact availability. For this prediction, a conservative 2% is being used.