This is de-blued version of my presentation at Rational Software Conference 2009. An accompanying video (http://www.youtube.com/watch?v=0ZU28Dma6zw&feature=channel_page) demonstrates one method for generating these values with IBM Rational ClearQuest.
2. IBM Rational Software Conference 2009
Asking Quality Questions
How good was our testing?
How good is our software?
2
3. IBM Rational Software Conference 2009
Cumulative Defect Removal Efficiency (Simple Method)
“Cumulative defect removal efficiency =
defects found before release
defects found before and after release
By this formula, if 100 defects are found in a program during its
entire life -- in both development and in production -- and 90 of the
defects are found before release, then the cumulative defect removal
efficiency is considered to be 90 percent.”
-- T.C. Jones, IBM Systems Journal, 1978
4. IBM Rational Software Conference 2009
Work-In-Process Defect Removal Efficiency
defects found prior test
defects found prior and current test
How good was my testing?
WIP DRE is retrospective.
5. IBM Rational Software Conference 2009
A DRE by Any Other Name
Defect Removal Effectiveness
Defect Fix Percentage
Defect Detection Effectiveness
Defect Detection Percentage
Defect Detection Rate
5
6. IBM Rational Software Conference 2009
What are Actual DREs?
-- data from table by Capers Jones, CrossTalk, 2008
6
7. IBM Rational Software Conference 2009
What are Actual CDREs?
< 80% 80-85% 85-90% 90-95% 95-99% >99%
-- based on Capers Jones data published 2008 by ITMPI
7
8. IBM Rational Software Conference 2009
Jones’ Simplifying Assumptions
All detection methods are equivalent
All fixes are good and singular
All defect causes are equivalent *
All defects are equivalent
-- T.C. Jones, IBM Systems Journal, 1978
8
9. IBM Rational Software Conference 2009
Defect Detection
CRITICAL X
MAJOR X
MINOR
COSMETIC
INCONSEQUENTIAL
Development
CRITICAL
MAJOR
MINOR X
COSMETIC X
INCONSEQUENTIAL
10. IBM Rational Software Conference 2009
Work In Process Calculations
WIP DRE 67%
CRITICAL X
MAJOR X
MINOR X
COSMETIC
INCONSEQUENTIAL
Development Acceptance
CRITICAL
MAJOR X
MINOR X
COSMETIC X
INCONSEQUENTIAL
WIP DRE 67%
11. IBM Rational Software Conference 2009
WIP DRE becomes DRE
WIP DRE 50% WIP DRE 50%
CRITICAL X
MAJOR X
MINOR X
COSMETIC X
INCONSEQUENTIAL
Development Acceptance Production
CRITICAL X
MAJOR X
MINOR X
COSMETIC X
INCONSEQUENTIAL
WIP DRE 50% WIP DRE 50%
With Production counts, WIP DRE becomes DRE
12. IBM Rational Software Conference 2009
Cumulative Defect Removal Efficiency (CDRE)
DRE 50% DRE 50%
CRITICAL X
MAJOR X
CDRE 75%
MINOR X
COSMETIC X
INCONSEQUENTIAL
Development Acceptance Production
CRITICAL X
MAJOR X
MINOR X
COSMETIC X CDRE 75%
INCONSEQUENTIAL
DRE 50% DRE 50%
13. IBM Rational Software Conference 2009
Are these test results equivalent ?????
CRITICAL X
MAJOR X
MINOR X
COSMETIC X
INCONSEQUENTIAL
Development Acceptance Production
CRITICAL X
MAJOR X
MINOR X
COSMETIC X
INCONSEQUENTIAL
14. IBM Rational Software Conference 2009
Severity Weighting
“Obviously, it is important to measure defect
severity levels as well as recording numbers
of defects.” -- T. Capers Jones, 2008
15
15. IBM Rational Software Conference 2009
Weighted Defect Removal Effectiveness (DREw)
Critical x 5
Major x 4
Minor x 3
Cosmetic x 2
Inconsequential x 1
Keep It Simple!
(or use quantified potential business impact)
16. IBM Rational Software Conference 2009
Weighted Defect Removal Effectiveness (DREw)
DREw 75%
CRITICAL 5
MAJOR 4
MINOR 3
COSMETIC
INCONSEQUENTIAL
9/12
5/9
CRITICAL
MAJOR 4
MINOR 3
COSMETIC 2
INCONSEQUENTIAL
DREw 56%
17. IBM Rational Software Conference 2009
Weighted Defect Removal Effectiveness (DREw)
DREw 60%
CRITICAL
MAJOR
MINOR 3
COSMETIC 2
INCONSEQUENTIAL
3/5
4/9
CRITICAL 5
MAJOR 4
MINOR
COSMETIC
INCONSEQUENTIAL
DREw 44%
18. IBM Rational Software Conference 2009
Cumulative DREw (CDREw)
CRITICAL 5
CDREw 86%
MAJOR 4
MINOR 3
COSMETIC 2
INCONSEQUENTIAL
12/14
9/14
CRITICAL 5
MAJOR 4
MINOR 3
COSMETIC 2 CDREw 64%
INCONSEQUENTIAL
19. IBM Rational Software Conference 2009
Why Measure Work-In-Process Testing?
Consistent WIP DRE lends predictive value
for product reliability from a stable process
Consistent (WIP) DREw lends predictive value
for product releasability from a stable
process
20
20. IBM Rational Software Conference 2009
Answering Quality Questions
Critical
Major
How good was our testing? Minor
Cosmetic
Weighted
Total
How good is our software?
Dev Int QA Alpha Beta Prod
21
Notas del editor
This presentation is very focused on only one metric, a modified form of TC Jones' Defect Removal Efficiency. (see http://www.research.ibm.com/journal/sj/171/ibmsj1701E.pdf) Outline - The Quality questions: How good is our software? How good is our testing? - Jones' original and simplified formulae for DRE and Cumulative DRE. - How DRE answers the Quality questions Published benchmarks from industry - Problems with DRE's simplifying assumption of defect equivalence. - A simple method for applying defect valence (DRE-W). - Does weighting make a difference? Examples from actual projects: DRE vs DRE-W. - Demo: How to calculate DRE and DRE-W with ClearQuest. - What _not_ to do with these numbers. Benefits The desired learning outcome is for each attendee to: - adopt an attitude that testing effectiveness is measurable - understand the method and limitations of DRE - be able to calculate DRE and weighted DRE (DRE-W) - appreciate how DRE and DRE-W differ in results - see how easy it is to generate these metrics from ClearQuest (CQ) - take away a set of instructions, and hyperlink, to calculate measures from their own CQ data Background I've championed DRE in two companies. What managers and teams do with the results is far more important than the numbers. I developed DRE-W to help counter measurement errors caused by QA over-reporting of cosmetic defects.
How do we know how good we are at designing, executing, and interpreting tests? Can you give me a simple, easily understandable measure to answer that question?How good is our product when it’s still under development? Is it good enough for release?How many found and undicovered defects still exist? We can’t test quality into a product, so are these two questions at all related?If the questions aren’t related, how could a measure for one tell us anything about the other? And yet, it does!! Because these are Quality Questions and Transition: Ultimately, Quality is in the eyes of the Customer!
So we count how many defects we found during testingThen we count how many defects the customer reported after our testingApply this simple formula and That’s how good our overall testing was!This all began in 1976 when Michael Fagan’s team of hardware engineers burned up some hardware during a test.I’ll take some liberties with the story, but where do you think the name “Smoke test” comes from?It wasn’t exactly possible to retest the burning rubble, so the team went back and looked at the blueprints.They found out why it toastedBut also a number of other problems with the designIncluding problems which they would not have found in the planned testsSo Fagan became an advocate for inspection before testing (it saves hardware)And applied a statistic called Error Detection EfficiencyThe idea of inspection before testing was further developed by Capers Jones and became the Cumulative Defect Removal Efficiency formula you see on this slide. This is the Simple Method, we’ll talk about the simplifications later.But software which fails a ‘smoke test’ isn’t a heap of smoldering rubble, so it can be fixed and restested, again and again. This means that we no longer have to wait on our customer’s bug reports to learn how well we have been at testing:Transition: Don’t have to wait on the customer to measure qualityOnce upon a time Hardware cost more than software So when hardware had mistakes in it A team sat down to figure out why And they found more mistakes paraphrasing Michael FaganError detection efficiency = Errors found by an inspection X 100 / Total errors in the product before inspection -- Michael Fagan, IBM Systems Journal, 1976
We just have to wait on the next testFor any level of defect detection (which may be inspection or test, by developers, QA, or users), we can apply the same basic formulaWe need remember the value of the present test will not be known until the next test.Some of you may be saying, hey, we do that, but we don’t call it DRETransition: So “Defect Removal Efficiency” is a specialized term, but others have rephrased it to meet their needs
Capers Jones uses the term Defect Removal Efficiency when considering standardized units of software, specifically, function point size categorizations.If we measure without reference to the size of the projectIf we want to talk in terms business people understandIf we are really focused on what we find, not what we fixIf we measure relative to some other measureTransition: regardless of what we call it, how good are we, as an industry, at stopping defects?
Some forms of testing are better at finding defects than others.This is a sampling from a table published in CrossTalk.The light blue centers and large rounded rectangles represent the ‘normal’ DRE for a given type test and the bar below User’s normal range shows even more of the variability in the efficiency of User Acceptance testing.But these are based on normalized projects. How many of us have ever worked on a ‘normal’ software project? Are we all at the same level Capability maturity? Are all of our teams the same size? Have we always applied the same series of test types?When we look at the overall effectiveness, we get a better idea of the variability involvedhttp://www.stsc.hill.af.mil/crosstalk/2005/04/0504Jones.htmlVaries significantly by Capability Maturity, Size of ProjectTransition: How much does it vary?
This is a simple slide, but take a moment to let it speak to you, as I try to interpret. Most projects sent software to production with 10-20% of the bugs still undetected.These are results published in April of this year!Remember the question: How good was our testing?Not perfect! We are going to send bugs to production!How many defects did you log before your last delivery? What if your Cumulative Removal Efficiency was 90%? How many defects did you miss?Transition: Before getting too worked up about this, let’s look at some of the assumptions behind this very useful number, and after three decades of experience, could we possibly provide better answers to the quality questions? Remember that CDRE is calculated using the Simple Method. What makes it simple?Curved linear representation of data from T. Capers Jones’ table from April 2009 ITMPI Webinar, “Software Defect Removal”
Jones’ original paper on DRE was not focused on the statistic; it wasn’t at all concerned with QA. He was interested in ways to improve programmer productivity! So CDRE was just a means to a very different end than we have been considering.Jones documented his assumptions for us, which you see on this slide:The industry DRE numbers you just examined show empirical refutation of the first assumption. This is a good reason to measure after each set of tests rather than just Cumulatively.Bug injection during fixes is also measured by Jones and others, it varies, but sometimes one bug is created for every three destroyed. --need source!!Remember that our Or use the detailed method“All defects, regardless of source or of origin (whether design problems, coding problems, or some other) are lumped together and counted as the single variable, defects.”“The defect removal efficiencies of all reviews, inspections, tests, and other defect removal operations are lumped together and counted as the single variable, cumulative defect removal efficiency.”Transition: Let’s discuss this last assumption, which we want to play with …
Spend just a few seconds getting comfortable with this bar chart.Team Swan is above the waterline and Team Dolphin is below the line.Each stage of testing is indicated in sequence.Defects are shown with an X where detected.Colors indicate Severity levels; the end of the color bar indicates the defect is removed.The blue, Inconsequential defect was never found or removed.Transition: Now that we understand the symbols, let’s read from left to right.
Spend just a few seconds getting comfortable with this bar chart.Team Swan is above the waterline and Team Dolphin is below the line.Each stage of testing is indicated in sequence.Defects are shown with an X where detected.Colors indicate Severity levels; the end of the color bar indicates the defect is removed.The blue, Inconsequential defect was never found or removed.Q: Can we determine anything about the effectiveness of my testing or the quality of our product?Tests are useful, caught some bugs.How good is my software? Looks pretty good for Dolphins, not so good for Swans.Remember, DRE is always retrospective.Transition: So let’s look at the next stage of testing.
At the end of Acceptance testing, How good WAS my development testing?How good is my software? The Swans are feeling better about their software, the Dolphins are getting nervous.Transition: And what did our Customers tell us after we released the product?
How good was Acceptance testing? The user-reported defects indicate that Acceptance testing eliminated half of the remaining defects, so the DRE is 50%Remember: Each DRE is relative only to the defects remainingHow good is my software? Well, you see.Transition: So what was our Cumulative DRE?
Three of a total four reported defects were detected and removed before production release, so …How good was my testing? 75% effective (well below industry norms)Transition: Look closely at what the CDRE seems to tell us: our test methods are equally effective? Do you believe that?
The Swans removed all but one cosmetic defect before release.The Dolphins allowed only one defect to slip, but the customer was not happy!Original purpose of Fagan and Jones was to look at practices, not measures.Transition: Capers Jones’ himself was aware of the effect of simplifying DRE by considering all defects equivalent.
Recall that Quality is determined by the Customer, and the customer cares about severity.Okay, we care about Severity, but how do we factor that into a nice, simple statistic?Transition: So what if all defects are not treated equally? How might we account for Severity?
Assign a weight to each of the severity levels in use.Sev levels can be invertedSeverity for test and Severity for use are not the same. A defect in Production has business impact. A defect in test has no business impact, we would have to guess. So I removed one of Jones’ simplifications to substitute one of my own. Quantified PotentialBusinessOrthogonal Defect Classification – Business Impact, not ODC Impact. The use of triggers to answer the quality questions has been published by Chaar, et al. Transition: Assuming just this simple 1 through 5 weighting, what difference would we see in our game tests?
Swans and Dolphins were not equally effective at discovering weighted defects.(step through the calculation)Transition: and our Acceptance test results would also differ
Transition: What if we were to apply weighting to the Cumulative DRE?
Knowing that one team sent a cosmetic bug to your customers and the other sent a critical bug, are you comfortable with the idea that the Swans were better at testing than the Dolphins?Transition: OK, so how can we calculate DRE and Weighted DRE with ClearQuest?
A study at the Software Engineering Laboratory found that code reading detected about 80 percent more faults per hour than testing (Basili and Selby 1987). Another organization found that it cost six times as much to detect design defects by using testing as by using inspections (Ackerman, Buchwald, and Lewski 1989). A later study at IBM found that only 3.5 staff hours were needed to find each error when using code inspections, whereas 15–25 hours were needed to find each error through testing (Kaplan 1995). -- Steve McConnell, Code Complete, 2d ed, 2004Transition: Let’s look at how these numbers answer the Quality Questions