Modeling the Effect of Size of Defect Proneness for Open-Source Software
1. PROMISE at ICSE’07
MODELING the EFFECT of SIZE on
DEFECT PRONENESS for OPEN-SOURCE
SOFTWARE
A. Güneş Koru1, Donsong Zhang1, and Hongfang Liu2
1Department of Information Systems
UMBC
Baltimore, MD, USA
2Georgetown Medical Center
Department of Bioinformatics, Biostatistics, and Biomathematics
Georgetown University, Washington, D.C., USA
E-mails: gkoru@umbc.edu, zhangd@umbc.edu, hl224@georgetown.edu
2. UMBC
• UMBC, University of Maryland, Baltimore County (http://umbc.edu/
~gkoru)
• Public research university with a focus on graduate education.
• Theoretically, all campuses belong to the University of Maryland but
practically they look like different universities.
• UMBC is located in Baltimore in a small suburban neighborhood called
Catonsville. UMBC is not
• University of Maryland, College Park
• University of Baltimore (Business school)
• University of Maryland Baltimore (Medical School)
• Hongfang Liu is with the Georgetown University located in Washington,
D.C., Interested in Bioinformatics and Health Care.
3. Size--Defect Relationship
• Size is perhaps the oldest measure. Mostly, measured by lines of code
(sometimes function points).
• Several studies found size to be associated with defect count. Earliest:
A linear model in [Akiyama 71].
• Many other measures (e.g. cyclomatic complexity [McCabe 76],
software science measures [Halstead 77]) are also correlated with size.
There is some consensus that these are also size measures [Fenton and
Pfleeger 96].
“May be size does not explain everything, but it explains a lot.”
Bojan Cukic, PROMISE 2007
• Functional form of this relationship is still not understood well.
• Commonly, practitioners assume a linear relationship [El Emam 05].
• Only general conclusion is that there is a continuously increasing
relationship between the two [Fenton and Ohlsson 00, El Emam et al.
01].
4. Size--Defect Relationship: Alternative Forms
defects defects defects
size size size
(a) (b) (c)
• Implications: “Things are linear is open to questions”
Tim Menzies, PROMISE 2007
• (a) Linear: Smaller and larger
modules are proportionally equally • Theoretical and Practical Importance
problematic
• Decomposition
• (b) Quadratic: Larger modules are
• Focused quality assurance
proportionally more problematic
• Functional Enhancements
• (c) Logarithmic: Smaller modules
are proportionally more problematic
5. Why the relationship is still unclear...
• Many earlier studies did not fully explore alternative functional forms or test the
deviation from linearity significantly.
• Linear models [Akiyama 71] or correlations [Andersson and Runeson 07] were
found sufficient.
• A study stated that linear models could be good as first approximations and
there was better tool support [Shen 85]
• Number of data points were very limited in the earlier studies (e.g. Akiyama 71).
• Deriving models analytically and then fitting data to validate those models [Lipow
82].
• Accepting correlations as a sign of a linear relationship [Schneidewind and
Hoffman 78]. Correlations do not imply proportionality.
• Focus shift on defect density. Observations for optimal module size that
minimizes defect density. U-shaped curve (Goldilock’s conjecture) [Withrow 90,
Hatton 97, Hatton 98, etc]. See [El Emam 02] for a detailed review.
• This approach can mask the plain size--defect relationship and mislead us. [El
Emam 02, Fenton and Neil 00, and Rosenberg 97]
• Gets more difficult to understand from multivariate and sophisticated machine
learning models (e.g. from Neural Networks in [Khoshgoftaar 97]).
6. Conventional Approach to
Investigate Size--Defect Relationship
• All these studies share a common characteristic
• A software system is measured at a snapshot time, then the
obtained measurements are associated with the future defect count
(note this might be pre-release or post release) For ex: [Koru and
Tian 03] [Khoshgoftaar 96]
• Usually, measurement and analysis performed at module level.
• A common problem is the availability of data [Fenton and Ohlsson
02].
• Publicly available Open Source Software (OSS) repositories: Source
code, change data, and defect data [Koru and Tian 04].
7. Challenges with Using Conventional Method in
OSS Context
• Evolutionary aspects of OSS. Continuous and concurrent functional
enhancements, defect fixes, all other changes (perfective, adaptive, etc.) Bazaar
model rather than cathedral model [Raymond 99].
• OSS, usually, developed by volunteers, not too much planning, no requirements
or design documents, source code is the main artifact. [Mockus et al. 00, Mockus
et al. 02].
• Quality assurance activities are not systematic in OSS (see Zhao and Elbaum 03,
Koru et al. 07])
• So far, research using conventional approach focused on relatively better
planned, analyzed, designed, and tested closed source products.
• Internal validity problems caused by the dynamic OSS context:
• Deleted classes
• Size changes
• There might be closed source products developed in an evolutionary manner and
vice versa. Such comparisons are outside of the scope here (see [Paulson et al.
04]))
8. In this study...
“If developers play with a file, it can change its defect proneness”
Elaine Weyuker, PROMISE 2007
• To gain a better understanding of the size--defect relationship, we
used both
• Novel approach that adopts Cox Proportional Hazards Modeling
with Recurrent Events (Cox Modeling) [Cox 72].
• The data comes from a large-scale long-lived OSS product Mozilla
(http://www.mozilla.org).
• The evolutionary aspects of the Mozilla project was shown in other
studies:
• Gyiomothy et al. [04] found that size of Mozilla increased
significantly during successive releases.
• Mockus et al. [02] found that there was no particular development
process in Mozilla.
9. In the rest of this presentation...
•Methodology
•Demonstrating the evolutionary aspects of Mozilla
•Cox Modeling
•Data Collection
•Modeling and Results
•Future Work
•Conclusion
11. Cox Modeling
( A non-parametric approach)
• The instantaneous relative risk (hazard) of defect fix, also called event,
becomes the response variable. Note that it can recur.
• A complete size history is obtained for each class by measuring size at each
change and corrective changes are marked.
• Time of change is also noted. At each unique time, the hazard is calculated by
dividing the events at that time by the classes at risk at that time.
λi (t) = λ0 (t)eβxi (t) . (1)
• Hazard function:
β is the regression coefficient for xi (t) and λ0 (t) is an unspecified non-negative
function of time called the baseline hazard function. It is the instantaneous
hazard of having an event without any covariate effect (i.e., when β = 0).
• Relative hazard: eβ(xj (t)−xk (t))
• Note that the relative hazard is proportional to the difference in covariate
values. This is called proportional hazards assumption and needs to be
checked.
12. Methodology
• Relative log risk is noted by f(size) (for median size, it is set to zero).
• Examine the functional model with Cubic Spline Functions using four knots
f (size) = β0 +β1 size+β2 (size−k1 )3 +β3 (size−k2 )3 +β4 (size−k3 )3 +β5 (size−k4 )3
+ + + +
(1)
where,
(size − kn ), if (size − kn ) > 0
(size − kn )+ = (2)
0, otherwise
• Examined the alternative model visually
• Tested whether the deviation from linearity was statistically significant
H0 : β2 = β3 = β4 = 0
13. Methodology - Data Layout and Collection
(A)
• We developed PERL scripts to extract class name size defect count
source code, analyze CVS changes, and A 75 0
to find whether a class is affected or not B 250 2
C 300 2
• (a) What would the data look like if D 600 2
conventional approach was used. E 800 3
F 220 0
• (b) Novel Approach: Classes between G 300 0
added to the system after Mozilla 1.0 . . .
. . .
release date were measured until Feb 22,
(B)
2006. class name start end event size state
• Each change resulted in an observation Y 0 50 0 75 0
Y 50 100 1 200 1
• 15,545 observations Y 100 200 0 300 1
• Events were identified by searching the Z 0 200 1 250 0
Z 200 800 0 180 1
CVS logs for words ‘bug’, ‘defect’, and Z 800 1400 1 400 1
‘fix’. When we sampled 100 logs Z 1400 1800 0 300 1
. . . . .
randomly, we saw that this automated . . . . .
approach was correct for 98 of them.
14. Results - Functional Form
2.0
1.5
Instantaneous relative risk of defect fix
1.0
0.5
0.0
−0.5
−1.0
0 2000 4000 6000 8000 10000 12000
Size (LOC)
• When we use cubic spline functions the logarithmic form is also obvious. The
curve down at the end is only for less than 0.3% of the data points. We can
use log(size) directly in the Cox model
15. Results -- Modeling results
MANUSCRIPT SUBMITTED TO TSE
coef exp(coef) se(coef) robust se z p
log(size) 0.368 1.44 0.00732 0.018 20.4 0
Rsquare= 0.152 (max possible= 1)
Likelihood ratio test= 2565 on 1 df, p=0
Wald test = 416 on 1 df, p=0
Score (logrank) test = 2565 on 1 df, p=0,
Robust Score = 142 p=0
Fig. 5. Modeling results using logarithmic transform of size
17. Test of Proportional Hazards
• Commonly, interaction with
time is tested
20
• Example: A drug only
effective in the first hour.
10
Beta(t) for log(size)
• Note: This test can also
become significant when a
0
wrong functional form is
used.
!10
• Result: p = 0.835 highly
insignificant.
!20
• A smooth plot of Schönfeld
0 500000 1000000 1500000 2000000
residuals show almost a
Time
perfectly straight line.