1. Should Experiments for a Systematic Review be
discarded based on their Quality?
Oscar Dieste
Facultad de Informática
Universidad Politécnica Madrid
Boadilla del Monte,
Madrid 28660, Spain
+34 91336 5011
odieste@fi.upm.es
Natalia Juristo
Facultad de Informática
Universidad Politécnica Madrid
Boadilla del Monte,
Madrid 28660, Spain
+34 91336 6922
natalia@fi.upm.es
Himanshu Saxena
Facultad de Informática
Universidad Politécnica Madrid
Boadilla del Monte,
Madrid 28660, Spain
himanshusaxena22@gmail.com
ABSTRACT
Discarding irrelevant experiments based on their quality in the
beginning of a systematic review process could make the process
more efficient and aggregated result more precise and reliable.
The systematic review process becomes cumbersome as it
involves analysis of a large number of experiments and there is a
potential risk that some of these experiments either do not
contribute to the aggregations and if they do they induce bias.
Quality of experiments is believed to depend on how experiment
is planned, executed and reported. We evaluated the quality of
experiments participating in 3 different systematic reviews. For
quality evaluation we used both the checklist and alternative
procedures based on practices in other experimental disciplines.
Further, we analyzed correlations among the checklist scores and
alternative measures of quality. We did not find any relationships
among the measures. Contemporary quality assessment
instruments seem to fail to detect any tangible evidence of quality
in experiments. Therefore, we recommend that quality assessment
for discarding experiments for a systematic review must be
performed with extreme caution or not performed at all until more
research is carried out.
Categories and Subject Descriptors
D.2.0 [Software Engineering]: General
General Terms
Experimentation, Measurement
Keywords
Systematic Literature Review (SR), Quality Assessment (QA) of
experiments, Checklists.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise,
or republish, to post on servers or to redistribute to lists, requires prior
specific permission and/or a fee.
ESEM’2010, September 13–17, 2010, Bolzano-Bozen, Italy.
Copyright 2004 ACM 1-58113-000-0/00/0004…$5.00.
1. INTRODUCTION
Systematic Review (SR) according to Kitchenham [1] “is a means
of identifying, evaluating and interpreting all available research
relevant to a particular research question, or topic area, or
phenomenon of interest”. SR process involves (i) Identifying
experiments about a particular research topic, (ii) Selecting the
studies relevant to the research, (iii) Inclusion/Exclusion of
studies based on their quality assessment, (iv) Extracting the data
from the selected studies and (v) Eventually, aggregating the data
to generate unbiased evidence. Here, we focus on SR that involve
aggregations.
SR process can be quite challenging especially if the
number of experiments considered is high. But, not all
experiments contribute to the final aggregation and some of the
experiments even could create bias in the aggregated result.
Kitchenham et al. [1] and Tore Dybå et al. [2], [3] recommend a
detailed quality assessment of the studies for exclusion. Quality of
experiments is related to how the experiment is executed and how
it is reported [1], [4]. Quality Assessment can weigh importance
of individual studies based on the mentioned aspects and further
remove low quality studies from a SR. Reducing the number of
studies to be analysed and making the SR process more efficient
and less error prone.
Based on different procedures we have assessed the
quality of a set of 42 experiments obtained from 3 different SR.
We developed a checklist for quality assessment of experiments in
SR using literature available in SE [1], [4], [22] and other
experimental disciplines [5], [6], [7].
If checklists can discriminate experiment’s quality, their
results should be consistent with our understanding about quality.
Our conclusions point out that quality is an elusive concept that
cannot be easily assessed by checklists/questionnaires that are
primarily based on the experimental report. Discarding
experiments in a SR based on the present quality assessment
instruments can mislead the aggregations and wrong evidences
could be generated. Therefore, quality assessment either should be
performed with extreme caution or should not be performed at all.
This recommendation matches current practice in other
experimental fields [8], [9], [10], [11].
The paper is divided into 6 main sections. Section 2
reviews the different instruments that have been used to assess
quality of experiments. Section 3 presents the research
methodology and section 4 discusses the research results.
2. Section 5 discusses the results and Section 6 shows the
conclusions from our research.
2. INSTRUMENTS FOR ASSESSING
EXPERIMENTS QUALITY
Since 1940’s when the first Randomized Controlled Trial (RCT)
was run in medicine, RCT’s has been considered as the ‘gold
standards’ for experimental research exceeding the value over
other type of studies like non-randomized trials, longitudinal
studies, case studies etc. [12]. Sacks H et al. [13] performed the
first corroboration of differences between RCT’s and other type of
studies. One of the interesting findings was that RCT’s in general
were less prone to bias than other type of studies like
observational studies, horizontal controls etc. Bias produces some
kind of deviation from the true value. Kitchenham [1] explains
bias as “a tendency to produce results that depart systematically
from the true results”.
This early finding was highly influential and gave birth
to the development of hierarchies of evidence [14]. Hierarchy of
evidence was first proposed and popularized by the Canadian Task
Force on Periodic Health Examination [15] in late 1979. A
hierarchy of evidence determines the best available evidence that
can be used to grade experiments according to their design. It also
reflects the degree of bias different types of experiments might
possess. The first hierarchy of evidence evolved into more refined
ones like the CRD guidelines [16], the Australian National Health
and Medical research council guidelines [5] and the Cochrane
review handbook [8].
To understand how hierarchy of evidence works, lets
assume we have two studies: one RCT and a case study
comparing testing technique A with testing technique B for a
specific context. Considering a situation where RCT says
technique A is better than technique B and the case study states
the opposite. The result from the RCT would be given priority
over the case study results because RCT are situated above case
studies in the hierarchy and thus there is more confidence in result
of RCT’s than case studies.
Hierarchy of evidence can be used during SR to
determine the study designs contributing to evidences. Hierarchies
have been considered as an uncontested method to grade the
strength of evidence and its predictive power has thought to be
demonstrated in multiple occasions [17].
Nevertheless, some voices have been raised against
hierarchies. It has been observed that lack of bias is not
necessarily related to the type of design, but to the provisions
made in the empirical study to control confounding variables [18].
In other words quality (the risk of bias [19]) is related to the to
internal validity of the experiment. Threats to internal validity
have been identified as selection, performance, measurement and
attrition (exclusion) bias [5], [8], [16]. Based on these
observations that lack of bias (quality) is related to the internal
validity, many researchers in medicine studied the influence of
selection, performance, measurement and execution aspects in
bias. Those aspects were for example the existence in the
experiment of randomization, blinding etc. Getting rid from all
kinds of biases in an experiment would lead to an internally valid
experiment. Biases can be avoided by using mechanisms like
randomization, blinding etc. Checklists were constructed keeping
in mind all different kinds of biases and questions were raised if
all relevant steps were performed to avoid bias [20].
In experimental SE, SR and quality assessment are
recent topics and they got attention after the seminal work by
Kitchenham in 2004 [1]. Since the publication of that report,
multiple SR has been published in SE [2], [21]. Most of the SR’s
in SE do not have an explicit quality assessment process. In the
new version of the report by Kitchenham published in 2007 [22],
she proposes a questionnaire (checklist) considering the issues
related to bias and validity problems that might arise in the
different phases of experiments. Tore Dybå et al. [2] also
developed and use a checklist in their SR on agile software
development.
Lately, extensive criticism has been raised in medicine
about the use of scales for measuring quality [8], [9], [10], [23].
No relations were found between the scores obtained by an
experiment in a given quality assessment checklist and the amount
of bias suffered by the experiment [9]. Cochrane reviewer’s
handbook goes to the extent of discouraging the use of quality
assessment scales for exclusion of experiments in SR [8].
3. RESEARCH METHODOLOGY
The research methodology tried to find if low quality experiments
cause bias when they are aggregated in a SR. For this we need a
quality assessment instrument to determine low quality
experiments; ways to determine bias; and finally, relate
experiment’s quality with bias. Therefore, our research
methodology has three elementary components: (i) an assessment
checklist, (ii) one or more ways to determine bias in experiments
and (iii) a way to relate experiments quality with bias in the
experiment.
3.1 Quality Assessment Checklist
We started studying QA scales available in Empirical SE
literature. Kitchenham [1] based on different works from
medicine propose a questionnaire (checklist) for experiment
quality assessment. In Kitchenham’s checklist the average number
of questions for evaluating an experiment is close to 45.
Pragmatically speaking it is an enormous number since they need
to be answered for every paper of a SR, (very often bigger than
40). Using such checklists would increase the time required to
conduct a SR making the process highly inefficient. Kitchenham
[26] suggests on not using all the questions provided in the
checklist but select the questions that are most appropriate for the
specific research problem. It is quite difficult to apply such
strategy because researchers lack in the information needed to
identify which questions are relevant to the particular SR at hand.
Tore Dybå et al. [2] also developed a questionnaire for
quality assessment of experiments in their SR on agile software
development. These authors do not provide justification for why
and from where the items have been selected to be included in the
checklist.
None of the existing ESE QA questionnaires have been tested for
suitability (to ensure if they measure what they are intended to
measure) or reliability (to ensure that the measure consistently).
We decided to build a much shorter checklist taking basic issues
that influence quality according to Kitchenham’s report [1] [22],
Kitchenham et al. guidelines [4] and the practices of experiments
QA in medicine [5], [8], [16], [22]. Our checklist is in essence
similar to Kitchenham’s checklist as they use the same sources
and are built on the basis of the same criteria (i.e.: importance of
randomization, blinding, threats to validity, etc.). It is shown in
Table 1. The questions have been classified into two groups:
methodological and reporting aspects. Methodological-related
questions deal with controlling extraneous variables.
3. Reporting-related questions ensure whether important details are
disclosed in the written report or not.
Creating a checklist each time SR is performed would
be quite time consuming. Therefore, we were to evaluate if it is
possible to develop a checklist valid for a diversity of SR. We use
several experiments from several domains to ensure
generalizability of the checklist.
Table 1 Construct, attributes and questions of our checklist
Construct Attributes Questions of our Checklist
Does the introduction contain
the industrial context and
description of the techniques
to be reviewed?
Does the report summarize
about the previous similar
experiments that have been
conducted and discusses
them?
Does the researcher define
the population from which
objects and subjects are
drawn?
Are the statistical
significance mentioned with
the results.
Quality of
Reporting
Are the threats to validity
mentioned explicitly and also
how these threats affect the
results and conclusion?
Does the researcher define
the process by which he
applies the treatment to
subjects or objects?
Was randomization used for
selecting the population and
applying the treatment?
Does the researcher define
the process from which the
subjects and objects are
selected?
Are the hypotheses being laid
and are they synonymous to
the goal discussed before in
introduction?
Quality of
Experiments
Methodological
Quality
Is the blinding procedure
used convincingly enough?
3.2 Determining Bias in Experiments
Kitchenham [1] explains bias as “a tendency to produce results
that depart systematically from the ‘true’ results”. For example, if
we find in a specific experiment that testing technique A is 50%
more effective than testing technique B, but we know that the
difference actually is 30%, the remaining 20% would be
considered as bias.
Therefore, or our research we need to know true values
to be able to determine experiments bias. However, given the lack
of empirical data in SE, true values are seldom available. We only
have the results of SR that have values that we might assume are
close to true values. We have found only two published SR that
detail about the aggregations showing the true value and how each
experiment contributes to the true value (Tore Dybå et al. [25] and
Ciolkowski [26]). At UPM, we had two more SR’s that detailed
the aggregations [14], [15].
Using SR results, depending on how they are
performed, we can identify three measures of bias. The first
measure is known as proximity to mean value (PTMV) [27], [29]
of all experiments in a SR. Figure 1 is a forest plot describing the
aggregation of three experiments.
The vertical line with diamond in its tail gives the mean value of
all three experiments. PTMV says that the farther is the
experiment to the mean value the more biased are the results of
the experiment. In the figure below from top to bottom we have
E1, E2 and E3. PTMV tells us that E3 possess highest bias as it is
farthest to the mean value and E2 has least bias. Table 2 provides
the distances from mean value and corresponding bias order of the
three experiments.
Table 2 PTMV scores for experiments in forest plot example.
Experiment Dis. ID PTMV Bias Order
E1 d1 0.47 2
E2 d2 0.17 3
E3 d3 0.48 1
The second measure is the conflicting and non-
conflicting contribution to the aggregation (CNC) of all
experiments in a SR. Table 3 describes three aggregations
produced by using comparative analysis [30]. Each aggregation
has a response variable and generalized result associated to it.
Table 3 Experiments contribution to the aggregation
Aggreg
ation
Respon
se
variabl
e
General
ized
Result
Positive
effects
Negativ
e effects
No
effects
AGG01 R1 T {a} >
T {b}
E1, E3 E2
AGG02 R2 T {c} <
T {d}
E1, E5 E4
AGG03 R3 T {e} >
T {f}
E3, E2
Figure 1 Aggregation forest plot of three experiments.
4. The experiments either have positive effect to the aggregated
result or a negative effect or no effect. A positive effect means that
the experimental result agrees with the generalized result and
negative effect means the opposite. No effects mean that the
experiment produced insignificant results. CNC contribution can
be calculated by using the following formula given in Equation 1.
Equation 1 CNC quality score equation
Lets take an example to calculate the CNC score. Experiment E2
is conflicting aggregation AGG01 and supporting (non-
conflicting) aggregation AGG03. Therefore the CNC score for
experiment E2 would be 0.5. The more the CNC score for the
experiment the less it possesses bias, because the higher the CNC
score, the more coherent is the experiment with the aggregated
result. Now, lets take E1 that is supporting the aggregation
AGG01 and AGG02 having the CNC scores as 1. Apparently,
CNC scores are handy for SR that do not use meta-analysis for
data aggregation. However, CNC scores are comparatively less
precise to PTMV scores.
The third measure we used is expert opinion about the
experiment looking at the experiment’s report [1], [16]. Based on
experience an expert could differentiate between biases and
unbiased experiments in a SR. We adapted the expert opinion
measure to make it simpler. A value of -1, 0 or 1 was given for a
poor, average or a god experiment respectively. Higher the score
lesser would the amount of bias in the experiment.
3.3 Relating Quality with Bias
The third component of our research methodology was to find a
way to relate checklist results with bias in the experiment. If the
internal validity and lack of bias are related, we expect that high
values of internal validity be related to low levels of bias and vice
versa. Correlations are the best way to calculate such relationship
[31]. Therefore, we correlated the checklist scores with PTMV
scores, CNC scores and expert opinion scores.
4. RESEARCH RESULTS
Table 4 The three SRs used in our research
Name No Of
Experiments
Aggregation
mechanism
Authors
SR on
elicitation
techniques [27]
14 Comparative
Analysis
Oscar Dieste.
Natalia Juristo
SR on
inspection
techniques [28]
13 Meta
Analysis
Anna Griman
Padua, Oscar
Dieste. Natalia
Juristo
SR on pair
programming
[25]
15 Meta
Analysis
Tore Dybå,
Erik Arisholm,
Dag I.K.
Sjoberg, Jo E.
Hannay and
Forrest Shull
Three SR were used as described in Table 4 for getting the scores
we have described [32]. Checklist, PTMV, CNC and expert
opinion scores were obtained for all the experiments in the three
SR’s. Table 5 provides all the scores obtained.
Table 5 All the scores
Set Study
ID
CNC
Scores
PTMV
Scores
Check
list
Score
s
Expert
Opinio
n Score
E2 1.00 0.70 0
E3 0.00 0.72 1
E4 0.00 0.63 1
E5 1.00 0.90 1
E6 0.50 0.45 1
E8 0.58 0.60 1
E17 0.33 0.60 1
E18 0.50 0.60 1
E19 1.00 0.80 1
E21 0.33 0.80 -1
E25 1.00 0.90 0
E28 0.50 0.30 -1
E31 0.50 0.60 -1
E36 0.83 0.90 1
Elicita
tion
SR
E43 1.00
N.A.
0.70 1
E05 1.00 0.34 0.80 1
E01 1.00 0.13 0.80 0
E02 1.00 0.36 0.70 0
E06 1.00 0.23 0.80 1
E03 1.00 0.21 0.70 0
E04 - R1 1.00 0.09 0.80 1
E04 - R2 1.00 0.63 0.80 1
E13 1.00 0.32 0.80 1
E14 0.50 0.26 0.80 1
E15 0.50 0.37 0.80 1
E19 1.00 0.33 0.80 0
Inspec
tion
SR
E20 1.00 0.14 0.80 0
P98 0.50 2.11 0.7 0
So0 1.00 0.66 0.6 0
So1 0.00 0.24 0.5 0
So2 0.00 0.265 0.6 0
Po2 0.00 0.34 0.6 1
So3 0.00 0.215 0.9 1
So5a 1.00 0.335 0.7 0
So5b 0.33 0.44 0.8 1
So5c 0.00 0.08 0.8 0
So6a 0.00 0.3 0.8 1
So6b 0.00 0.26 0.8 1
Pair -
Program
ming
SR
So6c 0.67 1.575 0.8 -1
5. So6d 0.67 1.2433 0.6 0
So7b 0.00 0.65 0.9 1
So7a 0.33 0.19 0.8 1
Three different experts were used. Two of them were PhD
students making their thesis on SR and aggregation and one of the
experts was a senior researcher on experimental SE.
Correlations were drawn between CNC scores vs.
Checklist scores, Checklist scores vs. PTMV scores and Checklist
scores vs. Expert Opinion. Figure 2 describes all the correlations.
Also, Table 6 provides the values for correlation coefficients and
corresponding p-value for the correlations described in Figure 2.
Considering the first row (CNC vs. checklist) of Figure
2 the x-axis gives the checklist scores and y-axis gives the CNC
scores. From left to right in the first plot (elicitation SR) we have
fluctuating CNC scores for checklist scores from 0.60 to 0.90
evidently inkling a poor correlation. The correlation coefficient
and p-value give in Table 6 (correlation coefficient = 0.453; p-
value = 0.242) substantiate our observation. Similarly, the second
and third graph in the first row exhibits poor correlations.
The second row in Figure 2 the rightmost graph
provides correlation between checklists vs. PTMV for pair
programming SR. As we can see for PTMV score (x-axis) as 1 we
have different checklist scores (y-axis) suggesting a poor
correlation again. The correlation coefficient (-0078) substantiates
our observation and the p-value (p-value = 0.526) make the result
statistically insignificant. The other graph (Inspection SR column)
in the same row also shows a poor correlation.
The third row in Figure 2 from top to bottom the second
graph from left to right provide the correlation between checklists
vs. expert opinion for inspection SR. The graph shows a positive
trend as with increase expert opinion scores (x-axis) the checklist
scores also shows an increasing trend. The correlation coefficient
(0.677) suggests a positive correlation. However, the p-value (p-
value = 0.129) is higher to the statistically significance value (p-
value = 0.05). The most probable reason is that the sample size is
quite small (11 studies) it becomes difficult to reach statically
significance. However, the plots and correlation coefficient for the
other two SR (elicitation and pair programming) suggest no
correlation between expert opinion and checklist scores.
All the correlations were poor and statistically
insignificant suggesting that no correlations could be identified
between quality measures through checklist and bias measured
through CNC, PTMV and Expert Opinion.
Table 6 Correlations Coefficients (in the grey cells) and p-
values (in the white cells) for all the correlations.
CNC PTMV Expert
Opinion
SR
0.453 0.222
0.242 0.367
Elicitation SR
- 0.182 0.085 0.677
0.600 0.453 0.129
Inspection SR
- 0.236 - 0.078 0.406
Checklist
0.630 - 0.526 0.277
Pair
Programming
SR
Figure 2 Correlations from top to bottom (i) CNC vs.
Checklist (ii) Checklist vs. PTMV (iii) Checklist vs. Expert
Opinion. Correlations from left to right (i) Elicitation SR (ii)
Inspection SR (iii) Pair Programming SR.
5. DISCUSSION
The results stimulate discussions. Getting back to the question.
Should experiments in a SR be discarded based on their quality?
Discarding an experiment in a SR based on its quality
(methodological or reporting) seems to be inappropriate. No
correlations were obtained between the QA checklist scores and
different measures of bias. Our checklist is similar to the
contemporary QA instruments prevalent in SE. Therefore
contemporary QA instruments should be used carefully. Exclusion
of valid studies from a SR would reduce the precision of the
aggregated result but the inclusion of invalid studies might lead to
bias aggregated result.
The Cochrane Handbook [8] shares our point of view
for the QA of experiments in medicine. The Cochrane handbook
[5] along with other researchers from medicine [9], [10], [11]
discourages the use of QA scales, as they cannot prove their
validity. Jüni [10] states “Scales have been shown to be unreliable
assessment of validity and they are less likely to be transparent to
the users of the review.” None of the scales present in SE have
proved their validity, including ours. Moreover, the QA scales can
discard some useful experiments in the SR making the aggregated
result biased, imprecise and pragmatically worthless. Therefore it
might be advisable to discourage the use of QA scales in SE.
The results we have obtained are statistically
insignificant. If the number of experiments increases we might
attain statically significance. Moving ahead when we have more
SR that details the aggregations.
Researchers [14], [33] in medicine have identified a
potential aspect in measuring quality. The relation between the
object being experimented and the experiments seem to bias the
results. This object-experimenter relationship has been the only
one empirically identified in medicine as having a relationship
with experimental results bias. It has been observed in medicine
that multi-centre experiments avoid this bias. Multi-centre SR’s
evidence is considered to have the maximum confidence [14].
6. However, further research is imperative to substantiate this
postulate in SE.
6. CONCLUSIONS
We have analysed if QA instruments as checklists do have a
relationship with experimental bias. Our pilot approach to this
research has been with our own QA checklist. The most extended
QA checklist in SE [22] was too long (45 questions) to be
answered for the 45 experiments we were analysing. Therefore,
we generated a short version of such checklist.
For three different SR [25], [27], [28] we have analysed
correlations between QA checklist and three bias measurements
(proximity to mean value, conflicting and non-conflicting
contribution and expert opinion). All the correlations were poor
and insignificant. These results suggest that discarding
experiments in a SR based on their quality should not be
practiced. These results match with similar results and
recommendations in medicine [8], [9], [10], [11].
Next step in our research is to run this same analysis
with similar QA instruments ([2], [34] for instance).
7. ACKNOWLEDGMENTS
This work was a part of master thesis done in UPM Madrid, Spain
and TU Kaiserslautern, Germany under Dieter Rombach, Natalia
Juristo and Oscar Dieste and in the European Masters in Software
Engineering course. This work was partially supported by the
projects TIN 2008 – 00555 and HD 2008-0048.
8. REFERENCES
[1] B.A. Kitchenham, Procedures for performing systematic
reviews, Technical Report TR/SE-0401, 2004.
[2] Dybå T, Dingsøyr T. Empirical studies of agile software
development: A systematic review. Inf. Softw. Technol.
2008;50(9-10):833-859.Availableat:
http://portal.acm.org.miman.bib.bth.se/citation.cfm?id=1379
905.1379989&coll=GUIDE&dl=GUIDE&CFID=81486639
&CFTOKEN=51504098 [Accessed March 11, 2010].
[3] Dyba T, Dingsoyr T, Hanssen G. Applying Systematic
Reviews to Diverse Study Types: An Experience Report. 20-
21 Sept. 2007 . 2007: 225 - 234 .
[4] B.A. Kitchenham, S.L. Pfleeger, L.M. Pickard, P.W. Jones,
D.C. Hoaglin, K.E. Emam, and J. Rosenberg, Preliminary
Guidelines for Empirical Research in Software Engineering,
vol. 28 no. 8, Aug. 2002, pp. pp. 721-734.
[5] National Health and Medical Research Council (NHMRC),
“NHMRC handbook series on preparing clinical practice
guidelines.” ISBN 0642432952, Feb. 2000.
[6] D. Moher, B. Pham, A. Jones, D.J. Cook, A.R. Jadad, M.
Moher, P. Tugwell, and T.P. Klassen, Does quality of reports
of randomised trials affect estimates of intervention efficacy
reported in meta-analyses?Lancet, vol. 352, Aug. 1998, pp.
609-613.
[7] Blobaum P. Physiotherapy Evidence Database (PEDro). J
Med Libr Assoc. 2006; 94 (4):477-478.
[8] J.P.T. Higgins and S. Green, Cochrane Handbook for
Systematic Reviews of Interventions, John Wiley and Sons,
2008.
[9] J.D. Emerson, E. Burdick, D.C. Hoaglin, F. Mosteller, and
T.C. Chalmers, An empirical study of the possible relation of
treatment differences to quality scores in controlled
randomized clinical trials, Controlled Clinical Trials, vol. 11,
Oct. 1990, pp. 339-352.
[10] P. Jüni, A. Witschi, R. Bloch, and M. Egger, The hazards of
scoring the quality of clinical trials for meta-analysis, JAMA:
The Journal of the American Medical Association, vol. 282,
Sep. 1999, pp. 1054-1060.
[11] Schulz KF, Chalmers I, Hayes RJ, Altman DG. Empirical
evidence of bias. Dimensions of methodological quality
associated with estimates of treatment effects in controlled
trials. JAMA 1995; 273: 408-412.
[12] Stuart Barton, Which clinical studies provide the best
evidence? BMA House, Tavistock Square, London WC1H
9JR, sbarton@bmjgroup.com
[13] Sacks H, Chalmers TC, Smith H Jr., Randomized versus
historical controls for clinical trials, Am J Med 1982; 72:
233-40.
[14] David Evans BN, Hierarchy of evidence: a framework for
ranking evidence evaluating healthcare interventions,
Department of Clinical Nursing, University of Adelaide,
South Australia 5005, 2002.
[15] Canadian Task Force on the Periodic Health Examinaton.
(1979) The periodic health examination.Canadian Medical
Association Journal 121, 1193-1254.
[16] U.O.Y. Centre for Reviews and Dissemination, Undertaking
systematic reviews of research on effectiveness: CRD's
guidance for carrying out or commissioning reviews, NHS
Centre for Reviews and Dissemination, Mar. 2001.
[17] Louis PCA. Research into the effects of bloodletting in some
inflammatory diseases and on the influence of tartarized
antimony and vesication in pneumonitis. Am J Med Sci1836;
18:102 –111.
[18] RRMacLehose, BCReeves, IMHarvey, TASheldon,
ITRussell, AMS Black, A systematic review of comparisons
of effect sizes derived from randomised and non-randomised
studies, Health technology Assessement 2000;Vol. 4:No. 34.
[19] Verhagen A.P., de Vet H.C.W., de Bie R.A., Boers M., A
van den Brandt P. The art of quality assessment of RCTs
included in systematic reviews. Journal of Clinical
Epidemiology. 2001; 54:651-654. Available at:
http://www.ingentaconnect.com/content/els/08954356/2001/
00000054/00000007/art00360 [Accessed March 3, 2010].
[20] A.P. Verhagen, H.C. de Vet, R.A. de Bie, A.G. Kessels, M.
Boers, L.M. Bouter, and P.G. Knipschild, The Delphi list: a
criteria list for quality assessment of randomized clinical
trials for conducting systematic reviews developed by Delphi
consensus, Journal of Clinical Epidemiology, vol. 51, Dec.
1998, pp. 1235-1241.
[21] Finn Olav Bjørnson, Torgeir Dingsøyr, Knowledge
management in software engineering: A systematic review of
studied concepts, findings and research methods used, a
Norwegian University of Science and Technology,
Department of Computer and Information Science, Sem
S_landsvei 7-9, 7491 Trondheim, Norway. SINTEF
Information and Communication Technology, SP Andersens
vei 15b, 7465 Trondheim, Norway, 2008
7. [22] B. Kitchenham and S. Charters, Guidelines for performing
Systematic Literature Reviews in Software Engineering,
2007.EBSE Technical Report EBSE-2007-01, 2004.
[23] E.M. Balk, P.A.L. Bonis, H. Moskowitz, C.H. Schmid,
J.P.A. Ioannidis, C. Wang, and J. Lau, Correlation of quality
measures with estimates of treatment effect in meta-analyses
of randomized controlled trials, JAMA: The Journal of the
American Medical Association, vol. 287, Jun. 2002, pp.
2973-2982.
[24] Fink, A. Conducting Research Literature Reviews. From the
Internet to Paper, Sage Publication, Inc., 2005.
[25] Hannay JE, Dybå T, Arisholm E, Sjøberg DIK. The
effectiveness of pair programming: A meta-analysis. Inf.
Softw. Technol. 2009; 51 (7):1110-1122. Available at:
http://portal.acm.org.miman.bib.bth.se/citation.cfm?id=1539
052.1539606 [Accessed March 11, 2010].
[26] Ciolkowski M. What do we know about perspective-based
reading? An approach for quantitative aggregation in
software engineering. In: Proceedings of the 2009 3rd
International Symposium on Empirical Software Engineering
and Measurement. IEEE Computer Society; 2009:133-144.
Available at http://portal.acm.org/citation.cfm?doid=
1671248.1671262 [Accessed March 11, 2010].
[27] Oscar Dieste, Natalia Juristo Systematic Review and
Aggregation of Empirical Studies on Elicitation Techniques.
IEEE Transactions on Software Engineering, 2008.
[28] Anna Griman Padua, Oscar Dieste and Natalia Juristo,
Systematic review on inspection techniques, UPM, Spain,
(Under Construction).
[29] Schulz, Kenneth F.; Chalmers, Iain; Hayes, Richard J.;
Altman, Douglas G, Empirical evidence of bias: Dimensions
of methodological quality associated with estimates of
treatment effects in controlled trials, JAMA: Journal of the
American Medical Association. Vol 273(5), Feb 1995, 408-
412.
[30] C.C. Ragin, The comparative method, University of
California Press, 1987.
[31] DeCoster, J. (2005). Scale Construction Notes, Jun. 2005,
http://www.start-help.com/notes.html.
[32] Himanshu Saxena, Prospective Study For The Quality
Assessment Of Experiments Included In Systematic
Reviews, Master thesis, UPM, Madrid, 2009.
[33] P. Jüni, A. Witschi, R. Bloch, and M. Egger, The hazards of
scoring the quality of clinical trials for meta-analysis, JAMA:
The Journal of the American Medical Association, vol. 282,
Sep. 1999, pp. 1054-1060.
[34] Kitchenham, B., Mendes, E., Travassos, G.H. (2007) A
Systematic Review of Cross- vs. Within-Company Cost
Estimation Studies, IEEE Trans on SE, 33 (5), pp 316-329.
9. LIST OF EXPERIMENTS USED IN OUR
STUDY
Set 1 - Following are the list of studies used in the systematic
review on elicitation techniques. The unique alphanumeric
number enlisting the studies represents the code used while
performing the systematic review. Hence, the aggregation tables
would have these numbers to represent the experiments.
[E2] Agarwal, R. and Tanniru, M. R., "Knowledge Acquisition
Using Structured Interviewing: An Empirical Investigation,"
Journal of Management Information Systems, vol. 7, pp.
123-141, Summer, 1990.
[E3] Bech-Larsen, T. and Nielsen, N. A., "A comparison of five
elicitation techniques for elicitation of attributes of low
involvement products," Journal of Economic Psychology,
vol. 20, pp. 315-341, 1999.
[E4] Breivik, E. and Supphellen, M., "Elicitation of product
attributes in an evaluation context: A comparison of three
elicitation techniques," Journal of Economic Psychology,
vol. 24, pp. 77-98, 2003.
[E5] Browne, G. J. and Rogich, M. B., "An Empirical
Investigation of User Requirements Elicitation: Comparing
the Effectiveness of Prompting Techniques," Journal of
Management Information Systems , vol. 17, pp. 223-249,
Spring, 2001.
[E6] Burton, A. M., Shadbolt, N. R., Hedgecock, A. P., and Rugg,
G. A formal evaluation of knowledge elicittion techniques
for expert systems: Domain 1. In: Research and
development in expert systems IV: proceedings of Expert
Systems '87, the seventh annual Technical Conference of
the British Computer Society Specialist Group on Expert
Systems, ed. Moralee, D. S. Cambridge, UK: Cambridge
University Press, 1987.
[E8] Corbridge, B., Rugg, G., Major, N. P., Shadbolt, N. R. , and
Burton, A. M., "Laddering - technique and tool use in
knowledge acquisition," Knowledge Acquisition, vol. 6, pp.
315-341, 1994.
[E17, E18] Burton, A. M., Shadbolt, N. R., Rugg, G., and
Hedgecock, A. P., "The efficacy of knowledge acquisition
techniques: A comparison across domains and levels of
expertise," Knowledge Acquisition, vol. 2, pp. 167-178,
1990.
[E19] Moody, J. W., Will, R. P., and Blanton, J. E., "Enhancing
knowledge elicitation using the cognitive interview," Expert
Systems with Applications, vol. 10, pp. 127-133, 1996.
[E21] Zmud, R. W. , Anthony, W. P., and Stair, R. M., "The use
of mental imagery to facilitate information identification in
requirements analysis," Journal of Management Information
Systems, vol. 9, pp. 175-191, 1993.
[E25] Pitts, M. G. and Browne, G. J., "Stopping Behavior of
Systems Analysts During Information Requirements
Elicitation," Journal of Management Information Systems,
vol. 21, pp. 203-226, 2004.
Set 2 - Following are the list of studies used in the systematic
review on inspection techniques. The unique alphanumeric
number enlisting the studies represents the code used while
performing the systematic review. Hence, the aggregation tables
would have these numbers to represent the experiments.
[E03] J. Miller, M. Wood, and M. Roper, “Further Experiences
with Scenarios and Checklists,” Empirical Softw. Engg.,
vol. 3, 1998, pp. 37-64.
[E13], [E14], [E15] O. Laitenberger, K.E. Emam, and T.G.
Harbich, “An Internally Replicated Quasi-Experimental
Comparison of Checklist and Perspective-Based Reading of
8. Code Documents,” IEEE Trans. Softw. Eng., vol. 27, 2001,
pp. 387-421.
[E04] K. Sandahl, O. Blomkvist, J. Karlsson, C. Krysander, M.
Lindvall, and N. Ohlsson, “An Extended Replication of an
Experiment for Assessing Methods for Software
Requirements Inspections ,” vol. Volume 3, Number 4 /
December, 1998, Oct. 2004.
[E19] P.R. Thomas Thelin, “An Experimental Comparison of
Usage-Based and Checklist-Based Reading,” Aug. 2003.
[E06] A.A. Porter and L.G. Votta, “An experiment to assess
different defect detection methods for software requirements
inspections,” Proceedings of the 16th international
conference on Software engineering, Sorrento, Italy: IEEE
Computer Society Press, 1994, pp. 103-112.
[E20] T. Thelin, C. Andersson, P. Runeson, and N. Dzamashvili-
Fogelstrom, “A Replicated Experiment of Usage-Based and
Checklist-Based Reading,” Proceedings of the Software
Metrics, 10th International Symposium, IEEE Computer
Society, 2004, pp. 246-256.
[E05] A. Porter and L. Votta, “Comparing Detection Methods
For Software Requirements Inspections: A Replication
Using Professional Subjects,” Empirical Softw. Engg., vol.
3, 1998, pp. 355-379.
[E01] A.A. Porter, L.G. Votta, Jr, J. Victor, and V.R. Basili,
“Comparing Detection Methods For Software Requirements
Inspections: A Replicated Experiment,” 1995.
Set 3 - Following are the list of studies used in the systematic
review on inspection techniques. The unique alphanumeric
number enlisting the studies represents the code used while
performing the systematic review. Hence, the aggregation tables
would have these numbers to represent the experiments.
[P98] J.T. Nosek, “The Case for Collaborative Programming,”
Comm. ACM, vol. 41, no. 3, 1998, pp. 105–108.
[Soo] L. Williams, R.R. Kessler, W. Cunningham, and R.
Jeffries, “ Strengthening the Case for Pair
Programming,” IEEE Software, vol. 17, no. 4, 2000,
pp. 19–25.
[So1] J. Nawrocki and A. Wojciechowski, “Experimental
Evaluation of Pair Programming,” Proc. European
Software Control and Metrics Conference (ESCOM 01),
2001, pp. 269–276.
[So2]P. Baheti, E. Gehringer, and D. Stotts, “Exploring the
Efficacy of Distributed Pair Programming,” Extreme
Programming and Agile Methods—XP/Agile Universe
2002, LNCS 2418, Springer, 2002, pp. 208–220.
[Po2] M. Rostaher and M. Hericko, “Tracking Test First Pair
Programming—An Experiment,” Proc. XP/Agile
Universe 2002, LNCS 2418, Springer, 2002, pp. 174–
184.
[So3] S. Heiberg, U. Puus, P. Salumaa, and A. Seeba, “Pair-
Programming Effect on Developers Productivity,”
Extreme Programming and Agile Processes in Software
Eng.—Proc 4th Int’l Conf. XP 2003, LNCS 2675,
Springer, 2003, pp. 215–224.
[So5a] G. Canfora, A. Cimitlie, and C.A. Visaggio, “Empirical
Study on the Productivity of the Pair Programming,”
Extreme Programming and Agile Processes in Software
Eng.—Proc 6th Int’l Conf. XP 2005, LNCS 3556,
Springer, 2005, pp. 92–99.
[So5b] M.M. Müller, “Two Controlled Experiments
Concerning the Comparison of Pair Programming to
Peer Review,” J. Systems and Software, vol. 78, no. 2,
2005, pp. 169–179.
[So5c] J. Vanhanen and C. Lassenius, “Effects of Pair
Programming at the Development Team
Level: An Experiment,” Proc. Int’l Symp.
Empirical Software Eng. (ISESE 05), IEEE CS Press,
2005, pp. 336–345.
[So6a] L. Madeyski, “The Impact of Pair Programming and
Test-Driven Development on Package Dependencies in
Object-Oriented Design—An Experiment,” Product-
Focused Software Process Improvement—Proc. 7th
Int’l Conf. (Profes 06), LNCS 4034, Springer, 2006, pp.
278–289.
[So6b] M.M. Müller, “A Preliminary Study on the Impact of a
Pair Design Phase on Pair Programming and Solo
Programming,” Information and Software Technology,
vol. 48, no. 5, 2006, pp. 335–344.
[So6c] M. Phongpaibul and B. Boehm, “An Empirical
Comparison between Pair Development and Software
Inspection in Thailand,” Proc. Int’l Symp. Empirical
Software Eng. (ISESE 06), ACM Press, 2006, pp. 85–
94.
[So6d] S. Xu and V. Rajlich, “Empirical Validation of
Test-Driven Pair Programming in Game
Development,” Proc. Int’l Conf. Computer and
Information Science and Int’l Workshop Component-
Based Software Eng., Software Architecture and Reuse
(ICIS-COMSAR 06), IEEE CS Press, 2006, pp. 500–
505.
[Po7b] G. Canfora, A. Cimitile, F. Garcia, M. Piattini, and
C.A. Visaggio, “Evaluating Performances of Pair
Designing in Industry,” J. Systems and Software, vol.
80, no. 8, 2007, pp. 1317–1327.
[Po7b] E. Arisholm, H. Gallis, T. Dybå, and D.I.K. Sjøberg,
“Evaluating Pair Programming with Respect
to System Complexity and Programmer
Expertise,” IEEE Trans. Software Eng., vol. 33, no. 2,
2007, pp. 65–86.