SlideShare una empresa de Scribd logo
1 de 8
Descargar para leer sin conexión
Should Experiments for a Systematic Review be
discarded based on their Quality?
Oscar Dieste
Facultad de Informática
Universidad Politécnica Madrid
Boadilla del Monte,
Madrid 28660, Spain
+34 91336 5011
odieste@fi.upm.es
Natalia Juristo
Facultad de Informática
Universidad Politécnica Madrid
Boadilla del Monte,
Madrid 28660, Spain
+34 91336 6922
natalia@fi.upm.es
Himanshu Saxena
Facultad de Informática
Universidad Politécnica Madrid
Boadilla del Monte,
Madrid 28660, Spain
himanshusaxena22@gmail.com
ABSTRACT
Discarding irrelevant experiments based on their quality in the
beginning of a systematic review process could make the process
more efficient and aggregated result more precise and reliable.
The systematic review process becomes cumbersome as it
involves analysis of a large number of experiments and there is a
potential risk that some of these experiments either do not
contribute to the aggregations and if they do they induce bias.
Quality of experiments is believed to depend on how experiment
is planned, executed and reported. We evaluated the quality of
experiments participating in 3 different systematic reviews. For
quality evaluation we used both the checklist and alternative
procedures based on practices in other experimental disciplines.
Further, we analyzed correlations among the checklist scores and
alternative measures of quality. We did not find any relationships
among the measures. Contemporary quality assessment
instruments seem to fail to detect any tangible evidence of quality
in experiments. Therefore, we recommend that quality assessment
for discarding experiments for a systematic review must be
performed with extreme caution or not performed at all until more
research is carried out.
Categories and Subject Descriptors
D.2.0 [Software Engineering]: General
General Terms
Experimentation, Measurement
Keywords
Systematic Literature Review (SR), Quality Assessment (QA) of
experiments, Checklists.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise,
or republish, to post on servers or to redistribute to lists, requires prior
specific permission and/or a fee.
ESEM’2010, September 13–17, 2010, Bolzano-Bozen, Italy.
Copyright 2004 ACM 1-58113-000-0/00/0004…$5.00.
1. INTRODUCTION
Systematic Review (SR) according to Kitchenham [1] “is a means
of identifying, evaluating and interpreting all available research
relevant to a particular research question, or topic area, or
phenomenon of interest”. SR process involves (i) Identifying
experiments about a particular research topic, (ii) Selecting the
studies relevant to the research, (iii) Inclusion/Exclusion of
studies based on their quality assessment, (iv) Extracting the data
from the selected studies and (v) Eventually, aggregating the data
to generate unbiased evidence. Here, we focus on SR that involve
aggregations.
SR process can be quite challenging especially if the
number of experiments considered is high. But, not all
experiments contribute to the final aggregation and some of the
experiments even could create bias in the aggregated result.
Kitchenham et al. [1] and Tore Dybå et al. [2], [3] recommend a
detailed quality assessment of the studies for exclusion. Quality of
experiments is related to how the experiment is executed and how
it is reported [1], [4]. Quality Assessment can weigh importance
of individual studies based on the mentioned aspects and further
remove low quality studies from a SR. Reducing the number of
studies to be analysed and making the SR process more efficient
and less error prone.
Based on different procedures we have assessed the
quality of a set of 42 experiments obtained from 3 different SR.
We developed a checklist for quality assessment of experiments in
SR using literature available in SE [1], [4], [22] and other
experimental disciplines [5], [6], [7].
If checklists can discriminate experiment’s quality, their
results should be consistent with our understanding about quality.
Our conclusions point out that quality is an elusive concept that
cannot be easily assessed by checklists/questionnaires that are
primarily based on the experimental report. Discarding
experiments in a SR based on the present quality assessment
instruments can mislead the aggregations and wrong evidences
could be generated. Therefore, quality assessment either should be
performed with extreme caution or should not be performed at all.
This recommendation matches current practice in other
experimental fields [8], [9], [10], [11].
The paper is divided into 6 main sections. Section 2
reviews the different instruments that have been used to assess
quality of experiments. Section 3 presents the research
methodology and section 4 discusses the research results.
Section 5 discusses the results and Section 6 shows the
conclusions from our research.
2. INSTRUMENTS FOR ASSESSING
EXPERIMENTS QUALITY
Since 1940’s when the first Randomized Controlled Trial (RCT)
was run in medicine, RCT’s has been considered as the ‘gold
standards’ for experimental research exceeding the value over
other type of studies like non-randomized trials, longitudinal
studies, case studies etc. [12]. Sacks H et al. [13] performed the
first corroboration of differences between RCT’s and other type of
studies. One of the interesting findings was that RCT’s in general
were less prone to bias than other type of studies like
observational studies, horizontal controls etc. Bias produces some
kind of deviation from the true value. Kitchenham [1] explains
bias as “a tendency to produce results that depart systematically
from the true results”.
This early finding was highly influential and gave birth
to the development of hierarchies of evidence [14]. Hierarchy of
evidence was first proposed and popularized by the Canadian Task
Force on Periodic Health Examination [15] in late 1979. A
hierarchy of evidence determines the best available evidence that
can be used to grade experiments according to their design. It also
reflects the degree of bias different types of experiments might
possess. The first hierarchy of evidence evolved into more refined
ones like the CRD guidelines [16], the Australian National Health
and Medical research council guidelines [5] and the Cochrane
review handbook [8].
To understand how hierarchy of evidence works, lets
assume we have two studies: one RCT and a case study
comparing testing technique A with testing technique B for a
specific context. Considering a situation where RCT says
technique A is better than technique B and the case study states
the opposite. The result from the RCT would be given priority
over the case study results because RCT are situated above case
studies in the hierarchy and thus there is more confidence in result
of RCT’s than case studies.
Hierarchy of evidence can be used during SR to
determine the study designs contributing to evidences. Hierarchies
have been considered as an uncontested method to grade the
strength of evidence and its predictive power has thought to be
demonstrated in multiple occasions [17].
Nevertheless, some voices have been raised against
hierarchies. It has been observed that lack of bias is not
necessarily related to the type of design, but to the provisions
made in the empirical study to control confounding variables [18].
In other words quality (the risk of bias [19]) is related to the to
internal validity of the experiment. Threats to internal validity
have been identified as selection, performance, measurement and
attrition (exclusion) bias [5], [8], [16]. Based on these
observations that lack of bias (quality) is related to the internal
validity, many researchers in medicine studied the influence of
selection, performance, measurement and execution aspects in
bias. Those aspects were for example the existence in the
experiment of randomization, blinding etc. Getting rid from all
kinds of biases in an experiment would lead to an internally valid
experiment. Biases can be avoided by using mechanisms like
randomization, blinding etc. Checklists were constructed keeping
in mind all different kinds of biases and questions were raised if
all relevant steps were performed to avoid bias [20].
In experimental SE, SR and quality assessment are
recent topics and they got attention after the seminal work by
Kitchenham in 2004 [1]. Since the publication of that report,
multiple SR has been published in SE [2], [21]. Most of the SR’s
in SE do not have an explicit quality assessment process. In the
new version of the report by Kitchenham published in 2007 [22],
she proposes a questionnaire (checklist) considering the issues
related to bias and validity problems that might arise in the
different phases of experiments. Tore Dybå et al. [2] also
developed and use a checklist in their SR on agile software
development.
Lately, extensive criticism has been raised in medicine
about the use of scales for measuring quality [8], [9], [10], [23].
No relations were found between the scores obtained by an
experiment in a given quality assessment checklist and the amount
of bias suffered by the experiment [9]. Cochrane reviewer’s
handbook goes to the extent of discouraging the use of quality
assessment scales for exclusion of experiments in SR [8].
3. RESEARCH METHODOLOGY
The research methodology tried to find if low quality experiments
cause bias when they are aggregated in a SR. For this we need a
quality assessment instrument to determine low quality
experiments; ways to determine bias; and finally, relate
experiment’s quality with bias. Therefore, our research
methodology has three elementary components: (i) an assessment
checklist, (ii) one or more ways to determine bias in experiments
and (iii) a way to relate experiments quality with bias in the
experiment.
3.1 Quality Assessment Checklist
We started studying QA scales available in Empirical SE
literature. Kitchenham [1] based on different works from
medicine propose a questionnaire (checklist) for experiment
quality assessment. In Kitchenham’s checklist the average number
of questions for evaluating an experiment is close to 45.
Pragmatically speaking it is an enormous number since they need
to be answered for every paper of a SR, (very often bigger than
40). Using such checklists would increase the time required to
conduct a SR making the process highly inefficient. Kitchenham
[26] suggests on not using all the questions provided in the
checklist but select the questions that are most appropriate for the
specific research problem. It is quite difficult to apply such
strategy because researchers lack in the information needed to
identify which questions are relevant to the particular SR at hand.
Tore Dybå et al. [2] also developed a questionnaire for
quality assessment of experiments in their SR on agile software
development. These authors do not provide justification for why
and from where the items have been selected to be included in the
checklist.
None of the existing ESE QA questionnaires have been tested for
suitability (to ensure if they measure what they are intended to
measure) or reliability (to ensure that the measure consistently).
We decided to build a much shorter checklist taking basic issues
that influence quality according to Kitchenham’s report [1] [22],
Kitchenham et al. guidelines [4] and the practices of experiments
QA in medicine [5], [8], [16], [22]. Our checklist is in essence
similar to Kitchenham’s checklist as they use the same sources
and are built on the basis of the same criteria (i.e.: importance of
randomization, blinding, threats to validity, etc.). It is shown in
Table 1. The questions have been classified into two groups:
methodological and reporting aspects. Methodological-related
questions deal with controlling extraneous variables.
Reporting-related questions ensure whether important details are
disclosed in the written report or not.
Creating a checklist each time SR is performed would
be quite time consuming. Therefore, we were to evaluate if it is
possible to develop a checklist valid for a diversity of SR. We use
several experiments from several domains to ensure
generalizability of the checklist.
Table 1 Construct, attributes and questions of our checklist
Construct Attributes Questions of our Checklist
Does the introduction contain
the industrial context and
description of the techniques
to be reviewed?
Does the report summarize
about the previous similar
experiments that have been
conducted and discusses
them?
Does the researcher define
the population from which
objects and subjects are
drawn?
Are the statistical
significance mentioned with
the results.
Quality of
Reporting
Are the threats to validity
mentioned explicitly and also
how these threats affect the
results and conclusion?
Does the researcher define
the process by which he
applies the treatment to
subjects or objects?
Was randomization used for
selecting the population and
applying the treatment?
Does the researcher define
the process from which the
subjects and objects are
selected?
Are the hypotheses being laid
and are they synonymous to
the goal discussed before in
introduction?
Quality of
Experiments
Methodological
Quality
Is the blinding procedure
used convincingly enough?
3.2 Determining Bias in Experiments
Kitchenham [1] explains bias as “a tendency to produce results
that depart systematically from the ‘true’ results”. For example, if
we find in a specific experiment that testing technique A is 50%
more effective than testing technique B, but we know that the
difference actually is 30%, the remaining 20% would be
considered as bias.
Therefore, or our research we need to know true values
to be able to determine experiments bias. However, given the lack
of empirical data in SE, true values are seldom available. We only
have the results of SR that have values that we might assume are
close to true values. We have found only two published SR that
detail about the aggregations showing the true value and how each
experiment contributes to the true value (Tore Dybå et al. [25] and
Ciolkowski [26]). At UPM, we had two more SR’s that detailed
the aggregations [14], [15].
Using SR results, depending on how they are
performed, we can identify three measures of bias. The first
measure is known as proximity to mean value (PTMV) [27], [29]
of all experiments in a SR. Figure 1 is a forest plot describing the
aggregation of three experiments.
The vertical line with diamond in its tail gives the mean value of
all three experiments. PTMV says that the farther is the
experiment to the mean value the more biased are the results of
the experiment. In the figure below from top to bottom we have
E1, E2 and E3. PTMV tells us that E3 possess highest bias as it is
farthest to the mean value and E2 has least bias. Table 2 provides
the distances from mean value and corresponding bias order of the
three experiments.
Table 2 PTMV scores for experiments in forest plot example.
Experiment Dis. ID PTMV Bias Order
E1 d1 0.47 2
E2 d2 0.17 3
E3 d3 0.48 1
The second measure is the conflicting and non-
conflicting contribution to the aggregation (CNC) of all
experiments in a SR. Table 3 describes three aggregations
produced by using comparative analysis [30]. Each aggregation
has a response variable and generalized result associated to it.
Table 3 Experiments contribution to the aggregation
Aggreg
ation
Respon
se
variabl
e
General
ized
Result
Positive
effects
Negativ
e effects
No
effects
AGG01 R1 T {a} >
T {b}
E1, E3 E2
AGG02 R2 T {c} <
T {d}
E1, E5 E4
AGG03 R3 T {e} >
T {f}
E3, E2
Figure 1 Aggregation forest plot of three experiments.
The experiments either have positive effect to the aggregated
result or a negative effect or no effect. A positive effect means that
the experimental result agrees with the generalized result and
negative effect means the opposite. No effects mean that the
experiment produced insignificant results. CNC contribution can
be calculated by using the following formula given in Equation 1.
Equation 1 CNC quality score equation
Lets take an example to calculate the CNC score. Experiment E2
is conflicting aggregation AGG01 and supporting (non-
conflicting) aggregation AGG03. Therefore the CNC score for
experiment E2 would be 0.5. The more the CNC score for the
experiment the less it possesses bias, because the higher the CNC
score, the more coherent is the experiment with the aggregated
result. Now, lets take E1 that is supporting the aggregation
AGG01 and AGG02 having the CNC scores as 1. Apparently,
CNC scores are handy for SR that do not use meta-analysis for
data aggregation. However, CNC scores are comparatively less
precise to PTMV scores.
The third measure we used is expert opinion about the
experiment looking at the experiment’s report [1], [16]. Based on
experience an expert could differentiate between biases and
unbiased experiments in a SR. We adapted the expert opinion
measure to make it simpler. A value of -1, 0 or 1 was given for a
poor, average or a god experiment respectively. Higher the score
lesser would the amount of bias in the experiment.
3.3 Relating Quality with Bias
The third component of our research methodology was to find a
way to relate checklist results with bias in the experiment. If the
internal validity and lack of bias are related, we expect that high
values of internal validity be related to low levels of bias and vice
versa. Correlations are the best way to calculate such relationship
[31]. Therefore, we correlated the checklist scores with PTMV
scores, CNC scores and expert opinion scores.
4. RESEARCH RESULTS
Table 4 The three SRs used in our research
Name No Of
Experiments
Aggregation
mechanism
Authors
SR on
elicitation
techniques [27]
14 Comparative
Analysis
Oscar Dieste.
Natalia Juristo
SR on
inspection
techniques [28]
13 Meta
Analysis
Anna Griman
Padua, Oscar
Dieste. Natalia
Juristo
SR on pair
programming
[25]
15 Meta
Analysis
Tore Dybå,
Erik Arisholm,
Dag I.K.
Sjoberg, Jo E.
Hannay and
Forrest Shull
Three SR were used as described in Table 4 for getting the scores
we have described [32]. Checklist, PTMV, CNC and expert
opinion scores were obtained for all the experiments in the three
SR’s. Table 5 provides all the scores obtained.
Table 5 All the scores
Set Study
ID
CNC
Scores
PTMV
Scores
Check
list
Score
s
Expert
Opinio
n Score
E2 1.00 0.70 0
E3 0.00 0.72 1
E4 0.00 0.63 1
E5 1.00 0.90 1
E6 0.50 0.45 1
E8 0.58 0.60 1
E17 0.33 0.60 1
E18 0.50 0.60 1
E19 1.00 0.80 1
E21 0.33 0.80 -1
E25 1.00 0.90 0
E28 0.50 0.30 -1
E31 0.50 0.60 -1
E36 0.83 0.90 1
Elicita
tion
SR
E43 1.00
N.A.
0.70 1
E05 1.00 0.34 0.80 1
E01 1.00 0.13 0.80 0
E02 1.00 0.36 0.70 0
E06 1.00 0.23 0.80 1
E03 1.00 0.21 0.70 0
E04 - R1 1.00 0.09 0.80 1
E04 - R2 1.00 0.63 0.80 1
E13 1.00 0.32 0.80 1
E14 0.50 0.26 0.80 1
E15 0.50 0.37 0.80 1
E19 1.00 0.33 0.80 0
Inspec
tion
SR
E20 1.00 0.14 0.80 0
P98 0.50 2.11 0.7 0
So0 1.00 0.66 0.6 0
So1 0.00 0.24 0.5 0
So2 0.00 0.265 0.6 0
Po2 0.00 0.34 0.6 1
So3 0.00 0.215 0.9 1
So5a 1.00 0.335 0.7 0
So5b 0.33 0.44 0.8 1
So5c 0.00 0.08 0.8 0
So6a 0.00 0.3 0.8 1
So6b 0.00 0.26 0.8 1
Pair -
Program
ming
SR
So6c 0.67 1.575 0.8 -1
So6d 0.67 1.2433 0.6 0
So7b 0.00 0.65 0.9 1
So7a 0.33 0.19 0.8 1
Three different experts were used. Two of them were PhD
students making their thesis on SR and aggregation and one of the
experts was a senior researcher on experimental SE.
Correlations were drawn between CNC scores vs.
Checklist scores, Checklist scores vs. PTMV scores and Checklist
scores vs. Expert Opinion. Figure 2 describes all the correlations.
Also, Table 6 provides the values for correlation coefficients and
corresponding p-value for the correlations described in Figure 2.
Considering the first row (CNC vs. checklist) of Figure
2 the x-axis gives the checklist scores and y-axis gives the CNC
scores. From left to right in the first plot (elicitation SR) we have
fluctuating CNC scores for checklist scores from 0.60 to 0.90
evidently inkling a poor correlation. The correlation coefficient
and p-value give in Table 6 (correlation coefficient = 0.453; p-
value = 0.242) substantiate our observation. Similarly, the second
and third graph in the first row exhibits poor correlations.
The second row in Figure 2 the rightmost graph
provides correlation between checklists vs. PTMV for pair
programming SR. As we can see for PTMV score (x-axis) as 1 we
have different checklist scores (y-axis) suggesting a poor
correlation again. The correlation coefficient (-0078) substantiates
our observation and the p-value (p-value = 0.526) make the result
statistically insignificant. The other graph (Inspection SR column)
in the same row also shows a poor correlation.
The third row in Figure 2 from top to bottom the second
graph from left to right provide the correlation between checklists
vs. expert opinion for inspection SR. The graph shows a positive
trend as with increase expert opinion scores (x-axis) the checklist
scores also shows an increasing trend. The correlation coefficient
(0.677) suggests a positive correlation. However, the p-value (p-
value = 0.129) is higher to the statistically significance value (p-
value = 0.05). The most probable reason is that the sample size is
quite small (11 studies) it becomes difficult to reach statically
significance. However, the plots and correlation coefficient for the
other two SR (elicitation and pair programming) suggest no
correlation between expert opinion and checklist scores.
All the correlations were poor and statistically
insignificant suggesting that no correlations could be identified
between quality measures through checklist and bias measured
through CNC, PTMV and Expert Opinion.
Table 6 Correlations Coefficients (in the grey cells) and p-
values (in the white cells) for all the correlations.
CNC PTMV Expert
Opinion
SR
0.453 0.222
0.242 0.367
Elicitation SR
- 0.182 0.085 0.677
0.600 0.453 0.129
Inspection SR
- 0.236 - 0.078 0.406
Checklist
0.630 - 0.526 0.277
Pair
Programming
SR
Figure 2 Correlations from top to bottom (i) CNC vs.
Checklist (ii) Checklist vs. PTMV (iii) Checklist vs. Expert
Opinion. Correlations from left to right (i) Elicitation SR (ii)
Inspection SR (iii) Pair Programming SR.
5. DISCUSSION
The results stimulate discussions. Getting back to the question.
Should experiments in a SR be discarded based on their quality?
Discarding an experiment in a SR based on its quality
(methodological or reporting) seems to be inappropriate. No
correlations were obtained between the QA checklist scores and
different measures of bias. Our checklist is similar to the
contemporary QA instruments prevalent in SE. Therefore
contemporary QA instruments should be used carefully. Exclusion
of valid studies from a SR would reduce the precision of the
aggregated result but the inclusion of invalid studies might lead to
bias aggregated result.
The Cochrane Handbook [8] shares our point of view
for the QA of experiments in medicine. The Cochrane handbook
[5] along with other researchers from medicine [9], [10], [11]
discourages the use of QA scales, as they cannot prove their
validity. Jüni [10] states “Scales have been shown to be unreliable
assessment of validity and they are less likely to be transparent to
the users of the review.” None of the scales present in SE have
proved their validity, including ours. Moreover, the QA scales can
discard some useful experiments in the SR making the aggregated
result biased, imprecise and pragmatically worthless. Therefore it
might be advisable to discourage the use of QA scales in SE.
The results we have obtained are statistically
insignificant. If the number of experiments increases we might
attain statically significance. Moving ahead when we have more
SR that details the aggregations.
Researchers [14], [33] in medicine have identified a
potential aspect in measuring quality. The relation between the
object being experimented and the experiments seem to bias the
results. This object-experimenter relationship has been the only
one empirically identified in medicine as having a relationship
with experimental results bias. It has been observed in medicine
that multi-centre experiments avoid this bias. Multi-centre SR’s
evidence is considered to have the maximum confidence [14].
However, further research is imperative to substantiate this
postulate in SE.
6. CONCLUSIONS
We have analysed if QA instruments as checklists do have a
relationship with experimental bias. Our pilot approach to this
research has been with our own QA checklist. The most extended
QA checklist in SE [22] was too long (45 questions) to be
answered for the 45 experiments we were analysing. Therefore,
we generated a short version of such checklist.
For three different SR [25], [27], [28] we have analysed
correlations between QA checklist and three bias measurements
(proximity to mean value, conflicting and non-conflicting
contribution and expert opinion). All the correlations were poor
and insignificant. These results suggest that discarding
experiments in a SR based on their quality should not be
practiced. These results match with similar results and
recommendations in medicine [8], [9], [10], [11].
Next step in our research is to run this same analysis
with similar QA instruments ([2], [34] for instance).
7. ACKNOWLEDGMENTS
This work was a part of master thesis done in UPM Madrid, Spain
and TU Kaiserslautern, Germany under Dieter Rombach, Natalia
Juristo and Oscar Dieste and in the European Masters in Software
Engineering course. This work was partially supported by the
projects TIN 2008 – 00555 and HD 2008-0048.
8. REFERENCES
[1] B.A. Kitchenham, Procedures for performing systematic
reviews, Technical Report TR/SE-0401, 2004.
[2] Dybå T, Dingsøyr T. Empirical studies of agile software
development: A systematic review. Inf. Softw. Technol.
2008;50(9-10):833-859.Availableat:
http://portal.acm.org.miman.bib.bth.se/citation.cfm?id=1379
905.1379989&coll=GUIDE&dl=GUIDE&CFID=81486639
&CFTOKEN=51504098 [Accessed March 11, 2010].
[3] Dyba T, Dingsoyr T, Hanssen G. Applying Systematic
Reviews to Diverse Study Types: An Experience Report. 20-
21 Sept. 2007 . 2007: 225 - 234 .
[4] B.A. Kitchenham, S.L. Pfleeger, L.M. Pickard, P.W. Jones,
D.C. Hoaglin, K.E. Emam, and J. Rosenberg, Preliminary
Guidelines for Empirical Research in Software Engineering,
vol. 28 no. 8, Aug. 2002, pp. pp. 721-734.
[5] National Health and Medical Research Council (NHMRC),
“NHMRC handbook series on preparing clinical practice
guidelines.” ISBN 0642432952, Feb. 2000.
[6] D. Moher, B. Pham, A. Jones, D.J. Cook, A.R. Jadad, M.
Moher, P. Tugwell, and T.P. Klassen, Does quality of reports
of randomised trials affect estimates of intervention efficacy
reported in meta-analyses?Lancet, vol. 352, Aug. 1998, pp.
609-613.
[7] Blobaum P. Physiotherapy Evidence Database (PEDro). J
Med Libr Assoc. 2006; 94 (4):477-478.
[8] J.P.T. Higgins and S. Green, Cochrane Handbook for
Systematic Reviews of Interventions, John Wiley and Sons,
2008.
[9] J.D. Emerson, E. Burdick, D.C. Hoaglin, F. Mosteller, and
T.C. Chalmers, An empirical study of the possible relation of
treatment differences to quality scores in controlled
randomized clinical trials, Controlled Clinical Trials, vol. 11,
Oct. 1990, pp. 339-352.
[10] P. Jüni, A. Witschi, R. Bloch, and M. Egger, The hazards of
scoring the quality of clinical trials for meta-analysis, JAMA:
The Journal of the American Medical Association, vol. 282,
Sep. 1999, pp. 1054-1060.
[11] Schulz KF, Chalmers I, Hayes RJ, Altman DG. Empirical
evidence of bias. Dimensions of methodological quality
associated with estimates of treatment effects in controlled
trials. JAMA 1995; 273: 408-412.
[12] Stuart Barton, Which clinical studies provide the best
evidence? BMA House, Tavistock Square, London WC1H
9JR, sbarton@bmjgroup.com
[13] Sacks H, Chalmers TC, Smith H Jr., Randomized versus
historical controls for clinical trials, Am J Med 1982; 72:
233-40.
[14] David Evans BN, Hierarchy of evidence: a framework for
ranking evidence evaluating healthcare interventions,
Department of Clinical Nursing, University of Adelaide,
South Australia 5005, 2002.
[15] Canadian Task Force on the Periodic Health Examinaton.
(1979) The periodic health examination.Canadian Medical
Association Journal 121, 1193-1254.
[16] U.O.Y. Centre for Reviews and Dissemination, Undertaking
systematic reviews of research on effectiveness: CRD's
guidance for carrying out or commissioning reviews, NHS
Centre for Reviews and Dissemination, Mar. 2001.
[17] Louis PCA. Research into the effects of bloodletting in some
inflammatory diseases and on the influence of tartarized
antimony and vesication in pneumonitis. Am J Med Sci1836;
18:102 –111.
[18] RRMacLehose, BCReeves, IMHarvey, TASheldon,
ITRussell, AMS Black, A systematic review of comparisons
of effect sizes derived from randomised and non-randomised
studies, Health technology Assessement 2000;Vol. 4:No. 34.
[19] Verhagen A.P., de Vet H.C.W., de Bie R.A., Boers M., A
van den Brandt P. The art of quality assessment of RCTs
included in systematic reviews. Journal of Clinical
Epidemiology. 2001; 54:651-654. Available at:
http://www.ingentaconnect.com/content/els/08954356/2001/
00000054/00000007/art00360 [Accessed March 3, 2010].
[20] A.P. Verhagen, H.C. de Vet, R.A. de Bie, A.G. Kessels, M.
Boers, L.M. Bouter, and P.G. Knipschild, The Delphi list: a
criteria list for quality assessment of randomized clinical
trials for conducting systematic reviews developed by Delphi
consensus, Journal of Clinical Epidemiology, vol. 51, Dec.
1998, pp. 1235-1241.
[21] Finn Olav Bjørnson, Torgeir Dingsøyr, Knowledge
management in software engineering: A systematic review of
studied concepts, findings and research methods used, a
Norwegian University of Science and Technology,
Department of Computer and Information Science, Sem
S_landsvei 7-9, 7491 Trondheim, Norway. SINTEF
Information and Communication Technology, SP Andersens
vei 15b, 7465 Trondheim, Norway, 2008
[22] B. Kitchenham and S. Charters, Guidelines for performing
Systematic Literature Reviews in Software Engineering,
2007.EBSE Technical Report EBSE-2007-01, 2004.
[23] E.M. Balk, P.A.L. Bonis, H. Moskowitz, C.H. Schmid,
J.P.A. Ioannidis, C. Wang, and J. Lau, Correlation of quality
measures with estimates of treatment effect in meta-analyses
of randomized controlled trials, JAMA: The Journal of the
American Medical Association, vol. 287, Jun. 2002, pp.
2973-2982.
[24] Fink, A. Conducting Research Literature Reviews. From the
Internet to Paper, Sage Publication, Inc., 2005.
[25] Hannay JE, Dybå T, Arisholm E, Sjøberg DIK. The
effectiveness of pair programming: A meta-analysis. Inf.
Softw. Technol. 2009; 51 (7):1110-1122. Available at:
http://portal.acm.org.miman.bib.bth.se/citation.cfm?id=1539
052.1539606 [Accessed March 11, 2010].
[26] Ciolkowski M. What do we know about perspective-based
reading? An approach for quantitative aggregation in
software engineering. In: Proceedings of the 2009 3rd
International Symposium on Empirical Software Engineering
and Measurement. IEEE Computer Society; 2009:133-144.
Available at http://portal.acm.org/citation.cfm?doid=
1671248.1671262 [Accessed March 11, 2010].
[27] Oscar Dieste, Natalia Juristo Systematic Review and
Aggregation of Empirical Studies on Elicitation Techniques.
IEEE Transactions on Software Engineering, 2008.
[28] Anna Griman Padua, Oscar Dieste and Natalia Juristo,
Systematic review on inspection techniques, UPM, Spain,
(Under Construction).
[29] Schulz, Kenneth F.; Chalmers, Iain; Hayes, Richard J.;
Altman, Douglas G, Empirical evidence of bias: Dimensions
of methodological quality associated with estimates of
treatment effects in controlled trials, JAMA: Journal of the
American Medical Association. Vol 273(5), Feb 1995, 408-
412.
[30] C.C. Ragin, The comparative method, University of
California Press, 1987.
[31] DeCoster, J. (2005). Scale Construction Notes, Jun. 2005,
http://www.start-help.com/notes.html.
[32] Himanshu Saxena, Prospective Study For The Quality
Assessment Of Experiments Included In Systematic
Reviews, Master thesis, UPM, Madrid, 2009.
[33] P. Jüni, A. Witschi, R. Bloch, and M. Egger, The hazards of
scoring the quality of clinical trials for meta-analysis, JAMA:
The Journal of the American Medical Association, vol. 282,
Sep. 1999, pp. 1054-1060.
[34] Kitchenham, B., Mendes, E., Travassos, G.H. (2007) A
Systematic Review of Cross- vs. Within-Company Cost
Estimation Studies, IEEE Trans on SE, 33 (5), pp 316-329.
9. LIST OF EXPERIMENTS USED IN OUR
STUDY
Set 1 - Following are the list of studies used in the systematic
review on elicitation techniques. The unique alphanumeric
number enlisting the studies represents the code used while
performing the systematic review. Hence, the aggregation tables
would have these numbers to represent the experiments.
[E2] Agarwal, R. and Tanniru, M. R., "Knowledge Acquisition
Using Structured Interviewing: An Empirical Investigation,"
Journal of Management Information Systems, vol. 7, pp.
123-141, Summer, 1990.
[E3] Bech-Larsen, T. and Nielsen, N. A., "A comparison of five
elicitation techniques for elicitation of attributes of low
involvement products," Journal of Economic Psychology,
vol. 20, pp. 315-341, 1999.
[E4] Breivik, E. and Supphellen, M., "Elicitation of product
attributes in an evaluation context: A comparison of three
elicitation techniques," Journal of Economic Psychology,
vol. 24, pp. 77-98, 2003.
[E5] Browne, G. J. and Rogich, M. B., "An Empirical
Investigation of User Requirements Elicitation: Comparing
the Effectiveness of Prompting Techniques," Journal of
Management Information Systems , vol. 17, pp. 223-249,
Spring, 2001.
[E6] Burton, A. M., Shadbolt, N. R., Hedgecock, A. P., and Rugg,
G. A formal evaluation of knowledge elicittion techniques
for expert systems: Domain 1. In: Research and
development in expert systems IV: proceedings of Expert
Systems '87, the seventh annual Technical Conference of
the British Computer Society Specialist Group on Expert
Systems, ed. Moralee, D. S. Cambridge, UK: Cambridge
University Press, 1987.
[E8] Corbridge, B., Rugg, G., Major, N. P., Shadbolt, N. R. , and
Burton, A. M., "Laddering - technique and tool use in
knowledge acquisition," Knowledge Acquisition, vol. 6, pp.
315-341, 1994.
[E17, E18] Burton, A. M., Shadbolt, N. R., Rugg, G., and
Hedgecock, A. P., "The efficacy of knowledge acquisition
techniques: A comparison across domains and levels of
expertise," Knowledge Acquisition, vol. 2, pp. 167-178,
1990.
[E19] Moody, J. W., Will, R. P., and Blanton, J. E., "Enhancing
knowledge elicitation using the cognitive interview," Expert
Systems with Applications, vol. 10, pp. 127-133, 1996.
[E21] Zmud, R. W. , Anthony, W. P., and Stair, R. M., "The use
of mental imagery to facilitate information identification in
requirements analysis," Journal of Management Information
Systems, vol. 9, pp. 175-191, 1993.
[E25] Pitts, M. G. and Browne, G. J., "Stopping Behavior of
Systems Analysts During Information Requirements
Elicitation," Journal of Management Information Systems,
vol. 21, pp. 203-226, 2004.
Set 2 - Following are the list of studies used in the systematic
review on inspection techniques. The unique alphanumeric
number enlisting the studies represents the code used while
performing the systematic review. Hence, the aggregation tables
would have these numbers to represent the experiments.
[E03] J. Miller, M. Wood, and M. Roper, “Further Experiences
with Scenarios and Checklists,” Empirical Softw. Engg.,
vol. 3, 1998, pp. 37-64.
[E13], [E14], [E15] O. Laitenberger, K.E. Emam, and T.G.
Harbich, “An Internally Replicated Quasi-Experimental
Comparison of Checklist and Perspective-Based Reading of
Code Documents,” IEEE Trans. Softw. Eng., vol. 27, 2001,
pp. 387-421.
[E04] K. Sandahl, O. Blomkvist, J. Karlsson, C. Krysander, M.
Lindvall, and N. Ohlsson, “An Extended Replication of an
Experiment for Assessing Methods for Software
Requirements Inspections ,” vol. Volume 3, Number 4 /
December, 1998, Oct. 2004.
[E19] P.R. Thomas Thelin, “An Experimental Comparison of
Usage-Based and Checklist-Based Reading,” Aug. 2003.
[E06] A.A. Porter and L.G. Votta, “An experiment to assess
different defect detection methods for software requirements
inspections,” Proceedings of the 16th international
conference on Software engineering, Sorrento, Italy: IEEE
Computer Society Press, 1994, pp. 103-112.
[E20] T. Thelin, C. Andersson, P. Runeson, and N. Dzamashvili-
Fogelstrom, “A Replicated Experiment of Usage-Based and
Checklist-Based Reading,” Proceedings of the Software
Metrics, 10th International Symposium, IEEE Computer
Society, 2004, pp. 246-256.
[E05] A. Porter and L. Votta, “Comparing Detection Methods
For Software Requirements Inspections: A Replication
Using Professional Subjects,” Empirical Softw. Engg., vol.
3, 1998, pp. 355-379.
[E01] A.A. Porter, L.G. Votta, Jr, J. Victor, and V.R. Basili,
“Comparing Detection Methods For Software Requirements
Inspections: A Replicated Experiment,” 1995.
Set 3 - Following are the list of studies used in the systematic
review on inspection techniques. The unique alphanumeric
number enlisting the studies represents the code used while
performing the systematic review. Hence, the aggregation tables
would have these numbers to represent the experiments.
[P98] J.T. Nosek, “The Case for Collaborative Programming,”
Comm. ACM, vol. 41, no. 3, 1998, pp. 105–108.
[Soo] L. Williams, R.R. Kessler, W. Cunningham, and R.
Jeffries, “ Strengthening the Case for Pair
Programming,” IEEE Software, vol. 17, no. 4, 2000,
pp. 19–25.
[So1] J. Nawrocki and A. Wojciechowski, “Experimental
Evaluation of Pair Programming,” Proc. European
Software Control and Metrics Conference (ESCOM 01),
2001, pp. 269–276.
[So2]P. Baheti, E. Gehringer, and D. Stotts, “Exploring the
Efficacy of Distributed Pair Programming,” Extreme
Programming and Agile Methods—XP/Agile Universe
2002, LNCS 2418, Springer, 2002, pp. 208–220.
[Po2] M. Rostaher and M. Hericko, “Tracking Test First Pair
Programming—An Experiment,” Proc. XP/Agile
Universe 2002, LNCS 2418, Springer, 2002, pp. 174–
184.
[So3] S. Heiberg, U. Puus, P. Salumaa, and A. Seeba, “Pair-
Programming Effect on Developers Productivity,”
Extreme Programming and Agile Processes in Software
Eng.—Proc 4th Int’l Conf. XP 2003, LNCS 2675,
Springer, 2003, pp. 215–224.
[So5a] G. Canfora, A. Cimitlie, and C.A. Visaggio, “Empirical
Study on the Productivity of the Pair Programming,”
Extreme Programming and Agile Processes in Software
Eng.—Proc 6th Int’l Conf. XP 2005, LNCS 3556,
Springer, 2005, pp. 92–99.
[So5b] M.M. Müller, “Two Controlled Experiments
Concerning the Comparison of Pair Programming to
Peer Review,” J. Systems and Software, vol. 78, no. 2,
2005, pp. 169–179.
[So5c] J. Vanhanen and C. Lassenius, “Effects of Pair
Programming at the Development Team
Level: An Experiment,” Proc. Int’l Symp.
Empirical Software Eng. (ISESE 05), IEEE CS Press,
2005, pp. 336–345.
[So6a] L. Madeyski, “The Impact of Pair Programming and
Test-Driven Development on Package Dependencies in
Object-Oriented Design—An Experiment,” Product-
Focused Software Process Improvement—Proc. 7th
Int’l Conf. (Profes 06), LNCS 4034, Springer, 2006, pp.
278–289.
[So6b] M.M. Müller, “A Preliminary Study on the Impact of a
Pair Design Phase on Pair Programming and Solo
Programming,” Information and Software Technology,
vol. 48, no. 5, 2006, pp. 335–344.
[So6c] M. Phongpaibul and B. Boehm, “An Empirical
Comparison between Pair Development and Software
Inspection in Thailand,” Proc. Int’l Symp. Empirical
Software Eng. (ISESE 06), ACM Press, 2006, pp. 85–
94.
[So6d] S. Xu and V. Rajlich, “Empirical Validation of
Test-Driven Pair Programming in Game
Development,” Proc. Int’l Conf. Computer and
Information Science and Int’l Workshop Component-
Based Software Eng., Software Architecture and Reuse
(ICIS-COMSAR 06), IEEE CS Press, 2006, pp. 500–
505.
[Po7b] G. Canfora, A. Cimitile, F. Garcia, M. Piattini, and
C.A. Visaggio, “Evaluating Performances of Pair
Designing in Industry,” J. Systems and Software, vol.
80, no. 8, 2007, pp. 1317–1327.
[Po7b] E. Arisholm, H. Gallis, T. Dybå, and D.I.K. Sjøberg,
“Evaluating Pair Programming with Respect
to System Complexity and Programmer
Expertise,” IEEE Trans. Software Eng., vol. 33, no. 2,
2007, pp. 65–86.

Más contenido relacionado

La actualidad más candente

Bias,confounding, causation and experimental designs
Bias,confounding, causation and experimental designsBias,confounding, causation and experimental designs
Bias,confounding, causation and experimental designsTarek Tawfik Amin
 
QUASI EXPERIMENTAL DESIGN . Type of Research Design. Research nursing
QUASI EXPERIMENTAL DESIGN . Type of Research Design. Research nursing QUASI EXPERIMENTAL DESIGN . Type of Research Design. Research nursing
QUASI EXPERIMENTAL DESIGN . Type of Research Design. Research nursing selvaraj227
 
Experimental method In Research Methodology
Experimental method In Research MethodologyExperimental method In Research Methodology
Experimental method In Research MethodologyRamla Sheikh
 
Ppt. types of quantitative research
Ppt.  types of quantitative researchPpt.  types of quantitative research
Ppt. types of quantitative researchNursing Path
 
Understanding The Experimental Research Design(Part I)
Understanding The Experimental Research Design(Part I)Understanding The Experimental Research Design(Part I)
Understanding The Experimental Research Design(Part I)DrShalooSaini
 
The Use of Historical Controls in Post-Test only Non-Equivalent Control Group
The Use of Historical Controls in Post-Test only Non-Equivalent Control GroupThe Use of Historical Controls in Post-Test only Non-Equivalent Control Group
The Use of Historical Controls in Post-Test only Non-Equivalent Control Groupijtsrd
 
Experimental Method of Data Collection
Experimental Method of Data CollectionExperimental Method of Data Collection
Experimental Method of Data CollectionDhanusha Dissanayake
 
Cluster randomization trial presentation
Cluster randomization trial presentationCluster randomization trial presentation
Cluster randomization trial presentationRanadip Chowdhury
 
Internal and external validity factors
Internal and external validity factorsInternal and external validity factors
Internal and external validity factorsAmir Mahmoud
 
Application of an evidence
Application of an evidenceApplication of an evidence
Application of an evidenceSyedEsamMahmood
 
Study design-guide
Study design-guideStudy design-guide
Study design-guideAmareBelete
 
Evidence based decision making in periodontics
Evidence based decision making in periodonticsEvidence based decision making in periodontics
Evidence based decision making in periodonticsHardi Gandhi
 
4. level of evidence
4. level of evidence4. level of evidence
4. level of evidenceSaurab Sharma
 

La actualidad más candente (20)

Experimental method of Research
Experimental method of ResearchExperimental method of Research
Experimental method of Research
 
Bias,confounding, causation and experimental designs
Bias,confounding, causation and experimental designsBias,confounding, causation and experimental designs
Bias,confounding, causation and experimental designs
 
QUASI EXPERIMENTAL DESIGN . Type of Research Design. Research nursing
QUASI EXPERIMENTAL DESIGN . Type of Research Design. Research nursing QUASI EXPERIMENTAL DESIGN . Type of Research Design. Research nursing
QUASI EXPERIMENTAL DESIGN . Type of Research Design. Research nursing
 
Experimental method In Research Methodology
Experimental method In Research MethodologyExperimental method In Research Methodology
Experimental method In Research Methodology
 
Ppt. types of quantitative research
Ppt.  types of quantitative researchPpt.  types of quantitative research
Ppt. types of quantitative research
 
Clinical trials
Clinical trialsClinical trials
Clinical trials
 
Understanding The Experimental Research Design(Part I)
Understanding The Experimental Research Design(Part I)Understanding The Experimental Research Design(Part I)
Understanding The Experimental Research Design(Part I)
 
Pre- Experimental Research
Pre- Experimental Research Pre- Experimental Research
Pre- Experimental Research
 
Grading Strength of Evidence
Grading Strength of EvidenceGrading Strength of Evidence
Grading Strength of Evidence
 
The Use of Historical Controls in Post-Test only Non-Equivalent Control Group
The Use of Historical Controls in Post-Test only Non-Equivalent Control GroupThe Use of Historical Controls in Post-Test only Non-Equivalent Control Group
The Use of Historical Controls in Post-Test only Non-Equivalent Control Group
 
Clinical trials
Clinical trialsClinical trials
Clinical trials
 
Experimental Method of Data Collection
Experimental Method of Data CollectionExperimental Method of Data Collection
Experimental Method of Data Collection
 
Cluster randomization trial presentation
Cluster randomization trial presentationCluster randomization trial presentation
Cluster randomization trial presentation
 
Retrospective vs Prospective Study
Retrospective vs Prospective StudyRetrospective vs Prospective Study
Retrospective vs Prospective Study
 
Internal and external validity factors
Internal and external validity factorsInternal and external validity factors
Internal and external validity factors
 
Research Design
Research DesignResearch Design
Research Design
 
Application of an evidence
Application of an evidenceApplication of an evidence
Application of an evidence
 
Study design-guide
Study design-guideStudy design-guide
Study design-guide
 
Evidence based decision making in periodontics
Evidence based decision making in periodonticsEvidence based decision making in periodontics
Evidence based decision making in periodontics
 
4. level of evidence
4. level of evidence4. level of evidence
4. level of evidence
 

Destacado

οικουμενικη διακηρυξη των δικαιωματων του ανθρωπου
οικουμενικη διακηρυξη των δικαιωματων του ανθρωπουοικουμενικη διακηρυξη των δικαιωματων του ανθρωπου
οικουμενικη διακηρυξη των δικαιωματων του ανθρωπουroulagkika
 
Sdp ja työurat
Sdp ja työuratSdp ja työurat
Sdp ja työuratSDP
 
Preliminary task – school magazine
Preliminary task – school magazinePreliminary task – school magazine
Preliminary task – school magazinebarrettlasharn
 
Šovinizam i rasizam Dinka Gruhonjića
Šovinizam i rasizam Dinka GruhonjićaŠovinizam i rasizam Dinka Gruhonjića
Šovinizam i rasizam Dinka Gruhonjićagosteljski
 
Sosiaali- ja terveysministeriön asettaman työryhmän raportti
Sosiaali- ja terveysministeriön asettaman työryhmän raporttiSosiaali- ja terveysministeriön asettaman työryhmän raportti
Sosiaali- ja terveysministeriön asettaman työryhmän raporttiSDP
 
Residential Flats in Omaxe City Jaipur...8459137252
Residential Flats in Omaxe City Jaipur...8459137252Residential Flats in Omaxe City Jaipur...8459137252
Residential Flats in Omaxe City Jaipur...8459137252sahilkharkara
 
Kisi kisi prediksi soal un MTK
Kisi kisi prediksi soal un MTKKisi kisi prediksi soal un MTK
Kisi kisi prediksi soal un MTKDeni Riansyah
 
Automatisering af tværautomatiske processer af Mia Cordes Hansen, iHedge A/S
Automatisering af tværautomatiske processer af Mia Cordes Hansen, iHedge A/SAutomatisering af tværautomatiske processer af Mia Cordes Hansen, iHedge A/S
Automatisering af tværautomatiske processer af Mia Cordes Hansen, iHedge A/SInfinIT - Innovationsnetværket for it
 
CrowdSearch2012 - Discovering User Perceptions of Semantic Similarity in Near...
CrowdSearch2012 - Discovering User Perceptions of Semantic Similarity in Near...CrowdSearch2012 - Discovering User Perceptions of Semantic Similarity in Near...
CrowdSearch2012 - Discovering User Perceptions of Semantic Similarity in Near...Raynor Vliegendhart
 
визуальная публицистика. фотография как медиа
визуальная публицистика. фотография как медиавизуальная публицистика. фотография как медиа
визуальная публицистика. фотография как медиаKateryna Iakovlenko
 

Destacado (20)

пустыни
пустынипустыни
пустыни
 
My power point
My power pointMy power point
My power point
 
οικουμενικη διακηρυξη των δικαιωματων του ανθρωπου
οικουμενικη διακηρυξη των δικαιωματων του ανθρωπουοικουμενικη διακηρυξη των δικαιωματων του ανθρωπου
οικουμενικη διακηρυξη των δικαιωματων του ανθρωπου
 
Numbers 1-12
Numbers 1-12Numbers 1-12
Numbers 1-12
 
2 partecipazione romantica
2 partecipazione romantica2 partecipazione romantica
2 partecipazione romantica
 
Sdp ja työurat
Sdp ja työuratSdp ja työurat
Sdp ja työurat
 
Roskilde Festival 2011 af Esben Danielsen
Roskilde Festival 2011 af Esben DanielsenRoskilde Festival 2011 af Esben Danielsen
Roskilde Festival 2011 af Esben Danielsen
 
Preliminary task – school magazine
Preliminary task – school magazinePreliminary task – school magazine
Preliminary task – school magazine
 
Alok ppt
Alok pptAlok ppt
Alok ppt
 
Šovinizam i rasizam Dinka Gruhonjića
Šovinizam i rasizam Dinka GruhonjićaŠovinizam i rasizam Dinka Gruhonjića
Šovinizam i rasizam Dinka Gruhonjića
 
Patas pa arriva
Patas pa arrivaPatas pa arriva
Patas pa arriva
 
Rumania
RumaniaRumania
Rumania
 
Sosiaali- ja terveysministeriön asettaman työryhmän raportti
Sosiaali- ja terveysministeriön asettaman työryhmän raporttiSosiaali- ja terveysministeriön asettaman työryhmän raportti
Sosiaali- ja terveysministeriön asettaman työryhmän raportti
 
Residential Flats in Omaxe City Jaipur...8459137252
Residential Flats in Omaxe City Jaipur...8459137252Residential Flats in Omaxe City Jaipur...8459137252
Residential Flats in Omaxe City Jaipur...8459137252
 
Taler vi om det samme? af Jacob Boye Hansen, CareCom A/S
Taler vi om det samme? af Jacob Boye Hansen, CareCom A/STaler vi om det samme? af Jacob Boye Hansen, CareCom A/S
Taler vi om det samme? af Jacob Boye Hansen, CareCom A/S
 
Kisi kisi prediksi soal un MTK
Kisi kisi prediksi soal un MTKKisi kisi prediksi soal un MTK
Kisi kisi prediksi soal un MTK
 
Q jg
Q jgQ jg
Q jg
 
Automatisering af tværautomatiske processer af Mia Cordes Hansen, iHedge A/S
Automatisering af tværautomatiske processer af Mia Cordes Hansen, iHedge A/SAutomatisering af tværautomatiske processer af Mia Cordes Hansen, iHedge A/S
Automatisering af tværautomatiske processer af Mia Cordes Hansen, iHedge A/S
 
CrowdSearch2012 - Discovering User Perceptions of Semantic Similarity in Near...
CrowdSearch2012 - Discovering User Perceptions of Semantic Similarity in Near...CrowdSearch2012 - Discovering User Perceptions of Semantic Similarity in Near...
CrowdSearch2012 - Discovering User Perceptions of Semantic Similarity in Near...
 
визуальная публицистика. фотография как медиа
визуальная публицистика. фотография как медиавизуальная публицистика. фотография как медиа
визуальная публицистика. фотография как медиа
 

Similar a research_paper_acm_1.6

9-Meta Analysis/ Systematic Review
9-Meta Analysis/ Systematic Review9-Meta Analysis/ Systematic Review
9-Meta Analysis/ Systematic ReviewResearchGuru
 
Systematic Review & Meta Analysis.pptx
Systematic Review & Meta Analysis.pptxSystematic Review & Meta Analysis.pptx
Systematic Review & Meta Analysis.pptxDr. Anik Chakraborty
 
A research study Writing a Systematic Review in Clinical Research – Pubrica
A research study Writing a Systematic Review in Clinical Research – PubricaA research study Writing a Systematic Review in Clinical Research – Pubrica
A research study Writing a Systematic Review in Clinical Research – PubricaPubrica
 
A research study writing a systematic review in clinical research – pubrica
A research study writing a systematic review in clinical research – pubricaA research study writing a systematic review in clinical research – pubrica
A research study writing a systematic review in clinical research – pubricaPubrica
 
IJPCR,Vol15,Issue6,Article137.pdf
IJPCR,Vol15,Issue6,Article137.pdfIJPCR,Vol15,Issue6,Article137.pdf
IJPCR,Vol15,Issue6,Article137.pdfdrdivyeshgoswami1
 
evidence based periodontics
 evidence based periodontics    evidence based periodontics
evidence based periodontics neeti shinde
 
Systematic Reviews Class 4c
Systematic Reviews Class 4cSystematic Reviews Class 4c
Systematic Reviews Class 4cguestf5d7ac
 
When to Select Observational Studies as Evidence for Comparative Effectivenes...
When to Select Observational Studies as Evidence for Comparative Effectivenes...When to Select Observational Studies as Evidence for Comparative Effectivenes...
When to Select Observational Studies as Evidence for Comparative Effectivenes...Effective Health Care Program
 
Guide for conducting meta analysis in health research
Guide for conducting meta analysis in health researchGuide for conducting meta analysis in health research
Guide for conducting meta analysis in health researchYogitha P
 
How to evaluate clinical practice
How to evaluate clinical practice How to evaluate clinical practice
How to evaluate clinical practice raveen mayi
 
Introa-Morton-slides.pdf
Introa-Morton-slides.pdfIntroa-Morton-slides.pdf
Introa-Morton-slides.pdfShahriarHabib4
 
Ich guidelines e9 to e12
Ich guidelines e9 to e12Ich guidelines e9 to e12
Ich guidelines e9 to e12Khushboo Bhatia
 
An illustrated guide to the methods of meta analysi
An illustrated guide to the methods of meta analysiAn illustrated guide to the methods of meta analysi
An illustrated guide to the methods of meta analysirsd kol abundjani
 
How to structure your table for systematic review and meta analysis – Pubrica
How to structure your table for systematic review and meta analysis – PubricaHow to structure your table for systematic review and meta analysis – Pubrica
How to structure your table for systematic review and meta analysis – PubricaPubrica
 
Good Laboratory Practices and Safety Assessments
Good Laboratory Practices and Safety AssessmentsGood Laboratory Practices and Safety Assessments
Good Laboratory Practices and Safety AssessmentsPostgradoMLCC
 
#1 Characteristics, Strengths, Weaknesses, Kinds of.pptx
#1 Characteristics, Strengths, Weaknesses, Kinds of.pptx#1 Characteristics, Strengths, Weaknesses, Kinds of.pptx
#1 Characteristics, Strengths, Weaknesses, Kinds of.pptxJessaMaeGastar1
 
Level of Evidence- Dina Hudiya Nadana Lubis.pptx
Level of Evidence- Dina Hudiya Nadana Lubis.pptxLevel of Evidence- Dina Hudiya Nadana Lubis.pptx
Level of Evidence- Dina Hudiya Nadana Lubis.pptxdina410715
 
Systematic and Scoping Reviews.pptx
Systematic and Scoping Reviews.pptxSystematic and Scoping Reviews.pptx
Systematic and Scoping Reviews.pptxLauraLarreaMantilla
 

Similar a research_paper_acm_1.6 (20)

9-Meta Analysis/ Systematic Review
9-Meta Analysis/ Systematic Review9-Meta Analysis/ Systematic Review
9-Meta Analysis/ Systematic Review
 
Systematic Review & Meta Analysis.pptx
Systematic Review & Meta Analysis.pptxSystematic Review & Meta Analysis.pptx
Systematic Review & Meta Analysis.pptx
 
A research study Writing a Systematic Review in Clinical Research – Pubrica
A research study Writing a Systematic Review in Clinical Research – PubricaA research study Writing a Systematic Review in Clinical Research – Pubrica
A research study Writing a Systematic Review in Clinical Research – Pubrica
 
A research study writing a systematic review in clinical research – pubrica
A research study writing a systematic review in clinical research – pubricaA research study writing a systematic review in clinical research – pubrica
A research study writing a systematic review in clinical research – pubrica
 
IJPCR,Vol15,Issue6,Article137.pdf
IJPCR,Vol15,Issue6,Article137.pdfIJPCR,Vol15,Issue6,Article137.pdf
IJPCR,Vol15,Issue6,Article137.pdf
 
evidence based periodontics
 evidence based periodontics    evidence based periodontics
evidence based periodontics
 
Systematic Reviews Class 4c
Systematic Reviews Class 4cSystematic Reviews Class 4c
Systematic Reviews Class 4c
 
Validation Studies in Simulation-based Education - Deb Rooney
Validation Studies in Simulation-based Education - Deb RooneyValidation Studies in Simulation-based Education - Deb Rooney
Validation Studies in Simulation-based Education - Deb Rooney
 
When to Select Observational Studies as Evidence for Comparative Effectivenes...
When to Select Observational Studies as Evidence for Comparative Effectivenes...When to Select Observational Studies as Evidence for Comparative Effectivenes...
When to Select Observational Studies as Evidence for Comparative Effectivenes...
 
Guide for conducting meta analysis in health research
Guide for conducting meta analysis in health researchGuide for conducting meta analysis in health research
Guide for conducting meta analysis in health research
 
How to evaluate clinical practice
How to evaluate clinical practice How to evaluate clinical practice
How to evaluate clinical practice
 
Introa-Morton-slides.pdf
Introa-Morton-slides.pdfIntroa-Morton-slides.pdf
Introa-Morton-slides.pdf
 
Systematic review.pptx
Systematic review.pptxSystematic review.pptx
Systematic review.pptx
 
Ich guidelines e9 to e12
Ich guidelines e9 to e12Ich guidelines e9 to e12
Ich guidelines e9 to e12
 
An illustrated guide to the methods of meta analysi
An illustrated guide to the methods of meta analysiAn illustrated guide to the methods of meta analysi
An illustrated guide to the methods of meta analysi
 
How to structure your table for systematic review and meta analysis – Pubrica
How to structure your table for systematic review and meta analysis – PubricaHow to structure your table for systematic review and meta analysis – Pubrica
How to structure your table for systematic review and meta analysis – Pubrica
 
Good Laboratory Practices and Safety Assessments
Good Laboratory Practices and Safety AssessmentsGood Laboratory Practices and Safety Assessments
Good Laboratory Practices and Safety Assessments
 
#1 Characteristics, Strengths, Weaknesses, Kinds of.pptx
#1 Characteristics, Strengths, Weaknesses, Kinds of.pptx#1 Characteristics, Strengths, Weaknesses, Kinds of.pptx
#1 Characteristics, Strengths, Weaknesses, Kinds of.pptx
 
Level of Evidence- Dina Hudiya Nadana Lubis.pptx
Level of Evidence- Dina Hudiya Nadana Lubis.pptxLevel of Evidence- Dina Hudiya Nadana Lubis.pptx
Level of Evidence- Dina Hudiya Nadana Lubis.pptx
 
Systematic and Scoping Reviews.pptx
Systematic and Scoping Reviews.pptxSystematic and Scoping Reviews.pptx
Systematic and Scoping Reviews.pptx
 

research_paper_acm_1.6

  • 1. Should Experiments for a Systematic Review be discarded based on their Quality? Oscar Dieste Facultad de Informática Universidad Politécnica Madrid Boadilla del Monte, Madrid 28660, Spain +34 91336 5011 odieste@fi.upm.es Natalia Juristo Facultad de Informática Universidad Politécnica Madrid Boadilla del Monte, Madrid 28660, Spain +34 91336 6922 natalia@fi.upm.es Himanshu Saxena Facultad de Informática Universidad Politécnica Madrid Boadilla del Monte, Madrid 28660, Spain himanshusaxena22@gmail.com ABSTRACT Discarding irrelevant experiments based on their quality in the beginning of a systematic review process could make the process more efficient and aggregated result more precise and reliable. The systematic review process becomes cumbersome as it involves analysis of a large number of experiments and there is a potential risk that some of these experiments either do not contribute to the aggregations and if they do they induce bias. Quality of experiments is believed to depend on how experiment is planned, executed and reported. We evaluated the quality of experiments participating in 3 different systematic reviews. For quality evaluation we used both the checklist and alternative procedures based on practices in other experimental disciplines. Further, we analyzed correlations among the checklist scores and alternative measures of quality. We did not find any relationships among the measures. Contemporary quality assessment instruments seem to fail to detect any tangible evidence of quality in experiments. Therefore, we recommend that quality assessment for discarding experiments for a systematic review must be performed with extreme caution or not performed at all until more research is carried out. Categories and Subject Descriptors D.2.0 [Software Engineering]: General General Terms Experimentation, Measurement Keywords Systematic Literature Review (SR), Quality Assessment (QA) of experiments, Checklists. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ESEM’2010, September 13–17, 2010, Bolzano-Bozen, Italy. Copyright 2004 ACM 1-58113-000-0/00/0004…$5.00. 1. INTRODUCTION Systematic Review (SR) according to Kitchenham [1] “is a means of identifying, evaluating and interpreting all available research relevant to a particular research question, or topic area, or phenomenon of interest”. SR process involves (i) Identifying experiments about a particular research topic, (ii) Selecting the studies relevant to the research, (iii) Inclusion/Exclusion of studies based on their quality assessment, (iv) Extracting the data from the selected studies and (v) Eventually, aggregating the data to generate unbiased evidence. Here, we focus on SR that involve aggregations. SR process can be quite challenging especially if the number of experiments considered is high. But, not all experiments contribute to the final aggregation and some of the experiments even could create bias in the aggregated result. Kitchenham et al. [1] and Tore Dybå et al. [2], [3] recommend a detailed quality assessment of the studies for exclusion. Quality of experiments is related to how the experiment is executed and how it is reported [1], [4]. Quality Assessment can weigh importance of individual studies based on the mentioned aspects and further remove low quality studies from a SR. Reducing the number of studies to be analysed and making the SR process more efficient and less error prone. Based on different procedures we have assessed the quality of a set of 42 experiments obtained from 3 different SR. We developed a checklist for quality assessment of experiments in SR using literature available in SE [1], [4], [22] and other experimental disciplines [5], [6], [7]. If checklists can discriminate experiment’s quality, their results should be consistent with our understanding about quality. Our conclusions point out that quality is an elusive concept that cannot be easily assessed by checklists/questionnaires that are primarily based on the experimental report. Discarding experiments in a SR based on the present quality assessment instruments can mislead the aggregations and wrong evidences could be generated. Therefore, quality assessment either should be performed with extreme caution or should not be performed at all. This recommendation matches current practice in other experimental fields [8], [9], [10], [11]. The paper is divided into 6 main sections. Section 2 reviews the different instruments that have been used to assess quality of experiments. Section 3 presents the research methodology and section 4 discusses the research results.
  • 2. Section 5 discusses the results and Section 6 shows the conclusions from our research. 2. INSTRUMENTS FOR ASSESSING EXPERIMENTS QUALITY Since 1940’s when the first Randomized Controlled Trial (RCT) was run in medicine, RCT’s has been considered as the ‘gold standards’ for experimental research exceeding the value over other type of studies like non-randomized trials, longitudinal studies, case studies etc. [12]. Sacks H et al. [13] performed the first corroboration of differences between RCT’s and other type of studies. One of the interesting findings was that RCT’s in general were less prone to bias than other type of studies like observational studies, horizontal controls etc. Bias produces some kind of deviation from the true value. Kitchenham [1] explains bias as “a tendency to produce results that depart systematically from the true results”. This early finding was highly influential and gave birth to the development of hierarchies of evidence [14]. Hierarchy of evidence was first proposed and popularized by the Canadian Task Force on Periodic Health Examination [15] in late 1979. A hierarchy of evidence determines the best available evidence that can be used to grade experiments according to their design. It also reflects the degree of bias different types of experiments might possess. The first hierarchy of evidence evolved into more refined ones like the CRD guidelines [16], the Australian National Health and Medical research council guidelines [5] and the Cochrane review handbook [8]. To understand how hierarchy of evidence works, lets assume we have two studies: one RCT and a case study comparing testing technique A with testing technique B for a specific context. Considering a situation where RCT says technique A is better than technique B and the case study states the opposite. The result from the RCT would be given priority over the case study results because RCT are situated above case studies in the hierarchy and thus there is more confidence in result of RCT’s than case studies. Hierarchy of evidence can be used during SR to determine the study designs contributing to evidences. Hierarchies have been considered as an uncontested method to grade the strength of evidence and its predictive power has thought to be demonstrated in multiple occasions [17]. Nevertheless, some voices have been raised against hierarchies. It has been observed that lack of bias is not necessarily related to the type of design, but to the provisions made in the empirical study to control confounding variables [18]. In other words quality (the risk of bias [19]) is related to the to internal validity of the experiment. Threats to internal validity have been identified as selection, performance, measurement and attrition (exclusion) bias [5], [8], [16]. Based on these observations that lack of bias (quality) is related to the internal validity, many researchers in medicine studied the influence of selection, performance, measurement and execution aspects in bias. Those aspects were for example the existence in the experiment of randomization, blinding etc. Getting rid from all kinds of biases in an experiment would lead to an internally valid experiment. Biases can be avoided by using mechanisms like randomization, blinding etc. Checklists were constructed keeping in mind all different kinds of biases and questions were raised if all relevant steps were performed to avoid bias [20]. In experimental SE, SR and quality assessment are recent topics and they got attention after the seminal work by Kitchenham in 2004 [1]. Since the publication of that report, multiple SR has been published in SE [2], [21]. Most of the SR’s in SE do not have an explicit quality assessment process. In the new version of the report by Kitchenham published in 2007 [22], she proposes a questionnaire (checklist) considering the issues related to bias and validity problems that might arise in the different phases of experiments. Tore Dybå et al. [2] also developed and use a checklist in their SR on agile software development. Lately, extensive criticism has been raised in medicine about the use of scales for measuring quality [8], [9], [10], [23]. No relations were found between the scores obtained by an experiment in a given quality assessment checklist and the amount of bias suffered by the experiment [9]. Cochrane reviewer’s handbook goes to the extent of discouraging the use of quality assessment scales for exclusion of experiments in SR [8]. 3. RESEARCH METHODOLOGY The research methodology tried to find if low quality experiments cause bias when they are aggregated in a SR. For this we need a quality assessment instrument to determine low quality experiments; ways to determine bias; and finally, relate experiment’s quality with bias. Therefore, our research methodology has three elementary components: (i) an assessment checklist, (ii) one or more ways to determine bias in experiments and (iii) a way to relate experiments quality with bias in the experiment. 3.1 Quality Assessment Checklist We started studying QA scales available in Empirical SE literature. Kitchenham [1] based on different works from medicine propose a questionnaire (checklist) for experiment quality assessment. In Kitchenham’s checklist the average number of questions for evaluating an experiment is close to 45. Pragmatically speaking it is an enormous number since they need to be answered for every paper of a SR, (very often bigger than 40). Using such checklists would increase the time required to conduct a SR making the process highly inefficient. Kitchenham [26] suggests on not using all the questions provided in the checklist but select the questions that are most appropriate for the specific research problem. It is quite difficult to apply such strategy because researchers lack in the information needed to identify which questions are relevant to the particular SR at hand. Tore Dybå et al. [2] also developed a questionnaire for quality assessment of experiments in their SR on agile software development. These authors do not provide justification for why and from where the items have been selected to be included in the checklist. None of the existing ESE QA questionnaires have been tested for suitability (to ensure if they measure what they are intended to measure) or reliability (to ensure that the measure consistently). We decided to build a much shorter checklist taking basic issues that influence quality according to Kitchenham’s report [1] [22], Kitchenham et al. guidelines [4] and the practices of experiments QA in medicine [5], [8], [16], [22]. Our checklist is in essence similar to Kitchenham’s checklist as they use the same sources and are built on the basis of the same criteria (i.e.: importance of randomization, blinding, threats to validity, etc.). It is shown in Table 1. The questions have been classified into two groups: methodological and reporting aspects. Methodological-related questions deal with controlling extraneous variables.
  • 3. Reporting-related questions ensure whether important details are disclosed in the written report or not. Creating a checklist each time SR is performed would be quite time consuming. Therefore, we were to evaluate if it is possible to develop a checklist valid for a diversity of SR. We use several experiments from several domains to ensure generalizability of the checklist. Table 1 Construct, attributes and questions of our checklist Construct Attributes Questions of our Checklist Does the introduction contain the industrial context and description of the techniques to be reviewed? Does the report summarize about the previous similar experiments that have been conducted and discusses them? Does the researcher define the population from which objects and subjects are drawn? Are the statistical significance mentioned with the results. Quality of Reporting Are the threats to validity mentioned explicitly and also how these threats affect the results and conclusion? Does the researcher define the process by which he applies the treatment to subjects or objects? Was randomization used for selecting the population and applying the treatment? Does the researcher define the process from which the subjects and objects are selected? Are the hypotheses being laid and are they synonymous to the goal discussed before in introduction? Quality of Experiments Methodological Quality Is the blinding procedure used convincingly enough? 3.2 Determining Bias in Experiments Kitchenham [1] explains bias as “a tendency to produce results that depart systematically from the ‘true’ results”. For example, if we find in a specific experiment that testing technique A is 50% more effective than testing technique B, but we know that the difference actually is 30%, the remaining 20% would be considered as bias. Therefore, or our research we need to know true values to be able to determine experiments bias. However, given the lack of empirical data in SE, true values are seldom available. We only have the results of SR that have values that we might assume are close to true values. We have found only two published SR that detail about the aggregations showing the true value and how each experiment contributes to the true value (Tore Dybå et al. [25] and Ciolkowski [26]). At UPM, we had two more SR’s that detailed the aggregations [14], [15]. Using SR results, depending on how they are performed, we can identify three measures of bias. The first measure is known as proximity to mean value (PTMV) [27], [29] of all experiments in a SR. Figure 1 is a forest plot describing the aggregation of three experiments. The vertical line with diamond in its tail gives the mean value of all three experiments. PTMV says that the farther is the experiment to the mean value the more biased are the results of the experiment. In the figure below from top to bottom we have E1, E2 and E3. PTMV tells us that E3 possess highest bias as it is farthest to the mean value and E2 has least bias. Table 2 provides the distances from mean value and corresponding bias order of the three experiments. Table 2 PTMV scores for experiments in forest plot example. Experiment Dis. ID PTMV Bias Order E1 d1 0.47 2 E2 d2 0.17 3 E3 d3 0.48 1 The second measure is the conflicting and non- conflicting contribution to the aggregation (CNC) of all experiments in a SR. Table 3 describes three aggregations produced by using comparative analysis [30]. Each aggregation has a response variable and generalized result associated to it. Table 3 Experiments contribution to the aggregation Aggreg ation Respon se variabl e General ized Result Positive effects Negativ e effects No effects AGG01 R1 T {a} > T {b} E1, E3 E2 AGG02 R2 T {c} < T {d} E1, E5 E4 AGG03 R3 T {e} > T {f} E3, E2 Figure 1 Aggregation forest plot of three experiments.
  • 4. The experiments either have positive effect to the aggregated result or a negative effect or no effect. A positive effect means that the experimental result agrees with the generalized result and negative effect means the opposite. No effects mean that the experiment produced insignificant results. CNC contribution can be calculated by using the following formula given in Equation 1. Equation 1 CNC quality score equation Lets take an example to calculate the CNC score. Experiment E2 is conflicting aggregation AGG01 and supporting (non- conflicting) aggregation AGG03. Therefore the CNC score for experiment E2 would be 0.5. The more the CNC score for the experiment the less it possesses bias, because the higher the CNC score, the more coherent is the experiment with the aggregated result. Now, lets take E1 that is supporting the aggregation AGG01 and AGG02 having the CNC scores as 1. Apparently, CNC scores are handy for SR that do not use meta-analysis for data aggregation. However, CNC scores are comparatively less precise to PTMV scores. The third measure we used is expert opinion about the experiment looking at the experiment’s report [1], [16]. Based on experience an expert could differentiate between biases and unbiased experiments in a SR. We adapted the expert opinion measure to make it simpler. A value of -1, 0 or 1 was given for a poor, average or a god experiment respectively. Higher the score lesser would the amount of bias in the experiment. 3.3 Relating Quality with Bias The third component of our research methodology was to find a way to relate checklist results with bias in the experiment. If the internal validity and lack of bias are related, we expect that high values of internal validity be related to low levels of bias and vice versa. Correlations are the best way to calculate such relationship [31]. Therefore, we correlated the checklist scores with PTMV scores, CNC scores and expert opinion scores. 4. RESEARCH RESULTS Table 4 The three SRs used in our research Name No Of Experiments Aggregation mechanism Authors SR on elicitation techniques [27] 14 Comparative Analysis Oscar Dieste. Natalia Juristo SR on inspection techniques [28] 13 Meta Analysis Anna Griman Padua, Oscar Dieste. Natalia Juristo SR on pair programming [25] 15 Meta Analysis Tore Dybå, Erik Arisholm, Dag I.K. Sjoberg, Jo E. Hannay and Forrest Shull Three SR were used as described in Table 4 for getting the scores we have described [32]. Checklist, PTMV, CNC and expert opinion scores were obtained for all the experiments in the three SR’s. Table 5 provides all the scores obtained. Table 5 All the scores Set Study ID CNC Scores PTMV Scores Check list Score s Expert Opinio n Score E2 1.00 0.70 0 E3 0.00 0.72 1 E4 0.00 0.63 1 E5 1.00 0.90 1 E6 0.50 0.45 1 E8 0.58 0.60 1 E17 0.33 0.60 1 E18 0.50 0.60 1 E19 1.00 0.80 1 E21 0.33 0.80 -1 E25 1.00 0.90 0 E28 0.50 0.30 -1 E31 0.50 0.60 -1 E36 0.83 0.90 1 Elicita tion SR E43 1.00 N.A. 0.70 1 E05 1.00 0.34 0.80 1 E01 1.00 0.13 0.80 0 E02 1.00 0.36 0.70 0 E06 1.00 0.23 0.80 1 E03 1.00 0.21 0.70 0 E04 - R1 1.00 0.09 0.80 1 E04 - R2 1.00 0.63 0.80 1 E13 1.00 0.32 0.80 1 E14 0.50 0.26 0.80 1 E15 0.50 0.37 0.80 1 E19 1.00 0.33 0.80 0 Inspec tion SR E20 1.00 0.14 0.80 0 P98 0.50 2.11 0.7 0 So0 1.00 0.66 0.6 0 So1 0.00 0.24 0.5 0 So2 0.00 0.265 0.6 0 Po2 0.00 0.34 0.6 1 So3 0.00 0.215 0.9 1 So5a 1.00 0.335 0.7 0 So5b 0.33 0.44 0.8 1 So5c 0.00 0.08 0.8 0 So6a 0.00 0.3 0.8 1 So6b 0.00 0.26 0.8 1 Pair - Program ming SR So6c 0.67 1.575 0.8 -1
  • 5. So6d 0.67 1.2433 0.6 0 So7b 0.00 0.65 0.9 1 So7a 0.33 0.19 0.8 1 Three different experts were used. Two of them were PhD students making their thesis on SR and aggregation and one of the experts was a senior researcher on experimental SE. Correlations were drawn between CNC scores vs. Checklist scores, Checklist scores vs. PTMV scores and Checklist scores vs. Expert Opinion. Figure 2 describes all the correlations. Also, Table 6 provides the values for correlation coefficients and corresponding p-value for the correlations described in Figure 2. Considering the first row (CNC vs. checklist) of Figure 2 the x-axis gives the checklist scores and y-axis gives the CNC scores. From left to right in the first plot (elicitation SR) we have fluctuating CNC scores for checklist scores from 0.60 to 0.90 evidently inkling a poor correlation. The correlation coefficient and p-value give in Table 6 (correlation coefficient = 0.453; p- value = 0.242) substantiate our observation. Similarly, the second and third graph in the first row exhibits poor correlations. The second row in Figure 2 the rightmost graph provides correlation between checklists vs. PTMV for pair programming SR. As we can see for PTMV score (x-axis) as 1 we have different checklist scores (y-axis) suggesting a poor correlation again. The correlation coefficient (-0078) substantiates our observation and the p-value (p-value = 0.526) make the result statistically insignificant. The other graph (Inspection SR column) in the same row also shows a poor correlation. The third row in Figure 2 from top to bottom the second graph from left to right provide the correlation between checklists vs. expert opinion for inspection SR. The graph shows a positive trend as with increase expert opinion scores (x-axis) the checklist scores also shows an increasing trend. The correlation coefficient (0.677) suggests a positive correlation. However, the p-value (p- value = 0.129) is higher to the statistically significance value (p- value = 0.05). The most probable reason is that the sample size is quite small (11 studies) it becomes difficult to reach statically significance. However, the plots and correlation coefficient for the other two SR (elicitation and pair programming) suggest no correlation between expert opinion and checklist scores. All the correlations were poor and statistically insignificant suggesting that no correlations could be identified between quality measures through checklist and bias measured through CNC, PTMV and Expert Opinion. Table 6 Correlations Coefficients (in the grey cells) and p- values (in the white cells) for all the correlations. CNC PTMV Expert Opinion SR 0.453 0.222 0.242 0.367 Elicitation SR - 0.182 0.085 0.677 0.600 0.453 0.129 Inspection SR - 0.236 - 0.078 0.406 Checklist 0.630 - 0.526 0.277 Pair Programming SR Figure 2 Correlations from top to bottom (i) CNC vs. Checklist (ii) Checklist vs. PTMV (iii) Checklist vs. Expert Opinion. Correlations from left to right (i) Elicitation SR (ii) Inspection SR (iii) Pair Programming SR. 5. DISCUSSION The results stimulate discussions. Getting back to the question. Should experiments in a SR be discarded based on their quality? Discarding an experiment in a SR based on its quality (methodological or reporting) seems to be inappropriate. No correlations were obtained between the QA checklist scores and different measures of bias. Our checklist is similar to the contemporary QA instruments prevalent in SE. Therefore contemporary QA instruments should be used carefully. Exclusion of valid studies from a SR would reduce the precision of the aggregated result but the inclusion of invalid studies might lead to bias aggregated result. The Cochrane Handbook [8] shares our point of view for the QA of experiments in medicine. The Cochrane handbook [5] along with other researchers from medicine [9], [10], [11] discourages the use of QA scales, as they cannot prove their validity. Jüni [10] states “Scales have been shown to be unreliable assessment of validity and they are less likely to be transparent to the users of the review.” None of the scales present in SE have proved their validity, including ours. Moreover, the QA scales can discard some useful experiments in the SR making the aggregated result biased, imprecise and pragmatically worthless. Therefore it might be advisable to discourage the use of QA scales in SE. The results we have obtained are statistically insignificant. If the number of experiments increases we might attain statically significance. Moving ahead when we have more SR that details the aggregations. Researchers [14], [33] in medicine have identified a potential aspect in measuring quality. The relation between the object being experimented and the experiments seem to bias the results. This object-experimenter relationship has been the only one empirically identified in medicine as having a relationship with experimental results bias. It has been observed in medicine that multi-centre experiments avoid this bias. Multi-centre SR’s evidence is considered to have the maximum confidence [14].
  • 6. However, further research is imperative to substantiate this postulate in SE. 6. CONCLUSIONS We have analysed if QA instruments as checklists do have a relationship with experimental bias. Our pilot approach to this research has been with our own QA checklist. The most extended QA checklist in SE [22] was too long (45 questions) to be answered for the 45 experiments we were analysing. Therefore, we generated a short version of such checklist. For three different SR [25], [27], [28] we have analysed correlations between QA checklist and three bias measurements (proximity to mean value, conflicting and non-conflicting contribution and expert opinion). All the correlations were poor and insignificant. These results suggest that discarding experiments in a SR based on their quality should not be practiced. These results match with similar results and recommendations in medicine [8], [9], [10], [11]. Next step in our research is to run this same analysis with similar QA instruments ([2], [34] for instance). 7. ACKNOWLEDGMENTS This work was a part of master thesis done in UPM Madrid, Spain and TU Kaiserslautern, Germany under Dieter Rombach, Natalia Juristo and Oscar Dieste and in the European Masters in Software Engineering course. This work was partially supported by the projects TIN 2008 – 00555 and HD 2008-0048. 8. REFERENCES [1] B.A. Kitchenham, Procedures for performing systematic reviews, Technical Report TR/SE-0401, 2004. [2] Dybå T, Dingsøyr T. Empirical studies of agile software development: A systematic review. Inf. Softw. Technol. 2008;50(9-10):833-859.Availableat: http://portal.acm.org.miman.bib.bth.se/citation.cfm?id=1379 905.1379989&coll=GUIDE&dl=GUIDE&CFID=81486639 &CFTOKEN=51504098 [Accessed March 11, 2010]. [3] Dyba T, Dingsoyr T, Hanssen G. Applying Systematic Reviews to Diverse Study Types: An Experience Report. 20- 21 Sept. 2007 . 2007: 225 - 234 . [4] B.A. Kitchenham, S.L. Pfleeger, L.M. Pickard, P.W. Jones, D.C. Hoaglin, K.E. Emam, and J. Rosenberg, Preliminary Guidelines for Empirical Research in Software Engineering, vol. 28 no. 8, Aug. 2002, pp. pp. 721-734. [5] National Health and Medical Research Council (NHMRC), “NHMRC handbook series on preparing clinical practice guidelines.” ISBN 0642432952, Feb. 2000. [6] D. Moher, B. Pham, A. Jones, D.J. Cook, A.R. Jadad, M. Moher, P. Tugwell, and T.P. Klassen, Does quality of reports of randomised trials affect estimates of intervention efficacy reported in meta-analyses?Lancet, vol. 352, Aug. 1998, pp. 609-613. [7] Blobaum P. Physiotherapy Evidence Database (PEDro). J Med Libr Assoc. 2006; 94 (4):477-478. [8] J.P.T. Higgins and S. Green, Cochrane Handbook for Systematic Reviews of Interventions, John Wiley and Sons, 2008. [9] J.D. Emerson, E. Burdick, D.C. Hoaglin, F. Mosteller, and T.C. Chalmers, An empirical study of the possible relation of treatment differences to quality scores in controlled randomized clinical trials, Controlled Clinical Trials, vol. 11, Oct. 1990, pp. 339-352. [10] P. Jüni, A. Witschi, R. Bloch, and M. Egger, The hazards of scoring the quality of clinical trials for meta-analysis, JAMA: The Journal of the American Medical Association, vol. 282, Sep. 1999, pp. 1054-1060. [11] Schulz KF, Chalmers I, Hayes RJ, Altman DG. Empirical evidence of bias. Dimensions of methodological quality associated with estimates of treatment effects in controlled trials. JAMA 1995; 273: 408-412. [12] Stuart Barton, Which clinical studies provide the best evidence? BMA House, Tavistock Square, London WC1H 9JR, sbarton@bmjgroup.com [13] Sacks H, Chalmers TC, Smith H Jr., Randomized versus historical controls for clinical trials, Am J Med 1982; 72: 233-40. [14] David Evans BN, Hierarchy of evidence: a framework for ranking evidence evaluating healthcare interventions, Department of Clinical Nursing, University of Adelaide, South Australia 5005, 2002. [15] Canadian Task Force on the Periodic Health Examinaton. (1979) The periodic health examination.Canadian Medical Association Journal 121, 1193-1254. [16] U.O.Y. Centre for Reviews and Dissemination, Undertaking systematic reviews of research on effectiveness: CRD's guidance for carrying out or commissioning reviews, NHS Centre for Reviews and Dissemination, Mar. 2001. [17] Louis PCA. Research into the effects of bloodletting in some inflammatory diseases and on the influence of tartarized antimony and vesication in pneumonitis. Am J Med Sci1836; 18:102 –111. [18] RRMacLehose, BCReeves, IMHarvey, TASheldon, ITRussell, AMS Black, A systematic review of comparisons of effect sizes derived from randomised and non-randomised studies, Health technology Assessement 2000;Vol. 4:No. 34. [19] Verhagen A.P., de Vet H.C.W., de Bie R.A., Boers M., A van den Brandt P. The art of quality assessment of RCTs included in systematic reviews. Journal of Clinical Epidemiology. 2001; 54:651-654. Available at: http://www.ingentaconnect.com/content/els/08954356/2001/ 00000054/00000007/art00360 [Accessed March 3, 2010]. [20] A.P. Verhagen, H.C. de Vet, R.A. de Bie, A.G. Kessels, M. Boers, L.M. Bouter, and P.G. Knipschild, The Delphi list: a criteria list for quality assessment of randomized clinical trials for conducting systematic reviews developed by Delphi consensus, Journal of Clinical Epidemiology, vol. 51, Dec. 1998, pp. 1235-1241. [21] Finn Olav Bjørnson, Torgeir Dingsøyr, Knowledge management in software engineering: A systematic review of studied concepts, findings and research methods used, a Norwegian University of Science and Technology, Department of Computer and Information Science, Sem S_landsvei 7-9, 7491 Trondheim, Norway. SINTEF Information and Communication Technology, SP Andersens vei 15b, 7465 Trondheim, Norway, 2008
  • 7. [22] B. Kitchenham and S. Charters, Guidelines for performing Systematic Literature Reviews in Software Engineering, 2007.EBSE Technical Report EBSE-2007-01, 2004. [23] E.M. Balk, P.A.L. Bonis, H. Moskowitz, C.H. Schmid, J.P.A. Ioannidis, C. Wang, and J. Lau, Correlation of quality measures with estimates of treatment effect in meta-analyses of randomized controlled trials, JAMA: The Journal of the American Medical Association, vol. 287, Jun. 2002, pp. 2973-2982. [24] Fink, A. Conducting Research Literature Reviews. From the Internet to Paper, Sage Publication, Inc., 2005. [25] Hannay JE, Dybå T, Arisholm E, Sjøberg DIK. The effectiveness of pair programming: A meta-analysis. Inf. Softw. Technol. 2009; 51 (7):1110-1122. Available at: http://portal.acm.org.miman.bib.bth.se/citation.cfm?id=1539 052.1539606 [Accessed March 11, 2010]. [26] Ciolkowski M. What do we know about perspective-based reading? An approach for quantitative aggregation in software engineering. In: Proceedings of the 2009 3rd International Symposium on Empirical Software Engineering and Measurement. IEEE Computer Society; 2009:133-144. Available at http://portal.acm.org/citation.cfm?doid= 1671248.1671262 [Accessed March 11, 2010]. [27] Oscar Dieste, Natalia Juristo Systematic Review and Aggregation of Empirical Studies on Elicitation Techniques. IEEE Transactions on Software Engineering, 2008. [28] Anna Griman Padua, Oscar Dieste and Natalia Juristo, Systematic review on inspection techniques, UPM, Spain, (Under Construction). [29] Schulz, Kenneth F.; Chalmers, Iain; Hayes, Richard J.; Altman, Douglas G, Empirical evidence of bias: Dimensions of methodological quality associated with estimates of treatment effects in controlled trials, JAMA: Journal of the American Medical Association. Vol 273(5), Feb 1995, 408- 412. [30] C.C. Ragin, The comparative method, University of California Press, 1987. [31] DeCoster, J. (2005). Scale Construction Notes, Jun. 2005, http://www.start-help.com/notes.html. [32] Himanshu Saxena, Prospective Study For The Quality Assessment Of Experiments Included In Systematic Reviews, Master thesis, UPM, Madrid, 2009. [33] P. Jüni, A. Witschi, R. Bloch, and M. Egger, The hazards of scoring the quality of clinical trials for meta-analysis, JAMA: The Journal of the American Medical Association, vol. 282, Sep. 1999, pp. 1054-1060. [34] Kitchenham, B., Mendes, E., Travassos, G.H. (2007) A Systematic Review of Cross- vs. Within-Company Cost Estimation Studies, IEEE Trans on SE, 33 (5), pp 316-329. 9. LIST OF EXPERIMENTS USED IN OUR STUDY Set 1 - Following are the list of studies used in the systematic review on elicitation techniques. The unique alphanumeric number enlisting the studies represents the code used while performing the systematic review. Hence, the aggregation tables would have these numbers to represent the experiments. [E2] Agarwal, R. and Tanniru, M. R., "Knowledge Acquisition Using Structured Interviewing: An Empirical Investigation," Journal of Management Information Systems, vol. 7, pp. 123-141, Summer, 1990. [E3] Bech-Larsen, T. and Nielsen, N. A., "A comparison of five elicitation techniques for elicitation of attributes of low involvement products," Journal of Economic Psychology, vol. 20, pp. 315-341, 1999. [E4] Breivik, E. and Supphellen, M., "Elicitation of product attributes in an evaluation context: A comparison of three elicitation techniques," Journal of Economic Psychology, vol. 24, pp. 77-98, 2003. [E5] Browne, G. J. and Rogich, M. B., "An Empirical Investigation of User Requirements Elicitation: Comparing the Effectiveness of Prompting Techniques," Journal of Management Information Systems , vol. 17, pp. 223-249, Spring, 2001. [E6] Burton, A. M., Shadbolt, N. R., Hedgecock, A. P., and Rugg, G. A formal evaluation of knowledge elicittion techniques for expert systems: Domain 1. In: Research and development in expert systems IV: proceedings of Expert Systems '87, the seventh annual Technical Conference of the British Computer Society Specialist Group on Expert Systems, ed. Moralee, D. S. Cambridge, UK: Cambridge University Press, 1987. [E8] Corbridge, B., Rugg, G., Major, N. P., Shadbolt, N. R. , and Burton, A. M., "Laddering - technique and tool use in knowledge acquisition," Knowledge Acquisition, vol. 6, pp. 315-341, 1994. [E17, E18] Burton, A. M., Shadbolt, N. R., Rugg, G., and Hedgecock, A. P., "The efficacy of knowledge acquisition techniques: A comparison across domains and levels of expertise," Knowledge Acquisition, vol. 2, pp. 167-178, 1990. [E19] Moody, J. W., Will, R. P., and Blanton, J. E., "Enhancing knowledge elicitation using the cognitive interview," Expert Systems with Applications, vol. 10, pp. 127-133, 1996. [E21] Zmud, R. W. , Anthony, W. P., and Stair, R. M., "The use of mental imagery to facilitate information identification in requirements analysis," Journal of Management Information Systems, vol. 9, pp. 175-191, 1993. [E25] Pitts, M. G. and Browne, G. J., "Stopping Behavior of Systems Analysts During Information Requirements Elicitation," Journal of Management Information Systems, vol. 21, pp. 203-226, 2004. Set 2 - Following are the list of studies used in the systematic review on inspection techniques. The unique alphanumeric number enlisting the studies represents the code used while performing the systematic review. Hence, the aggregation tables would have these numbers to represent the experiments. [E03] J. Miller, M. Wood, and M. Roper, “Further Experiences with Scenarios and Checklists,” Empirical Softw. Engg., vol. 3, 1998, pp. 37-64. [E13], [E14], [E15] O. Laitenberger, K.E. Emam, and T.G. Harbich, “An Internally Replicated Quasi-Experimental Comparison of Checklist and Perspective-Based Reading of
  • 8. Code Documents,” IEEE Trans. Softw. Eng., vol. 27, 2001, pp. 387-421. [E04] K. Sandahl, O. Blomkvist, J. Karlsson, C. Krysander, M. Lindvall, and N. Ohlsson, “An Extended Replication of an Experiment for Assessing Methods for Software Requirements Inspections ,” vol. Volume 3, Number 4 / December, 1998, Oct. 2004. [E19] P.R. Thomas Thelin, “An Experimental Comparison of Usage-Based and Checklist-Based Reading,” Aug. 2003. [E06] A.A. Porter and L.G. Votta, “An experiment to assess different defect detection methods for software requirements inspections,” Proceedings of the 16th international conference on Software engineering, Sorrento, Italy: IEEE Computer Society Press, 1994, pp. 103-112. [E20] T. Thelin, C. Andersson, P. Runeson, and N. Dzamashvili- Fogelstrom, “A Replicated Experiment of Usage-Based and Checklist-Based Reading,” Proceedings of the Software Metrics, 10th International Symposium, IEEE Computer Society, 2004, pp. 246-256. [E05] A. Porter and L. Votta, “Comparing Detection Methods For Software Requirements Inspections: A Replication Using Professional Subjects,” Empirical Softw. Engg., vol. 3, 1998, pp. 355-379. [E01] A.A. Porter, L.G. Votta, Jr, J. Victor, and V.R. Basili, “Comparing Detection Methods For Software Requirements Inspections: A Replicated Experiment,” 1995. Set 3 - Following are the list of studies used in the systematic review on inspection techniques. The unique alphanumeric number enlisting the studies represents the code used while performing the systematic review. Hence, the aggregation tables would have these numbers to represent the experiments. [P98] J.T. Nosek, “The Case for Collaborative Programming,” Comm. ACM, vol. 41, no. 3, 1998, pp. 105–108. [Soo] L. Williams, R.R. Kessler, W. Cunningham, and R. Jeffries, “ Strengthening the Case for Pair Programming,” IEEE Software, vol. 17, no. 4, 2000, pp. 19–25. [So1] J. Nawrocki and A. Wojciechowski, “Experimental Evaluation of Pair Programming,” Proc. European Software Control and Metrics Conference (ESCOM 01), 2001, pp. 269–276. [So2]P. Baheti, E. Gehringer, and D. Stotts, “Exploring the Efficacy of Distributed Pair Programming,” Extreme Programming and Agile Methods—XP/Agile Universe 2002, LNCS 2418, Springer, 2002, pp. 208–220. [Po2] M. Rostaher and M. Hericko, “Tracking Test First Pair Programming—An Experiment,” Proc. XP/Agile Universe 2002, LNCS 2418, Springer, 2002, pp. 174– 184. [So3] S. Heiberg, U. Puus, P. Salumaa, and A. Seeba, “Pair- Programming Effect on Developers Productivity,” Extreme Programming and Agile Processes in Software Eng.—Proc 4th Int’l Conf. XP 2003, LNCS 2675, Springer, 2003, pp. 215–224. [So5a] G. Canfora, A. Cimitlie, and C.A. Visaggio, “Empirical Study on the Productivity of the Pair Programming,” Extreme Programming and Agile Processes in Software Eng.—Proc 6th Int’l Conf. XP 2005, LNCS 3556, Springer, 2005, pp. 92–99. [So5b] M.M. Müller, “Two Controlled Experiments Concerning the Comparison of Pair Programming to Peer Review,” J. Systems and Software, vol. 78, no. 2, 2005, pp. 169–179. [So5c] J. Vanhanen and C. Lassenius, “Effects of Pair Programming at the Development Team Level: An Experiment,” Proc. Int’l Symp. Empirical Software Eng. (ISESE 05), IEEE CS Press, 2005, pp. 336–345. [So6a] L. Madeyski, “The Impact of Pair Programming and Test-Driven Development on Package Dependencies in Object-Oriented Design—An Experiment,” Product- Focused Software Process Improvement—Proc. 7th Int’l Conf. (Profes 06), LNCS 4034, Springer, 2006, pp. 278–289. [So6b] M.M. Müller, “A Preliminary Study on the Impact of a Pair Design Phase on Pair Programming and Solo Programming,” Information and Software Technology, vol. 48, no. 5, 2006, pp. 335–344. [So6c] M. Phongpaibul and B. Boehm, “An Empirical Comparison between Pair Development and Software Inspection in Thailand,” Proc. Int’l Symp. Empirical Software Eng. (ISESE 06), ACM Press, 2006, pp. 85– 94. [So6d] S. Xu and V. Rajlich, “Empirical Validation of Test-Driven Pair Programming in Game Development,” Proc. Int’l Conf. Computer and Information Science and Int’l Workshop Component- Based Software Eng., Software Architecture and Reuse (ICIS-COMSAR 06), IEEE CS Press, 2006, pp. 500– 505. [Po7b] G. Canfora, A. Cimitile, F. Garcia, M. Piattini, and C.A. Visaggio, “Evaluating Performances of Pair Designing in Industry,” J. Systems and Software, vol. 80, no. 8, 2007, pp. 1317–1327. [Po7b] E. Arisholm, H. Gallis, T. Dybå, and D.I.K. Sjøberg, “Evaluating Pair Programming with Respect to System Complexity and Programmer Expertise,” IEEE Trans. Software Eng., vol. 33, no. 2, 2007, pp. 65–86.