Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Reliability what is it, and how is it measured
1. 94 Key Words
Reliability, measurement,
quantitative measures,
statistical method.
by Anne Bruton
Reliability: Joy H Conway
Stephen T Holgate
What is it, and
how is it measured?
Summary Therapists regularly perform various measurements.
How reliable these measurements are in themselves, and how
clearly essential knowledge to help clinicians
reliable therapists are in using them, is clearly essential knowledge
decide whether or not a particular
to help clinicians decide whether or not a particular measurement measurement is of any value.
is of any value. The aim of this paper is to explain the nature of This article focuses on the reliability of
reliability, and to describe some of the commonly used estimates measures that generate quantitative data, and
that attempt to quantify it. An understanding of reliability, and in particular ‘interval’ and ‘ratio’ data.
how it is estimated, will help therapists to make sense of their own Interval data have equal intervals between
numbers but these are not related to true
clinical findings, and to interpret published studies.
zero, so do not represent absolute quantity.
Although reliability is generally perceived as desirable, there is no Examples of inter val data are IQ and
firm definition as to the level of reliability required to reach clinical degrees Centigrade or Fahrenheit. In the
acceptability. As with hypothesis testing, statistically significant temperature scale, the difference between
levels of reliability may not translate into clinically acceptable levels, 10° and 20° is the same as between 70° and
so that some authors’ claims about reliability may need to be 80°, but is based on the numerical value of
the scale, not the true nature of the variable
interpreted with caution. Reliability is generally population specific,
itself. Therefore the actual difference in
so that caution is also advised in making comparisons between heat and molecular motion generated is not
studies. the same and it is not appropriate to say that
The current consensus is that no single estimate is sufficient to someone is twice as hot as someone else.
provide the full picture about reliability, and that different types of With ratio data, numbers represent units
estimate should be used together. with equal intervals, measured from true
zero, eg distance, age, time, weight, strength,
blood pressure, range of motion, height.
Introduction Numbers therefore reflect actual amounts of
Therapists regularly per form various the variable being measured, and it is
measurements of varying reliability. The appropriate to say that one person is twice as
term ‘reliability’ here refers to the heavy, tall, etc, as another. The kind of
consistency or repeatability of such quantitative measures that therapists often
measurements. Irrespective of the area carry out are outlined in table 2.
in which they work, therapists take The aim of this paper is to explain the
measurements for any or all of the reasons nature of reliability, and to describe, in
outlined in table 1. How reliable these general terms, some of the commonly used
measurements are in themselves, and how methods for quantifying it. It is not intended
reliable therapists are in performing them, is to be a detailed account of the statistical
Table 1: Common reasons why therapists perform Table 2: Examples of quantitative measures
measurements performed by physiotherapists
As part of patient assessment. Strength measures (eg in newtons of force, kilos
lifted.
As baseline or outcome measures.
Angle or range of motion measures (eg in degrees,
Bruton, A, Conway, J H As aids to deciding upon treatment plans.
centimetres).
and Holgate, S T (2000). As feedback for patients and other interested
Velocity or speed measures (eg in litres per minute
‘Reliability: What is it and parties.
for peak expiratory flow rate).
how is it measured?’ As aids to making predictive judgements, eg about
Length or circumference measures (eg in metres,
Physiotherapy, 86, 2, outcome.
centimetres).
94-99.
Physiotherapy February 2000/vol 86/no 2
2. Professional articles 95
minutiae associated with reliability measures, Table 3: Repeated maximum inspiratory pressure measures data
for which readers are referred to standard demonstrating good relative reliability
books on medical statistics. MIP Rank
Subject Day 1 Day 2 Difference Day 1 Day 2
Measurement Error
1 110 120 +10 2 2
It is very rare to find any clinical
2 94 105 +11 4 4
measurement that is perfectly reliable, as all
instruments and observers or measurers 3 86 70 --16 5 5
(raters) are fallible to some extent and all 4 120 142 +22 1 1
humans respond with some inconsistency. 5 107 107 0 3 3
Thus any observed score (X) can be thought
of as a function of two components, ie a true
score (T) and an error component
(E): X = T ± E Table 4: Repeated maximum inspiratory pressures measures data
demonstrating poor relative reliability
The difference between the true value and
the observed value is measurement error. In MIP Rank
statistical terms, ‘error’ refers to all sources Subject Day 1 Day 2 Difference Day 1 Day 2
of variability that cannot be explained by the 1 110 95 --15 2 5
independent (also known as the predictor, 2 94 107 +13 4 3
or explanatory) variable. Since the error
3 86 97 +11 5 4
components are generally unknown, it is
4 120 120 0 1 2
only possible to estimate the amount of any
measurement that is attributable to error 5 107 129 +22 3 1
and the amount that represents an accurate
reading. This estimate is our measure of
reliability.
Measurement errors may be systematic or by some type of correlation coefficient, eg
random. Systematic errors are predictable Pearson’s correlation coefficient, usually
errors, occurring in one direction only, written as r. For table 3 the data give a
constant and biased. For example, when Pearson’s correlation coefficient of r = 0.94,
using a measurement that is susceptible to a generally accepted to indicate a high degree
learning effect (eg strength testing), a retest of correlation. In table 4, however, although
may be consistently higher than a prior test the differences between the two measures
(perhaps due to improved motor unit co- look similar to those in table 1 (ie –15 to +22
ordination). Such a systematic error would cm of water), on this occasion the ranking
not therefore affect reliability, but would has changed. Subject 4 has the highest MIP
affect validity, as test values are not true on day 1, but is second highest on day 2,
representations of the quantity being subject 1 had the second highest MIP in day
measured. Random errors are due to chance 1, but the lowest MIP on day 2, and so on.
and unpredictable, thus they are the basic For table 4 data r = 0.51, which would be
concern of reliability. interpreted as a low degree of correlation.
Correlation coefficients thus give infor-
Types of Reliability mation about association between two
Baumgarter (1989) has identified two types variables, and not necessarily about their
of reliability, ie relative reliability and proximity.
absolute reliability. Absolute reliability is the degree to which
Relative reliability is the degree to which repeated measurements vary for individuals,
individuals maintain their position in a ie the less they vary, the higher the reliability.
sample over repeated measurements. Tables This type of reliability is expressed either in
3 and 4 give some maximum inspiratory the actual units of measurement, or as a
pressure (MIP) measures taken on two proportion of the measured values. The
occasions, 48 hours apart. In table 3, standard error of measurement (SEM),
although the differences between the two coefficient of variation (CV) and Bland and
measures vary from –16 to +22 centimetres Altman’s 95% limits of agreement (1986)
of water, the ranking remains unchanged. are all examples of measures of absolute
That is, on both day 1 and day 2 subject 4 reliability. These will be described later.
had the highest MIP, subject 1 the second
highest, subject 5 the third highest, and so
on. This form of reliability is often assessed
Physiotherapy February 2000/vol 86/no 2
3. 96
Authors Why Estimate Reliability? estimate calculated for their data. Table 5
Anne Bruton MA MCSP is Reliability testing is usually performed to summarises the more common reliability
currently involved in assess one of the following: indices found in the literature, which are
postgraduate research, described below.
Joy H Conway PhD MSc s Instrumental reliability, ie the reliability of
MCSP is a lecturer in the measurement device. Table 5: Reliability indices in common use
physiotherapy, and
s Rater reliability, ie the reliability of the Hypothesis tests for bias, eg paired t-test, analysis
Stephen T Holgate MD of variance.
DSc FRCP researcher/observer/clinician
administering the measurement device. Correlation coefficients, eg Pearson’s, ICC.
is MRC professor of
Standard error of measurement (SEM).
immunopharmacology, s Response reliability, ie the
all at the University of reliability/stability of the variable being Coefficient of variation (CV).
Southampton. measured. Repeatability coefficient.
This article was received Bland and Altman 95% limits of agreement.
on November 16, 1998,
How is Reliability Measured?
and accepted on
September 7, 1999. As described earlier, observed scores consist Indices Based on Hypothesis Testing for Bias
of the true value ± the error component. The paired t-test, and analysis of variance
Since it is not possible to know the true techniques are statistical methods for
Address for value, the true reliability of any test is not detecting systematic bias between groups
Correspondence calculable. It can however be estimated, of data. These estimates, based upon
based on the statistical concept of variance, hypothesis testing, are often used in
Ms Anne Bruton, Health
Research Unit, School of ie a measure of the variability of differences reliability studies. However, they give
Health Professions and among scores within a sample. The greater information only about systematic
Rehabilitation Sciences, the dispersion of scores, the larger the differences between the means of two sets of
University of variance; the more homogeneous the scores, data, not about individual differences. Such
Southampton, Highfield, the smaller the variance. tests should, therefore, not be used in
Southampton SO17 1BJ. If a single measurer (rater) were to record isolation, but be complemented by other
the oxygen saturation of an individual 10 methods, eg Bland and Altman agreement
times, the resulting scores would not all be tests (1986).
Funding
identical, but would exhibit some variance.
Anne Bruton is currently Some of this total variance is due to true Correlation Coefficients (r)
sponsored by a South and differences between scores (since oxygen As stated earlier, correlation coefficients give
West Health Region R&D
saturation fluctuates), but some can be information about the degree of association
studentship.
attributable to measurement error (E). between two sets of data, or the consistency
Reliability (R) is the measure of the amount of position within the two distributions.
of the total variance attributable to true Provided the relative positions of each
differences and can be expressed as the ratio subject remain the same from test to test,
of true score variance (T) to total variance high measures of correlation will be
or: T obtained. However, a correlation coefficient
R=T+E will not detect any systematic errors. So it is
This ratio gives a value known as a possible to have two sets of scores that are
reliability coefficient. As the observed score highly correlated, but not highly repeatable,
approaches the true score, reliability as in table 6 where the hypothetical data
increases, so that with zero error there is give a Pearson’s correlation coefficient of
perfect reliability and a coefficient of 1, r = 1, ie per fect correlation despite a
because the observed score is the same as systematic difference of 40 cm of water
the true score. Conversely, as error increases for each subject.
reliability diminishes, so that with maximal Thus correlation only tells how two sets of
error there is no reliability and the scores vary together, not the extent of
coefficient approaches 0. There is, however, agreement between them. Often researchers
no such thing as a minimum acceptable level need to know that the actual values obtained
of reliability that can be applied to all by two measurements are the same, not just
measures, as this will vary depending on the proportional to one another. Although
use of the test. published studies abound with correlation
used as the sole indicator of reliability, their
Indices of Reliability results can be misleading, and it is now
In common with medical literature, recommended that they be no longer used
physiotherapy literature shows no in isolation (Keating and Matyas, 1998;
consistency in authors’ choice of reliability Chinn, 1990).
Physiotherapy February 2000/vol 86/no 2
4. Professional articles 97
Table 6: Repeated maximum inspiratory pressures measures data
demonstrating a high Pearson’s correlation coefficient, but poor absolute
reliability
MIP Rank
Subject Day 1 Day 2 Difference Day 1 Day 2
1 110 150 +40 2 2
2 94 134 +40 4 4
3 86 126 +40 5 5
4 120 160 +40 1 1
5 107 147 +40 3 3
Intra-class Correlation Coefficient (ICC) subjects to the sum of error variance and
The intra-class correlation coefficient (ICC) subject variance. If the variance between
is an attempt to overcome some of the subjects is sufficiently high (that is, the data
limitations of the classic correlation come from a heterogeneous sample) then
coefficients. It is a single index calculated reliability will inevitably appear to be high.
using variance estimates obtained through Thus if the ICC is applied to data from a
the partitioning of total variance into group of individuals demonstrating a wide
between and within subject variance (known range of the measured characteristic,
as analysis of variance or ANOVA). It thus reliability will appear to be higher than
reflects both degree of consistency and when applied to a group demonstrating a
agreement among ratings. narrow range of the same characteristic.
There are numerous versions of the ICC
(Shrout and Fleiss, 1979) with each form Standard Error of Measurement (SEM)
being appropriate to specific situations. As mentioned earlier, if any measurement
Readers interested in using the ICC can find test were to be applied to a single subject an
worked examples relevant to rehabilitation infinite number of times, it would be
in various published articles (Rankin and expected to generate responses that vary a
Stokes, 1998; Keating and Matyas, 1998; little from trial to trial, as a result of
Stratford et al, 1984; Eliasziw et al, 1994). The measurement error. Theoretically these
use of the ICC implies that each component responses could be plotted and their
of variance has been estimated appropriately distribution would follow a normal curve,
from sufficient data (at least 25 degrees of with the mean equal to the true score,
freedom), and from a sample representing and errors occurring above and below the
the population to which the results will be mean.
applied (Chinn, 1991). In this instance, The more reliable the measurement
degrees of freedom can be thought of as the response, the less error variability there
number of subjects multiplied by the would be around the mean. The standard
number of measurements. deviation of measurement errors is therefore
As with other reliability coefficients, there a reflection of the reliability of the test
is no standard acceptable level of reliability response, and is known as the standard error
using the ICC. It will range from 0 to 1, with of measurement (SEM). The value for the
values closer to one representing the higher SEM will vary from subject to subject, but
reliability. Chinn (1991) recommends that there are equations for calculating a group
any measure should have an intra-class estimate, eg SEM = sx √1 – rxx (where sx is the
correlation coefficient of at least 0.6 to be standard deviation of the set of observed test
useful. The ICC is useful when comparing scores and rxx is the reliability coefficient for
the repeatability of measures using different those data -- often the ICC is used here.)
units, as it is a dimensionless statistic. It is The SEM is a measure of absolute
most useful when three or more sets of reliability and is expressed in the actual units
observations are taken, either from a single of measurement, making it easy to interpret,
sample or from independent samples. It ie the smaller the SEM, the greater the
does, however, have some disadvantages as reliability. It is only appropriate, however, for
described by Rankin and Stokes (1998) that use with interval data (Atkinson and Neville,
make it unsuitable for use in isolation. As 1998) since with ratio data the amount of
described earlier, any reliability coefficient is random error may increase as the measured
determined as the ratio of variance between values increase.
Physiotherapy February 2000/vol 86/no 2
5. 98
Coefficient of Variation (CV) appropriate for method comparison studies
The CV is an often-quoted estimate of for reasons described by Bland and Altman
measurement error, particularly in lab- in their 1986 paper. These authors have
oratory studies where multiple repeated tests therefore proposed an approach for
are standard procedure. One form of the CV assessing agreement between two different
is calculated as the standard deviation of the methods of clinical measurement. This
data, divided by the mean and multiplied by involves calculating the mean for each
100 to give a percentage score. This method and using this in a series of
expresses the standard deviation as a agreement tests.
proportion of the mean, making it unit Step 1 consists of plotting the difference in
independent. However, as Bland (1987) the two results against the mean value from
points out, the problem with expressing the the two methods. Step 2 involves calculating
error as a percentage, is that x% of the the mean and standard deviation of the
smallest observation will differ markedly differences between the measures. Step 3
from x% of the largest observation. Chinn consists of calculating the 95% limits of
(1991) suggests that it is preferable to use agreement (as the mean difference plus or
the ICC rather than the CV, as the former minus two standard deviations of the
relates the size of the error variation to the differences), and 95% confidence intervals
size of the variation of interest. It has been for these limits of agreement. The
suggested that the above form of the CV advantages of this approach are that by using
should no longer be used to estimate scatterplots, data can be visually interpreted
reliability, and that other more appropriate fairly swiftly. Any outliers, bias, or rel-
methods should be employed based on ationship between variance in measures and
analysis of variance of logarithmically size of the mean can therefore be observed
transformed data (Atkinson and Neville, easily. The 95% limits of agreement provide
1998). a range of error that may relate to clinical
acceptability, although this needs to be
Repeatability Coefficient interpreted with reference to the range of
Another way to present measurement error measures in the raw data.
over two tests, as recommended by the In the same paper, Bland and Altman
British Standards Institution (1979) is the have a section headed ‘Repeatability’ in
value below which the difference between which they recommend the use of the
the two measurements will lie with ‘repeatability coefficient’ (described earlier)
probability 0.95. This is based upon the for studies involving repeated measures with
within-subject standard deviation (s). the same instrument. In their final
Provided the measurement errors are from a discussion, however, they suggest that their
normal distribution this can be estimated by agreement testing approach may be used
1.96 x √(2s2), or 2.83s and is known as the either for analysis of repeatability of a single
repeatability coefficient (Bland and Altman, measurement method, or for method
1986). This name is rather confusing, as comparison studies. Worked examples using
other coefficients (eg reliability coefficient) Bland and Altman agreement tests can be
are expected to be unit free and in a range found in their original paper, and more
from zero to one. The method of calculation recently in papers by Atkinson and Nevill
varies slightly in two different references (1998) and Rankin and Stokes (1998).
(Bland and Altman, 1986; Bland, 1987), and
to date it is not a frequently quoted statistic. Nature of Reliability
Unfortunately, the concept of reliability is
Bland and Altman Agreement Tests complex, with less of the straightforward
In 1986 The Lancet published a paper by ‘black and white’ statistical theory that
Bland and Altman that is frequently cited surrounds hypothesis testing. When testing
and has been instrumental in encouraging a research hypothesis there are clear
changing use of reliability estimates in the guidelines to help researchers and clinicians
medical literature. In the past, studies decide whether results indicate that the
comparing the reliability of two different hypothesis can be supported or not. In
instruments designed to measure the contrast, the decision as to whether a
same variable (eg two different types particular measurement tool or method
of goniometer) often quoted correlation is reliable or not is more open to
coefficients and ICCs. These can both interpretation. The decision to be made is
be misleading, however, and are not whether the level of measurement error is
Physiotherapy February 2000/vol 86/no 2
6. Professional articles 99
considered acceptable for practical use. instrument will have a certain degree of
There are no firm rules for making this reliability when applied to certain
decision, which will inevitably be context populations under certain conditions. The
based. An error of ±5° in goniometry issue to be addressed is what level of
measures may be clinically acceptable in reliability is considered to be clinically
some circumstances, but may be less acceptable. In some circumstances there
acceptable if definitive clinical decisions (eg may be a choice only between a measure
surgical intervention) are dependent on the with lower reliability or no measure at all, in
measure. Because of this dependence on the which case the less than perfect measure
context in which they are produced, it is may still add useful information.
therefore very difficult to make comparisons In recent years several authors have
of reliability across different studies, except recommended that no single reliability
in very general terms. estimate should be used for reliability
studies. Opinion is divided over exactly
Conclusion which estimates are suitable for which
This paper has attempted to explain the circumstances. Rankin and Stokes (1998)
concept of reliability and describe some of have recently suggested that a consensus
the estimates commonly used to quantify it. needs to be reached to establish which tests
Key points to note about reliability are should be adopted universally. In general,
summarised in the panel below. Reliability however, it is suggested that no single
should not necessarily be conceived as a estimate is universally appropriate, and that
property that a particular instrument or a combination of approaches is more likely
measurer does or does not possess. Any to give a true picture of reliability.
References Chinn, S (1991). ‘Repeatability and method
Atkinson, G and Nevill, A M (1998). ‘Statistical comparison’, Thorax, 46, 454-456.
methods for assessing measurement error Eliasziw, M, Young, S L, Woodbury, M G et al
(reliability) in variables relevant to sports (1994). ‘Statistical methodology for the
medicine’, Sports Medicine, 26, 217-238. concurrent assessment of inter-rater and
Baumgarter, T A (1989). ‘Norm-referenced intra-rater reliability: Using goniometric
measurement: reliability’ in: Safrit, M J and Wood, measurements as an example’, Physical Therapy,
T M (eds) Measurement Concepts in Physical 74, 777-788.
Education and Exercise Science, Champaign, Illinois, Keating, J and Matyas, T (1998). ‘Unreliable
pages 45-72. inferences from reliable measurements’,
Bland, J M (1987). An Introduction to Medical Australian Journal of Physiotherapy, 44, 5-10.
Statistics, Oxford University Press. Rankin, G and Stokes, M (1998).
Bland, J M and Altman, D G (1986). ‘Statistical ‘Reliability of assessment tools in rehabilitation:
methods for assessing agreement between two An illustration of appropriate statistical analyses’,
methods of clinical measurement’, The Lancet, Clinical Rehabilitation, 12, 187-199.
February 8, 307-310. Shrout, P E and Fleiss, J L (1979). ‘Intraclass
British Standards Institution (1979). ‘Precision of correlations: Uses in assessing rater reliability’,
test methods. 1: Guide for the determination and Psychological Bulletin, 86, 420-428.
reproducibility for a standard test method’ Stratford, P, Agostino, V, Brazeau, C and
BS5497, part 1. BSI, London. Gowitzke, B A (1984). ‘Reliability of joint angle
Chinn, S (1990). ‘The assessment of methods of measurement: A discussion of methodology
measurement’, Statistics in Medicine, 9, 351-362. issues’, Physiotherapy Canada, 36, 1, 5-9.
Key Messages
Reliability is:
s Population specific.
s Not an all-or-none phenomenon. s Related to the variability in the group
s Open to interpretation. studied.
s Not the same as clinical acceptability. s Best estimated by more than one index.
Physiotherapy February 2000/vol 86/no 2