The document discusses the use of statistics in analytical chemistry. It provides definitions and explanations of key statistical concepts used to analyze chemical data, including:
- Mean, median, standard deviation, and variance as measures of central tendency and spread of data.
- The normal distribution and how it relates to accuracy and precision.
- Confidence intervals and how they are used to estimate the uncertainty around a measured value based on the standard deviation.
- How the size of data sets affects the confidence interval and uncertainty of results.
2. Aim of Statistics in Analytical Chemistry:
• Modern analytical chemistry is concerned with the detection,
identification, and measurement of the chemical composition of
unknown substances using existing instrumental techniques, and
the development or application of new techniques and
instruments. It is a quantitative science,
• Quantitative results are obtained using devices or instruments that
allow us to determine the concentration of a chemical in a sample
from an observable signal. There is always some variation in that
signal over time due to noise and/or drift within the instrument.
• One of the uses of statistics in analytical chemistry is therefore to
provide an estimate of the likely value of that error; in other words,
to establish the uncertainty associated with the measurement.
3. Errors in Chemical Analysis
Impossible to eliminate errors.
How reliable are our data?
Data of unknown quality are useless!
•Carry out replicate measurements
•Analyse accurately known standards
•Perform statistical tests on data
4. Mean
• Mean:
• Technically, the mean, can be viewed as the most common value (the outcome) you
would expect from a measurement (the event) performed repeatedly. It has the same
units as each individual measurement value.
• It is also important to differentiate between the population mean, μ, and the
sample mean, .
Defined as follows:
N
x
x
N
i
1
=
i
=
Where xi = individual values of x and N = number of replicate measurements
5. Median
• The middle result when data are arranged in
order of size (for even numbers the mean of
middle two). Median can be preferred when
there is an “outlier” - one reading very
different from rest. Median less affected by
outlier than is mean.
6. Standard deviation
• The standard deviation (denoted σ) also
provides a measure of the spread of repeated
measurements either side of the mean. An
advantage of the standard deviation over the
variance is that its units are the same as those
of the measurement. The standard deviation
also allows you to determine how many
significant figures are appropriate when
reporting a mean value
7. Sample Standard Deviation, s
The equation for s must be modified for small samples of data, i.e. small N
s
x x
N
i
i
N
( )2
1
1
Two differences cf. to equation for s:
1. Use sample mean instead of population mean.
2. Use degrees of freedom, N - 1, instead of N.
Reason is that in working out the mean, the sum of the
differences from the mean must be zero. If N - 1 values are
known, the last value is defined. Thus only N - 1 degrees
of freedom. For large values of N, used in calculating
s, N and N - 1 are effectively equal.
8. Alternative Expression for s
(suitable for calculators)
s
x
x
N
N
i
i
N i
i
N
( )
( )
2
1
1
2
1
Note: NEVER round off figures before the end of the calculation
9. s : measure of precision of a population of data,
given by:
s
( )
x
N
i
i
N
2
1
Where = population mean; N is very large.
10. Variance
• The variance (denoted σ2) represents the
spread (the dispersion) of the repeated
measurements either side of the mean. As the
notation implies, the units of the variance are
the square of the units of the mean value. The
greater the variance, the greater the
probability that any given measurement will
have a value noticeably different from the
mean.
11. Two alternative methods for measuring the precision of a set of results:
VARIANCE: This is the square of the standard deviation:
s
x x
N
i
i
N
2
2 2
1
1
( )
COEFFICIENT OF VARIANCE (CV)
(or RELATIVE STANDARD DEVIATION):
Divide the standard deviation by the mean value and express as a percentage:
CV
s
x
( ) 100%
12. Reproducibility of a method for determining
the % of selenium in foods. 9 measurements
were made on a single batch of brown rice.
Sample Selenium content (g/g) (xI) xi
2
1 0.07 0.0049
2 0.07 0.0049
3 0.08 0.0064
4 0.07 0.0049
5 0.07 0.0049
6 0.08 0.0064
7 0.08 0.0064
8 0.09 0.0081
9 0.08 0.0064
Sxi = 0.69 Sxi
2= 0.0533
Mean = Sxi/N= 0.077g/g (Sxi)2/N = 0.4761/9 = 0.0529
Standard Deviation of a Sample
s
00533 00529
9 1
0 00707106 0007
. .
. .
Coefficient of variance = 9.2% Concentration = 0.077 ± 0.007 g/g
Standard deviation:
13. Standard Error of a Mean
The standard deviation relates to the probable error in a single measurement.
If we take a series of N measurements, the probable error of the mean is less than
the probable error of any one measurement.
The standard error of the mean, is defined as follows:
s s
N
m
14. Accuracy & Precision
• Accuracy:
• Accuracy is defined as the closeness of a result
to the true value.
• This can be applied to a single measurement,
but is more commonly applied to the mean
value of several repeated measurements, or
replicates.
15. Accuracy & Precision
• Accuracy described how close the measurement or result comes to the
"true" or accepted value. Accuracy is usually expressed as either the
absolute difference between the measured and "true" values (eq 1) or as
the relative difference between the "true" and experimental values ( eq
2).
16. Example:
• A standard sample has an accepted absorbance
value of 0.516, The measured absorbance of this
sample is 0.509
Then:
• S = 0.509 - 0.516 = 0.007
and:
• R = (0.509 - 0.516)/0.516
• R = 0.014 = 1.4% = 14 ppt = 14000 ppm
17. Precision
• Precision is defined as the extent to which results
agree with one another. In other words, it is a
measure of consistency, and is usually evaluated
in terms of the range or spread of results.
• Practically, this means that precision is inherently
related to the standard deviation of the repeated
measurements.
• Precision describes the reproducibility of the
result.
18. Illustrating the difference between “accuracy” and “precision”
Low accuracy, low precision Low accuracy, high precision
High accuracy, low precision High accuracy, high precision
19. Types of errors in experimental data
• The term "error" is defined as the difference
between a measured, calculated or observed
value and the "true" or accepted value.
• Systematic errors or determinate
• Gross errors or illegitimate
• Random errors or indeterminate
20. Nature of Random Errors
• Uncontrollable variables are the source of
random errors
• The combined effect of random errors
produce the fluctuation of replicate
measurements around the mean
• Random errors are the major source of
uncertainty.
21. Normal distribution & confidence
intervals
• Probability Distributions:
• If we repeat the measurements enough times, we expect that the average will be close to
the true value, with the actual results spread around it. The distribution of a few
measurements might look something like the following histogram:
• frequency is the
number of times
that a particular
result occurs
values are distributed relatively evenly around a point somewhere between 1.2 and 1.6, so
the mean value of these measurements is probably around 1.4 or 1.5. This type of plot is
called a probability distribution
22. The Normal Distribution:
• A normal distribution implies that if you take a large enough
number of measurements of the same property for the same
sample under the same conditions, the values will be distributed
around the expected value, or mean, and that the frequency with
which a particular result (i.e. value) occurs will become lower the
farther away the result is from the mean.
• Put another way, a normal distribution is a probability curve where
there is a high probability of an event (i.e. a particular value)
occurring near the mean value, with a decreasing chance of an
event occurring as we move away from the mean.
23. The Normal distribution curve and equation look like this:
The Normal distribution is also known as the Gaussian distribution
The equation for a Gaussian curve is defined in terms of and s, as follows:
y
e x
( ) /
s
s
2 2
2
2
24. Two Gaussian curves with two different
standard deviations, sA and sB (=2sA)
General Gaussian curve plotted in
units of z, where
z = (x - )/s
i.e. deviation from the mean of a
datum in units of standard
deviation. Plot can be used for
data with given value of mean,
and any standard deviation.
25. SAMPLE = finite number of observations
POPULATION = total (infinite) number of observations
Properties of Gaussian curve defined in terms of population.
Then see where modifications needed for small samples of data
Main properties of Gaussian curve:
Population mean () : defined as earlier (N ). In absence of systematic error,
is the true value (maximum on Gaussian curve).
Remember, sample mean ( x) defined for small values of N.
(Sample mean population mean when N 20)
Population Standard Deviation (s) - defined on next overhead
26. The Normal Distribution
• The normal distributions are a very important class of
statistical distributions.
• All normal distributions are symmetric and have bell-
shaped density curves with a single peak.
• To speak specifically of any normal distribution, two
quantities have to be specified: the mean , where the
peak of the density occurs, and the standard deviation ,
which indicates the spread or girth of the bell curve
• A standard normal distribution is a normal distribution
with mean 0 and standard deviation 1
27. The 68-95-99.7% Rule (important)
• All normal density curves satisfy the following:
• 68% of the observations fall within 1 standard
deviation of the mean.
• 95% of the observations fall within 2 standard
deviations of the mean.
• 99.7% of the observations fall within 3 standard
deviations of the mean
• Thus, for a normal distribution, almost all values
lie within 3 standard deviations of the mean
28. Only a small fraction of observations (0.3% = 1 in 333) lie outside this range.
29. Another way of looking at it!
This is merely a consequence of the 68-95-99.7 rule.
31. How can we relate the observed mean value ( x ) to the true mean ()?
The latter can never be known exactly.
The range of uncertainty depends how closely s corresponds to s.
We can calculate the limits (above and below) around x that must lie,
with a given degree of probability.
32. Define some terms:
CONFIDENCE LIMITS
interval around the mean that probably contains .
CONFIDENCE INTERVAL
the magnitude of the confidence limits
CONFIDENCE LEVEL
fixes the level of probability that the mean is within the confidence limits
Examples later. First assume that the known s is a good
approximation to s.
33. Percentages of area under Gaussian curves between certain limits of z (= x - /s)
50% of area lies between 0.67s
80% “ 1.29s
90% “ 1.64s
95% “ 1.96s
99% “ 2.58s
What this means, for example, is that 80 times out of 100 the true mean will lie
between 1.29s of any measurement we make.
Thus, at a confidence level of 80%, the confidence limits are 1.29s.
For a single measurement: CL for = x zs (values of z on next overhead)
For the sample mean of N measurements ( x ), the equivalent expression is:
CL for s
x z
N
34. Values of z for determining Confidence
Limits
Confidence level, % z
50 0.67
68 1.0
80 1.29
90 1.64
95 1.96
96 2.00
99 2.58
99.7 3.00
99.9 3.29
Note: these figures assume that an excellent approximation
to the real standard deviation is known.
35. Atomic absorption analysis for copper concentration in aircraft engine oil gave a value
of 8.53 g Cu/ml. Pooled results of many analyses showed s s = 0.32 g Cu/ml.
Calculate 90% and 99% confidence limits if the above result were based on (a) 1, (b) 4,
(c) 16 measurements.
90% 853
164 032
1
853 052
85 05
CL g / ml
i.e. g / ml
.
( . )( . )
. .
. .
(a)
99% 853
258 032
1
853 083
85 08
CL g / ml
i.e. g / ml
.
( . )( . )
. .
. .
(b)
90% 853
164 032
4
853 026
85 03
CL g / ml
i.e. g / ml
.
( . )( . )
. .
. .
99% 853
258 032
4
853 041
85 04
CL g / ml
i.e. g / ml
.
( . )( . )
. .
. .
(c)
90% 853
164 0 32
16
853 013
85 01
CL g / ml
i.e. g / ml
.
( . )( . )
. .
. .
99% 853
258 032
16
853 021
85 02
CL g / ml
i.e. g / ml
.
( . )( . )
. .
. .
Confidence Limits when s is known
36. If we have no information on s, and only have a value for s -
the confidence interval is larger,
i.e. there is a greater uncertainty.
Instead of z, it is necessary to use the parameter t, defined as follows:
t = (x - )/s
i.e. just like z, but using s instead of s.
By analogy we have: CL for
(where = sample mean for measurements)
x ts
N
x N
The calculated values of t are given on the next overhead
37. Values of t for various levels of probability
Degrees of freedom 80% 90% 95% 99%
(N-1)
1 3.08 6.31 12.7 63.7
2 1.89 2.92 4.30 9.92
3 1.64 2.35 3.18 5.84
4 1.53 2.13 2.78 4.60
5 1.48 2.02 2.57 4.03
6 1.44 1.94 2.45 3.71
7 1.42 1.90 2.36 3.50
8 1.40 1.86 2.31 3.36
9 1.38 1.83 2.26 3.25
19 1.33 1.73 2.10 2.88
59 1.30 1.67 2.00 2.66
1.29 1.64 1.96 2.58
Note: (1) As (N-1) , so t z
(2) For all values of (N-1) < , t > z, I.e. greater uncertainty
38. Analysis of an insecticide gave the following values for % of the chemical lindane:
7.47, 6.98, 7.27. Calculate the CL for the mean value at the 90% confidence level.
xi% xi
2
7.47 55.8009
6.98 48.7204
7.27 52.8529
Sxi = 21.72 Sxi
2 = 157.3742
x
x
N
i
2172
3
7 24
.
.
s
x
x
N
N
i
i
2
2
2
1
157 3742
2172
3
2
0 246 0 25%
( )
.
( . )
. .
90% CL
x ts
N
7 24
2 92 0 25
3
7 24 0 42%
.
( . )( . )
. .
If repeated analyses showed that s s = 0.28%: 90% CL
x z
N
s 7 24
164 0 28
3
7 24 0 27%
.
( . )( . )
. .
Confidence Limits where s is not known
39. Testing a Hypothesis
Carry out measurements on an accurately known standard.
Experimental value is different from the true value.
Is the difference due to a systematic error (bias) in the method - or simply to random error?
Assume that there is no bias
(NULL HYPOTHESIS),
and calculate the probability
that the experimental error
is due to random errors.
Figure shows (A) the curve for
the true value (A = t) and
(B) the experimental curve (B)
40. Bias = B- A = B - xt.
Test for bias by comparing with the
difference caused by random error
x xt
Remember confidence limit for (assumed to be xt, i.e. assume no bias)
is given by:
CL for
at desired confidence level, random
errors can lead to:
if , then at the desired
confidence level bias (systematic error)
is likely (and vice versa).
x
ts
N
x x
ts
N
x x
ts
N
t
t
41. A standard material known to contain
38.9% Hg was analysed by
atomic absorption spectroscopy.
The results were 38.9%, 37.4%
and 37.1%. At the 95% confidence level,
is there any evidence for
a systematic error in the method?
x x x
x x
s
t
i i
37 8% 11%
1134 4208 30
4208 30 1134 3
2
0 943%
2
2
. .
. .
. ( . )
.
Assume null hypothesis (no bias). Only reject this if
x x ts N
t
But t (from Table) = 4.30, s (calc. above) = 0.943% and N = 3
ts N
x x ts N
t
4 30 0 943 3 2 342%
. . .
Therefore the null hypothesis is maintained, and there is no
evidence for systematic error at the 95% confidence level.
Detection of Systematic Error (Bias)
42. Are two sets of measurements significantly different?
Suppose two samples are analysed under identical conditions.
Sample 1 from replicate analyses
Sample 2 from replicate analyses
x N
x N
1 1
2 2
Are these significantly different?
Using definition of pooled standard deviation, the equation on the last
overhead can be re-arranged:
x x ts
N N
N N
pooled
1 2
1 2
1 2
Only if the difference between the two samples is greater than the term on
the right-hand side can we assume a real difference between the samples.
43. Test for significant difference between two sets of data
Two different methods for the analysis of boron in plant samples
gave the following results (g/g):
(spectrophotometry)
(fluorimetry)
Each based on 5 replicate measurements.
At the 99% confidence level, are the mean values significantly
different?
Calculate spooled = 0.267. There are 8 degrees of freedom,
therefore (Table) t = 3.36 (99% level).
Level for rejecting null hypothesis is
ts N N N N
1 2 1 2 336 0 267 10 25
- i.e. ( . )( . )
i.e. ± 0.5674, or ±0.57 g/g.
But g / g
x x
1 2 28 0 26 25 1 75
. . .
i.e. x x ts N N N N
pooled
1 2 1 2 1 2
Therefore, at this confidence level, there is a significant
difference, and there must be a systematic error in at least
one of the methods of analysis.
44. A set of results may contain an outlying result
- out of line with the others.
Should it be retained or rejected?
There is no universal criterion for deciding this.
One rule that can give guidance is the Q test.
Qexp xq xn /w
where xq = questionable result
xn = nearest neighbour
w = spread of entire set
Consider a set of results
The parameter Qexp is defined as follows:
Detection of Gross Errors
45. Qexp is then compared to a set of values Qcrit:
Rejection of outlier recommended if Qexp > Qcrit for the desired confidence level.
Note:1. The higher the confidence level, the less likely is
rejection to be recommended.
2. Rejection of outliers can have a marked effect on mean
and standard deviation, esp. when there are only a few
data points. Always try to obtain more data.
3. If outliers are to be retained, it is often better to report
the median value rather than the mean.
Qcrit (reject if Qexpt > Qcrit)
No. of observations 90% 95% 99% confidencelevel
3 0.941 0.970 0.994
4 0.765 0.829 0.926
5 0.642 0.710 0.821
6 0.560 0.625 0.740
7 0.507 0.568 0.680
8 0.468 0.526 0.634
9 0.437 0.493 0.598
10 0.412 0.466 0.568
46. The following values were obtained for
the concentration of nitrite ions in a sample
of river water: 0.403, 0.410, 0.401, 0.380 mg/l.
Should the last reading be rejected?
Qexp . . ( . . ) .
0 380 0 401 0 410 0 380 0 7
But Qcrit = 0.829 (at 95% level) for 4 values
Therefore, Qexp < Qcrit, and we cannot reject the suspect value.
Suppose 3 further measurements taken, giving total values of:
0.403, 0.410, 0.401, 0.380, 0.400, 0.413, 0.411 mg/l. Should
0.380 still be retained?
Qexp . . ( . . ) .
0 380 0 400 0 413 0 380 0 606
But Qcrit = 0.568 (at 95% level) for 7 values
Therefore, Qexp > Qcrit, and rejection of 0.380 is recommended.
But note that 5 times in 100 it will be wrong to reject this suspect value!
Also note that if 0.380 is retained, s = 0.011 mg/l, but if it is rejected,
s = 0.0056 mg/l, i.e. precision appears to be twice as good, just by
rejecting one value.
Q Test for Rejection
of Outliers