1. 1
INTRODUCTION TO STATISTICS &
PROBABILITY
Chapter 1:
Looking at Data—Distributions (Part 2)
1.2 Describing Distributions with Numbers
Dr. Nahid Sultana
2. 1.2 Describing Distributions with
Numbers
2
Objectives
Measures of center: mean, median
Measures of spread: quartiles, standard deviation
Five-number summary and boxplot
IQR and outliers
Choosing among summary statistics
Changing the unit of measurement
3. Measures of center: The Mean
3
The most common measure of center is the arithmetic
average, or mean, or sample mean.
To calculate the average, or mean, add all values, then
divide by the number of individuals.
It is the “center of mass.”
If the n observations are x1, x2, x3, …, xn, their mean is:
sum of observations x1 x2 ... xn
x
n
n
1
or in more compact notation, x n xi
4. Measures of center: The Mean
(cont…)
4
Find the mean:
Here are the scores on the first exam in an introductory
statistics course for 10 students:
80
73
92
85
75
98
93
55
Find the mean first-exam score for these students.
Solution:
80
90
5. Measuring Center: The Median
5
Another common measure of center is the median.
The median M is the midpoint of a distribution, the
number such that half of the observations are smaller
and the other half are larger.
To find the median of a distribution:
1. Arrange all observations from smallest to largest.
2. If the number of observations n is odd, the median M is the
center observation in the ordered list.
3. If the number of observations n is even, the median M is the
average of the two center observations in the ordered list.
6. Measuring Center: The Median (cont...)
6
Find the median:
Here are the scores on the first exam in an introductory
statistics course for 10 students:
80 73
92
85
75
98
93
55
80
Find the median first-exam score for these students.
Solution:
90
Note: The location of the median is (n + 1)/2 in the sorted list.
8. Comparing Mean and Median (Cont...)
8
The mean and the median are the same only if the distribution is
symmetrical.
In a skewed
distribution, the mean is
usually farther out in
the long tail than is the
median.
The median is a measure of center that is resistant to skew and
outliers. The mean is not.
9. Measuring Spread: The Quartiles
9
A measure of center alone can be misleading. A useful numerical
description of a distribution requires both a measure of center and a
measure of spread.
We describe the spread or variability of a distribution by giving
several percentiles.
The median divides the data in two parts; half of the observations
are above the median and half are below the median. We could
call the median the 50th percentile.
The lower quartile (first quartile, Q1)is the median of the lower
half of the data; the upper quartile (third quartile, Q3) is the
median of the upper half of the data.
With the median, the quartiles divide the data into four equal
parts; 25% of the data are in each part
10. Measuring Spread: The Quartiles (Cont.)
Calculate the quartiles and inter-quartile:
10
1. Arrange the observations in
increasing order and locate
the median M.
2. The first quartile Q1 is the
median of the lower half of
the data, excluding M.
3. The third quartile Q3 is it is
the median of the upper half
of the data, excluding M.
11. Measuring Spread: The Quartiles
(Cont.)
11
Example: Here are the scores on the first-exam in an introductory
statistics course for 10 students:
80 73
92
85
75
98
93
55
80
90
Find the quartiles for these first-exam scores.
Solution: In order, the scores are:
55 73
75
80
80
85
90
92
93
98
The median is,
Q1 = 75, the median of the first five numbers: 55, 73, 75, 80, 80.
Q3 = 92, the median of the last five numbers: 85, 90, 92, 93, 98.
12. The Five-Number Summary
12
The five-number summary of a distribution consists of
The smallest observation (Min)
The first quartile (Q1)
The median (M)
The third quartile (Q3)
The largest observation (Max)
written in order from smallest to largest.
Minimum
Q1
M
Q3
Maximum
13. Boxplots
13
A boxplot is a graph of the five-number summary.
Draw a central box from Q1 to Q3.
Draw a line inside the box to mark the median M.
Extend lines from the box out to the minimum and maximum
values that are not outliers.
14. Boxplots (Cont…)
14
Example: Here are the scores on the first-exam in an introductory
statistics course for 10 students:
80 73
92
85
75
98
93
Make a boxplot for these first-exam scores.
Solution: In order, the scores are:
55, 73, 75, 80, 80, 85, 90, 92, 93, 98
Min = 55
Q1 = 75
M = 82.5
Q3 = 92
Max = 98
55
80
90
16. Boxplots and skewed data
16
Years until death
Boxplots for a symmetric and a right-skewed distribution
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
Boxplots show
symmetry or skew.
Disease X
Multiple Myeloma
17. Suspected Outliers: 1.5 IQR Rule
17
Outliers are troublesome data points, and it is important to be
able to identify them.
The interquartile range IQR is the distance between the first and
third quartiles,
IQR = Q3 − Q1
IQR is used as part of a rule of thumb for identifying outliers.
The 1.5 IQR Rule for Outliers
Call an observation an outlier if it falls more than 1.5 IQR above
the third quartile or below the first quartile.
Suspected low outlier: any value < Q1 – 1.5 IQR
Suspected high outlier: any value > Q3 + 1.5 IQR
18. Suspected Outliers: 1.5 IQR Rule (Cont..)
18
Individual #25 has a value of 7.9 years, which is 3.55 years
above the third quartile. This is more than 1.5 * IQR =3.225
years. Thus, individual #25 is a suspected outlier.
19. Suspected Outliers: 1.5 IQR Rule (Cont..)
19
Modified boxplots plot suspected outliers individually.
The 8 largest call lengths are
438, 465, 479, 700, 700, 951, 1148, 2631
They are plotted as individual points, though 2 of them are
identical and so do not appear separately.
20. Measuring Spread:
The Standard Deviation
20
The most common measure of spread looks at how far each
observation is from the mean. This measure is called the standard
deviation.
The standard deviation s measures the average distance of the
observations from their mean.
It is calculated by
This average squared distance is called the variance.
21. Calculating The Standard Deviation
21
1. Calculate mean
2. Calculate each deviation,
deviation = observation – mean
3. Square each deviation
4. Calculate the sum of the squared
deviations
5. Divided by degrees freedom,
(df) = (n-1), this is called the variance.
6. Calculate the square root of the
variance…this is the standard
deviation.
The variance = 52/(9 – 1) = 6.5
Standard deviation = 6.5 = 2.55
xi
(xi-mean) (xi-mean)2
1
1 - 5 = -4
(-4)2 = 16
3
3 - 5 = -2
(-2)2 = 4
4
4 - 5 = -1
(-1)2 = 1
4
4 - 5 = -1
(-1)2 = 1
4
4 - 5 = -1
(-1)2 = 1
5
5-5=0
(0)2 = 0
7
7-5=2
(2)2 = 4
8
8-5=3
(3)2 = 9
9
9-5=4
(4)2 = 16
Mean=5
Sum=0
Sum=52
22. Properties of The Standard Deviation
22
s measures spread about the mean and should be used only
when the mean is the measure of center.
s = 0 only when all observations have the same value and there
is no spread. Otherwise, s > 0.
s is not resistant to outliers.
s has the same units of measurement as the original
observations.
23. Choosing Measures of Center and
Spread
23
We now have a choice between two descriptions for center and spread
Mean and Standard Deviation
Median and Interquartile Range
The median and IQR are usually better than the mean and
standard deviation for describing a skewed distribution or a
distribution with outliers.
Use mean and standard deviation only for reasonably symmetric
distributions that don’t have outliers.
NOTE: Numerical summaries do not fully describe the shape of a
distribution. ALWAYS PLOT YOUR DATA FIRST!
24. Changing the Unit of Measurement
24
Variables can be recorded in different units of measurement.
Most often, one measurement unit is a linear transformation of
another measurement unit: xnew = a + bx.
Example 1: If a distance x is measured in kilometers, the same distance
in miles is xnew = 0.62 x
This transformation changes the units without changing the origin
—a distance of 0 kilometers is the same as a distance of 0 miles.
Example 2: Temperatures can be expressed in degrees Fahrenheit or
degrees Celsius.
This transformation changes both the unit; size and the origin of
the measurements —The origin in the Celsius scale (0◦C, the
temperature at which water freezes) is 32◦ in the Fahrenheit scale.
25. Changing the Unit of Measurement
(Cont…)
25
Linear transformations do not change the basic shape of a
distribution (skew, symmetry).
But they do change the measures of center and spread:
Multiplying each observation by a positive number b multiplies
both measures of center (mean, median) and spread (IQR, s) by b.
Adding the same number a (positive or negative) to each
observation adds a to measures of center and to quartiles but it
does not change measures of spread (IQR, s).