2. Chapter 3:
Describing, Exploring, and Comparing Data
3.1 Measures of Center
3.2 Measures of Variation
3.3 Measures of Relative Standing and Boxplots
2
Objectives:
1. Summarize data, using measures of central tendency, such as the mean, median, mode,
and midrange.
2. Describe data, using measures of variation, such as the range, variance, and standard
deviation.
3. Identify the position of a data value in a data set, using various measures of position,
such as percentiles, deciles, and quartiles.
4. Use the techniques of exploratory data analysis, including boxplots and five-number
summaries, to discover various aspects of data
3. Recall: 3.1 Measures of Center
Measure of Center (Central Tendency)
A measure of center is a value at the center or
middle of a data set.
1. Mean: 𝑥 =
𝑥
𝑛
, 𝜇 =
𝑥
𝑁
, 𝑥 =
𝑓∙𝑥 𝑚
𝑛
2. Median: The middle value of ranked data
3. Mode: The value(s) that occur(s) with the
greatest frequency.
4. Midrange: 𝑀𝑟 =
𝑀𝑖𝑛+𝑀𝑎𝑥
2
5. Weighted Mean: 𝑥 =
𝑤∙𝑥
𝑤
3
4. Key Concept: Variation is the single most important topic in statistics.
This section presents three important measures of variation: range, standard
deviation, and variance.
Recall: 3.2 Measures of Variation
4
1. Range = Max - Min
2. Variance
3. Standard Deviation
4. Coefficient of Variation
5. Chebyshev’s Theorem
6. Empirical Rule (Normal)
7. Range Rule of Thumb for
Understanding Standard Deviation
𝑠 ≈
𝑅𝑎𝑛𝑔𝑒
4
& µ ± 2σ
1 – 1/k2
Use CVAR to compare
variabiity when the units are
different.
100%
s
CVAR
X
5. 5
Recall: 3.2 Measures of Variation
The variance is the average of the squares of the distance
each value is from the mean.
The standard deviation is the square root of the variance.
The standard deviation is a measure of how spread out
your data are and how much data values deviate away from the
mean.
Notation
s = sample standard deviation
σ = population standard deviation
Usage & properties:
1. To determine the spread of the data.
2. To determine the consistency of a variable.
3. To determine the number of data values that fall within a specified
interval in a distribution (Chebyshev’s Theorem: 1 – 1/k2).
4. Used in inferential statistics.
5. The value of the standard deviation s is never negative. It is zero
only when all of the data values are exactly the same.
6. Larger values of s indicate greater amounts of variation.
𝑅 = 𝑀𝑎𝑥 − 𝑀𝑖𝑛
𝑠 ≈
𝑅
4
𝐶𝑉 =
𝑠
𝑥
(100%)
𝑥 ± 2𝑠
𝑥 =
𝑥
𝑛
𝜇 =
𝑥
𝑁
𝑠 =
(𝑥 − 𝑥)2
𝑛 − 1
𝜎 =
(𝑥 − 𝜇)2
𝑁
2
2
Population Variance:
X
N
2
Population Standard Deviation:
X
N
2
2
2
2
Sample Variance:
1
1
X X
X X
s
n
n
n n
2
2
2
Sample Standard Deviation
1
1
:
X X
X X
s
n
n
n n
TI Calculator:
How to enter data:
1. Stat
2. Edi
3. Highlight & Clear
4. Type in your data in L1, ..
TI Calculator:
Mean, SD, 5-number
summary
1. Stat
2. Calc
3. Select 1 for 1 variable
4. Type: L1 (second 1)
5. Scroll down for 5-
number summary
6. Key Concept: This section introduces measures of relative standing, which are
numbers showing the location of data values relative to the other values within
the same data set.
3.3 Measures of Relative Standing and Boxplots
Measures of Relative Standing (Position)
z-score
Percentile
Quartile
Outlier
Boxplot
6
x x
z
s
# of values below
100%
total # of values
X
Percentile
100 100
k P
L n n
Quartiles separate the data set into 4
equal groups. Q1=P25, Q2=MD, Q3=P75
𝐼𝑄𝑅 = 𝑄3 − 𝑄1,
Outlier: A value < 𝑄1 − 1.5 𝐼𝑄𝑅
A value > 𝑄3+1.5 𝐼𝑄𝑅
Boxplot: 5- Number summary
7. z Scores
A z score (or standard score or standardized value) is the number of standard
deviations that a given value x is above or below the mean. The z score is calculated by
using one of the following:
A z-score or standard score for a value is obtained by subtracting the mean from
the value and dividing the result by the standard deviation.
Important Properties of z Scores
1. A z score is the number of standard deviations that a given value x is above or
below the mean.
2. z scores are expressed as numbers with no units of measurement.
3. A data value is significantly low if its z score is less than or equal to −2 or the
value is significantly high if its z score is greater than or equal to +2.
4. If an individual data value is less than the mean, its corresponding z score is a
negative number.
Sample
Population
3.3 Measures of Relative Standing and Boxplots
X X
z
s
X
z
Values:
z scores ≤ −2.00 or
z scores ≥ 2.00
7
8. Example 1
8
Which of the following two data values is more extreme relative to the data
set from which it came?
x x
z
s
4000 3152
693.4
1.22
99 398.20
0.62
1.29
Temperature
The 4000 g weight of a baby (n = 400, 𝑥 = 3152.0𝑔, 𝑠 = 693.4𝑔)
The 990
𝐹 temperature of an adult (n = 106, 𝑥 = 98.200
𝐹, 𝑠 = 0.620
𝐹)
x x
z
s
9. Example 2
9
A student scored 65 on a calculus test that had a mean of 50 and a standard
deviation of 10; she scored 30 on a history test with a mean of 25 and a
standard deviation of 5. Compare her relative positions on the two tests.
She has a higher relative position in the Calculus class.
30 25
History:
5
z
65 50
Calculus:
10
z
1.5
1.0
x x
z
s
x
z
x
z
10. Example 3
10
75 239.4
64.2
2.56
Interpretation: 𝑧 = −2.56 < −2 → the platelet count of 75 is significantly low.
Platelets clump together and form clots to stop the bleeding during injury.
The lowest platelet count in a dataset is 75 (platelet counts are measured in
1000 cells/𝜇𝐿), is this significantly low? ( 𝑥 = 239.4, 𝑠 = 64.2)
x x
z
s
x x
z
s
11. Percentiles are measures of location, denoted P1, P2, . . . , P99, which divide a
set of data into 100 groups with about 1% of the values in each group.
(Percentiles separate the data set into 100 equal groups.)
A percentile rank for a datum represents the percentage of data values
below the datum (round the result to the nearest whole number):
3.3 Measures of Relative Standing and Boxplots
# of values below
100%
total # of values
X
Percentile
# of values below 0.5
Some Texts : 100%
total # of values
X
Percentile
n total number of values in the data set
k percentile being used (Example: For the 25th percentile, k = 25.)
L locator that gives the position of a value (Example: For the 12th
value in the sorted list, L = 12.)
Pk kth percentile (Example: P25 is the 25th percentile.)
11100 100
k P
L n n
12. Example 4
12
Fifty (50) cell phone data speeds listed below are arranged in increasing order.
Find the percentile for the data speed of 11.8 Mbps.
# of values below
100%
total # of values
X
Percentile
100
n k
L
0.8 1.4 1.8 1.9 3.2 3.6 4.5 4.5 4.6 6.2
6.5 7.7 7.9 9.9 10.2 10.3 10.9 11.1 11.1 11.6
11.8 12.0 13.1 13.5 13.7 14.1 14.2 14.7 15.0 15.1
15.5 15.8 16.0 17.5 18.2 20.2 21.1 21.5 22.2 22.4
23.1 24.5 25.7 28.5 34.6 38.5 43.0 55.6 71.3 77.8
# of values below 11.8
100%
total # of values
Percentile
20
100%
50
40%
Interpretation:
A data speed of 11.8 Mbps is in the 40th percentile and separates the lowest
40% of values from the highest 60% of values. We have P40 = 11.8 Mbps.
13. Example 5
13
For the 50 cell phone data speeds listed, find the
20th percentile, denoted by P20.
0.8 1.4 1.8 1.9 3.2 3.6 4.5 4.5 4.6 6.2
6.5 7.7 7.9 9.9 10.2 10.3 10.9 11.1 11.1 11.6
11.8 12.0 13.1 13.5 13.7 14.1 14.2 14.7 15.0 15.1
15.5 15.8 16.0 17.5 18.2 20.2 21.1 21.5 22.2 22.4
23.1 24.5 25.7 28.5 34.6 38.5 43.0 55.6 71.3 77.8
Converting a Percentile to a Data Value
100 100
k P
L n n
Solution: k = 20, n = 50
100 100
k P
L n n
20
50
100
10
Whole Number: The value of the 20th percentile is
between the Lth (10th ) value and the L + 1st (11st )value.
The 20th percentile is P20 = (6.2 + 6.5) / 2 = 6.35 Mbps.
14. Example 6
14
For these cell phone data speeds listed, find the 87th
percentile, denoted by P87.
0.8 1.4 1.8 1.9 3.2 3.6 4.5 4.5 4.6 6.2
6.5 7.7 7.9 9.9 10.2 10.3 10.9 11.1 11.1 11.6
11.8 12.0 13.1 13.5 13.7 14.1 14.2 14.7 15.0 15.1
15.5 15.8 16.0 17.5 18.2 20.2 21.1 21.5 22.2 22.4
23.1 24.5 25.7 28.5 34.6 38.5 43.0 55.6 71.3 77.8
Converting a Percentile to a Data Value
100 100
k P
L n n
Solution: k = 87, n = 50
100 100
k P
L n n
87
50
100
43.5
Not a Whole Number: Round up ALWAYS
The value of the 87th percentile is the 44th value.
P87 = 28.5 Mbps
15. Quartiles
Quartiles are measures of location, denoted Q1, Q2, and Q3, which divide a set of data into four groups with
about 25% of the values in each group. (Quartiles separate the data set into 4 equal groups. Q1=P25, Q2=MD,
Q3=P75 )
Q1 (First quartile, or P25) It separates the bottom 25% of the sorted values from the top 75%.
Q2 (Second quartile, or P50 ) and same as the median. It separates the bottom 50% of the sorted values from the top
50%.
Q3 (Third quartile): Same as P75. It separates the bottom 75% of the sorted values from the top 25%.
Caution Just as there is not universal agreement on a procedure for finding percentiles, there is not universal
agreement on a single procedure for calculating quartiles, and different technologies often yield different results.
Deciles separate the data set into 10 equal groups. D1=P10, D4=P40
The Interquartile Range, IQR = Q3 – Q1.
Step 1 Arrange the data in order from lowest to highest.
Step 2 Find the median of the data values. This is the value for Q2.
Step 3 Find the median of the data values that fall below Q2. This is the value for Q1.
Step 4 Find the median of the data values that fall above Q2. This is the value for Q3.
15
3.3 Measures of Relative Standing and Boxplots
𝐼𝑄𝑅 = 𝑄3 − 𝑄1
𝑆𝑒𝑚𝑖 𝐼𝑄𝑅 = (𝑄3−𝑄1)/2
𝑀𝑖𝑑 𝑄𝑢𝑎𝑟𝑡𝑖𝑙𝑒 𝑟𝑎𝑛𝑔𝑒: = (𝑄3+𝑄1)/2
10-90 𝑄𝑢𝑎𝑟𝑡𝑖𝑙𝑒 𝑟𝑎𝑛𝑔𝑒: = 𝑃90 − 𝑃10
16. Example 7
16
Given the fifty (50) data speeds listed, find the
5-number summary.
0.8 1.4 1.8 1.9 3.2 3.6 4.5 4.5 4.6 6.2
6.5 7.7 7.9 9.9 10.2 10.3 10.9 11.1 11.1 11.6
11.8 12.0 13.1 13.5 13.7 14.1 14.2 14.7 15.0 15.1
15.5 15.8 16.0 17.5 18.2 20.2 21.1 21.5 22.2 22.4
23.1 24.5 25.7 28.5 34.6 38.5 43.0 55.6 71.3 77.8
5-Number Summary
Min = 0.8 Mbps & Max = 77.8 Mbps
The median is equal to MD = Q2 = (13.7 + 14.1) / 2 = 13.9 Mbps.
5-Number Summary
Consists of:
1. Minimum
2. First quartile, Q1
3. Second quartile, Q2
(same as the median)
4. Third quartile, Q3
5. Maximum
Q1 = 7.9 Mbps (Median of the bottom half)
Q3 = 21.5 Mbps (Median of the top half)
17. Example 8
17
Given the fifty (50) data speeds listed, construct a
boxplot. Identify Outliers if any.
0.8 1.4 1.8 1.9 3.2 3.6 4.5 4.5 4.6 6.2
6.5 7.7 7.9 9.9 10.2 10.3 10.9 11.1 11.1 11.6
11.8 12.0 13.1 13.5 13.7 14.1 14.2 14.7 15.0 15.1
15.5 15.8 16.0 17.5 18.2 20.2 21.1 21.5 22.2 22.4
23.1 24.5 25.7 28.5 34.6 38.5 43.0 55.6 71.3 77.8
Boxplots A boxplot (or box-and-
whisker diagram) is a
graph that consists of a
line extending from the
min to the max value,
and a box with lines
drawn at the first
quartile Q1, the median,
and the third quartile Q3.
Min = 0.8, Q1 = 7.9, Q2 = 13.9, Q3 = 21.5, Max = 77.8
Constructing a Boxplot
1. Find the 5-number
summary (Min, Q1, Q2,
Q3, Max).
2. Construct a line segment
extending from the
minimum to the maximum
data value.
3. Construct a box
(rectangle) extending from
Q1 to Q3, and draw a line
in the box at the value of
Q2 (median).
IQR = Q3 - Q1 = 21.5 - 7.9 = 13.6, Q1 - 1.5 IQR = -12.5 & Q3 +1.5 IQR = 41.9
Any numbers smaller than -12.5 and larger than 41.9 is considered an outlier.
43.0, 55.6, 71.3 & 77.8
18. Skewness
A boxplot can often be used to identify skewness. A distribution of data is skewed if it is not
symmetric and extends more to one side than to the other.
Modified Boxplots
A modified boxplot is a regular boxplot constructed with these modifications:
1. A special symbol (such as an asterisk or point) is used to identify outliers as defined above,
and
2. the solid horizontal line extends only as far as the minimum data value that is not an outlier
and the maximum data value that is not an outlier.
Identifying Outliers (An outlier is a value that lies very far away from the vast majority of the other
values in a data set.) for Modified Boxplots
1. Find the quartiles Q1, Q2, and Q3.
2. Find the interquartile range (IQR), where IQR = Q3 − Q1.
3. Evaluate 1.5 × IQR.
4. In a modified boxplot, a data value is an outlier if it is above Q3, by an amount greater than 1.5 ×
IQR or below Q1, by an amount greater than 1.5 × IQR.
18
3.3 Measures of Relative Standing and Boxplots
Boxplots
19. Example 9
19
Given the fifty (50) data speeds listed, find the
40th percentile, denoted by P40.
0.8 1.4 1.8 1.9 3.2 3.6 4.5 4.5 4.6 6.2
6.5 7.7 7.9 9.9 10.2 10.3 10.9 11.1 11.1 11.6
11.8 12.0 13.1 13.5 13.7 14.1 14.2 14.7 15.0 15.1
15.5 15.8 16.0 17.5 18.2 20.2 21.1 21.5 22.2 22.4
23.1 24.5 25.7 28.5 34.6 38.5 43.0 55.6 71.3 77.8
Converting a Percentile to a Data Value (No need)
100 100
k P
L n n
Solution: k = 40, n = 50
100 100
k P
L n n
40
50
100
20
Whole Number: The value of the 40th percentile is
between the Lth (20th ) value and the 21st value.
The 40th percentile is P40 = 11.7 Mbps.
20. Example 10
20
A teacher gives a 20-point test to 10 students. Find the percentile rank of a score of
12.
18, 15, 12, 6, 8, 2, 3, 5, 20, 10
100
n k
L
Sort in ascending order.
2, 3, 5, 6, 8, 10, 12, 15, 18, 20
Interpretation:
A student whose score was 12 did better than 65% of the class.
# of values below 0.5
100%
total # of values
X
Percentile
6 values
6 0.5
100%
10
65%
# of values below 0.5
Some Texts : 100%
total # of values
X
Percentile
21. Example 11
21
A teacher gives a 20-point test to 10 students. Find the value corresponding to the
25th percentile.
18, 15, 12, 6, 8, 2, 3, 5, 20, 10
100
n k
L
Sort in ascending order.
2, 3, 5, 6, 8, 10, 12, 15, 18, 20
Interpretation:
The value 5 corresponds to the 25th percentile.
100
n p
c
10 25
2.5
100
3