2. Chapter 2:
Exploring Data with Tables and Graphs
2.1 Frequency Distributions for Organizing and
Summarizing Data
2.2 Histograms
2.3 Graphs that Enlighten and Graphs that Deceive
2.4 Scatterplots, Correlation, and Regression
2
Objectives:
1. Organize data using a frequency distribution.
2. Represent data in frequency distributions graphically using histograms, frequency polygons, and ogives.
3. Represent data using bar graphs, Pareto charts, time series graphs, and pie graphs.
4. Draw and interpret a stem and leaf plot.
5. Draw and interpret a scatter plot for a set of paired data.
3. Population: It consists of all subjects (human or otherwise) that are studied.
Sample: It is a subset of the population.
Census: The collection of data from every member of a population
3
Recall: 1.1 Statistical and Critical Thinking, Key Concept
Voluntary Response Sample or Self-Selected Sample is one in which the respondents
themselves decide whether to be included.
Statistics: The science of planning studies and experiments, obtaining data, and
organizing, summarizing, presenting, analyzing, and interpreting those data and then
drawing conclusions based on them.
Data: Collections of observations, such as measurements, genders, or survey responses
Statistical Significance: Statistical significance is achieved in a study if the likelihood
of an event occurring by chance is 5% or less.
Practical Significance: It is possible that some treatment or finding is effective, but
common sense might suggest that the treatment or finding does not make enough of a
difference to justify its use or to be practical.
4. Data
Qualitative
Categorical
Quantitative
Numerical,
Can be ranked
Discrete
Countable
5, 29, 8000, etc.
Continuous
Can be decimals
2.59, 312.1, etc.
4
Parameter: It’s a numerical measurement
describing some characteristic of a population
Statistic: a numerical measurement describing
some characteristic of a sample
Recall: 1.2 Types of Data, Levels of Measurement:
Another way of classifying data: 4 levels of measurement: nominal, ordinal, interval, and ratio.
Nominal level of measurement characterized by data that
consist of names, labels, or categories only, and the data
cannot be arranged in some order (such as low to high).
Example: Survey responses of yes, no, and undecided
Ordinal level of measurement involves data that can be
arranged in some order, but differences (obtained by
subtraction) between data values either cannot be
determined or are meaningless.
Example: Course grades A, B, C, D, or F
Interval level of measurement involves data that can be
arranged in order, and the differences between data
values can be found and are meaningful. However, there
is no natural zero starting point at which none of the
quantity is present. A value of zero does not mean the
absence of the quantity. Arithmetic operations such as
addition and subtraction can be performed on values of the
variable.
Example: Years 1000, 2000, 1776, and 1492
Ratio level of measurement data can be arranged in order,
differences can be found and are meaningful, and there is a
natural zero starting point (where zero indicates that none
of the quantity is present). Differences and ratios are both
meaningful. Arithmetic operations such as multiplication
and division can be performed on the values of the
variable.
Example: Class times of 50 minutes and 100 minutes
5. Statistical methods are driven by the data that we collect. We typically obtain data from two distinct sources: observational studies and
experiments.
Experiment: Apply some treatment and then proceed to observe its effects on the individuals. (The individuals in experiments are called
experimental units, and they are often called subjects when they are people.)
The researcher manipulates the independent (explanatory) variable and tries to determine how the manipulation influences the dependent
(outcome) variable in an experimental study.
A confounding variable influences the dependent variable but cannot be separated from the independent variable.
Observational study: Observing and measuring specific characteristics without attempting to modify the individuals being studied
In an observational study, the researcher merely observes and tries to draw conclusions based on the observations.
Cross-sectional study: Data are observed, measured, and collected at one point in time, not over a period of time.
Retrospective (or case control) study: Data are collected from a past time period by going back in time (through examination of records,
interviews, and so on).
Prospective (or longitudinal or cohort) study: Data are collected in the future from groups sharing common factors (called cohorts).
Recall: 1.3 Collecting Sample Data
5
A sample of n subjects is selected in such a way that every possible
sample of the same size n has the same chance of being chosen.
A simple random sample is often called a random sample, but strictly
speaking, a random sample has the weaker requirement that all members
of the population have the same chance of being selected.
Some Sampling Techniques
Random – random number generator
Systematic – every kth subject
Stratified – divide population into “layers”
Cluster – use intact groups
Convenient – mall surveys
6. 6
2.1 Frequency Distributions for Organizing and Summarizing Data
Data collected in original form is called raw data.
Frequency Distribution (or Frequency Table)
A frequency distribution is the organization of raw data in table form,
using classes and frequencies. It Shows how data are partitioned among
several categories (or classes) by listing the categories along with the
number (frequency) of data values in each of them.
Nominal- or ordinal-level data that can be placed in categories is
organized in categorical frequency distributions.
Key Concept: When working with large data sets, a frequency distribution
(or frequency table) is often helpful in organizing and summarizing data. A
frequency distribution helps us to understand the nature of the distribution of
a data set.
7. 7
Categorical Frequency Distribution:Example 1
Construct a frequency distribution for the data produced by 25 people’s
blood test that results in their blood type as follows:
Raw Data: A,B,B,AB,O O,O,B,AB,B B,B,O,A,O A,O,O,O,AB AB,A,O,B,A
Class Tally Frequency Percent Frequency = f / n
A
B
O
AB
IIII
IIII II
IIII IIII
IIII
5
7
9
4
20%
28%
36%
16%
𝑛 =
𝑖=1
𝑛
𝑓 = 25
𝑖=1
𝑛
𝑟𝑓 = 1 = 100%
8. 2.1 Frequency Distributions for Organizing and Summarizing Data
Definitions: Lower class limits: The smallest numbers that can belong to each of the different classes
Upper class limits: The largest numbers that can belong to each of the different classes
Class boundaries: The numbers used to separate the classes, but without the gaps created by class limits
Class midpoints: The values in the middle of the classes Each class midpoint can be found by adding the lower class
limit to the upper class limit and dividing the sum by 2.
Class width: The difference between two consecutive lower class limits in a frequency distribution
1. Select the number of classes, usually between 5 and 20.
2. Calculate the class width: 𝑊 =
𝑀𝑎𝑥−𝑀𝑖𝑛
# 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠
and round up accordingly.
3. Choose the value for the first lower class limit by using either the minimum value or a convenient value below
the minimum.
4. Using the first lower class limit and class width, list the other lower class limits.
5. List the lower class limits in a vertical column and then determine and enter the upper class limits.
6. Take each individual data value and put a tally mark in the appropriate class. Add the tally marks to get the
frequency.
Procedure for Constructing a Frequency Distribution
8
9. Class width: Divide the range by the number
of classes 7.
Range = High – Low = 134 – 100 = 34
𝑊 =
𝑅𝑎𝑛𝑔𝑒
# 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠
=
34
7
= 4.86
W = 5, Always round up.
Choose a convenient data value (Lowest or so )
for the first lower class limit: 100.
Add the width to find the subsequent lower
class limits
Class Limits
Class
Boundaries
Frequency
100 -
105 -
110 -
115 -
120 -
125 -
130 -
104
109
114
119
124
129
134
99.5 - 104.5
104.5 - 109.5
109.5 - 114.5
114.5 - 119.5
119.5 - 124.5
124.5 - 129.5
129.5 - 134.5
2
8
18
13
7
1
1 9
Constructing a (Grouped)Frequency DistributionExample 2
Construct a (grouped) frequency distribution for the data, the
record high temperatures for each of the 50 states, using 7 classes.
112 100 127 120 134 118 105 110 109 112 110 118 117 116 118 122
114 114 105 109 107 112 114 115 118 117 118 122 106 110 116 108
110 121 113 120 119 111 104 111 120 113 120 117 105 110 118 112
114 114
10. Example 3
10
Construct a (grouped) frequency distribution for the data: Drive-
through Service Times for a fast food restaurant Lunches, use 5 classes.
107 139 197 209 281 254 163 150 127 308 206 187 169 83 127 133 140
143 130 144 91 113 153 255 252 200 117 167 148 184 123 153 155 154
100 117 101 138 186 196 146 90 144 119 135 151 197 171 190 169 Blank
𝑊 =
𝑅𝑎𝑛𝑔𝑒
# 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠
=
308−83
5
= 45 →W = 50,
Rounded up to a more convenient number
The minimum data
value is 83, which
is not a very
convenient starting
point, so go to a
value below 83 and
select the more
convenient value of
75 as the first lower
class limit.
75-
125-
175-
225-
275-
Time
(Seconds) Frequency
75-124 11
125-174 24
175-224 10
225-274 3
275-324 2
Relative
Frequency = f / n
11 / 50 = 0.22
24 / 50 = 0.48
10 / 50 = 0.2
3 / 50 = 0.06
2 / 50 = 0.04
Percent
Frequency
22%
48%
20%
6%
4%
𝑛 =
𝑖=1
𝑛
𝑓 = 50
𝑖=1
𝑛
𝑟𝑓 = 1 = 100%
11. Find the Cumulative Frequency Distribution for example 3.
Time (Seconds)
Cumulative
Frequency
Less than 125
Less than 175
Less than 225
Less than 275
Less than 325
11
Example 4
11
35
45
48
50
11
35
45
48
50
Time
(Seconds) Frequency
75-124 11
125-174 24
175-224 10
225-274 3
275-324 2
Time
(Seconds) Frequency
75-124
75-174
75-224
75-274
75-324
Time (Seconds)
Cumulative
Frequency
Less than 124.5
Less than 174.5
Less than 224.5
Less than 274.5
Less than 324.5
11
35
45
48
50
3 ways of writing CFDT
12. Critical Thinking: Using Frequency Distributions to Understand Data
In statistics we are often interested in determining whether the data have a
normal distribution.
1. The frequencies start low, then increase to one or two high frequencies, and then
decrease to a low frequency.
2. The distribution is approximately symmetric. Frequencies preceding the maximum
frequency should be roughly a mirror image of those that follow the maximum
frequency.
12
2.1 Frequency Distributions for Organizing and Summarizing Data
Gaps:
1. The presence of gaps can show that the data are from two or more different
populations.
2. However, the converse is not true, because data from different populations
do not necessarily result in gaps.
13. Exploring Data: What Does a Gap Tell Us?
The table shown is a frequency distribution of the weights (grams) of randomly
selected pennies.
Weight (grams) of
Penny
Frequency
2.40-2.49 18
2.50-2.59 19
2.60-2.69 0
2.70-2.79 0
2.80-2.89 0
2.90-2.99 2
3.00-3.09 25
3.10-3.19 8
13
Example 5
Examination of the frequencies reveals a
large gap between the lightest pennies and the
heaviest pennies.
This suggests that we have two different
populations:
Facts:
Pennies made before 1983 are 95% copper and 5% zinc.
Pennies made after 1983 are 2.5% copper and 97.5% zinc.
14. Comparisons
Combining two or more relative frequency distributions in one table makes
comparisons of data much easier.
14
Example 6
The table shows the relative frequency distributions for the drive-through lunch
service times (seconds) for a fast food restaurant and a coffee shop.
Time (seconds) Fast Food Coffee Shop
25-74 Blank 22%
75-124 22% 44%
125-174 48% 28%
175-224 20% 6%
225-274 6% Blank
275-324 4% Blank
Because of the big differences in
their menus, the service times are
expected to be very different.
By comparing the relative
frequencies, we see that there are
major differences.
The Coffee shop service times
appear to be lower than those at
the fast food restaurant.
15. Identify Class width, midpoints, boundaries and the number of subjects.
15
Example 7
𝑀𝑖𝑑𝑝𝑜𝑖𝑛𝑡 =
𝑈𝐿+ 𝐿𝐿
2
Class
Boundaries
149.5
249.5
349.5
449.5
549.5
99.5 – 199.5
199.5 – 299.5
299.5 – 399.5
399.5 – 499.5
499.5 – 599.5
W = UL – LL + 1 = LL of 2nd – LL of 1st = 200 – 100 = 100
𝑛 =
𝑖=1
𝑛
𝑓 = 25 + 92 + 28 + 0 + 2 = 147
The class width can be
calculated by
subtracting successive
lower class limits (or
boundaries) successive
upper class limits (or
boundaries) upper and
lower class boundaries
The class
midpoint can be
calculated by
averaging upper
and lower class
limits (or
boundaries)
16. 16
Example 8
There are disproportionately more 0’s and 5’s, the weights were
reported instead of measured.
Therefore, the results are not accurate.