Descriptive Statistics Overview

1
Descriptive Statistics
2-1 Overview
2-2 Summarizing Data with Frequency Tables
2-3 Pictures of Data
2-4 Measures of Center
2-5 Measures of Variation
2-6 Measures of Position
2-7 Exploratory Data Analysis (EDA)

2
 Descriptive Statistics
summarize or describe the important
characteristics of a known set of
population data
 Inferential Statistics
use sample data to make inferences (or
generalizations) about a population
2 -1 Overview

3
1. Center: A representative or average value that
indicates where the middle of the data set is located
2. Variation: A measure of the amount that the values
vary among themselves
3. Distribution: The nature or shape of the
distribution of data (such as bell-shaped, uniform, or
skewed)
4. Outliers: Sample values that lie very far away from
the vast majority of other sample values
5. Time: Changing characteristics of the data over
time
Important Characteristics of Data

4
 Frequency Table
lists classes (or categories) of values,
along with frequencies (or counts) of the
number of values that fall into each class
2-2 Summarizing Data With
Frequency Tables

5
Qwerty Keyboard Word Ratings
Table 2-1
2 2 5 1 2 6 3 3 4 2
4 0 5 7 7 5 6 6 8 10
7 2 2 10 5 8 2 5 4 2
6 2 6 1 7 2 7 2 3 8
1 5 2 5 2 14 2 2 6 3
1 7

6
Frequency Table of Qwerty Word Ratings
Table 2-3
0 - 2 20
3 - 5 14
6 - 8 15
9 - 11 2
12 - 14 1
Rating Frequency

8
Lower Class Limits
Lower Class
Limits
0 - 2 20
3 - 5 14
6 - 8 15
9 - 11 2
12 - 14 1
Rating Frequency
are the smallest numbers that can actually belong to
different classes

9
Upper Class Limits
Upper Class
Limits
0 - 2 20
3 - 5 14
6 - 8 15
9 - 11 2
12 - 14 1
Rating Frequency
are the largest numbers that can actually belong to
different classes

10
are the numbers used to separate classes, but
without the gaps created by class limits
Class Boundaries

11
number separating classes
Class Boundaries
0 - 2 20
3 - 5 14
6 - 8 15
9 - 11 2
12 - 14 1
Rating Frequency
- 0.5
2.5
5.5
8.5
11.5
14.5

12
Class Boundaries
Class
Boundaries
0 - 2 20
3 - 5 14
6 - 8 15
9 - 11 2
12 - 14 1
Rating Frequency
- 0.5
2.5
5.5
8.5
11.5
14.5
number separating classes

13
midpoints of the classes
Class Midpoints

14
midpoints of the classes
Class Midpoints
Class
Midpoints
0 - 1 2 20
3 - 4 5 14
6 - 7 8 15
9 - 10 11 2
12 - 13 14 1
Rating Frequency

15
is the difference between two consecutive lower
class limits or two consecutive class boundaries
Class Width

16
Class Width
Class Width
3 0 - 2 20
3 3 - 5 14
3 6 - 8 15
3 9 - 11 2
3 12 - 14 1
Rating Frequency
is the difference between two consecutive lower
class limits or two consecutive class boundaries

17
1. Be sure that the classes are mutually exclusive.
2. Include all classes, even if the frequency is zero.
3. Try to use the same width for all classes.
4. Select convenient numbers for class limits.
5. Use between 5 and 20 classes.
6. The sum of the class frequencies must equal the
number of original data values.
Guidelines For Frequency Tables

18
3. Select for the first lower limit either the lowest score or a
convenient value slightly less than the lowest score.
4. Add the class width to the starting point to get the second lower
class limit, add the width to the second lower limit to get the
third, and so on.
5. List the lower class limits in a vertical column and enter the
upper class limits.
6. Represent each score by a tally mark in the appropriate class.
Total tally marks to find the total frequency for each class.
Constructing A Frequency Table
1. Decide on the number of classes .
2. Determine the class width by dividing the range by the number
of classes (range = highest score - lowest score) and round up.
class width ≈ round up of
range
number of classes

20
Relative Frequency Table
relative frequency =
class frequency
sum of all frequencies

21
Relative Frequency Table
0 - 2 20
3 - 5 14
6 - 8 15
9 - 11 2
12 - 14 1
Rating Frequency
0 - 2 38.5%
3 - 5 26.9%
6 - 8 28.8%
9 - 11 3.8%
12 - 14 1.9%
Rating
Relative
Frequency
20/52 = 38.5%
14/52 = 26.9%
etc.
Total frequency = 52

22
Cumulative Frequency Table
Cumulative
Frequencies
0 - 2 20
3 - 5 14
6 - 8 15
9 - 11 2
12 - 14 1
Rating Frequency
Less than 3 20
Less than 6 34
Less than 9 49
Less than 12 51
Less than 15 52
Rating
Cumulative
Frequency

23
Frequency Tables
0 - 2 20
3 - 5 14
6 - 8 15
9 - 11 2
12 - 14 1
Rating Frequency
0 - 2 38.5%
3 - 5 26.9%
6 - 8 28.8%
9 - 11 3.8%
12 - 14 1.9%
Rating
Relative
Frequency
Less than 3 20
Less than 6 34
Less than 9 49
Less than 12 51
Less than 15 52
Rating
Cumulative
Frequency

24
a value at the
center or middle
of a data set
Measures of Center

25
Mean
(Arithmetic Mean)
AVERAGE
the number obtained by adding the
values and dividing the total by the
number of values
Definitions

26
Notation
Σ denotes the addition of a set of values
x is the variable usually used to represent the individual
data values
n represents the number of data values in a sample
N represents the number of data values in a population

27
Notation
is pronounced ‘x-bar’ and denotes the mean of a set
of sample values
x =
n
Σ x
x

28
Notation
µ is pronounced ‘mu’ and denotes the mean of all values
in a population
is pronounced ‘x-bar’ and denotes the mean of a set
of sample values
Calculators can calculate the mean of data
x =
n
Σ x
x
N
µ =
Σ x

29
Definitions
 Median
the middle value when the original
data values are arranged in order of
increasing (or decreasing) magnitude

30
Definitions
 Median
 often denoted by x (pronounced ‘x-tilde’)
~

31
Definitions
 Median
 often denoted by x (pronounced ‘x-tilde’)
 is not affected by an extreme value
~

32
6.72 3.46 3.60 6.44
3.46 3.60 6.44 6.72
no exact middle -- shared by two numbers
3.60 + 6.44
2
(even number of values)
MEDIAN is 5.02

33
6.72 3.46 3.60 6.44 26.70
3.46 3.60 6.44 6.72 26.70
(in order - odd number of values)
exact middle MEDIAN is 6.44
6.72 3.46 3.60 6.44
3.46 3.60 6.44 6.72
no exact middle -- shared by two numbers
3.60 + 6.44
2
(even number of values)
MEDIAN is 5.02

34
Definitions
 Mode
the score that occurs most frequently
Bimodal
Multimodal
No Mode
denoted by M
the only measure of central tendency that can be
used with nominal data

35
a. 5 5 5 3 1 5 1 4 3 5
b. 1 2 2 2 3 4 5 6 6 6 7 9
c. 1 2 3 6 7 8 9 10
Examples
Mode is 5
Bimodal - 2 and 6
No Mode

36
 Midrange
the value midway between the highest and
lowest values in the original data set
Definitions

37
 Midrange
the value midway between the highest and
lowest values in the original data set
Definitions
Midrange =
highest score + lowest score
2

38
 Symmetric
Data is symmetric if the left half of its
histogram is roughly a mirror of its
right half.
 Skewed
Data is skewed if it is not symmetric
and if it extends more to one side than
the other.
Definitions

39
Skewness
Mode = Mean = Median
SYMMETRIC

40
Skewness
SKEWED LEFT
(negatively)
SYMMETRIC
Mean Mode
Median

41
Skewness
SKEWED LEFT
(negatively)
SYMMETRIC
Mean Mode
Median
SKEWED RIGHT
(positively)
MeanMode
Median

42
Waiting Times of Bank Customers
at Different Banks
in minutes
Jefferson Valley Bank
Bank of Providence
6.5
4.2
6.6
5.4
6.7
5.8
6.8
6.2
7.1
6.7
7.3
7.7
7.4
7.7
7.7
8.5
7.7
9.3
7.7
10.0

43
Bank of Providence
6.5
4.2
6.6
5.4
6.7
5.8
6.8
6.2
7.1
6.7
7.3
7.7
7.4
7.7
7.7
8.5
7.7
9.3
7.7
10.0
7.15
7.20
7.7
7.10
Bank of Providence
7.15
7.20
7.7
7.10
Mean
Median
Mode
Midrange
Waiting Times of Bank Customers
at Different Banks
in minutes

46
Measures of Variation
Range
value
highest lowest
value

47
a measure of variation of the scores
about the mean
(average deviation from the mean)
Standard Deviation

48
Sample Standard Deviation
Formula
calculators can compute the
sample standard deviation of data
Σ (x - x)2
n - 1
S =

49
Sample Standard Deviation
Shortcut Formula
n (n - 1)
s =
n (Σx2
) - (Σx)2
sample standard deviation of data

50
Σ x - x
Mean Absolute Deviation
Formula
n

51
Population Standard Deviation
population standard deviation
of data
2
Σ (x - µ)
N
σ =

52
Variance
standard deviation squared

53
Variance
standard deviation squared
s
σ
2
2
} use square key
on calculatorNotation

54
Sample
Variance
Population
Variance
Variance
Σ (x - x )2
n - 1
s
2
=
Σ (x - µ)2
N
σ 2
=

55
Estimation of Standard Deviation
Range Rule of Thumb
x - 2s x x + 2s
Range ≈ 4s
or
(minimum
usual value)
(maximum
usual value)

56
Estimation of Standard Deviation
Range Rule of Thumb
x - 2s x x + 2s
Range ≈ 4s
or
(minimum
usual value)
(maximum
usual value)
Range
4
s ≈ =
highest value - lowest value
4

57
x
The Empirical Rule
(applies to bell-shaped distributions)FIGURE 2-15

58
x - s x x + s
68% within
1 standard deviation
34% 34%
The Empirical Rule
(applies to bell-shaped distributions)FIGURE 2-15

59
x - 2s x - s x x + 2sx + s
68% within
34% 34%
95% within
2 standard deviations
The Empirical Rule
(applies to bell-shaped distributions)
13.5% 13.5%
FIGURE 2-15

60
x - 3s x - 2s x - s x x + 2s x + 3sx + s
68% within
34% 34%
95% within
2 standard deviations
99.7% of data are within 3 standard deviations of the mean
The Empirical Rule
(applies to bell-shaped distributions)
0.1% 0.1%
2.4% 2.4%
13.5% 13.5%
FIGURE 2-15

61
Chebyshev’s Theorem
 applies to distributions of any shape.
 the proportion (or fraction) of any set of data
lying within K standard deviations of the mean
is always at least 1 - 1/K
2
, where K is any
positive number greater than 1.
 at least 3/4 (75%) of all values lie within 2
standard deviations of the mean.
 at least 8/9 (89%) of all values lie within 3
standard deviations of the mean.

62
Measures of Variation Summary
For typical data sets, it is unusual for a
score to differ from the mean by more than
2 or 3 standard deviations.

63
 z Score (or standard score)
the number of standard deviations
that a given value x is above or below
the mean
Measures of Position

64
Sample
z = x - x
s
Population
z = x - µ
σ
z score

65
- 3 - 2 - 1 0 1 2 3
Z
Unusual
Values
Unusual
Values
Ordinary
Values
Interpreting Z Scores
FIGURE 2-16

66
Quartiles, Deciles,
Percentiles

67
Q1, Q2, Q3
divides ranked scores into four equal parts
Quartiles
25% 25% 25% 25%
Q3Q2Q1
(minimum) (maximum)
(median)

68
D1, D2, D3, D4, D5, D6, D7, D8, D9
divides ranked data into ten equal parts
Deciles
10% 10% 10% 10% 10% 10% 10% 10% 10% 10%
D1 D2 D3 D4 D5 D6 D7 D8 D9

70
Quartiles, Deciles, Percentiles
Fractiles
(Quantiles)
partitions data into approximately equal parts

71
Exploratory Data Analysis
the process of using statistical tools
(such as graphs, measures of center,
and measures of variation) to investigate
the data sets in order to understand their
important characteristics

72
Outliers
 a value located very far away from
almost all of the other values
 an extreme value
 can have a dramatic effect on the mean,
standard deviation, and on the scale of
the histogram so that the true nature of
the distribution is totally obscured

73
Boxplots
(Box-and-Whisker Diagram)
Reveals the:
 center of the data
 spread of the data
 distribution of the data
 presence of outliers
Excellent for comparing two
or more data sets

74
Boxplots
5 - number summary
 Minimum
 first quartile Q1
 Median (Q2)
 third quartile Q3
 Maximum

75
Boxplots
Boxplot of Qwerty Word Ratings
2 4 6 8 10 12 14
0
2 4 6
14
0

76
Bell-Shaped Skewed
Boxplots
Uniform

77
Exploring
 Measures of center: mean, median, and mode
 Measures of variation: Standard deviation and
range
 Measures of spread and relative location:
minimum values, maximum value, and quartiles
 Unusual values: outliers
 Distribution: histograms, stem-leaf plots, and
boxplots

Descriptive Statistics Overview

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Descriptive Statistics Overview

Similar to Descriptive Statistics Overview (20)

Recently uploaded

Recently uploaded (20)

Descriptive Statistics Overview

Editor's Notes