2. Objectives
Learn about basic descriptive quantitative analysis
How to perform these tasks in Excel
Starting point for 502B
Excel knowledge and quantitative skills are highly desired by
Employers
EC stream
2
3. Introduction
3
Without data, it is anyone’s opinion
Why use tables, graphs, summary stats?
“At their best, tables, graphs, and statistics are instruments
for reasoning about complex quantitative information.”
Why learn how to design them appropriately?
“At their worst, tables, graphs and summary statistics are
instruments of evil used for deceiving a naive viewer.”
Does your mindset match my dataset!
http://www.ted.com/talks/hans_rosling_at_state.html
7. Frequency Distribution
Page 7
A convenient way of summarizing a lot of tabular data
What is a Frequency Distribution?
A frequency distribution is a list or a table …
containing class groupings (categories or ranges within
which the data fall) ...
and the corresponding frequencies with which data fall
within each class or category
For nominal/ordinal data
9. Page 9
Table 1
Univariate Frequencies of Percentage of Sales
Reported to Tax Authorities
Source: 1999 World Bank World Business Environment
Survey (WBES), excludes missing observations
% of Sales
Reported
100%
90-99%
80-89%
70-79%
60-69%
50-59%
<50%
Total
Frequency
3307
1096
916
703
501
694
936
8153
Percent
(%)
40.56
13.44
11.24
8.62
6.14
8.51
11.48
100
http://www.enterprisesurveys.org/
10. Contingency/Pivot/Cross Table
10
May also want to produce a table with more
categories
Cross table or Contingency table or Pivot table
Suitable if you have two nominal/ordinal variables
Simple extension to a univariate table
Considers relationship between two variables
Row variable (Dependent)
Column variable (Independent)
11. Table2
Percentage of Sales Reported to Tax Authorities
by Region
Page 11
Africa Transition Asia Latin OECD Former Total
Europe America Soviet
Countries
100% 490 554 416 794 446 607 3,307
90-99% 266 196 142 119 145 228 1,096
80-89% 158 152 117 192 73 224 916
70-79% 162 117 103 153 43 125 703
60-69% 140 69 70 115 22 85 501
50-59% 140 105 141 118 16 174 694
<50% 100 106 283 296 25 126 936
Total 1,456 1,299 1,272 1,787 770 1,569 8,153
Source: 1999 World Bank World Business Environment Survey (WBES)
* Excludes missing observations
12. Features of a Table
12
Title that accurately summarizes the data
Simple, indicates major variables, and time frame (if applicable)
Source: data set or origin of table
Explanatory footnotes
Easy to read & separated from text
Properly formatted for style (see APA Rules)
Necessary to advance analysis
See Module 7 for APA Table Checklist
Reproduced from APA manual
14. Bar Graph
Page 14
Often used to describe categorical data
Ordinal/Nominal
Draws attention to the frequency of each category
15. Page 15
Table 1
Univariate Frequencies of Percentage of Sales
Reported to Tax Authorities
Source: 1999 World Bank World Business Environment
Survey (WBES), excludes missing observations
% of Sales
Reported
100%
90-99%
80-89%
70-79%
60-69%
50-59%
<50%
Total
Frequency
3307
1096
916
703
501
694
936
8153
Percent
(%)
40.56
13.44
11.24
8.62
6.14
8.51
11.48
100
http://www.enterprisesurveys.org/
16. Bar Graph
Page 16
Figure 1
Percentage of sales reported to tax authority
Source: 1999 World Bank World Business Environment Survey (WBES)
Note. Excludes missing observations. n = 8314
18. Pie Graph
Page 18
Emphasizes the proportion of each category
Something that may be good for our tax evasion data
Circle represents the total
Segments the shares of the total
Segment size is proportional to frequency
19. Pie Graph
19
Figure 1
Percentage of sales reported to tax authority
Source: 1999 World Bank World Business Environment Survey (WBES)
Note. Excludes missing observations. n = 8314
20. Page 2020
Pie Graph
Figure 1
Percentage of sales reported to tax authority
Source: 1999 World Bank World Business Environment Survey (WBES)
Note. Excludes missing observations. n = 8314
21. Page 2121
Pie Graph
Figure 1
Percentage of sales reported to tax authority
Source: 1999 World Bank World Business Environment Survey (WBES)
Note. Excludes missing observations. n = 8314
23. Table2
Percentage of Sales Reported to Tax Authorities
by Region
Page 23
Africa Transition Asia Latin OECD Former Total
Europe America Soviet
Countries
100% 490 554 416 794 446 607 3,307
90-99% 266 196 142 119 145 228 1,096
80-89% 158 152 117 192 73 224 916
70-79% 162 117 103 153 43 125 703
60-69% 140 69 70 115 22 85 501
50-59% 140 105 141 118 16 174 694
<50% 100 106 283 296 25 126 936
Total 1,456 1,299 1,272 1,787 770 1,569 8,153
24. Bar Graph
Page 24
Figure 1
Percentage of sales reported to tax authority
Source: 1999 World Bank World Business Environment Survey (WBES)
Note. Excludes missing observations. n = 8314
25. Page 2525
Segmented Bar Chart
Figure 1
Percentage of sales reported to tax authority
Source: 1999 World Bank World Business Environment Survey (WBES)
Note. Excludes missing observations. n = 8314
26. Pie Graph
Page 26
Figure 2
Percentage of sales reported to tax authority by region
Source: 1999 World Bank World Business Environment Survey (WBES)
Note. Excludes missing observations. n = 8314
29. Time Series Graph
Page 29
Time series are often used in social sciences
Data collected at various time period: daily, weekly, monthly,
quarterly, annually, etc.
Examples include GDP, Unemployment, University Tuition
Plot series of interest over time
Let’s look at a graph of the unemployment rate by gender and
age
31. InstructorPage 31
Histogram
Used for continuous data
Frequency Distribution for continuous data
Summary graph showing count of the data pints falling in
various ranges
Rough approximate of the distribution of the data
A histogram is a way to summarize data
The distribution condenses the raw data into a more useful
form...
and allows for a quick visual interpretation of the data
35. Principles of Graphical Excellence
35
Well-designed presentation of interesting data
Substance & design
Simplicity of design, complexity of data
Proportion and Balance
Clear, precise, efficient
Know what you are trying to show (have a story)
make sure you graph shows it
Well formatted, professional
Choose format that reflects your data and the story
Informative and legible axis
Fully labelled & legible
Gets across main point(s) in the shortest time with the least ink in the
smallest space
Adds information not otherwise available to the reader
But supplemented with text describing the figure
Tells the truth about the data
Limits complexity and confusion
Avoid Chart Junk
36. 36
0
10
20
30
40
50
60
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
0
20
40
60
80
100
120 West
North
Northeast
Southwest
Mexico
Europe
Japan
East
South
International
Examples of Chartjunk
37. 37
Examples of Chartjunk
0
10
20
30
40
50
60
70
80
90
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
Gridlines!
Vibration
Pointless
Fake 3-D Effects
Filled “Floor” Clip Art
In or out?
Filled
“Walls”
Borders and
Fills Galore
Unintentional
Heavy or Double Lines
Filled Labels
Serif Font with
Thin & Thick Lines
38. Displaying Data: “Mistakes”
Page 38
Graphs are also instruments of evil used for deceiving
a naive viewer.
Non-zero origin
Omitting data that refutes your “evidence”
Limiting scope of data
39. What is Wrong with this Graph?
39
Provincial Personal Income Taxes
Single Individual with $45,000 in
income claiming basic personal tax
credits
45. Describing Data Numerically
45
Simple Arithmetic Mean
Median
Mode
Describing Data Numerically
Variance
Standard Deviation
Range
Central Tendency Variation Association
Covariance
Correlation
Shape of the Distribution
46. Mode
46
A measure of central tendency
Value that occurs most often
Not affected by extreme values
Used for either numerical or categorical data
There may be no mode or several modes
What are the modes for the displayed data?
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6
47. Mode
47
A measure of central tendency
Value that occurs most often
Not affected by extreme values
Used for either numerical or categorical
data
There may be no mode
There may be several modes
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Mode = 9
0 1 2 3 4 5 6
No Mode
48. Mode
48
There may be several modes
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Mode = 5 & 9
49. Mode
49
Caution: Mode may not be representative of the data
{0.1, 0.1, 5000, 4900, 4500, 5200,…}
50. Median
50
In an ordered list, the median is the “middle” number
(50% above, 50% below)
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
51. Mean
51
The “balancing point” (centre of gravity) of the data
E.g. The data “balances” at 5
1 2 3 4 5 6 7 8 9
-2
-1 +3
52. Arithmetic Mean
52
The arithmetic mean (mean) is the most
common measure of central tendency
Calculated by summing the value observations
and dividing by the number of observations
For a sample of size n:
# of observationsn
xxx
n
x
x n21
n
1i
i
+++
==
∑= Observed
values
53. Arithmetic Mean
53
The most common measure of central tendency
Mean = sum of values divided by the number of values
Affected by extreme values (outliers)
What is the mean for these examples?
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
54. Arithmetic Mean
54
The most common measure of central tendency
Mean = sum of values divided by the number of values
Affected by extreme values (outliers)
0 1 2 3 4 5 6 7 8 9 10
Mean = 3
0 1 2 3 4 5 6 7 8 9 10
Mean = 4
3
5
15
5
54321
==
++++ 4
5
20
5
104321
==
++++
55. Measures of Central Tendency
55
Central Tendency
Mean Median Mode
n
x
x
n
1i
i∑=
=
Overview
Midpoint of
ranked values
Most frequently
observed valueArithmetic
average
50% 50%
56. The “Shape of a Distribution”
56
Use information on mean, median, and mode to
“visualize” the data
A data distribution is said to be symmetric if its shape
is the same on both sides of the median
Symmetry implies that median=arithmetic mean
If a distribution is uni-modal and symmetric then
Median=mean=mode
57. The “Shape of a Distribution”
57
0
1
2
3
4
5
6
7
8
9
1 2 3 4 5 6 7
#ofObs.
Value
MEDIAN50% 50%
Symmetric:
Median=Mean
Sym
m
etric:
Median=M
ean
UNIMODAL
Symmetric & Unimodel: Median=Mean=Mode
58. The “Shape of a Distribution”
58
0
1
2
3
4
5
6
7
8
9
1 2 3 4 5 6 7
#ofObs.
Value
MEDIAN50% 50%
Sym
m
etric:
Median=M
ean Symmetric:
Median=Mean
BIMODAL BIMODAL
Symmetric & Bimodel: Median=Mean≠Mode
59. The “Shape of a Distribution”
59
0
1
2
3
4
5
6
7
8
1 2 3 4 5 6 7
#ofObs.
Values
MEDIAN50% 50%
Symmetric:
Median=Mean
Symmetric:
Median=Mean
MODE?
Symmetric & no mode: Median=Mean (Uniform
60. The “Shape of a Distribution”
60
An asymmetric distribution is said to be skewed
1. Negatively if Mean<Median<Mode
2. Positively if Mean>Median>Mode
Hence, by comparing our measures of cental tendancy,
we can start to visualize the shape and characteristics
of the data
61. The “Shape of a Distribution”
61
0
2
4
6
8
10
12
1 2 3 4 5 6 7 8
MODE=2
MEDIAN=3
50% 50%
MEAN=3.2
MODE < MEDIAN < MEAN = POSITIVELY SKEWED
DISTRIBUTION
62. Example: Positively skewed variable
62
The Distribution of
After-Tax Income
shows the distribution
of income across all
Canadian households
63. Example: Positively skewed variable
63
The mode income is the
most common income and
was in the range from
$15,000 to $19,999.
The median income is the
level of income that
separates the population into
two groups of equal size and
was $39,700.
The mean income is the
average income and was
$48,400.
64. Example: Positively skewed variable
64
A distribution in which the
mean exceeds the median
and the median exceeds
the mode is positively
skewed, which means it
has a long tail of high
values.
The distribution of income
in Canada is positively
skewed.
Most likely to report
median rather than mean
since long tail distorts
average
65. Example: Positively skewed variable
65
Volunteer hours
Charitable contributions
# of Cigarette packs smoked (excluding 0)
Collective bargaining agreement duration (in years)
# of beers consumed on a Saturday night
Duration of low income (in years)
Number of children
66. The “Shape of a Distribution”
66
0
2
4
6
8
10
12
0 1 2 3 4 5 6 7
MODE=6
MEDIAN=5
50% 50%
MEAN=4.7
Mean< MEDIAN < Mode = NEGATIVELY SKEWED
DISTRIBUTION
68. Describing Data Numerically
68
Simple Arithmetic
Mean
Median
Mode
Describing Data Numerically
Variance
Standard Deviation
Range
Central Tendency Variation Association
Covariance
Correlation
Shape of the
Distribution
69. Same center,
different variation
Measures of Dispersion/Variability
69
Variation
Variance Standard
Deviation
Range
Measures of variation
give information on the
spread or variability of
the data values.
70. Range
70
Simplest measure of variation
Difference between the largest and the smallest
observations:
Range = Xlargest – Xsmallest
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Example:
71. Range
71
Simplest measure of variation
Difference between the largest and the smallest
observations:
Range = Xlargest – Xsmallest
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Range = 14 - 1 = 13
Example:
72. The Range
72
• Problem
• Ignores all but two data points
• These values may be “outliers” (i.e. not
representative)
73. Disadvantages of the Range
73
Ignores the way in which data are distributed
Sensitive to outliers
7 8 9 10 11 12
Range = 12 - 7 = 5
7 8 9 10 11 12
Range = 12 - 7 = 5
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 5 - 1 = 4
Range = 120 - 1 = 119
74. The Variance
74
• A single summary measure of dispersion would be
more helpful
• Takes account of all data Values
75. The Variance
1. Variance
2. Standard Deviation
∑=
−
−
=
N
i
i Xx
n
s
1
22
)(
1
1
75
siancedeviationdards == vartan
77. Comparing Standard Deviations
77
Mean = 15.5
s = 3.33811 12 13 14 15 16 17 18 19 20 21
11 12 13 14 15 16 17 18 19 20 21
Data B
Data A
Mean = 15.5
s = 0.926
11 12 13 14 15 16 17 18 19 20 21
Mean = 15.5
s = 4.570
Data C
78. Describing Data Numerically
78
Simple Arithmetic Mean
Median
Mode
Describing Data Numerically
Variance
Standard Deviation
Range
Central Tendency Variation Association
Covariance
Correlation
Shape of the Distribution
79. The Sample Covariance
79
The covariance measures the strength of the linear
relationship between two variables
The sample covariance:
Only concerned with the strength of the
relationship
No causal effect is implied
1n
)y)(yx(x
sy),(xCov
n
1i
ii
xy
−
−−
==
∑=
80. Interpreting Covariance
80
Covariance between two variables:
Cov(x,y) > 0 x and y tend to move in the same direction
Cov(x,y) < 0 x and y tend to move in opposite directions
Cov(x,y) = 0 x and y are independent
81. Coefficient of Correlation
81
Measures the relative strength of the linear relationship
between two variable
Sample correlation coefficient:
YX ss
y),(xCov
r =
82. Features of
Correlation Coefficient, r
82
Unit free
Ranges between –1 and 1
The closer to –1, the stronger the negative linear
relationship
The closer to 1, the stronger the positive linear
relationship
The closer to 0, the weaker any positive linear
relationship
86. Fun with Graphs
86
Does your mindset match my dataset!
http://www.ted.com/talks/hans_rosling_at_state.html
87. Looking ahead
SRs to client (cc) and Turnitin on Wednesday by
noon
No class next week
Work on 598 critiques
598 Critiques due in class & Turnitin Nov. 30
Comments on your SRs will be ready Nov. 30
Final SRs (if required) due Dec. 8 @11:55PM PST
Note carefully the requirements
Moodle site will be inaccessible sometime in December
Final Grades reported via usource once approved by
the Director
87
Notas del editor
Graph makes the frequencies pop more
Or that which could have been a bar chart can be made into a line by connect the midpoints
Remember our cross table?
Can we present this graphically?
Note legend is on right as no room on left hand side
Or we can display this as a stacked bar where the proportion of each region in each category is displayed.
Called a segmented bar chart
Mancession Video 4 minutes
Unemployment Rates sheet
ExcelTutorial5_timeseriesgraph
The main defences of the lying graph is that at least it was approximately corret, we were just trying to show the general direction of change or magnitidue.
So yes, taxes are low in BC but not as low as show in the original graph
Non zerio origins are a great way to lie
Very popular in government
Remember this time series graph. Look at what happens if we change the scale on the Y axis
Boy, that really changes your impression of the data and the underlying trend. The drop from 1992 to 1997 was 7%. Does this graph under or overstate a 7% change over this period?
Dr. Kendall used his diagram to demonstrate that we are drinking too much when really there are more people drinking due to population growth
9
No mode
If the mean=median and there is no mode, your distribution looks something like this
Not as frequently occuring in economic data so I actually do not have many examples
What does the standard deviation tell us? It tells us how far from the mean the data points tend to be . A bigger number tells us that the observations are further away from the mean than if there is a small standard deviation. Tells us HOW representative of the data the mean is.
Since the standard deviation can be thought of measuring how far the data values lie from the mean, we take the mean and move one standard deviation in either direction. The mean for this example was about 15.5
For the first distribution we have 15.5+3.338= 18.838 and 15.5-3.338=12.162
Assuming this is how much restaurant patrons spend, what this means is that most of the patrons probably spend between $12.16 and $18.84.
In the second example, we have 15.5+0.926=16.43 and 15.5-0.926=14.57 which as you can see shows less spread in the data.
In the third example we have 15.5+4.57=20.07 and 15.5-4.57=10.93 which is the most spread.
Excel 4 minutes
Food Expenditures 2
ExcelTutorial9_Dispersion.mp4
Measures of Relationships Between Variables
More often than not, we are interested in describing relationship between variables
On Oct. 28 we learned about scatter plots as a graphical way to describe a relationship between two variables.
We also learned about cross tabs aka contingency tables for nominal/ordinal variables
Let’s look a little more closely at measure of relationships for ratio level data