Intro to quant_analysis_students

Week 11: Basic Descriptive
Quantitative Data Analysis
Tables, Graphs, & Summary Statistics
1

Objectives
 Learn about basic descriptive quantitative analysis
 How to perform these tasks in Excel
 Starting point for 502B
 Excel knowledge and quantitative skills are highly desired by
Employers
 EC stream
2

Introduction
3
 Without data, it is anyone’s opinion
 Why use tables, graphs, summary stats?
“At their best, tables, graphs, and statistics are instruments
for reasoning about complex quantitative information.”
 Why learn how to design them appropriately?
“At their worst, tables, graphs and summary statistics are
instruments of evil used for deceiving a naive viewer.”
 Does your mindset match my dataset!
 http://www.ted.com/talks/hans_rosling_at_state.html

Quantitative Research Process
Page 4

Frequency Distribution
Page 7
 A convenient way of summarizing a lot of tabular data
 What is a Frequency Distribution?
 A frequency distribution is a list or a table …
 containing class groupings (categories or ranges within
which the data fall) ...
 and the corresponding frequencies with which data fall
within each class or category
 For nominal/ordinal data

Table 1
Univariate Frequencies of Percentage of Sales
Reported to Tax Authorities
Source: 1999 World Bank World Business Environment
Survey (WBES), excludes missing observations
% of Sales
Reported
100%
90-99%
80-89%
70-79%
60-69%
50-59%
<50%
Total
Frequency
3307
1096
916
703
501
694
936
8153
Percent
(%)
40.56
13.44
11.24
8.62
6.14
8.51
11.48
100
http://www.enterprisesurveys.org/

Contingency/Pivot/Cross Table
10
 May also want to produce a table with more
categories
 Cross table or Contingency table or Pivot table
 Suitable if you have two nominal/ordinal variables
 Simple extension to a univariate table
 Considers relationship between two variables
 Row variable (Dependent)
 Column variable (Independent)

Table2
Percentage of Sales Reported to Tax Authorities
by Region
Page 11
Africa Transition Asia Latin OECD Former Total
Europe America Soviet
Countries
100% 490 554 416 794 446 607 3,307
90-99% 266 196 142 119 145 228 1,096
80-89% 158 152 117 192 73 224 916
70-79% 162 117 103 153 43 125 703
60-69% 140 69 70 115 22 85 501
50-59% 140 105 141 118 16 174 694
<50% 100 106 283 296 25 126 936
Total 1,456 1,299 1,272 1,787 770 1,569 8,153
Source: 1999 World Bank World Business Environment Survey (WBES)
* Excludes missing observations

Features of a Table
12
 Title that accurately summarizes the data
 Simple, indicates major variables, and time frame (if applicable)
 Source: data set or origin of table
 Explanatory footnotes
 Easy to read & separated from text
 Properly formatted for style (see APA Rules)
 Necessary to advance analysis
 See Module 7 for APA Table Checklist
 Reproduced from APA manual

Bar Graph
Page 14
 Often used to describe categorical data
 Ordinal/Nominal
 Draws attention to the frequency of each category

Bar Graph
Page 16
Figure 1
Percentage of sales reported to tax authority
Note. Excludes missing observations. n = 8314

Relative Frequency Polygone
17

Pie Graph
Page 18
 Emphasizes the proportion of each category
 Something that may be good for our tax evasion data
 Circle represents the total
 Segments the shares of the total
 Segment size is proportional to frequency

Pie Graph
19
Figure 1

Pie Graph
Figure 1

Table2
Percentage of Sales Reported to Tax Authorities
by Region
Page 23
Africa Transition Asia Latin OECD Former Total
Europe America Soviet
Countries
100% 490 554 416 794 446 607 3,307
90-99% 266 196 142 119 145 228 1,096
80-89% 158 152 117 192 73 224 916
70-79% 162 117 103 153 43 125 703
60-69% 140 69 70 115 22 85 501
50-59% 140 105 141 118 16 174 694
<50% 100 106 283 296 25 126 936
Total 1,456 1,299 1,272 1,787 770 1,569 8,153

Bar Graph
Page 24
Figure 1

Segmented Bar Chart
Figure 1

Pie Graph
Page 26
Figure 2
Percentage of sales reported to tax authority by region

Time Series Graph
Page 29
 Time series are often used in social sciences
 Data collected at various time period: daily, weekly, monthly,
quarterly, annually, etc.
 Examples include GDP, Unemployment, University Tuition
 Plot series of interest over time
 Let’s look at a graph of the unemployment rate by gender and
age

InstructorPage 31
Histogram
 Used for continuous data
 Frequency Distribution for continuous data
 Summary graph showing count of the data pints falling in
various ranges
 Rough approximate of the distribution of the data
 A histogram is a way to summarize data
 The distribution condenses the raw data into a more useful
form...
 and allows for a quick visual interpretation of the data

InstructorPage 33
Scatter Graphs
 Graphs relationship between two continuous
variables

Principles of Graphical Excellence
35
 Well-designed presentation of interesting data
 Substance & design
 Simplicity of design, complexity of data
 Proportion and Balance
 Clear, precise, efficient
 Know what you are trying to show (have a story)
 make sure you graph shows it
 Well formatted, professional
 Choose format that reflects your data and the story
 Informative and legible axis
 Fully labelled & legible
 Gets across main point(s) in the shortest time with the least ink in the
smallest space
 Adds information not otherwise available to the reader
 But supplemented with text describing the figure
 Tells the truth about the data
 Limits complexity and confusion
 Avoid Chart Junk

36
0
10
20
30
40
50
60
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
0
20
40
60
80
100
120 West
North
Northeast
Southwest
Mexico
Europe
Japan
East
South
International
Examples of Chartjunk

37
Examples of Chartjunk
0
10
20
30
40
50
60
70
80
90
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
Gridlines!
Vibration
Pointless
Fake 3-D Effects
Filled “Floor” Clip Art
In or out?
Filled
“Walls”
Borders and
Fills Galore
Unintentional
Heavy or Double Lines
Filled Labels
Serif Font with
Thin & Thick Lines

Displaying Data: “Mistakes”
Page 38
 Graphs are also instruments of evil used for deceiving
a naive viewer.
 Non-zero origin
 Omitting data that refutes your “evidence”
 Limiting scope of data

What is Wrong with this Graph?
39
Provincial Personal Income Taxes
Single Individual with $45,000 in
income claiming basic personal tax
credits

Exaggerates a change in data
Page 41
Source: Statistics Canada, CANSIM II, V31215364

Worst Recession Since the Depression (?)
43

Describing Data Numerically
45
Simple Arithmetic Mean
Median
Mode
Variance
Standard Deviation
Range
Central Tendency Variation Association
Covariance
Correlation
Shape of the Distribution

Mode
46
 A measure of central tendency
 Value that occurs most often
 Not affected by extreme values
 Used for either numerical or categorical data
 There may be no mode or several modes
 What are the modes for the displayed data?
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6

Mode
47
 A measure of central tendency
 Value that occurs most often
 Not affected by extreme values
 Used for either numerical or categorical
data
 There may be no mode
 There may be several modes
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Mode = 9
0 1 2 3 4 5 6
No Mode

Mode
48
 There may be several modes
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Mode = 5 & 9

Mode
49
 Caution: Mode may not be representative of the data
 {0.1, 0.1, 5000, 4900, 4500, 5200,…}

Median
50
 In an ordered list, the median is the “middle” number
(50% above, 50% below)
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Mean
51
 The “balancing point” (centre of gravity) of the data
 E.g. The data “balances” at 5
1 2 3 4 5 6 7 8 9
-2
-1 +3

Arithmetic Mean
52
 The arithmetic mean (mean) is the most
common measure of central tendency
 Calculated by summing the value observations
and dividing by the number of observations
 For a sample of size n:
# of observationsn
xxx
n
x
x n21
n
1i
i
+++
==
∑=  Observed
values

Arithmetic Mean
53
 The most common measure of central tendency
 Mean = sum of values divided by the number of values
 Affected by extreme values (outliers)
 What is the mean for these examples?
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Arithmetic Mean
54
 The most common measure of central tendency
 Mean = sum of values divided by the number of values
 Affected by extreme values (outliers)
0 1 2 3 4 5 6 7 8 9 10
Mean = 3
0 1 2 3 4 5 6 7 8 9 10
Mean = 4
3
5
15
5
54321
==
++++ 4
5
20
5
104321
==
++++

Measures of Central Tendency
55
Central Tendency
Mean Median Mode
n
x
x
n
1i
i∑=
=
Overview
Midpoint of
ranked values
Most frequently
observed valueArithmetic
average
50% 50%

The “Shape of a Distribution”
56
 Use information on mean, median, and mode to
“visualize” the data
 A data distribution is said to be symmetric if its shape
is the same on both sides of the median
 Symmetry implies that median=arithmetic mean
 If a distribution is uni-modal and symmetric then
 Median=mean=mode

57
0
1
2
3
4
5
6
7
8
9
1 2 3 4 5 6 7
#ofObs.
Value
MEDIAN50% 50%
Symmetric:
Median=Mean
Sym
m
etric:
Median=M
ean
UNIMODAL
Symmetric & Unimodel: Median=Mean=Mode

58
0
1
2
3
4
5
6
7
8
9
1 2 3 4 5 6 7
#ofObs.
Value
MEDIAN50% 50%
Sym
m
etric:
Median=M
ean Symmetric:
Median=Mean
BIMODAL BIMODAL
Symmetric & Bimodel: Median=Mean≠Mode

59
0
1
2
3
4
5
6
7
8
1 2 3 4 5 6 7
#ofObs.
Values
MEDIAN50% 50%
Symmetric:
Median=Mean
Symmetric:
Median=Mean
MODE?
Symmetric & no mode: Median=Mean (Uniform

60
 An asymmetric distribution is said to be skewed
1. Negatively if Mean<Median<Mode
2. Positively if Mean>Median>Mode
 Hence, by comparing our measures of cental tendancy,
we can start to visualize the shape and characteristics
of the data

61
0
2
4
6
8
10
12
1 2 3 4 5 6 7 8
MODE=2
MEDIAN=3
50% 50%
MEAN=3.2
MODE < MEDIAN < MEAN = POSITIVELY SKEWED
DISTRIBUTION

Example: Positively skewed variable
62
 The Distribution of
After-Tax Income
 shows the distribution
of income across all
Canadian households

63
 The mode income is the
most common income and
was in the range from
$15,000 to $19,999.
 The median income is the
level of income that
separates the population into
two groups of equal size and
was $39,700.
 The mean income is the
average income and was
$48,400.

64
 A distribution in which the
mean exceeds the median
and the median exceeds
the mode is positively
skewed, which means it
has a long tail of high
values.
 The distribution of income
in Canada is positively
skewed.
 Most likely to report
median rather than mean
since long tail distorts
average

65
 Volunteer hours
 Charitable contributions
 # of Cigarette packs smoked (excluding 0)
 Collective bargaining agreement duration (in years)
 # of beers consumed on a Saturday night
 Duration of low income (in years)
 Number of children

66
0
2
4
6
8
10
12
0 1 2 3 4 5 6 7
MODE=6
MEDIAN=5
50% 50%
MEAN=4.7
Mean< MEDIAN < Mode = NEGATIVELY SKEWED
DISTRIBUTION

Examples
67
 University Grades
 Age
 Years in school
 Etc.

68
Simple Arithmetic
Mean
Median
Mode
Variance
Standard Deviation
Range
Covariance
Correlation
Shape of the
Distribution

Same center,
different variation
Measures of Dispersion/Variability
69
Variation
Variance Standard
Deviation
Range
 Measures of variation
give information on the
spread or variability of
the data values.

Range
70
 Simplest measure of variation
 Difference between the largest and the smallest
observations:
Range = Xlargest – Xsmallest
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Example:

Range
71
 Simplest measure of variation
 Difference between the largest and the smallest
observations:
Range = Xlargest – Xsmallest
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Range = 14 - 1 = 13
Example:

The Range
72
• Problem
• Ignores all but two data points
• These values may be “outliers” (i.e. not
representative)

Disadvantages of the Range
73
 Ignores the way in which data are distributed
 Sensitive to outliers
7 8 9 10 11 12
Range = 12 - 7 = 5
7 8 9 10 11 12
Range = 12 - 7 = 5
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 5 - 1 = 4
Range = 120 - 1 = 119

The Variance
74
• A single summary measure of dispersion would be
more helpful
• Takes account of all data Values

The Variance
1. Variance
2. Standard Deviation
∑=
−
−
=
N
i
i Xx
n
s
1
22
)(
1
1
75
siancedeviationdards == vartan

Measuring variation
76
Small standard deviation
Large standard deviation

Comparing Standard Deviations
77
Mean = 15.5
s = 3.33811 12 13 14 15 16 17 18 19 20 21
11 12 13 14 15 16 17 18 19 20 21
Data B
Data A
Mean = 15.5
s = 0.926
11 12 13 14 15 16 17 18 19 20 21
Mean = 15.5
s = 4.570
Data C

78
Simple Arithmetic Mean
Median
Mode
Variance
Standard Deviation
Range
Covariance
Correlation
Shape of the Distribution

The Sample Covariance
79
 The covariance measures the strength of the linear
relationship between two variables
 The sample covariance:
 Only concerned with the strength of the
relationship
 No causal effect is implied
1n
)y)(yx(x
sy),(xCov
n
1i
ii
xy
−
−−
==
∑=

Interpreting Covariance
80
Covariance between two variables:
Cov(x,y) > 0 x and y tend to move in the same direction
Cov(x,y) < 0 x and y tend to move in opposite directions
Cov(x,y) = 0 x and y are independent

Coefficient of Correlation
81
 Measures the relative strength of the linear relationship
between two variable
 Sample correlation coefficient:
YX ss
y),(xCov
r =

Features of
Correlation Coefficient, r
82
 Unit free
 Ranges between –1 and 1
 The closer to –1, the stronger the negative linear
relationship
 The closer to 1, the stronger the positive linear
relationship
 The closer to 0, the weaker any positive linear
relationship

Interpreting the Correlation Coefficient, r
83

Scatter Plots of Data with Variou
Correlation Coefficients
84
Y
X
Y
X
Y
X
Y
X
Y
X
r = -1
Cov<0
r = -.6
Cov<0
r = 0
Cov=0
r = +.3r = +1
Y
X
r = 0

Fun with Graphs
86
 Does your mindset match my dataset!
 http://www.ted.com/talks/hans_rosling_at_state.html

Looking ahead
 SRs to client (cc) and Turnitin on Wednesday by
noon
 No class next week
 Work on 598 critiques
 598 Critiques due in class & Turnitin Nov. 30
 Comments on your SRs will be ready Nov. 30
 Final SRs (if required) due Dec. 8 @11:55PM PST
 Note carefully the requirements
 Moodle site will be inaccessible sometime in December
 Final Grades reported via usource once approved by
the Director
87

Intro to quant_analysis_students

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (16)

Destacado

Destacado (20)

Similar a Intro to quant_analysis_students

Similar a Intro to quant_analysis_students (20)

Último

Último (20)

Intro to quant_analysis_students

Notas del editor