- Discriminant analysis uses independent variables to distinguish between groups defined by a dependent variable. It creates discriminant functions and scores to classify cases into groups.
- The usefulness of a discriminant model is determined by its classification accuracy rate, or ability to correctly predict group membership compared to chance accuracy.
- This document describes key outputs of discriminant analysis like discriminant functions and scores, structure matrix, group statistics, and compares actual classification accuracy to chance accuracy to evaluate a model.
1. SW388R7
Discriminant Analysis – Basic Relationships
Data Analysis &
Computers II
Slide 1
Discriminant Functions and Scores
Describing Relationships
Classification Accuracy
Sample Problems
2. SW388R7
Discriminant analysis
Data Analysis &
Computers II
Slide 2
Discriminant analysis is used to analyze relationships between a
non-metric dependent variable and metric or dichotomous
independent variables.
Discriminant analysis attempts to use the independent variables
to distinguish among the groups or categories of the dependent
variable.
The usefulness of a discriminant model is based upon its
accuracy rate, or ability to predict the known group
memberships in the categories of the dependent variable.
3. SW388R7
Discriminant scores
Data Analysis &
Computers II
Slide 3
Discriminant analysis works by creating a new variable called
the discriminant function score which is used to predict to
which group a case belongs.
Discriminant function scores are computed similarly to factor
scores, i.e. using eigenvalues. The computations find the
coefficients for the independent variables that maximize the
measure of distance between the groups defined by the
dependent variable.
The discriminant function is similar to a regression equation in
which the independent variables are multiplied by coefficients
and summed to produce a score.
4. SW388R7
Discriminant functions
Data Analysis &
Computers II
Slide 4
Conceptually, we can think of the discriminant function or
equation as defining the boundary between groups.
Discriminant scores are standardized, so that if the score falls
on one side of the boundary (standard score less than zero, the
case is predicted to be a member of one group) and if the score
falls on the other side of the boundary (positive standard
score), it is predicted to be a member of the other group.
5. SW388R7
Number of functions
Data Analysis &
Computers II
Slide 5
If the dependent variable defines two groups, one statistically
significant discriminant function is required to distinguish the
groups; if the dependent variable defines three groups, two
statistically significant discriminant functions are required to
distinguish among the three groups; etc.
If a discriminant function is able to distinguish among groups, it
must have a strong relationship to at least one of the
independent variables.
The number of possible discriminant functions in an analysis is
limited to the smaller of the number of independent variables
or one less than the number of groups defined by the
dependent variable.
6. SW388R7
Overall test of relationship
Data Analysis &
Computers II
Slide 6
The overall test of relationship among the independent
variables and groups defined by the dependent variable is a
series of tests that each of the functions needed to distinguish
among the groups is statistically significant.
In some analyses, we might discover that two or more of the
groups defined by the dependent variable cannot be
distinguished using the available independent variables. While
it is reasonable to interpret a solution in which there are fewer
significant discriminant functions than the maximum number
possible, our problems will require that all of the possible
discriminant functions be significant.
7. SW388R7
Data Analysis & Interpreting the relationship between
independent and dependent variables
Computers II
Slide 7
The interpretative statement about the relationship between
the independent variable and the dependent variable is a
statement like: cases in group A tended to have higher scores
on variable X than cases in group B or group C.
This interpretation is complicated by the fact that the
relationship is not direct, but operates through the discriminant
function.
Dependent variable groups are distinguished by scores on
discriminant functions, not on values of independent variables.
The scores on functions are based on the values of the
independent variables that are multiplied by the function
coefficients.
8. SW388R7
Groups, functions, and variables
Data Analysis &
Computers II
Slide 8
To interpret the relationship between an independent variable
and the dependent variable, we must first identify how the
discriminant functions separate the groups, and then the role of
the independent variable is for each function.
SPSS provides a table called "Functions at Group Centroids"
(multivariate means) that indicates which groups are separated
by which functions.
SPSS provides another table called the "Structure Matrix" which,
like its counterpart in factor analysis, identifies the loading, or
correlation, between each independent variable and each
function. This tells us which variables to interpret for each
function. Each variable is interpreted on the function that it
loads most highly on.
9. SW388R7
Functions at Group Centroids
Data Analysis &
Computers II
Slide 9
In order to specify the role that each independent
variable plays in predicting group membership on the
dependent variable, we must link together the
relationship between the discriminant functions and the
groups defined by the dependent variable, the role of
the significant independent variables in the
discriminant functions, and the differences in group
means for each of the variables.
Function 2 separates
Functions at Group Centroids survey respondents
who thought we spend
Function too little money on
welfare (positive value
WELFARE 1 2
of 0.235) from survey
1 -.220 .235 respondents who
2 .446 -.031 thought we spend too
3 -.311 -.362 much money (negative
value of -0.362) on
Unstandardized canonical discriminant welfare. We ignore the
functions evaluated at group means second group (-0.031)
Function 1 separates survey respondents in this comparison
who thought we spend about the right because it was
amount of money on welfare (the positive distinguished from the
value of 0.446) from survey respondents other two groups by
who thought we spend too much (negative function 1.
value of -0.311) or little money (negative
value of -0.220) on welfare.
10. SW388R7
Structure Matrix
Data Analysis &
Computers II
Slide 10
Based on the structure matrix, the We do not interpret
predictor variables strongly associated with loadings in the
discriminant function 1 which distinguished structure matrix unless
between survey respondents who thought they are 0.30 or higher.
we spend about the right amount of money
on welfare and survey respondents who
thought we spend too much or little money
on welfare were number of hours worked in Structure Matrix
the past week (r=-0.582) and highest year
of school completed (r=0.687). Function
1 2
HIGHEST YEAR OF
.687* .136
SCHOOL COMPLETED
NUMBER OF HOURS
-.582* .345
WORKED LAST WEEK
R SELF-EMP OR WORKS
.223 .889*
FOR SOMEBODY
RESPONDENTS INCOMEa .101 .292*
Pooled within-groups correlations between discriminating
variables and standardized canonical discriminant functions
Variables ordered by absolute size of correlation within function.
Based on the *. Largest absolute correlation between each variable and
structure matrix, the predictor variable
strongly associated with discriminant function 2 which
any discriminant function
distinguished a.
between survey respondents who thought we
spend too little This variable not used inand analysis. respondents
money on welfare the survey
who thought we spend too much money on welfare was
self-employment (r=0.889).
11. SW388R7
Group Statistics
Data Analysis &
Computers II
Slide 11
Group Statistics
Valid N (listwise)
WELFARE Mean Std. Deviation Unweighted Weighted
1 TOO LITTLE NUMBER OF HOURS The average number of hours worked
43.96 13.240
WORKED LAST WEEK in the past week56.000
56
for survey
HIGHEST YEAR OF respondents who thought we spend
13.73 2.401about the 56
right amount of money on
56.000
SCHOOL COMPLETED
welfare (mean=37.90) was lower than
R SELF-EMP OR WORKS
1.93 .260the average number of hours worked
56 56.000
FOR SOMEBODY in the past weeks for survey
RESPONDENTS INCOME 13.70 5.034respondents who thought we spend too
56 56.000
2 ABOUT RIGHT NUMBER OF HOURS much money on welfare (mean=43.96)
37.90 13.235and survey respondents who thought
50 50.000
WORKED LAST WEEK
HIGHEST YEAR OF we spend too little money on welfare
14.78 2.558(mean=42.03). 50.000
50
SCHOOL COMPLETED
R SELF-EMP OR WORKS
1.90 .303This enables us to make the
50 50.000
FOR SOMEBODY statement: "survey respondents who
RESPONDENTS INCOME 14.00 5.503thought we spend about the right
50 50.000
3 TOO MUCH NUMBER OF HOURS amount of money on welfare worked
42.03 10.456fewer hours in the past week than
32 32.000
WORKED LAST WEEK
HIGHEST YEAR OF survey respondents who thought we
13.38 2.524spend too32
much 32.000
or little money on
SCHOOL COMPLETED
welfare."
R SELF-EMP OR WORKS
1.75 .440 32 32.000
FOR SOMEBODY
RESPONDENTS INCOME 14.75 5.304 32 32.000
Total NUMBER OF HOURS
41.32 12.846 138 138.000
WORKED LAST WEEK
HIGHEST YEAR OF
14.03 2.537 138 138.000
SCHOOL COMPLETED
R SELF-EMP OR WORKS
12. SW388R7
Which independent variables to interpret
Data Analysis &
Computers II
Slide 12
In a simultaneous discriminant analysis, in which all
independent variables are entered together, we only interpret
the relationships for independent variables that have a loading
of 0.30 or higher one or more discriminant functions. A
variable can have a high loading on more than one function,
which complicates the interpretation. We will interpret the
variable for the function on which it has the highest loading.
In a stepwise discriminant analysis, we limit the interpretation
of relationships between independent variables and groups
defined by the dependent variable to those independent
variables that met the statistical test for inclusion in the
analysis.
13. SW388R7
Discriminant analysis and classification
Data Analysis &
Computers II
Slide 13
Discriminant analysis consists of two stages: in the first stage,
the discriminant functions are derived; in the second stage, the
discriminant functions are used to classify the cases.
While discriminant analysis does compute correlation measures
to estimate the strength of the relationship, these correlations
measure the relationship between the independent variables
and the discriminant scores.
A more useful measure to assess the utility of a discriminant
model is classification accuracy, which compares predicted
group membership based on the discriminant model to the
actual, known group membership which is the value for the
dependent variable.
14. SW388R7
Evaluating usefulness for discriminant models
Data Analysis &
Computers II
Slide 14
The benchmark that we will use to characterize a discriminant
model as useful is a 25% improvement over the rate of accuracy
achievable by chance alone.
Even if the independent variables had no relationship to the
groups defined by the dependent variable, we would still
expect to be correct in our predictions of group membership
some percentage of the time. This is referred to as by chance
accuracy.
The estimate of by chance accuracy that we will use is the
proportional by chance accuracy rate, computed by summing
the squared percentage of cases in each group.
15. SW388R7
Comparing accuracy rates
Data Analysis &
Computers II
Slide 15
To characterize our model as useful, we compare the cross-
validated accuracy rate produced by SPSS to 25% more than the
proportional by chance accuracy.
The cross-validated accuracy rate is a one-at-a-time hold out
method that classifies each case based on a discriminant
solution for all of the other cases in the analysis. It is a more
realistic estimate of the accuracy rate we should expect in the
population because discriminant analysis inflates accuracy rates
when the cases classified are the same cases used to derive the
discriminant functions.
Cross-validated accuracy rates are not produced by SPSS when
separate covariance matrices are used in the classification,
which we address more next week.
16. SW388R7
Computing by chance accuracy
Data Analysis &
Computers II
Slide 16
The percentage of cases in each group defined by the
dependent variable are reported in the table "Prior
Probabilities for Groups"
Prior Probabilities for Groups
Cases Used in Analysis
WELFARE Prior Unweighted Weighted
1 TOO LITTLE .406 56 56.000
2 ABOUT RIGHT .362 50 50.000
3 TOO MUCH .232 32 32.000
Total 1.000 138 138.000
The proportional by chance accuracy
rate was computed by squaring and
summing the proportion of cases in
each group from the table of prior
probabilities for groups (0.406² +
0.362² + 0.232² = 0.350).
A 25% increase over this would require
that our cross-validated accuracy be
43.7% (1.25 x 35.0% = 43.7%).
17. SW388R7
Comparing the cross-validated accuracy rate
Data Analysis &
Computers II
Slide 17
b,c
Classification Results
Predicted Group Membership
1 TOO 2 ABOUT
WELFARE LITTLE RIGHT 3 TOO MUCH Total
Original Count 1 TOO LITTLE 43 15 6 64
2 ABOUT RIGHT 26 30 6 62
3 TOO MUCH 17 10 9 36
Ungrouped cases 3 3 2 8
% 1 TOO LITTLE 67.2 23.4 9.4 100.0
2 ABOUT RIGHT 41.9 48.4 9.7 100.0
3 TOO MUCH 47.2 27.8 25.0 100.0
Ungrouped cases 37.5 37.5 25.0 100.0
Cross-validated a Count 1 TOO LITTLE 43 15 6 64
SPSS reports the cross-validated accuracy rate
2 ABOUT RIGHT
in the footnotes to the table "Classification 6
26 30 62
3 TOO MUCH The cross-validated accuracy rate 8
Results." 17 11 36
% 1 TOO LITTLE by SPSS was 50.0% which was 9.4
computed 67.2 23.4 100.0
greater than or equal to the proportional by
2 ABOUT RIGHT 41.9 48.4 9.7 100.0
chance accuracy criteria of 43.7%.
3 TOO MUCH 47.2 30.6 22.2 100.0
a. Cross validation is done only for those cases in the analysis. In cross validation, each case is
classified by the functions derived from all cases other than that case.
b. 50.6% of original grouped cases correctly classified.
c. 50.0% of cross-validated grouped cases correctly classified.
18. SW388R7
Problem 1
Data Analysis &
Computers II
Slide 18
1. In the dataset GSS2000.sav, is the following statement true, false, or an incorrect
application of a statistic? Assume that there is no problem with missing data, violation of
assumptions, or outliers. Use a level of significance of 0.05 for evaluating the statistical
relationship.
The variables "age" [age], "highest year of school completed" [educ], "sex" [sex], and "income"
[rincom98] are useful in distinguishing between groups based on responses to "seen x-rated
movie in last year" [xmovie]. These predictors differentiate survey respondents who had seen
an x-rated movie in the last year from survey respondents who had not seen an x-rated movie
in the last year.
Survey respondents who had seen an x-rated movie in the last year were younger than survey
respondents who had not seen an x-rated movie in the last year. Survey respondents who had
seen an x-rated movie in the last year were more likely to be male than survey respondents
who had not seen an x-rated movie in the last year.
1. True
2. True with caution
3. False
4. Inappropriate application of a statistic
19. SW388R7
Dissecting problem 1 - 1
Data Analysis &
Computers II
Slide 19
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect
application of a statistic? Assume that there is no problem with missing data, violation of
assumptions, or outliers. Use a level of significance of 0.05 for evaluating the statistical
relationship.
The variables "age" [age], "highest year of school completed" [educ], "sex" [sex], and "income"
[rincom98] are useful in distinguishing between groups based on responses to "seen x-rated
For these problems, we will
movie in last year" [xmovie]. These predictors differentiate survey respondents who had seen
an x-rated movie in the last no problemsurvey respondents who had not seen an x-rated movie
assume that there is
year from
with missing data, violation of
in the last year.
assumptions, or outliers.
Survey respondents whowe are told tox-rated movie in the last year were younger than survey
In this problem,
had seen an
respondents0.05 as alpha for the x-rated movie in the last year. Survey respondents who had
use who had not seen an
seen an discriminant analysis. last year were more likely to be male than survey respondents
x-rated movie in the
who had not seen an x-rated movie in the last year.
1. True
2. True with caution
3. False
4. Inappropriate application of a statistic
20. SW388R7
Dissecting problem 1 - 2
Data Analysis &
Computers II
Slide 20
The variables listed first in the problem
statement are the independent variables
1. In the dataset GSS2000.sav, is theof school statement true, false, or an incorrect
(IVs): "age" [age], "highest year following
completed" [educ], "sex" [sex], and
application of a statistic? Assume that there is no problem with missing data, violation of
"income" [rincom98].
assumptions, or outliers. Use a level of significance of 0.05 for evaluating the statistical
relationship.
The variables "age" [age], "highest year of school completed" [educ], "sex" [sex], and
"income" [rincom98] are useful in distinguishing between groups based on responses to
"seen x-rated movie in last year" [xmovie]. These predictors differentiate survey respondents
who had seen an x-rated movie in the last year from survey respondents who had not seen an x-
rated movie in the last year.
Survey variable usedwho had seen an x-rated movie in the last year were younger than survey
The respondents to define
respondents the dependent
groups is who had not seen an x-rated movie in the last year. Survey respondents who had
seen an x-rated movie in the last year were more likely to be male than survey respondents
variable (DV): "seen x-rated
movie in last year" [xmovie].
who had not seen an x-rated movie in the last year. When a problem states
that a list of independent
variables can distinguish
among groups, we do a
discriminant analysis
entering all of the
variables simultaneously.
21. SW388R7
Dissecting problem 1 - 3
Data Analysis &
Computers II
Slide 21
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application
of a statistic? Assume that there is no problem with missing data, violation of assumptions, or
outliers. Use a level of significance of 0.05 for evaluating the statistical relationship.
The variables "age" [age], "highest year of school completed" [educ], "sex" [sex], and "income"
[rincom98] are useful in distinguishing between groups based on responses to "seen x-rated
movie in last year" [xmovie]. These predictors differentiate survey respondents who had
seen an x-rated movie in the last year from survey respondents who had not seen an x-
rated movie in the last year.
Survey respondents who had seen an x-rated movie in the last year were younger than survey
respondents who had not seen an x-rated movie in the last year. Survey respondents who had
seen an x-rated movie in the last year were more the dependent
The problem identifies two groups for
likely to be male than survey respondents
who had not seen an x-rated movie in the last year.
variable:
•survey respondents who had seen an x-rated
1. True movie in the last year
•survey respondents who had not seen an x-
2. True with caution movie in the last year
rated
3. False
4. Inappropriate application of a statistic the analysis will be
To distinguish among two groups,
required to find one statistically significant
discriminant function.
22. SW388R7
Dissecting problem 1 - 4
Data Analysis &
Computers II
Slide 22
The specific relationships listed in the
problem indicate how the independent
The variables "age" [age], "highest year of school completed" [educ], "sex"the
variable relates to groups of [sex], and "income"
[rincom98] are useful in distinguishing between groups based on responsesmean for x-rated
dependent variable, i.e., the to "seen
movie in last year" [xmovie]. These predictors age will be lower for respondents who had seen
differentiate survey respondents who
had seen an x-rated movie in the last
an x-rated movie in the last year from survey respondents who had not seen an x-rated movie
year.
in the last year.
Survey respondents who had seen an x-rated movie in the last year were younger than
survey respondents who had not seen an x-rated movie in the last year. Survey respondents
who had seen an x-rated movie in the last year were more likely to be male than survey
respondents who had not seen an x-rated movie in the last year.
1. True
2. True with caution
3. False In order for the discriminant analysis to be
4. Inappropriate application of a statistic we must have enough statistically
true,
significant functions to distinguish among the
groups, the classification accuracy rate must
be substantially better than could be obtained
by chance alone, and each significant
relationship must be interpreted correctly.
23. SW388R7
LEVEL OF MEASUREMENT - 1
Data Analysis &
Computers II
Slide 23
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application
of a statistic? Assume that there is no problem with missing data, violation of assumptions, or
outliers. Use a level of significance of 0.05 for evaluating the statistical relationship.
The variables "age" [age], "highest year of school completed" [educ], "sex" [sex], and "income"
[rincom98] are useful in distinguishing between groups based on responses to "seen x-rated
movie in last year" [xmovie]. These predictors differentiate survey respondents who had
seen an x-rated movie in the last year from survey respondents who had not seen an x-
rated movie in the last year.
Survey respondents who had seen an x-rated movie in the last year were younger than survey
respondents who had not seen an x-rated movie in the last year. Survey respondents who had
seen an x-rated movie in the last year were more likely to be male than survey respondents
who had not seen an x-rated movie in the last year.
Discriminant analysis requires that the
dependent variable be non-metric and the
1. True independent variables be metric or dichotomous.
2. True with caution"seen x-rated movie in last year" [xmovie] is an
dichotomous variable, which satisfies the level of
3. False measurement requirement.
4. Inappropriate application of a statistic
It contains two categories: survey respondents
who had seen an x-rated movie in the last year
and survey respondents who had not seen an x-
rated movie in the last year.
24. SW388R7
LEVEL OF MEASUREMENT - 2
Data Analysis &
Computers II
Slide 24
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application
of a statistic? Assume that there is no problem with missing data, violation of assumptions, or
outliers. Use a level of significance of 0.05 for evaluating the statistical relationship.
The variables "age" [age], "highest year of school completed" [educ], "sex" [sex], and
"income" [rincom98] are useful in distinguishing between groups based on responses to "seen
x-rated movie in last year" [xmovie]. These predictors differentiate survey respondents who
had seen an x-rated movie in the last year from survey respondents who had not seen an x-
rated movie in the last year.
Survey respondents who had seen an x-rated movie in the last year were younger than survey
respondents who had not seen an x-rated movie in the last year. Survey respondents who had
seen an x-rated movie in the last year were more likely to be male than survey respondents
"Age" [age] and "highest year of
schoolhad not seen an x-rated movie in the last year.
who completed" [educ] are
interval level variables, which
satisfies the level of measurement
1. True
requirements for discriminant "Income" [rincom98] is an ordinal level
analysis.
2. True with caution variable. If we follow the convention of
3. False treating ordinal level variables as metric
variables, the level of measurement
4. Inappropriate application of a statistic requirement for discriminant analysis is
satisfied. Since some data analysts do
not agree with this convention, a note
"Sex" [sex] is a dichotomous or of caution should be included in our
dummy-coded nominal variable interpretation.
which may be included in
discriminant analysis.
25. SW388R7
Request simultaneous discriminant analysis
Data Analysis &
Computers II
Slide 25
Select the Classify |
Discriminant… command
from the Analyze menu.
26. SW388R7
Selecting the dependent variable
Data Analysis &
Computers II
Slide 26
First, highlight the
dependent variable
xmovie in the list
of variables.
Second, click on the right
arrow button to move the
dependent variable to the
Grouping Variable text box.
27. SW388R7
Defining the group values
Data Analysis &
Computers II
Slide 27
When SPSS moves the dependent variable to the
Grouping Variable textbox, it puts two question marks in
parentheses after the variable name. This is a reminder
that we have to enter the number that represent the
groups we want to include in the analysis.
First, to specify the
group numbers, click
on the Define Range…
button.
28. SW388R7
Completing the range of group values
Data Analysis &
Computers II
Slide 28
The value labels for xmovie show
two categories:
1 = YES
2 = NO
First, type in 1 in
The range of values that we need
the Minimum text
to enter goes from 1 as the
box.
minimum and 2 as the maximum.
Second, type in
2 in the Third, click on the
Maximum text Continue button to
box. close the dialog box.
29. SW388R7
Selecting the independent variables
Data Analysis &
Computers II
Slide 29
Move the independent
variables listed in the
problem to the
Independents list box.
30. SW388R7
Specifying the method for including variables
Data Analysis &
Computers II
Slide 30
SPSS provides us with two methods for including
variables: to enter all of the independent variables
at one time, and a stepwise method for selecting
variables using a statistical test to determine the
order in which variables are included.
Since the problem
states that there is a
relationship without
requesting the best
predictors, we accept
the default to Enter
independents together.
31. SW388R7
Requesting statistics for the output
Data Analysis &
Computers II
Slide 31
Click on the Statistics…
button to select statistics
we will need for the
analysis.
32. SW388R7
Specifying statistical output
Data Analysis &
Computers II
Slide 32
First, mark the Means
checkbox on the Descriptives
panel. We will use the group
means in our interpretation.
Second, mark the Univariate
ANOVAs checkbox on the
Descriptives panel. Perusing
these tests suggests which
variables might be useful
descriminators.
Third, mark the Box’s M
checkbox. Box’s M statistic Fourth, click on the
evaluates conformity to the Continue button to
assumption of homogeneity of close the dialog box.
group variances.
33. SW388R7
Specifying details for classification
Data Analysis &
Computers II
Slide 33
Click on the Classify…
button to specify details for
the classification phase of
the analysis.
34. SW388R7
Details for classification - 1
Data Analysis &
Computers II
Slide 34
First, mark the option button to Compute from
group sizes on the Prior Probabilities panel.
This incorporates the size of the groups defined
by the dependent variable into the classification
of cases using the discriminant functions.
Second, mark the
Casewise results
checkbox on the
Display panel to
include
classification details
for each case in the
output.
Third, mark the Summary
table checkbox to include
summary tables
comparing actual and
predicted classification.
35. SW388R7
Details for classification - 2
Data Analysis &
Computers II
Slide 35
Fourth, mark the Leave-one-out
classification checkbox to request SPSS to
include a cross-validated classification in
the output. This option produces a less
biased estimate of classification accuracy
by sequentially holding each case out of
the calculations for the discriminant
functions, and using the derived functions
to classify the case held out.
36. SW388R7
Details for classification - 3
Data Analysis &
Computers II
Slide 36
Fifth, accept the default of Within-groups Seventh, click
option button on the Use Covariance Matrix on the Continue
panel. The Covariance matrices are the button to close
measure of the dispersion in the groups the dialog box.
defined by the dependent variable. If we
fail the homogeneity of group variances
test (Box’s M), our option is use Separate
groups covariance in classification.
Sixth, mark the Combines-
groups checkbox on the Plots
panel to obtain a visual plot of
the relationship between
functions and groups defined
by the dependent variable.
37. SW388R7
Completing the discriminant analysis request
Data Analysis &
Computers II
Slide 37
Click on the OK
button to request the
output for the
disciminant analysis.
38. SW388R7
Sample size – ratio of cases to variables
Data Analysis &
Computers II
Slide 38
Analysis Case Processing Summary
Unweighted Cases N Percent
Valid 119 44.1
Excluded Missing or out-of-range
49 18.1
group codes
At least one missing
66 24.4
discriminating variable
Both missing or
out-of-range group codes The minimum ratio of valid
36 13.3
and at least one missing cases to independent
discriminating variable variables for discriminant
Total 151 analysis is 5 to 1, with a
55.9
Total 270 preferred ratio of 20 to 1. In
100.0
this analysis, there are 119
valid cases and 4 independent
variables. The ratio of cases
to independent variables is
29.75 to 1, which satisfies the
minimum requirement. In
addition, the ratio of 29.75 to
1 satisfies the preferred ratio
of 20 to 1.
39. SW388R7
Sample size – minimum group size
Data Analysis &
Computers II
Slide 39
Prior Probabilities for Groups
Cases Used in Analysis
In addition to the requirement for the
XMOVIE Prior Unweighted Weighted
ratio of cases to independent
1 .311 37 37.000
variables, discriminant analysis
2 .689 82 82.000 requires that there be a minimum
Total 1.000 119 119.000 number of cases in the smallest group
defined by the dependent variable.
The number of cases in the smallest
group must be larger than the number
of independent variables, and
preferably contains 20 or more cases.
The number of cases in the smallest
group in this problem is 37, which is
larger than the number of
independent variables (4), satisfying
the minimum requirement. In
addition, the number of cases in the
smallest group satisfies the preferred
minimum of 20 cases.
If the sample size did not
initially satisfy the minimum
requirements, discriminant
analysis is not appropriate.
40. SW388R7
NUMBER OF DISCRIMINANT FUNCTIONS - 1
Data Analysis &
Computers II
Slide 40
The maximum possible number of discriminant
functions is the smaller of one less than the
number of groups defined by the dependent
variable and the number of independent
variables.
In this analysis there were 2 groups defined by
seen x-rated movie in last year and 4
independent variables, so the maximum
possible number of discriminant functions was
1.
41. SW388R7
NUMBER OF DISCRIMINANT FUNCTIONS - 2
Data Analysis &
Computers II
Slide 41
In the table of Wilks' Lambda
which tested functions for
statistical significance, the direct
analysis identified 1 discriminant
functions that were statistically
significant. The Wilks' lambda
statistic for the test of function 1
(chi-square=24.159) had a
probability of <0.001 which was
less than or equal to the level
of significance of 0.05. The
significance of the maximum
possible number of discriminant
functions supports the
interpretation of a solution using
1 discriminant function.
42. SW388R7
Data Analysis & Independent variables and group membership:
relationship of functions to groups
Computers II
Slide 42
In order to specify the role that each independent
variable plays in predicting group membership on the
dependent variable, we must link together the
relationship between the discriminant functions and the
groups defined by the dependent variable, the role of
the significant independent variables in the
discriminant functions, and the differences in group
means for each of the variables.
Each function divides the groups into two
subgroups by assigning negative values to
one subgroup and positive values to the
other subgroup. Function 1 separates
survey respondents who had seen an x-
rated movie in the last year (-.714) from
survey respondents who had not seen an
Functions at Group Centroids x-rated movie in the last year (.322).
Function
XMOVIE 1
1 -.714
2 .322
Unstandardized canonical discriminant
functions evaluated at group means
43. SW388R7
Data Analysis & Independent variables and group membership:
predictor loadings on functions
Computers II
Slide 43
We do not
interpret loadings
in the structure Based on the structure matrix, the
matrix unless they predictor variables strongly associated with
are 0.30 or discriminant function 1 which distinguished
higher. between survey respondents who had seen
an x-rated movie in the last year and
survey respondents who had not seen an
x-rated movie in the last year were age
(r=0.467) and sex (r=0.770).
Structure Matrix
Function
1
SEX .770
AGE .467
EDUC .118
RINCOM98 .044
Pooled within-groups correlations between discriminating
variables and standardized canonical discriminant functions
Variables ordered by absolute size of correlation within function.
44. SW388R7
Data Analysis & Independent variables and group membership:
predictors associated with first function - 1
Computers II
Slide 44
Group Statistics
Valid N (listwise)
XMOVIE Mean Std. Deviation Unweighted
The Weighted age for survey
average
1 AGE 37.24 10.838 37 37.000
respondents who had seen an
EDUC 13.86 2.720 x-rated movie in the last year
37 37.000
SEX 1.27 .450 (mean=37.24) was lower than the
37 37.000
average age for survey
RINCOM98 13.76 5.209 37 37.000
respondents who had not seen an
2 AGE 42.70 11.461 x-rated movie in the last year
82 82.000
EDUC 14.18 2.534 (mean=42.70).
82 82.000
SEX 1.65 .481 82 82.000
This supports the relationship that
RINCOM98 14.00 5.308 82 82.000
"survey respondents who had seen
Total AGE 41.00 11.508 an x-rated movie in the last year
119 119.000
EDUC 14.08 2.586 were younger than survey
119 119.000
respondents who had not seen an
SEX 1.53 .501 119 119.000
x-rated movie in the last year."
RINCOM98 13.92 5.256 119 119.000
45. SW388R7
Data Analysis & Independent variables and group membership:
predictors associated with first function - 2
Computers II
Slide 45
Group Statistics
Valid N (listwise)
XMOVIE Mean Std. Deviation Unweighted Weighted
1 AGE Since sex is a dichotomous variable,
37.24 10.838 37 37.000
the mean is not directly interpretable.
EDUC 13.86 2.720 37Its interpretation must take into
37.000
SEX 1.27 .450 37account the coding by which 1
37.000
RINCOM98 13.76 5.209 37corresponds to male and 2
37.000
corresponds to female. The lower
2 AGE 42.70 11.461 82mean for survey respondents who
82.000
EDUC 14.18 2.534 82had seen an x-rated movie in the last
82.000
SEX 1.65 .481 82year 82.000
(mean=1.27), when compared
to the mean for survey respondents
RINCOM98 14.00 5.308 82who had not seen an x-rated movie in
82.000
Total AGE 41.00 11.508 119the last year (mean=1.65), implies
119.000
EDUC 14.08 2.586 119 that 119.000
the group contained more survey
respondents who were male and
SEX 1.53 .501 119fewer survey respondents who were
119.000
RINCOM98 13.92 5.256 119female.119.000
This supports the relationship that
"survey respondents who had seen
an x-rated movie in the last year
were more likely to be male than
survey respondents who had not seen
an x-rated movie in the last year."
46. SW388R7
CLASSIFICATION USING THE DISCRIMINANT MODEL:
Data Analysis &
by chance accuracy rate
Computers II
Slide 46
The independent variables could be characterized as useful
predictors of membership in the groups defined by the
dependent variable if the cross-validated classification
accuracy rate was significantly higher than the accuracy
attainable by chance alone. Operationally, the cross-validated
classfication accuracy rate should be 25% or more higher
than the proportional by chance accuracy rate.
The proportional by chance accuracy rate was computed by
squaring and summing the proportion of cases in each group
from the table of prior probabilities for groups (0.311² +
0.689² = 0.571).
Prior Probabilities for Groups
Cases Used in Analysis
XMOVIE Prior Unweighted Weighted
1 .311 37 37.000
2 .689 82 82.000
Total 1.000 119 119.000
47. SW388R7
CLASSIFICATION USING THE DISCRIMINANT MODEL:
Data Analysis &
criteria for classification accuracy
Computers II
Slide 47
b,c
Classification Results
Predicted Group
Membership
XMOVIE 1 2 Total
Original Count 1 15 22 37
2 12 70 82
Ungrouped cases 13 36 49
% 1 40.5 59.5 100.0
2 14.6 85.4 100.0
Ungrouped cases 26.5 73.5 100.0
Cross-validated a Count 1 15 22 37
2 12 70 82
% 1 40.5 59.5 100.0
2 14.6 85.4 100.0
a. Cross validation is done only for those cases in the analysis. In cross
validation, each case is classified by the functions derived from all cases other
than that case.
b. 71.4% of original grouped cases correctly classified.
c. 71.4% of cross-validated grouped cases correctly classified.
The cross-validated accuracy rate computed by SPSS
was 71.4% which was greater than or equal to the
proportional by chance accuracy criteria of 71.4%
(1.25 x 57.1% = 71.4%).
The criteria for classification accuracy is satisfied.
48. SW388R7
Answering the question in problem 1 - 1
Data Analysis &
Computers II
Slide 48
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application
of a statistic? Assume that there is no problem with missing data, violation of assumptions, or
outliers. Use a level of significance of 0.05 for evaluating the statistical relationship.
The variables "age" [age], "highest year of school completed" [educ], "sex" [sex], and "income"
[rincom98] are useful in distinguishing between groups based on responses to "seen x-rated
movie in last year" [xmovie]. These predictors differentiate survey respondents who had
seen an x-rated movie in the last year from survey respondents who had not seen an x-
rated movie in the last year.
Survey respondents who had seen an x-rated movie in the last year were younger than survey
respondents who had not seen an x-rated movie in the last year. Survey respondents who had
seen an x-rated movie in the last year were more likely to be male than survey respondents
who had not seen an x-rated movie in the last year.
We found one statistically significant
1. True
discriminant function, making it possible to
2. True with caution
distinguish among the two groups defined by
3. False the dependent variable.
4. Inappropriate applicationcross-validated classification
Moreover, the of a statistic
accuracy surpassed the by chance accuracy
criteria, supporting the utility of the model.
49. SW388R7
Answering the question in problem 1 - 2
Data Analysis &
Computers II
Slide 49
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application
of a statistic? Assume that there is no problem with missing data, violation of assumptions, or
outliers. Use a level of significance of 0.05 for evaluating the statistical relationship.
The variables "age" [age], "highest year of school completed" [educ], "sex" [sex], and "income"
[rincom98] are useful in distinguishing between that each statement
We verified groups based on responses to "seen x-rated
movie in last year" [xmovie]. These predictorsrelationship between respondents who had seen
about the differentiate survey
an x-rated movie in the last year from survey respondentswas correct. seen an x-rated movie
predictors and groups who had not
in the last year.
Survey respondents who had seen an x-rated movie in the last year were younger than
survey respondents who had not seen an x-rated movie in the last year. Survey respondents
who had seen an x-rated movie in the last year were more likely to be male than survey
respondents who had not seen an x-rated movie in the last year.
1. True
2. True with caution
3. False
4. Inappropriate application of a statistic
The answer to the question is true
with caution.
A caution is added because of the
inclusion of ordinal level variables.
50. SW388R7
Problem 2
Data Analysis &
Computers II
Slide 50
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a
statistic? Assume that there is no problem with missing data, violation of assumptions, or outliers.
Use a level of significance of 0.05 for evaluating the statistical relationship.
From the list of variables "respondent's degree of religious fundamentalism" [fund], "frequency of
prayer" [pray], and "frequency of attendance at religious services" [attend], the most useful
predictor for distinguishing between groups based on responses to "attitude toward abortion when
there is a strong chance of serious defect in the baby" [abdefect] is "frequency of prayer" [pray].
These predictors differentiate survey respondents who thought it should be possible for a woman
to obtain a legal abortion if there is a strong chance of a serious defect in the baby from survey
respondents who didn't think it should be possible for a woman to obtain a legal abortion if there
is a strong chance of a serious defect in the baby.
The most important predictor of groups based on responses to attitude toward abortion when
there is a strong chance of serious defect in the baby was frequency of prayer.
Survey respondents who didn't think it should be possible for a woman to obtain a legal abortion if
there is a strong chance of a serious defect in the baby prayed more often than survey
respondents who thought it should be possible for a woman to obtain a legal abortion if there is a
strong chance of a serious defect in the baby.
1. True
2. True with caution
3. False
4. Inappropriate application of a statistic
51. SW388R7
Dissecting problem 2 - 1
Data Analysis &
Computers II
Slide 51
The variables listed first in the problem
statement are the independent variables
(IVs): "respondent's degree of religious
fundamentalism" [fund], "frequency of
prayer" [pray], and "frequency of
attendance at religious services" [attend].
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a
statistic? Assume that there is no problem with missing data, violation of assumptions, or outliers.
Use a level of significance of 0.05 for evaluating the statistical relationship.
From the list of variables "respondent's degree of religious fundamentalism" [fund],
"frequency of prayer" [pray], and "frequency of attendance at religious services" [attend], the
most useful predictor for distinguishing between groups based on responses to "attitude toward
abortion when there is a strong chance of serious defect in the baby" [abdefect] is "frequency
of prayer" [pray]. These predictors differentiate survey respondents who thought it should be
possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in
the baby from survey respondents who didn't think it should be possible for a woman to obtain a
legal abortion if there is a strong chance of a serious defect in the baby.
The variable used to define
The most important predictor of groups based on responses to attitude dependent
groups is the toward abortion when
there is a strong a problem asks us defect in the baby was variable (DV):prayer. toward
When chance of serious frequency of "attitude
to identify the best or abortion when there is a
most useful predictors strong chance of serious
from a list of defect in the baby" [abdefect]
independent variables,
we do stepwise
discriminant analysis.
52. SW388R7
Dissecting problem 2 - 2
Data Analysis &
Computers II
Slide 52
The problem identifies two groups for the dependent variable:
•survey respondents who thought it should be possible for a woman
to obtain a legal abortion if there is a strong chance of a serious
defect in the baby
•survey respondents who didn't think it should be possible for a
woman to obtain a legal abortion if there is a strong chance of a
serious defect in the baby.
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a
statistic? Assume that amongis no groups, the analysis will be required toof assumptions, or outliers.
To distinguish
there two problem with missing data, violation find one
Use a level of significance of 0.05 for evaluating the statistical relationship.
statistically significant discriminant functions.
From the list of variables "respondent's degree of religious fundamentalism" [fund], "frequency of
prayer" [pray], and "frequency of attendance at religious services" [attend], the most useful
predictor for distinguishing between groups based on responses to "attitude toward abortion when
there is a strong chance of serious defect in the baby" [abdefect] is "frequency of prayer" [pray].
These predictors differentiate survey respondents who thought it should be possible for a
woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby
from survey respondents who didn't think it should be possible for a woman to obtain a legal
abortion if there is a strong chance of a serious defect in the baby.
The most important predictor of groups based on responses to attitude toward abortion when
there is a strong chance of serious defect in the baby was frequency of prayer.
The importance of predictors is
based upon the stepwise
addition of variables to the
analysis.
53. SW388R7
Dissecting problem 2 - 3
Data Analysis &
Computers II
Slide 53
From the list of variables "respondent'slisted in thereligious fundamentalism" [fund], "frequency of
The specific relationships degree of problem indicate how the
prayer" [pray], and "frequency of attendancegroups of theservices" [attend], the most useful
independent variable relates to at religious dependent variable, i.e.,
predictor for distinguishing frequency groups based be lower for respondents who
the mean for between of prayer will on responses to "attitude toward abortion when
there is a strong chance should be possible in thewoman [abdefect] is "frequency of prayer" [pray].
thought it of serious defect for a baby" to obtain a legal abortion if
These predictors differentiate chance respondentsdefectthought it should be possible for a woman
there is a strong survey of a serious who in the baby compared to
to obtain a legal abortion if therewho didn't think it should be possible forin the baby from survey
survey respondents is a strong chance of a serious defect a
respondents who didn't think itashouldabortion if there is woman to obtain aa
woman to obtain legal be possible for a a strong chance of legal abortion if there
is a strong chance of a serious the baby. the baby.
serious defect in defect in
The most important predictor of groups based on responses to attitude toward abortion when
there is a strong chance of serious defect in the baby was frequency of prayer.
Survey respondents who didn't think it should be possible for a woman to obtain a legal
abortion if there is a strong chance of a serious defect in the baby prayed more often than
survey respondents who thought it should be possible for a woman to obtain a legal abortion if
there is a strong chance of a serious defect in the baby.
1. True
2. True with caution
In a 3. False analysis, we only
stepwise In order for a stepwise analysis to be
interpret the independent
4. Inappropriate application of a statistic true, we must have enough statistically
variables that are entered in significant functions to distinguish among
the stepwise analysis. the groups, the order of entry must be
correct, and each significant relationship
must be interpreted correctly.
54. SW388R7
LEVEL OF MEASUREMENT - 1
Data Analysis &
Computers II
Slide 54
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application
of a statistic? Assume that there is no problem with missing data, violation of assumptions, or
outliers. Use a level of significance of 0.05 for evaluating the statistical relationship.
From the list of variables "respondent's degree of religious fundamentalism" [fund], "frequency
of prayer" [pray], and "frequency of attendance at religious services" [attend], the most useful
predictor for distinguishing between groups based on responses to "attitude toward abortion
when there is a strong chance of serious defect in the baby" [abdefect] is "frequency of prayer"
[pray]. These predictors differentiate survey respondents who thought it should be possible for
a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby
from survey respondents who didn't think it should be possible for a woman to obtain a legal
abortion if there is a strong chance of a serious defect in the baby.
The most important predictor of groups based on responses to attitude toward abortion when
there is a strong chance of serious defect in the baby was frequency of prayer.
Survey respondents who didn't think it should be possible for a woman to obtain a legal
abortion if there is a strong chance of a serious defect in the baby prayed more often than
survey respondents who thought it should be possible for a woman to obtain a legal abortion if
there is a strong chance of a serious defect requires that the
Discriminant analysis in the baby.
dependent variable be non-metric and the
independent variables be metric or dichotomous.
"Attitude toward abortion when there is a strong
chance of serious defect in the baby" [abdefect]
is a nominal level variable, which satisfies the
level of measurement requirement.
55. SW388R7
LEVEL OF MEASUREMENT - 2
Data Analysis &
Computers II
Slide 55
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application
of a statistic? Assume that there is no problem with missing data, violation of assumptions, or
outliers. Use a level of significance of 0.05 for evaluating the statistical relationship.
From the list of variables "respondent's degree of religious fundamentalism" [fund], "frequency
of prayer" [pray], and "frequency of attendance at religious services" [attend], the most useful
predictor for distinguishing between groups based on responses to "attitude toward abortion
when there is a strong chance of serious defect in the baby" [abdefect] is "frequency of prayer"
[pray]. These predictors differentiate survey respondents who thought it should be possible for
a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby
from survey respondents who didn't think it should be possible for a woman to obtain a legal
abortion if there is a strong chance of a serious defect in the baby.
The most important predictor of groups based on responses to attitude toward abortion when
there is a strong chance of serious defect in the baby was frequency of prayer.
"Respondent's degree of religious
Survey respondents who didn't think it should be possible for a woman to obtain a legal
abortion if there is a strong chance of a serious defect in the baby [fund], "frequency ofthan
fundamentalism" prayed more often
survey respondents who thought it should be possible for [pray], and "frequency of abortion if
prayer" a woman to obtain a legal
there is a strong chance of a serious defect in theattendance at religious services"
baby.
[attend] are ordinal level variables. If
we follow the convention of treating
ordinal level variables as metric
variables, the level of measurement
requirement for discriminant analysis is
satisfied. Since some data analysts do
not agree with this convention, a note
of caution should be included in our
interpretation.
56. SW388R7
Request stepwise discriminant analysis
Data Analysis &
Computers II
Slide 56
Select the Classify |
Discriminant… command
from the Analyze menu.
57. SW388R7
Selecting the dependent variable
Data Analysis &
Computers II
Slide 57
First, highlight the
dependent variable
abdefect in the list
of variables. Second, click on the right
arrow button to move the
dependent variable to the
Grouping Variable text box.
58. SW388R7
Defining the group values
Data Analysis &
Computers II
Slide 58
When SPSS moves the dependent variable to the
Grouping Variable textbox, it puts two question marks in
parentheses after the variable name. This is a reminder
that we have to enter the number that represent the
groups we want to include in the analysis.
First, to specify the
group numbers, click
on the Define Range…
button.
59. SW388R7
Completing the range of group values
Data Analysis &
Computers II
Slide 59
The value labels for abdefect
show two categories:
1 = YES
2 = NO
First, type in 1 in
The range of values that we need
the Minimum text
to enter goes from 1 as the
box.
minimum and 2 as the maximum.
Second, type in
2 in the Third, click on the
Maximum text Continue button to
box. close the dialog box.
60. SW388R7
Selecting the independent variables
Data Analysis &
Computers II
Slide 60
Move the independent
variables listed in the
problem to the
Independents list box.
61. SW388R7
Specifying the method for including variables
Data Analysis &
Computers II
Slide 61
SPSS provides us with two methods for including
variables: to enter all of the independent variables
at one time, and a stepwise method for selecting
variables using a statistical test to determine the
order in which variables are included.
Since the problem calls
for identifying the best
predictors, we click on
the option button to
Use stepwise method.
62. SW388R7
Requesting statistics for the output
Data Analysis &
Computers II
Slide 62
Click on the Statistics…
button to select statistics
we will need for the
analysis.
63. SW388R7
Specifying statistical output
Data Analysis &
Computers II
Slide 63
First, mark the Means
checkbox on the Descriptives
panel. We will use the group
means in our interpretation.
Second, mark the Univariate
ANOVAs checkbox on the
Descriptives panel. Perusing
these tests suggests which
variables might be useful
descriminators.
Third, mark the Box’s M
checkbox. Box’s M statistic Fourth, click on the
evaluates conformity to the Continue button to
assumption of homogeneity of close the dialog box.
group variances.
64. SW388R7
Specifying details for the stepwise method
Data Analysis &
Computers II
Slide 64
Click on the Method…
button to specify the
specific statistical criteria to
use for including variables.