SlideShare una empresa de Scribd logo
1 de 98
Slide 1
Stepwise Multiple Regression
Slide 2
Different Methods for Entering Variables in Multiple Regression
 Different types of multiple regression are distinguished by the method for entering the
independent variables into the analysis.
 In standard (or simultaneous) multiple regression, all of the independent variables are
entered into the analysis at the same.
 In hierarchical (or sequential) multiple regression, the independent variables are entered in
an order prescribed by the analyst.
 In stepwise (or statistical) multiple regression, the independent variables are entered
according to their statistical contribution in explaining the variance in the dependent
variable.
 No matter what method of entry is chosen, a multiple regression that includes the same
independent variables and the same dependent variables will produce the same multiple
regression equation.
 The number of cases required for stepwise regression is greater than the number for the
other forms. We will use the norm of 40 cases for each independent variable.
Slide 3
Purpose of Stepwise Multiple Regression
 Stepwise regression is designed to find the most parsimonious set of predictors that are
most effective in predicting the dependent variable.
 Variables are added to the regression equation one at a time, using the statistical criterion
of maximizing the R² of the included variables.
 After each variable is entered, each of the included variables are tested to see if the model
would be better off it were excluded. This does not happen often.
 The process of adding more variables stops when all of the available variables have been
included or when it is not possible to make a statistically significant improvement in R²
using any of the variables not yet included.
 Since variables will not be added to the regression equation unless they make a statistically
significant addition to the analysis, all of the independent variable selected for inclusion
will have a statistically significant relationship to the dependent variable.
 An example of how SPSS does stepwise regression is shown below.
Slide 4
Stepwise Multiple Regression in SPSS
 Each time SPSS includes or removes a variable from the analysis, SPSS considers it a new
step or model, i.e. there will be one model and result for each variable included in the
analysis.
 SPSS provides a table of variables included in the analysis and a table of variables
excluded from the analysis. It is possible that none of the variables will be included. It is
possible that all of the variables will be included.
 The order of entry of the variables can be used as a measure of relative importance.
 Once a variable is included, its interpretation in stepwise regression is the same as it would
be using other methods for including regression variables.
Slide 5
Pros and Cons of Stepwise Regression
 Stepwise multiple regression can be used when the goal is to produce a predictive model
that is parsimonious and accurate because it excludes variables that do not contribute to
explaining differences in the dependent variable.
 Stepwise multiple regression is less useful for testing hypotheses about statistical
relationships. It is widely regarded as atheoretical and its usage is not recommended.
 Stepwise multiple regression can be useful in finding relationships that have not been
tested before. Its findings invite one to speculate on why an unusual relationship makes
sense.
 It is not legitimate to do a stepwise multiple regression and present the results as though
one were testing a hypothesis that included the variables found to be significant in the
stepwise regression.
 Using statistical criteria to determine relationships is vulnerable to over-fitting the data set
used to develop the model at the expense of generalizability.
 When stepwise regression is used, some form of validation analysis is a necessity. We will
use 75/25% cross-validation.
Slide 6
75/25% Cross-validation
 To do cross validation, we randomly split the data set into a 75% training sample and a
25% validation sample. We will use the training sample to develop the model, and we test
its effectiveness on the validation sample to test the applicability of the model to cases not
used to develop it.
 In order to be successful, the follow two questions must be answers affirmatively:
 Did the stepwise regression of the training sample produce the same subset of
predictors produced by the regression model of the full data set?
 If yes, compare the R2
for the 25% validation sample to the R2
for the 75% training
sample. If the shrinkage (R2
for the 75% training sample - R2
for the 25% validation
sample) is 2% (0.02) or less, we conclude that validation was successful.
 Note: shrinkage may be a negative value, indicating that the accuracy rate for the
validation sample is larger than the accuracy rate for the training sample. Negative
shrinkage (increase in accuracy) is evidence of a successful validation analysis.
 If the validation is successful, we base our interpretation on the model that included all
cases.
Slide 7
DV
IV1
DV
IV2
Correlations between dependent variable and independent variables
DV and IV1 are
correlated at r = .
70. The area of
overlap is r² = .49.
We have two independent variables, IV1
and IV2, which each have a relationship
to the dependent variable. The areas of
IV1 and IV2 which overlap with DV are
r² values, i.e. the proportion of the dv
that is explained by the iv.
DV and IV2 are
correlated at r = .
40. The area of
overlap is r² = .16.
Slide 8
Correlations between independent variables
IV1 IV2
The two independent
variables, IV1 and IV2,
are correlated at r = .20.
This correlation
represents redundant
information in the
independent variables.
Slide 9
Variance in the dependent variable explained by the independent variables
The variance explained in DV
is divided into three areas.
The total variance explained
is the sum of the three areas.
DV
IV1 IV2
The brown area is
the variance in DV
that is explained by
both IV1 and IV2.
The green area is the
variance in DV uniquely
explained by IV1.
The orange area is the
variance in DV uniquely
explained by IV2.
Slide 10
Correlations at step 1 of the stepwise regression
Since IV1 had the stronger relationship with
DV (.70 versus .40), it will be the variable
entered first in the stepwise regression.
As the only variable in the regression
equation, it is given full credit (.70) for its
relationship to DV.
The partial correlation and the part
correlation have the same value as the
zero-order correlation at .70.
DV
IV1
Slide 11
Change in variance explained when a second variable in included
At step 2, IV2 enters the model, increasing the total
variance explained from .49 to .56, an increase 0f .07.
By itself, IV2 explained .16 of the variance in DV, but
since it was itself correlated with IV1, a portion of
what it could explain had already been attributed to
IV1.
Slide 12
Differences in correlations when a second variable is entered
While the zero-order correlations do not change, both the
partial and the part correlations decrease.
Partial correlation represents the relationship between the
dependent variable and an independent variable when the
relationship between the dependent variable and other
independent variables has been removed from the
variance of both the dependent and the independent
variable.
Part (or semi-partial) correlation is the portion of the total
variance in the dependent variable that is by only that
independent variable. The square of part correlation is the
amount of change in R² by including this variable.
Slide 13
Zero-order, partial, and part correlations
DV
IV1 IV2
DV
IV1
The zero-order correlation is
based on the relationship between
the independent variable and the
dependent variable, ignoring all
other independent variables.
The partial correlation for
IV1 is the green area divided
by the area in DV and IV1
that is not part of IV2, i.e.
green divided by green +
yellow.
Part correlation for IV1 is
the green area divided by
all parts of DV, i.e. including
areas associated with IV2.
NOTE:
diagrams are
scaled to r2
rather than r.
Slide 14
DV
IV2
Zero-order, partial, and part correlations
DV
IV1 IV2
The zero-order correlation
is based on the
relationship between the
independent variable and
the dependent variable,
ignoring all other
independent variables.
The partial correlation for IV2
is the green area divided by
the area in DV and IV2 that
is not part of IV2, i.e. orange
divided by orange + yellow.
Part correlation for IV2 is
the orange area divided by
all parts of DV, i.e. including
areas associated with IV1.
Slide 15
How SPSS Stepwise Regression Chooses Variables - 1
The table of Correlations shows
the the variable with the
strongest individual relationship
with the dependent variable is
RACE OF HOUSEHOLD=WHITE,
with a correlation of -.247.
Provided that the relationship
between this variable and the
dependent variable is
statistically significant, this will
be the variable that enters first.
We can use the table of
correlations to identify which
variable will be entered at the
first step of the stepwise
regression.
Slide 16
How SPSS Stepwise Regression Chooses Variables - 2
The correlation between RACE OF
HOUSEHOLD=WHITE and
importance of ethnic group to R is
statistically significant at p < .001.
It will be the first variable entered
into the regression equation.
Slide 17
How SPSS Stepwise Regression Chooses Variables - 3
Model 1 contains the variable RACE
OF HOUSEHOLD=WHITE, with a
Multiple R of .247, producing an R²
of .061 (.247²), which is
statistically significant at p < .001.
We cannot use the table of correlations
to show which variable will be entered
second, since the variable entered
second must take into account its
correlation to the independent variable
entered first.
Slide 18
How SPSS Stepwise Regression Chooses Variables - 4
Partial correlation is a measure of the
relationship of the dependent variable to an
independent variable, where the variance
explained by previously entered independent
variables has been removed from both.
The table of Excluded
Variables, however,
shows the Partial
Correlation between
each candidate for entry
and the dependent
variable.
In this example, RACE
OF
HOUSEHOLD=BLACK
has the largest Partial
Correlation (.252) and
is statistically significant
at p < .001, so it will be
entered on the next
step
Slide 19
How SPSS Stepwise Regression Chooses Variables - 5
As expected, Model 2 contains the
variable RACE OF
HOUSEHOLD=WHITE and RACE OF
HOUSEHOLD=BLACK. The R² for
Model 2 increased by 0.059 to a total
of .120. The increase in R² was
statistically significant at p < .001.
Slide 20
How SPSS Stepwise Regression Chooses Variables - 6
The increase in R² of .059 is the square
of the Part Correlation for RACE OF
HOUSEHOLD=BLACK (.244² = 0.059).
Part correlation, also referred to as
semi-partial correlation, is the unique
relationship between this independent
variable and the dependent variable.
Slide 21
How SPSS Stepwise Regression Chooses Variables - 7
In the table of Excluded
Variables for model 2,
the next largest partial
correlation is HOW
OFTEN R ATTENDS
RELIGIOUS SERVICES
at .149.
This is the variable that will be
added in Model 3 because the
relationships is statistically
significant at p = 0.32.
Partial
Correlatio
n
Column
Sig.
Colum
n
Slide 22
How SPSS Stepwise Regression Chooses Variables - 8
As expected, Model 3 contains the
variable RACE OF
HOUSEHOLD=WHITE and RACE OF
HOUSEHOLD=BLACK, and HOW OFTEN
R ATTENDS RELIGIOUS SERVICES .
The R² for Model 3 increased by 0.019
to a total of .140. The increase in R²
was statistically significant at p = .
032.
Slide 23
How SPSS Stepwise Regression Chooses Variables - 9
Partial
Correlatio
n
Column
Sig.
Colum
n
However, the partial
correlation is not significant
(p=.203), so no additional
variables will be added to
the model.
In the table of Excluded
Variables for model 3,
the next largest partial
correlation is THINK OF
SELF AS LIBERAL OR
CONSERVATIVE at .
089.
Slide 24
What SPSS Displays when Nothing is Significant
If none of the independent
variables has a statistically
significant relationship to the
dependent variable, SPSS
displays an empty table for
Variables Entered/Removed.
Slide 25
The Problem in BlackBoard - 1
The introductory problem statement tells us:
• the data set to use:
GSS2002_PrejudiceAndAltruism.SAV
• the method for including variables in the regression
• The dependent variable for the analysis
• the list of independent variables that stepwise
regression will select from
Slide 26
This Week’s Problems
 The problems this week take the 13 questions on prejudice from the general social survey
and explore the relationship of each to the demographic characteristics of age, education,
income, political views (conservative versus liberal), religiosity (attendance at church),
socioeconomic index, gender, and race.
 I had no specific hypothesis about which demographic factors would be related to which
question on prejudice, beyond an expectation that race would be a significant contributor
to explaining differences on each of the questions.
 My analyses were exploratory (to identify what demographic characteristics were
associated with different aspects of prejudice) and, thus, appropriate for stepwise
regression.
Slide 27
The Problem in BlackBoard - 2
In these problems, we will assume that our data
satisfies the assumptions required by multiple
regression without explicitly testing for it.
We should recognize that failing to use a
needed transformation could preclude a variable
from being selected as a predictor.
In your analyses, you would, of course, want to
test for conformity to all of the assumptions.
Slide 28
The Problem in BlackBoard - 3
The next sequence of specific instructions tell us
whether each variable should be treated as
metric or non-metric, along with the reference
category to use when dummy-coding non-
metric variables.
Though we will not use the script to test for
assumptions, we can use it to do the dummy
coding that we need for the problem.
Slide 29
The Problem in BlackBoard - 4
The next pair of instructions tell us the
probability values to use for alpha for
both the tests of statistical
relationships and for the diagnostic
tests.
Slide 30
The Problem in BlackBoard - 4
The final instruction tells us the
random number seed to use in
the validation analysis.
If you do not use this number
for the seed, it is likely that
you will get different results
from those shown in the
feedback.
Slide 31
The Statement about Level of Measurement
The first statement in the problem asks about
level of measurement. Stepwise multiple
regression requires the dependent variable and
the metric independent variables be interval
level, and the non-metric independent variables
be dummy-coded if they are not dichotomous.
The only way we would violate the level of
measurement would be to use a nominal
variable as the dependent variable, or to
attempt to dummy-code an interval level
variable that was not grouped.
Slide 32
Marking the Statement about Level of Measurement - 1
Mark the check box as a correct statement because:
• "Importance of ethnic identity" [ethimp] is ordinal level, but the
problem calls for treating it as metric, applying the common
convention of treating ordinal variables as interval level.
• The metric independent variable "age" [age] was interval level,
satisfying the requirement for independent variables.
• The metric independent variable "highest year of school
completed" [educ] was interval level, satisfying the requirement
for independent variables.
• "Income" [rincom98] is ordinal level, but the problem calls for
treating it as metric, applying the common convention of
treating ordinal variables as interval level.
Stepwise multiple regression requires the dependent
variable and the metric independent variables be interval
level, and the non-metric independent variables be
dummy-coded if they are not dichotomous.
Slide 33
Marking the Statement about Level of Measurement - 2
In addition:
• "Description of political views" [polviews] is ordinal level, but the
problem calls for treating it as metric, applying the common
convention of treating ordinal variables as interval level.
• "Frequency of attendance at religious services" [attend] is
ordinal level, but the problem calls for treating it as metric,
applying the common convention of treating ordinal variables as
interval level.
• The metric independent variable "socioeconomic index" [sei]
was interval level, satisfying the requirement for independent
variables.
• The non-metric independent variable "sex" [sex] was
dichotomous level, satisfying the requirement for independent
variables.
• The non-metric independent variable "race of the household"
[hhrace] was nominal level, but will satisfy the requirement for
independent variables when dummy coded.
Slide 34
The Statement for Sample Size
The statement for sample
size indicates that the
available data satisfies the
requirement.
Because of the tendency for
stepwise regression to over-fit
the data, we have a larger
sample size requirement, i.e. 40
cases per independent variable
(Tabachnick and Fidell, p. 117)
To obtain the number of cases
available for this analysis, we run
the stepwise regression.
Slide 35
Using the Script to Create Dummy-coded Variables - 1
Before we can run the
stepwise regression, we need
to dummy code sex and race.
We will use the script to
create the dummy-coded
variables.
Select the Run
Script command
from the Utilities
menu.
Slide 36
Using the Script to Create Dummy-coded Variables - 2
Navigate to the
My Documents
folder, if
necessary.
Highlight the script file
SatisfyingRegressionAssumptionsWit
h
MetricAndNonMetricVariables.SBS.
Click on the Run button
to open the script.
Slide 37
Using the Script to Create Dummy-coded Variables - 3
Move the non-metric
variable "sex" [sex] to the
list box for Non-metric
independent variables list
box.
With the variable highlighted,
select the reference category,
2=FEMALE from the
Reference category drop
down menu.
Slide 38
Using the Script to Create Dummy-coded Variables - 4
Move the non-metric variable
"race of the household"
[hhrace] to the list box for Non-
metric independent variables list
box.
With the variable highlighted,
select the reference category,
3=OTHER from the Reference
category drop down menu.
The OK button to run the
regression is deactivated until
we select a dependent
variable.
Slide 39
Using the Script to Create Dummy-coded Variables - 5
We select the dependent variable
"importance of ethnic identity"
[ethimp], though since we are
not going to interpret the output,
we could select any variable.
To have the script save the dummy-
coded variables, clear the check box
Delete variables created in this analysis.
Slide 40
Using the Script to Create Dummy-coded Variables - 6
Click on the OK button to run the
regression, creating the dummy-
coded variables as a by-product.
Slide 41
The Dummy-Coded Variables in the Data Editor
If we scroll the variable list to
the right, we see that the
three dummy-coded variables
have been added to the data
set.
Slide 42
Run the Stepwise Regression - 1
To run the regression, select
Regression > Linear from
the Analyze menu.
Slide 43
Run the Stepwise Regression - 2
Move the dependent variable
•"importance of ethnic identity"
[ethimp]
to the Dependent text box.
Move the independent variables:
•"age" [age]
•"highest year of school completed" [educ],
•"income" [rincom98],
•"description of political views" [polviews],
•"frequency of attendance at religious services"
[attend],
•"socioeconomic index" [sei],
•“survey respondents were male" [sex_1],
•"survey respondents who were white" [hhrace_1],
•"survey respondents who were black" [hhrace_2]
to the Independent(s) list box.
Slide 44
Run the Stepwise Regression - 3
Select Stepwise from the
Method drop down menu.
The critical step to produce a
stepwise regression is the selection
of the method for entering
variables.
Slide 45
Run the Stepwise Regression - 4
Click on the Statistics
button to specify additional
output.
Slide 46
Run the Stepwise Regression - 5
We mark the check boxes for
optional statistics:
• R squared change,
• Descriptives,
• Part and partial
correlations,
• Collinearity diagnostics, and
• Durbin-Watson.
Click on the
Continue button to
close the dialog box.
Slide 47
Run the Stepwise Regression - 6
Click on the OK
button to produce
the output.
Slide 48
Answering the Sample Size Question
The analysis included 9 independent variables (6 metric
independent variables plus 3 dummy-coded variables). The
number of cases available for the analysis was 209, not
satisfying the requirement for 360 cases based on the rule
of thumb that the required number of cases for stepwise
multiple regression should be 40 x the number of
independent variables recommended by Tabachnick and
Fidell (p. 117).
We should consider mentioning the sample size issue as a
limitation of the analysis.
Slide 49
Marking the Statement for Sample Size
The check box is not marked
because we did not satisfy
the sample size
requirement.
Slide 50
Statements about Variables Included in Stepwise Regression
Three statements in the problem list
different combinations of the
variables included in the stepwise
regression.
To determine which is correct, we
look at the table of Variables Entered
and Removed in the SPSS output.
Slide 51
Answering the Question about Variables Included in Stepwise Regression - 1
Three independent variables satisfied
the statistical criteria for entry into the
model. The variable "survey
respondents who were white"
[hhrace_1] had the largest individual
impact on the dependent variable
"importance of ethnic identity" [ethimp].
The second variable included in the
model was "survey respondents who
were black" [hhrace_2]. The third
variable included in the model was
"frequency of attendance at religious
services" [attend].
The column for Variables
Removed is empty, telling us
that no variables were
removed after being
entered.
Slide 52
Marking the Statement about Variables Included in Stepwise Regression
Three independent variables satisfied the
statistical criteria for entry into the model.
The variable "survey respondents who
were white" [hhrace_1] had the largest
individual impact on the dependent
variable "importance of ethnic identity"
[ethimp]. The second variable included in
the model was "survey respondents who
were black" [hhrace_2]. The third variable
included in the model was "frequency of
attendance at religious services" [attend].
We mark the check box for the first of the
three statements.
Slide 53
Statement about the Strength of the Relationship
The next two statements focus on the
strength of the overall relationship
between the dependent variable and
the set of predictors that are selected
in the stepwise entry of variables. The
statement assumes that the overall
relationship will be statistically
significant, which will be true if any
variables are selected for the model.
We will use Cohen’s scale for
assigning an adjective to the
strength of the relationship:
•less than .10 = trivial
•.10 up to 0.30 = weak
•.30 up to .50 = moderately strong
•.50 or greater = strong
Slide 54
Statement about the Strength of the Relationship
The overall relationship was
statistically significant (F(3, 205)
= 11.11, p < .001. The null
hypothesis that "all of the partial
slopes (b coefficients) = 0" is
rejected, supporting the
research hypothesis that "at
least one of the partial slopes (b
coefficients) is not equal to 0".
Applying Cohen's criteria for effect size,
the relationship was correctly characterized
as moderately strong (Multiple R = .374).
Three independent variables satisfied the statistical
criteria for inclusion in the model. We interpret the
results for the last step for all of the questions about
statistical relationships (Model 3 in this example).
Slide 55
Marking the Statement about the Strength of the Relationship
The Multiple R of .374 translates to a
moderately strong relationship, so we
mark the check box for the second
statement on strength of the
relationship.
Slide 56
Statements about Relationships to the Dependent Variable for Individual Predictors
The next set of statements focus on
individual relationships between
predictors and the dependent
variable. In order for a statement to
be true, it must have a statistically
significant individual relationship (i.e.
it entered into the model), and the
direction of the relationship must be
interpreted correctly.
Slide 57
Answering Question about Relationship of RACE OF HOUSEHOLD=WHITE
Again, we base our interpretation about
statistical relationships on the last model for
variables entered, i.e. Model 3 for this problem.
We reject the null hypothesis that the partial slope (b
coefficient) for the variable "survey respondents who
were white" = 0 and conclude that the partial slope (b
coefficient) for the variable "survey respondents who
were white" is not equal to 0. The negative sign of the b
coefficient (-0.518) means that survey respondents who
were white attached less importance to ethnic identity
compared to the average for all survey respondents.
The statement that "survey respondents who were white
attached less importance to ethnic identity compared to the
average for all survey respondents" is correct. The
individual relationship between the independent variable
"survey respondents who were white" [hhrace_1] and the
dependent variable "importance of ethnic identity" [ethimp]
was statistically significant, ß = -.290, t(199) = -4.38, p < .
001.
Slide 58
Marking the Statement about Relationship of RACE OF HOUSEHOLD=WHITE
Since the statement “survey respondents
who were white attached less importance
to ethnic identity compared to the
average for all survey respondents” is
supported by our statistical results, we
mark the check box.
Slide 59
Answering Question about Relationship of RACE OF HOUSEHOLD=BLACK
We reject the null hypothesis that the partial slope (b coefficient)
for the variable "survey respondents who were black" = 0 and
conclude that the partial slope (b coefficient) for the variable
"survey respondents who were black" is not equal to 0. The
positive sign of the b coefficient (0.524) means that survey
respondents who were black attached greater importance to ethnic
identity compared to the average for all survey respondents.
The statement that "survey respondents who were black
attached greater importance to ethnic identity compared to
the average for all survey respondents" is correct. The
individual relationship between the independent variable
"survey respondents who were black" [hhrace_2] and the
dependent variable "importance of ethnic identity" [ethimp]
was statistically significant, ß = .225, t(199) = 3.37, p < .001.
Slide 60
Marking the Statement about Relationship of RACE OF HOUSEHOLD=BLACK
Since the statement “survey respondents
who were black attached greater
importance to ethnic identity compared
to the average for all survey
respondents" is supported by our
statistical results, we mark the check
box.
Since the previous statement was
correct, this statement cannot be
true, so the check box is not
marked.
Slide 61
Answering Question about Relationship of ATTEND RELIGIOUS SERVICES
We reject the null hypothesis that the partial slope (b coefficient) for the
variable "frequency of attendance at religious services" = 0 and conclude that
the partial slope (b coefficient) for the variable "frequency of attendance at
religious services" is not equal to 0. The positive sign of the b coefficient
(0.062) means that higher values of frequency of attendance at religious
services were associated with higher values of "importance of ethnic
identity".
The statement that "survey respondents who attended
religious services more often attached greater importance to
ethnic identity" is correct. The individual relationship between
the independent variable "frequency of attendance at religious
services" [attend] and the dependent variable "importance of
ethnic identity" [ethimp] was statistically significant, ß = .141,
t(199) = 2.16, p = .032.
Slide 62
Marking the Statement about Relationship of ATTEND RELIGIOUS SERVICES
Since the statement “survey respondents
who attended religious services more
often attached greater importance to
ethnic identity" is supported by our
statistical results, we mark the check
box.
The following check box is not marked
because the statement contradicts the
finding we have just made.
Slide 63
Answering Question about Relationship of AGE
The statement that "survey respondents
who were older attached greater importance
to ethnic identity" is not correct. The
variable "age" [age] was not among the list
of variables included in the stepwise model.
Slide 64
Marking the Statement for Age
The check box for the
statement for age is not
marked because the variable
did not enter the model in
the stepwise regression.
Slide 65
Statement about Cross-validation
The final statement concerns the
generalizability of our findings to
the larger population. To answer
this question, we will do a
75/25% cross-validation.
The findings from our analysis are
generalizable to the extent that they are
applicable to cases not included in the
analysis. Since we cannot collect new
cases, we will divide our sample into two
subsets, using one subset to create the
model and test the findings on the second
subset of cases which were not included in
the analysis that created the model.
Slide 66
Creating the Training Sample and the Validation Sample - 1
The 75/25% cross-validation requires
that we randomly divide the cases for
this analysis into two parts:
75% of the cases will be used to run
the stepwise regression (the training
sample), which will be tested for
accuracy on the remaining 25% of the
cases (the validation sample).
To set the seed for the random
number generator, select
Random Number Generator
from the Transform menu.
NOTE: you must use the random number
seed that is stated in the problem in order
to produce the same results that I found.
Any other seed will generate a different
random sequence that can produce results
that are very different from mine.
Slide 67
Creating the Training Sample and the Validation Sample - 2
Third, type the seed
number provided in the
problem directions: 726201.
First, mark the check
for Set Starting Point.
Second, select
the option button
for a Fixed
Value.
Fourth, click on the
OK button to complete
the action.
NOTE: SPSS does not provide any
feedback that the seed has been set or
changed. If you are in doubt, you can
reopen the dialog box and see what it
indicates.
Slide 68
Creating the Training Sample and the Validation Sample - 3
We will create a variable that will
contain the information about whether a
case is in the training sample or the
validation sample. We will name this
variable “split” and use a value of 1 to
indicate the training sample and a value
of 0 to indicate the validation sample.
To create the new
variable, select Compute
from the Transform
menu.
Slide 69
Creating the Training Sample and the Validation Sample - 4
Type the name of
the new variable,
split, in the Target
Variable text box.
Type the formula as
shown in the
Numeric Expression
text box.
Click on the OK
button to create
the variable.
The formula uses the SPSS UNIFORM
function to create a uniform distribution
of decimal numbers between 0 and 1. If
the generated number for a case is less
than or equal to 0.75, the statement in
the text box is True and the split
variable will be assigned a 1 for that
case. If the generated number is larger
than 0.75, the statement is false and
the case will be assigned a 0 for split.
Slide 70
Creating the Training Sample and the Validation Sample - 5
If we scroll the data editor
window to the right, we see
the split variable in a new
column.
Slide 71
Creating the Training Sample and the Validation Sample - 6
If we created a frequency distribution for
the split variable, we see that the
breakdown is approximately, not exactly,
correct. This is a consequence of
generating random numbers – you have
no control over the sequence that it
generates beyond setting an initial seed.
Though I have done it to create
specific results for homework
problems, it is not acceptable to
run repeated series of random
numbers until one gets a
sequence that has desirable
properties.
Slide 72
An Additional Task before Running the Stepwise Regression on the Training Sample
 Before we run the regression on the training sample, we need an additional step that will
enable us to compare the accuracy of the model for the training sample to the accuracy of
the model for the validation sample, using the R2
for each as our measure of accuracy.
 We need to exclude from the analysis cases that are missing data for any of the variables
that we have designated as candidates for inclusion. If we don’t specifically do this, SPSS
may include different cases in predicting values for the dependent variable than it does in
determining which variables to include in the model.
 In model building, SPSS does listwise exclusion of missing data and omits any cases that
have missing data for any variable. In predicting scores on the dependent variable, it
excludes cases that are missing data for only the variables included in the stepwise model.
Thus, when selecting variables, SPSS assumes that only respondents who answer all
questions are valid cases; in predicting scores, it assumes that failing to answer a question
on a variable that is not included has no importance in the analysis.
Slide 73
Selecting Cases with Valid Data for All Variables in the Analysis - 1
To include only those
cases that have valid
data for all variables in
the analysis, choose the
Select Cases command
from the Data menu.
Slide 74
Selecting Cases with Valid Data for All Variables in the Analysis - 2
First, mark the
option button for If
condition is
satisfied.
Second, click on
the If button to
add the condition.
Slide 75
Selecting Cases with Valid Data for All Variables in the Analysis - 3
Type
NMISS(ethimp,age,educ,rincom98,polviews,
attend,sei,sex_1,hhrace_1,hhrace_2) = 0
in the condition textbox. In the parentheses,
we type the names of the dependent variable
and all of the independent variables.
The SPSS NMISS function counts the number
of variables in the list that have missing data.
Telling SPSS to include cases for which this
calculation results in 0 indicates that the case
was not missing data for any of the variables.
Slide 76
Selecting Cases with Valid Data for All Variables in the Analysis - 4
Click on the
Continue button to
close the dialog box.
Slide 77
Selecting Cases with Valid Data for All Variables in the Analysis - 5
Click on the OK
button to
execute the
command.
Slide 78
Selecting Cases with Valid Data for All Variables in the Analysis - 6
The excluded cases
have a slash
through the case
number.
Slide 79
Run the Stepwise Regression on the Training Sample - 1
To run the regression, select
Regression > Linear from
the Analyze menu.
Slide 80
Run the Stepwise Regression on the Training Sample - 2
Move the dependent variable
•"importance of ethnic identity"
[ethimp]
to the Dependent text box.
Move the independent variables:
•"age" [age]
•"highest year of school completed" [educ],
•"income" [rincom98],
•"description of political views" [polviews],
•"frequency of attendance at religious services"
[attend],
•"socioeconomic index" [sei],
•“survey respondents were male" [sex_1],
•"survey respondents who were white" [hhrace_1],
•"survey respondents who were black" [hhrace_2]
to the Independent(s) list box.
Slide 81
Run the Stepwise Regression on the Training Sample - 3
Select Stepwise from the
Method drop down menu.
The critical steps to produce a
stepwise regression on the training
sample are the selection of the
stepwise method for entering
variables and the inclusion of the
training sample cases.
Slide 82
Run the Stepwise Regression on the Training Sample - 4
First, highlight
the split variable.
To select the training sample, we
move the split variable to the
Selection Variable text box.
Second, click on the
right arrow button to the
left of the Selection
Variable text box..
Slide 83
Run the Stepwise Regression on the Training Sample - 5
Click on the Rule button
to specify the value that
we want split to use to
select cases.
Slide 84
Run the Stepwise Regression on the Training Sample - 6
First, type 1 in
the Value text
box. Recall that
this is the value
of split indicating
training cases.
Second, click on the
Continue button to
close the dialog box.
Slide 85
Run the Stepwise Regression on the Training Sample - 7
Click on the Statistics
button to specify additional
output.
Slide 86
Run the Stepwise Regression on the Training Sample - 8
We mark the check boxes for
optional statistics:
• R squared change,
• Descriptives,
• Part and partial
correlations,
• Collinearity diagnostics, and
• Durbin-Watson.
Click on the
Continue button to
close the dialog box.
Slide 87
Run the Stepwise Regression on the Training Sample - 9
We mark the check boxes for
optional statistics:
• R squared change,
• Descriptives,
• Part and partial
correlations,
• Collinearity diagnostics, and
• Durbin-Watson.
Click on the
Continue button to
close the dialog box.
Slide 88
Run the Stepwise Regression on the Training Sample - 10
Click on the OK
button to produce
the output.
Slide 89
Validating the Model - 1
The first step in our validation is to make
certain that the model based on the
training sample reasonably approximates
the model based on the full sample.
Here we see that both models included 3
variables.
If the number of models
(steps) were different, the
validation would fail.
Slide 90
Validating the Model - 2
Second, we verify that the model
based on the training sample
included the same three variables as
the model based on the full data set.
We do not require that the variables
be entered in the same order, as the
difference in samples can easily
result in small shifts.
The same variables entered into the
stepwise regression of the training sample
that entered into the stepwise regression
using the full sample ("frequency of
attendance at religious services" [attend],
"survey respondents who were black"
[hhrace_2] and "survey respondents who
were white" [hhrace_1]).
Slide 91
Validating the Model - 3
Third, we compare the accuracy of the model
for the validation sample to the accuracy of
the model for the training sample.
We have to calculate the R² for the
validation sample (split ~= 1.0) by hand
from the Multiple R: .402² = .162.
The R² for the 75% training sample was
0.131 and the R² for the 25% validation
sample was 0.162, resulting in a value of
.131 – 162 = -.031 for shrinkage. Since
-.031 is <= .02, the validation is
successful.
If the shrinkage were greater
than .02 (2%), the validation
fails.
Slide 92
Marking the Check Box for the Cross-validation Statement
The validation analysis supported
the generalizability of the findings
of the analysis to the population
represented by the sample in the
data set.
We mark the check box for the
validation.
Slide 93
The Question Graded in Blackboard
When the problem was
submitted, BlackBoard
confirmed that all marked
answers were correct.
Slide 94
Logic Diagram for Solving Homework Problems: Level of Measurement
No
No
Ordinal level variable
treated as metric?
• Do not mark check box
• Mark: Inappropriate
application of the
statistic
• Stop
Yes
Yes
Level of
measurement ok?
Consider limitation in
discussion of findings
Run script to dummy-code
non-metric variables, if needed
Run stepwise regression
Slide 95
Logic Diagram for Solving Homework Problems: Sample Size and Overall Relationship
• Do not mark check box
• Consider limitation in
discussion of findings
Yes
Sample size ok
(number of Iv’s x 40)?
No
Mark check box
for correct sample size
1+ variables entered
in model?
No
Yes
Model is not trivial
(Multiple R >= .10)
No
Yes
Stop (no significant
predictors)
Stop (model is
not usable)
Model will be
statistically
significant if
any
variables
entered
Slide 96
Logic Diagram for Solving Homework Problems: Strength of Overall Relationship
Do not mark check box
Yes
Strength of model
correctly characterized No
Mark check box
for correct strength
Do not mark check box for
correct subset
Yes
Subset of entered
variables correctly
identified?
No
Mark check box
for correct subset
Slide 97
Yes
Variable entered
and not removed? No
Mark check box
for individual relationship
Correct interpretation of
direction of relationship?
Yes
Do not mark check box for
individual relationship
No
Logic Diagram for Solving Homework Problems: Individual Relationships
Additional variables
entered?
No
Yes
Slide 98
Logic Diagram for Solving Homework Problems: Cross-validation
Create split variable
using specified seed
Select cases with no missing
values for all variables
Run stepwise regression
on training sample
Same variables entered
in full model?
Yes
Do not mark check box for
supporting validation
No
Shrinkage
< or = 2%?
Yes
Mark check box for
supporting validation
No
Do not mark check box for
supporting validation

Más contenido relacionado

La actualidad más candente (20)

Regression analysis
Regression analysisRegression analysis
Regression analysis
 
Manova ppt
Manova pptManova ppt
Manova ppt
 
Analysis of variance anova
Analysis of variance anovaAnalysis of variance anova
Analysis of variance anova
 
discriminant analysis
discriminant analysisdiscriminant analysis
discriminant analysis
 
Regression Analysis
Regression AnalysisRegression Analysis
Regression Analysis
 
Manova
ManovaManova
Manova
 
One sample runs test
One sample runs testOne sample runs test
One sample runs test
 
Basics of Regression analysis
 Basics of Regression analysis Basics of Regression analysis
Basics of Regression analysis
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
 
Measure of Dispersion in statistics
Measure of Dispersion in statisticsMeasure of Dispersion in statistics
Measure of Dispersion in statistics
 
Central limit theorem
Central limit theoremCentral limit theorem
Central limit theorem
 
Regression
RegressionRegression
Regression
 
Multivariate Analysis Techniques
Multivariate Analysis TechniquesMultivariate Analysis Techniques
Multivariate Analysis Techniques
 
Analysis of Variance (ANOVA)
Analysis of Variance (ANOVA)Analysis of Variance (ANOVA)
Analysis of Variance (ANOVA)
 
The mann whitney u test
The mann whitney u testThe mann whitney u test
The mann whitney u test
 
Cannonical correlation
Cannonical correlationCannonical correlation
Cannonical correlation
 
Chapter13
Chapter13Chapter13
Chapter13
 
Logistic regression with SPSS examples
Logistic regression with SPSS examplesLogistic regression with SPSS examples
Logistic regression with SPSS examples
 
multiple regression
multiple regressionmultiple regression
multiple regression
 
Two way anova+manova
Two way anova+manovaTwo way anova+manova
Two way anova+manova
 

Destacado

Statistics Case Study - Stepwise Multiple Regression
Statistics Case Study - Stepwise Multiple RegressionStatistics Case Study - Stepwise Multiple Regression
Statistics Case Study - Stepwise Multiple RegressionSharad Srivastava
 
Multiple linear regression
Multiple linear regressionMultiple linear regression
Multiple linear regressionJames Neill
 
Reporting a single linear regression in apa
Reporting a single linear regression in apaReporting a single linear regression in apa
Reporting a single linear regression in apaKen Plummer
 
Reporting a multiple linear regression in apa
Reporting a multiple linear regression in apaReporting a multiple linear regression in apa
Reporting a multiple linear regression in apaKen Plummer
 
Correlation and regression analysis - summary
Correlation and regression analysis - summaryCorrelation and regression analysis - summary
Correlation and regression analysis - summaryNihad Omerbegović
 
Multiple Regression worked example (July 2014 updated)
Multiple Regression worked example (July 2014 updated)Multiple Regression worked example (July 2014 updated)
Multiple Regression worked example (July 2014 updated)Michael Ling
 
Reporting a partial correlation in apa
Reporting a partial correlation in apaReporting a partial correlation in apa
Reporting a partial correlation in apaKen Plummer
 
Stepwise Logistic Regression - Lecture for Students /Faculty of Mathematics a...
Stepwise Logistic Regression - Lecture for Students /Faculty of Mathematics a...Stepwise Logistic Regression - Lecture for Students /Faculty of Mathematics a...
Stepwise Logistic Regression - Lecture for Students /Faculty of Mathematics a...Alexander Efremov
 
Tutorial normal v skew distributions
Tutorial   normal v skew distributionsTutorial   normal v skew distributions
Tutorial normal v skew distributionsKen Plummer
 
What is a partial correlation?
What is a partial correlation?What is a partial correlation?
What is a partial correlation?Ken Plummer
 
Tutorial repeated measures ANOVA
Tutorial   repeated measures ANOVATutorial   repeated measures ANOVA
Tutorial repeated measures ANOVAKen Plummer
 
Reporting a partial correlation in apa
Reporting a partial correlation in apaReporting a partial correlation in apa
Reporting a partial correlation in apaKen Plummer
 
Regression Analysis
Regression AnalysisRegression Analysis
Regression Analysisnadiazaheer
 
Linear model of Curriculum
Linear model of CurriculumLinear model of Curriculum
Linear model of CurriculumJonna May Berci
 
Presentation On Regression
Presentation On RegressionPresentation On Regression
Presentation On Regressionalok tiwari
 
Regression analysis ppt
Regression analysis pptRegression analysis ppt
Regression analysis pptElkana Rorio
 

Destacado (20)

Statistics Case Study - Stepwise Multiple Regression
Statistics Case Study - Stepwise Multiple RegressionStatistics Case Study - Stepwise Multiple Regression
Statistics Case Study - Stepwise Multiple Regression
 
Regression
RegressionRegression
Regression
 
Multiple linear regression
Multiple linear regressionMultiple linear regression
Multiple linear regression
 
Reporting a single linear regression in apa
Reporting a single linear regression in apaReporting a single linear regression in apa
Reporting a single linear regression in apa
 
Reporting a multiple linear regression in apa
Reporting a multiple linear regression in apaReporting a multiple linear regression in apa
Reporting a multiple linear regression in apa
 
Correlation and regression analysis - summary
Correlation and regression analysis - summaryCorrelation and regression analysis - summary
Correlation and regression analysis - summary
 
Multiple Regression worked example (July 2014 updated)
Multiple Regression worked example (July 2014 updated)Multiple Regression worked example (July 2014 updated)
Multiple Regression worked example (July 2014 updated)
 
Reporting a partial correlation in apa
Reporting a partial correlation in apaReporting a partial correlation in apa
Reporting a partial correlation in apa
 
Stepwise Logistic Regression - Lecture for Students /Faculty of Mathematics a...
Stepwise Logistic Regression - Lecture for Students /Faculty of Mathematics a...Stepwise Logistic Regression - Lecture for Students /Faculty of Mathematics a...
Stepwise Logistic Regression - Lecture for Students /Faculty of Mathematics a...
 
Tutorial normal v skew distributions
Tutorial   normal v skew distributionsTutorial   normal v skew distributions
Tutorial normal v skew distributions
 
Regression
RegressionRegression
Regression
 
What is a partial correlation?
What is a partial correlation?What is a partial correlation?
What is a partial correlation?
 
Tutorial repeated measures ANOVA
Tutorial   repeated measures ANOVATutorial   repeated measures ANOVA
Tutorial repeated measures ANOVA
 
Reporting a partial correlation in apa
Reporting a partial correlation in apaReporting a partial correlation in apa
Reporting a partial correlation in apa
 
Regression Analysis
Regression AnalysisRegression Analysis
Regression Analysis
 
Linear model of Curriculum
Linear model of CurriculumLinear model of Curriculum
Linear model of Curriculum
 
Presentation On Regression
Presentation On RegressionPresentation On Regression
Presentation On Regression
 
Credit Risk Model Building Steps
Credit Risk Model Building StepsCredit Risk Model Building Steps
Credit Risk Model Building Steps
 
Regression analysis ppt
Regression analysis pptRegression analysis ppt
Regression analysis ppt
 
Simple linear regressionn and Correlation
Simple linear regressionn and CorrelationSimple linear regressionn and Correlation
Simple linear regressionn and Correlation
 

Similar a Solving stepwise regression problems

30REGRESSION Regression is a statistical tool that a.docx
30REGRESSION  Regression is a statistical tool that a.docx30REGRESSION  Regression is a statistical tool that a.docx
30REGRESSION Regression is a statistical tool that a.docxtarifarmarie
 
Stats ca report_18180485
Stats ca report_18180485Stats ca report_18180485
Stats ca report_18180485sarthakkhare3
 
Recep maz msb 701 quantitative analysis for managers
Recep maz msb 701 quantitative analysis for managersRecep maz msb 701 quantitative analysis for managers
Recep maz msb 701 quantitative analysis for managersrecepmaz
 
Recep maz msb 701 quantitative analysis for managers
Recep maz msb 701 quantitative analysis for managersRecep maz msb 701 quantitative analysis for managers
Recep maz msb 701 quantitative analysis for managersrecepmaz
 
Multinomial logisticregression basicrelationships
Multinomial logisticregression basicrelationshipsMultinomial logisticregression basicrelationships
Multinomial logisticregression basicrelationshipsAnirudha si
 
Statistics for Data Analytics
Statistics for Data AnalyticsStatistics for Data Analytics
Statistics for Data AnalyticsTushar Dalvi
 
Multiple Linear Regression II and ANOVA I
Multiple Linear Regression II and ANOVA IMultiple Linear Regression II and ANOVA I
Multiple Linear Regression II and ANOVA IJames Neill
 
Regression with Time Series Data
Regression with Time Series DataRegression with Time Series Data
Regression with Time Series DataRizano Ahdiat R
 
linear regression PDF.pdf
linear regression PDF.pdflinear regression PDF.pdf
linear regression PDF.pdfJoshuaLau29
 
Mba2216 week 11 data analysis part 02
Mba2216 week 11 data analysis part 02Mba2216 week 11 data analysis part 02
Mba2216 week 11 data analysis part 02Stephen Ong
 

Similar a Solving stepwise regression problems (20)

30REGRESSION Regression is a statistical tool that a.docx
30REGRESSION  Regression is a statistical tool that a.docx30REGRESSION  Regression is a statistical tool that a.docx
30REGRESSION Regression is a statistical tool that a.docx
 
Stats ca report_18180485
Stats ca report_18180485Stats ca report_18180485
Stats ca report_18180485
 
regression.ppt
regression.pptregression.ppt
regression.ppt
 
Recep maz msb 701 quantitative analysis for managers
Recep maz msb 701 quantitative analysis for managersRecep maz msb 701 quantitative analysis for managers
Recep maz msb 701 quantitative analysis for managers
 
Recep maz msb 701 quantitative analysis for managers
Recep maz msb 701 quantitative analysis for managersRecep maz msb 701 quantitative analysis for managers
Recep maz msb 701 quantitative analysis for managers
 
Multiple Linear Regression
Multiple Linear Regression Multiple Linear Regression
Multiple Linear Regression
 
Multinomial logisticregression basicrelationships
Multinomial logisticregression basicrelationshipsMultinomial logisticregression basicrelationships
Multinomial logisticregression basicrelationships
 
Statistics for Data Analytics
Statistics for Data AnalyticsStatistics for Data Analytics
Statistics for Data Analytics
 
Quantitative Methods - Level II - CFA Program
Quantitative Methods - Level II - CFA ProgramQuantitative Methods - Level II - CFA Program
Quantitative Methods - Level II - CFA Program
 
Bus 173_6.pptx
Bus 173_6.pptxBus 173_6.pptx
Bus 173_6.pptx
 
Multiple Linear Regression II and ANOVA I
Multiple Linear Regression II and ANOVA IMultiple Linear Regression II and ANOVA I
Multiple Linear Regression II and ANOVA I
 
Sem with amos ii
Sem with amos iiSem with amos ii
Sem with amos ii
 
Regression with Time Series Data
Regression with Time Series DataRegression with Time Series Data
Regression with Time Series Data
 
Regression -Linear.pptx
Regression -Linear.pptxRegression -Linear.pptx
Regression -Linear.pptx
 
linear regression PDF.pdf
linear regression PDF.pdflinear regression PDF.pdf
linear regression PDF.pdf
 
Regression Analysis
Regression AnalysisRegression Analysis
Regression Analysis
 
Mba2216 week 11 data analysis part 02
Mba2216 week 11 data analysis part 02Mba2216 week 11 data analysis part 02
Mba2216 week 11 data analysis part 02
 
Statistical analysis in SPSS_
Statistical analysis in SPSS_ Statistical analysis in SPSS_
Statistical analysis in SPSS_
 
Correlation & Regression.pptx
Correlation & Regression.pptxCorrelation & Regression.pptx
Correlation & Regression.pptx
 
Simple Regression.pptx
Simple Regression.pptxSimple Regression.pptx
Simple Regression.pptx
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAnitaRaj43
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)Samir Dash
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMKumar Satyam
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 

Último (20)

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 

Solving stepwise regression problems

  • 2. Slide 2 Different Methods for Entering Variables in Multiple Regression  Different types of multiple regression are distinguished by the method for entering the independent variables into the analysis.  In standard (or simultaneous) multiple regression, all of the independent variables are entered into the analysis at the same.  In hierarchical (or sequential) multiple regression, the independent variables are entered in an order prescribed by the analyst.  In stepwise (or statistical) multiple regression, the independent variables are entered according to their statistical contribution in explaining the variance in the dependent variable.  No matter what method of entry is chosen, a multiple regression that includes the same independent variables and the same dependent variables will produce the same multiple regression equation.  The number of cases required for stepwise regression is greater than the number for the other forms. We will use the norm of 40 cases for each independent variable.
  • 3. Slide 3 Purpose of Stepwise Multiple Regression  Stepwise regression is designed to find the most parsimonious set of predictors that are most effective in predicting the dependent variable.  Variables are added to the regression equation one at a time, using the statistical criterion of maximizing the R² of the included variables.  After each variable is entered, each of the included variables are tested to see if the model would be better off it were excluded. This does not happen often.  The process of adding more variables stops when all of the available variables have been included or when it is not possible to make a statistically significant improvement in R² using any of the variables not yet included.  Since variables will not be added to the regression equation unless they make a statistically significant addition to the analysis, all of the independent variable selected for inclusion will have a statistically significant relationship to the dependent variable.  An example of how SPSS does stepwise regression is shown below.
  • 4. Slide 4 Stepwise Multiple Regression in SPSS  Each time SPSS includes or removes a variable from the analysis, SPSS considers it a new step or model, i.e. there will be one model and result for each variable included in the analysis.  SPSS provides a table of variables included in the analysis and a table of variables excluded from the analysis. It is possible that none of the variables will be included. It is possible that all of the variables will be included.  The order of entry of the variables can be used as a measure of relative importance.  Once a variable is included, its interpretation in stepwise regression is the same as it would be using other methods for including regression variables.
  • 5. Slide 5 Pros and Cons of Stepwise Regression  Stepwise multiple regression can be used when the goal is to produce a predictive model that is parsimonious and accurate because it excludes variables that do not contribute to explaining differences in the dependent variable.  Stepwise multiple regression is less useful for testing hypotheses about statistical relationships. It is widely regarded as atheoretical and its usage is not recommended.  Stepwise multiple regression can be useful in finding relationships that have not been tested before. Its findings invite one to speculate on why an unusual relationship makes sense.  It is not legitimate to do a stepwise multiple regression and present the results as though one were testing a hypothesis that included the variables found to be significant in the stepwise regression.  Using statistical criteria to determine relationships is vulnerable to over-fitting the data set used to develop the model at the expense of generalizability.  When stepwise regression is used, some form of validation analysis is a necessity. We will use 75/25% cross-validation.
  • 6. Slide 6 75/25% Cross-validation  To do cross validation, we randomly split the data set into a 75% training sample and a 25% validation sample. We will use the training sample to develop the model, and we test its effectiveness on the validation sample to test the applicability of the model to cases not used to develop it.  In order to be successful, the follow two questions must be answers affirmatively:  Did the stepwise regression of the training sample produce the same subset of predictors produced by the regression model of the full data set?  If yes, compare the R2 for the 25% validation sample to the R2 for the 75% training sample. If the shrinkage (R2 for the 75% training sample - R2 for the 25% validation sample) is 2% (0.02) or less, we conclude that validation was successful.  Note: shrinkage may be a negative value, indicating that the accuracy rate for the validation sample is larger than the accuracy rate for the training sample. Negative shrinkage (increase in accuracy) is evidence of a successful validation analysis.  If the validation is successful, we base our interpretation on the model that included all cases.
  • 7. Slide 7 DV IV1 DV IV2 Correlations between dependent variable and independent variables DV and IV1 are correlated at r = . 70. The area of overlap is r² = .49. We have two independent variables, IV1 and IV2, which each have a relationship to the dependent variable. The areas of IV1 and IV2 which overlap with DV are r² values, i.e. the proportion of the dv that is explained by the iv. DV and IV2 are correlated at r = . 40. The area of overlap is r² = .16.
  • 8. Slide 8 Correlations between independent variables IV1 IV2 The two independent variables, IV1 and IV2, are correlated at r = .20. This correlation represents redundant information in the independent variables.
  • 9. Slide 9 Variance in the dependent variable explained by the independent variables The variance explained in DV is divided into three areas. The total variance explained is the sum of the three areas. DV IV1 IV2 The brown area is the variance in DV that is explained by both IV1 and IV2. The green area is the variance in DV uniquely explained by IV1. The orange area is the variance in DV uniquely explained by IV2.
  • 10. Slide 10 Correlations at step 1 of the stepwise regression Since IV1 had the stronger relationship with DV (.70 versus .40), it will be the variable entered first in the stepwise regression. As the only variable in the regression equation, it is given full credit (.70) for its relationship to DV. The partial correlation and the part correlation have the same value as the zero-order correlation at .70. DV IV1
  • 11. Slide 11 Change in variance explained when a second variable in included At step 2, IV2 enters the model, increasing the total variance explained from .49 to .56, an increase 0f .07. By itself, IV2 explained .16 of the variance in DV, but since it was itself correlated with IV1, a portion of what it could explain had already been attributed to IV1.
  • 12. Slide 12 Differences in correlations when a second variable is entered While the zero-order correlations do not change, both the partial and the part correlations decrease. Partial correlation represents the relationship between the dependent variable and an independent variable when the relationship between the dependent variable and other independent variables has been removed from the variance of both the dependent and the independent variable. Part (or semi-partial) correlation is the portion of the total variance in the dependent variable that is by only that independent variable. The square of part correlation is the amount of change in R² by including this variable.
  • 13. Slide 13 Zero-order, partial, and part correlations DV IV1 IV2 DV IV1 The zero-order correlation is based on the relationship between the independent variable and the dependent variable, ignoring all other independent variables. The partial correlation for IV1 is the green area divided by the area in DV and IV1 that is not part of IV2, i.e. green divided by green + yellow. Part correlation for IV1 is the green area divided by all parts of DV, i.e. including areas associated with IV2. NOTE: diagrams are scaled to r2 rather than r.
  • 14. Slide 14 DV IV2 Zero-order, partial, and part correlations DV IV1 IV2 The zero-order correlation is based on the relationship between the independent variable and the dependent variable, ignoring all other independent variables. The partial correlation for IV2 is the green area divided by the area in DV and IV2 that is not part of IV2, i.e. orange divided by orange + yellow. Part correlation for IV2 is the orange area divided by all parts of DV, i.e. including areas associated with IV1.
  • 15. Slide 15 How SPSS Stepwise Regression Chooses Variables - 1 The table of Correlations shows the the variable with the strongest individual relationship with the dependent variable is RACE OF HOUSEHOLD=WHITE, with a correlation of -.247. Provided that the relationship between this variable and the dependent variable is statistically significant, this will be the variable that enters first. We can use the table of correlations to identify which variable will be entered at the first step of the stepwise regression.
  • 16. Slide 16 How SPSS Stepwise Regression Chooses Variables - 2 The correlation between RACE OF HOUSEHOLD=WHITE and importance of ethnic group to R is statistically significant at p < .001. It will be the first variable entered into the regression equation.
  • 17. Slide 17 How SPSS Stepwise Regression Chooses Variables - 3 Model 1 contains the variable RACE OF HOUSEHOLD=WHITE, with a Multiple R of .247, producing an R² of .061 (.247²), which is statistically significant at p < .001. We cannot use the table of correlations to show which variable will be entered second, since the variable entered second must take into account its correlation to the independent variable entered first.
  • 18. Slide 18 How SPSS Stepwise Regression Chooses Variables - 4 Partial correlation is a measure of the relationship of the dependent variable to an independent variable, where the variance explained by previously entered independent variables has been removed from both. The table of Excluded Variables, however, shows the Partial Correlation between each candidate for entry and the dependent variable. In this example, RACE OF HOUSEHOLD=BLACK has the largest Partial Correlation (.252) and is statistically significant at p < .001, so it will be entered on the next step
  • 19. Slide 19 How SPSS Stepwise Regression Chooses Variables - 5 As expected, Model 2 contains the variable RACE OF HOUSEHOLD=WHITE and RACE OF HOUSEHOLD=BLACK. The R² for Model 2 increased by 0.059 to a total of .120. The increase in R² was statistically significant at p < .001.
  • 20. Slide 20 How SPSS Stepwise Regression Chooses Variables - 6 The increase in R² of .059 is the square of the Part Correlation for RACE OF HOUSEHOLD=BLACK (.244² = 0.059). Part correlation, also referred to as semi-partial correlation, is the unique relationship between this independent variable and the dependent variable.
  • 21. Slide 21 How SPSS Stepwise Regression Chooses Variables - 7 In the table of Excluded Variables for model 2, the next largest partial correlation is HOW OFTEN R ATTENDS RELIGIOUS SERVICES at .149. This is the variable that will be added in Model 3 because the relationships is statistically significant at p = 0.32. Partial Correlatio n Column Sig. Colum n
  • 22. Slide 22 How SPSS Stepwise Regression Chooses Variables - 8 As expected, Model 3 contains the variable RACE OF HOUSEHOLD=WHITE and RACE OF HOUSEHOLD=BLACK, and HOW OFTEN R ATTENDS RELIGIOUS SERVICES . The R² for Model 3 increased by 0.019 to a total of .140. The increase in R² was statistically significant at p = . 032.
  • 23. Slide 23 How SPSS Stepwise Regression Chooses Variables - 9 Partial Correlatio n Column Sig. Colum n However, the partial correlation is not significant (p=.203), so no additional variables will be added to the model. In the table of Excluded Variables for model 3, the next largest partial correlation is THINK OF SELF AS LIBERAL OR CONSERVATIVE at . 089.
  • 24. Slide 24 What SPSS Displays when Nothing is Significant If none of the independent variables has a statistically significant relationship to the dependent variable, SPSS displays an empty table for Variables Entered/Removed.
  • 25. Slide 25 The Problem in BlackBoard - 1 The introductory problem statement tells us: • the data set to use: GSS2002_PrejudiceAndAltruism.SAV • the method for including variables in the regression • The dependent variable for the analysis • the list of independent variables that stepwise regression will select from
  • 26. Slide 26 This Week’s Problems  The problems this week take the 13 questions on prejudice from the general social survey and explore the relationship of each to the demographic characteristics of age, education, income, political views (conservative versus liberal), religiosity (attendance at church), socioeconomic index, gender, and race.  I had no specific hypothesis about which demographic factors would be related to which question on prejudice, beyond an expectation that race would be a significant contributor to explaining differences on each of the questions.  My analyses were exploratory (to identify what demographic characteristics were associated with different aspects of prejudice) and, thus, appropriate for stepwise regression.
  • 27. Slide 27 The Problem in BlackBoard - 2 In these problems, we will assume that our data satisfies the assumptions required by multiple regression without explicitly testing for it. We should recognize that failing to use a needed transformation could preclude a variable from being selected as a predictor. In your analyses, you would, of course, want to test for conformity to all of the assumptions.
  • 28. Slide 28 The Problem in BlackBoard - 3 The next sequence of specific instructions tell us whether each variable should be treated as metric or non-metric, along with the reference category to use when dummy-coding non- metric variables. Though we will not use the script to test for assumptions, we can use it to do the dummy coding that we need for the problem.
  • 29. Slide 29 The Problem in BlackBoard - 4 The next pair of instructions tell us the probability values to use for alpha for both the tests of statistical relationships and for the diagnostic tests.
  • 30. Slide 30 The Problem in BlackBoard - 4 The final instruction tells us the random number seed to use in the validation analysis. If you do not use this number for the seed, it is likely that you will get different results from those shown in the feedback.
  • 31. Slide 31 The Statement about Level of Measurement The first statement in the problem asks about level of measurement. Stepwise multiple regression requires the dependent variable and the metric independent variables be interval level, and the non-metric independent variables be dummy-coded if they are not dichotomous. The only way we would violate the level of measurement would be to use a nominal variable as the dependent variable, or to attempt to dummy-code an interval level variable that was not grouped.
  • 32. Slide 32 Marking the Statement about Level of Measurement - 1 Mark the check box as a correct statement because: • "Importance of ethnic identity" [ethimp] is ordinal level, but the problem calls for treating it as metric, applying the common convention of treating ordinal variables as interval level. • The metric independent variable "age" [age] was interval level, satisfying the requirement for independent variables. • The metric independent variable "highest year of school completed" [educ] was interval level, satisfying the requirement for independent variables. • "Income" [rincom98] is ordinal level, but the problem calls for treating it as metric, applying the common convention of treating ordinal variables as interval level. Stepwise multiple regression requires the dependent variable and the metric independent variables be interval level, and the non-metric independent variables be dummy-coded if they are not dichotomous.
  • 33. Slide 33 Marking the Statement about Level of Measurement - 2 In addition: • "Description of political views" [polviews] is ordinal level, but the problem calls for treating it as metric, applying the common convention of treating ordinal variables as interval level. • "Frequency of attendance at religious services" [attend] is ordinal level, but the problem calls for treating it as metric, applying the common convention of treating ordinal variables as interval level. • The metric independent variable "socioeconomic index" [sei] was interval level, satisfying the requirement for independent variables. • The non-metric independent variable "sex" [sex] was dichotomous level, satisfying the requirement for independent variables. • The non-metric independent variable "race of the household" [hhrace] was nominal level, but will satisfy the requirement for independent variables when dummy coded.
  • 34. Slide 34 The Statement for Sample Size The statement for sample size indicates that the available data satisfies the requirement. Because of the tendency for stepwise regression to over-fit the data, we have a larger sample size requirement, i.e. 40 cases per independent variable (Tabachnick and Fidell, p. 117) To obtain the number of cases available for this analysis, we run the stepwise regression.
  • 35. Slide 35 Using the Script to Create Dummy-coded Variables - 1 Before we can run the stepwise regression, we need to dummy code sex and race. We will use the script to create the dummy-coded variables. Select the Run Script command from the Utilities menu.
  • 36. Slide 36 Using the Script to Create Dummy-coded Variables - 2 Navigate to the My Documents folder, if necessary. Highlight the script file SatisfyingRegressionAssumptionsWit h MetricAndNonMetricVariables.SBS. Click on the Run button to open the script.
  • 37. Slide 37 Using the Script to Create Dummy-coded Variables - 3 Move the non-metric variable "sex" [sex] to the list box for Non-metric independent variables list box. With the variable highlighted, select the reference category, 2=FEMALE from the Reference category drop down menu.
  • 38. Slide 38 Using the Script to Create Dummy-coded Variables - 4 Move the non-metric variable "race of the household" [hhrace] to the list box for Non- metric independent variables list box. With the variable highlighted, select the reference category, 3=OTHER from the Reference category drop down menu. The OK button to run the regression is deactivated until we select a dependent variable.
  • 39. Slide 39 Using the Script to Create Dummy-coded Variables - 5 We select the dependent variable "importance of ethnic identity" [ethimp], though since we are not going to interpret the output, we could select any variable. To have the script save the dummy- coded variables, clear the check box Delete variables created in this analysis.
  • 40. Slide 40 Using the Script to Create Dummy-coded Variables - 6 Click on the OK button to run the regression, creating the dummy- coded variables as a by-product.
  • 41. Slide 41 The Dummy-Coded Variables in the Data Editor If we scroll the variable list to the right, we see that the three dummy-coded variables have been added to the data set.
  • 42. Slide 42 Run the Stepwise Regression - 1 To run the regression, select Regression > Linear from the Analyze menu.
  • 43. Slide 43 Run the Stepwise Regression - 2 Move the dependent variable •"importance of ethnic identity" [ethimp] to the Dependent text box. Move the independent variables: •"age" [age] •"highest year of school completed" [educ], •"income" [rincom98], •"description of political views" [polviews], •"frequency of attendance at religious services" [attend], •"socioeconomic index" [sei], •“survey respondents were male" [sex_1], •"survey respondents who were white" [hhrace_1], •"survey respondents who were black" [hhrace_2] to the Independent(s) list box.
  • 44. Slide 44 Run the Stepwise Regression - 3 Select Stepwise from the Method drop down menu. The critical step to produce a stepwise regression is the selection of the method for entering variables.
  • 45. Slide 45 Run the Stepwise Regression - 4 Click on the Statistics button to specify additional output.
  • 46. Slide 46 Run the Stepwise Regression - 5 We mark the check boxes for optional statistics: • R squared change, • Descriptives, • Part and partial correlations, • Collinearity diagnostics, and • Durbin-Watson. Click on the Continue button to close the dialog box.
  • 47. Slide 47 Run the Stepwise Regression - 6 Click on the OK button to produce the output.
  • 48. Slide 48 Answering the Sample Size Question The analysis included 9 independent variables (6 metric independent variables plus 3 dummy-coded variables). The number of cases available for the analysis was 209, not satisfying the requirement for 360 cases based on the rule of thumb that the required number of cases for stepwise multiple regression should be 40 x the number of independent variables recommended by Tabachnick and Fidell (p. 117). We should consider mentioning the sample size issue as a limitation of the analysis.
  • 49. Slide 49 Marking the Statement for Sample Size The check box is not marked because we did not satisfy the sample size requirement.
  • 50. Slide 50 Statements about Variables Included in Stepwise Regression Three statements in the problem list different combinations of the variables included in the stepwise regression. To determine which is correct, we look at the table of Variables Entered and Removed in the SPSS output.
  • 51. Slide 51 Answering the Question about Variables Included in Stepwise Regression - 1 Three independent variables satisfied the statistical criteria for entry into the model. The variable "survey respondents who were white" [hhrace_1] had the largest individual impact on the dependent variable "importance of ethnic identity" [ethimp]. The second variable included in the model was "survey respondents who were black" [hhrace_2]. The third variable included in the model was "frequency of attendance at religious services" [attend]. The column for Variables Removed is empty, telling us that no variables were removed after being entered.
  • 52. Slide 52 Marking the Statement about Variables Included in Stepwise Regression Three independent variables satisfied the statistical criteria for entry into the model. The variable "survey respondents who were white" [hhrace_1] had the largest individual impact on the dependent variable "importance of ethnic identity" [ethimp]. The second variable included in the model was "survey respondents who were black" [hhrace_2]. The third variable included in the model was "frequency of attendance at religious services" [attend]. We mark the check box for the first of the three statements.
  • 53. Slide 53 Statement about the Strength of the Relationship The next two statements focus on the strength of the overall relationship between the dependent variable and the set of predictors that are selected in the stepwise entry of variables. The statement assumes that the overall relationship will be statistically significant, which will be true if any variables are selected for the model. We will use Cohen’s scale for assigning an adjective to the strength of the relationship: •less than .10 = trivial •.10 up to 0.30 = weak •.30 up to .50 = moderately strong •.50 or greater = strong
  • 54. Slide 54 Statement about the Strength of the Relationship The overall relationship was statistically significant (F(3, 205) = 11.11, p < .001. The null hypothesis that "all of the partial slopes (b coefficients) = 0" is rejected, supporting the research hypothesis that "at least one of the partial slopes (b coefficients) is not equal to 0". Applying Cohen's criteria for effect size, the relationship was correctly characterized as moderately strong (Multiple R = .374). Three independent variables satisfied the statistical criteria for inclusion in the model. We interpret the results for the last step for all of the questions about statistical relationships (Model 3 in this example).
  • 55. Slide 55 Marking the Statement about the Strength of the Relationship The Multiple R of .374 translates to a moderately strong relationship, so we mark the check box for the second statement on strength of the relationship.
  • 56. Slide 56 Statements about Relationships to the Dependent Variable for Individual Predictors The next set of statements focus on individual relationships between predictors and the dependent variable. In order for a statement to be true, it must have a statistically significant individual relationship (i.e. it entered into the model), and the direction of the relationship must be interpreted correctly.
  • 57. Slide 57 Answering Question about Relationship of RACE OF HOUSEHOLD=WHITE Again, we base our interpretation about statistical relationships on the last model for variables entered, i.e. Model 3 for this problem. We reject the null hypothesis that the partial slope (b coefficient) for the variable "survey respondents who were white" = 0 and conclude that the partial slope (b coefficient) for the variable "survey respondents who were white" is not equal to 0. The negative sign of the b coefficient (-0.518) means that survey respondents who were white attached less importance to ethnic identity compared to the average for all survey respondents. The statement that "survey respondents who were white attached less importance to ethnic identity compared to the average for all survey respondents" is correct. The individual relationship between the independent variable "survey respondents who were white" [hhrace_1] and the dependent variable "importance of ethnic identity" [ethimp] was statistically significant, ß = -.290, t(199) = -4.38, p < . 001.
  • 58. Slide 58 Marking the Statement about Relationship of RACE OF HOUSEHOLD=WHITE Since the statement “survey respondents who were white attached less importance to ethnic identity compared to the average for all survey respondents” is supported by our statistical results, we mark the check box.
  • 59. Slide 59 Answering Question about Relationship of RACE OF HOUSEHOLD=BLACK We reject the null hypothesis that the partial slope (b coefficient) for the variable "survey respondents who were black" = 0 and conclude that the partial slope (b coefficient) for the variable "survey respondents who were black" is not equal to 0. The positive sign of the b coefficient (0.524) means that survey respondents who were black attached greater importance to ethnic identity compared to the average for all survey respondents. The statement that "survey respondents who were black attached greater importance to ethnic identity compared to the average for all survey respondents" is correct. The individual relationship between the independent variable "survey respondents who were black" [hhrace_2] and the dependent variable "importance of ethnic identity" [ethimp] was statistically significant, ß = .225, t(199) = 3.37, p < .001.
  • 60. Slide 60 Marking the Statement about Relationship of RACE OF HOUSEHOLD=BLACK Since the statement “survey respondents who were black attached greater importance to ethnic identity compared to the average for all survey respondents" is supported by our statistical results, we mark the check box. Since the previous statement was correct, this statement cannot be true, so the check box is not marked.
  • 61. Slide 61 Answering Question about Relationship of ATTEND RELIGIOUS SERVICES We reject the null hypothesis that the partial slope (b coefficient) for the variable "frequency of attendance at religious services" = 0 and conclude that the partial slope (b coefficient) for the variable "frequency of attendance at religious services" is not equal to 0. The positive sign of the b coefficient (0.062) means that higher values of frequency of attendance at religious services were associated with higher values of "importance of ethnic identity". The statement that "survey respondents who attended religious services more often attached greater importance to ethnic identity" is correct. The individual relationship between the independent variable "frequency of attendance at religious services" [attend] and the dependent variable "importance of ethnic identity" [ethimp] was statistically significant, ß = .141, t(199) = 2.16, p = .032.
  • 62. Slide 62 Marking the Statement about Relationship of ATTEND RELIGIOUS SERVICES Since the statement “survey respondents who attended religious services more often attached greater importance to ethnic identity" is supported by our statistical results, we mark the check box. The following check box is not marked because the statement contradicts the finding we have just made.
  • 63. Slide 63 Answering Question about Relationship of AGE The statement that "survey respondents who were older attached greater importance to ethnic identity" is not correct. The variable "age" [age] was not among the list of variables included in the stepwise model.
  • 64. Slide 64 Marking the Statement for Age The check box for the statement for age is not marked because the variable did not enter the model in the stepwise regression.
  • 65. Slide 65 Statement about Cross-validation The final statement concerns the generalizability of our findings to the larger population. To answer this question, we will do a 75/25% cross-validation. The findings from our analysis are generalizable to the extent that they are applicable to cases not included in the analysis. Since we cannot collect new cases, we will divide our sample into two subsets, using one subset to create the model and test the findings on the second subset of cases which were not included in the analysis that created the model.
  • 66. Slide 66 Creating the Training Sample and the Validation Sample - 1 The 75/25% cross-validation requires that we randomly divide the cases for this analysis into two parts: 75% of the cases will be used to run the stepwise regression (the training sample), which will be tested for accuracy on the remaining 25% of the cases (the validation sample). To set the seed for the random number generator, select Random Number Generator from the Transform menu. NOTE: you must use the random number seed that is stated in the problem in order to produce the same results that I found. Any other seed will generate a different random sequence that can produce results that are very different from mine.
  • 67. Slide 67 Creating the Training Sample and the Validation Sample - 2 Third, type the seed number provided in the problem directions: 726201. First, mark the check for Set Starting Point. Second, select the option button for a Fixed Value. Fourth, click on the OK button to complete the action. NOTE: SPSS does not provide any feedback that the seed has been set or changed. If you are in doubt, you can reopen the dialog box and see what it indicates.
  • 68. Slide 68 Creating the Training Sample and the Validation Sample - 3 We will create a variable that will contain the information about whether a case is in the training sample or the validation sample. We will name this variable “split” and use a value of 1 to indicate the training sample and a value of 0 to indicate the validation sample. To create the new variable, select Compute from the Transform menu.
  • 69. Slide 69 Creating the Training Sample and the Validation Sample - 4 Type the name of the new variable, split, in the Target Variable text box. Type the formula as shown in the Numeric Expression text box. Click on the OK button to create the variable. The formula uses the SPSS UNIFORM function to create a uniform distribution of decimal numbers between 0 and 1. If the generated number for a case is less than or equal to 0.75, the statement in the text box is True and the split variable will be assigned a 1 for that case. If the generated number is larger than 0.75, the statement is false and the case will be assigned a 0 for split.
  • 70. Slide 70 Creating the Training Sample and the Validation Sample - 5 If we scroll the data editor window to the right, we see the split variable in a new column.
  • 71. Slide 71 Creating the Training Sample and the Validation Sample - 6 If we created a frequency distribution for the split variable, we see that the breakdown is approximately, not exactly, correct. This is a consequence of generating random numbers – you have no control over the sequence that it generates beyond setting an initial seed. Though I have done it to create specific results for homework problems, it is not acceptable to run repeated series of random numbers until one gets a sequence that has desirable properties.
  • 72. Slide 72 An Additional Task before Running the Stepwise Regression on the Training Sample  Before we run the regression on the training sample, we need an additional step that will enable us to compare the accuracy of the model for the training sample to the accuracy of the model for the validation sample, using the R2 for each as our measure of accuracy.  We need to exclude from the analysis cases that are missing data for any of the variables that we have designated as candidates for inclusion. If we don’t specifically do this, SPSS may include different cases in predicting values for the dependent variable than it does in determining which variables to include in the model.  In model building, SPSS does listwise exclusion of missing data and omits any cases that have missing data for any variable. In predicting scores on the dependent variable, it excludes cases that are missing data for only the variables included in the stepwise model. Thus, when selecting variables, SPSS assumes that only respondents who answer all questions are valid cases; in predicting scores, it assumes that failing to answer a question on a variable that is not included has no importance in the analysis.
  • 73. Slide 73 Selecting Cases with Valid Data for All Variables in the Analysis - 1 To include only those cases that have valid data for all variables in the analysis, choose the Select Cases command from the Data menu.
  • 74. Slide 74 Selecting Cases with Valid Data for All Variables in the Analysis - 2 First, mark the option button for If condition is satisfied. Second, click on the If button to add the condition.
  • 75. Slide 75 Selecting Cases with Valid Data for All Variables in the Analysis - 3 Type NMISS(ethimp,age,educ,rincom98,polviews, attend,sei,sex_1,hhrace_1,hhrace_2) = 0 in the condition textbox. In the parentheses, we type the names of the dependent variable and all of the independent variables. The SPSS NMISS function counts the number of variables in the list that have missing data. Telling SPSS to include cases for which this calculation results in 0 indicates that the case was not missing data for any of the variables.
  • 76. Slide 76 Selecting Cases with Valid Data for All Variables in the Analysis - 4 Click on the Continue button to close the dialog box.
  • 77. Slide 77 Selecting Cases with Valid Data for All Variables in the Analysis - 5 Click on the OK button to execute the command.
  • 78. Slide 78 Selecting Cases with Valid Data for All Variables in the Analysis - 6 The excluded cases have a slash through the case number.
  • 79. Slide 79 Run the Stepwise Regression on the Training Sample - 1 To run the regression, select Regression > Linear from the Analyze menu.
  • 80. Slide 80 Run the Stepwise Regression on the Training Sample - 2 Move the dependent variable •"importance of ethnic identity" [ethimp] to the Dependent text box. Move the independent variables: •"age" [age] •"highest year of school completed" [educ], •"income" [rincom98], •"description of political views" [polviews], •"frequency of attendance at religious services" [attend], •"socioeconomic index" [sei], •“survey respondents were male" [sex_1], •"survey respondents who were white" [hhrace_1], •"survey respondents who were black" [hhrace_2] to the Independent(s) list box.
  • 81. Slide 81 Run the Stepwise Regression on the Training Sample - 3 Select Stepwise from the Method drop down menu. The critical steps to produce a stepwise regression on the training sample are the selection of the stepwise method for entering variables and the inclusion of the training sample cases.
  • 82. Slide 82 Run the Stepwise Regression on the Training Sample - 4 First, highlight the split variable. To select the training sample, we move the split variable to the Selection Variable text box. Second, click on the right arrow button to the left of the Selection Variable text box..
  • 83. Slide 83 Run the Stepwise Regression on the Training Sample - 5 Click on the Rule button to specify the value that we want split to use to select cases.
  • 84. Slide 84 Run the Stepwise Regression on the Training Sample - 6 First, type 1 in the Value text box. Recall that this is the value of split indicating training cases. Second, click on the Continue button to close the dialog box.
  • 85. Slide 85 Run the Stepwise Regression on the Training Sample - 7 Click on the Statistics button to specify additional output.
  • 86. Slide 86 Run the Stepwise Regression on the Training Sample - 8 We mark the check boxes for optional statistics: • R squared change, • Descriptives, • Part and partial correlations, • Collinearity diagnostics, and • Durbin-Watson. Click on the Continue button to close the dialog box.
  • 87. Slide 87 Run the Stepwise Regression on the Training Sample - 9 We mark the check boxes for optional statistics: • R squared change, • Descriptives, • Part and partial correlations, • Collinearity diagnostics, and • Durbin-Watson. Click on the Continue button to close the dialog box.
  • 88. Slide 88 Run the Stepwise Regression on the Training Sample - 10 Click on the OK button to produce the output.
  • 89. Slide 89 Validating the Model - 1 The first step in our validation is to make certain that the model based on the training sample reasonably approximates the model based on the full sample. Here we see that both models included 3 variables. If the number of models (steps) were different, the validation would fail.
  • 90. Slide 90 Validating the Model - 2 Second, we verify that the model based on the training sample included the same three variables as the model based on the full data set. We do not require that the variables be entered in the same order, as the difference in samples can easily result in small shifts. The same variables entered into the stepwise regression of the training sample that entered into the stepwise regression using the full sample ("frequency of attendance at religious services" [attend], "survey respondents who were black" [hhrace_2] and "survey respondents who were white" [hhrace_1]).
  • 91. Slide 91 Validating the Model - 3 Third, we compare the accuracy of the model for the validation sample to the accuracy of the model for the training sample. We have to calculate the R² for the validation sample (split ~= 1.0) by hand from the Multiple R: .402² = .162. The R² for the 75% training sample was 0.131 and the R² for the 25% validation sample was 0.162, resulting in a value of .131 – 162 = -.031 for shrinkage. Since -.031 is <= .02, the validation is successful. If the shrinkage were greater than .02 (2%), the validation fails.
  • 92. Slide 92 Marking the Check Box for the Cross-validation Statement The validation analysis supported the generalizability of the findings of the analysis to the population represented by the sample in the data set. We mark the check box for the validation.
  • 93. Slide 93 The Question Graded in Blackboard When the problem was submitted, BlackBoard confirmed that all marked answers were correct.
  • 94. Slide 94 Logic Diagram for Solving Homework Problems: Level of Measurement No No Ordinal level variable treated as metric? • Do not mark check box • Mark: Inappropriate application of the statistic • Stop Yes Yes Level of measurement ok? Consider limitation in discussion of findings Run script to dummy-code non-metric variables, if needed Run stepwise regression
  • 95. Slide 95 Logic Diagram for Solving Homework Problems: Sample Size and Overall Relationship • Do not mark check box • Consider limitation in discussion of findings Yes Sample size ok (number of Iv’s x 40)? No Mark check box for correct sample size 1+ variables entered in model? No Yes Model is not trivial (Multiple R >= .10) No Yes Stop (no significant predictors) Stop (model is not usable) Model will be statistically significant if any variables entered
  • 96. Slide 96 Logic Diagram for Solving Homework Problems: Strength of Overall Relationship Do not mark check box Yes Strength of model correctly characterized No Mark check box for correct strength Do not mark check box for correct subset Yes Subset of entered variables correctly identified? No Mark check box for correct subset
  • 97. Slide 97 Yes Variable entered and not removed? No Mark check box for individual relationship Correct interpretation of direction of relationship? Yes Do not mark check box for individual relationship No Logic Diagram for Solving Homework Problems: Individual Relationships Additional variables entered? No Yes
  • 98. Slide 98 Logic Diagram for Solving Homework Problems: Cross-validation Create split variable using specified seed Select cases with no missing values for all variables Run stepwise regression on training sample Same variables entered in full model? Yes Do not mark check box for supporting validation No Shrinkage < or = 2%? Yes Mark check box for supporting validation No Do not mark check box for supporting validation