This document discusses multiple regression analysis. It begins by explaining the linear multiple regression model and key steps in regression modeling such as specifying the model, collecting data, and evaluating the model. It then covers assumptions of multiple regression including linearity and independence of errors. The document presents a mini-case study predicting home heating oil consumption based on temperature and insulation. It provides the multiple regression equation developed from the case study data and uses the equation to make predictions about oil consumption. Finally, it discusses interpreting the coefficient of multiple determination (R2) which indicates how well the model explains the variation in the dependent variable.
2. Learning Objectives
Explain the linear multiple regression model [for cross-
sectional data]
Interpret linear multiple regression computer output
Explain multicollinearity
Describe the types of multiple regression models
3. Regression Modeling Steps
Define problem or question
Specify model
Collect data
Do descriptive data analysis
Estimate unknown parameters
Evaluate model
Use model for prediction
4. Y-hat = 0 + 1x1 + 2x2 + ... + PxP +
Simple vs. Multiple
represents the
unit change in Y
per unit change in
X .
Does not take into
account any other
variable besides
single independent
variable.
i represents the unit
change in Y per unit
change in Xi. Multiple
variable
Takes into account
the effect of other
i s.
“Net regression
coefficient.”
5. Assumptions
Linearity - the Y variable is linearly related
to the value of the X variable.
Independence of Error - the error
(residual) is independent for each value of X.
Homoscedasticity - the variation around
the line of regression be constant for all
values of X.
Normality - the values of Y be normally
distributed at each value of X.
6. Regression to be performed with hypothesis
Ho: γ= 0 Yt becomes non-stationary
H1: γ≠ 0 Yt is significant
(all variables must become significant at level)
And for residuals,
Ho: γ= 0 Yt residuals are not serially correlated and not
heteroscedastic and normally distributed
H1: γ≠ 0 Yt residuals are serially correlated,
heteroscedastic and not normally distributed.
7. Best regression model
null, Ho= residuals are not serially corelated
alt, H1= residuals are serially corelated
R2 value must be high
(it should be 60% and more for good model)
No serial correlation
(LM test probability value must be higher than 0.05 )
No hetroscedasticity
(in the residual p value must be highe than 0.05)
Residual are normally distributed
(histogram p value must be higher than 0.05)
9. Goal
Develop a statistical model
that can predict the values of
a dependent (response)
variable based upon the
values of the independent
(explanatory) variables.
10. Simple Regression
A statistical model that utilizes
one quantitative independent
variable “X” to predict the
quantitative dependent
variable “Y.”
11. Ols interpretation
Variables Coefficient St error T stats Prob.
C -3188845 1822720 -1.749487 0.0866
Income 0.819235 0.003190 256,7871 0.0000
R2 0.999 Mean dependency 3522.160
Adjusted R 0.9999 St dependent var. 3077,678
SE of Reg 82.86681 Akaike info criterion 11,73539
Sum square resid 337614,8 Schwarz criterion 11.87820
Log likelihood -292,3779 Hanna quin criterion 1176500
F stats 65939.59 Durbin Watson stats 0.568044
Prob 0.0000
12. For ols results
Coefficient signs explains the direction of relation between explanatory and
dependent variables
The standard error of a coefficient indicates the accuracy of the estimated
ordinary least squares (OLS) coefficient with respect to its population
parameter. Each standard error is the square root of the variance of the
corresponding coefficient.
t-test is a statistical hypothesis testing technique that is used to test the
linearity of the relationship between the response variable and different
predictor variables. In other words, it is used to determine whether or not
there is a linear correlation between the response and predictor variables.
The t-test helps to determine if this linear relationship is statistically
significant
it is estimated by dividing the coefficient to the st error
t stats= coefficient/ st error
13. Ols results
Probability value must be between zero and 0.05 to be
significant model.
The p-value for each term tests the null hypothesis that
the coefficient is equal to zero (no effect).
A low p-value (< 0.05) indicates that you can reject the
null hypothesis. In other words, a predictor that has a
low p-value is likely to be a meaningful addition to your
model because changes in the predictor's value are
related to changes in the response variable.
Ho residual are not normally distributed
14. Ols results
R2 value explain how many percentage of Y
dependent variable will be explained by
explanatory variables X. The effects of
independent variables on dependent variable
Adjusted R2 can increase or decrease the
independent variable. Too many explanatory
variable may cause negative sign.
15. Ols results
F statistic this statistic tells how jointly
significant explanatory variables affect
dependent variable
The higher the F value the better the model
Probability statistics lower the value the
better the model. It tells the statistically
significance of the model.
16. Ols results
Mean dependent variable is the average value
of the dependent variable
AIC SIC and HQC are used to choose the best
model the lower the value the better the
model AIC here is the lowest value gives us
the best model to adop for model
Durbin Watson stats tell the serial correlation,
if the DW is less than two it is the evidence of
positive serial correlation and the model is
suffering from serial correlation.
17. Multiple Regression
A statistical model that utilizes two
or more quantitative and qualitative
explanatory variables (x1,..., xp) to
predict a quantitative dependent
variable Y.
Caution: have at least two or more quantitative
explanatory variables (rule of thumb)
18. Multiple regression
Multiple regression is a statistical technique that
can be used to analyze the relationship between
a single dependent variable and several
independent variables. The objective of multiple
regression analysis is to use the independent
variables whose values are known to predict the
value of the single dependent value.
19. Multiple regression assumptions
Assumption #1: Your dependent variable should
be measured on a continuous scale (i.e., it is either
an interval or ratio variable). Examples of variables
that meet this criterion include revision time
(measured in hours), intelligence (measured using IQ
score), exam performance (measured from 0 to 100),
weight (measured in kg), and so forth
20. Multiple regression assumptions
Assumption #2: You have two or more
independent variables, which can be
either continuous (i.e., an interval or ratio variable)
or categorical (i.e., an ordinal or nominal variable).
For examples of continuous and ordinal variables,
see the bullet above. Examples of nominal
variables include gender (e.g., 2 groups: male and
female), ethnicity
21. Multiple regression
assumptions
Assumption #3: You should
have independence of
observations (i.e., independence of
residuals), which you can easily check using
the Durbin-Watson statistic,
22. Multiple regression
assumptions
Assumption #4: There needs to be a linear
relationship between (a) the dependent
variable and each of your independent
variables, and (b) the dependent variable and
the independent variables collectively.
Whilst there are a number of ways to check
for these linear relationships,
23. Multiple regression
assumptions
Assumption #5: Your data needs to
show homoscedasticity, which is where the
variances along the line of best fit remain similar
as you move along the line. We explain more
about what this means and how to assess the
homoscedasticity of your data in our enhanced
multiple regression guide.
24. Multiple regression assumptions
Assumption #6: Your data must not
show multicollinearity (ne realtion
between variables), which occurs when you
have two or more independent variables that
are highly correlated with each other. This
leads to problems with understanding which
independent variable contributes to the
variance explained in the dependent variable,
as well as technical issues in calculating a
multiple regression model.
26. Multiple regression
assumptions
Finally, you need to check that the residuals
(errors) are approximately normally
distributed (we explain these terms in our
enhanced multiple regression guide).
Estimate histogram ;
(p value must be higher than 0.05)
30. Types of Models
Positive linear relationship
Negative linear relationship
No relationship between X and Y
Positive curvilinear relationship
U-shaped curvilinear
Negative curvilinear relationship
33. Linear Model
Relationship between one dependent & two or more independent
variables is a linear function
P
P X
X
X
Y
2
2
1
1
0
Dependent
(response)
variable
Independent
(explanatory)
variables
Population
slopes
Population
Y-intercept
Random
error
34. Method of Least Squares
The straight line that best fits the data.
Determine the straight line for which the differences
between the actual values (Y) and the values that
would be predicted from the fitted line of regression
(Y-hat) are as small as possible.
35. Measures of Variation
Explained variation (sum of
squares due to regression)
Unexplained variation (error
sum of squares)
Total sum of squares
36. Coefficient of Multiple Determination
When null hypothesis is rejected, a
relationship between Y and the X
variables exists.
Strength measured by R2
[ several types ]
40. Multiple Regression Equation
Y-hat = 0 + 1x1 + 2x2 + ... + PxP +
where:
0 = y-intercept {a constant value}
1 = slope of Y with variable x1 holding the variables x2, x3, ...,
xP effects constant
P = slope of Y with variable xP holding all
other variables’ effects constant
41. Mini-Case
Predict the consumption of home
heating oil during January for
homes located around Screne
Lakes. Two explanatory variables
are selected - - average daily
atmospheric temperature (oF) and
the amount of attic insulation (“).
42. O il (G a l) Te m p Insula tion
275.30 40 3
363.80 27 3
164.30 40 10
40.80 73 6
94.30 64 6
230.90 34 6
366.70 9 6
300.60 8 10
237.80 23 10
121.40 63 3
31.40 65 10
203.50 41 6
441.10 21 3
323.00 38 3
52.50 58 10
Mini-Case
(0F)
Develop a model for estimating
heating oil used for a single
family home in the month of
January based on average
temperature and amount of
insulation in inches.
43. Mini-Case
Oil is dependent and temp is independent
What preliminary conclusions can home owners draw from the
data?
What could a home owner expect heating oil consumption (in
gallons) to be if the outside temperature is 15 oF when the attic
insulation is 10 inches thick?
Model: Oil= tepm+attic insulation +error term
44. Multiple Regression Equation
[mini-case]
Dependent variable: Gallons Consumed
-------------------------------------------------------------------------------------
Standard T
Parameter Estimate Error Statistic P-Value
--------------------------------------------------------------------------------------
CONSTANT 562.151 21.0931 26.6509 0.0000
Insulation -20.0123 2.34251 -8.54313 0.0000
Temperature -5.43658 0.336216 -16.1699 0.0000
--------------------------------------------------------------------------------------
R-squared = 96.561 percent
R-squared (adjusted for d.f.) = 95.9879 percent
Standard Error of Est. = 26.0138
+
46. Multiple Regression Equation
[mini-case]
Y-hat = 562.15 - 5.44x1 - 20.01x2
thus:
For a home with zero inches of attic
insulation and an outside temperature of
0 oF, 562.15 gallons of heating oil would
be consumed.
[ caution .. data boundaries .. extrapolation ]
+
47. Extrapolation is the process of creating new data
point out of a discrete set of known data points
Y
Interpolation
X
Extrapolation Extrapolation
Relevant Range
48. Multiple Regression Equation
[mini-case]
Y-hat = 562.15 - 5.44x1 - 20.01x2
For a home with zero attic insulation and an outside temperature of zero,
562.15 gallons of heating oil would be consumed.
[ caution .. data boundaries .. extrapolation ]
For each incremental increase in degree F of
temperature, for a given amount of attic
insulation, heating oil consumption drops 5.44
gallons.
+
49. Multiple Regression Equation
[mini-case]
Y-hat = 562.15 - 5.44x1 - 20.01x2
For a home with zero attic insulation and an outside temperature of zero,
562 gallons of heating oil would be consumed. [ caution … ]
For each incremental increase in degree F of temperature, for a given
amount of attic insulation, heating oil consumption drops 5.44 gallons.
For each incremental increase in inches of
attic insulation, at a given temperature,
heating oil consumption drops 20.01
gallons.
51. Coefficient of Multiple Determination
[mini-case]
R2
y.12 = .9656
96.56 percent of the variation in
heating oil can be explained by
the variation in temperature
insulation.
Is a very high effects of temp and
attic on oil consumiton
52. Coefficient of Multiple Determination
Proportion of variation in Y ‘explained’ by all X variables taken
together
R2
Y.12 = Explained variation = SSR
Total variation SST
sum of squares due to regression (SSR), ∑(Ŷ − Ȳ)2.
SST is the total sum of squares. R-square can take on any value
between 0 and 1, with a value closer to 1 indicating that a
greater proportion of variance is accounted for by the model.
Never decreases when new X variable is added to model
Only Y values determine SST
Disadvantage when comparing models
53. Coefficient of Multiple Determination
Adjusted
Proportion of variation in Y ‘explained’ by all X variables
taken together
Reflects
Sample size
Number of independent variables
Smaller [more conservative] than R2
Y.12
Used to compare models
54. Coefficient of Multiple Determination
(adjusted)
R2
(adj) y.123- - -P
The proportion of Y that is explained by the
set of independent [explanatory] variables
selected, adjusted for the number of
independent variables and the sample size.
55. Coefficient of Multiple Determination
(adjusted) [Mini-Case]
R2
adj = 0.9599
95.99 percent of the variation in
heating oil consumption can be
explained by the model - adjusted
for number of independent
variables and the sample size
56. Coefficient of Partial Determination
Proportion of variation in Y ‘explained’ by variable XP
holding all others constant
Must estimate separate models
Denoted R2
Y1.2 in two X variables case
Coefficient of partial determination of X1 with Y
holding X2 constant
Useful in selecting X variables
57. Coefficient of Partial
Determination [p. 878]
R2
y1.234 --- P
The coefficient of partial variation of
variable Y with x1 holding constant
the effects of variables x2, x3, x4, ... xP.
58. Testing Overall Significance
Shows if there is a linear relationship between all X
variables together & Y
Uses p-value
Hypotheses
H0: 1 = 2 = ... = P = 0
No linear relationship
H1: At least one coefficient is not 0
At least one X variable affects Y
59. Testing Model Portions
Examines the contribution of a set of X
variables to the relationship with Y
Null hypothesis:
Variables in set do not improve
significantly the model when all other
variables are included
Must estimate separate models
Used in selecting X variables
61. Multicollinearity
It is the occurrence of high
intercorrelations among two or
more independent variables in
a multiple regression model.
62. Multicolinearity is a problem
Multicollinearity is a problem because it
produces regression model results that are
less reliable.
This is due to wider confidence intervals
(larger standard errors) that
can lower the statistical significance of
regression coefficients.
63. Multicollinearity
High correlation between X variables
Coefficients measure combined effect
Leads to unstable coefficients depending on X
variables in model
Always exists; matter of degree
Example: Using both total number of rooms and
number of bedrooms as explanatory variables in same
model (independent variables)
64. Detecting Multicollinearity
Examine correlation matrix
Correlations between pairs of X variables are more
than with Y variable
Few solution (remedies)
Obtain new sample data
Eliminate one correlated X variable
65. Evaluating Multiple Regression Model Steps
Examine variation measures
Do residual analysis
Test parameter significance
Overall model
Portions of model
Individual coefficients
Test for multicollinearity
68. Dummy-Variable Regression Model
Involves categorical X variable with
two levels
e.g., female-male,
employed-not employed, etc.
Variable levels coded 0 & 1
69. Dummy-Variable Regression Model
Involves categorical X variable with
two levels
e.g., female-male, employed-not employed,
etc.
Variable levels coded 0 & 1
Assumes only intercept is different
Slopes are constant across categories
71. Dummy Variables
Permits use of
qualitative data
(e.g.: seasonal, class
standing, location,
gender).
0, 1 coding
(nominative data)
As part of Diagnostic
Checking;
incorporate outliers
(i.e.: large residuals)
and influence
measures.
73. Interaction Regression Model
Hypothesizes interaction between pairs of X
variables
Response to one X variable varies at different
levels of another X variable
Contains two-way cross product terms
Y = 0 + 1x1 + 2x2 + 3x1x2 +
Can be combined with other models
e.g. dummy variable models
74. Effect of Interaction
Given:
Without interaction term, effect of X1 on Y is measured by 1
With interaction term, effect of X1 on Y is measured by 1 +
3X2
Effect increases as X2i increases
Y X X X X
i i i i i i
0 1 1 2 2 3 1 2
80. The Difference between Linear and Nonlinear
Regression Models
The Difference between Linear and Nonlinear
Regression Models
The difference between linear and nonlinear
regression models isn’t as straightforward as it
sounds.
You’d think that linear equations produce straight
lines and nonlinear equations model curvature.
Unfortunately, that’s not correct.
81. Linear Regression Equations
A linear regression model follows a very particular
form. In statistics, a regression model is linear
when all terms in the model are one of the
following:
The constant
A parameter multiplied by an independent
variable (IV)
Then, you build the equation by only adding the
terms together. These rules limit the form to just
one type:
82. Linear regression
Then, you build the equation by only adding the terms
together. These rules limit the form to just one type:
Dependent variable = constant + parameter * IV + … +
parameter * IV
83. The regression example below models the relationship between body mass
index (BMI) and body fat percent.
In a different blog post, I use this model to show how to make predictions
with regression analysis.
It is a linear model that uses a quadratic (squared) term to model the
curved relationship.
84. Nonlinear Regression Equations
I showed how linear regression models have
one basic configuration.
Now, we’ll focus on the “non” in nonlinear! If a
regression equation doesn’t follow the rules for
a linear model, then it must be a nonlinear
model.
It’s that simple! A nonlinear model is literally
not linear.
85. Non linear regression
Consequently, nonlinear regression can fit
an enormous variety of curves.
However, because there are so many
candidates, you may need to conduct some
research to determine which functional
form provides the best fit for your data.
86. Non linear regression Beside, I present a handful of
examples that illustrate the
diversity of nonlinear regression
models. Keep in mind that each
function can fit a variety of
shapes, and there are many
nonlinear functions. Also, notice
how nonlinear regression
equations are not comprised of
only addition and multiplication!
In
the table, thetas (dependent)
are the parameters, and
Xs are the independent variables.
87. non Linear Models
Non-linear models that can be expressed in
linear form
Can be estimated by least square in
linear form
Require data transformation