This document provides an overview of linear regression analysis. It begins by defining regression analysis and describing its uses in prediction, forecasting, and understanding relationships between variables. It then covers simple and multivariate linear regression, discussing modeling relationships between one or more predictor and response variables. The document explains linear regression in R and how to evaluate model performance using analysis of variance (ANOVA) and other metrics like the coefficient of correlation. Key concepts like residuals, least squares estimation, and assumptions of linear regression are also introduced.
2. Content
What is Regression Analysis
When to use regression
Intuition behind linear regression - Machine learning
Simple Linear Regression
Multivariate Linear Regression
Performance Analysis
ANOVA
Goodness of fit
Confidence & Prediction bands
Assumptions
Friday, November 22, 2013
WITHOUT TEARS SERIES, DIGG DATA
2
3. What is Regression Analysis?
In statistics, regression analysis is a statistical process
for estimating the relationships among variables.
More specifically, regression analysis helps one
understand how the typical value of the dependent
variable changes when any one of the independent
variables is varied, while the other independent
variables are held fixed.
Regression analysis is widely used for prediction and
forecasting.
Regression analysis is also used to understand which
among the independent variables are related to the
dependent variable, and to explore the forms of
these relationships.
Friday, November 22, 2013
WITHOUT TEARS SERIES, DIGG DATA
3
4. When to use regression?
Regression analysis is used to describe the relationship between:
A single response variable: Y ; and
One or more predictor variables: X1, X2,…,Xp
• p = 1: Simple Regression
• p > 1: Multivariate Regression
Response Variable ‘Y’ must be a continuous variable.
Predictor Variables X1,…,Xp can be continuous, discrete or categorical variables.
Friday, November 22, 2013
WITHOUT TEARS SERIES, DIGG DATA
4
5. The Meaning of the term “Linear”
Linearity in the Variables
The first meaning of linearity is that the conditional
expectation of Y, E(Y|Xi), is a linear function of Xi, the
regression curve in this case is a straight line. But
E(Y|Xi) = β1 + β2X2i is not a linear function
Linearity in the Parameters
The second interpretation of linearity is that the
conditional expectation of Y, E(Y|Xi), is a linear function
of the parameters, the β’s; it may or may not be linear in
the variable X.
E(Y|Xi) = β1 + β2X2i is a linear (in parameter)
regression model. All the models shown in Figure are thus linear
regression models, that is, models linear in the parameters.
Friday, November 22, 2013
WITHOUT TEARS SERIES, DIGG DATA
5
6. The Meaning of the term “Linear” Cond...
Now consider the model:
E(Y|Xi) = β1 + β22 Xi
The preceding model is an example of a nonlinear (in the parameter) regression model.
From now on the term “linear” regression will always mean a regression that is linear in the
parameters; the β’s (that is, the parameters are raised to the first power only).
Friday, November 22, 2013
WITHOUT TEARS SERIES, DIGG DATA
6
8. Hypothesis(for one variable)
500
Price
(in 1000s of
dollars)
Training Set
400
300
200
Learning Algorithm
100
0
0
Size of
house
Friday, November 22, 2013
h
Estimated
price
500
1000
1500
2000
2500
3000
Size (feet2)
WITHOUT TEARS SERIES, DIGG DATA
8
10. (for fixed
, this is a function of x)
(function of the parameter
3
3
2
2
1
1
0
)
0
y
0
1
x
2
-0.5
3
1
𝐽 1 =
1−1
2𝑚
Friday, November 22, 2013
2
+ 2−2
2
+ 3−3
WITHOUT TEARS SERIES, DIGG DATA
0
2
0.5
1
1.5
2
2.5
=0
10
11. (for fixed
, this is a function of x)
(function of the parameter
3
3
2
2
1
1
0
)
0
y
0
1
x
2
-0.5
1
𝐽 0.5 =
0.5 − 1
2𝑚
Friday, November 22, 2013
0
+ 1.5 − 3
2
3
2
+ 1−2
2
WITHOUT TEARS SERIES, DIGG DATA
0.5
1
1.5
2
2.5
= 0.68
11
12. (for fixed
, this is a function of x)
(function of the parameter
3
3
2
2
1
1
0
)
0
y
0
1
x
2
-0.5
3
0
0.5
1
1.5
2
2.5
min 𝐽 𝜃1
𝜃1
Friday, November 22, 2013
WITHOUT TEARS SERIES, DIGG DATA
12
14. Linear Regression in R
SINGLE PREDICTOR
Friday, November 22, 2013
WITHOUT TEARS SERIES, DIGG DATA
14
15. Data cleaning & preprocessing
Prior to any
analysis, the data
should always be
inspected for
Data-entry
errors
Missing
values
Outliers
Numerical
summaries
5-number
summaries
Correlations
…
Graphical
summaries
Boxplots
Histograms
Scatterplots
Friday, November 22, 2013
Unusual
distributions
Changes in
variability
Clustering
Non-linear
bivariate
relationships
Unexpected
patterns
…
WITHOUT TEARS SERIES, DIGG DATA
15
16. Simple Linear Regression
Objective
Describe the relationship between two variables, say X and Y as a straight line, that is, Y is
modeled as a linear function of X.
X
X: explanatory variable (horizontal axis)
Y : response variable (vertical axis)
After data collection, we have pairs of observations: (X1,Y1),…,(Xn,Yn)
Friday, November 22, 2013
WITHOUT TEARS SERIES, DIGG DATA
X1
Y1
X2
Y2
…
The variables
Y
…
Xn
Yn
16
17. Simple LR model
The regression of variable Y on variable X is given by:
yi = β0 + β1xi + ϵi
Residuals
i = 1,...,n
The difference between the
observed value yi and the
fitted value ^yi is called
residual and is given by:
where:
Random Error: ϵi ̴N(0, σ2), independent
Linear Function: β0 + β1xi = E(Y|X = xi )
Unknown parameters
ei = yi - ^yi
- β0 (Intercept): point in which the line intercepts the y-axis;
- β1 (Slope): increase in Y per unit change in X.
Least Squares Method
Estimation of unknown parameters
We want to find the equation of the line that “best" fits the
data. It means finding β0 and β1 such that the fitted values
of yi , given by
^yi = β0 + β1 xi ;
A usual way of calculating β0 and β1 is based on the
minimization of the sum of the squared residuals, or
residual sum of squares (RSS):
are as “close" as possible to the observed values yi .
𝑒2 𝑖 =
𝑅𝑆𝑆 =
𝑖
(𝑦𝑖 − 𝑦 𝑖)2
𝑖
(𝑦𝑖 − β0 − β1xi)2
𝑅𝑆𝑆 =
𝑖
Friday, November 22, 2013
WITHOUT TEARS SERIES, DIGG DATA
17
18. Simple LR in R
> # Download the data from a url
> production <read.table("http://www.stat.tamu.edu/~sheather/book/docs/datasets/productio
n.txt", header=T, sep="")
> # analyze the data
> head(production)
Case RunTime RunSize
1 1 195 175
2 2 215 189
3 3 243 344
4 4 162 88
5 5 185 114
6 6 231 338
> table(is.na(production))
FALSE
60
> str(production)
'data.frame':
20 obs. of 3 variables:
$ Case : int 1 2 3 4 5 6 7 8 9 10 ...
$ RunTime: int 195 215 243 162 185 231 234 166 253 196 ...
$ RunSize: int 175 189 344 88 114 338 271 173 284 277 ...
> attach(production)
The following object is masked from production (position 3):
Case, RunSize, RunTime
> # Lets plot the data
> plot(RunTime~RunSize)
> # Fit the regression model using the lm()
> production.lm <- lm(RunTime~RunSize, data=production)
> # Use the function summary() to get some results
Friday, November 22, 2013
> summary(production.lm)
Call:
lm(formula = RunTime ~ RunSize, data = production)
Residuals:
Min 1Q Median 3Q Max
-28.597 -11.079 3.329 8.302 29.627
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 149.74770 8.32815 17.98 6.00e-13 ***
RunSize
0.25924 0.03714 6.98 1.61e-06 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 16.25 on 18 degrees of freedom
Multiple R-squared: 0.7302,
Adjusted R-squared: 0.7152
F-statistic: 48.72 on 1 and 18 DF, p-value: 1.615e-06
> # plot a line fitting the model
> abline(production.lm)
> production <data.frame(production,fitted.value=fitted(production.lm),residual=resid(productio
n.lm))
> head(production)
Case RunTime RunSize fitted.value residual
1 1 195 175 195.1152 -0.1152469
2 2 215 189 198.7447 16.2553496
3 3 243 344 238.9273 4.0726679
4 4 162 88 172.5611 -10.5610965
5 5 185 114 179.3014 5.6985827
6 6 231 338 237.3719 -6.3718734
WITHOUT TEARS SERIES, DIGG DATA
18
20. Multivariate Linear Regression
Objective
Generalize the simple regression methodology in order to describe the relationship between a
response variable Y and a set of predictors X1,X2,…, Xp in terms of a linear function.
The variables
Y : response variable (vertical axis)
After data collection, we have pairs of observations:
(X11,…,X1p,Y1),…,(Xn1,…,Xnp,Yn)
Friday, November 22, 2013
WITHOUT TEARS SERIES, DIGG DATA
…
Xp
Y
X1
…
X1p
Y1
X2
X: explanatory variable (horizontal axis)
X1
…
X2p
Y2
…
…
…
…
Xn
…
Xnp
Yn
20
22. Multivariate LR model
The model is given by:
yi = β0 + β1xi +…+ βpxp + ϵi
i = 1,...,n
Residuals
The difference between the observed value yi and the
fitted value ^yi is called residual and is given by:
where:
Random Error: ϵi ̴N(0, σ2), independent
ei = yi - ^yi
Linear Function: β0 + β1xi + βpxp = E(y|x1 ,…, xp)
Unknown parameters
- β0 : overall mean
- βk : regression coefficient
Least Squares Method
Estimation of unknown parameters
We want to find the equation of the line that “best" fits the
data. It means finding β0 and βk such that the fitted values
of yi , given by
^yi = β0 + β1 xi ;
A usual way of calculating β0, β1, …, βp is based on the
minimization of the sum of the squared residuals, or
residual sum of squares (RSS): :
are as “close" as possible to the observed values yi .
𝑒2 𝑖 =
𝑅𝑆𝑆 =
𝑖
(𝑦𝑖 − 𝑦 𝑖)2
𝑖
(𝑦𝑖 − β0 − β1xi − ⋯ )2
𝑅𝑆𝑆 =
𝑖
Friday, November 22, 2013
WITHOUT TEARS SERIES, DIGG DATA
22
24. Analysis of Variance (ANOVA)
Total sample
variability
TSS
Unexplained
(or error)
variability
RSS
Variability
explained by
the model
SSreg
> anova(production.lm)
Analysis of Variance Table
Response: RunTime
Df Sum Sq Mean Sq F value Pr(>F)
RunSize
1 12868.4 12868.4
48.717 1.615e-06 ***
Residuals 18 4754.6 264.1
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The ANOVA Table gives us the following
information:
• Degrees Of Freedom
• The Sum Of The Squares
• The Mean Square
• The F ratio
• The p-value
Friday, November 22, 2013
WITHOUT TEARS SERIES, DIGG DATA
24
25. ANOVA Cond…
Select a model:
y = β0+ β1x1+ β2x2+ β3x3+ … + ε
Use sample data to estimate unknown parameters
Evaluate how useful the model is
If we want to test the usefulness of a particular term in our model, we would
perform a t-test and look at the p-value for that term. However, if we wanted to test
whether any of the terms in our model are useful in predicting y we would use the
F-test.
The F-test is a test of the hypothesis:
H0: β1= β2= … = βk= 0
H1: At least one of the coefficients is non-zero
Note1 our H0 will always include all of our parameters except our y-intercept β0.
Note2 this test has a general set-up of:
H0: None of the explanatory variables are helping
H1: At least one of the explanatory variables are helping
which shares the general format seen throughout the last couple of chapters of:
H0: Model not useful
H1: Model useful
Friday, November 22, 2013
WITHOUT TEARS SERIES, DIGG DATA
Once we know the test
statistic of our F-test, we will
often want to determine
whether it is significant. As in
all our tests, if our test statistic
is more extreme (i.e. greater)
than our critical value, we
reject H0. By rejecting H0 we
are saying that our model is
significantly better than just
estimating y with avg(y).
25
26. The Coefficient Of Correlation
The Correlation Coefficient (denoted r) is a measure of the
strength of the linear relationship between x and y. It will
always be between -1 and 1.
If r is near -1 or 1, then there is a strong linear
relationship.
If r is near 0, then there is little or no linear relationship.
A positive correlation occurs when an increase in one
variable typically leads to an increase in the other
variable.
A negative correlation occurs when an increase in one
variable typically leads to a decrease in the other variable.
𝑟=
Friday, November 22, 2013
𝑆𝑆 𝑋𝑌
𝑆𝑆 𝑋𝑋 − 𝑆𝑆 𝑌𝑌
WITHOUT TEARS SERIES, DIGG DATA
26
27. Measuring Goodness of Fit
Coefficient of Determination, r2
Represents the proportion of the total sample variability
explained by the regression model.
Adjusted r2adj
For simple linear regression, the r2 statistic corresponds to
the square of the correlation between Y and X.
The adjusted r2 takes into account the number of degrees
of freedom and is preferable to r2.
Indicates of how well the model fits the data.
𝑟2
𝑆𝑆 𝑦𝑦 − 𝑆𝑆𝐸
𝑆𝑆𝐸
=
=1 −
𝑆𝑆 𝑦𝑦
𝑆𝑆 𝑦𝑦
Important Note: Neither r2 nor r2adj give direct
indication on how well the model will perform in the
prediction of a new observation.
About 100(r2)% of the sample variation in y can be explained by (or attributed to) using x to predict
y in the straight line model. Ideally this value will be close to 1.
Friday, November 22, 2013
WITHOUT TEARS SERIES, DIGG DATA
27
28. Confidence & Prediction band
Confidence Bands
Reflect the uncertainty about the regression line (how well the line is determined).
Prediction Bands
Include also the uncertainty about future observations.
Attention
250
200
50
100
150
RunTime
300
350
These limits rely strongly on the assumption of normally distributed errors with constant
variance and should not be used if this assumption is violated for the data being analyzed.
50
100
150
200
250
300
350
> predict(production.lm, interval="confidence")
fit
lwr
upr
1 195.1152 187.2000 203.0305
2 198.7447 191.0450 206.4443
…
20 167.3762 154.4448 180.3077
> predict(production.lm, interval="prediction")
fit
lwr
upr
1 195.1152 160.0646 230.1659
2 198.7447 163.7421 233.7472
…
20 167.3762 130.8644 203.8881
# Create a new data frame containing the values of X
# at which we want the predictions to be made
pred.frame <- data.frame(RunSize=seq(55,345,by=10))
# Confidence bands
pc <- predict(production.lm, int="c", newdata=pred.frame)
# Prediction bands
pp <- predict(production.lm, int="p", newdata=pred.frame)
require ( graphics )
# Standard scatterplot with extended limits
plot(RunSize, RunTime, ylim=range(RunSize,pp,na.rm=T))
pred.Size <- pred.frame$RunSize
# Add curves
matlines(pred.Size, pc, lty=c(1,2,2), lwd=1.5, col=1)
matlines(pred.Size, pp, lty=c(1,3,3), lwd=1.5, col=1)
RunSize
Friday, November 22, 2013
WITHOUT TEARS SERIES, DIGG DATA
28
29. Validity of regression model
For all data sets, the fitted regression is the same:
^y = 3.0 + 0.5x
All models have r2= 0.67, ^σ = 1.24 and the slope coefficients are significant at < 1% level.
Friday, November 22, 2013
WITHOUT TEARS SERIES, DIGG DATA
29
30. Residual plots
A residual plot is a graph that shows the residuals on the vertical axis and the independent
variable on the horizontal axis.
If the points in a residual plot are randomly dispersed around the horizontal axis, a linear
regression model is appropriate for the data;
otherwise, a non-linear model is more appropriate.
The first plot shows a random pattern, indicating a good fit for a linear model. The other plot
patterns are non-random (U-shaped and inverted U), suggesting a better fit for a non-linear model.
Friday, November 22, 2013
WITHOUT TEARS SERIES, DIGG DATA
30
33. Influential point
Outliers
Data points that diverge in a big way from the overall
pattern are called outliers. There are four ways that a
data point might be considered an outlier.
It could have an extreme X value compared to other
data points.
It could have an extreme Y value compared to other
data points.
It could have extreme X and Y values.
It might be distant from the rest of the data, even
without extreme X or Y values.
Influential Points
An influential point is an outlier that greatly affects the
slope of the regression line.
Friday, November 22, 2013
WITHOUT TEARS SERIES, DIGG DATA
33
34. Influential point Cond…
How to deal with them:
Leverage/Influential Points
Good leverage points have their standardized
residuals within the interval [ 2; 2]
Outliers are leverage points whose standardized
residuals fall outside the interval [ 2; 2]
Remove invalid data points
o if they look unusual or are different
from the rest of the data
Fit a different regression model
o if the model is not valid for the data
higher-order terms
transformation
Friday, November 22, 2013
WITHOUT TEARS SERIES, DIGG DATA
34
35. Normality & constant variance of errors
Normality and Constant Variance Assumptions, these assumptions are necessary for inference:
• hypothesis testing
• confidence intervals
• prediction intervals
Check the Normal Q-Q plot of the standardized residuals.
Check the Standardized Residuals vs. X plot.
When these assumptions do not hold, we can try to correct the problem using data transformations.
Friday, November 22, 2013
WITHOUT TEARS SERIES, DIGG DATA
35
36. Normality & constant variance check
> production.lm <- lm(RunTime~RunSize, data=production)
# Residual plots
> plot(production.lm)
Friday, November 22, 2013
WITHOUT TEARS SERIES, DIGG DATA
36
37. Cook’s distance
Cook's Distance: D
the Cook's distance statistic combines the effects of leverage and the magnitude of the residual.
it is used to evaluate the impact of a given observation on the estimated regression coefficients.
D > 1: undue influence
The Cook's distance plot is obtained by applying the function plot() to the linear model object.
Friday, November 22, 2013
WITHOUT TEARS SERIES, DIGG DATA
37
38. Transformation
When to use transformation?
Transformations can be used to
correct for:
non-constant variance
There are many ways to transform variables to achieve linearity for
regression analysis. Some common methods are summarized below.
non-linearity
non-normality
Friday, November 22, 2013
WITHOUT TEARS SERIES, DIGG DATA
38
39. Assumptions for Simple LR
There are four principal assumptions which justify the use of linear
regression models for purposes of prediction:
I. linearity of the relationship between dependent & independent
variables
Y = β 0 + β 1X + ϵ
II. independence of the errors (no serial correlation)
III. homoscedasticity (constant variance) of the errors
a) versus time
b) versus the predictions (or versus any independent variable)
IV.normality of the error distribution.
If any of these assumptions is violated (i.e., if there is nonlinearity,
serial correlation, heteroscedasticity, and/or non-normality), then
the forecasts, confidence intervals, and economic insights yielded by
a regression model may be inefficient or seriously biased or
misleading.
Friday, November 22, 2013
WITHOUT TEARS SERIES, DIGG DATA
What can go wrong?
Violations:
In the linear regression model:
• linearity (e.g. quadratic relationship or higher
order terms)
• In the residual assumptions:
• non-normal distribution
• non-constant variances
• dependence
• outliers
Checks:
Residuals vs. each predictor variable
o nonlinearity: higher-order terms in that
variable
Residuals vs. fitted values
o variance increasing with the response:
transformation
Residuals Q-Q norm plot
o deviation from a straight line: nonnormality
39
40. Violations of linearity
These are extremely serious--if you fit a linear model to data which are nonlinearly related, your predictions are
likely to be seriously in error, especially when you extrapolate beyond the range of the sample data.
How to detect
Plot
• observed vs. predicted values, or
• residuals vs predicted values
Look carefully for evidence of a "bowed" pattern, indicating that the model makes systematic errors whenever it is
making unusually large or small predictions.
How to fix
Consider applying a nonlinear transformation to the dependent and/or independent variables. For example, if
the data are strictly positive, a log transformation may be feasible.
Another possibility to consider is adding another regressor which is a nonlinear function of one of the other
variables. For example, if you have regressed Y on X, and the graph of residuals versus predicted suggests a
parabolic curve, then it may make sense to regress Y on both X and X^2 (i.e., X-squared). The latter
transformation is possible even when X and/or Y have negative values, whereas logging may not be.
Friday, November 22, 2013
WITHOUT TEARS SERIES, DIGG DATA
40
41. Violations of homoscedasticity
Violations of homoscedasticity makes it difficult to gauge the true standard deviation of the forecast errors, usually resulting in confidence
intervals that are too wide or too narrow. In particular, if the variance of the errors is increasing over time, confidence intervals for out-ofsample predictions will tend to be unrealistically narrow. Heteroscedasticity may also have the effect of giving too much weight to small
subset of the data (namely the subset where the error variance was largest) when estimating coefficients.
How to detect
Plots of
• residuals vs. time, and
• residuals vs. predicted value
Check for residuals that are getting larger (i.e., more spread-out) either as a function of time or as a function of the predicted value. (To be
really thorough, you might also want to plot residuals versus some of the independent variables.)
How to fix
In time series models, heteroscedasticity often arises due to the effects of inflation and/or real compound growth, perhaps magnified by
a multiplicative seasonal pattern. Some combination of logging and/or deflating will often stabilize the variance in this case.
A simple fix would be to work with shorter intervals of data in which volatility is more nearly constant.
Heteroscedasticity can also be a byproduct of a significant violation of the linearity and/or independence assumptions, in which case it
may also be fixed as a byproduct of fixing those problems.
Friday, November 22, 2013
WITHOUT TEARS SERIES, DIGG DATA
41
42. Violations of normality
It compromise the estimation of coefficients and the calculation of confidence intervals. Sometimes the error distribution is "skewed" by the presence of a few large outliers.
Since parameter estimation is based on the minimization of squared error, a few extreme observations can exert a disproportionate influence on parameter estimates.
Calculation of confidence intervals and various significance tests for coefficients are all based on the assumptions of normally distributed errors.
If the error distribution is significantly non-normal, confidence intervals may be too wide or too narrow.
How to detect
The best test for normally distributed errors is a normal probability plot of the residuals.
o This is a plot of the fractiles of error distribution versus the fractiles of a normal distribution having the same mean and variance. If the distribution is normal, the points on this plot should
fall close to the diagonal line.
o A bow-shaped pattern of deviations from the diagonal indicates that the residuals have excessive skewness (i.e., they are not symmetrically distributed, with too many large errors in the
same direction).
o An S-shaped pattern of deviations indicates that the residuals have excessive kurtosis--i.e., there are either two many or two few large errors in both directions.
How to fix
Violations of normality often arise either because
(a) the distributions of the dependent and/or independent variables are themselves significantly non-normal, and/or
(b) the linearity assumption is violated.
In such cases, a nonlinear transformation of variables might cure both problems. In some cases, the problem with the residual distribution is mainly due to one or two very large errors.
Such values should be scrutinized closely: are they genuine (i.e., not the result of data entry errors), are they explainable, are similar events likely to occur again in the future, and how
influential are they in your model-fitting results? (The "influence measures" report is a guide to the relative influence of extreme observations.) If they are merely errors or if they can be
explained as unique events not likely to be repeated, then you may have cause to remove them. In some cases, however, it may be that the extreme values in the data provide the most
useful information about values of some of the coefficients and/or provide the most realistic guide to the magnitudes of forecast errors.
Friday, November 22, 2013
WITHOUT TEARS SERIES, DIGG DATA
42