Linear regression without tears

Digg Data

Linear Regression
without Tears
ANKIT SHARMA, DIGG DATA

www.diggdata.in

Content
 What is Regression Analysis
 When to use regression

 Intuition behind linear regression - Machine learning
 Simple Linear Regression
 Multivariate Linear Regression
 Performance Analysis
 ANOVA
 Goodness of fit
 Confidence & Prediction bands

 Assumptions

Friday, November 22, 2013

WITHOUT TEARS SERIES, DIGG DATA

2

What is Regression Analysis?
In statistics, regression analysis is a statistical process
for estimating the relationships among variables.

More specifically, regression analysis helps one
understand how the typical value of the dependent
variable changes when any one of the independent
variables is varied, while the other independent
variables are held fixed.
Regression analysis is widely used for prediction and
forecasting.
Regression analysis is also used to understand which
among the independent variables are related to the
dependent variable, and to explore the forms of
these relationships.



3

When to use regression?
Regression analysis is used to describe the relationship between:
 A single response variable: Y ; and
 One or more predictor variables: X1, X2,…,Xp
• p = 1: Simple Regression
• p > 1: Multivariate Regression

Response Variable ‘Y’ must be a continuous variable.
Predictor Variables X1,…,Xp can be continuous, discrete or categorical variables.



4

The Meaning of the term “Linear”
Linearity in the Variables
The first meaning of linearity is that the conditional
expectation of Y, E(Y|Xi), is a linear function of Xi, the
regression curve in this case is a straight line. But
E(Y|Xi) = β1 + β2X2i is not a linear function
Linearity in the Parameters
The second interpretation of linearity is that the
conditional expectation of Y, E(Y|Xi), is a linear function
of the parameters, the β’s; it may or may not be linear in
the variable X.
E(Y|Xi) = β1 + β2X2i is a linear (in parameter)

regression model. All the models shown in Figure are thus linear
regression models, that is, models linear in the parameters.



5

The Meaning of the term “Linear” Cond...
Now consider the model:
E(Y|Xi) = β1 + β22 Xi
The preceding model is an example of a nonlinear (in the parameter) regression model.
From now on the term “linear” regression will always mean a regression that is linear in the
parameters; the β’s (that is, the parameters are raised to the first power only).



6

Intuition
LINEAR REGRESSION



7

Hypothesis(for one variable)
500
Price
(in 1000s of
dollars)

Training Set

400
300
200

Learning Algorithm

100
0
0

Size of
house


h

Estimated
price

500

1000

1500

2000

2500

3000

Size (feet2)


8

Cost function
Hypothesis:
How to choose

‘s ?

Cost Function:
Goal:



9

(for fixed

, this is a function of x)

(function of the parameter

3

3

2

2

1

1

0

)

0

y

0

1

x

2

-0.5

3

1
𝐽 1 =
1−1
2𝑚

2

+ 2−2

2

+ 3−3


0

2

0.5

1

1.5

2

2.5

=0
10

(for fixed



3

3

2

2

1

1

0

)

0

y

0

1

x

2

-0.5

1
𝐽 0.5 =
0.5 − 1
2𝑚

0

+ 1.5 − 3

2

3

2

+ 1−2

2


0.5

1

1.5

2

2.5

= 0.68
11

(for fixed



3

3

2

2

1

1

0

)

0

y

0

1

x

2

-0.5

3

0

0.5

1

1.5

2

2.5

min 𝐽 𝜃1
𝜃1



12

Contour plot



13

Linear Regression in R
SINGLE PREDICTOR



14

Data cleaning & preprocessing
Prior to any
analysis, the data
should always be
inspected for

Data-entry
errors

Missing
values

Outliers

Numerical
summaries

5-number
summaries

Correlations

…

Graphical
summaries

Boxplots

Histograms

Scatterplots


Unusual
distributions

Changes in
variability

Clustering

Non-linear
bivariate
relationships

Unexpected
patterns

…


15

Simple Linear Regression
Objective
Describe the relationship between two variables, say X and Y as a straight line, that is, Y is
modeled as a linear function of X.
X

 X: explanatory variable (horizontal axis)
 Y : response variable (vertical axis)
After data collection, we have pairs of observations: (X1,Y1),…,(Xn,Yn)



X1

Y1

X2

Y2

…

The variables

Y

…

Xn

Yn

16

Simple LR model
The regression of variable Y on variable X is given by:
yi = β0 + β1xi + ϵi

Residuals

i = 1,...,n

The difference between the
observed value yi and the
fitted value ^yi is called
residual and is given by:

where:
Random Error: ϵi ̴N(0, σ2), independent
Linear Function: β0 + β1xi = E(Y|X = xi )
Unknown parameters

ei = yi - ^yi

- β0 (Intercept): point in which the line intercepts the y-axis;
- β1 (Slope): increase in Y per unit change in X.

Least Squares Method

Estimation of unknown parameters
We want to find the equation of the line that “best" fits the
data. It means finding β0 and β1 such that the fitted values
of yi , given by
^yi = β0 + β1 xi ;

A usual way of calculating β0 and β1 is based on the
minimization of the sum of the squared residuals, or
residual sum of squares (RSS):

are as “close" as possible to the observed values yi .

𝑒2 𝑖 =

𝑅𝑆𝑆 =
𝑖

(𝑦𝑖 − 𝑦 𝑖)2
𝑖

(𝑦𝑖 − β0 − β1xi)2

𝑅𝑆𝑆 =
𝑖



17

Simple LR in R
> # Download the data from a url
> production <read.table("http://www.stat.tamu.edu/~sheather/book/docs/datasets/productio
n.txt", header=T, sep="")
> # analyze the data
> head(production)
Case RunTime RunSize
1 1 195 175
2 2 215 189
3 3 243 344
4 4 162 88
5 5 185 114
6 6 231 338
> table(is.na(production))
FALSE
60
> str(production)
'data.frame':
20 obs. of 3 variables:
$ Case : int 1 2 3 4 5 6 7 8 9 10 ...
$ RunTime: int 195 215 243 162 185 231 234 166 253 196 ...
$ RunSize: int 175 189 344 88 114 338 271 173 284 277 ...
> attach(production)
The following object is masked from production (position 3):
Case, RunSize, RunTime
> # Lets plot the data
> plot(RunTime~RunSize)
> # Fit the regression model using the lm()
> production.lm <- lm(RunTime~RunSize, data=production)
> # Use the function summary() to get some results

> summary(production.lm)
Call:
lm(formula = RunTime ~ RunSize, data = production)
Residuals:
Min 1Q Median 3Q Max
-28.597 -11.079 3.329 8.302 29.627
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 149.74770 8.32815 17.98 6.00e-13 ***
RunSize
0.25924 0.03714 6.98 1.61e-06 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 16.25 on 18 degrees of freedom
Multiple R-squared: 0.7302,
Adjusted R-squared: 0.7152
F-statistic: 48.72 on 1 and 18 DF, p-value: 1.615e-06
> # plot a line fitting the model
> abline(production.lm)
> production <data.frame(production,fitted.value=fitted(production.lm),residual=resid(productio
n.lm))
> head(production)
Case RunTime RunSize fitted.value residual
1 1 195 175 195.1152 -0.1152469
2 2 215 189 198.7447 16.2553496
3 3 243 344 238.9273 4.0726679
4 4 162 88 172.5611 -10.5610965
5 5 185 114 179.3014 5.6985827
6 6 231 338 237.3719 -6.3718734


18

Multivariate Linear Regression



19

Multivariate Linear Regression
Objective
Generalize the simple regression methodology in order to describe the relationship between a
response variable Y and a set of predictors X1,X2,…, Xp in terms of a linear function.
The variables
 Y : response variable (vertical axis)
After data collection, we have pairs of observations:
(X11,…,X1p,Y1),…,(Xn1,…,Xnp,Yn)



…

Xp

Y

X1

…

X1p

Y1

X2

 X: explanatory variable (horizontal axis)

X1

…

X2p

Y2

…

…

…

…

Xn

…

Xnp

Yn

20

Polynomial regression
Price
(y)

Size (x)



21

Multivariate LR model
The model is given by:
yi = β0 + β1xi +…+ βpxp + ϵi

i = 1,...,n

Residuals
The difference between the observed value yi and the
fitted value ^yi is called residual and is given by:

where:
Random Error: ϵi ̴N(0, σ2), independent

ei = yi - ^yi

Linear Function: β0 + β1xi + βpxp = E(y|x1 ,…, xp)
Unknown parameters
- β0 : overall mean
- βk : regression coefficient

Least Squares Method

Estimation of unknown parameters
We want to find the equation of the line that “best" fits the
data. It means finding β0 and βk such that the fitted values
of yi , given by
^yi = β0 + β1 xi ;

A usual way of calculating β0, β1, …, βp is based on the
minimization of the sum of the squared residuals, or
residual sum of squares (RSS): :

are as “close" as possible to the observed values yi .

𝑒2 𝑖 =

𝑅𝑆𝑆 =
𝑖

(𝑦𝑖 − 𝑦 𝑖)2
𝑖

(𝑦𝑖 − β0 − β1xi − ⋯ )2

𝑅𝑆𝑆 =
𝑖



22

Performance measurement



23

Analysis of Variance (ANOVA)
Total sample
variability
TSS

Unexplained
(or error)
variability
RSS

Variability
explained by
the model
SSreg

> anova(production.lm)
Analysis of Variance Table
Response: RunTime
Df Sum Sq Mean Sq F value Pr(>F)
RunSize
1 12868.4 12868.4
48.717 1.615e-06 ***
Residuals 18 4754.6 264.1
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The ANOVA Table gives us the following
information:
• Degrees Of Freedom
• The Sum Of The Squares
• The Mean Square
• The F ratio
• The p-value



24

ANOVA Cond…
Select a model:
y = β0+ β1x1+ β2x2+ β3x3+ … + ε
Use sample data to estimate unknown parameters
Evaluate how useful the model is
If we want to test the usefulness of a particular term in our model, we would
perform a t-test and look at the p-value for that term. However, if we wanted to test
whether any of the terms in our model are useful in predicting y we would use the
F-test.
The F-test is a test of the hypothesis:
H0: β1= β2= … = βk= 0
H1: At least one of the coefficients is non-zero
Note1 our H0 will always include all of our parameters except our y-intercept β0.
Note2 this test has a general set-up of:
H0: None of the explanatory variables are helping
H1: At least one of the explanatory variables are helping
which shares the general format seen throughout the last couple of chapters of:
H0: Model not useful
H1: Model useful


Once we know the test
statistic of our F-test, we will
often want to determine
whether it is significant. As in
all our tests, if our test statistic
is more extreme (i.e. greater)
than our critical value, we
reject H0. By rejecting H0 we
are saying that our model is
significantly better than just
estimating y with avg(y).
25

The Coefficient Of Correlation
 The Correlation Coefficient (denoted r) is a measure of the
strength of the linear relationship between x and y. It will
always be between -1 and 1.
 If r is near -1 or 1, then there is a strong linear
relationship.
 If r is near 0, then there is little or no linear relationship.
 A positive correlation occurs when an increase in one
variable typically leads to an increase in the other
variable.

 A negative correlation occurs when an increase in one
variable typically leads to a decrease in the other variable.

𝑟=


𝑆𝑆 𝑋𝑌

𝑆𝑆 𝑋𝑋 − 𝑆𝑆 𝑌𝑌


26

Measuring Goodness of Fit
Coefficient of Determination, r2
 Represents the proportion of the total sample variability
explained by the regression model.

Adjusted r2adj

 For simple linear regression, the r2 statistic corresponds to
the square of the correlation between Y and X.

 The adjusted r2 takes into account the number of degrees
of freedom and is preferable to r2.

 Indicates of how well the model fits the data.

𝑟2

𝑆𝑆 𝑦𝑦 − 𝑆𝑆𝐸
𝑆𝑆𝐸
=
=1 −
𝑆𝑆 𝑦𝑦
𝑆𝑆 𝑦𝑦

Important Note: Neither r2 nor r2adj give direct
indication on how well the model will perform in the
prediction of a new observation.

About 100(r2)% of the sample variation in y can be explained by (or attributed to) using x to predict
y in the straight line model. Ideally this value will be close to 1.



27

Confidence & Prediction band
Confidence Bands
Reflect the uncertainty about the regression line (how well the line is determined).
Prediction Bands
Include also the uncertainty about future observations.
Attention

250
200
50

100

150

RunTime

300

350

These limits rely strongly on the assumption of normally distributed errors with constant
variance and should not be used if this assumption is violated for the data being analyzed.

50

100

150

200

250

300

350

> predict(production.lm, interval="confidence")
fit
lwr
upr
1 195.1152 187.2000 203.0305
2 198.7447 191.0450 206.4443
…
20 167.3762 154.4448 180.3077
> predict(production.lm, interval="prediction")
fit
lwr
upr
1 195.1152 160.0646 230.1659
2 198.7447 163.7421 233.7472
…
20 167.3762 130.8644 203.8881
# Create a new data frame containing the values of X
# at which we want the predictions to be made
pred.frame <- data.frame(RunSize=seq(55,345,by=10))
# Confidence bands
pc <- predict(production.lm, int="c", newdata=pred.frame)
# Prediction bands
pp <- predict(production.lm, int="p", newdata=pred.frame)
require ( graphics )
# Standard scatterplot with extended limits
plot(RunSize, RunTime, ylim=range(RunSize,pp,na.rm=T))
pred.Size <- pred.frame$RunSize
# Add curves
matlines(pred.Size, pc, lty=c(1,2,2), lwd=1.5, col=1)
matlines(pred.Size, pp, lty=c(1,3,3), lwd=1.5, col=1)

RunSize



28

Validity of regression model

For all data sets, the fitted regression is the same:
^y = 3.0 + 0.5x
All models have r2= 0.67, ^σ = 1.24 and the slope coefficients are significant at < 1% level.



29

Residual plots
 A residual plot is a graph that shows the residuals on the vertical axis and the independent
variable on the horizontal axis.

 If the points in a residual plot are randomly dispersed around the horizontal axis, a linear
regression model is appropriate for the data;
otherwise, a non-linear model is more appropriate.

The first plot shows a random pattern, indicating a good fit for a linear model. The other plot
patterns are non-random (U-shaped and inverted U), suggesting a better fit for a non-linear model.


30

Residual plots
Residuals vs. X



31

Residual plots
Residuals vs. fitted values



32

Influential point
Outliers
Data points that diverge in a big way from the overall
pattern are called outliers. There are four ways that a
data point might be considered an outlier.
 It could have an extreme X value compared to other
data points.
 It could have an extreme Y value compared to other
data points.
 It could have extreme X and Y values.
 It might be distant from the rest of the data, even
without extreme X or Y values.

Influential Points
An influential point is an outlier that greatly affects the
slope of the regression line.



33

Influential point Cond…
How to deal with them:

Leverage/Influential Points
 Good leverage points have their standardized
residuals within the interval [ 2; 2]
 Outliers are leverage points whose standardized
residuals fall outside the interval [ 2; 2]

 Remove invalid data points
o if they look unusual or are different
from the rest of the data

 Fit a different regression model
o if the model is not valid for the data
 higher-order terms
 transformation



34

Normality & constant variance of errors
Normality and Constant Variance Assumptions, these assumptions are necessary for inference:
• hypothesis testing
• confidence intervals
• prediction intervals
 Check the Normal Q-Q plot of the standardized residuals.

 Check the Standardized Residuals vs. X plot.

When these assumptions do not hold, we can try to correct the problem using data transformations.



35

Normality & constant variance check
> production.lm <- lm(RunTime~RunSize, data=production)
# Residual plots
> plot(production.lm)



36

Cook’s distance
Cook's Distance: D
the Cook's distance statistic combines the effects of leverage and the magnitude of the residual.
it is used to evaluate the impact of a given observation on the estimated regression coefficients.
D > 1: undue influence
The Cook's distance plot is obtained by applying the function plot() to the linear model object.



37

Transformation
When to use transformation?
Transformations can be used to
correct for:
 non-constant variance

There are many ways to transform variables to achieve linearity for
regression analysis. Some common methods are summarized below.

 non-linearity
 non-normality



38

Assumptions for Simple LR
There are four principal assumptions which justify the use of linear
regression models for purposes of prediction:
I. linearity of the relationship between dependent & independent
variables
Y = β 0 + β 1X + ϵ
II. independence of the errors (no serial correlation)
III. homoscedasticity (constant variance) of the errors
a) versus time
b) versus the predictions (or versus any independent variable)

IV.normality of the error distribution.
If any of these assumptions is violated (i.e., if there is nonlinearity,
serial correlation, heteroscedasticity, and/or non-normality), then
the forecasts, confidence intervals, and economic insights yielded by
a regression model may be inefficient or seriously biased or
misleading.



What can go wrong?
Violations:
In the linear regression model:
• linearity (e.g. quadratic relationship or higher
order terms)
• In the residual assumptions:
• non-normal distribution
• non-constant variances
• dependence
• outliers
Checks:
 Residuals vs. each predictor variable
o nonlinearity: higher-order terms in that
variable
 Residuals vs. fitted values
o variance increasing with the response:
transformation
 Residuals Q-Q norm plot
o deviation from a straight line: nonnormality
39

Violations of linearity
These are extremely serious--if you fit a linear model to data which are nonlinearly related, your predictions are
likely to be seriously in error, especially when you extrapolate beyond the range of the sample data.
How to detect

Plot

• observed vs. predicted values, or
• residuals vs predicted values

Look carefully for evidence of a "bowed" pattern, indicating that the model makes systematic errors whenever it is
making unusually large or small predictions.
How to fix
 Consider applying a nonlinear transformation to the dependent and/or independent variables. For example, if
the data are strictly positive, a log transformation may be feasible.
 Another possibility to consider is adding another regressor which is a nonlinear function of one of the other
variables. For example, if you have regressed Y on X, and the graph of residuals versus predicted suggests a
parabolic curve, then it may make sense to regress Y on both X and X^2 (i.e., X-squared). The latter
transformation is possible even when X and/or Y have negative values, whereas logging may not be.



40

Violations of homoscedasticity
Violations of homoscedasticity makes it difficult to gauge the true standard deviation of the forecast errors, usually resulting in confidence
intervals that are too wide or too narrow. In particular, if the variance of the errors is increasing over time, confidence intervals for out-ofsample predictions will tend to be unrealistically narrow. Heteroscedasticity may also have the effect of giving too much weight to small
subset of the data (namely the subset where the error variance was largest) when estimating coefficients.
How to detect
Plots of

• residuals vs. time, and
• residuals vs. predicted value

Check for residuals that are getting larger (i.e., more spread-out) either as a function of time or as a function of the predicted value. (To be
really thorough, you might also want to plot residuals versus some of the independent variables.)
How to fix
 In time series models, heteroscedasticity often arises due to the effects of inflation and/or real compound growth, perhaps magnified by
a multiplicative seasonal pattern. Some combination of logging and/or deflating will often stabilize the variance in this case.
 A simple fix would be to work with shorter intervals of data in which volatility is more nearly constant.
 Heteroscedasticity can also be a byproduct of a significant violation of the linearity and/or independence assumptions, in which case it
may also be fixed as a byproduct of fixing those problems.



41

Violations of normality
 It compromise the estimation of coefficients and the calculation of confidence intervals. Sometimes the error distribution is "skewed" by the presence of a few large outliers.
 Since parameter estimation is based on the minimization of squared error, a few extreme observations can exert a disproportionate influence on parameter estimates.
 Calculation of confidence intervals and various significance tests for coefficients are all based on the assumptions of normally distributed errors.
 If the error distribution is significantly non-normal, confidence intervals may be too wide or too narrow.

How to detect
The best test for normally distributed errors is a normal probability plot of the residuals.
o This is a plot of the fractiles of error distribution versus the fractiles of a normal distribution having the same mean and variance. If the distribution is normal, the points on this plot should
fall close to the diagonal line.
o A bow-shaped pattern of deviations from the diagonal indicates that the residuals have excessive skewness (i.e., they are not symmetrically distributed, with too many large errors in the
same direction).
o An S-shaped pattern of deviations indicates that the residuals have excessive kurtosis--i.e., there are either two many or two few large errors in both directions.

How to fix
Violations of normality often arise either because
(a) the distributions of the dependent and/or independent variables are themselves significantly non-normal, and/or
(b) the linearity assumption is violated.
In such cases, a nonlinear transformation of variables might cure both problems. In some cases, the problem with the residual distribution is mainly due to one or two very large errors.
Such values should be scrutinized closely: are they genuine (i.e., not the result of data entry errors), are they explainable, are similar events likely to occur again in the future, and how
influential are they in your model-fitting results? (The "influence measures" report is a guide to the relative influence of extreme observations.) If they are merely errors or if they can be
explained as unique events not likely to be repeated, then you may have cause to remove them. In some cases, however, it may be that the extreme values in the data provide the most
useful information about values of some of the coefficients and/or provide the most realistic guide to the magnitudes of forecast errors.



42

Thank you!



43

Linear regression without tears

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (10)

Similar a Linear regression without tears

Similar a Linear regression without tears (20)

Último

Último (20)

Linear regression without tears