Chapter2_Simple_Reg.pdf.pdf

Definition of the Simple Regression Model
Deriving the Ordinary Least Squares Estimates
Properties of OLS on any Sample of Data
Units of Measurement and Functional Form
Expected Values and Variance of OLS
The Simple Regression Model
ECON 371: Econometrics
September 3, 2021
ECON 371: Econometrics The Simple Regression Model

I We begin with cross-sectional analysis and will (eventually)
assume we can collect a random sample from the population
of interest.
I We begin with the following premise, once we have a
population in mind. There are two variables, x and y, and we
would like to “study how y varies with changes in x”.
I We have seen examples: x is amount of fertilizer, and y is
soybean yield. x is years of schooling, y is hourly wage.

We must confront three issues:
I 1. How do we allow factors other than x to affect y? There is
never an exact relationship between two variables (in
interesting cases).
I 2. What is the functional relationship between y and x?
I 3. How can we be sure we are capturing a ceteris paribus
relationship between y and x (as is so often the goal)?

Consider the following equation relating y to x:
y = β0 + β1x + u, (1)
which is assumed to hold in the population of interest.
This equation defines the simple linear regression model (or
two-variable regression model).
I y and x are not treated symmetrically. We want to explain y
in terms of x. From a causality standpoint, it makes no sense
to “explain” past educational attainment in terms of future
labor earnings. For example, we want to explain student
performance (y) in terms of class size (x).

Terminology
y x
Dependent Variable Independent Variable
Explained Variable Explanatory Variable
Response Variable Control Variable
Predicted Variable Predictor Variable
Regressand Regressor
The terms “explained” and “explanatory” are probably best, as
they are the most descriptive and widely applicable. But
“dependent” and “independent” are used often. (The
“independence” here is not really statistical independence.)

We mentioned the error term or disturbance, u, before. The
equation
y = β0 + β1x + u, (2)
explicitly allows for other factors, contained in u, to affect y.
I This equation also addresses the functional form issue (in a
simple way). Namely, y is assumed to be linearly related to x.
We call β0 the intercept parameter and β1 the slope
parameter. These describe a population, and our ultimate
goal is to estimate them.
I In the simple regression framework, we effectively treat all
factors affecting y,other than x as being unobserved.

The equation also addresses the ceteris paribus issue. In
y = β0 + β1x + u, (3)
all other factors that affect y are in u. We want to know how y
changes when x changes, holding u fixed. Let ∆ denote “change”.
Then holding u fixed means ∆u = 0. So
∆y = β1∆x + ∆u = β1∆x when ∆u = 0 (4)
This equation effectively defines β1 as a slope, with the only
difference being the (very important) caveat ∆u = 0.

EXAMPLE: Yield and Fertilizer
yield = β0 + β1fertilizer + u (5)
where u contains things such as land quality, rainfall on a plot of
land, and so on. The slope parameter, β1, is of primary interest: it
tells us how yield will change when the amount of fertilizer
changes, holding all else fixed.
Note: The linear function is probably not realistic here. The effect
of fertilizer is likely to diminish at large amounts of fertilizer.

EXAMPLE: Wage and Education
wage = β0 + β1educ + u (6)
where u contains somewhat nebulous factors (“ability”) but, also,
past workforce experience and tenure on the current job.
∆wage = β1∆educ when ∆u = 0 (7)
Is each year of education really worth the same dollar amount no
matter how much education one starts with?
I The linear form of the equation being used implies that a
one-unit change in x has the same impact on y regardless of
the initial value of x. That is not very realistic in several
applied cases.

We said we must confront three issues:
I 1. How do we allow factors other than x to affect y?
I 2. What is the functional relationship between y and x?
I 3. How can we be sure we are capturing a ceteris paribus
relationship between y and x?
We have argued that the simple regression model
y = β0 + β1x + u (8)
addresses each of them.

I This seems too easy! How can we hope to generally estimate
the ceteris paribus effect of y on x when we have assumed all
other factors affecting y are unobserved and lumped into u?
I The key is that the SLR model is a population model. When
it comes to estimating β1 (and β0) using a random sample of
data, we must restrict how u and x are related to each other.
I But x and u are properly viewed as having distributions in the
population. For example, if x = educ then, in principle, we
could figure out its distribution in the population just over,
say, 30 years old. Suppose u is cognitive ability. Assuming we
can measure what that means, it also has a distribution in the
population.
I What we must do is restrict the way u and x relate to each
other in the population.

First, we make a simplifying assumption (without loss of
generality): the average, or expected, value of u is zero in the
population:
E(u) = 0 (9)
where E(·) is the expected value operator.
Normalizing land quality or “ability” to be zero in the population is
harmless.

The presence of β0 in
y = β0 + β1x + u (10)
allows us to assume E(u) = 0. If the average of u is different from
zero, say α0, we just adjust the intercept, leaving the slope the
same:
y = (β0 + α0) + β1x + (u − α0) (11)
where α0 = E(u). The new error is u − α0 and the new intercept
is β0 + α0. The important point is that the slope, β1, has not
changed.

I The expectation operator, E(·), is a population concept, i.e.,
it refers to the whole group of interest, not only the sample
we have available to us.
I Its intuition is (loosely) similar to the average of a random
variable, in the population.
I Useful facts about the expectation operator for W and H any
random variables: Note that E(W ) is constant.
E(c) = c for any constant c (12)
E(aW + b) = aE(W ) + b for any constants a, b. (13)
E(W + H) = E(W ) + E(H) (14)
E(W − E(W )) = 0 (15)

As you have seen before, we use the variance of a random variable,
say W as a measure of its dispersion.
Var(W ) = σ2
= E

(W − E(W ))2

in the population (16)
Var(aW + b) = a2
Var(W ) for any constants a, b. (17)
We can show that:
Var(W ) = E W 2

− [E(W )]2
(18)
In a given sample, we estimate the variance by:
S2
= (n − 1)−1
n
X
i=1
(xi − x̄)2
(19)

Usually, we are interested in how two variables are related to each
other. One important measure of the (linear) connection between
them is given by the covariance.
I In the population the covariance between X and Y is given by:
Cov(X, Y ) = E(XY ) − E(X)E(Y ) (20)
I Then, if E(X) or E(Y ) equals zero, Cov(X, Y ) = E(XY ).
I If X and Y are independent Cov(X, Y ) = 0.
I Cov(aX + b, cY + d) = acCov(X, Y ).

I The covariance measures the amount of linear dependence
between two random variables. Cov(X, Y ) 0 indicates that
the random variables move in the same direction.
Cov(X, Y ) 0 that they move in opposite directions.
I Zero covariance does NOT imply that the variables are not
related (there could be a nonlinear connection between them)
I Interpreting the magnitude of the covariance can be tricky.
For that, we are better served by looking at correlation:
Corr(X, Y ) =
Cov(X, Y )
p
Var(X)Var(Y )
(21)
I The Correlation coefficient is between −1 and 1. A positive
(negative) correlation indicates that the variables move in the
same (opposite) ways. The closer to 1 or −1 the strongest
the linear relationship is.

An assumption that meshes well with our introductory treatment
involves the mean of the error term for each “slice” of the
population determined by values of x:
E(u|x) = E(u), all values x (22)
where E(u|x) means “the expected value of u given x”.
Then, we say u is mean independent of x.

I Suppose u is “ability” and x is years of education. We need,
for example,
E(ability|x = 8) = E(ability|x = 12) = E(ability|x = 16)
so that the average ability is the same in the different portions
of the population with an 8th grade education, a 12th grade
education, and a four-year college education.
Because people choose education levels partly based on
ability, this assumption is almost certainly false.
I Suppose u is “land quality” and x is fertilizer amount. Then
E(u|x) = E(u) if fertilizer amounts are chosen independently
of quality.

Combining E(u|x) = E(u) (the substantive assumption) with
E(u) = 0 (a normalization) gives
E(u|x) = 0, all values x (23)
Called the zero conditional mean assumption. Because the
conditional expected value is a linear operator, E(u|x) = 0 implies
E(y|x) = β0 + β1x (24)
which shows the population regression function is a linear
function of x.

I This relation is crucial for the intuition of the parameter β1.
I A different approach to simple regression ignores the causality
issue and just starts with a linear model for E(y|x).
I The straight line in the graph in page 26 is the PRF,
E(y|x) = β0 + β1x. The conditional distribution of y at three
different values of x are superimposed. for a given value of x,
we see a range of y values: remember, y = β0 + β1x + u, and
u has a distribution in the population.

EXAMPLE: College versus High School GPA.
Consider the population of students attending a university. We
assume β0 = 1.5 and β1 = 0.5 are known (which is unrealistic).
E(colGPA|hsGPA) = 1.5 + 0.5 hsGPA (25)
If hsGPA = 3.6 then the average colGPA among students with this
particular high school GPA is
1.5 + 0.5(3.6) = 3.3 (26)
I Anyone with hsGPA = 3.6 most likely will not have colGPA
= 3.3. This is the expected college GPA based on averages
within that slice of the population.
I Regression analysis is essentially about explaining effects on
averages.

I Given data on x and y, how can we estimate the population
parameters, β0 and β1?
I Let {(xi , yi ) : i = 1, 2, ..., n} be a random sample of size n
(the number of observations) from the population.
I Plug any observation into the population equation:
yi = β0 + β1xi + ui (27)
where the i subscript indicates a particular observation.
I We observe yi and xi , but not ui (but we know it is there).

We use the two population restrictions:
E(u) = 0 (28)
Cov(x, u) = 0 (29)
to obtain estimating equations for β0 and β1. We talked about the
first condition. The second condition means that x and u are
uncorrelated. Both conditions are implied by
E(u|x) = 0 (30)

With E(u) = 0, Cov(x, u) = 0 is the same as E(xu) = 0. Next we
plug in for u:
E(y − β0 − β1x) = 0 (31)
E[x(y − β0 − β1x)] = 0 (32)
These are the two conditions in the population that effectively
determine β0 and β1.

So we use their sample counterparts (which is a method of
moments approach to estimation):
n−1
n
X
i=1
(yi − β̂0 − β̂1xi ) = 0 (33)
n−1
n
X
i=1
xi (yi − β̂0 − β̂1xi ) = 0 (34)
where β̂0 and β̂1 are the estimates from the data.
These are two linear equations in the two unknowns β̂0 and β̂1.

Pass the summation operator through the first equation:
n−1
n
X
i=1
(yi − β̂0 − β̂1xi ) (35)
= n−1
n
X
i=1
yi − n−1
n
X
i=1
β̂0 − n−1
n
X
i=1
β̂1xi (36)
= n−1
n
X
i=1
yi − β̂0 − β̂1 n−1
n
X
i=1
xi
!
(37)
= ȳ − β̂0 − β̂1x̄ (38)
We use the standard notation ȳ = n−1
Pn
i=1 yi for the average of
the n numbers {yi : i = 1, 2, ..., n}. For emphasis, we call ȳ a
sample average.

We have shown that the first equation,
n−1
n
X
i=1
(yi − β̂0 − β̂1xi ) = 0 (39)
implies
ȳ = β̂0 + β̂1x̄ (40)
Now, use this equation to write the intercept in terms of the slope:
β̂0 = ȳ − β̂1x̄ (41)

Plug this into the second equation (but where we take away the
division by n):
n
X
i=1
xi (yi − β̂0 − β̂1xi ) = 0 (42)
so
n
X
i=1
xi [yi − (ȳ − β̂1x̄) − β̂1xi ] = 0 (43)
Simple algebra gives
n
X
i=1
xi (yi − ȳ) = β̂1
n
X
i=1
xi (xi − x̄)
#
(44)

So, the equation to solve is
n
X
i=1
(xi − x̄)(yi − ȳ) = β̂1
n
X
i=1
(xi − x̄)2
#
(45)
If
Pn
i=1(xi − x̄)2 0, we can write
β̂1 =
Pn
i=1(xi − x̄)(yi − ȳ)
Pn
i=1(xi − x̄)2
=
Sample Covariance(xi , yi )
Sample Variance(xi )
(46)

I The previous formula for β̂1 is important. It shows us how to
take the data we have and compute the slope estimate. For
reasons we will see, β̂1 is called the ordinary least squares
(OLS) slope estimate.
I It can be computed whenever the sample variance of the xi is
not zero, which only rules out the case where each xi has the
same value. The intuition is that the variation in x is what
permits us to identify its impact in y.
I The graph on page 29 shows we have no way to determine the
slope in a relationship between wage and educ if we observe a
sample where everyone has 12 years of schooling.

I Once we have β̂1, we compute β̂0 = ȳ − β̂1x̄. This is the OLS
intercept estimate.
I These days, we let the computer do the calculations, which
are tedious even if n is small.
I Where does the name “ordinary least squares” come from?
For any candidates β̂0 and β̂1, define a fitted value for each i
as
ŷi = β̂0 + β̂1xi (47)
We have n of these. This is the value we predict for yi given
that x = xi .

I The “mistake” we make is the residual:
ûi = yi − ŷi = yi − β̂0 − β̂1xi (48)
Suppose we measure the size of the mistake, for each i, by
squaring it. Then we add them all up:
n
X
i=1
û2
i =
n
X
i=1
(yi − β̂0 − β̂1xi )2
(49)
This is called the sum of squared residuals. Finally, choose
β̂0 and β̂1 to minimize the sum of squared residuals. Using
calculus (or other arguments), it can be shown that the
solutions are the same we obtained before.

Once we have the numbers β̂0 and β̂1 for a given data set, we
write the OLS regression line:
ŷ = β̂0 + β̂1x (50)
which allows us to predict y for any (sensible) value of x. Also
called the sample regression function.
I The intercept, β̂0, is the predicted y when x = 0, if x = 0
makes sense.
I The slope, β̂1, allows us to predict changes in y for any
(reasonable) change in x:
∆ŷ = β̂1∆x (51)
If ∆x = 1, so x increases by one unit, ∆ŷ = β̂1.

EXAMPLE: Effects of Education on Hourly Wage (WAGE2.DTA)
I wage is in dollars per hour. Data are from 1991. educ is
highest grade completed. Only men.
I At first the negative intercept is a puzzle: wage is predicted to
be $ − 0.90 when educ = 0!
I Each additional year of schooling is estimated to be worth
about $0.54 per hour in increased wages.
[
wage = −0.90 + 0.54 educ
n = 526

I In the Stata output, β̂0 = −0.90 is the coefficient labeled
“cons.” β̂1 = 0.54 is the coefficient labeled “educ”.
I We will learn about the other numbers as we go.
I General form of the Stata command:
reg y x
The order of y and x is critical! The first variable in the
command is taken to be the dependent variable, while the
subsequent are the independent variables.

Reminder: When we write
wage = β0 + β1educ + u (52)
in the population, we do not know β0 and β1. β̂0 = −0.90 and
β̂1 = 0.54 are our estimates from this sample of 526 men. They
may or may not be close to the population values.
The function
[
wage = −0.90 + 0.54 educ (53)
is the OLS regression line.

EXAMPLE: Effects of Campaign Spending on Voting Outcomes
(VOTE.DTA)
I Two-party vote, Congressional elections. voteA is the
percentage of votes received by candidate A; shareA is the
percentage share of campaign spending.
I Then, y = voteA, x = shareA.


voteA = 26.81 + 0.464 shareA (54)
n = 173 (55)
I If shareA = 50,
voteA = 26.81 + .0464(50) = 50.01,or
essentially 50 percent of the votes.
I If shareA increases by 10 - which means 10 percentage points
- voteA is predicted to increase by 4.64 percentage points in
votes cast to that candidate.
I Is this effect causal? Or, do better candidates spend more
(more contributions?)? As with simple correlation analysis,
these simple regression estimates do not allow us to tell.

Assuming we have values for β̂0 and β̂1,
ŷ = β̂0 + β̂1x (56)
We can get the OLS fitted values by plugging the xi into the
equation:
ŷi = β̂0 + β̂1xi , i = 1, 2, ..., n (57)
The OLS residuals are
ûi = yi − ŷi = yi − β̂0 − β̂1xi , i = 1, 2, ..., n (58)

I Most residuals will be different from zero (they do not lie on
the regression line). Some will be positive, others negative.
I Positive residuals indicate that the regression line (and, hence,
the predicted values) underestimate the value of yi , similarly if
ûi is negative, the line overestimate.

Algebraic Properties of OLS Statistics
Remember how we obtained β̂0 and β̂1. When an intercept is
included, we will have:
n
X
i=1
(yi − β̂0 − β̂1xi ) = 0 (59)
The OLS residuals always add up to zero, by construction,
n
X
i=1
ûi = 0 (60)
Because yi = ŷi + ûi by definition,
n−1
n
X
i=1
yi = n−1
n
X
i=1
ŷi + n−1
n
X
i=1
ûi (61)
and so ȳ = ŷ.

Similarly, because of the way we obtained our estimates,
n
X
i=1
xi (yi − β̂0 − β̂1xi ) = 0 (62)
The sample covariance (and therefore the sample correlation)
between the explanatory variables and the residuals is always zero:
n
X
i=1
xi ûi = 0 (63)
Because the ŷi are linear functions of the xi , the fitted values and
residuals are uncorrelated, too:
n
X
i=1
ŷi ûi = 0 (64)

Both properties hold by construction. β̂0 and β̂1 were chosen to
make them true.
A third property is that the point (x̄, ȳ) is always on the OLS
regression line. That is, if we plug in the average for x, we predict
the sample average for y:
ȳ = β̂0 + β̂1x̄ (65)
Again, we chose the estimates to make this true.

Goodness-of-Fit
I For each observation, write
yi = ŷi + ûi (66)
I Define the total (SST), explained (SSE) and residual (SSR)
sum of squares as
SST =
n
X
i=1
(yi − ȳ)2
(67)
SSE =
n
X
i=1
(ŷi − ȳ)2
(68)
SSR =
n
X
i=1
û2
i (69)

These are sample variances when divided by n (or n − 1). SST
n−1 is
the sample variance of yi , SSE
n−1 is the sample variance of ŷi , and
SSR
n−1 is the sample variance of ûi .
By writing
SST =
n
X
i=1
(yi − ȳ)2
=
n
X
i=1
[(yi − ŷi ) − (ŷi − ȳ)]2
(70)
=
n
X
i=1
[ûi − (ŷi − ȳ)]2
(71)
and using that the fitted values and residuals are uncorrelated, can
show
SST = SSE + SSR (72)

Assuming SST 0, we can define the fraction of the total
variation in yi that is explained by xi (or the OLS regression line) as
R2
=
SSE
SST
= 1 −
SSR
SST
(73)
Called the R-squared of the regression.
It can be shown to be equal to the square of the correlation
between yi and ŷi . Therefore,
0 ≤ R2
≤ 1 (74)

I R2 = 0 means no linear relationship between yi and xi .
R2 = 1 means a perfect linear relationship.
I As R2 increases, the yi are closer and closer to falling on the
OLS regression line.
I Do not want to fixate on R2. It is a useful summary measure
but does not tell us about causality.

Wage example:
[
wage = −0.90 + 0.54 educ (75)
n = 526, R2
= .1648 (76)
Voting example:

voteA = 26.81 + 0.464 shareA (77)
n = 173, R2
= .856 (78)

Two important issues need to be understood:
I How changing the units of measurement of the dependent
and/or independent variable affects OLS estimates? It is very
important to know how y and x are measured in order to
interpret regression functions.
I How to incorporate functional forms usually used in economics
into regression analysis.

Consider an equation estimated from CEOSAL.DTA, where annual
CEO salary is in thousands of dollars and the return on equity is a
percent:

salary = 963.191 + 18.501 roe (79)
n = 209, R2
= .0132 (80)
When roe = 0 (it never is in the data),
salary = 963.191. But
salary is in thousands of dollars, so $963,191.
A one percentage point (unit) increase in roe increases predicted
salary by 18.501, or $18,501.

What if we measure roe as a decimal, rather than a percent?
Define
roedec = roe/100 (81)
What will happen to the intercept, slope, and R2 when we regress
salary on roedec?
Nothing should happen to the intercept: roedec = 0 is the same as
roe = 0. But the slope will increase by 100. The goodness-of-fit
should not change, and it does not.

The new regression is

salary = 963.191 + 1, 850.1 roedec (82)
n = 209, R2
= .0132 (83)
Now a one percentage point change in roe is the same as
∆roedec = .01, and so we get the same effect as before. Note
that,
y = β0 + β1x + u ⇒ y = β0 +

β1
c

(cx) + u (84)
then the slope is scaled so to preserve the relationship between
variables.

What if we measure salary in dollars, rather than thousands of
dollars?
salarydol = 1, 000 · salary (85)
Both the intercept and slope get multiplied by 1,000.

salarydol = 963, 191 + 18, 501 roe (86)
n = 209, R2
= .0132 (87)
Note that,
y = β0 + β1x + u ⇒ cy = cβ0 + (cβ1)x + cu (88)

Using the Natural Logarithm in Simple Regression
Recall the wage example:
[
wage = −0.90 + 0.54 educ (89)
n = 526, R2
= .133 (90)
Might be an okay approximation, but unsatisfying for a couple of
reasons:
I The negative intercept is a bit strange (even though the
equation gives sensible predictions for education ranging from
8 to 20).
I More importantly, the dollar value of another year of schooling
is constant. So the 16th year of education is worth the same
as the second. We expect additional years of schooling to be
worth more, in dollar terms, than previous years.

How can we incorporate an increasing effect? One way is to
postulate a constant percentage effect. We can approximate this
using the natural log (ln, but JW uses log).
Let the dependent variable be ln(wage):
ln(wage) = β0 + β1educ + u (91)
Holding u fixed,
∆ ln(wage) = β1∆educ (92)
so
β1 =
∆ ln(wage)
∆educ
(93)

Useful result from calculus:
100 · ∆ ln(wage) ≈ %∆wage (94)
This means when ln(wage) = β0 + β1educ + u, we have a simple
interpretation of β1:
100β1 ≈ %∆wage when ∆educ = 1 (95)
If we do not multiply by 100, we have the decimal version (the
proportionate change).
In this example, 100β1 is often called the return to education (just
like an investment). This measure is free of units of measurement
(currency, price level).

I We still want to explain wage in terms of educ! This gives us
a way to get a (rough) constant percentage effect.
I Predicting wage is more complicated but not usually the
interesting question.
Using WAGE2.DTA. Have to define ln(wage) in general.

ln(wage) = 0.58 + .082 educ (96)
n = 526, R2
= .1858 (97)
The estimated return to each year of education is about 8.2%.
Warning: This R-squared is not directly comparable to the
R-squared when wage is the dependent variable.

We can use the ln on both sides of the equation to get constant
elasticity models. For example, if
ln(salary) = β0 + β1 ln(sales) + u (98)
then,
β1 ≈
%∆salary
%∆sales
(99)
The elasticity is free of units of salary and sales.
A constant elasticity model for salary and sales makes more sense
than a constant dollar effect.

I The estimated elasticity of CEO salary with respect to firms
sales is about 0.257. (Note that we do not need to even know
salary is in thousands of dollars while sales is in millions.)
I A 10 percent increase in sales is associated with(causes?)
about a
0.257(10) = 2.57 (100)
percent increase in salary.

Model Dep. Var. Indep. Var. β1
Level-Level y x ∆y = β1∆x
Level-Log y ln(x) ∆y = (β1/100)%∆x
Log-Level ln(y) x %∆y = (100β1)∆x
Log-Log ln(y) ln(x) %∆y = β1%∆x
The possibility of using the natural log to get nonlinear
relationships between y and x raises a question: What do we mean
now by ”linear” regression? The answer is that the model is linear
in the parameters, β0 and β1. We can use any transformations of
the dependent and independent variables to get interesting
interpretations for the parameters.

Expected Value of OLS
I We motivated simple regression using a population model.
But our analysis so far has been purely algebraic, based on a
sample of data. (So, residuals always average to zero when we
apply OLS to a sample, regardless of any underlying model.)
I Now our job gets harder. We have to study statistical
properties of the OLS estimator, referring to a population
model and assuming random sampling.

I Mathematical statistics: How do our estimators behave across
different samples of data? On average, would we get the right
answer if we could repeatedly sample?
I We need to find the expected value of the OLS estimators –
in effect, the average outcome across all possible random
samples – and determine if we are right on average.
I Leads to the notion of unbiasedness, which is a “desirable”
characteristic for estimators.
E(β̂) = β (101)

I Remember, our objective is to estimate β1, the slope
population parameter that describes the relationship between
y and x.
I β̂1 is an estimator of that parameter obtained for a specific
sample.
I Different samples will generate different estimates (β̂1) for the
“true” β1, i.e. (β̂1) is a random variable.
I Unbiasedness is the idea that if we could take as many
random samples on Y as we want from the population, and
compute an estimate each time, the average of these
estimates would be equal to β1.

Assumption SLR.1 (Linear in Parameters)
I The population model can be written as
y = β0 + β1x + u (102)
where β0 and β1 are the (unknown) population parameters.
I We view x and u as outcomes of random variables; thus, y is
random.
I Stating this assumption formally shows that our goal is to
estimate β0 and β1.

Assumption SLR.2 (Random Sampling)
I We have a random sample of size n, {(xi , yi ) : i = 1, ..., n},
following the population model.
I We know how to use this data to estimate β0 and β1 by OLS.
I Because each i is a draw from the population, we can write,
for each i,
yi = β0 + β1xi + ui (103)
I Notice that ui here is the unobserved error for observation i.
It is not the residual that we compute from the data!

Assumption SLR.3 (Sample Variation in the Explanatory
Variable)
I The sample outcomes on xi are not all the same value.
I This is the same as saying the sample variance of
{xi : i = 1, ..., n} is not zero.
I In practice, this is no assumption at all. If the xi are all the
same value, we cannot learn how x affects y in the population.

Assumption SLR.4 (Zero Conditional Mean)
I In the population, the error term has zero mean given any
value of the explanatory variable:
E(u|x) = E(u) = 0. (104)
I This is the key assumption for showing that OLS is unbiased,
with the zero value not being important once we assume
E(u|x) does not change with x.
I Note that we can compute the OLS estimates whether or not
this assumption holds, or even if there is an underlying
population model.

I We will focus on β̂1. There are a few approaches to showing
unbiasedness.
I One explicitly computes the expected value of β̂1 conditional
on the sample outcomes on x, {xi : i = 1, 2, ..., n}. Even
though this is the more proper way to understand the
problem, technically we can obtain the same results by
treating the conditioning variables as if they were fixed in
repeated samples.
I That is, to treat the xi as nonrandom in the derivation. So,
the randomness in β̂1 comes through the ui (equivalently, the
yi ). Nevertheless, it is important to remember that xi are
random variables and that we are taking expectations
conditional on knowing them.

I The nonrandomness of the xi , also called “fixed in repeated
samples”, is not realistic in most cases but gets one to the
same place. See the textbook(5e, pp. 49-50).
I We use it only as a simplifying device. When thinking about
applying OLS to the simple regression model, it is best to
think in terms of Assumptions SLR.1 to SLR.4.

How do we show β̂1 is unbiased for β1? What we need to show is
E(β̂1) = β1 (105)
where the expected value means averaging across random samples.
Step 1: Write down a formula for β̂1. It is convenient to use
β̂1 =
Pn
i=1(xi − x̄)yi
Pn
i=1(xi − x̄)2
(106)
which is one of several equivalent forms.

It is convenient to define SSTx =
Pn
i=1(xi − x̄)2, to total variation
in the xi , and write
β̂1 =
Pn
i=1(xi − x̄)yi
SSTx
(107)
Remember, SSTx is just some positive number. The existence of
β̂1 is guaranteed by SLR.3.
Step 2: Replace each yi with yi = β0 + β1xi + ui (which uses
SLR.1 and the fact that we have data from SLR.2).

The numerator becomes
n
X
i=1
(xi − x̄)yi =
n
X
i=1
(xi − x̄)(β0 + β1xi + ui ) (108)
= β0
n
X
i=1
(xi − x̄) + β1
n
X
i=1
(xi − x̄)xi +
n
X
i=1
(xi − x̄)ui (109)
= 0 + β1
n
X
i=1
(xi − x̄)2
+
n
X
i=1
(xi − x̄)ui (110)
= β1SSTx +
n
X
i=1
(xi − x̄)ui (111)
We used
Pn
i=1(xi − x̄) = 0 and
Pn
i=1(xi − x̄)xi =
Pn
i=1(xi − x̄)2.

We have shown
β̂1 =
β1SSTx +
Pn
i=1(xi − x̄)ui
SSTx
= β1 +
Pn
i=1(xi − x̄)ui
SSTx
(112)
Note how the last piece is the slope coefficient from the OLS
regression of ui on xi , i = 1, ..., n. We cannot do this regression
because the ui are not observed.

Now define
wi =
(xi − x̄)
SSTx
(113)
so we have
β̂1 = β1 +
n
X
i=1
wi ui (114)
I β̂1 is a linear function of the unobserved errors, ui . The wi are
all functions of {x1, x2, ..., xn}.
I The (random) difference between β̂1 and β1 is due to this
linear function of the unobservables.

Step 3: Find E(β̂1).
I Under Assumptions SLR.2 and SLR.4, E(ui |x1, x2, ..., xn) = 0.
That means, conditional on {x1, x2, ..., xn},
E(wi ui |x1, x2, ..., xn) = wi E(ui |x1, x2, ..., xn) = 0
because wi is a function of {x1, x2, ..., xn}. (In the next slides I
omit the conditioning in the expectations)
I This would not be true if, in the population, u and x are
correlated.

Now we can complete the proof: conditional on {x1, x2, ..., xn},
E(β̂1) = E β1 +
n
X
i=1
wi ui
!
(115)
= β1 +
n
X
i=1
E(wi ui ) = β1 +
n
X
i=1
wi E(ui ) (116)
= β1 (117)
Remember, β1 is the fixed constant in the population. The
estimator, β̂1, varies across samples and is the random outcome:
before we collect our data, we do not know what β̂1 will be.

THEOREM (Unbiasedness of OLS)
Under Assumptions SLR.1 through SLR.4
E(β̂0) = β0 and E(β̂1) = β1. (118)
I Omit the proof for β̂0.
I Now simulate data in Stata:
y = 3 + 2x + u (119)
I x ∼ Normal(0, 9), u ∼ Normal(0, 36), and they are
independent.

I Each sample leads to a different estimate, β̂0 and β̂1. Some
will be very close to the true values β0 = 3 and β1 = 2.
Nevertheless, some could be very far from those values.
I If we repeat the experiment again and again, and average the
estimates, we would get very close to 2.
I The problem is, we do not know which kind of sample we
have. We can never know whether we are close to the
population value.
I We hope that our sample is “typical” and produces a slope
estimate close to β1 but we can never know.

Unbiasedness is a property of the procedure or rule. After
estimating an equation like

lwage = 0.58 + .082 educ (120)
n = 526, R2
= .1858 (121)
it is tempting to say 8.2% is an ”unbiased estimate” of the return
to education. Technically, this statement is incorrect. The rule
used to get β̂0 = 0.58 and β̂1 = .082 is unbiased (if we even
believe u is unrelated to educ).

We use four assumptions:
I SLR.1: y = β0 + β1x + u
I SLR.2: random sampling
I SLR.3: sample variation in the xi
I SLR.4: E(u|x) = 0
The focus should mainly be on the last of these. What are the
omitted factors? Are they likely to be correlated with x? If so,
SLR.4 fails and OLS will be biased.

EXAMPLE: Student Performance and Student-Teacher Ratios
(MATHPNL.DTA)

math4 = 69.09 − 0.2076 ptr (122)
n = 2199, R2
= .0015 (123)
Notice the minus sign for β̂1 = −0.2076. This implies that one
more student per teacher (on an average of 21.5) decreases the
estimated pass rate by about 0.2076 percentage points (note that
math4 is a percent). A decrease of one standard deviation (about
2.9) in ptr is predicted to increase the pass rate by about
0.2076(2.9) = 0.60 percentage points.

I But, is the OLS estimator likely to be unbiased in this setting?
Students from advantaged backgrounds tend to go to schools
with smaller ptr and would perform better, on average,
without small classes. We may just be picking up this
negative correlation, rather than something causal. When u
contains factors affecting y that are also correlated with x we
will have an spurious correlation.
I We will study how to determine the sign of the bias later with
multiple regression.
I Incidentally, notice the very small R-squared. Only about
0.15% of the variation in math4 across schools is explained by
ptr. In other words, there are lots of unobservables in u.

I The low R-squared means that using this equation for
predicting math4 likely will produce poor results. But a high
variance for u (leading to a low R-squared) is separate from
whether u and ptr are correlated. We must judge that based
on introspection or external evidence.
I Final comment: because the ptr cannot be less than one, we
will never predict math4 above the intercept, 69.09. Smallest
ptr is 7.3, so our largest prediction for math4 is
69.09 − 0.2076(7.3) ≈ 67.57. There are 1059 schools with
pass rates higher than 67.57.

Variance of the OLS Estimators
I Under SLR.1 to SLR.4, the OLS estimators are unbiased. This
tells us that, on average, the estimates will equal the
population values.
I But we need a measure of dispersion (spread) in the sampling
distribution of the estimators. We use the variance (and,
ultimately, the standard deviation).
I We could characterize the variance of the OLS estimators
under SLR.1 to SLR.4 (and we will later). For now, it is
easiest to introduce an assumption that simplifies the
calculations.

Assumption SLR.5 (Homoskedasticity, or Constant Variance)
The error has the same variance given any value of the explanatory
variable x:
Var(u|x) = σ2
0 (124)
where σ2 is (virtually always) unknown.
Because we assume SLR.4, that is, E(u|x) = 0 whenever we
assume SLR.5, we can also write
E(u2
|x) = σ2
= E(u2
) (125)

Under the population Assumptions SLR.1 (y = β0 + β1x + u),
SRL.4 (E(u|x) = 0) and SLR.5 (Var(u|x) = σ2),
E(y|x) = β0 + β1x (126)
Var(y|x) = σ2
(127)
So the average or expected value of y is allowed to change with x
– in fact, this is what interests us – but the variance does not
change with x. (See Graphs on pages 54 and 55)

The constant variance assumption may not be realistic; it must be
determined on a case-by-case basis.
EXAMPLE: Suppose that we would like to analyse the relationship
between savings and income; y = sav, x = inc; and we think
E(sav|inc) = β0 + β1inc (128)
with β1 0. This means average family saving increases with
income. If we impose SLR.5 then
Var(sav|inc) = σ2
(129)
which means the variability in saving does not change with income.
There are reasons to think savings would be more variable as
income increases. So SLR.5 is not realistic on this case.

EXAMPLE: Suppose y is an average score, at the school level,
and x is school enrollment. The variance of an average is inversely
related to the number of units being averaged.
Var

1
n
n
X
i=1
xi
#
=
1
n2
Var
n
X
i=1
xi
#
1
n2
n
X
i=1
Var[xi ] =
Var[x]
n
Then,
Var(y|x) =
η2
n
(130)
often holds, where η2 is the variance in individual scores.
Will discuss later how to use the data to test whether SLR.5 holds.

THEOREM (Sampling Variances of OLS)
Under Assumptions SLR.1 to SLR.5,
Var(β̂1|x) =
σ2
Pn
i=1(xi − x̄)2
=
σ2
SSTx
(131)
Var(β̂0|x) =
σ2 n−1
Pn
i=1 x2
i

SSTx
(132)
(conditional on the outcomes {x1, x2, ..., xn}).

To show this, write, as before,
β̂1 = β1 +
n
X
i=1
wi ui (133)
where wi = (xi − x̄)/SSTx . Because β1 is a constant, it does not
affect Var(β̂1). Now, we need to use the fact that, for uncorrelated
random variables, the variance of the sum is the sum of the
variances.

The {ui : i = 1, 2, ..., n} are actually independent across i, and so
they are uncorrelated. So (remember that if we know x, we know
w)
Var(β̂1|x) = Var
n
X
i=1
wi ui |x
!
(134)
=
n
X
i=1
Var(wi ui |x) =
n
X
i=1
w2
i Var(ui |x) (135)
=
n
X
i=1
w2
i σ2
= σ2
n
X
i=1
w2
i (136)
where the second-to-last equality uses Assumption SLR.5, so that
the variance of ui does not depend on xi .

Now we have
n
X
i=1
w2
i =
n
X
i=1
(xi − x̄)2
(SSTx )2
=
Pn
i=1(xi − x̄)2
(SSTx )2
(137)
=
SSTx
(SSTx )2
=
1
SSTx
(138)
We have shown
Var(β̂1) =
σ2
SSTx
(139)

I This is the “standard” formula for the variance of the OLS
slope estimator. It is not valid if Assumption SLR.5 does not
hold.
I The homoskedasticity assumption was not used to show
unbiasedness of the OLS estimators. That requires only SLR.1
to SLR.4.

Usually we are interested in β1. We can easily study the two
factors that affect its variance.
Var(β̂1) =
σ2
SSTx
(140)
1. As the error variance increases, i.e, as σ2 increases, so does
Var(β̂1). The more “noise” in the relationship between y and
x – that is, the larger variability in u – the harder it is to learn
about β1.
2. By contrast, more variation in {xi } is a good thing:
SSTx ↑ implies Var(β̂1) ↓ (141)

Notice that SSTx /n is the sample variance in x. We can think of
this as getting close to the population variance of x, σ2
x , as n gets
large. This means
SSTx ≈ nσ2
x (142)
which means, as n grows, Var(β̂1) shrinks at the rate 1/n. This is
why more data is a good thing: it shrinks the sampling variance of
our estimators.

The standard deviation of β̂1 is the square root of the variance. So
sd(β̂1) =
σ
√
SSTx
(143)
This turns out to be the measure of variation that appears in
confidence intervals and test statistics.

Estimating the Error Variance
In the formula
Var(β̂1) =
σ2
SSTx
(144)
we can compute SSTx from {xi : i = 1, ..., n}. But we need to
estimate σ2.
Recall that
σ2
= E(u2
). (145)

Therefore, if we could observe a sample on the errors,
{ui : i = 1, 2, ..., n}, an unbiased estimator of σ2 would be the
sample average
n−1
n
X
i=1
u2
i (146)
But this not an estimator because we cannot compute it from the
data we observe, since ui are unobserved.
How about replacing each ui with its “estimate”, the OLS residual
ûi ?
ui = yi − β0 − β1xi (147)
ûi = yi − β̂0 − β̂1xi (148)

ûi can be computed from the data because it depends on the
estimators β̂0 and β̂1. Except by fluke,
ûi 6= ui (149)
for any i.
ûi = yi − β̂0 − β̂1xi = (β0 + β1xi + ui ) − β̂0 − β̂1xi (150)
= ui − (β̂0 − β0) − (β̂1 − β1)xi (151)
E(β̂0) = β0 and E(β̂1) = β1, but the estimators almost always
differ from the population values in a sample.

Now, what about this as an estimator of σ2?
n−1
n
X
i=1
û2
i = SSR/n (152)
It is a true estimator and easily computed from the data after OLS.
As it turns out, this estimator is slightly biased: its expected value
is a little less than σ2.

The estimator does not account for the two restrictions on the
residuals, used to obtain β̂0 and β̂1:
n
X
i=1
ûi = 0 (153)
n
X
i=1
xi ûi = 0 (154)
There is no such restriction on the unobserved errors.

The unbiased estimator of σ2 uses a degrees-of-freedom
adjustment. The residuals have only n − 2 degrees-of-freedom, not
n.
σ̂2
=
SSR
(n − 2)
(155)
THEOREM: Unbiased Estimator of σ2
Under Assumptions SLR.1 to SLR.5,
E(σ̂2
) = σ2
(156)

In regression output, it is
σ̂ =
√
σ̂2 =
s
SSR
(n − 2)
(157)
that is usually reported. This is an estimator of sd(u), the
standard deviation of the population error.
One small glitch is that σ̂ is not unbiased for σ. This will not
matter for our purposes (and there is no unbiased estimator of σ).

I σ̂ is called the standard error of the regression, which
means it is an estimate of the standard deviation of the error
in the regression. Stata calls it the root mean squared error.
I Given σ̂, we can now estimate sd(β̂1) and sd(β̂0). The
estimates of these are called the standard errors of the β̂j .
We will use these a lot.
I Almost all regression packages report the standard errors in a
column next to the coefficient estimates.

I We just plug σ̂ in for σ:
se(β̂1) =
σ̂
√
SSTx
(158)
where both the numerator and denominator are computed
from the data.
I For reasons we will see, it is useful to report the standard
errors below the corresponding coefficient, usually in
parentheses.

EXAMPLE: Return to Education Using WAGE1.DTA

lwage = 0.58
(0.0973)
+ 0.082
(0.0076)
educ (159)
n = 526, R2
= .1858 (160)
In this regression, σ̂ = 0.48008 (see ”Root MSE” in the Stata
output). Sometimes it is reported, but not usually.

Summary and Tasks Left Undone
I In the table of Stata output, we have talked about the
coefficients (Coef.), and how to interpret them. It is critical to
become comfortable with that.
I Of course, we know what the sample size (Number of obs)
means. We have talked about the R-squared and how it
depends on the total, explained (model) and residual sum of
squares.
I We just talked about σ̂, the standard error of the regression
(Root MSE). Its main importance, for now, is how it appears
in the standard errors of the intercept and slope estimates.
The standard errors are in the column labeled ”Std. Err.”

I We have talked about conditions under which OLS is
unbiased, and we have derived the sampling variance of the
OLS estimators(which underlies the standard error
calculations).
I SLR.1: y = β0 + β1x + u
I SLR.2: random sampling
I SLR.3: sample variation in the xi
I SLR.4: E(u|x) = 0
I SLR.5: Var(u|x) = Var(u) = σ2.

I The first four assumptions are used for unbiasedness and we
add the fifth, homoskedasticity, for the ”standard” variance
calculation.
I We could continue by discussing statistical inference and
confidence intervals for the population parameters. But the
mechanics are very similar to multiple regression, and so we
hold off.
I Also, there are a lot of ways to combine data to estimate a
slope and intercept. How do we know the OLS approach is a
good one? We will argue, in the context of multiple
regression, that OLS is ”best” in a certain sense under the
five given assumptions. These are often called the
Gauss-Markov Assumptions (for cross-sectional regression).

Chapter2_Simple_Reg.pdf.pdf

Recommended

Recommended

More Related Content

Similar to Chapter2_Simple_Reg.pdf.pdf

Similar to Chapter2_Simple_Reg.pdf.pdf (20)

More from ROBERTOENRIQUEGARCAA1

More from ROBERTOENRIQUEGARCAA1 (20)

Recently uploaded

Recently uploaded (20)

Chapter2_Simple_Reg.pdf.pdf