Econ 4650
Spring 2019
Introduction to R
Basic Statistics & R
This chapter provides a good review of basic statistics that we will use throughout the course. It should serve
as a reference guide for you as we discuss estimation, sampling distributions, and hypothesis testing. We will
utilize normal, F, and t distributions frequently. Tables for these distributions are provided at the end of the
text.
In this module, you will learn how to enter data into R from a CSV file, describe the data, and produce
simple plots.
Assuming you have R installed on your computer you can type (or copy and paste) everything you see in the
shaded area (written in this font) and you should see the results (after the ##) just as you see them in
this PDF. (Note,[1] if you need to install R, or would like to find more information on the software, you can
do so https://cran.r-project.org/,[2] there are a few tasks that we will use R for in this course in which you
must be working on a computer in which you have adminstrative rights in order to complete).
For example, let’s use R as a simple calculator to find 2 + 2. What I do is type “2 + 2” and press the enter
key
2 + 2
## [1] 4
We see that R returns the correct answer.
Now, let’s try 2 x 3
2 * 3
## [1] 6
and 23
2^3
## [1] 8
We can also get the natural log
log(2.71821)
## [1] 0.9999736
log(exp(1))
## [1] 1
Let’s input the data for the height of a random sample of U.S. women as in the text (see page 9 of the custom
edition of the text). As in the text, we will call the data X
X <- c(61.0, 63.5, 66.0, 68.5, 71.0)
We can see, in R, what is in our workspace by typing ls()
1
https://cran.r-project.org/
ls()
## [1] "X"
R knows that there is something called X and can report back the values in X
X
## [1] 61.0 63.5 66.0 68.5 71.0
The textbook states that the mean (µ) of these data is known to be equal to 66 with a known standard
deviation (σ) of 2.5 inches. Let’s use R to calculate the standardized Z-values from our data
Z <- (X - 66)/2.5
and we can see the results
Z
## [1] -2 -1 0 1 2
In R, we can calculate the mean, variance, and standard deviation for a particular sample of data
mean(X)
## [1] 66
var(X)
## [1] 15.625
sd(X)
## [1] 3.952847
Let’s assume that heights are distributed according to the normal distribution with mean 66 and standard
deviation 2.5. We can have R randomly sample from this distribution. Let’s create a sample of 100 draws
X100 <- rnorm(100, mean = 66, sd = 2.5)
and find
mean(X100)
## [1] 66.2055
sd(X100)
## [1] 2.910129
Now let’s see what a simple histogram for these data look like
2
hist(X100)
Histogram of X100
X100
F
re
q
u
e
n
cy
60 65 70
0
5
1
0
1
5
2
0
2
5
3
0
When you do this on your computer, you will get different results. That is because your random draw will be
different.
We can have the same draw by doing this
set.seed(100)
X100a <- rnorm(100, mean = 66, sd = 2.5)
And we find
mean(X100a)
## [1] 66.00728
sd(X100a)
## [1] 2.55177.
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Econ 4650Spring 2019Introduction to RBasic Statistics .docx
1. Econ 4650
Spring 2019
Introduction to R
Basic Statistics & R
This chapter provides a good review of basic statistics that we
will use throughout the course. It should serve
as a reference guide for you as we discuss estimation, sampling
distributions, and hypothesis testing. We will
utilize normal, F, and t distributions frequently. Tables for these
distributions are provided at the end of the
text.
In this module, you will learn how to enter data into R from a
CSV file, describe the data, and produce
simple plots.
Assuming you have R installed on your computer you can type
(or copy and paste) everything you see in the
shaded area (written in this font) and you should see the results
(after the ##) just as you see them in
this PDF. (Note,[1] if you need to install R, or would like to
find more information on the software, you can
do so https://cran.r-project.org/,[2] there are a few tasks that we
will use R for in this course in which you
must be working on a computer in which you have adminstrative
rights in order to complete).
For example, let’s use R as a simple calculator to find 2 + 2.
What I do is type “2 + 2” and press the enter
key
2. 2 + 2
## [1] 4
We see that R returns the correct answer.
Now, let’s try 2 x 3
2 * 3
## [1] 6
and 23
2^3
## [1] 8
We can also get the natural log
log(2.71821)
## [1] 0.9999736
log(exp(1))
## [1] 1
Let’s input the data for the height of a random sample of U.S.
women as in the text (see page 9 of the custom
edition of the text). As in the text, we will call the data X
X <- c(61.0, 63.5, 66.0, 68.5, 71.0)
We can see, in R, what is in our workspace by typing ls()
1
https://cran.r-project.org/
3. ls()
## [1] "X"
R knows that there is something called X and can report back
the values in X
X
## [1] 61.0 63.5 66.0 68.5 71.0
The textbook states that the mean (µ) of these data is known to
be equal to 66 with a known standard
deviation (σ) of 2.5 inches. Let’s use R to calculate the
standardized Z-values from our data
Z <- (X - 66)/2.5
and we can see the results
Z
## [1] -2 -1 0 1 2
In R, we can calculate the mean, variance, and standard
deviation for a particular sample of data
mean(X)
## [1] 66
var(X)
## [1] 15.625
sd(X)
## [1] 3.952847
4. Let’s assume that heights are distributed according to the
normal distribution with mean 66 and standard
deviation 2.5. We can have R randomly sample from this
distribution. Let’s create a sample of 100 draws
X100 <- rnorm(100, mean = 66, sd = 2.5)
and find
mean(X100)
## [1] 66.2055
sd(X100)
## [1] 2.910129
Now let’s see what a simple histogram for these data look like
2
hist(X100)
Histogram of X100
X100
F
re
q
u
e
n
5. cy
60 65 70
0
5
1
0
1
5
2
0
2
5
3
0
When you do this on your computer, you will get different
results. That is because your random draw will be
different.
We can have the same draw by doing this
set.seed(100)
X100a <- rnorm(100, mean = 66, sd = 2.5)
And we find
mean(X100a)
## [1] 66.00728
6. sd(X100a)
## [1] 2.551776
With the same histogram
hist(X100a)
3
Histogram of X100a
X100a
F
re
q
u
e
n
cy
60 62 64 66 68 70 72 74
0
5
1
0
1
5
7. 2
0
2
5
3
0
The quantiles of the data can be determined by
quantile(X100a)
## 0% 25% 50% 75% 100%
## 60.32019 64.47788 65.85145 67.63973 72.45490
So the mean of this particular random sample is 66.007 and the
median is 65.85. As an aside, let’s look at
the distribution of a draw of size 20,000 from a standard normal
set.seed(666)
zdraw <- rnorm(20000, mean = 0, sd = 1)
mean(zdraw)
## [1] 0.007248452
quantile(zdraw, prob = c(0.01, 0.05, 0.5, 0.95, 0.99))
## 1% 5% 50% 95% 99%
## -2.34318795 -1.63898524 0.01879137 1.65229300
2.30314467
You can compare this draw with the theoretical values for our
requested percentiles (.01, .05, .5, .95, .99) of
-2.326, -1.645, 0, 1.645, and 2.326.
Here is a plot of the large draw
8. plot(density(zdraw))
4
−4 −2 0 2 4
0
.0
0
.1
0
.2
0
.3
0
.4
density.default(x = zdraw)
N = 20000 Bandwidth = 0.1242
D
e
n
si
ty
Throughout the class we will generate random data to learn
9. about econometrics and for results to match,
setting the random number seed via set.seed() is a useful R
command. Later on we will learn how to use
many of the built-in random distribution functions in R. (Just
like we did here by generating a sample of size
100 drawn from a normal distribution.)
Read Data
Entering data the way we did before is awkward. So now let’s
see how to read in a set of data that comes in
a csv file (comma separated values). These files can be read by
Excel and Excel can export csv files. They
are easily read in R (and other statistical packages).
Let’s begin with data from Table 1 on a sample of 22 single-
family homes in Diamond Bar, California.
I downloaded the Excel file and saved it as a csv file under
STAT17.csv. Before saving the file in R, I changed
my working directory (i.e., the place on my computer that R
will read and write files to) with the command
setwd(). I can ensure the STAT17.csv is in my working
directory by asking R to report all of the file names
that I have saved in this folder with the dir() command
setwd("C:/myfile/University of Utah/2019
Spring/Econometrics")
dir()
## [1] "Assignment 1.pdf" "Chapter 1.pdf" "STATS17.csv"
Indeed, the file is listed as being saved in my working
directory. I can read it into R and name it data17 by
typing
data17 <- read.csv("STATS17.csv", header = T)
10. 5
Note that R is case sensitive, so Data17 will be different than
data17, or datA17, etc. (In the read command,
the header comment is telling R that the first row contains the
variable names, or labels.)
Here are what the data look like
data17
## X OBS PRICE SQFT
## 1 1 1 425000 1349
## 2 2 2 451500 1807
## 3 3 3 508560 1651
## 4 4 4 448050 1293
## 5 5 5 500580 1745
## 6 6 6 524160 1900
## 7 7 7 500580 1759
## 8 8 8 399330 1740
## 9 9 9 442020 1950
## 10 10 10 537660 1771
## 11 11 11 515100 2078
## 12 12 12 589000 2268
## 13 13 13 696000 2400
## 14 14 14 540750 2050
## 15 15 15 659200 2267
## 16 16 16 492450 1986
## 17 17 17 567047 2950
## 18 18 18 684950 2712
## 19 19 19 668470 2799
## 20 20 20 733360 2933
## 21 21 21 775590 3203
## 22 22 22 788888 2988
11. Here are the summary statistics for data17
summary(data17)
## X OBS PRICE SQFT
## Min. : 1.00 Min. : 1.00 Min. :399330 Min. :1293
## 1st Qu.: 6.25 1st Qu.: 6.25 1st Qu.:494483 1st Qu.:1762
## Median :11.50 Median :11.50 Median :530910 Median :2018
## Mean :11.50 Mean :11.50 Mean :565829 Mean :2164
## 3rd Qu.:16.75 3rd Qu.:16.75 3rd Qu.:666153 3rd Qu.:2634
## Max. :22.00 Max. :22.00 Max. :788888 Max. :3203
We can generate a scatterplot of these data
plot(data17$SQFT,data17$PRICE)
6
1500 2000 2500 3000
4
e
+
0
5
5
e
+
0
5
6
13. IC
E
Here is another way to be able to access the variable names or
labels for the SQRT and PRICE data from
the data17 frame
attach(data17)
## The following object is masked _by_ .GlobalEnv:
##
## X
plot(SQFT, PRICE)
7
1500 2000 2500 3000
4
e
+
0
5
5
e
+
0
5
6
15. ## [1] 0.8768234
We see that the correlation coefficient aligns with the graphical
representation of the scatterplot.
Attaching a dataframe is a nice way to be able to grab labels. It
is always a good idea to detach labels from
dataframes when we are done. We can do this by using the
detach() command.
detach(data17)
Let’s see once again what R has in its memory.
ls()
## [1] "data17" "X" "X100" "X100a" "Z" "zdraw"
We can clean (i.e., erase) all objects from memory in R.
rm(list=ls())
ls()
## character(0)
That covers some of the R basics you will use in this course.
We recommmend that you consult this reference
throughout the course.
8
Basic Statistics & RRead Data
Ordinary Least Squares
Introduction
This chapter introduces the most commonly used regression
16. estimation technique, Ordinary Least Squares
(OLS). OLS is a regression estimation technique that calculates
the coefficients β̂ so as to minimize the sum
of the squared residuals, that is OLS minimizes
∑n
i=1 e
2
i or
∑
(Yi − Ŷi)2.
An estimator is a mathematical technique that is applied to a
sample of data to produce real world numerical
estimates of the true population regression coefficients. Thus,
OLS is an estimator, and a β̂ produced by
OLS is an estimate.
How does OLS work?
A Single-Independent-Variable Regression Model
Recall the theoretical equation: Yi = β0 + β1Xi + �i
OLS selects those estimates of β0 and β1 that minimize the total
squared residuals, where the sum is taken
over all the sample data points. The formula of these
coefficients are:
β̂1 =
∑n
i=1[(Xi − X
̄ )(Yi − Ȳ )]∑n
i=1(Xi − X
̄ )2
17. β̂0 = Ȳ − β̂1X
̄
where, X
̄ =
∑
Xi/N and Ȳ =
∑
Yi/N
Manual Calculation of Estimated Regression Coefficients
In this section we will calculate estimated regression
coefficients β̂0 and β̂1 manually using R. The first thing
we will do is read a .csv file into R as a dataframe named
mydata. The file, “FINAID2.csv” is available on
Canvas from this course module. Save it to the same folder as
your current working directory for ease of use.
rm(list=ls())
setwd("C:/myfile/University of Utah/2019
Spring/Econometrics")
mydata <- read.csv("FINAID2.csv", header = TRUE)
head(mydata)
## OBS FINAID PARENT HSRANK MALE
## 1 1 19640 0 92 0
## 2 2 8325 9147 44 1
## 3 3 12950 7063 89 0
## 4 4 700 33344 97 1
## 5 5 7000 20497 95 1
## 6 6 11325 10487 96 0
This last function shows the first 6 rows to verify the .csv was
read in correctly, and to let us examine the
variable names and discover whether they are numeric or
character. We will focus on two of the variables
for our initial regression: one dependent variable and one
18. independent variable. The dependent variable, or
left-hand-side (LHS) variable, is called Yi in our formula; the
independent variable, or right-hand-side (RHS)
variable is labelled Xi in our formula.
1
For our first example we will use FINAID as the dependent
variable and HSRANK as the independent variable
(our first model specification).
Next, we will attach the dataframe we just created. Then we will
proceed to manually calculate estimated
coefficients using OLS estimation, as done in the first section
of this chapter of the text.
attach(mydata)
For ease of use and understanding, for this example we will
identify the dependent variable, FINAID, as Y in
R and the independent variable, HSRANK, as X.
Y <- FINAID
X <- HSRANK
Next we need to tell R what to use for Ȳ and X
̄ , easily done
with the mean command.
YBAR <- mean(Y)
XBAR <- mean(X)
Next we will create vectors for (Yi − Ȳ ) and (Xi − X
̄ ), calling
them YDIFF and XDIFF respectively.
YDIFF <- Y - YBAR
XDIFF <- X - XBAR
To get an idea for what we are creating here, lets combine the
19. vectors we just created into a table called
table, using the column bind, or cbind command. Then we will
use the head command to view it.
table <-cbind(Y, X, YDIFF, XDIFF)
head(table)
## Y X YDIFF XDIFF
## [1,] 19640 92 7963.74 10.38
## [2,] 8325 44 -3351.26 -37.62
## [3,] 12950 89 1273.74 7.38
## [4,] 700 97 -10976.26 15.38
## [5,] 7000 95 -4676.26 13.38
## [6,] 11325 96 -351.26 14.38
This is comparable to the first few columns of Table 1 on page
39 of the text. Continuing on, we can have R
perform the remaining intermediate calculations to estimate our
regression coefficients. We will calculate
(Xi − X
̄ )2 and name it XDIFFSQ, as such:
XDIFFSQ <- (XDIFF)^2
As well as (Xi − X
̄ )(Yi − Ȳ ), which we will call PRODUCT
PRODUCT <- XDIFF * YDIFF
We will now calculate β̂1, using the sum command in R, and
assigning the name BETAHAT1 to the output.
BETAHAT1 <- sum(PRODUCT)/sum(XDIFFSQ)
BETAHAT1
## [1] 64.68226
We can also now calculate β̂0 as
BETAHAT0 <- YBAR - BETAHAT1 * XBAR
BETAHAT0
## [1] 6396.894
20. Let’s check that our estimators look right. We are going to
determine estimated regression coefficients with
OLS automatically through R by using the lm function.
Intuitively lm allows us to write out the equation
2
similar to how we do on paper, with LHS variable, a tilde
instead of an equals sign (~), and RHS variable
after. In the case (almost always) that we want to save our
results, we store them in a new arbitrary variable
we create. Here, we’ll name it lm1.
lm1 <- lm(Y ~ X)
lm1
##
## Call:
## lm(formula = Y ~ X)
##
## Coefficients:
## (Intercept) X
## 6396.89 64.68
We see from the output that the intercept, or β̂0 matches what
we calculated, as does the coefficient
corresponding to X.
Having computed the estimated regression coefficients using
OLS manually, (and checked it with lm), we will
now move on to a more typical procedure for estimating
regression models with OLS in R.
OLS Estimation with R functions
21. We will start out by examining the data. Exploring the data
using numerical and graphical techniques is
an important step in creating a model (part of a larger process
we will develop over this course). R has a
very useful summary command for descriptive statistics; let’s
run it on our data set, first on our dependent
variable, then on our independent variable. We can explore the
general shape of the data sample for this
variable.
summary(FINAID)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 7869 11262 11676 15809 22148
Next let’s explore this variable visually using a graph called a
histogram.
hist(FINAID)
3
Histogram of FINAID
FINAID
F
re
q
u
e
n
22. cy
0 5000 15000 25000
0
5
1
0
1
5
The next commands refine the histogram by adding a smooth
curve called a kernel density curve that adds
information to our exploration. You can think of this as an
approximate actual probability density function
(PDF).
hist(FINAID, prob = TRUE)
lines(density(FINAID))
Histogram of FINAID
FINAID
D
e
n
si
ty
0 5000 15000 25000
0
23. e
+
0
0
4
e
−
0
5
And now, do the same analyses for our independent variable,
HSRANK.
summary(HSRANK)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
4
## 20.00 80.00 89.00 81.62 96.75 99.00
hist(HSRANK, prob = TRUE)
lines(density(HSRANK))
Histogram of HSRANK
HSRANK
D
e
n
24. si
ty
20 40 60 80 100
0
.0
0
0
.0
2
0
.0
4
Now we are prepared to run our regression. Once again we will
utilize lm, which will run an OLS regression
on the data, and a new function abline, which will use
estimators from the regression to draw a line on the
plot.
As before we will want to save our results, storing them in an
arbitrarily named variable, mymodel. To see the
results, we again use the summary function, this time with the
newly created model variable as the argument
to the function.
mymodel <- lm(FINAID ~ HSRANK)
summary(mymodel)
##
## Call:
25. ## lm(formula = FINAID ~ HSRANK)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11971.1 -3966.3 -838.7 4436.4 9412.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6396.89 3281.34 1.949 0.0571 .
## HSRANK 64.68 39.15 1.652 0.1050
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5273 on 48 degrees of freedom
## Multiple R-squared: 0.05381, Adjusted R-squared: 0.03409
## F-statistic: 2.73 on 1 and 48 DF, p-value: 0.105
5
The resulting summary contains a great deal of information. For
this module we will focus on reading the
(Intercept) and HSRANK coefficients, specifically the estimates
of them apparent in the column titled as such.
To be clear, R is telling us β̂0 is the value in the Estimate
column next to (Intercept) and β̂1 is the value
in the Estimate column next to HSRANK. The other aspect we
want to focus on is the Multiple R-Squared,
R’s way of identifying what we normally call R2, and Adjusted
R-squared. Adjusted R-squared will come
in handy once we start adding additional independent variables.
Note the particularly low value of R2. We
will examine how these change as independent variables are
added.
26. Now we insert the linear regression line into a plot of the data.
Note that by convention, we put our
independent variable on the X-axis and the dependent variable
on the Y -axis; with plot this means naming
the independent variable as the first argument and the dependent
variable as the second. The third argument
will be using the abline command to insert the regression
established in the last step as a line in the plot.
plot(HSRANK, FINAID, abline(mymodel))
20 40 60 80 100
0
1
0
0
0
0
2
0
0
0
0
HSRANK
F
IN
A
ID
27. Multivariate Regression Model
Recall the multivariate equation:
Yi = β0 + β1X1i + β2X2i + ... + β3XKi + �i
A multivariate regression coefficient indicates the change in the
dependent variable associated with a one unit
increase in the independent variable by holding the other
independent variables in the equation constant. For
example, β1 measures the impact on Y of a one unit increase in
X1 holding constant X2, X3. . . and XK.
The OLS estimation of multivariate models is identical in
general approach to the OLS estimation of models
with just one independent variable.
To demonstrate multivariate regression, we will extend our
initial model twice. The first extension will
add PARENT as an independent variable; the second will add
MALE, a special type of dummy variable as an
independent variable.
As we are adding new variables, we should proceed with data
exploration, and a plot to begin our understanding
of the relationship between dependent and independent
variables.
6
summary(PARENT)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
28. ## 0 3382 8934 12284 17818 64305
hist(PARENT, prob = TRUE)
lines(density(PARENT))
Histogram of PARENT
PARENT
D
e
n
si
ty
0 20000 40000 60000
0
e
+
0
0
3
e
−
0
5
plot(PARENT, FINAID)
0 20000 40000 60000
29. 0
1
0
0
0
0
2
0
0
0
0
PARENT
F
IN
A
ID
Having done our initial data exploration, we will run and
interpret a multivariate regression. In R this is
7
very simple . . . simply add the new independent variable to the
RHS of the model specification:
mymodel <- lm(FINAID ~ PARENT + HSRANK)
30. summary(mymodel)
##
## Call:
## lm(formula = FINAID ~ PARENT + HSRANK)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6353.1 -1905.9 280.9 1942.5 6675.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8926.92907 1739.08253 5.133 5.36e-06 ***
## PARENT -0.35677 0.03169 -11.260 6.06e-15 ***
## HSRANK 87.37815 20.67413 4.226 0.000108 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2771 on 47 degrees of freedom
## Multiple R-squared: 0.7441, Adjusted R-squared: 0.7332
## F-statistic: 68.33 on 2 and 47 DF, p-value: 1.229e-14
Notice in this model, the added coefficient for HSRANK, as
well as the change in Multiple R-squared
and Adjusted R-Squared. Multiple R-squared increased, as we
expect it to when adding additional
independent variables, but more importantly Adjusted R-
squared has increased, which accounts for the change
in degrees of freedom and is a more relevant measure to
compare this model specification to the previous one.
When looking at the model, it is important to intuitively
understand the signs of the individual estimated
coefficients. Holding parents’ ability to contribute constant,
does the financial aid increase of $87.40 for each
percentage point increase in GPA rank make sense? And vice
31. versa holding GPA rank constant does the
parent financial contribution estimated coefficient seem logical?
To pull up the coefficients on their own, let’s
use the coef command in R and look at what it tells us.
coef(mymodel)
## (Intercept) PARENT HSRANK
## 8926.9290669 -0.3567721 87.3781524
It will be helpful if we create some plots to visualize these
relationships.
The plots will be similar to the plots on page 44 of the text
which illustrate these points. For the impact of
parents’ ability to contribute on financial aid, we can ask R the
following:
plot(PARENT, FINAID, abline(coef(mymodel)[1],
coef(mymodel)[2]))
8
0 20000 40000 60000
0
1
0
0
0
0
2
0
32. 0
0
0
PARENT
F
IN
A
ID
This command told R to plot PARENT and FINAID variables,
using abline, (think of it as literally plotting
points “a” and “b” to make a line) to grab the intercept
coefficient ([1]), and the coefficient of PARENT, which
happened to be the first independent variable we wrote in our
model (R thinks of it as the coefficient after
the intercept one, hence the [2]). So we can see the apparent
negative relationship between parents’ ability to
pay and financial aid.
Similarly we can call a graph for financial aid and high school
GPA rank:
plot(HSRANK, FINAID, abline(coef(mymodel)[1],
coef(mymodel)[3]))
20 40 60 80 100
0
1
0
0
33. 0
0
2
0
0
0
0
HSRANK
F
IN
A
ID
Finally, we will introduce the idea of dummy variables and
include them in a regression. A dummy is simply
a categorical variable, as in male or female, young or old, New
York or San Francisco, and so forth. In the
9
data it is coded as a 1 or 0. Our dummy variable here is MALE.
With categorical variables, the only data
exploration that makes sense is to see the count of the category;
a convenient way to do this in R is the
table function:
table(MALE)
34. ## MALE
## 0 1
## 27 23
Here the count of MALE is 23; the rest are presumably female.
And now we are ready to include that as an independent
variable:
mymodel <- lm(FINAID ~ PARENT + HSRANK + MALE)
summary(mymodel)
##
## Call:
## lm(formula = FINAID ~ PARENT + HSRANK + MALE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5401.7 -2388.1 292.1 2069.4 5233.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.813e+03 1.743e+03 5.630 1.04e-06 ***
## PARENT -3.427e-01 3.151e-02 -10.879 2.61e-14 ***
## HSRANK 8.326e+01 2.015e+01 4.132 0.00015 ***
## MALE -1.570e+03 7.843e+02 -2.002 0.05120 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2687 on 46 degrees of freedom
## Multiple R-squared: 0.7646, Adjusted R-squared: 0.7493
## F-statistic: 49.81 on 3 and 46 DF, p-value: 1.721e-14
Note again the change in Adjusted R-squared. Can you explain
why that change is relevant? And why did
Multiple R-squared increase and would we expect that as we
add independent variables?
35. Good data hygiene requires that we detach our data set when we
are finished with it.
detach(mydata)
Total, Explained, and Residual Sums of Squares
The square difference of Y around its mean is used to measure
how much of the variation of the dependent
variable is explained by the estimated regression equation. It is
called the total sum of squares (TSS):
TSS =
n∑
i=1
(Yi − Ȳ )2
TSS has two components, variation that can be explained by the
regression and variation that cannot:
10
∑
i
(Yi − Ȳ )2 =
∑
i
(Ŷi − Ȳ )2 +
∑
36. i
e2i
TSS = ESS + RSS
Explained Sum of Squares (ESS) measures the amount of the
squared deviation of Yi from its means that is
explained by the regression. Residual Sum of Squares (RSS) is
the part of the TSS that is unexplained by
the estimated regression. The smaller RSS is relative to the
TSS, the better the estimated regression line fits
the data.
To apply this, we return to the manual calculation of estimated
coefficients with OLS in the section of Manual
Calculation of Estimated Regression Coefficients. To calculate
ESS and RSS, we need Ŷi. To calculate this,
we have R determine Ŷi using the estimated equation of the
model, like the one found on page 34 of the text,
Ŷi = β̂0 + β̂1Xi
YHAT <- BETAHAT0 + BETAHAT1 * X
YHAT
## [1] 12347.662 9242.913 12153.615 12671.073 12541.709
12606.391 12735.755
## [8] 10924.652 9566.325 11571.475 12218.297 11248.063
12282.980 11636.157
## [15] 12735.755 11571.475 12218.297 11700.839 12735.755
9631.007 12735.755
## [22] 8984.184 9566.325 12282.980 11700.839 12735.755
12800.438 12218.297
## [29] 12671.073 10213.147 12671.073 11830.204 11830.204
12735.755 12541.709
## [36] 12800.438 7690.539 12153.615 9048.867 12347.662
11959.568 12024.251
37. ## [43] 11830.204 12800.438 12153.615 10083.783 11830.204
11636.157 12800.438
## [50] 10795.288
Now that we have our Ŷi data, we can calculate our residuals,
ei, where ei = Yi − Ŷi. Just a reminder, these
residual errors represent the difference between the estimated
Ŷi values, and the actual sample values, Yi.
RESID <- Y - YHAT
RESID
## [1] 7292.33816 -917.91343 796.38493 -11971.07314 -
5541.70862
## [6] -1281.39088 6429.24460 -3924.65216 -1641.32473 -
96.47474
## [11] 6571.70267 -2358.06345 5307.02041 6128.84300
1364.24460
## [16] 7393.52526 -7718.29733 -3750.83926 -5735.75540 -
2356.00698
## [21] -4735.75540 -4694.18440 -1391.32473 -932.97959
3624.16074
## [26] 9412.24460 4619.56235 6771.70267 -1496.07314
3886.85269
## [31] -5671.07314 -3980.20378 -11830.20378 -5735.75540
3558.29138
## [36] -4800.43765 809.46077 -4578.61507 4701.13334 -
5347.66184
## [41] -759.56829 2425.74945 3434.79622 7669.56235 -
2603.61507
## [46] 5886.21721 359.79622 163.84300 8839.56235 -
1595.28764
Now we can calculate the ESS as:
ESS <- sum((YHAT - YBAR)^2)
ESS
38. ## [1] 75893113
We can also calculate the RSS with ease:
RSS <- sum(RESID^2)
RSS
## [1] 1334607437
Finally, the TSS will be:
11
TSS <- sum((Y - YBAR)^2)
TSS
## [1] 1410500550
As stated above, the TSS = ESS + RSS, lets check that now.
ESS + RSS
## [1] 1410500550
Looks good!
The Overall Fit of the Estimated Model
R2
R2 is the ratio of the explained sum of squares to the total sum
of squares:
R2 =
ESS
39. TSS
= 1 −
RSS
TSS
= 1 −
∑
e2i∑
(Yi − Ȳ )2
R2 measures the percentage of the variation of Y around Ȳ that
is explained by the regression. The higher
R2 is, the closer the estimated regression equation fits the
sample data. R2 lies in the interval 0 ≤ R2 ≤ 1.
The value of R2 close to one shows an excellent overall fit.
Let’s manually calculate R2 as ESS/TSS:
R2 <- ESS/TSS
R2
## [1] 0.0538058
Now we will grab the R2 value from the lm model we executed
earlier, by telling the summary function to
grab it for us specifically:
summary(lm1)$r.squared
## [1] 0.0538058
The R2 value matches with ours, a good sign. However as you
may have noticed it is rather low, a sign that
the estimated regression line is not fitting the sample data very
well.
40. Correlation Coefficient (r)
The simple correlation coefficient, r ∈ [−1, 1], is a measure of
the strength and direction of the linear
relationship between two variables. The sign of r indicates the
direction of the correlation between the two
variables.
If r = +1, the two variables are perfectly positively correlated.
If r = −1,the two variables are perfectly
negatively correlated. If r = 0, the two variables are totally
uncorrelated.
Checking this in R, we use the cor command:
cor(Y,X)
## [1] 0.2319608
revealing a relatively weak positive correlation.
12
The Adjusted R2 (R
̄ 2)
R
̄ 2 measures the percentage of the variation of Y around its
mean that is explained by the regression equation,
adjusted for degrees of freedom.
R
̄ 2 = 1 −
∑
e2i /(N −K − 1)∑
(Yi − Ȳ )2/(N − 1)
41. N −K − 1 is the degree of freedom which is the excess of the
number of observation N over the number of
coefficients (including the intercept) estimated (K + 1). The
value of R
̄ 2 can be used to compare the fits of
equations with the same dependent variable and different
numbers of independent variables.
For one final exercise we will calculate the R
̄ 2 for our last
model specification. The first step is to extract the
residuals from R. We can pull the residuals from mymodel by
asking R to specifically return residuals:
RESIDS <- mymodel$residuals
Now we can square them and sum them up:
RSQ <- sum(RESIDS^2)
Notice that the TSS we calculated previously (we can use it as it
only makes use of the sample Yi values
and their mean, not any estimated data hence it is still relevant
to this model’s R
̄ 2). Knowing that we can
calculate R
̄ 2 as
RBAR <- 1 - ((RSQ/46)/(TSS/49))
RBAR
## [1] 0.7492617
We can compare that to the calculated R
̄ 2 from the mymodel
summary as such:
summary(mymodel)$adj.r.squared
## [1] 0.7492617
Good and we see our calculation was correct.
13
42. IntroductionHow does OLS work?A Single-Independent-
Variable Regression ModelManual Calculation of Estimated
Regression CoefficientsOLS Estimation with R
functionsMultivariate Regression ModelTotal, Explained, and
Residual Sums of SquaresThe Overall Fit of the Estimated
ModelR^2Correlation Coefficient (r)The Adjusted R^2
(bar{R^2})
Econ 4650, Spring 2019
Assignment 1
Introduction
Welcome to Assignment 1!
It is suggested you do the following before attempting this
assignment*
• Read the appropriate textbook chapters (Chapter 1 & Chapter
2 in the Custom Text;
Chapter 17 & Chapter 1 in the full text)
• Read the Econ_4650_Sprin_2019_Intro_to_R.pdf in Module
1.a
• Install the software R
• Read OrdinaryLeastSquares pdf in Module 1.b
• Take a deep breath!
*Note, You may also watch the Module 1 video tutorials, but
this is not required. Also, please note that
there have been adjustments to the number of assignments and
the structure of the course since the time the
videos were made.
43. Definitions for Annual Consumption of Chicken Data
Y = per capita chicken consumption (in pound) in a given year
PC = the price of chicken (in cents per pound) in a given year
PB = the price of beef (in cents per pound) in a given year
YD = U.S. per capita disposable income (in hundred of dollars)
in a given year
Assignment 1
[1] Download the data file CHICK6.csv from Assignment 1.
[2] Load the CHICK6.csv data into R and rename it so that it
includes your own name or initials (for example,
we might name our dataset JunfuCHICK6 ).
[3] Generate the mean and the standard deviation for the
variables Y, PC, PB, and YD from the CHICK6.csv
data set.
[4] You are going to run a regression (in [5]) where Y is the
dependent variable, and PC, PB, and YD are
the independent variables. Discuss whether you think the
independent variable PB will have a negative or
positive effect on the dependent variable. In your decision, try
to use as much economic theory as you can –
theory is what motivates what variables are included in a model
and what sign we anticipate for the model’s
estimates.
[5] Estimate the model in R and present the results.
[6] Interpret the results for the coefficient PB from your model.
44. Make sure to include whether or not the
result aligned with your expectations.
1
IntroductionDefinitions for Annual Consumption of Chicken
DataAssignment 1