SlideShare una empresa de Scribd logo
1 de 44
Econ 4650
Spring 2019
Introduction to R
Basic Statistics & R
This chapter provides a good review of basic statistics that we
will use throughout the course. It should serve
as a reference guide for you as we discuss estimation, sampling
distributions, and hypothesis testing. We will
utilize normal, F, and t distributions frequently. Tables for these
distributions are provided at the end of the
text.
In this module, you will learn how to enter data into R from a
CSV file, describe the data, and produce
simple plots.
Assuming you have R installed on your computer you can type
(or copy and paste) everything you see in the
shaded area (written in this font) and you should see the results
(after the ##) just as you see them in
this PDF. (Note,[1] if you need to install R, or would like to
find more information on the software, you can
do so https://cran.r-project.org/,[2] there are a few tasks that we
will use R for in this course in which you
must be working on a computer in which you have adminstrative
rights in order to complete).
For example, let’s use R as a simple calculator to find 2 + 2.
What I do is type “2 + 2” and press the enter
key
2 + 2
## [1] 4
We see that R returns the correct answer.
Now, let’s try 2 x 3
2 * 3
## [1] 6
and 23
2^3
## [1] 8
We can also get the natural log
log(2.71821)
## [1] 0.9999736
log(exp(1))
## [1] 1
Let’s input the data for the height of a random sample of U.S.
women as in the text (see page 9 of the custom
edition of the text). As in the text, we will call the data X
X <- c(61.0, 63.5, 66.0, 68.5, 71.0)
We can see, in R, what is in our workspace by typing ls()
1
https://cran.r-project.org/
ls()
## [1] "X"
R knows that there is something called X and can report back
the values in X
X
## [1] 61.0 63.5 66.0 68.5 71.0
The textbook states that the mean (µ) of these data is known to
be equal to 66 with a known standard
deviation (σ) of 2.5 inches. Let’s use R to calculate the
standardized Z-values from our data
Z <- (X - 66)/2.5
and we can see the results
Z
## [1] -2 -1 0 1 2
In R, we can calculate the mean, variance, and standard
deviation for a particular sample of data
mean(X)
## [1] 66
var(X)
## [1] 15.625
sd(X)
## [1] 3.952847
Let’s assume that heights are distributed according to the
normal distribution with mean 66 and standard
deviation 2.5. We can have R randomly sample from this
distribution. Let’s create a sample of 100 draws
X100 <- rnorm(100, mean = 66, sd = 2.5)
and find
mean(X100)
## [1] 66.2055
sd(X100)
## [1] 2.910129
Now let’s see what a simple histogram for these data look like
2
hist(X100)
Histogram of X100
X100
F
re
q
u
e
n
cy
60 65 70
0
5
1
0
1
5
2
0
2
5
3
0
When you do this on your computer, you will get different
results. That is because your random draw will be
different.
We can have the same draw by doing this
set.seed(100)
X100a <- rnorm(100, mean = 66, sd = 2.5)
And we find
mean(X100a)
## [1] 66.00728
sd(X100a)
## [1] 2.551776
With the same histogram
hist(X100a)
3
Histogram of X100a
X100a
F
re
q
u
e
n
cy
60 62 64 66 68 70 72 74
0
5
1
0
1
5
2
0
2
5
3
0
The quantiles of the data can be determined by
quantile(X100a)
## 0% 25% 50% 75% 100%
## 60.32019 64.47788 65.85145 67.63973 72.45490
So the mean of this particular random sample is 66.007 and the
median is 65.85. As an aside, let’s look at
the distribution of a draw of size 20,000 from a standard normal
set.seed(666)
zdraw <- rnorm(20000, mean = 0, sd = 1)
mean(zdraw)
## [1] 0.007248452
quantile(zdraw, prob = c(0.01, 0.05, 0.5, 0.95, 0.99))
## 1% 5% 50% 95% 99%
## -2.34318795 -1.63898524 0.01879137 1.65229300
2.30314467
You can compare this draw with the theoretical values for our
requested percentiles (.01, .05, .5, .95, .99) of
-2.326, -1.645, 0, 1.645, and 2.326.
Here is a plot of the large draw
plot(density(zdraw))
4
−4 −2 0 2 4
0
.0
0
.1
0
.2
0
.3
0
.4
density.default(x = zdraw)
N = 20000 Bandwidth = 0.1242
D
e
n
si
ty
Throughout the class we will generate random data to learn
about econometrics and for results to match,
setting the random number seed via set.seed() is a useful R
command. Later on we will learn how to use
many of the built-in random distribution functions in R. (Just
like we did here by generating a sample of size
100 drawn from a normal distribution.)
Read Data
Entering data the way we did before is awkward. So now let’s
see how to read in a set of data that comes in
a csv file (comma separated values). These files can be read by
Excel and Excel can export csv files. They
are easily read in R (and other statistical packages).
Let’s begin with data from Table 1 on a sample of 22 single-
family homes in Diamond Bar, California.
I downloaded the Excel file and saved it as a csv file under
STAT17.csv. Before saving the file in R, I changed
my working directory (i.e., the place on my computer that R
will read and write files to) with the command
setwd(). I can ensure the STAT17.csv is in my working
directory by asking R to report all of the file names
that I have saved in this folder with the dir() command
setwd("C:/myfile/University of Utah/2019
Spring/Econometrics")
dir()
## [1] "Assignment 1.pdf" "Chapter 1.pdf" "STATS17.csv"
Indeed, the file is listed as being saved in my working
directory. I can read it into R and name it data17 by
typing
data17 <- read.csv("STATS17.csv", header = T)
5
Note that R is case sensitive, so Data17 will be different than
data17, or datA17, etc. (In the read command,
the header comment is telling R that the first row contains the
variable names, or labels.)
Here are what the data look like
data17
## X OBS PRICE SQFT
## 1 1 1 425000 1349
## 2 2 2 451500 1807
## 3 3 3 508560 1651
## 4 4 4 448050 1293
## 5 5 5 500580 1745
## 6 6 6 524160 1900
## 7 7 7 500580 1759
## 8 8 8 399330 1740
## 9 9 9 442020 1950
## 10 10 10 537660 1771
## 11 11 11 515100 2078
## 12 12 12 589000 2268
## 13 13 13 696000 2400
## 14 14 14 540750 2050
## 15 15 15 659200 2267
## 16 16 16 492450 1986
## 17 17 17 567047 2950
## 18 18 18 684950 2712
## 19 19 19 668470 2799
## 20 20 20 733360 2933
## 21 21 21 775590 3203
## 22 22 22 788888 2988
Here are the summary statistics for data17
summary(data17)
## X OBS PRICE SQFT
## Min. : 1.00 Min. : 1.00 Min. :399330 Min. :1293
## 1st Qu.: 6.25 1st Qu.: 6.25 1st Qu.:494483 1st Qu.:1762
## Median :11.50 Median :11.50 Median :530910 Median :2018
## Mean :11.50 Mean :11.50 Mean :565829 Mean :2164
## 3rd Qu.:16.75 3rd Qu.:16.75 3rd Qu.:666153 3rd Qu.:2634
## Max. :22.00 Max. :22.00 Max. :788888 Max. :3203
We can generate a scatterplot of these data
plot(data17$SQFT,data17$PRICE)
6
1500 2000 2500 3000
4
e
+
0
5
5
e
+
0
5
6
e
+
0
5
7
e
+
0
5
8
e
+
0
5
data17$SQFT
d
a
ta
1
7
$
P
R
IC
E
Here is another way to be able to access the variable names or
labels for the SQRT and PRICE data from
the data17 frame
attach(data17)
## The following object is masked _by_ .GlobalEnv:
##
## X
plot(SQFT, PRICE)
7
1500 2000 2500 3000
4
e
+
0
5
5
e
+
0
5
6
e
+
0
5
7
e
+
0
5
8
e
+
0
5
SQFT
P
R
IC
E
Above, we have attached the data17 so that we could easily
work with the variables in the file.
We can also get the correlation between the variables SQFT and
PRICE
cor(SQFT, PRICE)
## [1] 0.8768234
We see that the correlation coefficient aligns with the graphical
representation of the scatterplot.
Attaching a dataframe is a nice way to be able to grab labels. It
is always a good idea to detach labels from
dataframes when we are done. We can do this by using the
detach() command.
detach(data17)
Let’s see once again what R has in its memory.
ls()
## [1] "data17" "X" "X100" "X100a" "Z" "zdraw"
We can clean (i.e., erase) all objects from memory in R.
rm(list=ls())
ls()
## character(0)
That covers some of the R basics you will use in this course.
We recommmend that you consult this reference
throughout the course.
8
Basic Statistics & RRead Data
Ordinary Least Squares
Introduction
This chapter introduces the most commonly used regression
estimation technique, Ordinary Least Squares
(OLS). OLS is a regression estimation technique that calculates
the coefficients β̂ so as to minimize the sum
of the squared residuals, that is OLS minimizes
∑n
i=1 e
2
i or
∑
(Yi − Ŷi)2.
An estimator is a mathematical technique that is applied to a
sample of data to produce real world numerical
estimates of the true population regression coefficients. Thus,
OLS is an estimator, and a β̂ produced by
OLS is an estimate.
How does OLS work?
A Single-Independent-Variable Regression Model
Recall the theoretical equation: Yi = β0 + β1Xi + �i
OLS selects those estimates of β0 and β1 that minimize the total
squared residuals, where the sum is taken
over all the sample data points. The formula of these
coefficients are:
β̂1 =
∑n
i=1[(Xi − X
̄ )(Yi − Ȳ )]∑n
i=1(Xi − X
̄ )2
β̂0 = Ȳ − β̂1X
̄
where, X
̄ =
∑
Xi/N and Ȳ =
∑
Yi/N
Manual Calculation of Estimated Regression Coefficients
In this section we will calculate estimated regression
coefficients β̂0 and β̂1 manually using R. The first thing
we will do is read a .csv file into R as a dataframe named
mydata. The file, “FINAID2.csv” is available on
Canvas from this course module. Save it to the same folder as
your current working directory for ease of use.
rm(list=ls())
setwd("C:/myfile/University of Utah/2019
Spring/Econometrics")
mydata <- read.csv("FINAID2.csv", header = TRUE)
head(mydata)
## OBS FINAID PARENT HSRANK MALE
## 1 1 19640 0 92 0
## 2 2 8325 9147 44 1
## 3 3 12950 7063 89 0
## 4 4 700 33344 97 1
## 5 5 7000 20497 95 1
## 6 6 11325 10487 96 0
This last function shows the first 6 rows to verify the .csv was
read in correctly, and to let us examine the
variable names and discover whether they are numeric or
character. We will focus on two of the variables
for our initial regression: one dependent variable and one
independent variable. The dependent variable, or
left-hand-side (LHS) variable, is called Yi in our formula; the
independent variable, or right-hand-side (RHS)
variable is labelled Xi in our formula.
1
For our first example we will use FINAID as the dependent
variable and HSRANK as the independent variable
(our first model specification).
Next, we will attach the dataframe we just created. Then we will
proceed to manually calculate estimated
coefficients using OLS estimation, as done in the first section
of this chapter of the text.
attach(mydata)
For ease of use and understanding, for this example we will
identify the dependent variable, FINAID, as Y in
R and the independent variable, HSRANK, as X.
Y <- FINAID
X <- HSRANK
Next we need to tell R what to use for Ȳ and X
̄ , easily done
with the mean command.
YBAR <- mean(Y)
XBAR <- mean(X)
Next we will create vectors for (Yi − Ȳ ) and (Xi − X
̄ ), calling
them YDIFF and XDIFF respectively.
YDIFF <- Y - YBAR
XDIFF <- X - XBAR
To get an idea for what we are creating here, lets combine the
vectors we just created into a table called
table, using the column bind, or cbind command. Then we will
use the head command to view it.
table <-cbind(Y, X, YDIFF, XDIFF)
head(table)
## Y X YDIFF XDIFF
## [1,] 19640 92 7963.74 10.38
## [2,] 8325 44 -3351.26 -37.62
## [3,] 12950 89 1273.74 7.38
## [4,] 700 97 -10976.26 15.38
## [5,] 7000 95 -4676.26 13.38
## [6,] 11325 96 -351.26 14.38
This is comparable to the first few columns of Table 1 on page
39 of the text. Continuing on, we can have R
perform the remaining intermediate calculations to estimate our
regression coefficients. We will calculate
(Xi − X
̄ )2 and name it XDIFFSQ, as such:
XDIFFSQ <- (XDIFF)^2
As well as (Xi − X
̄ )(Yi − Ȳ ), which we will call PRODUCT
PRODUCT <- XDIFF * YDIFF
We will now calculate β̂1, using the sum command in R, and
assigning the name BETAHAT1 to the output.
BETAHAT1 <- sum(PRODUCT)/sum(XDIFFSQ)
BETAHAT1
## [1] 64.68226
We can also now calculate β̂0 as
BETAHAT0 <- YBAR - BETAHAT1 * XBAR
BETAHAT0
## [1] 6396.894
Let’s check that our estimators look right. We are going to
determine estimated regression coefficients with
OLS automatically through R by using the lm function.
Intuitively lm allows us to write out the equation
2
similar to how we do on paper, with LHS variable, a tilde
instead of an equals sign (~), and RHS variable
after. In the case (almost always) that we want to save our
results, we store them in a new arbitrary variable
we create. Here, we’ll name it lm1.
lm1 <- lm(Y ~ X)
lm1
##
## Call:
## lm(formula = Y ~ X)
##
## Coefficients:
## (Intercept) X
## 6396.89 64.68
We see from the output that the intercept, or β̂0 matches what
we calculated, as does the coefficient
corresponding to X.
Having computed the estimated regression coefficients using
OLS manually, (and checked it with lm), we will
now move on to a more typical procedure for estimating
regression models with OLS in R.
OLS Estimation with R functions
We will start out by examining the data. Exploring the data
using numerical and graphical techniques is
an important step in creating a model (part of a larger process
we will develop over this course). R has a
very useful summary command for descriptive statistics; let’s
run it on our data set, first on our dependent
variable, then on our independent variable. We can explore the
general shape of the data sample for this
variable.
summary(FINAID)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 7869 11262 11676 15809 22148
Next let’s explore this variable visually using a graph called a
histogram.
hist(FINAID)
3
Histogram of FINAID
FINAID
F
re
q
u
e
n
cy
0 5000 15000 25000
0
5
1
0
1
5
The next commands refine the histogram by adding a smooth
curve called a kernel density curve that adds
information to our exploration. You can think of this as an
approximate actual probability density function
(PDF).
hist(FINAID, prob = TRUE)
lines(density(FINAID))
Histogram of FINAID
FINAID
D
e
n
si
ty
0 5000 15000 25000
0
e
+
0
0
4
e
−
0
5
And now, do the same analyses for our independent variable,
HSRANK.
summary(HSRANK)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
4
## 20.00 80.00 89.00 81.62 96.75 99.00
hist(HSRANK, prob = TRUE)
lines(density(HSRANK))
Histogram of HSRANK
HSRANK
D
e
n
si
ty
20 40 60 80 100
0
.0
0
0
.0
2
0
.0
4
Now we are prepared to run our regression. Once again we will
utilize lm, which will run an OLS regression
on the data, and a new function abline, which will use
estimators from the regression to draw a line on the
plot.
As before we will want to save our results, storing them in an
arbitrarily named variable, mymodel. To see the
results, we again use the summary function, this time with the
newly created model variable as the argument
to the function.
mymodel <- lm(FINAID ~ HSRANK)
summary(mymodel)
##
## Call:
## lm(formula = FINAID ~ HSRANK)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11971.1 -3966.3 -838.7 4436.4 9412.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6396.89 3281.34 1.949 0.0571 .
## HSRANK 64.68 39.15 1.652 0.1050
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5273 on 48 degrees of freedom
## Multiple R-squared: 0.05381, Adjusted R-squared: 0.03409
## F-statistic: 2.73 on 1 and 48 DF, p-value: 0.105
5
The resulting summary contains a great deal of information. For
this module we will focus on reading the
(Intercept) and HSRANK coefficients, specifically the estimates
of them apparent in the column titled as such.
To be clear, R is telling us β̂0 is the value in the Estimate
column next to (Intercept) and β̂1 is the value
in the Estimate column next to HSRANK. The other aspect we
want to focus on is the Multiple R-Squared,
R’s way of identifying what we normally call R2, and Adjusted
R-squared. Adjusted R-squared will come
in handy once we start adding additional independent variables.
Note the particularly low value of R2. We
will examine how these change as independent variables are
added.
Now we insert the linear regression line into a plot of the data.
Note that by convention, we put our
independent variable on the X-axis and the dependent variable
on the Y -axis; with plot this means naming
the independent variable as the first argument and the dependent
variable as the second. The third argument
will be using the abline command to insert the regression
established in the last step as a line in the plot.
plot(HSRANK, FINAID, abline(mymodel))
20 40 60 80 100
0
1
0
0
0
0
2
0
0
0
0
HSRANK
F
IN
A
ID
Multivariate Regression Model
Recall the multivariate equation:
Yi = β0 + β1X1i + β2X2i + ... + β3XKi + �i
A multivariate regression coefficient indicates the change in the
dependent variable associated with a one unit
increase in the independent variable by holding the other
independent variables in the equation constant. For
example, β1 measures the impact on Y of a one unit increase in
X1 holding constant X2, X3. . . and XK.
The OLS estimation of multivariate models is identical in
general approach to the OLS estimation of models
with just one independent variable.
To demonstrate multivariate regression, we will extend our
initial model twice. The first extension will
add PARENT as an independent variable; the second will add
MALE, a special type of dummy variable as an
independent variable.
As we are adding new variables, we should proceed with data
exploration, and a plot to begin our understanding
of the relationship between dependent and independent
variables.
6
summary(PARENT)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3382 8934 12284 17818 64305
hist(PARENT, prob = TRUE)
lines(density(PARENT))
Histogram of PARENT
PARENT
D
e
n
si
ty
0 20000 40000 60000
0
e
+
0
0
3
e
−
0
5
plot(PARENT, FINAID)
0 20000 40000 60000
0
1
0
0
0
0
2
0
0
0
0
PARENT
F
IN
A
ID
Having done our initial data exploration, we will run and
interpret a multivariate regression. In R this is
7
very simple . . . simply add the new independent variable to the
RHS of the model specification:
mymodel <- lm(FINAID ~ PARENT + HSRANK)
summary(mymodel)
##
## Call:
## lm(formula = FINAID ~ PARENT + HSRANK)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6353.1 -1905.9 280.9 1942.5 6675.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8926.92907 1739.08253 5.133 5.36e-06 ***
## PARENT -0.35677 0.03169 -11.260 6.06e-15 ***
## HSRANK 87.37815 20.67413 4.226 0.000108 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2771 on 47 degrees of freedom
## Multiple R-squared: 0.7441, Adjusted R-squared: 0.7332
## F-statistic: 68.33 on 2 and 47 DF, p-value: 1.229e-14
Notice in this model, the added coefficient for HSRANK, as
well as the change in Multiple R-squared
and Adjusted R-Squared. Multiple R-squared increased, as we
expect it to when adding additional
independent variables, but more importantly Adjusted R-
squared has increased, which accounts for the change
in degrees of freedom and is a more relevant measure to
compare this model specification to the previous one.
When looking at the model, it is important to intuitively
understand the signs of the individual estimated
coefficients. Holding parents’ ability to contribute constant,
does the financial aid increase of $87.40 for each
percentage point increase in GPA rank make sense? And vice
versa holding GPA rank constant does the
parent financial contribution estimated coefficient seem logical?
To pull up the coefficients on their own, let’s
use the coef command in R and look at what it tells us.
coef(mymodel)
## (Intercept) PARENT HSRANK
## 8926.9290669 -0.3567721 87.3781524
It will be helpful if we create some plots to visualize these
relationships.
The plots will be similar to the plots on page 44 of the text
which illustrate these points. For the impact of
parents’ ability to contribute on financial aid, we can ask R the
following:
plot(PARENT, FINAID, abline(coef(mymodel)[1],
coef(mymodel)[2]))
8
0 20000 40000 60000
0
1
0
0
0
0
2
0
0
0
0
PARENT
F
IN
A
ID
This command told R to plot PARENT and FINAID variables,
using abline, (think of it as literally plotting
points “a” and “b” to make a line) to grab the intercept
coefficient ([1]), and the coefficient of PARENT, which
happened to be the first independent variable we wrote in our
model (R thinks of it as the coefficient after
the intercept one, hence the [2]). So we can see the apparent
negative relationship between parents’ ability to
pay and financial aid.
Similarly we can call a graph for financial aid and high school
GPA rank:
plot(HSRANK, FINAID, abline(coef(mymodel)[1],
coef(mymodel)[3]))
20 40 60 80 100
0
1
0
0
0
0
2
0
0
0
0
HSRANK
F
IN
A
ID
Finally, we will introduce the idea of dummy variables and
include them in a regression. A dummy is simply
a categorical variable, as in male or female, young or old, New
York or San Francisco, and so forth. In the
9
data it is coded as a 1 or 0. Our dummy variable here is MALE.
With categorical variables, the only data
exploration that makes sense is to see the count of the category;
a convenient way to do this in R is the
table function:
table(MALE)
## MALE
## 0 1
## 27 23
Here the count of MALE is 23; the rest are presumably female.
And now we are ready to include that as an independent
variable:
mymodel <- lm(FINAID ~ PARENT + HSRANK + MALE)
summary(mymodel)
##
## Call:
## lm(formula = FINAID ~ PARENT + HSRANK + MALE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5401.7 -2388.1 292.1 2069.4 5233.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.813e+03 1.743e+03 5.630 1.04e-06 ***
## PARENT -3.427e-01 3.151e-02 -10.879 2.61e-14 ***
## HSRANK 8.326e+01 2.015e+01 4.132 0.00015 ***
## MALE -1.570e+03 7.843e+02 -2.002 0.05120 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2687 on 46 degrees of freedom
## Multiple R-squared: 0.7646, Adjusted R-squared: 0.7493
## F-statistic: 49.81 on 3 and 46 DF, p-value: 1.721e-14
Note again the change in Adjusted R-squared. Can you explain
why that change is relevant? And why did
Multiple R-squared increase and would we expect that as we
add independent variables?
Good data hygiene requires that we detach our data set when we
are finished with it.
detach(mydata)
Total, Explained, and Residual Sums of Squares
The square difference of Y around its mean is used to measure
how much of the variation of the dependent
variable is explained by the estimated regression equation. It is
called the total sum of squares (TSS):
TSS =
n∑
i=1
(Yi − Ȳ )2
TSS has two components, variation that can be explained by the
regression and variation that cannot:
10
∑
i
(Yi − Ȳ )2 =
∑
i
(Ŷi − Ȳ )2 +
∑
i
e2i
TSS = ESS + RSS
Explained Sum of Squares (ESS) measures the amount of the
squared deviation of Yi from its means that is
explained by the regression. Residual Sum of Squares (RSS) is
the part of the TSS that is unexplained by
the estimated regression. The smaller RSS is relative to the
TSS, the better the estimated regression line fits
the data.
To apply this, we return to the manual calculation of estimated
coefficients with OLS in the section of Manual
Calculation of Estimated Regression Coefficients. To calculate
ESS and RSS, we need Ŷi. To calculate this,
we have R determine Ŷi using the estimated equation of the
model, like the one found on page 34 of the text,
Ŷi = β̂0 + β̂1Xi
YHAT <- BETAHAT0 + BETAHAT1 * X
YHAT
## [1] 12347.662 9242.913 12153.615 12671.073 12541.709
12606.391 12735.755
## [8] 10924.652 9566.325 11571.475 12218.297 11248.063
12282.980 11636.157
## [15] 12735.755 11571.475 12218.297 11700.839 12735.755
9631.007 12735.755
## [22] 8984.184 9566.325 12282.980 11700.839 12735.755
12800.438 12218.297
## [29] 12671.073 10213.147 12671.073 11830.204 11830.204
12735.755 12541.709
## [36] 12800.438 7690.539 12153.615 9048.867 12347.662
11959.568 12024.251
## [43] 11830.204 12800.438 12153.615 10083.783 11830.204
11636.157 12800.438
## [50] 10795.288
Now that we have our Ŷi data, we can calculate our residuals,
ei, where ei = Yi − Ŷi. Just a reminder, these
residual errors represent the difference between the estimated
Ŷi values, and the actual sample values, Yi.
RESID <- Y - YHAT
RESID
## [1] 7292.33816 -917.91343 796.38493 -11971.07314 -
5541.70862
## [6] -1281.39088 6429.24460 -3924.65216 -1641.32473 -
96.47474
## [11] 6571.70267 -2358.06345 5307.02041 6128.84300
1364.24460
## [16] 7393.52526 -7718.29733 -3750.83926 -5735.75540 -
2356.00698
## [21] -4735.75540 -4694.18440 -1391.32473 -932.97959
3624.16074
## [26] 9412.24460 4619.56235 6771.70267 -1496.07314
3886.85269
## [31] -5671.07314 -3980.20378 -11830.20378 -5735.75540
3558.29138
## [36] -4800.43765 809.46077 -4578.61507 4701.13334 -
5347.66184
## [41] -759.56829 2425.74945 3434.79622 7669.56235 -
2603.61507
## [46] 5886.21721 359.79622 163.84300 8839.56235 -
1595.28764
Now we can calculate the ESS as:
ESS <- sum((YHAT - YBAR)^2)
ESS
## [1] 75893113
We can also calculate the RSS with ease:
RSS <- sum(RESID^2)
RSS
## [1] 1334607437
Finally, the TSS will be:
11
TSS <- sum((Y - YBAR)^2)
TSS
## [1] 1410500550
As stated above, the TSS = ESS + RSS, lets check that now.
ESS + RSS
## [1] 1410500550
Looks good!
The Overall Fit of the Estimated Model
R2
R2 is the ratio of the explained sum of squares to the total sum
of squares:
R2 =
ESS
TSS
= 1 −
RSS
TSS
= 1 −
∑
e2i∑
(Yi − Ȳ )2
R2 measures the percentage of the variation of Y around Ȳ that
is explained by the regression. The higher
R2 is, the closer the estimated regression equation fits the
sample data. R2 lies in the interval 0 ≤ R2 ≤ 1.
The value of R2 close to one shows an excellent overall fit.
Let’s manually calculate R2 as ESS/TSS:
R2 <- ESS/TSS
R2
## [1] 0.0538058
Now we will grab the R2 value from the lm model we executed
earlier, by telling the summary function to
grab it for us specifically:
summary(lm1)$r.squared
## [1] 0.0538058
The R2 value matches with ours, a good sign. However as you
may have noticed it is rather low, a sign that
the estimated regression line is not fitting the sample data very
well.
Correlation Coefficient (r)
The simple correlation coefficient, r ∈ [−1, 1], is a measure of
the strength and direction of the linear
relationship between two variables. The sign of r indicates the
direction of the correlation between the two
variables.
If r = +1, the two variables are perfectly positively correlated.
If r = −1,the two variables are perfectly
negatively correlated. If r = 0, the two variables are totally
uncorrelated.
Checking this in R, we use the cor command:
cor(Y,X)
## [1] 0.2319608
revealing a relatively weak positive correlation.
12
The Adjusted R2 (R
̄ 2)
R
̄ 2 measures the percentage of the variation of Y around its
mean that is explained by the regression equation,
adjusted for degrees of freedom.
R
̄ 2 = 1 −
∑
e2i /(N −K − 1)∑
(Yi − Ȳ )2/(N − 1)
N −K − 1 is the degree of freedom which is the excess of the
number of observation N over the number of
coefficients (including the intercept) estimated (K + 1). The
value of R
̄ 2 can be used to compare the fits of
equations with the same dependent variable and different
numbers of independent variables.
For one final exercise we will calculate the R
̄ 2 for our last
model specification. The first step is to extract the
residuals from R. We can pull the residuals from mymodel by
asking R to specifically return residuals:
RESIDS <- mymodel$residuals
Now we can square them and sum them up:
RSQ <- sum(RESIDS^2)
Notice that the TSS we calculated previously (we can use it as it
only makes use of the sample Yi values
and their mean, not any estimated data hence it is still relevant
to this model’s R
̄ 2). Knowing that we can
calculate R
̄ 2 as
RBAR <- 1 - ((RSQ/46)/(TSS/49))
RBAR
## [1] 0.7492617
We can compare that to the calculated R
̄ 2 from the mymodel
summary as such:
summary(mymodel)$adj.r.squared
## [1] 0.7492617
Good and we see our calculation was correct.
13
IntroductionHow does OLS work?A Single-Independent-
Variable Regression ModelManual Calculation of Estimated
Regression CoefficientsOLS Estimation with R
functionsMultivariate Regression ModelTotal, Explained, and
Residual Sums of SquaresThe Overall Fit of the Estimated
ModelR^2Correlation Coefficient (r)The Adjusted R^2
(bar{R^2})
Econ 4650, Spring 2019
Assignment 1
Introduction
Welcome to Assignment 1!
It is suggested you do the following before attempting this
assignment*
• Read the appropriate textbook chapters (Chapter 1 & Chapter
2 in the Custom Text;
Chapter 17 & Chapter 1 in the full text)
• Read the Econ_4650_Sprin_2019_Intro_to_R.pdf in Module
1.a
• Install the software R
• Read OrdinaryLeastSquares pdf in Module 1.b
• Take a deep breath!
*Note, You may also watch the Module 1 video tutorials, but
this is not required. Also, please note that
there have been adjustments to the number of assignments and
the structure of the course since the time the
videos were made.
Definitions for Annual Consumption of Chicken Data
Y = per capita chicken consumption (in pound) in a given year
PC = the price of chicken (in cents per pound) in a given year
PB = the price of beef (in cents per pound) in a given year
YD = U.S. per capita disposable income (in hundred of dollars)
in a given year
Assignment 1
[1] Download the data file CHICK6.csv from Assignment 1.
[2] Load the CHICK6.csv data into R and rename it so that it
includes your own name or initials (for example,
we might name our dataset JunfuCHICK6 ).
[3] Generate the mean and the standard deviation for the
variables Y, PC, PB, and YD from the CHICK6.csv
data set.
[4] You are going to run a regression (in [5]) where Y is the
dependent variable, and PC, PB, and YD are
the independent variables. Discuss whether you think the
independent variable PB will have a negative or
positive effect on the dependent variable. In your decision, try
to use as much economic theory as you can –
theory is what motivates what variables are included in a model
and what sign we anticipate for the model’s
estimates.
[5] Estimate the model in R and present the results.
[6] Interpret the results for the coefficient PB from your model.
Make sure to include whether or not the
result aligned with your expectations.
1
IntroductionDefinitions for Annual Consumption of Chicken
DataAssignment 1

Más contenido relacionado

Similar a Econ 4650Spring 2019Introduction to RBasic Statistics .docx

Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with RYanchang Zhao
 
ComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical SciencesComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical Sciencesalexstorer
 
R-Language-Lab-Manual-lab-1.pdf
R-Language-Lab-Manual-lab-1.pdfR-Language-Lab-Manual-lab-1.pdf
R-Language-Lab-Manual-lab-1.pdfKabilaArun
 
R-Language-Lab-Manual-lab-1.pdf
R-Language-Lab-Manual-lab-1.pdfR-Language-Lab-Manual-lab-1.pdf
R-Language-Lab-Manual-lab-1.pdfattalurilalitha
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Yao Yao
 
Big Data Mining in Indian Economic Survey 2017
Big Data Mining in Indian Economic Survey 2017Big Data Mining in Indian Economic Survey 2017
Big Data Mining in Indian Economic Survey 2017Parth Khare
 
Summerization notes for descriptive statistics using r
Summerization notes for descriptive statistics using r Summerization notes for descriptive statistics using r
Summerization notes for descriptive statistics using r Ashwini Mathur
 
Itroroduction to R language
Itroroduction to R languageItroroduction to R language
Itroroduction to R languagechhabria-nitesh
 

Similar a Econ 4650Spring 2019Introduction to RBasic Statistics .docx (20)

Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with R
 
R basics
R basicsR basics
R basics
 
R for Statistical Computing
R for Statistical ComputingR for Statistical Computing
R for Statistical Computing
 
R studio
R studio R studio
R studio
 
ComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical SciencesComputeFest 2012: Intro To R for Physical Sciences
ComputeFest 2012: Intro To R for Physical Sciences
 
R-Language-Lab-Manual-lab-1.pdf
R-Language-Lab-Manual-lab-1.pdfR-Language-Lab-Manual-lab-1.pdf
R-Language-Lab-Manual-lab-1.pdf
 
R-Language-Lab-Manual-lab-1.pdf
R-Language-Lab-Manual-lab-1.pdfR-Language-Lab-Manual-lab-1.pdf
R-Language-Lab-Manual-lab-1.pdf
 
R-Language-Lab-Manual-lab-1.pdf
R-Language-Lab-Manual-lab-1.pdfR-Language-Lab-Manual-lab-1.pdf
R-Language-Lab-Manual-lab-1.pdf
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
 
R programming
R programmingR programming
R programming
 
Big Data Mining in Indian Economic Survey 2017
Big Data Mining in Indian Economic Survey 2017Big Data Mining in Indian Economic Survey 2017
Big Data Mining in Indian Economic Survey 2017
 
R Programming Intro
R Programming IntroR Programming Intro
R Programming Intro
 
Summerization notes for descriptive statistics using r
Summerization notes for descriptive statistics using r Summerization notes for descriptive statistics using r
Summerization notes for descriptive statistics using r
 
Unit 1 - R Programming (Part 2).pptx
Unit 1 - R Programming (Part 2).pptxUnit 1 - R Programming (Part 2).pptx
Unit 1 - R Programming (Part 2).pptx
 
R Programming Homework Help
R Programming Homework HelpR Programming Homework Help
R Programming Homework Help
 
Itroroduction to R language
Itroroduction to R languageItroroduction to R language
Itroroduction to R language
 
Datamining with R
Datamining with RDatamining with R
Datamining with R
 
Lecture1_R.ppt
Lecture1_R.pptLecture1_R.ppt
Lecture1_R.ppt
 
Lecture1_R.ppt
Lecture1_R.pptLecture1_R.ppt
Lecture1_R.ppt
 
Lecture1 r
Lecture1 rLecture1 r
Lecture1 r
 

Más de tidwellveronique

EDUC 742EDUC 742Reading Summary and Reflective Comments .docx
EDUC 742EDUC 742Reading Summary and Reflective Comments .docxEDUC 742EDUC 742Reading Summary and Reflective Comments .docx
EDUC 742EDUC 742Reading Summary and Reflective Comments .docxtidwellveronique
 
EDUC 380 Blog Post Samples Module 1 The Brain Below .docx
EDUC 380 Blog Post Samples Module 1 The Brain  Below .docxEDUC 380 Blog Post Samples Module 1 The Brain  Below .docx
EDUC 380 Blog Post Samples Module 1 The Brain Below .docxtidwellveronique
 
EDUC 741Course Project Part 1 Grading RubricCriteriaLevels .docx
EDUC 741Course Project Part 1 Grading RubricCriteriaLevels .docxEDUC 741Course Project Part 1 Grading RubricCriteriaLevels .docx
EDUC 741Course Project Part 1 Grading RubricCriteriaLevels .docxtidwellveronique
 
EDUC 740Prayer Reflection Report Grading RubricCriteriaLev.docx
EDUC 740Prayer Reflection Report Grading RubricCriteriaLev.docxEDUC 740Prayer Reflection Report Grading RubricCriteriaLev.docx
EDUC 740Prayer Reflection Report Grading RubricCriteriaLev.docxtidwellveronique
 
EDUC 6733 Action Research for EducatorsReading LiteracyDraft.docx
EDUC 6733 Action Research for EducatorsReading LiteracyDraft.docxEDUC 6733 Action Research for EducatorsReading LiteracyDraft.docx
EDUC 6733 Action Research for EducatorsReading LiteracyDraft.docxtidwellveronique
 
EDUC 637Technology Portfolio InstructionsGeneral OverviewF.docx
EDUC 637Technology Portfolio InstructionsGeneral OverviewF.docxEDUC 637Technology Portfolio InstructionsGeneral OverviewF.docx
EDUC 637Technology Portfolio InstructionsGeneral OverviewF.docxtidwellveronique
 
EDUC 364 The Role of Cultural Diversity in Schooling A dialecti.docx
EDUC 364 The Role of Cultural Diversity in Schooling A dialecti.docxEDUC 364 The Role of Cultural Diversity in Schooling A dialecti.docx
EDUC 364 The Role of Cultural Diversity in Schooling A dialecti.docxtidwellveronique
 
EDUC 144 Writing Tips The writing assignments in this cla.docx
EDUC 144 Writing Tips  The writing assignments in this cla.docxEDUC 144 Writing Tips  The writing assignments in this cla.docx
EDUC 144 Writing Tips The writing assignments in this cla.docxtidwellveronique
 
EDUC 1300- LEARNING FRAMEWORK Portfolio Page Prompts .docx
EDUC 1300- LEARNING FRAMEWORK Portfolio Page Prompts .docxEDUC 1300- LEARNING FRAMEWORK Portfolio Page Prompts .docx
EDUC 1300- LEARNING FRAMEWORK Portfolio Page Prompts .docxtidwellveronique
 
EDU734 Teaching and Learning Environment Week 5.docx
EDU734 Teaching and  Learning Environment Week 5.docxEDU734 Teaching and  Learning Environment Week 5.docx
EDU734 Teaching and Learning Environment Week 5.docxtidwellveronique
 
EDU 505 – Contemporary Issues in EducationCOURSE DESCRIPTION.docx
EDU 505 – Contemporary Issues in EducationCOURSE DESCRIPTION.docxEDU 505 – Contemporary Issues in EducationCOURSE DESCRIPTION.docx
EDU 505 – Contemporary Issues in EducationCOURSE DESCRIPTION.docxtidwellveronique
 
EDU 3338 Lesson Plan TemplateCandidate NameCooperatin.docx
EDU 3338 Lesson Plan TemplateCandidate NameCooperatin.docxEDU 3338 Lesson Plan TemplateCandidate NameCooperatin.docx
EDU 3338 Lesson Plan TemplateCandidate NameCooperatin.docxtidwellveronique
 
EDU 3215 Lesson Plan Template & Elements Name Andres Rod.docx
EDU 3215 Lesson Plan Template & Elements  Name Andres Rod.docxEDU 3215 Lesson Plan Template & Elements  Name Andres Rod.docx
EDU 3215 Lesson Plan Template & Elements Name Andres Rod.docxtidwellveronique
 
EDST 1100R SITUATED LEARNING EDST 1100 N Situated Learning .docx
EDST 1100R SITUATED LEARNING  EDST 1100 N Situated Learning .docxEDST 1100R SITUATED LEARNING  EDST 1100 N Situated Learning .docx
EDST 1100R SITUATED LEARNING EDST 1100 N Situated Learning .docxtidwellveronique
 
EDU 151 Thematic Unit Required ComponentsThematic Unit Requireme.docx
EDU 151 Thematic Unit Required ComponentsThematic Unit Requireme.docxEDU 151 Thematic Unit Required ComponentsThematic Unit Requireme.docx
EDU 151 Thematic Unit Required ComponentsThematic Unit Requireme.docxtidwellveronique
 
EDSP 429Differentiated Instruction PowerPoint InstructionsThe .docx
EDSP 429Differentiated Instruction PowerPoint InstructionsThe .docxEDSP 429Differentiated Instruction PowerPoint InstructionsThe .docx
EDSP 429Differentiated Instruction PowerPoint InstructionsThe .docxtidwellveronique
 
EDSP 429Fact Sheet on Disability Categories InstructionsThe pu.docx
EDSP 429Fact Sheet on Disability Categories InstructionsThe pu.docxEDSP 429Fact Sheet on Disability Categories InstructionsThe pu.docx
EDSP 429Fact Sheet on Disability Categories InstructionsThe pu.docxtidwellveronique
 
EDSP 370Individualized Education Plan (IEP) InstructionsThe .docx
EDSP 370Individualized Education Plan (IEP) InstructionsThe .docxEDSP 370Individualized Education Plan (IEP) InstructionsThe .docx
EDSP 370Individualized Education Plan (IEP) InstructionsThe .docxtidwellveronique
 
EDSP 377Scenario InstructionsScenario 2 Teaching communicatio.docx
EDSP 377Scenario InstructionsScenario 2 Teaching communicatio.docxEDSP 377Scenario InstructionsScenario 2 Teaching communicatio.docx
EDSP 377Scenario InstructionsScenario 2 Teaching communicatio.docxtidwellveronique
 
EDSP 377Autism Interventions1. Applied Behavior Analysis (ABA).docx
EDSP 377Autism Interventions1. Applied Behavior Analysis (ABA).docxEDSP 377Autism Interventions1. Applied Behavior Analysis (ABA).docx
EDSP 377Autism Interventions1. Applied Behavior Analysis (ABA).docxtidwellveronique
 

Más de tidwellveronique (20)

EDUC 742EDUC 742Reading Summary and Reflective Comments .docx
EDUC 742EDUC 742Reading Summary and Reflective Comments .docxEDUC 742EDUC 742Reading Summary and Reflective Comments .docx
EDUC 742EDUC 742Reading Summary and Reflective Comments .docx
 
EDUC 380 Blog Post Samples Module 1 The Brain Below .docx
EDUC 380 Blog Post Samples Module 1 The Brain  Below .docxEDUC 380 Blog Post Samples Module 1 The Brain  Below .docx
EDUC 380 Blog Post Samples Module 1 The Brain Below .docx
 
EDUC 741Course Project Part 1 Grading RubricCriteriaLevels .docx
EDUC 741Course Project Part 1 Grading RubricCriteriaLevels .docxEDUC 741Course Project Part 1 Grading RubricCriteriaLevels .docx
EDUC 741Course Project Part 1 Grading RubricCriteriaLevels .docx
 
EDUC 740Prayer Reflection Report Grading RubricCriteriaLev.docx
EDUC 740Prayer Reflection Report Grading RubricCriteriaLev.docxEDUC 740Prayer Reflection Report Grading RubricCriteriaLev.docx
EDUC 740Prayer Reflection Report Grading RubricCriteriaLev.docx
 
EDUC 6733 Action Research for EducatorsReading LiteracyDraft.docx
EDUC 6733 Action Research for EducatorsReading LiteracyDraft.docxEDUC 6733 Action Research for EducatorsReading LiteracyDraft.docx
EDUC 6733 Action Research for EducatorsReading LiteracyDraft.docx
 
EDUC 637Technology Portfolio InstructionsGeneral OverviewF.docx
EDUC 637Technology Portfolio InstructionsGeneral OverviewF.docxEDUC 637Technology Portfolio InstructionsGeneral OverviewF.docx
EDUC 637Technology Portfolio InstructionsGeneral OverviewF.docx
 
EDUC 364 The Role of Cultural Diversity in Schooling A dialecti.docx
EDUC 364 The Role of Cultural Diversity in Schooling A dialecti.docxEDUC 364 The Role of Cultural Diversity in Schooling A dialecti.docx
EDUC 364 The Role of Cultural Diversity in Schooling A dialecti.docx
 
EDUC 144 Writing Tips The writing assignments in this cla.docx
EDUC 144 Writing Tips  The writing assignments in this cla.docxEDUC 144 Writing Tips  The writing assignments in this cla.docx
EDUC 144 Writing Tips The writing assignments in this cla.docx
 
EDUC 1300- LEARNING FRAMEWORK Portfolio Page Prompts .docx
EDUC 1300- LEARNING FRAMEWORK Portfolio Page Prompts .docxEDUC 1300- LEARNING FRAMEWORK Portfolio Page Prompts .docx
EDUC 1300- LEARNING FRAMEWORK Portfolio Page Prompts .docx
 
EDU734 Teaching and Learning Environment Week 5.docx
EDU734 Teaching and  Learning Environment Week 5.docxEDU734 Teaching and  Learning Environment Week 5.docx
EDU734 Teaching and Learning Environment Week 5.docx
 
EDU 505 – Contemporary Issues in EducationCOURSE DESCRIPTION.docx
EDU 505 – Contemporary Issues in EducationCOURSE DESCRIPTION.docxEDU 505 – Contemporary Issues in EducationCOURSE DESCRIPTION.docx
EDU 505 – Contemporary Issues in EducationCOURSE DESCRIPTION.docx
 
EDU 3338 Lesson Plan TemplateCandidate NameCooperatin.docx
EDU 3338 Lesson Plan TemplateCandidate NameCooperatin.docxEDU 3338 Lesson Plan TemplateCandidate NameCooperatin.docx
EDU 3338 Lesson Plan TemplateCandidate NameCooperatin.docx
 
EDU 3215 Lesson Plan Template & Elements Name Andres Rod.docx
EDU 3215 Lesson Plan Template & Elements  Name Andres Rod.docxEDU 3215 Lesson Plan Template & Elements  Name Andres Rod.docx
EDU 3215 Lesson Plan Template & Elements Name Andres Rod.docx
 
EDST 1100R SITUATED LEARNING EDST 1100 N Situated Learning .docx
EDST 1100R SITUATED LEARNING  EDST 1100 N Situated Learning .docxEDST 1100R SITUATED LEARNING  EDST 1100 N Situated Learning .docx
EDST 1100R SITUATED LEARNING EDST 1100 N Situated Learning .docx
 
EDU 151 Thematic Unit Required ComponentsThematic Unit Requireme.docx
EDU 151 Thematic Unit Required ComponentsThematic Unit Requireme.docxEDU 151 Thematic Unit Required ComponentsThematic Unit Requireme.docx
EDU 151 Thematic Unit Required ComponentsThematic Unit Requireme.docx
 
EDSP 429Differentiated Instruction PowerPoint InstructionsThe .docx
EDSP 429Differentiated Instruction PowerPoint InstructionsThe .docxEDSP 429Differentiated Instruction PowerPoint InstructionsThe .docx
EDSP 429Differentiated Instruction PowerPoint InstructionsThe .docx
 
EDSP 429Fact Sheet on Disability Categories InstructionsThe pu.docx
EDSP 429Fact Sheet on Disability Categories InstructionsThe pu.docxEDSP 429Fact Sheet on Disability Categories InstructionsThe pu.docx
EDSP 429Fact Sheet on Disability Categories InstructionsThe pu.docx
 
EDSP 370Individualized Education Plan (IEP) InstructionsThe .docx
EDSP 370Individualized Education Plan (IEP) InstructionsThe .docxEDSP 370Individualized Education Plan (IEP) InstructionsThe .docx
EDSP 370Individualized Education Plan (IEP) InstructionsThe .docx
 
EDSP 377Scenario InstructionsScenario 2 Teaching communicatio.docx
EDSP 377Scenario InstructionsScenario 2 Teaching communicatio.docxEDSP 377Scenario InstructionsScenario 2 Teaching communicatio.docx
EDSP 377Scenario InstructionsScenario 2 Teaching communicatio.docx
 
EDSP 377Autism Interventions1. Applied Behavior Analysis (ABA).docx
EDSP 377Autism Interventions1. Applied Behavior Analysis (ABA).docxEDSP 377Autism Interventions1. Applied Behavior Analysis (ABA).docx
EDSP 377Autism Interventions1. Applied Behavior Analysis (ABA).docx
 

Último

SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxAmanpreet Kaur
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Association for Project Management
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibitjbellavia9
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701bronxfugly43
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docxPoojaSen20
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and ModificationsMJDuyan
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...pradhanghanshyam7136
 

Último (20)

SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 

Econ 4650Spring 2019Introduction to RBasic Statistics .docx

  • 1. Econ 4650 Spring 2019 Introduction to R Basic Statistics & R This chapter provides a good review of basic statistics that we will use throughout the course. It should serve as a reference guide for you as we discuss estimation, sampling distributions, and hypothesis testing. We will utilize normal, F, and t distributions frequently. Tables for these distributions are provided at the end of the text. In this module, you will learn how to enter data into R from a CSV file, describe the data, and produce simple plots. Assuming you have R installed on your computer you can type (or copy and paste) everything you see in the shaded area (written in this font) and you should see the results (after the ##) just as you see them in this PDF. (Note,[1] if you need to install R, or would like to find more information on the software, you can do so https://cran.r-project.org/,[2] there are a few tasks that we will use R for in this course in which you must be working on a computer in which you have adminstrative rights in order to complete). For example, let’s use R as a simple calculator to find 2 + 2. What I do is type “2 + 2” and press the enter key
  • 2. 2 + 2 ## [1] 4 We see that R returns the correct answer. Now, let’s try 2 x 3 2 * 3 ## [1] 6 and 23 2^3 ## [1] 8 We can also get the natural log log(2.71821) ## [1] 0.9999736 log(exp(1)) ## [1] 1 Let’s input the data for the height of a random sample of U.S. women as in the text (see page 9 of the custom edition of the text). As in the text, we will call the data X X <- c(61.0, 63.5, 66.0, 68.5, 71.0) We can see, in R, what is in our workspace by typing ls() 1 https://cran.r-project.org/
  • 3. ls() ## [1] "X" R knows that there is something called X and can report back the values in X X ## [1] 61.0 63.5 66.0 68.5 71.0 The textbook states that the mean (µ) of these data is known to be equal to 66 with a known standard deviation (σ) of 2.5 inches. Let’s use R to calculate the standardized Z-values from our data Z <- (X - 66)/2.5 and we can see the results Z ## [1] -2 -1 0 1 2 In R, we can calculate the mean, variance, and standard deviation for a particular sample of data mean(X) ## [1] 66 var(X) ## [1] 15.625 sd(X) ## [1] 3.952847
  • 4. Let’s assume that heights are distributed according to the normal distribution with mean 66 and standard deviation 2.5. We can have R randomly sample from this distribution. Let’s create a sample of 100 draws X100 <- rnorm(100, mean = 66, sd = 2.5) and find mean(X100) ## [1] 66.2055 sd(X100) ## [1] 2.910129 Now let’s see what a simple histogram for these data look like 2 hist(X100) Histogram of X100 X100 F re q u e n
  • 5. cy 60 65 70 0 5 1 0 1 5 2 0 2 5 3 0 When you do this on your computer, you will get different results. That is because your random draw will be different. We can have the same draw by doing this set.seed(100) X100a <- rnorm(100, mean = 66, sd = 2.5) And we find mean(X100a) ## [1] 66.00728
  • 6. sd(X100a) ## [1] 2.551776 With the same histogram hist(X100a) 3 Histogram of X100a X100a F re q u e n cy 60 62 64 66 68 70 72 74 0 5 1 0 1 5
  • 7. 2 0 2 5 3 0 The quantiles of the data can be determined by quantile(X100a) ## 0% 25% 50% 75% 100% ## 60.32019 64.47788 65.85145 67.63973 72.45490 So the mean of this particular random sample is 66.007 and the median is 65.85. As an aside, let’s look at the distribution of a draw of size 20,000 from a standard normal set.seed(666) zdraw <- rnorm(20000, mean = 0, sd = 1) mean(zdraw) ## [1] 0.007248452 quantile(zdraw, prob = c(0.01, 0.05, 0.5, 0.95, 0.99)) ## 1% 5% 50% 95% 99% ## -2.34318795 -1.63898524 0.01879137 1.65229300 2.30314467 You can compare this draw with the theoretical values for our requested percentiles (.01, .05, .5, .95, .99) of -2.326, -1.645, 0, 1.645, and 2.326. Here is a plot of the large draw
  • 8. plot(density(zdraw)) 4 −4 −2 0 2 4 0 .0 0 .1 0 .2 0 .3 0 .4 density.default(x = zdraw) N = 20000 Bandwidth = 0.1242 D e n si ty Throughout the class we will generate random data to learn
  • 9. about econometrics and for results to match, setting the random number seed via set.seed() is a useful R command. Later on we will learn how to use many of the built-in random distribution functions in R. (Just like we did here by generating a sample of size 100 drawn from a normal distribution.) Read Data Entering data the way we did before is awkward. So now let’s see how to read in a set of data that comes in a csv file (comma separated values). These files can be read by Excel and Excel can export csv files. They are easily read in R (and other statistical packages). Let’s begin with data from Table 1 on a sample of 22 single- family homes in Diamond Bar, California. I downloaded the Excel file and saved it as a csv file under STAT17.csv. Before saving the file in R, I changed my working directory (i.e., the place on my computer that R will read and write files to) with the command setwd(). I can ensure the STAT17.csv is in my working directory by asking R to report all of the file names that I have saved in this folder with the dir() command setwd("C:/myfile/University of Utah/2019 Spring/Econometrics") dir() ## [1] "Assignment 1.pdf" "Chapter 1.pdf" "STATS17.csv" Indeed, the file is listed as being saved in my working directory. I can read it into R and name it data17 by typing data17 <- read.csv("STATS17.csv", header = T)
  • 10. 5 Note that R is case sensitive, so Data17 will be different than data17, or datA17, etc. (In the read command, the header comment is telling R that the first row contains the variable names, or labels.) Here are what the data look like data17 ## X OBS PRICE SQFT ## 1 1 1 425000 1349 ## 2 2 2 451500 1807 ## 3 3 3 508560 1651 ## 4 4 4 448050 1293 ## 5 5 5 500580 1745 ## 6 6 6 524160 1900 ## 7 7 7 500580 1759 ## 8 8 8 399330 1740 ## 9 9 9 442020 1950 ## 10 10 10 537660 1771 ## 11 11 11 515100 2078 ## 12 12 12 589000 2268 ## 13 13 13 696000 2400 ## 14 14 14 540750 2050 ## 15 15 15 659200 2267 ## 16 16 16 492450 1986 ## 17 17 17 567047 2950 ## 18 18 18 684950 2712 ## 19 19 19 668470 2799 ## 20 20 20 733360 2933 ## 21 21 21 775590 3203 ## 22 22 22 788888 2988
  • 11. Here are the summary statistics for data17 summary(data17) ## X OBS PRICE SQFT ## Min. : 1.00 Min. : 1.00 Min. :399330 Min. :1293 ## 1st Qu.: 6.25 1st Qu.: 6.25 1st Qu.:494483 1st Qu.:1762 ## Median :11.50 Median :11.50 Median :530910 Median :2018 ## Mean :11.50 Mean :11.50 Mean :565829 Mean :2164 ## 3rd Qu.:16.75 3rd Qu.:16.75 3rd Qu.:666153 3rd Qu.:2634 ## Max. :22.00 Max. :22.00 Max. :788888 Max. :3203 We can generate a scatterplot of these data plot(data17$SQFT,data17$PRICE) 6 1500 2000 2500 3000 4 e + 0 5 5 e + 0 5 6
  • 13. IC E Here is another way to be able to access the variable names or labels for the SQRT and PRICE data from the data17 frame attach(data17) ## The following object is masked _by_ .GlobalEnv: ## ## X plot(SQFT, PRICE) 7 1500 2000 2500 3000 4 e + 0 5 5 e + 0 5 6
  • 14. e + 0 5 7 e + 0 5 8 e + 0 5 SQFT P R IC E Above, we have attached the data17 so that we could easily work with the variables in the file. We can also get the correlation between the variables SQFT and PRICE cor(SQFT, PRICE)
  • 15. ## [1] 0.8768234 We see that the correlation coefficient aligns with the graphical representation of the scatterplot. Attaching a dataframe is a nice way to be able to grab labels. It is always a good idea to detach labels from dataframes when we are done. We can do this by using the detach() command. detach(data17) Let’s see once again what R has in its memory. ls() ## [1] "data17" "X" "X100" "X100a" "Z" "zdraw" We can clean (i.e., erase) all objects from memory in R. rm(list=ls()) ls() ## character(0) That covers some of the R basics you will use in this course. We recommmend that you consult this reference throughout the course. 8 Basic Statistics & RRead Data Ordinary Least Squares Introduction This chapter introduces the most commonly used regression
  • 16. estimation technique, Ordinary Least Squares (OLS). OLS is a regression estimation technique that calculates the coefficients β̂ so as to minimize the sum of the squared residuals, that is OLS minimizes ∑n i=1 e 2 i or ∑ (Yi − Ŷi)2. An estimator is a mathematical technique that is applied to a sample of data to produce real world numerical estimates of the true population regression coefficients. Thus, OLS is an estimator, and a β̂ produced by OLS is an estimate. How does OLS work? A Single-Independent-Variable Regression Model Recall the theoretical equation: Yi = β0 + β1Xi + �i OLS selects those estimates of β0 and β1 that minimize the total squared residuals, where the sum is taken over all the sample data points. The formula of these coefficients are: β̂1 = ∑n i=1[(Xi − X ̄ )(Yi − Ȳ )]∑n i=1(Xi − X ̄ )2
  • 17. β̂0 = Ȳ − β̂1X ̄ where, X ̄ = ∑ Xi/N and Ȳ = ∑ Yi/N Manual Calculation of Estimated Regression Coefficients In this section we will calculate estimated regression coefficients β̂0 and β̂1 manually using R. The first thing we will do is read a .csv file into R as a dataframe named mydata. The file, “FINAID2.csv” is available on Canvas from this course module. Save it to the same folder as your current working directory for ease of use. rm(list=ls()) setwd("C:/myfile/University of Utah/2019 Spring/Econometrics") mydata <- read.csv("FINAID2.csv", header = TRUE) head(mydata) ## OBS FINAID PARENT HSRANK MALE ## 1 1 19640 0 92 0 ## 2 2 8325 9147 44 1 ## 3 3 12950 7063 89 0 ## 4 4 700 33344 97 1 ## 5 5 7000 20497 95 1 ## 6 6 11325 10487 96 0 This last function shows the first 6 rows to verify the .csv was read in correctly, and to let us examine the variable names and discover whether they are numeric or character. We will focus on two of the variables for our initial regression: one dependent variable and one
  • 18. independent variable. The dependent variable, or left-hand-side (LHS) variable, is called Yi in our formula; the independent variable, or right-hand-side (RHS) variable is labelled Xi in our formula. 1 For our first example we will use FINAID as the dependent variable and HSRANK as the independent variable (our first model specification). Next, we will attach the dataframe we just created. Then we will proceed to manually calculate estimated coefficients using OLS estimation, as done in the first section of this chapter of the text. attach(mydata) For ease of use and understanding, for this example we will identify the dependent variable, FINAID, as Y in R and the independent variable, HSRANK, as X. Y <- FINAID X <- HSRANK Next we need to tell R what to use for Ȳ and X ̄ , easily done with the mean command. YBAR <- mean(Y) XBAR <- mean(X) Next we will create vectors for (Yi − Ȳ ) and (Xi − X ̄ ), calling them YDIFF and XDIFF respectively. YDIFF <- Y - YBAR XDIFF <- X - XBAR To get an idea for what we are creating here, lets combine the
  • 19. vectors we just created into a table called table, using the column bind, or cbind command. Then we will use the head command to view it. table <-cbind(Y, X, YDIFF, XDIFF) head(table) ## Y X YDIFF XDIFF ## [1,] 19640 92 7963.74 10.38 ## [2,] 8325 44 -3351.26 -37.62 ## [3,] 12950 89 1273.74 7.38 ## [4,] 700 97 -10976.26 15.38 ## [5,] 7000 95 -4676.26 13.38 ## [6,] 11325 96 -351.26 14.38 This is comparable to the first few columns of Table 1 on page 39 of the text. Continuing on, we can have R perform the remaining intermediate calculations to estimate our regression coefficients. We will calculate (Xi − X ̄ )2 and name it XDIFFSQ, as such: XDIFFSQ <- (XDIFF)^2 As well as (Xi − X ̄ )(Yi − Ȳ ), which we will call PRODUCT PRODUCT <- XDIFF * YDIFF We will now calculate β̂1, using the sum command in R, and assigning the name BETAHAT1 to the output. BETAHAT1 <- sum(PRODUCT)/sum(XDIFFSQ) BETAHAT1 ## [1] 64.68226 We can also now calculate β̂0 as BETAHAT0 <- YBAR - BETAHAT1 * XBAR BETAHAT0 ## [1] 6396.894
  • 20. Let’s check that our estimators look right. We are going to determine estimated regression coefficients with OLS automatically through R by using the lm function. Intuitively lm allows us to write out the equation 2 similar to how we do on paper, with LHS variable, a tilde instead of an equals sign (~), and RHS variable after. In the case (almost always) that we want to save our results, we store them in a new arbitrary variable we create. Here, we’ll name it lm1. lm1 <- lm(Y ~ X) lm1 ## ## Call: ## lm(formula = Y ~ X) ## ## Coefficients: ## (Intercept) X ## 6396.89 64.68 We see from the output that the intercept, or β̂0 matches what we calculated, as does the coefficient corresponding to X. Having computed the estimated regression coefficients using OLS manually, (and checked it with lm), we will now move on to a more typical procedure for estimating regression models with OLS in R. OLS Estimation with R functions
  • 21. We will start out by examining the data. Exploring the data using numerical and graphical techniques is an important step in creating a model (part of a larger process we will develop over this course). R has a very useful summary command for descriptive statistics; let’s run it on our data set, first on our dependent variable, then on our independent variable. We can explore the general shape of the data sample for this variable. summary(FINAID) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0 7869 11262 11676 15809 22148 Next let’s explore this variable visually using a graph called a histogram. hist(FINAID) 3 Histogram of FINAID FINAID F re q u e n
  • 22. cy 0 5000 15000 25000 0 5 1 0 1 5 The next commands refine the histogram by adding a smooth curve called a kernel density curve that adds information to our exploration. You can think of this as an approximate actual probability density function (PDF). hist(FINAID, prob = TRUE) lines(density(FINAID)) Histogram of FINAID FINAID D e n si ty 0 5000 15000 25000 0
  • 23. e + 0 0 4 e − 0 5 And now, do the same analyses for our independent variable, HSRANK. summary(HSRANK) ## Min. 1st Qu. Median Mean 3rd Qu. Max. 4 ## 20.00 80.00 89.00 81.62 96.75 99.00 hist(HSRANK, prob = TRUE) lines(density(HSRANK)) Histogram of HSRANK HSRANK D e n
  • 24. si ty 20 40 60 80 100 0 .0 0 0 .0 2 0 .0 4 Now we are prepared to run our regression. Once again we will utilize lm, which will run an OLS regression on the data, and a new function abline, which will use estimators from the regression to draw a line on the plot. As before we will want to save our results, storing them in an arbitrarily named variable, mymodel. To see the results, we again use the summary function, this time with the newly created model variable as the argument to the function. mymodel <- lm(FINAID ~ HSRANK) summary(mymodel) ## ## Call:
  • 25. ## lm(formula = FINAID ~ HSRANK) ## ## Residuals: ## Min 1Q Median 3Q Max ## -11971.1 -3966.3 -838.7 4436.4 9412.2 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6396.89 3281.34 1.949 0.0571 . ## HSRANK 64.68 39.15 1.652 0.1050 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 5273 on 48 degrees of freedom ## Multiple R-squared: 0.05381, Adjusted R-squared: 0.03409 ## F-statistic: 2.73 on 1 and 48 DF, p-value: 0.105 5 The resulting summary contains a great deal of information. For this module we will focus on reading the (Intercept) and HSRANK coefficients, specifically the estimates of them apparent in the column titled as such. To be clear, R is telling us β̂0 is the value in the Estimate column next to (Intercept) and β̂1 is the value in the Estimate column next to HSRANK. The other aspect we want to focus on is the Multiple R-Squared, R’s way of identifying what we normally call R2, and Adjusted R-squared. Adjusted R-squared will come in handy once we start adding additional independent variables. Note the particularly low value of R2. We will examine how these change as independent variables are added.
  • 26. Now we insert the linear regression line into a plot of the data. Note that by convention, we put our independent variable on the X-axis and the dependent variable on the Y -axis; with plot this means naming the independent variable as the first argument and the dependent variable as the second. The third argument will be using the abline command to insert the regression established in the last step as a line in the plot. plot(HSRANK, FINAID, abline(mymodel)) 20 40 60 80 100 0 1 0 0 0 0 2 0 0 0 0 HSRANK F IN A ID
  • 27. Multivariate Regression Model Recall the multivariate equation: Yi = β0 + β1X1i + β2X2i + ... + β3XKi + �i A multivariate regression coefficient indicates the change in the dependent variable associated with a one unit increase in the independent variable by holding the other independent variables in the equation constant. For example, β1 measures the impact on Y of a one unit increase in X1 holding constant X2, X3. . . and XK. The OLS estimation of multivariate models is identical in general approach to the OLS estimation of models with just one independent variable. To demonstrate multivariate regression, we will extend our initial model twice. The first extension will add PARENT as an independent variable; the second will add MALE, a special type of dummy variable as an independent variable. As we are adding new variables, we should proceed with data exploration, and a plot to begin our understanding of the relationship between dependent and independent variables. 6 summary(PARENT) ## Min. 1st Qu. Median Mean 3rd Qu. Max.
  • 28. ## 0 3382 8934 12284 17818 64305 hist(PARENT, prob = TRUE) lines(density(PARENT)) Histogram of PARENT PARENT D e n si ty 0 20000 40000 60000 0 e + 0 0 3 e − 0 5 plot(PARENT, FINAID) 0 20000 40000 60000
  • 29. 0 1 0 0 0 0 2 0 0 0 0 PARENT F IN A ID Having done our initial data exploration, we will run and interpret a multivariate regression. In R this is 7 very simple . . . simply add the new independent variable to the RHS of the model specification: mymodel <- lm(FINAID ~ PARENT + HSRANK)
  • 30. summary(mymodel) ## ## Call: ## lm(formula = FINAID ~ PARENT + HSRANK) ## ## Residuals: ## Min 1Q Median 3Q Max ## -6353.1 -1905.9 280.9 1942.5 6675.5 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 8926.92907 1739.08253 5.133 5.36e-06 *** ## PARENT -0.35677 0.03169 -11.260 6.06e-15 *** ## HSRANK 87.37815 20.67413 4.226 0.000108 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2771 on 47 degrees of freedom ## Multiple R-squared: 0.7441, Adjusted R-squared: 0.7332 ## F-statistic: 68.33 on 2 and 47 DF, p-value: 1.229e-14 Notice in this model, the added coefficient for HSRANK, as well as the change in Multiple R-squared and Adjusted R-Squared. Multiple R-squared increased, as we expect it to when adding additional independent variables, but more importantly Adjusted R- squared has increased, which accounts for the change in degrees of freedom and is a more relevant measure to compare this model specification to the previous one. When looking at the model, it is important to intuitively understand the signs of the individual estimated coefficients. Holding parents’ ability to contribute constant, does the financial aid increase of $87.40 for each percentage point increase in GPA rank make sense? And vice
  • 31. versa holding GPA rank constant does the parent financial contribution estimated coefficient seem logical? To pull up the coefficients on their own, let’s use the coef command in R and look at what it tells us. coef(mymodel) ## (Intercept) PARENT HSRANK ## 8926.9290669 -0.3567721 87.3781524 It will be helpful if we create some plots to visualize these relationships. The plots will be similar to the plots on page 44 of the text which illustrate these points. For the impact of parents’ ability to contribute on financial aid, we can ask R the following: plot(PARENT, FINAID, abline(coef(mymodel)[1], coef(mymodel)[2])) 8 0 20000 40000 60000 0 1 0 0 0 0 2 0
  • 32. 0 0 0 PARENT F IN A ID This command told R to plot PARENT and FINAID variables, using abline, (think of it as literally plotting points “a” and “b” to make a line) to grab the intercept coefficient ([1]), and the coefficient of PARENT, which happened to be the first independent variable we wrote in our model (R thinks of it as the coefficient after the intercept one, hence the [2]). So we can see the apparent negative relationship between parents’ ability to pay and financial aid. Similarly we can call a graph for financial aid and high school GPA rank: plot(HSRANK, FINAID, abline(coef(mymodel)[1], coef(mymodel)[3])) 20 40 60 80 100 0 1 0 0
  • 33. 0 0 2 0 0 0 0 HSRANK F IN A ID Finally, we will introduce the idea of dummy variables and include them in a regression. A dummy is simply a categorical variable, as in male or female, young or old, New York or San Francisco, and so forth. In the 9 data it is coded as a 1 or 0. Our dummy variable here is MALE. With categorical variables, the only data exploration that makes sense is to see the count of the category; a convenient way to do this in R is the table function: table(MALE)
  • 34. ## MALE ## 0 1 ## 27 23 Here the count of MALE is 23; the rest are presumably female. And now we are ready to include that as an independent variable: mymodel <- lm(FINAID ~ PARENT + HSRANK + MALE) summary(mymodel) ## ## Call: ## lm(formula = FINAID ~ PARENT + HSRANK + MALE) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5401.7 -2388.1 292.1 2069.4 5233.8 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 9.813e+03 1.743e+03 5.630 1.04e-06 *** ## PARENT -3.427e-01 3.151e-02 -10.879 2.61e-14 *** ## HSRANK 8.326e+01 2.015e+01 4.132 0.00015 *** ## MALE -1.570e+03 7.843e+02 -2.002 0.05120 . ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2687 on 46 degrees of freedom ## Multiple R-squared: 0.7646, Adjusted R-squared: 0.7493 ## F-statistic: 49.81 on 3 and 46 DF, p-value: 1.721e-14 Note again the change in Adjusted R-squared. Can you explain why that change is relevant? And why did Multiple R-squared increase and would we expect that as we add independent variables?
  • 35. Good data hygiene requires that we detach our data set when we are finished with it. detach(mydata) Total, Explained, and Residual Sums of Squares The square difference of Y around its mean is used to measure how much of the variation of the dependent variable is explained by the estimated regression equation. It is called the total sum of squares (TSS): TSS = n∑ i=1 (Yi − Ȳ )2 TSS has two components, variation that can be explained by the regression and variation that cannot: 10 ∑ i (Yi − Ȳ )2 = ∑ i (Ŷi − Ȳ )2 + ∑
  • 36. i e2i TSS = ESS + RSS Explained Sum of Squares (ESS) measures the amount of the squared deviation of Yi from its means that is explained by the regression. Residual Sum of Squares (RSS) is the part of the TSS that is unexplained by the estimated regression. The smaller RSS is relative to the TSS, the better the estimated regression line fits the data. To apply this, we return to the manual calculation of estimated coefficients with OLS in the section of Manual Calculation of Estimated Regression Coefficients. To calculate ESS and RSS, we need Ŷi. To calculate this, we have R determine Ŷi using the estimated equation of the model, like the one found on page 34 of the text, Ŷi = β̂0 + β̂1Xi YHAT <- BETAHAT0 + BETAHAT1 * X YHAT ## [1] 12347.662 9242.913 12153.615 12671.073 12541.709 12606.391 12735.755 ## [8] 10924.652 9566.325 11571.475 12218.297 11248.063 12282.980 11636.157 ## [15] 12735.755 11571.475 12218.297 11700.839 12735.755 9631.007 12735.755 ## [22] 8984.184 9566.325 12282.980 11700.839 12735.755 12800.438 12218.297 ## [29] 12671.073 10213.147 12671.073 11830.204 11830.204 12735.755 12541.709 ## [36] 12800.438 7690.539 12153.615 9048.867 12347.662 11959.568 12024.251
  • 37. ## [43] 11830.204 12800.438 12153.615 10083.783 11830.204 11636.157 12800.438 ## [50] 10795.288 Now that we have our Ŷi data, we can calculate our residuals, ei, where ei = Yi − Ŷi. Just a reminder, these residual errors represent the difference between the estimated Ŷi values, and the actual sample values, Yi. RESID <- Y - YHAT RESID ## [1] 7292.33816 -917.91343 796.38493 -11971.07314 - 5541.70862 ## [6] -1281.39088 6429.24460 -3924.65216 -1641.32473 - 96.47474 ## [11] 6571.70267 -2358.06345 5307.02041 6128.84300 1364.24460 ## [16] 7393.52526 -7718.29733 -3750.83926 -5735.75540 - 2356.00698 ## [21] -4735.75540 -4694.18440 -1391.32473 -932.97959 3624.16074 ## [26] 9412.24460 4619.56235 6771.70267 -1496.07314 3886.85269 ## [31] -5671.07314 -3980.20378 -11830.20378 -5735.75540 3558.29138 ## [36] -4800.43765 809.46077 -4578.61507 4701.13334 - 5347.66184 ## [41] -759.56829 2425.74945 3434.79622 7669.56235 - 2603.61507 ## [46] 5886.21721 359.79622 163.84300 8839.56235 - 1595.28764 Now we can calculate the ESS as: ESS <- sum((YHAT - YBAR)^2) ESS
  • 38. ## [1] 75893113 We can also calculate the RSS with ease: RSS <- sum(RESID^2) RSS ## [1] 1334607437 Finally, the TSS will be: 11 TSS <- sum((Y - YBAR)^2) TSS ## [1] 1410500550 As stated above, the TSS = ESS + RSS, lets check that now. ESS + RSS ## [1] 1410500550 Looks good! The Overall Fit of the Estimated Model R2 R2 is the ratio of the explained sum of squares to the total sum of squares: R2 = ESS
  • 39. TSS = 1 − RSS TSS = 1 − ∑ e2i∑ (Yi − Ȳ )2 R2 measures the percentage of the variation of Y around Ȳ that is explained by the regression. The higher R2 is, the closer the estimated regression equation fits the sample data. R2 lies in the interval 0 ≤ R2 ≤ 1. The value of R2 close to one shows an excellent overall fit. Let’s manually calculate R2 as ESS/TSS: R2 <- ESS/TSS R2 ## [1] 0.0538058 Now we will grab the R2 value from the lm model we executed earlier, by telling the summary function to grab it for us specifically: summary(lm1)$r.squared ## [1] 0.0538058 The R2 value matches with ours, a good sign. However as you may have noticed it is rather low, a sign that the estimated regression line is not fitting the sample data very well.
  • 40. Correlation Coefficient (r) The simple correlation coefficient, r ∈ [−1, 1], is a measure of the strength and direction of the linear relationship between two variables. The sign of r indicates the direction of the correlation between the two variables. If r = +1, the two variables are perfectly positively correlated. If r = −1,the two variables are perfectly negatively correlated. If r = 0, the two variables are totally uncorrelated. Checking this in R, we use the cor command: cor(Y,X) ## [1] 0.2319608 revealing a relatively weak positive correlation. 12 The Adjusted R2 (R ̄ 2) R ̄ 2 measures the percentage of the variation of Y around its mean that is explained by the regression equation, adjusted for degrees of freedom. R ̄ 2 = 1 − ∑ e2i /(N −K − 1)∑ (Yi − Ȳ )2/(N − 1)
  • 41. N −K − 1 is the degree of freedom which is the excess of the number of observation N over the number of coefficients (including the intercept) estimated (K + 1). The value of R ̄ 2 can be used to compare the fits of equations with the same dependent variable and different numbers of independent variables. For one final exercise we will calculate the R ̄ 2 for our last model specification. The first step is to extract the residuals from R. We can pull the residuals from mymodel by asking R to specifically return residuals: RESIDS <- mymodel$residuals Now we can square them and sum them up: RSQ <- sum(RESIDS^2) Notice that the TSS we calculated previously (we can use it as it only makes use of the sample Yi values and their mean, not any estimated data hence it is still relevant to this model’s R ̄ 2). Knowing that we can calculate R ̄ 2 as RBAR <- 1 - ((RSQ/46)/(TSS/49)) RBAR ## [1] 0.7492617 We can compare that to the calculated R ̄ 2 from the mymodel summary as such: summary(mymodel)$adj.r.squared ## [1] 0.7492617 Good and we see our calculation was correct. 13
  • 42. IntroductionHow does OLS work?A Single-Independent- Variable Regression ModelManual Calculation of Estimated Regression CoefficientsOLS Estimation with R functionsMultivariate Regression ModelTotal, Explained, and Residual Sums of SquaresThe Overall Fit of the Estimated ModelR^2Correlation Coefficient (r)The Adjusted R^2 (bar{R^2}) Econ 4650, Spring 2019 Assignment 1 Introduction Welcome to Assignment 1! It is suggested you do the following before attempting this assignment* • Read the appropriate textbook chapters (Chapter 1 & Chapter 2 in the Custom Text; Chapter 17 & Chapter 1 in the full text) • Read the Econ_4650_Sprin_2019_Intro_to_R.pdf in Module 1.a • Install the software R • Read OrdinaryLeastSquares pdf in Module 1.b • Take a deep breath! *Note, You may also watch the Module 1 video tutorials, but this is not required. Also, please note that there have been adjustments to the number of assignments and the structure of the course since the time the videos were made.
  • 43. Definitions for Annual Consumption of Chicken Data Y = per capita chicken consumption (in pound) in a given year PC = the price of chicken (in cents per pound) in a given year PB = the price of beef (in cents per pound) in a given year YD = U.S. per capita disposable income (in hundred of dollars) in a given year Assignment 1 [1] Download the data file CHICK6.csv from Assignment 1. [2] Load the CHICK6.csv data into R and rename it so that it includes your own name or initials (for example, we might name our dataset JunfuCHICK6 ). [3] Generate the mean and the standard deviation for the variables Y, PC, PB, and YD from the CHICK6.csv data set. [4] You are going to run a regression (in [5]) where Y is the dependent variable, and PC, PB, and YD are the independent variables. Discuss whether you think the independent variable PB will have a negative or positive effect on the dependent variable. In your decision, try to use as much economic theory as you can – theory is what motivates what variables are included in a model and what sign we anticipate for the model’s estimates. [5] Estimate the model in R and present the results. [6] Interpret the results for the coefficient PB from your model.
  • 44. Make sure to include whether or not the result aligned with your expectations. 1 IntroductionDefinitions for Annual Consumption of Chicken DataAssignment 1