Overview of Regression and Motivating Examples

Why regression

Today’s Lecture:

Overview of Regression (1:
Motivating Examples)

Why regression

Wine Quality

In “Super Crunchers” Ian Ayres gives a formula for wine quality as:
,

Wine quality = 12.145 + 0.00117 × winter rainfall +
0.0614 × average growing season temperature −
0.00386 × harvest rainfall

What is this formula telling us?

Why regression

Motivating examples

1 The relationship between restaurant characteristics and
location, and the prices charged
2 The relationship between wine price and critic ratings
3 The relationship between various “risk factors” and the
occurence of heart disease
4 One more example I haven’t decided on yet (and isn’t
therefore in the notes yet - suggestions welcome!)
5 Later we will examine the relationship between various
characteristics of golfers and the money they earn
Along the way, we will consider “smaller” datasets to illustrate
speciﬁc points.

Why regression

Motivating examples 1

New York Restaurants

1 This is intended to give an example of where we want to be
by the end of Term 1 - so you have an idea of what we are
learning and why.
2 In other words, relax, sit back and get a feel for what you will
be able to do after we’ve spent a whole term studying it.

Why regression


Zagat Price Guide

Example (Manhattan Restaurant Pricing)
1 Sheather [2009] suggests you have been retained to advise a
chef on Menu pricing for a new Italian restaurant in
Manhattan
2 He provides data from “Zagat Survey 2001: New York City
Restaurants, Zagat, New York”
3 Given a model, you can predict the eﬀect of various restaurant
characteristics on the kind of price you can charge
4 Speciﬁcally, we could try to decide whether you can charge
more for a restaurant that is “East” of the river. This is
denoted by an “indicator” variable which takes on the value 1
for a restaurant East of the river, and 0 otherwise

Why regression


Loading the data
> nyc.df <- read.csv("data/nyc.csv")
> summary(nyc.df)

Case Restaurant Price
Min. : 1.00 Amarone : 1 Min. :19.0
1st Qu.: 42.75 Anche Vivolo: 1 1st Qu.:36.0
Median : 84.50 Andiamo : 1 Median :43.0
Mean : 84.50 Arno : 1 Mean :42.7
3rd Qu.:126.25 Artusi : 1 3rd Qu.:50.0
Max. :168.00 Baci : 1 Max. :65.0
(Other) :162
Food Decor Service
Min. :16.0 Min. : 6.00 Min. :14.0
1st Qu.:19.0 1st Qu.:16.00 1st Qu.:18.0
Median :20.5 Median :18.00 Median :20.0
Mean :20.6 Mean :17.69 Mean :19.4
3rd Qu.:22.0 3rd Qu.:19.00 3rd Qu.:21.0
Max. :25.0 Max. :25.00 Max. :24.0

East
Min. :0.000
1st Qu.:0.000
Median :1.000
Mean :0.631
3rd Qu.:1.000
Max. :1.000

> pairs(nyc.df[, c(3:6)], main = "Pairs plot for Zagat price data")

Why regression


The data

Pairs plot for Zagat price data
16 18 20 22 24 14 18 22
q q q q qq q q q
q q q q q q

60
q q q
q qqqqqq
q q q
q q q q
q q q q
q
q
q q q q
q q qqq
q q q q q q q q
q q q q q
q
q q q q
q
qqq q
qqqqqqq q q
q
q
q
q q
q q
q q q q q
q q q
qqq
q qqqq q q
q q
q q
q
q q q q q q
q q q q q q qq q q q q q
q
q q q q q
q q qqqqqqqq
qqqq
q q q q
q q
q q q
q q
Price q q q q
q q q q q q
q q q q
qqqqq
qqqq qqq
q q q
q
q
q
q
q
q
q
q

40
q q q q q
q q
q q q q q q
qq qq
qqqqq
q qq
q q
q q
q q
q q
q q
q q q q q q q
q q q q q q q
q qqqqqq
q q q q q q
q q
q q
q q
q
q q q q q
q q q q q qqqqqq
qqq q q q q
q
q q q
q q
q q q q q
q q q q
qqqqqqq
qq
q
q q q q
q
q
q q q q q q q
q
q q q
q q q q
q q q qqq q
q q
q q

20
q q q

q q q q q q
24

q qqq q q
q q q qqqqq q q q q
qq q qqqqq q
q q q q qqqqqqqq q q q q q
q qq qqqqqqq
qq qq qq q q qqqqqqq qq q q q q
q q q qqqq qq q q
q qq qq q
qq q q q qqqqqq q q q q q q
Food
20

q q qqqq qqqq q q
qq q qqqq qqqqqqq q q q q q q q
qqqqqqqqqq
qq qq q q qqqqqqqqq q q q q q q
q q
q qqqq q q
qqqq q q q qqqq
qq q q q q q q
q q qq
q q qqq q q q q
16

q q qq q

25
q q q
q
q q q q q
q q q q q q q q
q q q qq q q q q q q q q q
q qqqq qq
q q q q q q q q q q q q

20
q qq q qqqq
q q qq q q q q q q q q q q q q q
q qqqqqqq qq q
q qq qq q q q q q q q q q q q q q q
q q qq qqqqq q
q qqq
q q q q q q q q q q q q q q q q
qqqq q qq q q
qqq q q q q q q q q q q q
q qqqq qqq q q q
qqq qq q q q q q q q q q
Decor q q q q q q q q

15
q qqq q q q
qq q q q q q q q q q q q q q
q qq q q q
q q q q q q q q q
q q q q q q q q

10
q q q
q q q

q q q

q q q q q q
q q q q qq
q q q q q q qqqqq
22

q q q qq qq qq
q qq q q q q q qqqq q
q q qqqqqqq qq
q qqqqqq q q q q q q qqqqqqqqq
qq
q qq qqq qq qq
qq qq qq q q q q q q q q qqqqqqqqq
qq
qqqqq qqq q
q qq q q q q q qqqqqq
Service
18

q qqqqqqqq q q
q qq q q q q q qqqqqqqq q
qqqqqqqqq qq
qq q qqq q q q q qqqqq qq
q qq qq q q q q q q q qqq
q q q q q q q q qqq q
14

q q q

20 30 40 50 60 10 15 20 25

Why regression


Loading the data

One of our predictor variables is not continuous. It is a
qualitative/categorical/nominal/factor with one level variable
which we use as an indicator/dummy variable such that it is equal
to 1 if the restaurant is East of the river, and 0 otherwise. We can
best examine the relationship between this variable and the Price
by means of a boxplot.
> boxplot(Price ~ East, data = nyc.df, col = "orange",
main = "Effect of East on price", ylab = "Price",
xlab = "East")

Why regression


The boxplot

Effect of East on price

60
50
Price

40
30
20

0 1

East

Why regression


Fitting a model

> nyc.lm1 <- lm(Price ~ Food + Decor + Service +
East, data = nyc.df)
> summary(nyc.lm1)

Call:
lm(formula = Price ~ Food + Decor + Service + East, data = nyc.df)

Residuals:
Min 1Q Median 3Q Max
-14.0465 -3.8837 0.0373 3.3942 17.7491

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -24.023800 4.708359 -5.102 9.24e-07
Food 1.538120 0.368951 4.169 4.96e-05
Decor 1.910087 0.217005 8.802 1.87e-15
Service -0.002727 0.396232 -0.007 0.9945
East 2.068050 0.946739 2.184 0.0304

Residual standard error: 5.738 on 163 degrees of freedom
Multiple R-squared: 0.6279, Adjusted R-squared: 0.6187
F-statistic: 68.76 on 4 and 163 DF, p-value: < 2.2e-16

Why regression


The model

Price = −24.02 + 1.54xFood + 1.91xDecor + 0xService + 2.07xEast + i

We can see that the higher the values of xFood , the higher the price
charged. We can also see that for every unit increase in xFood , the
value of yPrice increases by 1.54.
There doesn’t appear to be a relationship between Service and
Price. This is rather interesting, as it implies you could employ
Basil Fawlty to look after all the diners.

Why regression


What to do about Service

The most important thing to note for now is that these values are
only estimates! We will study inference more formally, but for now
we shall use a
Key Point: Rule of two
If the absolute value of an estimate divided by its standard error
is less than 2, we can’t even be sure what sign the estimate
should have

Why regression


Modifying the model

Some people might remove this variable from our model.
> nyc.lm2 <- update(nyc.lm1, Price ~ . - Service)
> summary(nyc.lm2)

Call:
lm(formula = Price ~ Food + Decor + East, data = nyc.df)

Residuals:
Min 1Q Median 3Q Max
-14.0451 -3.8809 0.0389 3.3918 17.7557

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -24.0269 4.6727 -5.142 7.67e-07
Food 1.5363 0.2632 5.838 2.76e-08
Decor 1.9094 0.1900 10.049 < 2e-16
East 2.0670 0.9318 2.218 0.0279

Residual standard error: 5.72 on 164 degrees of freedom
Multiple R-squared: 0.6279, Adjusted R-squared: 0.6211
F-statistic: 92.24 on 3 and 164 DF, p-value: < 2.2e-16

Why regression


Can you look at the adjusted R2

Key Point: Adjusted R2
The R2 is a diagnostic (values between 0 and 1) which tells us
the“proportion of variation in Y explained by our model. The
adjusted R2 incorporates a penalty to account for the number
of variables we have used.
What do you notice about the adjusted R-squared?

Why regression


We could also compare the two models:

Model 1:

Price = −24.02 + 1.54xFood + 1.91xDecor + 0xService + 2.07xEast + i

Model 2:

Price = −24.03 + 1.54xFood + 1.91xDecor + 2.07xEast + i

So it looks as if we could indeed suggest prices to charge, based on
ratings of the other variables.

Why regression


Residual checking

As well as ﬁtting models, we have to make sure they are sensible.
This involves checking all the assumptions we made when ﬁtting
the model. Again, we haven’t said anything about the assumptions
yet. But just to introduce the ideas let’s check two
1 Check that the residuals are Normally distributed
2 Check that the residuals have constant variance

Why regression


Checking the Normality of the residuals

One method we could use it to examine a histogram of the
residuals:
> hist(resid(nyc.lm2), freq = FALSE)
> curve(dnorm(x, 0, summary(nyc.lm2)$sigma),
add = TRUE, col = "red")

Why regression


Histogram of resid(nyc.lm2)

0.06
0.05
0.04
Density

0.03
0.02
0.01
0.00

−15 −10 −5 0 5 10 15 20

resid(nyc.lm2)

Figure: Histogram of residuals from model ﬁt, normal model
superimposed

Why regression


Checking the constant variance assumption

We will use some kind of plot of residuals against the ﬁtted values
(the points on the regression line corresponding to individuals in
the dataset). We start with a simple plot of residuals against ﬁtted
values.
> plot(fitted(nyc.lm2) ~ resid(nyc.lm2))

Why regression


q

60
q
q
q
q qq
q q q
qq q q q
q q qq q
q qqq
q

50
q q q q
q q qq q q
q q qq q
q q q
q
q q q
q q q q
fitted(nyc.lm2)
q q q q
q q q q q q q
q
q q qq qq q q q
q q q q
q q
q q qq q qq q q
q q qq
q q q
40

q
q q qq
q q qqqqq qq q
qq
qq q
q
q q qq q q q q q
q
q
q q q q q qqqq q q
q
q q q q q qqqq
q q q
q q
q q
q q q q
30

q q q

q
20

q

−15 −10 −5 0 5 10 15

resid(nyc.lm2)

Figure: Plot of ﬁtted values versus residuals

Why regression


Assumption checking

We could conclude that (a) the residuals appear Normal and (b)
the variance appears constant.
As we are happy with our model, we can answer the most
subtantive question. Having a restaurant East of the river seems to
add $2.07 to the price you can charge for a meal.

Why regression


Summary of what we have done

1 We have specified a problem and collected some data
2 We have carried out an exploratory data analysis
3 We have fitted an appropriate model to the data
4 We have checked the assumptions made when fitting that
model
5 We have made some adjustments to the model
6 We have attempted to draw some conclusions

Why regression


What we need to do

1 Think about the kinds of problems we can examine by
regression modelling
2 Learn (revise and extend what you did in STAT1401) how to
carry out an exploratory data analysis
3 Learn about the types of models we can build, and the
assumptions we make when building them
4 Learn more about how to check the model assumptions, and
understand some of the problems when they not met
5 Learn more about how to alter the structure of a model, in
particular how to decide in observational studies which
variables to include and exclude
6 How to interpret the results of model ﬁtting and, when
appropriate, how to carry out statistical inference on the
results

Why regression


How we can assess this
1 Ask you to discuss the reasons for a particular study, how we
deal with the diﬀerent variables (exam)
2 Ask you to carry out eda (coursework), or comment on an eda
that has been carried out (exam)
3 Fit (coursework) an appropriate model for a particular dataset.
Discuss and explain the principles behind various models
(exam)
4 Carry out residual checks and make adjustments to a model
(coursework), comment on residual checks / explain why
adjustments have been made (exam)
5 Carry out and report a model building exercise (coursework),
explain someone else’s model building (exam)
6 Interpret the results of your own (coursework) or someone
else’s model ﬁtting (exam)

Why regression


R.D. Cook and S. Weisberg. Applied Regression Including
Computing and Graphics. John Wiley, Hoboken NJ, 1999.
Simon Sheather. A Modern Approach to Regression with R.
Springer Texts in Statistics. Sheather. Springer Verlag, New
York, 2009.

Overview of Regression and Motivating Examples

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (6)

Similar a Overview of Regression and Motivating Examples

Similar a Overview of Regression and Motivating Examples (20)

Último

Último (20)

Overview of Regression and Motivating Examples