This document presents an analysis of automobile data. It begins with data manipulation steps including removing missing data and converting variables to appropriate data types. Exploratory data analysis is conducted through scatter plots and box plots to examine relationships between variables like mileage and weight grouped by cylinders. Simple and multiple linear regression models are fit to predict mileage, and model diagnostics identify violations of assumptions like homoscedasticity. Transforming the response variable to log scale addresses these issues. The modified multiple regression model has the highest R-squared value, indicating it best fits the data.
7. −10
0
10
10 20 30
fitted
residuals
Figure 1: Residuals vs. Fitted Values. The gure proves that the homoscedasticity assumption i.e. constant variance
is violated. Thus, the multivariate linear regression model developed above seems not to be suitable. The response
variable may need to be transformed and then retted. Log transfomation of the response (mpg) variable may rectify
the non constant variance.
−10
0
10
−2 0 2
theoretical
sample
Figure 2: qqnorm and qqline plots to test for the normality assumption. The data do not seem to deviate much from
the normality assumption. There seems to be few outliers in the data set.
7
8. 0
20
40
60
−10 0 10 20
residuals
count
0.00
0.05
0.10
0.15
0.20
0.25
0 100 200 300 400
Index
Leverages
Index plot of Leverages
8 Modifying the model
lm.fit3 - lm(log(mpg) ~ weight + horsepower + acceleration + cylinders +
cylinders * horsepower, data=auto)
summary(lm.fit3)
##
## Call:
## lm(formula = log(mpg) ~ weight + horsepower + acceleration +
## cylinders + cylinders * horsepower, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.36326 -0.08544 -0.00528 0.08195 0.63418
##
8