This document analyzes an energy efficiency dataset to predict heating load using linear regression. It finds that heating load is highly correlated with cooling load, so only heating load is used as the response variable. Stepwise regression identifies relative compactness, surface area, wall area, overall height, glazing area, and glazing area distribution as significant predictors of heating load. The regression has high R-squared but residual analysis shows the model is not a good fit for the data due to distinct variable values.
1. Analyzing the Energy Efficiency Data Set
Ankit Ghosalkar
Introduction
The main purpose of the report is to study and analyze the Energy Efficiency Dataset. We have
performed linear regression on this dataset using response variable called Heating Load. Along with this
variable, we also performed an analysis on other variables in the dataset and studied the significance of
these variables to model the linear regression. We have performed and worked on three model
selection criterion: Ordinary Least squares(OLS), Akaike Information Criterion(AIC), Bayesian Information
Criterion(BIC) to represent the data linearly.
Data Set Information
We perform energy analysis using 12 different building shapes simulated in Ecotect. The buildings differ
with respect to Glazing Area, Glazing Area Distribution and Orientation, amongst other parameters. The
dataset comprises of 768 samples and 8 features aiming to predict one real valued response. The
dataset made use of X as the predictor variables and Y as response variables i.e. the predictor variables
are X1…..X8 and the response variables are Y1 and Y2.The variables have been renamed to the
following:
Information about the 10 variables is
1. Relative Compactness
2. Surface Area
3. Wall Area
4. Roof Area
5. Overall Height
6. Orientation
7. Glazing Area
8. Glazing Area Distribution
9. Heating Load
10. Cooling Load
From the above variables we are considering Heating Load and Cooling Load as response variables. First
of all we calculated the correlation between these two response variables and other variables and also
studied the correlation among all the variables.
>Cor(file)
2. As shown in correlation matrix, the correlation between two response variable is very high (i.e. 0.975).
Hence by studying one of the response variables, we can predict the value of other response variable. So
we have considered Heating Load as the only response variable. Overall Height variable has a very good
correlation with the response variable Heating Load (i.e. 0.889), hence we can say that Overall Height is
good predictor of response variable Heating Load. Also Relative Compactness has a good correlation
with the response variable, so it is also a good predictor of response variable. The two predictor
variables also have good correlation with each other. So the two predictor variable can themselves act
as a good predictor-response pair.
Study of Variables
Before we do the linear regression for the given dataset and variables, we will need to study the
variables of the datasets. And this will be done by creating the pairwise scatterplot of the variables in
dataset. The scatterplot is useful to visualize the relationship between the variables of the dataset and
scatterplots are used to make an initial diagnoses before any statistical analysis are conducted.
3. The Scatterplot of Dataset is as follow:
The scatterplot concurs with the fact that there is a linear relationship between two response variables.
There is a very high correlation between the two variables. So instead of analyzing both the variables we
are analyzing only one response variable. The same analysis will apply to the other variable. The
correlation between most variables is not linear because most of the variables have a few distinct
values.
Linear Regression
lm.mo<-
lm(Heating.Load~Relative.Compactness+Surface.Area+Wall.Area+Roof.Area+Overall.Height+Orientation
+Glazing.Area+Glazing.Area.Distribution,data=file)
summary(lm.mo)
4. Linear Regression was performed with the response variable as Heating Load and the rest of the
variables as predictors. The summary shows us that Roof Area does not have any relationship with the
response variable. It does not predict the value of response variable. The p-value of Orientation is not
significant. All other attributes have a significant p-value. Also the R-squared value is very high. It is
91.62% which implies that it fits the model data. During linear regression we analyze the model based
on hypothesis that: All regression coefficients are 0. So,
Null Hypothesis=All regression coefficients are zero.
Alternate Hypothesis=Atleast one coefficient is not zero.
Since the p-value for the model is less than alpha i.e. 2.2e-16 we can reject the Null Hypothesis.
Residual Analysis
The Residual Analysis on the model resulted in following output
> lm.mo<-
lm(Heating.Load~Relative.Compactness+Surface.Area+Wall.Area+Roof.Area+Overall.Height+Orientation
+Glazing.Area+Glazing.Area.Distribution,data=file)
> par(mfrow=c(2,2))
> plot(lm.mo)
5. The Residuals v/s Fitted values graph shows that the variables are scattered on the both side of the
reference line. Also there is a very high variance in the values. This is due to the fact that most variables
have few distinct values because of which the values in graph are scattered. Also the graph appears to
be fanning out to the right. Hence the model is not a good fit for the data. The normal Q-Q graph is not
linear.
6. Stepwise Regression
We have performed stepwise regression in the backward direction.
Akaike Information Criterion (AIC)
>step (lm.mo, direction=”backward”)
7. During each step the attribute with the lowest p-value is removed. The AIC of the model increases when
the attribute with the lowest p-value is removed. After removing Roof Area the AIC remains unchanged.
This happens due to the fact that Roof Area does not play a part in determining the response variable
but after removing Orientation the model improves as AIC changes by 1.94 to 1659.48. We can see that
attributes like Relative Compactness, Surface Area, Wall Area, Overall Height, Glazing Area and Glazing
Area Distribution play a significant part in deciding the value of response variable. This concurs with the
p-values which we got during regression.
Bayesian Information Criterion (BIC)
>step(lm.mo,direction=”backward”,k=log(nrow(file)))
8. The behavior of BIC is similar to AIC. Initially the value of BIC was 1698.57. It remains unchanged after
removal of Roof Area since it is not significant in determining the value of response variable. The model
improves after the removal of Orientation by 6.58. Thus AIC and BIC give similar outputs and the
significant variables remain the same for both.
Conclusion
Heating Load and Cooling Load depend on same variables because of high correlation.The variables
which play an important part in predicting their values are Relative Compactness, Surface Area, Wall
Area, Overall Height, Glazing Area and Glazing Area Distribution.
The paper which cites this dataset is:
A. Tsanas, A. Xifara: 'Accurate quantitative estimation of energy performance of residential buildings
using statistical machine learning tools', Energy and Buildings, Vol. 49, pp. 560-567, 2012