MLR Project (Onion)

A STUDY ON THE
CRIME RATE IN 47
STATES OF THE USA
IE 5318 PROJECT 2
GROUP MEMBERS (ONION):
UKESH CHAWAL
OLATEJU OLANREWAJU
ADEBIYIAKINSANYA
FALL
2014
Universityof TexasatArlington
FALL 2014

Page | 1
INDEX
TOPIC PAGE NUMBER
1. Proposal 2
2. Preliminary Multiple Linear Regression Model Analysis 4
3. Exploration of Interaction Terms 13
4. ModelSearch 18
5. ModelSelection 22
6. Final Multiple Linear RegressionModel 28
7. Final Discussion 30
(i) Appendix 1 32
(ii) Appendix 2 33
(iii) Appendix 3 34
(iv) Appendix 4 35
(v) Appendix 5 36

Page | 2
1. Proposal
Description of the problem and the variables:
Our project is focused on the increase in the crime rate which is a problem in any kind of society and we
wanted to know the possible relationship between the crime rate in 47 states of the USA (response
variable, Y) and four factors (predictor variables) we believe are the reason for causing response variable.
These variables include:
 X1 – unemployment rate of urban males per 1000
 X2 – the number of families per 1000 earnings below one-half of the median income
 X3 – state population size in hundred thousands
 X4 – police expenditure per person by state and local government
Present and discuss the matrix scatter plot of the variables
a
)
c
)
b
i
)
h
)
g
)
f
)
e
)
d
)
j
)
j
a
)
b
c
)
d
)
e
)
f
)
g
)
h
)
i
)

Page | 3
Figure 1: Matrix Scatter plot
Response vs Predictors Scatter Plot
Crime rate(Y) vs unemployment rate(X1) in figure 1-a: Looking at the figure we can see there is no
linear relationship between Y and X1. And it has a constant trend. There are no outliers.
Crime rate(Y) vsnumber of families earning below one-half income (X2) in figure 1-b: Looking at
the figure we can see there is a linear relationship between Y and X2. And it has a negative correlation
pattern. There are no outliers. It will be useful if X2 is added to the model because there is some linear
trend that is not explained by the model. So adding X2 helps to explain more variation.
Crime rate(Y) vsstate population size (X3) in figure 1-c: Looking at the figure we can see there is an
upward trend between Y and X3. So, there is a linear relationship. Also, there are no outliers. It will be
useful if X3 is added to the model because there is some linear trend that is not explained by the model.
So adding X3 helps to explain more variation.
Crime rate(Y) vspolice expenditure (X4) in figure 1-d: Looking at the figure we can see there is a
positive correlation pattern between Y and X4. So, there is a linear relationship. Also, there are no
outliers. It will be useful if X4 is added to the model because there is some linear trend that is not
explained by the model. So adding X4 helps to explain more variation.
Predictor vs Predictor Scatter Plot
Unemployment rate(X1) vsnumber of families earning below one-half income (X2) in figure
1-e:Looking at the figure there is a random cloud and no linear trend. The model is reasonable. There are
no outliers.
Unemployment rate(X1) vsstate population size (X3) in figure 1-f:Looking at the figure we can see
there is an upward trend between X1 and X3. So, there is a linear relationship. Also, there are no outliers.
Unemployment rate(X1) vspolice expenditure (X4) in figure 1-g: Looking at the figure there is a
curvature in the pattern. The model is not reasonable. There are no outliers.
Number of families earning below one-half income (X2)vsstate population size (X3) in figure 1-h:
Looking at the figure there is a curvature in the pattern. The model is not reasonable. There are no
outliers.
Number of families earning below one-half income (X2) vspolice expenditure (X4) in figure 1-
i:Looking at the figure we can see there is a negative downward trend between X2 and X4. So, there is a
linear relationship. Also, there are no outliers.
State population size (X3) vspolice expenditure (X4) in figure 1-j:Looking at the figure we can see
there is an upward trend between X3 and X4. So, there is a linear relationship. Also, there is one possible
x-outlier.
Check response-predictor and predictor-predictor pairwise correlation:
It is important to check the correlations between the response-predictor; and also, to check the
correlations between the predictor-predictor before we model the regression of our data,so that we can be
aware of the possible problem we might face while modelling the data regression, the table in figure 1
below shows the correlations between the response- predictor and predictor-predictor.

Page | 4
Figure 2: Pearson Correlation Coefficients.
From the figure 2, it is observed that Pearson correlation coefficients between Y and X1 equals to
0.03112 which is less than our standard r (sample correlation, 0.7) that implies they do not have strong
correlation. Similarly, Pearson correlation coefficients between Y and X2 (-0.17876) and Y and X3
(0.37930) are again less than r that implies they do not have strong correlation. But we can see that
Pearson correlation coefficients between Y and X4 (0.70735) is greater than r that implies Y and X4 have
strong correlation. Because Y and X4 have the strong correlation between them, it is desirable. Moreover,
Pearson correlation coefficients between X1 and X2 (-0.03038), X1 and X3 (-0.07302), X1 and X4 (-
0.00975), X2 and X3 (-0.19457), X2 and X4 (-0.63801), X3 and X4 (0.57151) are again less than r that
implies they do not have strong correlation. Because the correlations between predictor-predictor are all
less than r, therefore we may not have serious multicollinearity problem.
Discuss Potential Complications
Since, the correlations between predictor-predictor are all less than r (sample correlation), therefore we
may not have serious multicollinearity problem in our model. But, looking at our scatter plot we can see
unemployment rate(X1) vs police expenditure (X4) and number of families earning below one-half
income (X2) vs state population size (X3) has a curve linear pattern which may have potential
curvilinearity problem. Also, state population size (X3) vs police expenditure (X4) shows there is one
possible x-outlier. So, we can use residual plot to precede the analysis by checking the outlier and curve
trend.
2. Preliminary Multiple Linear Regression Model Analysis
From the scatter plot correlation matrices it will be necessary to show the relationship between the crime
rate in 47 states and the four predictor variables. The multiple linear regression form is given as:
Yi= β0+β1Xi1+…βkXik+ϵi, where β0,β1,βk and ϵ are random variables representing the regression
coefficients and vertical variations between the observed and fitted values of Y.
A preliminary model was fit from our collected data as follows: crime rate= β0+β1unemployment_rate+
β2median_income+β3state_population+β4police_expenditure, where β0, β1, β2, β3, β4 are the model
parameters.

Page | 5
FIGURE 3: RESIDUALS VS PREDICTORS PLOT
In order to check the fitness of the model form, a residual plot is necessary to check for curvature. From
figure 3, it can be concluded that the model form is acceptable because the residual plot of each predictor
variable does not show a curvilinear trend.
FIGURE 4: RESIDUAL VS PREDICTED VALUE PLOT
The residual plot of each predictor varaible is also used to check the constant varriance assumption and
from the figure above, it can be concluded that the constant variance assumption is satisfied since there is
no funnel shape or curvature in our plot.
The Modified Levene test or Brown Forsythe test is conducted to confirm the constant variance in the
plot. SAS software is used to perform this test. To perform this test the first thing we do is divide the data
into two groups using the median of the independent variable. The median of our data was 79.5. So we

Page | 6
divide our data into 2 groups one is lower than 79.5 and other being greater than 79.5. In order to perform
the modified levene test, the equality of the variances using an F-test should be performed first to check if
the test has equal variances or unequal followed by the t-test to check if the variances are constant or not.
FIGURE 5: MODIFIED LEVENE TEST
Using 95% confidence (α = 0.05) and the p-value(α=0.05), we test using the following hypothesis:
Null hypothesis: H0 = variances are equal
Alternative hypothesis: H1 = variances are not equal
Decision rule: Reject H0 if p-value < α
Looking at the figure 5, from Equality of variances, we can see that p-value (0.0375) <α (0.05).
Decision: We reject H0.
Conclusion: Unequal variance is used for t-test.
Thus, we should conduct the two-sample t-test using unequal variances that is the pooled method from the
figure 5 above. Using 95% confidence (α = 0.05) and the p value, we test using the following hypothesis:
Null hypothesis: H0 = Means are equal
Alternative hypothesis: H1 = Means are not equal
Decision rule: Reject H0 if p value< α
Looking at the figure 5 above, we can see that p value (0.2169) >α (0.05),
Decision: We fail to reject H0.
Conclusion: Means are equal.
Thus, we are 95 % confident that the means of the absolute deviations are unequal that is the
model has constant variance. The result is consistent with that of the residual plot from figure 4
above.

Page | 7
FIGURE 6: NORMALITY PLOT
The above figure gives us the following information:
 The plot is not straight with shorter right tail and shorter left tail.
 Normality is not ok.
Thus from this information we conclude that model assumption for normality is not satisfied.
Furthermore, The CORR Procedure using SAS software is used to test the normality. Here, we compare
the observed value of the distribution of the coefficient of correlation between residuals (e) and their
expected values under normality (enrm) with its critical value. The observed value of the coefficient of
correlation (ῥ obtained from figure 7) while the critical value (c) is obtained from statistical tables.
FIGURE 7: CORRELATION TEST FORNORMALITY
Using 90% confidence (α = 0.1) we perform test using the following hypothesis:
Null hypothesis: H0 = Normality is OK vs
Alternative hypothesis: H1 = Normality is violated
Decision rule: Reject H0 if ῥ < c (α, n)
To calculate c (0.1, 34) from statistical tables, we do interpolation and the value is 0.9734.
From figure 7, ῥ = 0.98859 and we can see ῥ (0.98859) > c (0.9734)
Decision: We fail to reject H0
Conclusion: Normality is not violated.
This result is not consistent with the result from NPP Plot. If the NPP plot and the normality test using
SAS software give two different results, the result form the plot is more reliable. Thus, our overall
conclusion is that the normality is not satisfied or violated.

Page | 8
Our data was taken in 47 states of USA and time constraint is not taken into considerations. So, we
consider our residuals are uncorrelated. Hence time sequence plot to check uncorrelated errors assumption
is not relevant for our model.
FIGURE 8: SAS OUTPUT OF T VALUE AND F VALUE
1. Checking Outliers
Leverage, influence and Bonferroni outlier test are diagnostic measures use to check for outliers.
a) Bonferroni outlier test is used to check for y-outliers and this can be done by comparing the absolute
values of the deleted studentized residuals. The Bonferroni outlier hypothesis testing is conducted as
follows: H0: The data point is y-outlying vs. H1: The data point is not y-outlying. The decision rule is
reject H0 if /ti/>t(1-α/2n; n-p-1), where α= 0.05, t (0.9993, 28) is 3.54557 from figure 8, since the absolute
values of the deleted studentised residuals (From Appendix 1) are less than the Bonferroni cut-off, we
reject H0 and conclude that the there are no y-outlying points in our data sets. b) X-outliers can be checked
by identifying the data point in our data sets that has its leverage value (hii) which is given as HatDiagH >
0.29(2𝑝/𝑛 ) which is the cut-off value, where p is the number of parameters, which is 5 and n is the
number of observations in our data sets which is 34. From Appendix 1, observation 4 and 29 have
leverage values greater than 0.29, therefore they are considered as x-outliers.
2. Checking Influence
DFFITS, DFBETAS and Cook’s distance measures are used to check the influence of the outliers found
in our datasets on the fitted values, they are used to investigate if the outlying data point will affect the
fitted regression function. Since our datasets is larger than 30, we use flag /DFFITS >2√ 𝑝/𝑛 (0.7699) and
from Appendix 1, we can see that observation 4 has an absolute DFFITS value of 1.4805, observation 11
has absolute DFFITS value of 0.8163 and observation 29 has an absolute DFFITS values of 2.5746,
which is all greater than 0.7699(the cut-off value). Also since our data is larger than 30, we flag
/DFBETAS/>2/√ 𝑛 (0.34299) and from Appendix 1, we can see that observation 2 has absolute value of
DFBETAS for X3 has the only value greater than the cut-off value while other values are less than the
cut-off values therefore it is slightly influential, thus observation 2 can be ignored as an influential point.
Furthermore, observation 4(X3), 18(X4), 22(X2), 24(X1), 26(X3, X4), 29(X3, X4), 31(X1) all have the
absolute values of DFBETAS greater than cut-off values while other values are less than the cut-off
value, thus all the observations are slightly influential and they can all be ignored as an influential point.
Using the cook’s distance, the data point is influential when Di> f (0.5; p, n-p), which is 0.89089 from
figure 8. And from our table below the observation 4 cook’s distance value is less than the cut-off value
of 0.89089, thus x-outlying observation 4 is not influential; but x-outlying observation 29 is influential
because the cook’s distance value is greater than the cut-off value of 0.89089.
Thus, looking at the overall result we do not have adequate reason to remove any outliers. There are no
data points that are unusual so it is fine to include all the points in our data. Thus our data still has 34
observations.

Page | 9
FIGURE 9: VARIANCE INFLATION DIAGNOSTICS
Variance Inflation factors are used to check multicollinearity between predictors by regressing predictors
on each other. And the following are the guidelines of VIFs:
 ∑ /𝑝 − 1
𝑝−1(𝑉𝐼𝐹)𝑘
𝑘=1 >> 1;the mean values of the VIFs, which is 1.75869 is slightly greater than 1
 Max(VIFk) > 10, the maximum VIF is 2.59828 which is far lower than 10
 From figure 9 above none of the VIFk > 5
Therefore,we can conclude that we do not have serious multicollinearity problems and this was also
mentioned earlier in our scatter plot correlation matrices observations.
Transformation
In order to see the difference or to improve the model we performed log transformation for the previous
model. So, the transformed multiple linear regression form is given as:
Log crime rate= β0+β1unemployment_rate+ β2median_income+β3state_population+β4police_expenditure,
where β0, β1, β2, β3, β4 are the model parameters.
Now we need to perform verification of model assumptions in order to present our transformed model to
be preliminary model.
FIGURE 10: TRANSFORMED MODEL OF RESIDUALS AGAINST PREDICTORS

Page | 10
FIGURE 11: TRANSFORMED MODEL OF RESIDUAL AGAINST PREDICTED VALUE
Since there is no curvature in our plot of residuals against predictors and residual against predicted value
as seen in figure 10 and 11, thus it can be inferred that the transformed model is appropriate. Because of
the scattered patterns,it can be concluded that the constant variance error assumption is satisfied.
Method Variances DF t Value Pr > |t|
Pooled Equal 32 0.07 0.9441
Satterthwaite Unequal 28.709 0.07 0.9449
Equality of Variances
Method Num DF Den DF F Value Pr > F
FoldedF 15 17 1.56 0.3731
FIGURE 12: TRANSFORMED MODIFIED LEVENE TEST
The constant variance error assumption is checked by conducting a modified levene test, the median value
1.935281 is used to divide the data into two groups. In conducting the modified levene test, the equality
of the variances is tested first by using the F-test. Hypothesis is conducted at 95% confidence level as
follows:
Null hypothesis (H0): equal group variances vs
Alternative hypothesis (H1): unequal group variances.
Decision rule: Reject H0 if p<α.
Looking at figure 12, it can be seen that the p-value of 0.3731> 0.05, thus we fail to reject H0 and
conclude that the group variances are equal. From this result the two-sample t-test hypothesis is
conducted by testing at 95% confidence level as follows:
Null hypothesis (H0): constant variances vs
Alternative hypothesis (H1): constant variances violated.
Decision rule: Reject H0 if p<α.
It can also be seen from figure 12, that p-value of 0.9441 >0.05, thus we fail to reject H0 and conclude
that the variances are constant which is consistent with the constant variance assumption noticed from the
residual plot.

Page | 11
FIGURE 13: TRANSFORMED MODEL OF NORMALITY PLOT
From the above figure, we can see the plot is straight and thus normality is ok.
Pearson CorrelationCoefficients, N= 34
e enrm
e
Residual
1.00000 0.98290
enrm
Normal Scores
0.98290 1.00000
FIGURE 14: TRANSFORMED MODEL OF CORELLATION TEST FOE NORMALITY
Using 90% confidence (α = 0.1) we perform test using the following hypothesis:
To calculate c (0.1, 34) from statistical tables we do interpolation and the value calculated is 0.9734.
Conclusion: Normality is ok.
This result is consistent with the result from the NPP Plot.
Also, using Bonferroni outlier test for y-outliers in this case, we got our leverage cut off value to be
3.5712 and since the absolute values of the deleted studentised residuals (From Appendix 5) are less than
the Bonferroni cut-off, we conclude that the there are no y-outlying points in our data sets. Also, leverage
cut-off value is 0.29 and from Appendix 5, there are no observations greater than the cut-off value,
therefore they are no x-outliers in this case. Thus there is no need to check their influence.

Page | 12
FIGURE 15: SAS OUPUT OF PRELIMINARY MODEL
Preliminary model
The transformed model satisfied the entire necessary model assumptions therefore it will be considered as
our preliminary model, thus we fit the model with this equation we got from SAS output from figure 15:
Log crime rate = 0.83627+0.00044715unemployment_rate +0.00260median_income
-0.00067161state_population +0.00659police_expenditure
Using the ANOVA table in figure 15, necessary information can be obtained to explain and check the
model form. The F-value and the Pr>F are used as the cut-off values needed to conduct F-test, which are
used to check if the regression is significance, the f-value (12.04) and the Pr>F which is used to F-test
statistics is given as 0.0001, in order to check if the model is significant, F-test is conducted at
significance level of 0.0001 which is our α
Null hypothesis (H0): All βk = 0 VS
Alternative (H1): At least one βk≠0, where k =1, 2….4
Decision rule: Reject H0 if p-value<α.
Looking at figure 15 below, it is noticed that 0.0001< 0.01, thus we reject H0 and conclude regression is
significant.
Looking at Figure 15, our regression model provides estimates of the crime rate in 47 states of the USA
which is dependent on the four predictors. Because we are providing estimates of the actual data,

Page | 13
variability is major concern. The ANOVA table (in Figure 15) provides variability information in the
columns Sum of squares and Mean Square. The value 0.75249, found in the cell that is the intersection of
model row and sum of squares column, is called the SSR (regression model sum of squares). It represents
the amount of variability that is explained by our model. The row and mean square column is called the
MSR (regression model mean of square); which value is 0.18812.The value 0.452324 represents the SSE
(error/residual sum of squares), this is the amount of variability that is unexplained by our model. Below
it is the value 1.20573; this is the value of the total variability of the observations of crime rate in 47 states
of the USA due to variation in the four predictor variables; which is the SSTO – total sum of squares. At
the intersection of the error row and mean square column is the MSE (error/residual mean square), given
as 0.01563, this value represents an unbiased estimator of the variance of the crime rate in 47states in
USA variable, as it provides information on the spread of the data points from their mean and from each
other. The value of the MSE, 0.01563 is low; thus, we can say that our data points are not very spread out
from each other; as noticeable in Figure 15. The Root MSE is the square root of the MSE value found in
the ANOVA table; it is the unbiased estimator of the standard deviation of crime rate in 47 states of the
USA. R-square, is also found in this table. Given as 0.6241, this value (known as the coefficient of
determination) represents the proportion of variability that is explained by our model; thus we can say that
our regression model only accounts for 62.41% of the variability in the observations of the crime rate in
47 states of the USA, also more predictor variables can be added to get more explained proportion of
variability. The adjust R-square is very useful in adjusting the R-square in order to determine the number
of predictors that are in the model, the adjusted R-square doesn’t determine the fit of the model; it inform
us of how suitable our model is in comparison with the number of predictors in the model thereby
measuring the marginal contribution of each predictors in the model. Since the adjust R-square (0.5722) is
not very close to the R-square(0.6241) this shows that the model may include too many predictors
Based on the observation mentioned earlier on in figure 9 which is also the same as figure 15, the
variance inflation factor guidelines, it was observed that that none of all the VIFs are greater than 5and
the average of the VIFs(1.758685) is not much bigger than 1, then we conclude that serious
multicollinearity is not a problem.
From figure 15, t-test at significance level of 0.05 is used to conduct test to check if the predictors are
significance to our model. The Pr>/t/ represents the p-value.
Null hypothesis (H0): All βk = 0 VS Alternative (H1): At least one βk≠0, where k =1, 2….4
Decision rule: Reject H0 if p-value<α, Looking at figure 15 above, it is noticed that X2 (0.0015) and X4
(<0.0001) have their p-value less than α(0.05), thus we reject H0and conclude that only the two predictors
are significant to this model; since β2 andβ4 are not equal to zero.
3. Exploration of Interaction Terms
It is necessary to check for potential interaction effect because the effects of predictors on the response
may not be additive but they may depend on the levels of the other predictors. Interaction terms are
obtained by multiplying predictors with each other, which is then used to check if they may have any
effect on the response variable.

Page | 14
FIGURE 16: PARTIAL REGRESSION PLOT X1X2
Here in figure 16, the points have no trend (random cloud). So we conclude not to add X1X2 because it
doesn’t have any marginal contribution to our model since it doesn’t have any particular trend.
Similarly in figure 17, the points have no trend (random cloud).This interaction term will not be useful in
our model, which means X1X3cannot be added into the model because it doesn’t have any marginal
contribution to the model since it has no trend.

Page | 15
The points in figure 18 shows a linear trend, therefore; it will be useful to add the interaction terms X1X4
to the model because there will be marginal contribution of the interaction terms to the model.
The points in figure 19 shows a linear trend, therefore; it will be useful to add the interaction terms X2X3
to the model because there will be marginal contribution of the interaction terms to the model.

Page | 16
Since there is a trend observed in figure 20, it will be useful to add the interaction terms to the model
since it will have marginal contribution to the model.
Since there is a trend observed in figure 21, it will be useful to add the interaction terms to the model
since it will have marginal contribution to the model.
Therefore looking at all the six interaction terms from figure 16 to figure 21, interaction terms between
X1X4, X2X3, X2X4 and X3X4 are the possible useful interaction terms that can be added to our model since
they have linear trend and have significant effects on our response variable (crime rate).

Page | 17
Interaction terms are usually highly correlated with their original predictors when rxx(correlation
coefficients) >0.7 as seen in figure 22 below.
FIGURE 22: CORRELATIONS BETWEEN PREDICTORS AND INTERACTION TERMS
It is seen from figure 22, that all of the six interaction terms are highly correlated with one of their
original predictors. As a result of this high correlations, there might be serious multicollinearity problems
when interaction terms are added to the model, to reduce the effect of this high correlation; the interaction
terms will be standardized by centering the means to 0 and scaling the variances to 1 to reduce the effect
of the high correlation.
FIGURE 23: CORRELATIONS BETWEEN PREDICTORS AND STANDARDIZED
INTERACTION TERMS
Looking at figure 23 above, the high correlation effects in all the six interaction terms has been reduced
by standardizing the interaction terms, thus serious multicollinearity problem has been reduced and now
we can add the standardized interaction terms into the model.

Page | 18
4. Model Search
The model search technique used in obtaining potential best models is backwards deletion, best
subsets and stepwise regression. Here we will find two potentially best models. These best models
predictors are significant at α=0.1 and multicollinearity is not a serious problem.
Best Subsets Selection
This technique contain crime rate on every available subset of our four predictors using SAS
software. We will find the potentially two best models based on the following criteria:
 High R-square and low SSE
 High adjusted R-square and low MSE
 Low values of Mallow’s CP such that (Cp ≈ p), where “p” is the number of parameters.
 Low values of AIC (Akaike’s Information Criterion) and SBC (Schwarz’ Bayesian Criterion)
FIGURE 24: BEST SUBSETS SELECTION – 1 VARIABLE MODELS
FIGURE 26: BEST SUBSETS SELECTION – 3 VARIABLE MODELS (2ND
BEST MODEL)
FIGURE 27: BEST SUBSETS SELECTION – 4 VARIABLE MODELS (1ST
BEST MODEL)

Page | 19
The two potentially best models can be selected using the criteria mentioned above. Here, looking at the
Adjusted R-square in figure 28 and 29 it started to decrease instead of increasing trend as in previous
figure. This means we do not have to look for 7 variable models because it will not help in explaining
variability, thus it does not make sense to consider models with more than 6 variables.
Now, using the criteria mentioned above from figure 24 – 29, we observed the R-square, Adjusted R-
square and the CP values to know the 1st
and 2nd
best model. We came to conclusion that Figure 27 is the
1st
best model because of the R-square (.7750) and Adjusted R-Square (.7440) which are higher values
then the previous model. Also CP value is 5.3977 which are comparatively low than other models and p
value is 5 which implies (Cp ≈ p) is satisfied. While Figure 26 is the 2nd
best model due to R-square
(.7411) and Adjusted R-square (.7152) noticeably high and the CP (7.8307) which is relatively low and p
is 4 is this case and thus satisfies (Cp ≈ p).
Step-wise Regression
This technique consists of Backward and Forwards selection for addition and deletion of a
predictor variable from the model. We start with no predictor by identifying the predictors that are
significant at α= 0.1. This is done by comparing the p-value for t-test to α= 0.1 for adding and deleting of
predictor variable. The predictor with the largest p-value is identified and deleted when p-value is greater
than α. The potentially good model is identified when no predictor variable can be added or deleted from
the model.

Page | 20
FIGURE 30: STEP-WISE REGRESSION RESULTS
Using step-wise regression, the potentially best models are 3 and 4 variables models from figure 30.
These two best models was selected also using subset technique.
Backward Deletion
This technique is a bit similar to step-wise regression. The different in this technique is that, we
begin with full set of predictors by regressing crime rate on full set of its predictors. The potentially good
model is identify until the remaining predictor variable are significant [p-value is less than α].
FIGURE 31: BACKWARDS DELETION RESULTS

Page | 21
We observed from figure 31 that 7 predictors were removed from the full set of our 9 predictors using
backward deletion technique. Thus the potentially best models are 3 and 4 variables models same as the
two selected by step-wise regression and subset technique. We also concluded that multicollinearity is not
a serious problem.
Thus, our two potentials best models with significant predictors and no serious milticollinearity problem
are presented below in figures 32 and 33 below.
R-Square = 0.7750
FIGURE 32: BEST MODEL 1
Best Model 1: Preliminary model:
Log Yi= 0.49799 + 0.00397X2 + 0.00837X4 + 0.08940stdx2x4 – 0.02166stdx3x4
Where Y is Crime rate,X2 is median income, X3 is state population and X4 is police expenditure.

Page | 22
R-Square = 0.7411
FIGURE 33: BEST MODEL 2
Best Model 2: Preliminary model:
Log Yi= 0.54640 + 0.00390X2 + 0.0078X4 + 0.09498stdx2x4
Where Y is Crime rate,X2 is median income, and X4 is police expenditure.
5. Model Selection
In selecting the final best model, the model assumptions must be verified and diagnostics must be
checked for each model.
Best Model 1:
Figure 34: RESIDUAL PLOT - Best Model 1

Page | 23
The best model 1 consist of 3 – predictor models, so we need to check for the constant variance by
plotting residual against predictor variable in figure 34. From the regression plot, we can see there is no
curvature, no funnel shape and there is a constant variance in the residual plot of x2, x4 and stdx2x4.
Figure 35: MODIFIED LEVENE TEST- BEST MODEL 1
Using modified levene test to confirm this observation with dividing point = 1.935281 and testing at 95%
confidence using p-value.
Looking at the figure 35, from Equality of variances, we can see that p-value (0.0375) <α (0.05).
Decision: We reject H0.
Conclusion: Unequal variance is used for t-test.
Thus, we should conduct the two-sample t-test using unequal variances from the figure 35 above. Using
95% confidence (α = 0.05) and the p value, we test using the following hypothesis:
Looking at the figure 5 above, we can see that p value (0.2169) >α (0.05),
Thus, we are 95 % confident that the means of the absolute deviations are unequal that is the
model has constant variance. The result is consistent with that of the residual plot from figure 34
above.

Page | 24
Figure 36: NORMAL PROBABILITY PLOT- Best Model 1
From the normal probability plot figure 36, we can see shorter-right tail and shorter left tail and we
conclude normality is not okay.
Figure 37: CORRELATION TEST FOR NORMALITY- Best Model 1
Using the correlation test to verify normality at 90% confidence (α = 0.1) we perform test using the
following hypothesis:
This result is not consistent with the result from NPP Plot. Thus we conclude that the normality
assumption is violated after based on more reliable NPP Plot.
Normal Score

Page | 25
We verify for outlier in our model by using the leverages values for x-outliers and Bonferroni outlier for
y-outliers. The leverages cutoff value is 0.29 while the Bonferroni value is 3.54557 at ∝ = 0.05. Our
model has no outliers since all the values are smaller than leverages cutoff value and the Bonferroni value
(Appendix 4). Thus there is no need to check for influence.
Best Model 2:
Figure 38: RESIDUAL PLOT - Best Model 2
From the regression plot figure 38, we can see there is no curvature, no funnel shape and thus we can
conclude there is a constant variance.
Figure 39: MODIFIED LEVENE- Best Model 2
Using modified levene test to confirm this observation with dividing point = 1.935281 and testing at 95%
confidence using p-value.
Looking at the figure 39, from Equality of variances, we can see that p-value (0.3731) >α (0.05).
Conclusion: Equal variance is used for t-test.

Page | 26
Thus, we should conduct the two-sample t-test using unequal variances that is the pooled method from the
figure 39 above. Using 95% confidence (α = 0.05) and the p value, we test using the following
hypothesis:
Looking at the figure 5 above, we can see that p value (0. 9449) >α (0.05),
Thus, we are 95 % confident that the means of the absolute deviations are equal and the model has
constant variance. The result is consistent with that of the residual plot from figure 38 above.
Figure 40: NORMAL PROBABILITY PLOT - Best Model 2
From the normal probability plot figure 40, we can see a shorter-right tail and shorter left tail and we
conclude normality is not okay.
Figure 41: CORRELATION TEST FOR NORMALITY- Best Model 2
Using the correlation test to verify normality at 90% confidence (α = 0.1) we perform test using the
following hypothesis:

Page | 27
This result is not consistent with the result from NPP Plot. Thus we conclude that the normality
assumption is violated based on more reliable NPP Plot.
We verify for outlier in our model by using the leverages values for x-outliers and Bonferroni outlier for
y-outliers. The leverages cutoff value is 0.29 while the Bonferroni value is 3. 3.56123 at ∝ = 0.05. Our
model has no outliers since all the values are smaller than leverages cutoff value and the Bonferroni value
(Appendix 5). Thus there is no need to check for influence.
Comparison ofthe two best models
In selecting the overall best regression model depends on the results of each model and the
diagnostic from the residuals. The comparison table of the two best models using the assumptions in
selecting the overall best model is shown below:
Best Model 1 Best Model 2
R-square and SSE: The rule is higher the R-square
and lower the SSE the better the model is. Here,
R-square value is 0.7750 while the SSE value is
0.27645. Comparatively this model is better than
the other one.
R-square and SSE: Here,the R-square value is
0.7411 while the SSE value is 0.31219.
Adjusted R-square and MSE: The rule is higher
the Adjusted R-square and lower the SSE the
better the model is. Here Adjusted R-square value
is 0.7440 while the MSE value is 0.00953.
Comparatively this model is better than the other
one.
Adjusted R-square and MSE: Here,Adjusted R-
square value is 0.7152 while the MSE value is
0.01041.
Low value of Mallow’s CP: Here,the CP value is
5.3977 which is the lower than the other model.
Moreover, p value here is 5 and CP value is close
to p value in this case which satisfies (Cp ≈ p).
Comparatively this model is better than the other
one.
Low value of Mallow’s CP: Here,the CP value is
7.8307 which is the higher than the other model.
Moreover, p value here is 4 and CP value is not
close to p value in this case compared to the other
model.
AIC and SBC: The rule is the lower the AIC and
SBC value, the better the model is. Here,AIC
value is -154.2522 and SBC value is -146.62044
which is lower than the other model. Thus this
model is better than the other model.
AIC and SBC: Here,AIC value is -151.4769 and
SBC value is -145.3714 which is lower than the
other model.
Hence,we can conclude from the above table that “Best Model 1” is the best overall regression model for
relationship between the crime rates in 47 states of the USA.

Page | 28
6. Final Multiple Linear Regression Model
As mentioned above, the general multiple linear regression form Yi= β0+β1Xi1+…βkXik+ϵi,and
our selected model from the SAS output is:
Log crime rate = 0.49799 + 0.00397median income + 0.00837police expenditure + 0.08940std
median income police expenditure – 0.02166stdstate populations police expenditure.
We noticed from the analysis that the four initial factors (unemployment rate, median income, state
population and police expenditure in relationship with the crime rate in 47 states), but only median
income and police expenditure is used in predicting the relationship between the crime rate in 47 states.
This model focus on the relationship between the crime rates in 47 states based on the low earning and
police expenditure. From the scatter plot matrix, we observed that there is a linear relationship between
the crime rate and number of families with median income with a negative pattern and also a linear
relationship between the crime rate and police expenditure with a positive correlation pattern and a linear
trend. Both variable explained more variation when added.
R-Square = 0.7750
Figure 42: FINAL MODEL – SAS OUTPUT
We conducted 95% confidence F-test to know the significance of our final model and we analyze the fit
of the model by testing null hypothesis, H0: 𝛽1 = 𝛽2 = 0 versus alternative hypothesis, H1: 𝛽1 𝑜𝑟 𝛽2 ≠
0. Decision rule: reject H0 when p < 𝛼 where p is the p-value, Pr> 𝐹. The ANOVA table in figure 42,
we noticed the p-value < 0.0001 which is less than 0.05, we reject null hypothesis and conclude that our
selected final model is significant.

Page | 29
The ANOVA table (in Figure 42) provides variability information in the columns Sum of squares and
Mean Square. The value 0.92927, found in the cell that is the intersection of model row and sum of
squares column, is called the SSR (regression model sum of squares). It represents the amount of
variability that is explained by our model. The row and mean square column is called the MSR
(regression model mean of square); which value is 0.23232.The value 0.27645 represents the SSE
(error/residual sum of squares), this is the amount of variability that is unexplained by our model. Below
it is the value 1.20573; this is the value of the total variability of the observations of crime rate in 47 states
of the USA due to variation in the predictor variables; which is the SSTO – total sum of squares. At the
intersection of the error row and mean square column is the MSE (error/residual mean square), given as
0.00953, this value represents an unbiased estimator of the variance of the crime rate in 47states in USA
variable, as it provides information on the spread of the data points from their mean and from each other.
The value of the MSE, 0.00953 is low; thus, we can say that our data points are not very spread out from
each other. Given the R-square value is 0.7750 (known as the coefficient of determination) represents the
proportion of variability that is explained by our model; thus we can say that our regression model only
accounts for 77.50% of the variability in the observations of the crime rate in 47 states of the USA. The
adjust R-square is very useful in adjusting the R-square in order to determine the number of predictors
that are in the model, the adjusted R-square doesn’t determine the fit of the model; it inform us of how
suitable our model is in comparison with the number of predictors in the model thereby measuring the
marginal contribution of each predictors in the model. Since adjusted R-square (0.7707) which is close to
R-square mean the number of predictors in this model is adequate. Thus, this model is valid for making
prediction for the relationship between the median income and police expenditure with our response.
Interpret inferences
Lower CI Limit Upper CI Limit
Median income (0.00397) 0.00213 0.00571
Police expenditure (0.00837) 0.00603 0.01071
Std median income police
expenditure (0.08940)
0.0266 0.1522
Std state population police
expenditure (0.02166)
-0.05156 0.00824
Figure 43: BONFERRONI SIMULTANEOUS 95% CONFIDENCE INTERVALS
From the values seen in figure 43, we are 95% confident that median income lies in between 0.00213 and
0.00571 and police expenditure lies in between 0.00603 and 0.01071 based on the quadratic relation that
includes adjustment variable [0.0266, 0.1522] and [-0.05156, 0.00824] simultaneously.
We are particularly interested in studying the case of xh when the median income is 139 and the police
expenditure per person is 69 because during this observation it has the smallest crime rate per 1000
in 47 states which is 34.2 (Refer to Appendix 2).
The SAS System
Xnew
1 139 69
Predicted
Lower
CI
Upper
CI
Lower
CB
Upper
CB
Lower
PI
Upper
PI
2.12936 1.23423 3.35637 1.03128 4.01023 -2.2342 7.00123
Figure 44: INTERVAL ESTIMATION AND PREDICTION

Page | 30
Confidence Interval
Looking at figure 44, we are 95% confident that the average number of crime rate in 47 states when
median income is 139 and the police expenditure per person is 69 lies in between 1.23423 and 3.35637.
Confidence Band Boundaries
Similarly, we can say with 95% confidence that the boundary values of the confidence band for the
average number of crime rate in 47 states when median income is 139 and the police expenditure per
person is 69 will lie in between 1.03128 and 4.01023.
Prediction Interval
Also, we can conclude with 95% confidence that an actual average number of crime rate in 47 states
when median income is 139 and the police expenditure per person is 69 will lie in between -2.2342 and
7.00123.
7. Final Discussion
Our project was focused on the possible relationship between the crime rate in 47 states of the USA and
four factors unemployment rate, number of families per 1000, state population size and police
expenditure. We collected total of 34 data and using multiple regression we found out that police
expenditure per person is the most useful predictor variable because it showed that police expenditure per
person has the most likely relationship with the crime rate.
A preliminary model was fit from our collected data as follows: crime rate= β0+β1unemployment_rate+
β2median_income+β3state_population+β4police_expenditure, where β0, β1, β2, β3, β4 are the model
parameters. Then we checked the constant variance and normality. Variances were constant but the model
did not satisfy the normality. Moreover we checked for the outliers, leverage, influence and variance
inflation. We did not have adequate reason to remove any outliers because there were no data points that
are unusual. Also we found that there was no serious multicollinearity problem because our max VIF
value was 2.59828. Then we did log transformation and re-checked model assumptions.
We then explored the interactions using partial regression. Looking at all the six interaction terms from
figure 16 to figure 21, interaction terms between X1X4, X2X3, X2X4 and X3X4 were the possible useful
interaction terms that can be added to our model since they have linear trend and have significant effects
on our response variable (crime rate).
We then obtained two potentially “best” models by using best subsets, backwards deletion and stepwise
regression model search techniques. We found that the potentially best models are 3 variables model and
4 variables model.
After that for each model we verified the model assumptions and checked the diagnostics as well. The
first best model (model with 3 variables) was chosen as the best overall model comparing it has higher R-
square values and adjusted R-square values, lower SSE, MSE, AIC and SBC values. Also, CP value is
close to p value in this case.
Then we present our final model as: Log crime rate = 0.49799 + 0.00397median income + 0.00837police
expenditure + 0.08940std median income police expenditure – 0.02166stdstate population police
expenditure. We found that the R-square value of our overall best model was 0.7750 which represents the
proportion of variability that is explained by our model; thus we can say that our regression model
accounts for 77.50% of the variability in the observations of the crime rate in 47 states of the USA.

Page | 31
For Further analysis, we can add some more predictor variables like mean number of years of schooling
or labor force participation rate to see if the addition of this variables can help explain more variations in
our data or not. We can also check their respective relationship with our predictor variable (the crime rate
in 47 states of the USA).

MLR Project (Onion)

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (15)

Similar a MLR Project (Onion)

Similar a MLR Project (Onion) (20)

Más de Chawal Ukesh

Más de Chawal Ukesh (10)

MLR Project (Onion)