SlideShare una empresa de Scribd logo
1 de 19
Descargar para leer sin conexión
South Dakota State University
STAT 786, Fall 2016
Project 1
ASSOCIATION OF INSULIN WITH CHEMICAL AND
DEMOGRAPHIC VARIABLES
Authors:
1. Acharya, Subash
2. Khan, Riaz
3. Shrestha, Mahesh
4. Suehring, Aaron
2/19
INTRODUCTION
The body breaks down carbohydrates from foods into glucose. Glucose—a
form of sugar—is used by tissues in the body for energy (USDHHS 2014a). Once in
the bloodstream, glucose needs the insulin hormone in order to be absorbed by organs
and tissues (USDHHS 2014a). The insulin hormone, made by the pancreas for
regulating blood glucose levels as part of metabolism, assists glucose to be absorbed
by organs and tissues (Joshi et al. 2007; USDHHS 2014a). Glucose accumulates in
the bloodstream if the pancreas cannot secrete enough insulin (Mayo Clinic).
Prolonged periods of elevated blood glucose can lead to diabetes or prediabetes,
kidney damage, and nerve damage (USDHHS 2014a).
Using data from the National Health and Nutrition Examination Survey
(NHANES), our objective was to determine if there was a relationship between insulin
in the bloodstream (pmol/L), and a group of chemical predictor variables measured in
the urine and demographic predictor variables for a sample of 974 individuals. The
chemical predictor variables we examined were urinary iodine, urinary creatinine,
urinary perchlorate, urinary nitrate, and urinary thiocyanate. Demographic variables
included gender, age, and household income.
Iodine (μg/L), measured in the urine as a quantitative variable, is an element
that must come from the diet—it cannot be made in the body (WebMD). Iodine is
needed by the thyroid gland to produce hormones. Iodine has been shown to have a
negative correlation with blood glucose and insulin (Al-Attas et al. 2012). Further,
iodine has been associated with insulin resistance in people with type 2 diabetes (Al-
Attas et al. 2012). Insulin resistance is a condition where organs and body tissue do
not respond to insulin, and therefore cannot easily absorb glucose from the
bloodstream. Thus, the body needs to produce more insulin (USDHHS 2014b).
Creatinine (mg/dL), also measured in the urine as a quantitative variable, is a
waste product of muscle deterioration and is filtered through the kidneys and excreted
through the urine (WebMD 2015). Testing the level of creatinine in a person’s body is
a common measure of kidney health, and high levels of creatinine in your blood
indicate kidney failure in patients with diabetes (National Kidney Foundation; WebMD
2015).
Perchlorate (ng/mL), measured in the urine as a quantitative variable, is an
element commonly found in the environment. Exposure to it occurs through the
ingestion of food or water that contains perchlorate (ATSDR 2015). Perchlorate in the
3/19
body is associated with the inhibition of iodide uptake by the thyroid, which can lead
to hypothyroidism (Blount et al. 2006; ATSDR 2015). Similarly, exposure to Nitrate
(ng/mL), measured in the urine as a quantitative variable, is primarily linked to the
ingestion of food or water that contain nitrates (CDC 2013). Similar to perchlorate, it
is known to disrupt thyroid function by inhibiting iodide uptake (CDC 2013).
Thiocyanate (ng/mL) is another element known to inhibit iodide uptake by the thyroid.
Measured in the urine as a quantitative variable, exposure to thiocyanate occurs from
cigarette smoke and some Brassica genus vegetables (Leung et al. 2014). Similar to
perchlorate and nitrate, thiocyanate could lead to lower thyroid hormone production
(Steinmaus et al. 2013).
The demographic variables included in the analysis were gender, age, and
household income. Gender, a qualitative variable coded as 1 for male and 2 for
female, was included for its possible relationship with insulin resistance. Given that
increased body fat has been associated with insulin resistance (USDHHS 2014b),
higher levels of visceral and hepatic adipose tissue in men may contribute to higher
levels of insulin resistance (Greer & Shen 2009).
Aging has also been associated with increased body fat and increased levels
of insulin resistance (Ryan 2000). Therefore, age, measured as a quantitative
variable, was included in the analysis. Elderly were found to be more prone to insulin
resistance, although it’s uncertain whether the cause is biological or environmental
(i.e., decreased physical activity, weight gain) (Refaie et al. 2006).
Lastly, household income, coded in qualitative bins, was included in the
analysis. It has been shown that proximity to resources in high-income areas is related
to insulin resistance (Auchincloss et al. 2007). A potential explanation is that grocery
stores in poorer areas have less-healthy food options (Horowitz et al. 2004).
Alternatively, in a low-income, poorly educated population of patients with diabetes,
48% reported an unwillingness to use insulin (Machinani et al. 2013). Known as
psychological insulin resistance, this refers to a person’s unwillingness to use insulin
(Machinani et al. 2013). This may result in a relationship between lower-income
households and lower levels of insulin.
A previous study by Blount et al. (2006) used this data from NHANES to
examine the relationship between perchlorate and thyroid hormone levels. The
covariates incorporated in their analysis that were also examined in ours include
urinary creatinine, urinary iodine, urinary nitrate, urinary thiocyanate, and age. They
4/19
found a relationship between perchlorate and thyroid hormone production in women;
however, no such significant relationship was found in men (Blount et al. 2006). A
study by Steinmaus et al. (2013) also examined data from NHANES. They looked for
an interaction effect between perchlorate, thiocyanate, and iodine on thyroid hormone
levels. They found greater effects when all three variables were assessed together
than when they were examined separately (Steinmaus et al. 2013).
EXPLANATORY DATA EXPLORATION
Figure 1 is a simple scatter plot of the data. We have the response in the left
most column. Looking at the smoothed curve, there seems to be some predictive
power with creatinine. However, there is lack of homogeneity of variability as
suggested by the corresponding scatter plot. This implies that there may be a need for
transformation of the data, which would be done later. The other variables appear to
have very low predictive power.
Figure 1: Scatter plot of the data
ANALYSIS
Data Formatting
Four files, which contained the data for estimation of insulin level in the blood,
were acquired from NHANES. In those files, each respondent was given a unique
sequence number. The final dataset was created by aggregating the four datasets to
extract the desired response variable and predictor variables based on common
5/19
sequence number. In total, there were 974 unique sequence numbers which
contained common sequence numbers for all the variables to be included in the
analysis. After forming the final data, the data set was randomly divided into two equal
parts: a train dataset and test dataset. The train dataset was used to perform initial
analysis, whereas the test dataset was used for model validation.
Insulin level was selected as the response variable, and urinary iodine, urinary
creatinine, urinary perchlorate, urinary nitrate, urinary thiocyanate, gender, age and
household income were treated as the predictor variables. In total, there were eight-
predictor variables selected for estimating the insulin level in the blood. Two of the
variables—gender and household income—were qualitative variables, and the six
remaining variables were quantitative variables.
Initial Model Fitting and Diagnostics
As initial analysis of the data, a linear model was fitted to the train data using
all the variables and its diagnostics plots were analyzed to check whether the
assumptions of the linear model were met.
Figure 2: Diagnostics plots, full model, train data
A plot of residuals vs. fitted values for this fitted model suggested that the
variance of the error is not constant. Furthermore, the normal Q-Q plot of the residuals
also suggests some departure from normality of the errors exits. It is concluded after
assessing the diagnostic plots that the assumptions of constancy of the error term
variance and the normality of the error terms are violated. Furthermore, the estimated
6/19
intercept and coefficients for the different predictor variables range over a long range,
so the values of the predictor variables were scaled in an appropriate manner. Urinary
iodine, urinary creatinine, and urinary thiocyanate were divided by 100, whereas
urinary nitrate was divided by 1000. These scaling factors were chosen to get all the
predictor variables on a common scale.
Figure 3: Graphical representation of Box-Cox transformation
As previously stated, there appears to be unequal error variance and non-
normality of the error terms. To alleviate the violation of these assumptions,
transformation of the response variable is an appropriate measure due to shape and
spread of the distribution of the response variable. To accomplish this, a suitable
transformation to mitigate the unequal error variance and non-normality of the
residuals needed to be determined. Box-Cox transformation was used to determine
the best transformation to apply to the response variable. The built-in R function
boxcox() automatically identified a range of power for transformations. Figure 3 shows
this range, indicating an optimal lambda value to be near zero. So, it is reasonable to
select lambda to be zero. Therefore, a log transformation, applied to the response
variable, was suitable to remove the non-constant error variance and the non-
normality of the residuals. A linear model was again fitted to the data using all the
predictor variables and the log-transformed response variable, and the diagnostic plots
(Figure 4) were analyzed to check whether the transformation was appropriate. The
residuals versus fitted values showed the constancy of error variance assumption was
met, and the normality plot indicated the residuals were normally distributed.
7/19
Figure 4: Diagnostics of full model after transformation, train data
Multicollinearity Check
To check if there is any serious multicollinearity problem in our data, we used
the variance inflation factor (VIF) criterion. The technique is all the predictors are
regressed against the rest of the predictors and the VIF is found using , where
is the multiple for the regression of on the other predictors. The cutoff point is
suggested to be 10 (Kutner et al. 2004). For our data, we got the highest VIF to be
1.76, suggesting there is not a serious multicollinearity problem in our data. Table 1
shows the VIF values and the correlations between the predictor variables of the data.
Table 1: VIF and Correlation of the training data
8/19
Model selection
In total, we fit four models. In model 1, we included all the variables except for
the age and income. When fitting with all the predictors, these terms were not
significant at any default significance level (maximum 0.1). Additionally, the response
was regressed with all the predictors separately and these two predictors did not show
any significant linear relationship. Model 2 was selected based on the Akaike
information criterion (AIC) based stepwise selection. Model 3 is a simpler model,
derived from model 2 after deleting the variable having the highest p-value. Model 4
is a further simplification where one more variable was deleted from model 3 based
on the highest p-value.
Table 2: shows the linear predictors of each of the models considered
Figure 5 and 6 show the residual scatter plots and the normal Q-Q plot for all
the models proposed. From visual inspection, they appear to meet the normality and
variance constancy assumptions of the residuals. However, from the Q-Q plot, there
seems to be some outlying observations in the data. Further analysis attempted to
identify influential points which will be discussed later.
9/19
Figure 5: Residual plot of four candidate models, training data
Figure 6: Q-Q plot of four candidate models, training data
With Influential Observations
Table 3 shows the regression results for candidate models based on training
and validation dataset, with influential and outlying observations. For each of the four
candidate models, we recorded the point estimates for the intercept and the slope of
each predictor variable, along with the corresponding standard error. We also color-
coded the slope coefficients and intercept point estimate based on the predictor’s level
of significance. We performed a lack-of-fit test for all four of the models to determine
significance of the model. Given that the p-value for all the models was significant (p-
10/19
value < 0.05),	we found that there were four competitive models. We calculated
several model selection statistics to determine the best model.
Table 3: Regression results for candidate models based on training and validation
dataset
We calculated the AIC value for each model using = 2 − 2ln	( ), where k
is the number of estimated parameters and L is the maximum value for the likelihood
function. Model 2 produced the lowest AIC value of the four models, and was therefore
considered the best model using this criterion. We compared the SSE values for the
four models using = ∑( − ) , where is the th
observation and is the
predicted value for the th
observation. Although model 1 and model 2 produced
almost identical SSE values, model 1 produced the lowest SSE and was therefore
considered the best model using this criterion.
We compared the MSE values for the four models using =
∑( )
, where
is the number of observations and is the number of parameters to be estimated.
11/19
All four models produced similar values; however, model 2 produced the lowest MSE
and was therefore considered the best model. We calculated and compared the
PRESS statistic for all four models using = ∑( − ( )) , where ( ) is the
prediction of the	 th
value with the th
observation removed. The PRESS criterion is a
way to determine of how well the use of the fitted values for a given subset model can
predict the observed response values (Kutner et al. 2004). Although models 2, 3, and
4 were all very similar, model 4 produced the lowest PRESS value and was therefore
considered the best model using this criterion. Next, we calculated the value for all
four models using = − + 2 , where is the error sum of squares
from the full model including all the potential predictors,	 is the residual mean
square error of the candidate model. Model 2 produced the smallest value, and
was therefore considered the best model. Lastly, we calculated and compared the
MSPR for all the four models using =
∑( )
∗ , where and are original
observations and point estimations of the response, respectively and ∗
is the number
of cases in the validation dataset. Although model 3 and 4 produced very similar
results, model 3 produced the lowest MSPR and was therefore considered the best
model. All four models were then applied to the validation dataset, and the results
were recorded in Table 3. The results show that model 1 was the best model;
however, three of the predictor variables were not significant.
From the results presented in Table 3, model 2 seems to be an appealing
choice as it possesses the lowest AIC and values. The SSE, MSE and PRESS
statistics do not show a lot of variability for the four models. However, looking at the
MSPR value, model 3 and model 4 look to be the better choice over the other two.
Because the results from Table 3 are not consistent using different model selection
techniques, this suggests the presence of influential points. Therefore, we conducted
measures to detect influential observations.
Up to this point none of our data was screened for the presence of influential
observations. Looking at Figure 4 and 5, there seems to be presence of observations
with high deviations of the residuals from the mean. This could be a direct
consequence of potential outliers and influential points in the data. Although a point
may be an outlier in terms of the range of predictor variables, it may not be an outlier
in terms of the response variable. Conversely, a point may be an outlier in terms of
12/19
the response variable, yet it may not be an outlier in terms of the predictor variable.
Further, a point may be an outlier in terms of both the response and predictor variables.
In these instances, it is possible that although the point is outlying for all or only one
of the variables, it may not have an influence on the regression line. Therefore, it was
necessary to assess the data for influential points.
We used two measures to assess the presence of influential points: DFFITS
and Cook’s Distance. The DFFIT is given by , where is the studentized
deleted residual and the second term of the product is the leverage factor of the
observation. An observation having a high DFFIT value is identified as an influential
point per this criterion. We used 2 as the threshold value (Kutner et al. 2004).
The DFFIT considers the influence of the case on the fitted value , while the
Cook’s Distance measures the influence of the case on all fitted values. Cook’s
Distance measure for the case is calculated using =
× ( )
. Thus, we
get a high Cook’s Distance value for high residuals and/or high leverage value. Higher
indicates higher degree of influence of the case of the fitted values. We used a
Cook’s Distance cutoff value of 4/n to identify influential points, where n was the
number of data observations (Introduction to SAS). It should be noted that identifying
the influential points by both these criteria depends on the model itself as the
calculation involves ℎ , the diagonal element of the hat matrix. We aggregated the
influential data from both tests and removed them from the data for corresponding
model. This process was repeated for each of the four models.
Figure 7: DFFIT and Cook’s Distance plot for model 1
13/19
Figure 7 shows the influential points identified by these measures for model 1.
This was repeated for all four proposed models. The models were fitted again without
these influential points. Table 4 summarizes the regression results for the candidate
models based on the datasets after removing the influential points.
Table 4: Regression results after removing influential points
Looking at the results presented in Table 4, we see that model 4 produces the
best model selection values in terms of , SSE, MSE, PRESS, and value. Five
out of the six criteria we have considered for model evaluation indicate model 4 is the
best model. The other statistic for model 4, MSPR is only slightly above the lowest of
all the models. These results are consistent for the validation data as well, with the
exception of the statistic. Therefore, we chose model 4 for this study. This model
is supported by the statistics and is very simple in nature as it includes only two
predictor variables. For this chosen model, we ran the multiple comparison test to
check whether all the coefficients were significantly different from zero. The in-built R
function glht() was used to do that. This function takes the null hypothesis in the form
14/19
: = (Hothorn et al. 2016). The default choice of is and was specified
as a diagonal matrix of size 3. Table 5 summarizes the results of multiple comparison
and it support our choice of model 4.
Table 5: Simultaneous inference results
Robust regression
To investigate whether our choice of model based on the data after removing
the influential points complies, we have implemented the robust regression. Robust
regression dampens the effect of influential cases and safeguards against these
influences (Kutner et al. 2004). The in-built R function rlm() was used for this purpose.
This function, by default uses the iteratively reweighted least square (IRLS) method to
fit the function (Yegorov 2016). It chooses the mean absolute deviation (MAD) as the
weight function with the default choice. Using the weights from a least square
regression, it obtains the weights and fits the model again using weighted least
squares. The weights are re-estimated after each iteration until a convergence is
obtained (Kutner et al. 2004).
Table 6: Results from robust regression
15/19
Table 6 summarizes the regression results obtained from robust regression.
Comparison between the test and training data indicates similar coefficient values and
standard error values. Further, these results are consistent with results obtained from
ordinary least square results. Given that robust regression is insensitive to influential
observations, this indicates that our original data used for ordinary least squares
analysis was sufficiently assessed for the presence of influential data.
Regression Tree
We have implemented the regression tree, which is a non-parametric, simple,
and powerful regression technique (Kutner et al. 2004). Implementing this method in
the training dataset without the influential cases, the MSE, PRESS and MSPR were
found to be 0.44, 203.36, and 0.61 respectively, which validates our choice of model.
CONCLUSION
Based on our analysis, we select model 4 as our final model, i.e the blood
insulin level can be modeled as a linear function of urinary creatinine level and gender.
However, the coefficient of the gender variable is opposite in the training and the
validation data. This holds true for all four models (also in the robust regression, Table
6). This was a direct consequence of the sampling of the data, when the full dataset
was divided into train and test data. This can be explained with the help of Figure 8.
The boxplot of the train data shows that female has higher insulin level on average,
whereas the test data tells us the different story. The full data agrees with the train
dataset. Therefore, we stick with our coefficients found from the training dataset, when
interpreting the effect of gender on insulin level.
Figure 8: Boxplot of response based on gender
16/19
The model we choose to show the relationship of blood insulin level with urinary
creatinine level and gender was found to be statistically significant. However, it had
very low predictive power, as indicated by low coefficient of determination ( less
than 5%). Figure 9 shows the original values and the predicted values along with the
95% confidence and prediction band. This figure indicates that this model has little
practical application because the prediction and confidence bands are similar across
the entire range of the data. This emphasizes the importance of examining the
practical application of a statistically significant model.
Figure 9: Original and predicted values of training data
REFERENCES
Al-Attas, O. S., Al-Daghri, N. M., Alkharfy, K. M., Alokail, M. S., Al-Johani, N. J., Abd-
Alrahman, S. H., Yakout, S. M., Draz, H. M., & Sabico, S. (2012). Urinary iodine
is associated with insulin resistance in subjects with diabetes mellitus type 2.
Experimental and Clinical Endocrinology & Diabetes, 120(10), 618-622.
ATSDR (Agency for Toxic Substances and Disease Registry), Division of Toxicology
and Environmental Medicine. (2015). Public health statement: Perchlorates.
https://www.atsdr.cdc.gov/ToxProfiles/tp162-c1-b.pdf. Accessed 4 Dec. 2016.
Auchincloss, A. H., Roux, A. V. D., Brown, D. G., O'Meara, E. S., & Raghunathan, T.
17/19
E. (2007). Association of insulin resistance with distance to wealthy areas the
multi-ethnic study of atherosclerosis. American Journal of Epidemiology,
165(4), 389-397.
Blount, B. C., Pirkle, J. L., Osterloh, J. D., Valentin-Blasini, L., & Caldwell, K. L. (2006).
Urinary perchlorate and thyroid hormone levels in adolescent and adult men
and women living in the United States. Environmental Health Perspectives,
1865-1871.
CDC (Centers for Disease Control and Prevention). (2013). National Health and
Nutrition Examination Survey. https://wwwn.cdc.gov/nchs/nhanes/2011-
2012/PERNTS_G.htm. Accessed 4 Dec. 2016.
Geer, E. B., & Shen, W. (2009). Gender differences in insulin resistance, body
composition, and energy balance. Gender Medicine, 6, 60-75.
Horowitz, C. R., Colson, K. A., Hebert, P. L., & Lancaster, K. (2004). Barriers to buying
healthy foods for people with diabetes: evidence of environmental disparities.
American Journal of Public Health, 94(9), 1549-1554.
Hothorn, T., Bretz, F., Westfall, P., Heiberger, R. M., Schuetzenmeister, A., Scheibe, S.
(2016). Simultaneous Inference in General Parametric Models. https://cran.r-
project.org/web/packages/multcomp/multcomp.pdf. Accessed 10 Dec. 2016.
Joshi, S. R., Parikh, R. M., & Das, A. K. (2007). Insulin-history, biochemistry,
physiology and pharmacology. Journal-Association of Physicians of India,
55(L), 19.
Kutner, M. H., Nachtsheim, C. J., Neter, J. (2004). Applied Linear Regression Models
(4th ed.). McGraw-Hill Irwin.
Leung, A. M., Katz, P. M., He, X., Feig, D. S., Pearce, E. N., & Braverman, L. E. (2014).
Urinary perchlorate and thiocyanate concentrations in pregnant women from
Toronto, Canada. Thyroid, 24(1), 175-176.
Machinani, S., Bazargan-Hejazi, S., & Hsia, S. H. (2013). Psychological insulin
resistance among low-income, US racial minority patients with type 2 diabetes.
Primary Care Diabetes, 7(1), 51-55.
Mayo Clinic. Diabetes treatment: Using insulin to manage blood sugar.
http://www.mayoclinic.org/diseases-conditions/diabetes/in-depth/diabetes-
treatment/art-20044084. Accessed 2 Dec 2016.
National Kidney Foundation. Diabetes – a major risk factor for kidney disease.
https://www.kidney.org/atoz/content/diabetes. Accessed 2 Dec. 2016.
18/19
Ryan, A. S. (2000). Insulin resistance with aging. Sports Medicine, 30(5), 327-346.
Refaie, M. R., Sayed-Ahmed, N. A., Bakr, A. M., Aziz, M. Y. A., El Kannishi, M. H., &
Abdel-Gawad, S. S. (2006). Aging is an Inevitable Risk Factor for Insulin
Resistance. Journal of Taibah University Medical Sciences, 1(1), 30-41.
Steinmaus, C., Miller, M. D., Cushing, L., Blount, B. C., & Smith, A. H. (2013).
Combined effects of perchlorate, thiocyanate, and iodine on thyroid function in
the National Health and Nutrition Examination Survey 2007–08. Environmental
Research, 123, 17-24.
Introduction to SAS. UCLA: Statistical Consulting Group.
http://www.ats.ucla.edu/stat/sas/dae/rreg.htm. Accessed 9 Dec. 2016.
USDHHS (United States Department of Health and Human Services), National
Institute of Diabetes and Digestive and Kidney Diseases. (2014a). Causes of
diabetes. https://www.niddk.nih.gov/health-information/diabetes/causes.
Accessed 5 Dec. 2016.
USDHHS (United States Department of Health and Human Services), National
Institute of Diabetes and Digestive and Kidney Diseases (2014b). Prediabetes
and insulin resistance. https://www.niddk.nih.gov/health-
information/diabetes/types/prediabetes-insulin-resistance. Accessed 5 Dec.
2016.
WebMD. Iodine. http://www.webmd.com/vitamins-supplements/ingredientmono-35-
iodine.aspx?activeingredientid=35. Accessed 2 Dec 2016.
WebMD. (2015). Creatinine and creatinine clearance blood tests.
http://www.webmd.com/a-to-z-guides/creatinine-and-creatinine-clearance-
blood-tests - 1. Accessed 2 Dec. 2016.
Yegorov, O. (2016). Robust Fitting of Linear Models.
http://stat.ethz.ch/R-manual/R-devel/library/MASS/html/rlm.html. Accessed 10
Dec. 2016.
19/19
APPENDIX
Part of the Data
R code: Submitted as separate .R files.

Más contenido relacionado

La actualidad más candente

Iain Christie Elite Athlete Report
Iain Christie Elite Athlete ReportIain Christie Elite Athlete Report
Iain Christie Elite Athlete Report
Iain Christie
 
dkNET Webinar: Population-Based Approaches to Investigate Endocrine Communica...
dkNET Webinar: Population-Based Approaches to Investigate Endocrine Communica...dkNET Webinar: Population-Based Approaches to Investigate Endocrine Communica...
dkNET Webinar: Population-Based Approaches to Investigate Endocrine Communica...
dkNET
 
The Efficacy of Equine Oral Joint Supplements
The Efficacy of Equine Oral Joint SupplementsThe Efficacy of Equine Oral Joint Supplements
The Efficacy of Equine Oral Joint Supplements
Erika Hedegaard-Schou
 
Albumina sarcopenia
Albumina sarcopeniaAlbumina sarcopenia
Albumina sarcopenia
vegaline
 
2013 Bazzell, et al dietary composition regulates Drosophila mobility and car...
2013 Bazzell, et al dietary composition regulates Drosophila mobility and car...2013 Bazzell, et al dietary composition regulates Drosophila mobility and car...
2013 Bazzell, et al dietary composition regulates Drosophila mobility and car...
Brian Bazzell, PharmD
 
2015 Sujkowski, Bazzell, et al endurance exercise and selective breeding exte...
2015 Sujkowski, Bazzell, et al endurance exercise and selective breeding exte...2015 Sujkowski, Bazzell, et al endurance exercise and selective breeding exte...
2015 Sujkowski, Bazzell, et al endurance exercise and selective breeding exte...
Brian Bazzell, PharmD
 
3. RAAS- MEWARI- IJBMR-2013
3. RAAS- MEWARI- IJBMR-20133. RAAS- MEWARI- IJBMR-2013
3. RAAS- MEWARI- IJBMR-2013
Pulakes Purkait
 
Lipid, liver related and other clinical parameters affect study of cigarette ...
Lipid, liver related and other clinical parameters affect study of cigarette ...Lipid, liver related and other clinical parameters affect study of cigarette ...
Lipid, liver related and other clinical parameters affect study of cigarette ...
IJARIIT
 
Novel Association of the R230C Variant of theABCA1Gene with High Triglyceride...
Novel Association of the R230C Variant of theABCA1Gene with High Triglyceride...Novel Association of the R230C Variant of theABCA1Gene with High Triglyceride...
Novel Association of the R230C Variant of theABCA1Gene with High Triglyceride...
Marco Alberto Gamboa Melendez
 

La actualidad más candente (20)

IOSR Journal of Pharmacy (IOSRPHR)
IOSR Journal of Pharmacy (IOSRPHR)IOSR Journal of Pharmacy (IOSRPHR)
IOSR Journal of Pharmacy (IOSRPHR)
 
Iain Christie Elite Athlete Report
Iain Christie Elite Athlete ReportIain Christie Elite Athlete Report
Iain Christie Elite Athlete Report
 
Exercise as a Preventative Measure in Women with BRCA Gene Mutations
Exercise as a Preventative Measure in Women with BRCA Gene MutationsExercise as a Preventative Measure in Women with BRCA Gene Mutations
Exercise as a Preventative Measure in Women with BRCA Gene Mutations
 
dkNET Webinar: Population-Based Approaches to Investigate Endocrine Communica...
dkNET Webinar: Population-Based Approaches to Investigate Endocrine Communica...dkNET Webinar: Population-Based Approaches to Investigate Endocrine Communica...
dkNET Webinar: Population-Based Approaches to Investigate Endocrine Communica...
 
A04302001017
A04302001017A04302001017
A04302001017
 
Development and Validation of prediction for estimating resting energy expend...
Development and Validation of prediction for estimating resting energy expend...Development and Validation of prediction for estimating resting energy expend...
Development and Validation of prediction for estimating resting energy expend...
 
The Efficacy of Equine Oral Joint Supplements
The Efficacy of Equine Oral Joint SupplementsThe Efficacy of Equine Oral Joint Supplements
The Efficacy of Equine Oral Joint Supplements
 
Albumina sarcopenia
Albumina sarcopeniaAlbumina sarcopenia
Albumina sarcopenia
 
THE EFFECTS OF CIGARETTE SMOKING ON SEMEN QUALITY OF INFERTILE AND FERTILE ME...
THE EFFECTS OF CIGARETTE SMOKING ON SEMEN QUALITY OF INFERTILE AND FERTILE ME...THE EFFECTS OF CIGARETTE SMOKING ON SEMEN QUALITY OF INFERTILE AND FERTILE ME...
THE EFFECTS OF CIGARETTE SMOKING ON SEMEN QUALITY OF INFERTILE AND FERTILE ME...
 
Study of serum magnesium and fasting blood glucose in hypertension
Study of serum magnesium and fasting blood glucose in hypertensionStudy of serum magnesium and fasting blood glucose in hypertension
Study of serum magnesium and fasting blood glucose in hypertension
 
FERRITIC AS A POTENT MARKER OF BREST CANCER
FERRITIC AS A POTENT MARKER OF BREST CANCERFERRITIC AS A POTENT MARKER OF BREST CANCER
FERRITIC AS A POTENT MARKER OF BREST CANCER
 
liver poster for linkin
liver poster for linkinliver poster for linkin
liver poster for linkin
 
2013 Bazzell, et al dietary composition regulates Drosophila mobility and car...
2013 Bazzell, et al dietary composition regulates Drosophila mobility and car...2013 Bazzell, et al dietary composition regulates Drosophila mobility and car...
2013 Bazzell, et al dietary composition regulates Drosophila mobility and car...
 
Iroijmas107
Iroijmas107Iroijmas107
Iroijmas107
 
Physical Activity: Analysis of CADM2
Physical Activity: Analysis of CADM2Physical Activity: Analysis of CADM2
Physical Activity: Analysis of CADM2
 
2015 Sujkowski, Bazzell, et al endurance exercise and selective breeding exte...
2015 Sujkowski, Bazzell, et al endurance exercise and selective breeding exte...2015 Sujkowski, Bazzell, et al endurance exercise and selective breeding exte...
2015 Sujkowski, Bazzell, et al endurance exercise and selective breeding exte...
 
APOE Poster
APOE PosterAPOE Poster
APOE Poster
 
3. RAAS- MEWARI- IJBMR-2013
3. RAAS- MEWARI- IJBMR-20133. RAAS- MEWARI- IJBMR-2013
3. RAAS- MEWARI- IJBMR-2013
 
Lipid, liver related and other clinical parameters affect study of cigarette ...
Lipid, liver related and other clinical parameters affect study of cigarette ...Lipid, liver related and other clinical parameters affect study of cigarette ...
Lipid, liver related and other clinical parameters affect study of cigarette ...
 
Novel Association of the R230C Variant of theABCA1Gene with High Triglyceride...
Novel Association of the R230C Variant of theABCA1Gene with High Triglyceride...Novel Association of the R230C Variant of theABCA1Gene with High Triglyceride...
Novel Association of the R230C Variant of theABCA1Gene with High Triglyceride...
 

Destacado (9)

CONTROL OF BLOOD GLUCOSE IN DIABETIC PATIENTS
CONTROL OF BLOOD GLUCOSE IN DIABETIC PATIENTSCONTROL OF BLOOD GLUCOSE IN DIABETIC PATIENTS
CONTROL OF BLOOD GLUCOSE IN DIABETIC PATIENTS
 
Stability Of Peptides And Proteins
Stability Of Peptides And ProteinsStability Of Peptides And Proteins
Stability Of Peptides And Proteins
 
Protein and peptide d d s
Protein and peptide d d sProtein and peptide d d s
Protein and peptide d d s
 
biochemistry of Insulin and diabetes
biochemistry of Insulin and diabetes biochemistry of Insulin and diabetes
biochemistry of Insulin and diabetes
 
Insulin
InsulinInsulin
Insulin
 
Insulin
InsulinInsulin
Insulin
 
Insulin and its mechanism of action
Insulin and its mechanism of actionInsulin and its mechanism of action
Insulin and its mechanism of action
 
Insulin presentation
Insulin presentationInsulin presentation
Insulin presentation
 
Diabetes Mellitus
Diabetes MellitusDiabetes Mellitus
Diabetes Mellitus
 

Similar a AMRS_Project_1_Report

(c) 8 What volume of 0.150 M FeCl3(aq) solution i.docx
(c)    8 What volume of 0.150 M FeCl3(aq) solution i.docx(c)    8 What volume of 0.150 M FeCl3(aq) solution i.docx
(c) 8 What volume of 0.150 M FeCl3(aq) solution i.docx
katherncarlyle
 
ARTICLENight-shift work and incident diabetes among Africa.docx
ARTICLENight-shift work and incident diabetes among Africa.docxARTICLENight-shift work and incident diabetes among Africa.docx
ARTICLENight-shift work and incident diabetes among Africa.docx
davezstarr61655
 
ObesityNutritionObesity and iron deficiency a quantitati.docx
ObesityNutritionObesity and iron deficiency a quantitati.docxObesityNutritionObesity and iron deficiency a quantitati.docx
ObesityNutritionObesity and iron deficiency a quantitati.docx
cherishwinsland
 
Paper - analysis of primary literature
Paper - analysis of primary literaturePaper - analysis of primary literature
Paper - analysis of primary literature
Laura Patriarca
 
Jenney et al 2016 Early avoidance
Jenney et al 2016 Early avoidanceJenney et al 2016 Early avoidance
Jenney et al 2016 Early avoidance
Teddy Uzamere
 
THE FIRST SYSTEM OF REFERENCE FOR THE MEDICAL PRACTICE OF HOMEOPATHY IN FRANCE
THE FIRST SYSTEM OF REFERENCE FOR THE MEDICAL PRACTICE OF HOMEOPATHY IN FRANCETHE FIRST SYSTEM OF REFERENCE FOR THE MEDICAL PRACTICE OF HOMEOPATHY IN FRANCE
THE FIRST SYSTEM OF REFERENCE FOR THE MEDICAL PRACTICE OF HOMEOPATHY IN FRANCE
home
 
Life-Science-Vol-3-3-14_2...
Life-Science-Vol-3-3-14_2...Life-Science-Vol-3-3-14_2...
Life-Science-Vol-3-3-14_2...
Masoumeh Hosseini
 
FINAL CURO 2016 Szymonik
FINAL CURO 2016 SzymonikFINAL CURO 2016 Szymonik
FINAL CURO 2016 Szymonik
Joanna Szymonik
 
ATT_1424628128019_90 HUSSEINI MASOOMEH......
ATT_1424628128019_90 HUSSEINI MASOOMEH......ATT_1424628128019_90 HUSSEINI MASOOMEH......
ATT_1424628128019_90 HUSSEINI MASOOMEH......
Masoumeh Hosseini
 
Systematic Reviews and Meta- and Pooled AnalysesEffects of.docx
Systematic Reviews and Meta- and Pooled AnalysesEffects of.docxSystematic Reviews and Meta- and Pooled AnalysesEffects of.docx
Systematic Reviews and Meta- and Pooled AnalysesEffects of.docx
ssuserf9c51d
 
Works Cited Milne, Anne C., Alison Avenell, and Jan Potter. Meta-.docx
Works Cited Milne, Anne C., Alison Avenell, and Jan Potter. Meta-.docxWorks Cited Milne, Anne C., Alison Avenell, and Jan Potter. Meta-.docx
Works Cited Milne, Anne C., Alison Avenell, and Jan Potter. Meta-.docx
keilenettie
 
Hypoglycaemia and improved testicular parameters in Sesamum radiatum treated ...
Hypoglycaemia and improved testicular parameters in Sesamum radiatum treated ...Hypoglycaemia and improved testicular parameters in Sesamum radiatum treated ...
Hypoglycaemia and improved testicular parameters in Sesamum radiatum treated ...
lukeman Joseph Ade shittu
 

Similar a AMRS_Project_1_Report (20)

(c) 8 What volume of 0.150 M FeCl3(aq) solution i.docx
(c)    8 What volume of 0.150 M FeCl3(aq) solution i.docx(c)    8 What volume of 0.150 M FeCl3(aq) solution i.docx
(c) 8 What volume of 0.150 M FeCl3(aq) solution i.docx
 
ARTICLENight-shift work and incident diabetes among Africa.docx
ARTICLENight-shift work and incident diabetes among Africa.docxARTICLENight-shift work and incident diabetes among Africa.docx
ARTICLENight-shift work and incident diabetes among Africa.docx
 
Central Lechera Asturiana, estudio de intervención Naturlinea
Central Lechera Asturiana, estudio de intervención Naturlinea Central Lechera Asturiana, estudio de intervención Naturlinea
Central Lechera Asturiana, estudio de intervención Naturlinea
 
Testosterone Concentrations in Women
Testosterone Concentrations in Women Testosterone Concentrations in Women
Testosterone Concentrations in Women
 
Does the Mediterranean diet predict longevity in the elderly? A Swedish persp...
Does the Mediterranean diet predict longevity in the elderly? A Swedish persp...Does the Mediterranean diet predict longevity in the elderly? A Swedish persp...
Does the Mediterranean diet predict longevity in the elderly? A Swedish persp...
 
ObesityNutritionObesity and iron deficiency a quantitati.docx
ObesityNutritionObesity and iron deficiency a quantitati.docxObesityNutritionObesity and iron deficiency a quantitati.docx
ObesityNutritionObesity and iron deficiency a quantitati.docx
 
Paper - analysis of primary literature
Paper - analysis of primary literaturePaper - analysis of primary literature
Paper - analysis of primary literature
 
Jenney et al 2016 Early avoidance
Jenney et al 2016 Early avoidanceJenney et al 2016 Early avoidance
Jenney et al 2016 Early avoidance
 
AAPA poster 2016
AAPA poster 2016AAPA poster 2016
AAPA poster 2016
 
THE FIRST SYSTEM OF REFERENCE FOR THE MEDICAL PRACTICE OF HOMEOPATHY IN FRANCE
THE FIRST SYSTEM OF REFERENCE FOR THE MEDICAL PRACTICE OF HOMEOPATHY IN FRANCETHE FIRST SYSTEM OF REFERENCE FOR THE MEDICAL PRACTICE OF HOMEOPATHY IN FRANCE
THE FIRST SYSTEM OF REFERENCE FOR THE MEDICAL PRACTICE OF HOMEOPATHY IN FRANCE
 
Deep learning-approach
Deep learning-approachDeep learning-approach
Deep learning-approach
 
International Journal of Reproductive Medicine & Gynecology
International Journal of Reproductive Medicine & GynecologyInternational Journal of Reproductive Medicine & Gynecology
International Journal of Reproductive Medicine & Gynecology
 
Life-Science-Vol-3-3-14_2...
Life-Science-Vol-3-3-14_2...Life-Science-Vol-3-3-14_2...
Life-Science-Vol-3-3-14_2...
 
FINAL CURO 2016 Szymonik
FINAL CURO 2016 SzymonikFINAL CURO 2016 Szymonik
FINAL CURO 2016 Szymonik
 
ATT_1424628128019_90 HUSSEINI MASOOMEH......
ATT_1424628128019_90 HUSSEINI MASOOMEH......ATT_1424628128019_90 HUSSEINI MASOOMEH......
ATT_1424628128019_90 HUSSEINI MASOOMEH......
 
Systematic Reviews and Meta- and Pooled AnalysesEffects of.docx
Systematic Reviews and Meta- and Pooled AnalysesEffects of.docxSystematic Reviews and Meta- and Pooled AnalysesEffects of.docx
Systematic Reviews and Meta- and Pooled AnalysesEffects of.docx
 
Potential role of uric acid in correlation with epidemics of hypertension and...
Potential role of uric acid in correlation with epidemics of hypertension and...Potential role of uric acid in correlation with epidemics of hypertension and...
Potential role of uric acid in correlation with epidemics of hypertension and...
 
Works Cited Milne, Anne C., Alison Avenell, and Jan Potter. Meta-.docx
Works Cited Milne, Anne C., Alison Avenell, and Jan Potter. Meta-.docxWorks Cited Milne, Anne C., Alison Avenell, and Jan Potter. Meta-.docx
Works Cited Milne, Anne C., Alison Avenell, and Jan Potter. Meta-.docx
 
Hypoglycaemia and improved testicular parameters in Sesamum radiatum treated ...
Hypoglycaemia and improved testicular parameters in Sesamum radiatum treated ...Hypoglycaemia and improved testicular parameters in Sesamum radiatum treated ...
Hypoglycaemia and improved testicular parameters in Sesamum radiatum treated ...
 
Serum il17 and il6 levels in a sample of iraqi patients with rheumatoid arthr...
Serum il17 and il6 levels in a sample of iraqi patients with rheumatoid arthr...Serum il17 and il6 levels in a sample of iraqi patients with rheumatoid arthr...
Serum il17 and il6 levels in a sample of iraqi patients with rheumatoid arthr...
 

AMRS_Project_1_Report

  • 1. South Dakota State University STAT 786, Fall 2016 Project 1 ASSOCIATION OF INSULIN WITH CHEMICAL AND DEMOGRAPHIC VARIABLES Authors: 1. Acharya, Subash 2. Khan, Riaz 3. Shrestha, Mahesh 4. Suehring, Aaron
  • 2. 2/19 INTRODUCTION The body breaks down carbohydrates from foods into glucose. Glucose—a form of sugar—is used by tissues in the body for energy (USDHHS 2014a). Once in the bloodstream, glucose needs the insulin hormone in order to be absorbed by organs and tissues (USDHHS 2014a). The insulin hormone, made by the pancreas for regulating blood glucose levels as part of metabolism, assists glucose to be absorbed by organs and tissues (Joshi et al. 2007; USDHHS 2014a). Glucose accumulates in the bloodstream if the pancreas cannot secrete enough insulin (Mayo Clinic). Prolonged periods of elevated blood glucose can lead to diabetes or prediabetes, kidney damage, and nerve damage (USDHHS 2014a). Using data from the National Health and Nutrition Examination Survey (NHANES), our objective was to determine if there was a relationship between insulin in the bloodstream (pmol/L), and a group of chemical predictor variables measured in the urine and demographic predictor variables for a sample of 974 individuals. The chemical predictor variables we examined were urinary iodine, urinary creatinine, urinary perchlorate, urinary nitrate, and urinary thiocyanate. Demographic variables included gender, age, and household income. Iodine (μg/L), measured in the urine as a quantitative variable, is an element that must come from the diet—it cannot be made in the body (WebMD). Iodine is needed by the thyroid gland to produce hormones. Iodine has been shown to have a negative correlation with blood glucose and insulin (Al-Attas et al. 2012). Further, iodine has been associated with insulin resistance in people with type 2 diabetes (Al- Attas et al. 2012). Insulin resistance is a condition where organs and body tissue do not respond to insulin, and therefore cannot easily absorb glucose from the bloodstream. Thus, the body needs to produce more insulin (USDHHS 2014b). Creatinine (mg/dL), also measured in the urine as a quantitative variable, is a waste product of muscle deterioration and is filtered through the kidneys and excreted through the urine (WebMD 2015). Testing the level of creatinine in a person’s body is a common measure of kidney health, and high levels of creatinine in your blood indicate kidney failure in patients with diabetes (National Kidney Foundation; WebMD 2015). Perchlorate (ng/mL), measured in the urine as a quantitative variable, is an element commonly found in the environment. Exposure to it occurs through the ingestion of food or water that contains perchlorate (ATSDR 2015). Perchlorate in the
  • 3. 3/19 body is associated with the inhibition of iodide uptake by the thyroid, which can lead to hypothyroidism (Blount et al. 2006; ATSDR 2015). Similarly, exposure to Nitrate (ng/mL), measured in the urine as a quantitative variable, is primarily linked to the ingestion of food or water that contain nitrates (CDC 2013). Similar to perchlorate, it is known to disrupt thyroid function by inhibiting iodide uptake (CDC 2013). Thiocyanate (ng/mL) is another element known to inhibit iodide uptake by the thyroid. Measured in the urine as a quantitative variable, exposure to thiocyanate occurs from cigarette smoke and some Brassica genus vegetables (Leung et al. 2014). Similar to perchlorate and nitrate, thiocyanate could lead to lower thyroid hormone production (Steinmaus et al. 2013). The demographic variables included in the analysis were gender, age, and household income. Gender, a qualitative variable coded as 1 for male and 2 for female, was included for its possible relationship with insulin resistance. Given that increased body fat has been associated with insulin resistance (USDHHS 2014b), higher levels of visceral and hepatic adipose tissue in men may contribute to higher levels of insulin resistance (Greer & Shen 2009). Aging has also been associated with increased body fat and increased levels of insulin resistance (Ryan 2000). Therefore, age, measured as a quantitative variable, was included in the analysis. Elderly were found to be more prone to insulin resistance, although it’s uncertain whether the cause is biological or environmental (i.e., decreased physical activity, weight gain) (Refaie et al. 2006). Lastly, household income, coded in qualitative bins, was included in the analysis. It has been shown that proximity to resources in high-income areas is related to insulin resistance (Auchincloss et al. 2007). A potential explanation is that grocery stores in poorer areas have less-healthy food options (Horowitz et al. 2004). Alternatively, in a low-income, poorly educated population of patients with diabetes, 48% reported an unwillingness to use insulin (Machinani et al. 2013). Known as psychological insulin resistance, this refers to a person’s unwillingness to use insulin (Machinani et al. 2013). This may result in a relationship between lower-income households and lower levels of insulin. A previous study by Blount et al. (2006) used this data from NHANES to examine the relationship between perchlorate and thyroid hormone levels. The covariates incorporated in their analysis that were also examined in ours include urinary creatinine, urinary iodine, urinary nitrate, urinary thiocyanate, and age. They
  • 4. 4/19 found a relationship between perchlorate and thyroid hormone production in women; however, no such significant relationship was found in men (Blount et al. 2006). A study by Steinmaus et al. (2013) also examined data from NHANES. They looked for an interaction effect between perchlorate, thiocyanate, and iodine on thyroid hormone levels. They found greater effects when all three variables were assessed together than when they were examined separately (Steinmaus et al. 2013). EXPLANATORY DATA EXPLORATION Figure 1 is a simple scatter plot of the data. We have the response in the left most column. Looking at the smoothed curve, there seems to be some predictive power with creatinine. However, there is lack of homogeneity of variability as suggested by the corresponding scatter plot. This implies that there may be a need for transformation of the data, which would be done later. The other variables appear to have very low predictive power. Figure 1: Scatter plot of the data ANALYSIS Data Formatting Four files, which contained the data for estimation of insulin level in the blood, were acquired from NHANES. In those files, each respondent was given a unique sequence number. The final dataset was created by aggregating the four datasets to extract the desired response variable and predictor variables based on common
  • 5. 5/19 sequence number. In total, there were 974 unique sequence numbers which contained common sequence numbers for all the variables to be included in the analysis. After forming the final data, the data set was randomly divided into two equal parts: a train dataset and test dataset. The train dataset was used to perform initial analysis, whereas the test dataset was used for model validation. Insulin level was selected as the response variable, and urinary iodine, urinary creatinine, urinary perchlorate, urinary nitrate, urinary thiocyanate, gender, age and household income were treated as the predictor variables. In total, there were eight- predictor variables selected for estimating the insulin level in the blood. Two of the variables—gender and household income—were qualitative variables, and the six remaining variables were quantitative variables. Initial Model Fitting and Diagnostics As initial analysis of the data, a linear model was fitted to the train data using all the variables and its diagnostics plots were analyzed to check whether the assumptions of the linear model were met. Figure 2: Diagnostics plots, full model, train data A plot of residuals vs. fitted values for this fitted model suggested that the variance of the error is not constant. Furthermore, the normal Q-Q plot of the residuals also suggests some departure from normality of the errors exits. It is concluded after assessing the diagnostic plots that the assumptions of constancy of the error term variance and the normality of the error terms are violated. Furthermore, the estimated
  • 6. 6/19 intercept and coefficients for the different predictor variables range over a long range, so the values of the predictor variables were scaled in an appropriate manner. Urinary iodine, urinary creatinine, and urinary thiocyanate were divided by 100, whereas urinary nitrate was divided by 1000. These scaling factors were chosen to get all the predictor variables on a common scale. Figure 3: Graphical representation of Box-Cox transformation As previously stated, there appears to be unequal error variance and non- normality of the error terms. To alleviate the violation of these assumptions, transformation of the response variable is an appropriate measure due to shape and spread of the distribution of the response variable. To accomplish this, a suitable transformation to mitigate the unequal error variance and non-normality of the residuals needed to be determined. Box-Cox transformation was used to determine the best transformation to apply to the response variable. The built-in R function boxcox() automatically identified a range of power for transformations. Figure 3 shows this range, indicating an optimal lambda value to be near zero. So, it is reasonable to select lambda to be zero. Therefore, a log transformation, applied to the response variable, was suitable to remove the non-constant error variance and the non- normality of the residuals. A linear model was again fitted to the data using all the predictor variables and the log-transformed response variable, and the diagnostic plots (Figure 4) were analyzed to check whether the transformation was appropriate. The residuals versus fitted values showed the constancy of error variance assumption was met, and the normality plot indicated the residuals were normally distributed.
  • 7. 7/19 Figure 4: Diagnostics of full model after transformation, train data Multicollinearity Check To check if there is any serious multicollinearity problem in our data, we used the variance inflation factor (VIF) criterion. The technique is all the predictors are regressed against the rest of the predictors and the VIF is found using , where is the multiple for the regression of on the other predictors. The cutoff point is suggested to be 10 (Kutner et al. 2004). For our data, we got the highest VIF to be 1.76, suggesting there is not a serious multicollinearity problem in our data. Table 1 shows the VIF values and the correlations between the predictor variables of the data. Table 1: VIF and Correlation of the training data
  • 8. 8/19 Model selection In total, we fit four models. In model 1, we included all the variables except for the age and income. When fitting with all the predictors, these terms were not significant at any default significance level (maximum 0.1). Additionally, the response was regressed with all the predictors separately and these two predictors did not show any significant linear relationship. Model 2 was selected based on the Akaike information criterion (AIC) based stepwise selection. Model 3 is a simpler model, derived from model 2 after deleting the variable having the highest p-value. Model 4 is a further simplification where one more variable was deleted from model 3 based on the highest p-value. Table 2: shows the linear predictors of each of the models considered Figure 5 and 6 show the residual scatter plots and the normal Q-Q plot for all the models proposed. From visual inspection, they appear to meet the normality and variance constancy assumptions of the residuals. However, from the Q-Q plot, there seems to be some outlying observations in the data. Further analysis attempted to identify influential points which will be discussed later.
  • 9. 9/19 Figure 5: Residual plot of four candidate models, training data Figure 6: Q-Q plot of four candidate models, training data With Influential Observations Table 3 shows the regression results for candidate models based on training and validation dataset, with influential and outlying observations. For each of the four candidate models, we recorded the point estimates for the intercept and the slope of each predictor variable, along with the corresponding standard error. We also color- coded the slope coefficients and intercept point estimate based on the predictor’s level of significance. We performed a lack-of-fit test for all four of the models to determine significance of the model. Given that the p-value for all the models was significant (p-
  • 10. 10/19 value < 0.05), we found that there were four competitive models. We calculated several model selection statistics to determine the best model. Table 3: Regression results for candidate models based on training and validation dataset We calculated the AIC value for each model using = 2 − 2ln ( ), where k is the number of estimated parameters and L is the maximum value for the likelihood function. Model 2 produced the lowest AIC value of the four models, and was therefore considered the best model using this criterion. We compared the SSE values for the four models using = ∑( − ) , where is the th observation and is the predicted value for the th observation. Although model 1 and model 2 produced almost identical SSE values, model 1 produced the lowest SSE and was therefore considered the best model using this criterion. We compared the MSE values for the four models using = ∑( ) , where is the number of observations and is the number of parameters to be estimated.
  • 11. 11/19 All four models produced similar values; however, model 2 produced the lowest MSE and was therefore considered the best model. We calculated and compared the PRESS statistic for all four models using = ∑( − ( )) , where ( ) is the prediction of the th value with the th observation removed. The PRESS criterion is a way to determine of how well the use of the fitted values for a given subset model can predict the observed response values (Kutner et al. 2004). Although models 2, 3, and 4 were all very similar, model 4 produced the lowest PRESS value and was therefore considered the best model using this criterion. Next, we calculated the value for all four models using = − + 2 , where is the error sum of squares from the full model including all the potential predictors, is the residual mean square error of the candidate model. Model 2 produced the smallest value, and was therefore considered the best model. Lastly, we calculated and compared the MSPR for all the four models using = ∑( ) ∗ , where and are original observations and point estimations of the response, respectively and ∗ is the number of cases in the validation dataset. Although model 3 and 4 produced very similar results, model 3 produced the lowest MSPR and was therefore considered the best model. All four models were then applied to the validation dataset, and the results were recorded in Table 3. The results show that model 1 was the best model; however, three of the predictor variables were not significant. From the results presented in Table 3, model 2 seems to be an appealing choice as it possesses the lowest AIC and values. The SSE, MSE and PRESS statistics do not show a lot of variability for the four models. However, looking at the MSPR value, model 3 and model 4 look to be the better choice over the other two. Because the results from Table 3 are not consistent using different model selection techniques, this suggests the presence of influential points. Therefore, we conducted measures to detect influential observations. Up to this point none of our data was screened for the presence of influential observations. Looking at Figure 4 and 5, there seems to be presence of observations with high deviations of the residuals from the mean. This could be a direct consequence of potential outliers and influential points in the data. Although a point may be an outlier in terms of the range of predictor variables, it may not be an outlier in terms of the response variable. Conversely, a point may be an outlier in terms of
  • 12. 12/19 the response variable, yet it may not be an outlier in terms of the predictor variable. Further, a point may be an outlier in terms of both the response and predictor variables. In these instances, it is possible that although the point is outlying for all or only one of the variables, it may not have an influence on the regression line. Therefore, it was necessary to assess the data for influential points. We used two measures to assess the presence of influential points: DFFITS and Cook’s Distance. The DFFIT is given by , where is the studentized deleted residual and the second term of the product is the leverage factor of the observation. An observation having a high DFFIT value is identified as an influential point per this criterion. We used 2 as the threshold value (Kutner et al. 2004). The DFFIT considers the influence of the case on the fitted value , while the Cook’s Distance measures the influence of the case on all fitted values. Cook’s Distance measure for the case is calculated using = × ( ) . Thus, we get a high Cook’s Distance value for high residuals and/or high leverage value. Higher indicates higher degree of influence of the case of the fitted values. We used a Cook’s Distance cutoff value of 4/n to identify influential points, where n was the number of data observations (Introduction to SAS). It should be noted that identifying the influential points by both these criteria depends on the model itself as the calculation involves ℎ , the diagonal element of the hat matrix. We aggregated the influential data from both tests and removed them from the data for corresponding model. This process was repeated for each of the four models. Figure 7: DFFIT and Cook’s Distance plot for model 1
  • 13. 13/19 Figure 7 shows the influential points identified by these measures for model 1. This was repeated for all four proposed models. The models were fitted again without these influential points. Table 4 summarizes the regression results for the candidate models based on the datasets after removing the influential points. Table 4: Regression results after removing influential points Looking at the results presented in Table 4, we see that model 4 produces the best model selection values in terms of , SSE, MSE, PRESS, and value. Five out of the six criteria we have considered for model evaluation indicate model 4 is the best model. The other statistic for model 4, MSPR is only slightly above the lowest of all the models. These results are consistent for the validation data as well, with the exception of the statistic. Therefore, we chose model 4 for this study. This model is supported by the statistics and is very simple in nature as it includes only two predictor variables. For this chosen model, we ran the multiple comparison test to check whether all the coefficients were significantly different from zero. The in-built R function glht() was used to do that. This function takes the null hypothesis in the form
  • 14. 14/19 : = (Hothorn et al. 2016). The default choice of is and was specified as a diagonal matrix of size 3. Table 5 summarizes the results of multiple comparison and it support our choice of model 4. Table 5: Simultaneous inference results Robust regression To investigate whether our choice of model based on the data after removing the influential points complies, we have implemented the robust regression. Robust regression dampens the effect of influential cases and safeguards against these influences (Kutner et al. 2004). The in-built R function rlm() was used for this purpose. This function, by default uses the iteratively reweighted least square (IRLS) method to fit the function (Yegorov 2016). It chooses the mean absolute deviation (MAD) as the weight function with the default choice. Using the weights from a least square regression, it obtains the weights and fits the model again using weighted least squares. The weights are re-estimated after each iteration until a convergence is obtained (Kutner et al. 2004). Table 6: Results from robust regression
  • 15. 15/19 Table 6 summarizes the regression results obtained from robust regression. Comparison between the test and training data indicates similar coefficient values and standard error values. Further, these results are consistent with results obtained from ordinary least square results. Given that robust regression is insensitive to influential observations, this indicates that our original data used for ordinary least squares analysis was sufficiently assessed for the presence of influential data. Regression Tree We have implemented the regression tree, which is a non-parametric, simple, and powerful regression technique (Kutner et al. 2004). Implementing this method in the training dataset without the influential cases, the MSE, PRESS and MSPR were found to be 0.44, 203.36, and 0.61 respectively, which validates our choice of model. CONCLUSION Based on our analysis, we select model 4 as our final model, i.e the blood insulin level can be modeled as a linear function of urinary creatinine level and gender. However, the coefficient of the gender variable is opposite in the training and the validation data. This holds true for all four models (also in the robust regression, Table 6). This was a direct consequence of the sampling of the data, when the full dataset was divided into train and test data. This can be explained with the help of Figure 8. The boxplot of the train data shows that female has higher insulin level on average, whereas the test data tells us the different story. The full data agrees with the train dataset. Therefore, we stick with our coefficients found from the training dataset, when interpreting the effect of gender on insulin level. Figure 8: Boxplot of response based on gender
  • 16. 16/19 The model we choose to show the relationship of blood insulin level with urinary creatinine level and gender was found to be statistically significant. However, it had very low predictive power, as indicated by low coefficient of determination ( less than 5%). Figure 9 shows the original values and the predicted values along with the 95% confidence and prediction band. This figure indicates that this model has little practical application because the prediction and confidence bands are similar across the entire range of the data. This emphasizes the importance of examining the practical application of a statistically significant model. Figure 9: Original and predicted values of training data REFERENCES Al-Attas, O. S., Al-Daghri, N. M., Alkharfy, K. M., Alokail, M. S., Al-Johani, N. J., Abd- Alrahman, S. H., Yakout, S. M., Draz, H. M., & Sabico, S. (2012). Urinary iodine is associated with insulin resistance in subjects with diabetes mellitus type 2. Experimental and Clinical Endocrinology & Diabetes, 120(10), 618-622. ATSDR (Agency for Toxic Substances and Disease Registry), Division of Toxicology and Environmental Medicine. (2015). Public health statement: Perchlorates. https://www.atsdr.cdc.gov/ToxProfiles/tp162-c1-b.pdf. Accessed 4 Dec. 2016. Auchincloss, A. H., Roux, A. V. D., Brown, D. G., O'Meara, E. S., & Raghunathan, T.
  • 17. 17/19 E. (2007). Association of insulin resistance with distance to wealthy areas the multi-ethnic study of atherosclerosis. American Journal of Epidemiology, 165(4), 389-397. Blount, B. C., Pirkle, J. L., Osterloh, J. D., Valentin-Blasini, L., & Caldwell, K. L. (2006). Urinary perchlorate and thyroid hormone levels in adolescent and adult men and women living in the United States. Environmental Health Perspectives, 1865-1871. CDC (Centers for Disease Control and Prevention). (2013). National Health and Nutrition Examination Survey. https://wwwn.cdc.gov/nchs/nhanes/2011- 2012/PERNTS_G.htm. Accessed 4 Dec. 2016. Geer, E. B., & Shen, W. (2009). Gender differences in insulin resistance, body composition, and energy balance. Gender Medicine, 6, 60-75. Horowitz, C. R., Colson, K. A., Hebert, P. L., & Lancaster, K. (2004). Barriers to buying healthy foods for people with diabetes: evidence of environmental disparities. American Journal of Public Health, 94(9), 1549-1554. Hothorn, T., Bretz, F., Westfall, P., Heiberger, R. M., Schuetzenmeister, A., Scheibe, S. (2016). Simultaneous Inference in General Parametric Models. https://cran.r- project.org/web/packages/multcomp/multcomp.pdf. Accessed 10 Dec. 2016. Joshi, S. R., Parikh, R. M., & Das, A. K. (2007). Insulin-history, biochemistry, physiology and pharmacology. Journal-Association of Physicians of India, 55(L), 19. Kutner, M. H., Nachtsheim, C. J., Neter, J. (2004). Applied Linear Regression Models (4th ed.). McGraw-Hill Irwin. Leung, A. M., Katz, P. M., He, X., Feig, D. S., Pearce, E. N., & Braverman, L. E. (2014). Urinary perchlorate and thiocyanate concentrations in pregnant women from Toronto, Canada. Thyroid, 24(1), 175-176. Machinani, S., Bazargan-Hejazi, S., & Hsia, S. H. (2013). Psychological insulin resistance among low-income, US racial minority patients with type 2 diabetes. Primary Care Diabetes, 7(1), 51-55. Mayo Clinic. Diabetes treatment: Using insulin to manage blood sugar. http://www.mayoclinic.org/diseases-conditions/diabetes/in-depth/diabetes- treatment/art-20044084. Accessed 2 Dec 2016. National Kidney Foundation. Diabetes – a major risk factor for kidney disease. https://www.kidney.org/atoz/content/diabetes. Accessed 2 Dec. 2016.
  • 18. 18/19 Ryan, A. S. (2000). Insulin resistance with aging. Sports Medicine, 30(5), 327-346. Refaie, M. R., Sayed-Ahmed, N. A., Bakr, A. M., Aziz, M. Y. A., El Kannishi, M. H., & Abdel-Gawad, S. S. (2006). Aging is an Inevitable Risk Factor for Insulin Resistance. Journal of Taibah University Medical Sciences, 1(1), 30-41. Steinmaus, C., Miller, M. D., Cushing, L., Blount, B. C., & Smith, A. H. (2013). Combined effects of perchlorate, thiocyanate, and iodine on thyroid function in the National Health and Nutrition Examination Survey 2007–08. Environmental Research, 123, 17-24. Introduction to SAS. UCLA: Statistical Consulting Group. http://www.ats.ucla.edu/stat/sas/dae/rreg.htm. Accessed 9 Dec. 2016. USDHHS (United States Department of Health and Human Services), National Institute of Diabetes and Digestive and Kidney Diseases. (2014a). Causes of diabetes. https://www.niddk.nih.gov/health-information/diabetes/causes. Accessed 5 Dec. 2016. USDHHS (United States Department of Health and Human Services), National Institute of Diabetes and Digestive and Kidney Diseases (2014b). Prediabetes and insulin resistance. https://www.niddk.nih.gov/health- information/diabetes/types/prediabetes-insulin-resistance. Accessed 5 Dec. 2016. WebMD. Iodine. http://www.webmd.com/vitamins-supplements/ingredientmono-35- iodine.aspx?activeingredientid=35. Accessed 2 Dec 2016. WebMD. (2015). Creatinine and creatinine clearance blood tests. http://www.webmd.com/a-to-z-guides/creatinine-and-creatinine-clearance- blood-tests - 1. Accessed 2 Dec. 2016. Yegorov, O. (2016). Robust Fitting of Linear Models. http://stat.ethz.ch/R-manual/R-devel/library/MASS/html/rlm.html. Accessed 10 Dec. 2016.
  • 19. 19/19 APPENDIX Part of the Data R code: Submitted as separate .R files.