SlideShare una empresa de Scribd logo
1 de 8
Descargar para leer sin conexión
A Note on Correlation Analysis:
statistically spurious vs. non-spurious results
Aris Spanos
This note uses an empirical example to illustrate (i) how one can in-
advertently derive spurious correlation results and (ii) how such results
can be transformed into statistically reliable ones.
The example uses the following annual data for the period 2000-2009:
−Per capita consumption of cheese (US)
−Number of people who died by becoming tangled in their bedsheets
year: 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
 298 301 305 306 313 317 326 331 327 328
 327 456 509 497 596 573 661 741 809 717
The data were downloaded from the following web site:
http://www.tylervigen.com/
The original question of interest: are these variables correlated?
The correlation coefficient between two random variables  and 
is defined by:
= ()
√
 () ()
=
[(−)(−)
√
[(−)2][(−)2]
=

√
2
·2


This parameter is usually estimated using the following estimator:
b=
1

P
=1(−)(− )
q
[1

P
=1(−)2][1

P
=1(− )2]

where =1

P
=1  and  =1

P
=1  are the estimators of the means:
¡
=() =()
¢

b2
=[1

P
=1( − )2] and b2
=[1

P
=1( −  )2] are the estimators
of the variances:
¡
2
=( − )2 2
=( − )2
¢
and b=1

P
=1( − )( −  ) is the estimator of the covariance:
=[( − )( − )
1
In the case of the above data the resulting estimates are:
=3152 =5886 b2
=1515 b2
=216192 b=171409
giving rise to the estimated correlation coefficient:
b= 171409√
(1515)(21619156)
=94713
The key issue for this inference result is whether it is reliable, or sta-
tistically spurious? This is irrespective of whether it is substantively
meaningful or not. Statistical spuriousness arises easily when any of
the probabilistic assumptions imposed on the data so as to ensure that
the employed inference procedures have the properties assumed; these
properties were derived assuming these assumptions are valid.
What is often insufficiently appreciated in practice is that the above
estimators:
   b2
 b2
 b
of the unknown parameters:
¡
  2
 2
 
¢

are minimally ‘good’ (consistent) when the stochastic process
{Z:=( ) =1 2   } underlying the data:
z0:={( ) =1 2  }
is Independent and Identically Distributed (IID). When these assump-
tions are supplemented with the assumption of Normality, in addition
to being consistent, the above estimators of ( ) are also unbiased,
sufficient and fully efficient and the estimators of (2
 2
 ) are suf-
ficient and asymptotically efficient.
More formally, the implicit statistical model for correlation analysis
is a simple bivariate Normal model:
Z vNIID(μ Σ)  =1 2    (1)
where Z:=
µ


¶
 μ:=
µ


¶
 Σ:=
µ
2
 
 2

¶

2
The NIID probabilistic assumptions for the process
{Z:=( ) =1 2   } can be used to derive a N-P test for the
hypotheses:
0 : =0 vs. 1 :  6= 0 (2)
that was originally proposed by Fisher (1915) and is based on the test
statistic:
(Z) =
√
(−2)b
√
(1−b2
)
=0
v St(−2)
where "St(−2)" denotes a Student’s t distribution with (−2) degrees
of freedom. Applying this test using the above data yields:
(z0)=
√
8(94713)
√
(1−(94713)2)
=8349 P(|(Z)|  |(z0)| ; =0)=000034
The p-value of .000034 indicates that  is statistically (highly) signifi-
cant:
b=94713[000034] (3)
What does linear regression have to do with correlation?
The above correlation results are directly related to the results based
on a simple regression of  on  :
 = 0 + 1 +  /=1 2    (4)
since the correlation coefficient  is related to the regression coefficient
1:
1=()
 () via =1
µ√
 ()
√
 ()
¶
(5)
In the case of the above data, the estimated regression is:
= − 29773
(4276)
+11313
(1356)
+b 2
=897 =50056 =10 (6)
with the t-test for the significance of 1 yielding:
1(z0) = 11313
1356 =8349[000034] (7)
3
This result is directly related to the significance of the correlation co-
efficient, since:
b=b1
³
b
b
´
= (11313)
µ √
(1515)
√
(21619156)
¶
=947[000034] (8)
This confirms the above result that the two random variables ‘appear’
to be highly correlated!
But are they? Are the above inference results reliable?
Looking at the t plots of the above data in figures 1-2, it is clear that
the mean of both data series is not constant; it is trending upwards.
This indicates that at least one of the invoked assumptions, the ID
[constant mean, variance and covariance], is invalid for this data!
10987654321
33.5
33.0
32.5
32.0
31.5
31.0
30.5
30.0
date
x
Fig. 1: t-plot of  data
10987654321
800
700
600
500
400
300
date
y
Fig. 2: t-plot of  data
What does this departure imply for the above estimation and testing
results. First, =1

P
=1  and  =1

P
=1  will be inconsistent
estimators of the ‘true’ means:
¡
()=() ()=()
¢

which, in light of figures 1-2, change (increase) with ; the true means
appear to be better described using a linear trend, i.e.
()=0+1 and ()=0+1 (9)
Second, the inconsistency of
¡
 
¢
implies that the estimators of
all the above linear regression parameters:
b0= − b1 b1=
P
=1(−)(− )
P
=1(−)2  2= 1
−2
P
=1
b2
  (10)
4
as well as the correlation coefficient, are also inconsistently estimated
because all the formulae involve deviations of the form ( − ) and
( −  ) which are deviations from the wrong (constant) mean!
Third, all inferences based on the estimated correlation coefficient
and linear regression model are likely to be unreliable and the signifi-
cance results statistically spurious. That is, equally unreliable will be
the t-test statistic (1) and the goodness-of-fit measure 2:
1(z0)=
√P
=1(−)2(b1)
  2=1−
P
=1 b2
P
=1(−)2  (11)
because they also involve the erroneous deviations:
( − ) ( −  )
instead of the more appropriate deviations:
( − b0−b1) ( − b0−b1)
What is one supposed to do after establishing that the above signif-
icance results are likely to be statistically spurious?
Addressing the spuriousness of inference results
A way to secure the reliability of inference and sidestep the statisti-
cally spurious results is to account for the trending mean in the above
data. This can be achieved by detrending the original data using the
following auxiliary regressions:
 = 29367
(240)
+392
(039)
+b1 2
=927 =352 =10 (12)
 = 33493
(3336)
+4612
(5378)
 + b2 2
=902 =4884 =10 (13)
The detrended data are evaluated using:
(e= − 392 e= − 4612)  =1 2  
The t-plots of the detrended data {(e e)  =1 2  } are shown in
figures 3-4.
The estimated variances and covariance for the detrended data are:
b2
e=110 b2
e=2120132 bee=5885
5
It is worth noting how ‘inflated’ were the original estimates of these
moments:
b2
=1515 b2
=216192 b=171409
when compared to the above estimates.
10987654321
30.0
29.8
29.6
29.4
29.2
29.0
date
xd
Fig. 3: detrended  data
10987654321
400
375
350
325
300
275
250
date
yd
Fig. 4: detrended  data
Testing the significance of the correlation coefficient using the de-
trended data yields:
b= 5885√
(11)(2120132)
=385[271] (14)
with the p-value indicating that  is statistically insignificant; contra-
dicting the previous result! There is no statistical correlation between
these two variables afterall! The source of the statistically unreliable
results based on (6) is the fact that the misspecification due to the pres-
ence of trends usually induces sizeable discrepancies between actual and
nominal error probabilities. Applying a .05 significance level test when
the actual type I error is greater than .90, will lead an inference astray.
Correlation or Linear Regression? An equivalent way to derive
the above inference result is to estimate the linear regression between
the detrended data:
e= − 1236
(1330)
+ 535
(4529)
e+b 2
=148 =45066 =10 (15)
The t-test for the statistical significance of the coefficient of e yields:
1(z0) =
535
4529
= 1181[271] (16)
6
Notice that the p-value is identical to the one for the correlation coef-
ficient based on the detrended data. This is not accidental. As shown
above, the correlation coefficient is a simple reparameterization of the
regression coefficient. Given that:
b=
√
110=3316 b=
√
2120132=4605
b=b1(b
b
)=535
¡3316
4605
¢
=385[271] (17)
In light of the above discussion, one might be forgiven for think-
ing that the statistical spuriousness was remedied by transmuting (or
cleaning up) our data to comply with the invoked assumptions of the
underlying statistical model in (1). Although there is an element of
truth in this viewpoint, it is much too narrow to be illuminating enough
for the broader issues raised by statistically spurious results stemming
from different departures from the model assumptions. For instance,
how does one transmut the data when Normality is false but  is the
parameter of interest?
A broader, and more illuminating perspective, is provided by view-
ing the remedy as stemming from respecifying the original model to
account for the statistical systematic information in data z0 that was
not accounted for by the original statistical model. In the above case
the (unaccounted) systematic information came in the form of trending
means which can be accounted for by the following respecified linear
regression that includes a trend term to reflect (9):
 = 0 +  + 1 +  =1 2    (18)
When this respecified regression is estimated using the original data
yields:
 = −1236
(1422)
+ 2518
(1968)
+ 535
(4842)
 + b
2=916 =48178 =10
(19)
This estimated regression is interesting because the coefficient estimates
of (0 1) are identical, and their standard errors and  are almost
7
identical, to those of the linear regression between the detrended data
in (15). Indeed, the discrepancies between them are due to numerical
approximation errors. Despite the minor differences in these estimates,
the inference results pertaining to the parameters
¡
0 1 2
¢
are iden-
tical. This includes the t-test for the significance of 1:
1(z0) =
535
4842
=1105[301] (20)
confirming that 1 is statistically insignificant.
In conclusion, it is important to emphasize that in the above ex-
ample the sample size is too small (=10) to test for departures from
the other probabilistic assumptions. This is especially true for the as-
sumption of independence. However, if other departures are present in
this data, they would undermine the reliability of the original inference
results based on (6) every further. In practice, one should do empirical
modeling with a large enough  to enable one to test the validity of all
the model assumptions using trenchant misspecification testing. If any
departures are detected one should then respecify the original model to
account for all the systematic statistical information not accounted for.
The respecification is considered successful when the respecified model
turns out to be statistical adequate: all its probabilistic assumptions
are shown to be valid for data z0 i.e. the statistical model accounts
for all the chance regularity patterns in the data.
8

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

A. spanos slides ch14-2013 (4)
A. spanos slides ch14-2013 (4)A. spanos slides ch14-2013 (4)
A. spanos slides ch14-2013 (4)
 
Chapter13
Chapter13Chapter13
Chapter13
 
statistics assignment help
statistics assignment helpstatistics assignment help
statistics assignment help
 
Chapter7
Chapter7Chapter7
Chapter7
 
Applied Business Statistics ,ken black , ch 3 part 2
Applied Business Statistics ,ken black , ch 3 part 2Applied Business Statistics ,ken black , ch 3 part 2
Applied Business Statistics ,ken black , ch 3 part 2
 
Applied Business Statistics ,ken black , ch 6
Applied Business Statistics ,ken black , ch 6Applied Business Statistics ,ken black , ch 6
Applied Business Statistics ,ken black , ch 6
 
Pattern Discovery - part I
Pattern Discovery - part IPattern Discovery - part I
Pattern Discovery - part I
 
Introduction to Functions
Introduction to FunctionsIntroduction to Functions
Introduction to Functions
 
HW1 MIT Fall 2005
HW1 MIT Fall 2005HW1 MIT Fall 2005
HW1 MIT Fall 2005
 
Applied Business Statistics ,ken black , ch 5
Applied Business Statistics ,ken black , ch 5Applied Business Statistics ,ken black , ch 5
Applied Business Statistics ,ken black , ch 5
 
Chapter14
Chapter14Chapter14
Chapter14
 
numerical methods
numerical methodsnumerical methods
numerical methods
 
b
bb
b
 
Chapter15
Chapter15Chapter15
Chapter15
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Stochastic Processes Homework Help
Stochastic Processes Homework Help Stochastic Processes Homework Help
Stochastic Processes Homework Help
 
Applied Business Statistics ,ken black , ch 4
Applied Business Statistics ,ken black , ch 4Applied Business Statistics ,ken black , ch 4
Applied Business Statistics ,ken black , ch 4
 
Chap09 hypothesis testing
Chap09 hypothesis testingChap09 hypothesis testing
Chap09 hypothesis testing
 
Chap08 estimation additional topics
Chap08 estimation additional topicsChap08 estimation additional topics
Chap08 estimation additional topics
 
Chapter11
Chapter11Chapter11
Chapter11
 

Similar a Correlation of Cheese Consumption and Bedsheet Deaths

Page 1 of 18Part A Multiple Choice (1–11)______1. Using.docx
Page 1 of 18Part A Multiple Choice (1–11)______1. Using.docxPage 1 of 18Part A Multiple Choice (1–11)______1. Using.docx
Page 1 of 18Part A Multiple Choice (1–11)______1. Using.docxalfred4lewis58146
 
Unit 4_3 Correlation Regression.pptx
Unit 4_3 Correlation Regression.pptxUnit 4_3 Correlation Regression.pptx
Unit 4_3 Correlation Regression.pptxAppasamiG
 
Correlation Analysis PRESENTED.pptx
Correlation Analysis PRESENTED.pptxCorrelation Analysis PRESENTED.pptx
Correlation Analysis PRESENTED.pptxHaimanotReta
 
Nonparametric approach to multiple regression
Nonparametric approach to multiple regressionNonparametric approach to multiple regression
Nonparametric approach to multiple regressionAlexander Decker
 
Data-Handling part 2.ppt
Data-Handling part 2.pptData-Handling part 2.ppt
Data-Handling part 2.pptAhmadHashlamon
 
Linear functions and modeling
Linear functions and modelingLinear functions and modeling
Linear functions and modelingIVY SOLIS
 
Ali, Redescending M-estimator
Ali, Redescending M-estimator Ali, Redescending M-estimator
Ali, Redescending M-estimator Muhammad Ali
 
Polynomial regression model of making cost prediction in mixed cost analysis
Polynomial regression model of making cost prediction in mixed cost analysisPolynomial regression model of making cost prediction in mixed cost analysis
Polynomial regression model of making cost prediction in mixed cost analysisAlexander Decker
 
11.polynomial regression model of making cost prediction in mixed cost analysis
11.polynomial regression model of making cost prediction in mixed cost analysis11.polynomial regression model of making cost prediction in mixed cost analysis
11.polynomial regression model of making cost prediction in mixed cost analysisAlexander Decker
 
ECO 578 Final Exam There are 4 parts
ECO 578 Final Exam There are 4 partsECO 578 Final Exam There are 4 parts
ECO 578 Final Exam There are 4 partsAlexHunetr
 
02.bayesian learning
02.bayesian learning02.bayesian learning
02.bayesian learningSteven Scott
 
02.bayesian learning
02.bayesian learning02.bayesian learning
02.bayesian learningSteven Scott
 
Applied numerical methods lec8
Applied numerical methods lec8Applied numerical methods lec8
Applied numerical methods lec8Yasser Ahmed
 

Similar a Correlation of Cheese Consumption and Bedsheet Deaths (20)

Regression Analysis.pdf
Regression Analysis.pdfRegression Analysis.pdf
Regression Analysis.pdf
 
Regression
RegressionRegression
Regression
 
Page 1 of 18Part A Multiple Choice (1–11)______1. Using.docx
Page 1 of 18Part A Multiple Choice (1–11)______1. Using.docxPage 1 of 18Part A Multiple Choice (1–11)______1. Using.docx
Page 1 of 18Part A Multiple Choice (1–11)______1. Using.docx
 
Unit 4_3 Correlation Regression.pptx
Unit 4_3 Correlation Regression.pptxUnit 4_3 Correlation Regression.pptx
Unit 4_3 Correlation Regression.pptx
 
Correlation Analysis PRESENTED.pptx
Correlation Analysis PRESENTED.pptxCorrelation Analysis PRESENTED.pptx
Correlation Analysis PRESENTED.pptx
 
Stats Coursework
Stats CourseworkStats Coursework
Stats Coursework
 
Nonparametric approach to multiple regression
Nonparametric approach to multiple regressionNonparametric approach to multiple regression
Nonparametric approach to multiple regression
 
Data-Handling part 2.ppt
Data-Handling part 2.pptData-Handling part 2.ppt
Data-Handling part 2.ppt
 
Linear functions and modeling
Linear functions and modelingLinear functions and modeling
Linear functions and modeling
 
Ali, Redescending M-estimator
Ali, Redescending M-estimator Ali, Redescending M-estimator
Ali, Redescending M-estimator
 
Polynomial regression model of making cost prediction in mixed cost analysis
Polynomial regression model of making cost prediction in mixed cost analysisPolynomial regression model of making cost prediction in mixed cost analysis
Polynomial regression model of making cost prediction in mixed cost analysis
 
11.polynomial regression model of making cost prediction in mixed cost analysis
11.polynomial regression model of making cost prediction in mixed cost analysis11.polynomial regression model of making cost prediction in mixed cost analysis
11.polynomial regression model of making cost prediction in mixed cost analysis
 
ECO 578 Final Exam There are 4 parts
ECO 578 Final Exam There are 4 partsECO 578 Final Exam There are 4 parts
ECO 578 Final Exam There are 4 parts
 
02.bayesian learning
02.bayesian learning02.bayesian learning
02.bayesian learning
 
02.bayesian learning
02.bayesian learning02.bayesian learning
02.bayesian learning
 
bayesian learning
bayesian learningbayesian learning
bayesian learning
 
Applied numerical methods lec8
Applied numerical methods lec8Applied numerical methods lec8
Applied numerical methods lec8
 
Binary Logistic Regression
Binary Logistic RegressionBinary Logistic Regression
Binary Logistic Regression
 
Module4
Module4Module4
Module4
 
Project2
Project2Project2
Project2
 

Más de jemille6

“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”jemille6
 
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and ProbabilismStatistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and Probabilismjemille6
 
D. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfD. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfjemille6
 
reid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfreid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfjemille6
 
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022jemille6
 
Causal inference is not statistical inference
Causal inference is not statistical inferenceCausal inference is not statistical inference
Causal inference is not statistical inferencejemille6
 
What are questionable research practices?
What are questionable research practices?What are questionable research practices?
What are questionable research practices?jemille6
 
What's the question?
What's the question? What's the question?
What's the question? jemille6
 
The neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and MetascienceThe neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and Metasciencejemille6
 
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...jemille6
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the TwoOn Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the Twojemille6
 
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...jemille6
 
Comparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple TestingComparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple Testingjemille6
 
Good Data Dredging
Good Data DredgingGood Data Dredging
Good Data Dredgingjemille6
 
The Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of ProbabilityThe Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of Probabilityjemille6
 
Error Control and Severity
Error Control and SeverityError Control and Severity
Error Control and Severityjemille6
 
The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)jemille6
 
The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)jemille6
 
On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...jemille6
 
The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (jemille6
 

Más de jemille6 (20)

“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”
 
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and ProbabilismStatistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
 
D. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfD. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdf
 
reid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfreid-postJSM-DRC.pdf
reid-postJSM-DRC.pdf
 
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
 
Causal inference is not statistical inference
Causal inference is not statistical inferenceCausal inference is not statistical inference
Causal inference is not statistical inference
 
What are questionable research practices?
What are questionable research practices?What are questionable research practices?
What are questionable research practices?
 
What's the question?
What's the question? What's the question?
What's the question?
 
The neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and MetascienceThe neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and Metascience
 
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the TwoOn Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the Two
 
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
 
Comparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple TestingComparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple Testing
 
Good Data Dredging
Good Data DredgingGood Data Dredging
Good Data Dredging
 
The Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of ProbabilityThe Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of Probability
 
Error Control and Severity
Error Control and SeverityError Control and Severity
Error Control and Severity
 
The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)
 
The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)
 
On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...
 
The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (
 

Último

Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)cama23
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 

Último (20)

Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 

Correlation of Cheese Consumption and Bedsheet Deaths

  • 1. A Note on Correlation Analysis: statistically spurious vs. non-spurious results Aris Spanos This note uses an empirical example to illustrate (i) how one can in- advertently derive spurious correlation results and (ii) how such results can be transformed into statistically reliable ones. The example uses the following annual data for the period 2000-2009: −Per capita consumption of cheese (US) −Number of people who died by becoming tangled in their bedsheets year: 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009  298 301 305 306 313 317 326 331 327 328  327 456 509 497 596 573 661 741 809 717 The data were downloaded from the following web site: http://www.tylervigen.com/ The original question of interest: are these variables correlated? The correlation coefficient between two random variables  and  is defined by: = () √  () () = [(−)(−) √ [(−)2][(−)2] =  √ 2 ·2   This parameter is usually estimated using the following estimator: b= 1  P =1(−)(− ) q [1  P =1(−)2][1  P =1(− )2]  where =1  P =1  and  =1  P =1  are the estimators of the means: ¡ =() =() ¢  b2 =[1  P =1( − )2] and b2 =[1  P =1( −  )2] are the estimators of the variances: ¡ 2 =( − )2 2 =( − )2 ¢ and b=1  P =1( − )( −  ) is the estimator of the covariance: =[( − )( − ) 1
  • 2. In the case of the above data the resulting estimates are: =3152 =5886 b2 =1515 b2 =216192 b=171409 giving rise to the estimated correlation coefficient: b= 171409√ (1515)(21619156) =94713 The key issue for this inference result is whether it is reliable, or sta- tistically spurious? This is irrespective of whether it is substantively meaningful or not. Statistical spuriousness arises easily when any of the probabilistic assumptions imposed on the data so as to ensure that the employed inference procedures have the properties assumed; these properties were derived assuming these assumptions are valid. What is often insufficiently appreciated in practice is that the above estimators:    b2  b2  b of the unknown parameters: ¡   2  2   ¢  are minimally ‘good’ (consistent) when the stochastic process {Z:=( ) =1 2   } underlying the data: z0:={( ) =1 2  } is Independent and Identically Distributed (IID). When these assump- tions are supplemented with the assumption of Normality, in addition to being consistent, the above estimators of ( ) are also unbiased, sufficient and fully efficient and the estimators of (2  2  ) are suf- ficient and asymptotically efficient. More formally, the implicit statistical model for correlation analysis is a simple bivariate Normal model: Z vNIID(μ Σ)  =1 2    (1) where Z:= µ   ¶  μ:= µ   ¶  Σ:= µ 2    2  ¶  2
  • 3. The NIID probabilistic assumptions for the process {Z:=( ) =1 2   } can be used to derive a N-P test for the hypotheses: 0 : =0 vs. 1 :  6= 0 (2) that was originally proposed by Fisher (1915) and is based on the test statistic: (Z) = √ (−2)b √ (1−b2 ) =0 v St(−2) where "St(−2)" denotes a Student’s t distribution with (−2) degrees of freedom. Applying this test using the above data yields: (z0)= √ 8(94713) √ (1−(94713)2) =8349 P(|(Z)|  |(z0)| ; =0)=000034 The p-value of .000034 indicates that  is statistically (highly) signifi- cant: b=94713[000034] (3) What does linear regression have to do with correlation? The above correlation results are directly related to the results based on a simple regression of  on  :  = 0 + 1 +  /=1 2    (4) since the correlation coefficient  is related to the regression coefficient 1: 1=()  () via =1 µ√  () √  () ¶ (5) In the case of the above data, the estimated regression is: = − 29773 (4276) +11313 (1356) +b 2 =897 =50056 =10 (6) with the t-test for the significance of 1 yielding: 1(z0) = 11313 1356 =8349[000034] (7) 3
  • 4. This result is directly related to the significance of the correlation co- efficient, since: b=b1 ³ b b ´ = (11313) µ √ (1515) √ (21619156) ¶ =947[000034] (8) This confirms the above result that the two random variables ‘appear’ to be highly correlated! But are they? Are the above inference results reliable? Looking at the t plots of the above data in figures 1-2, it is clear that the mean of both data series is not constant; it is trending upwards. This indicates that at least one of the invoked assumptions, the ID [constant mean, variance and covariance], is invalid for this data! 10987654321 33.5 33.0 32.5 32.0 31.5 31.0 30.5 30.0 date x Fig. 1: t-plot of  data 10987654321 800 700 600 500 400 300 date y Fig. 2: t-plot of  data What does this departure imply for the above estimation and testing results. First, =1  P =1  and  =1  P =1  will be inconsistent estimators of the ‘true’ means: ¡ ()=() ()=() ¢  which, in light of figures 1-2, change (increase) with ; the true means appear to be better described using a linear trend, i.e. ()=0+1 and ()=0+1 (9) Second, the inconsistency of ¡   ¢ implies that the estimators of all the above linear regression parameters: b0= − b1 b1= P =1(−)(− ) P =1(−)2  2= 1 −2 P =1 b2   (10) 4
  • 5. as well as the correlation coefficient, are also inconsistently estimated because all the formulae involve deviations of the form ( − ) and ( −  ) which are deviations from the wrong (constant) mean! Third, all inferences based on the estimated correlation coefficient and linear regression model are likely to be unreliable and the signifi- cance results statistically spurious. That is, equally unreliable will be the t-test statistic (1) and the goodness-of-fit measure 2: 1(z0)= √P =1(−)2(b1)   2=1− P =1 b2 P =1(−)2  (11) because they also involve the erroneous deviations: ( − ) ( −  ) instead of the more appropriate deviations: ( − b0−b1) ( − b0−b1) What is one supposed to do after establishing that the above signif- icance results are likely to be statistically spurious? Addressing the spuriousness of inference results A way to secure the reliability of inference and sidestep the statisti- cally spurious results is to account for the trending mean in the above data. This can be achieved by detrending the original data using the following auxiliary regressions:  = 29367 (240) +392 (039) +b1 2 =927 =352 =10 (12)  = 33493 (3336) +4612 (5378)  + b2 2 =902 =4884 =10 (13) The detrended data are evaluated using: (e= − 392 e= − 4612)  =1 2   The t-plots of the detrended data {(e e)  =1 2  } are shown in figures 3-4. The estimated variances and covariance for the detrended data are: b2 e=110 b2 e=2120132 bee=5885 5
  • 6. It is worth noting how ‘inflated’ were the original estimates of these moments: b2 =1515 b2 =216192 b=171409 when compared to the above estimates. 10987654321 30.0 29.8 29.6 29.4 29.2 29.0 date xd Fig. 3: detrended  data 10987654321 400 375 350 325 300 275 250 date yd Fig. 4: detrended  data Testing the significance of the correlation coefficient using the de- trended data yields: b= 5885√ (11)(2120132) =385[271] (14) with the p-value indicating that  is statistically insignificant; contra- dicting the previous result! There is no statistical correlation between these two variables afterall! The source of the statistically unreliable results based on (6) is the fact that the misspecification due to the pres- ence of trends usually induces sizeable discrepancies between actual and nominal error probabilities. Applying a .05 significance level test when the actual type I error is greater than .90, will lead an inference astray. Correlation or Linear Regression? An equivalent way to derive the above inference result is to estimate the linear regression between the detrended data: e= − 1236 (1330) + 535 (4529) e+b 2 =148 =45066 =10 (15) The t-test for the statistical significance of the coefficient of e yields: 1(z0) = 535 4529 = 1181[271] (16) 6
  • 7. Notice that the p-value is identical to the one for the correlation coef- ficient based on the detrended data. This is not accidental. As shown above, the correlation coefficient is a simple reparameterization of the regression coefficient. Given that: b= √ 110=3316 b= √ 2120132=4605 b=b1(b b )=535 ¡3316 4605 ¢ =385[271] (17) In light of the above discussion, one might be forgiven for think- ing that the statistical spuriousness was remedied by transmuting (or cleaning up) our data to comply with the invoked assumptions of the underlying statistical model in (1). Although there is an element of truth in this viewpoint, it is much too narrow to be illuminating enough for the broader issues raised by statistically spurious results stemming from different departures from the model assumptions. For instance, how does one transmut the data when Normality is false but  is the parameter of interest? A broader, and more illuminating perspective, is provided by view- ing the remedy as stemming from respecifying the original model to account for the statistical systematic information in data z0 that was not accounted for by the original statistical model. In the above case the (unaccounted) systematic information came in the form of trending means which can be accounted for by the following respecified linear regression that includes a trend term to reflect (9):  = 0 +  + 1 +  =1 2    (18) When this respecified regression is estimated using the original data yields:  = −1236 (1422) + 2518 (1968) + 535 (4842)  + b 2=916 =48178 =10 (19) This estimated regression is interesting because the coefficient estimates of (0 1) are identical, and their standard errors and  are almost 7
  • 8. identical, to those of the linear regression between the detrended data in (15). Indeed, the discrepancies between them are due to numerical approximation errors. Despite the minor differences in these estimates, the inference results pertaining to the parameters ¡ 0 1 2 ¢ are iden- tical. This includes the t-test for the significance of 1: 1(z0) = 535 4842 =1105[301] (20) confirming that 1 is statistically insignificant. In conclusion, it is important to emphasize that in the above ex- ample the sample size is too small (=10) to test for departures from the other probabilistic assumptions. This is especially true for the as- sumption of independence. However, if other departures are present in this data, they would undermine the reliability of the original inference results based on (6) every further. In practice, one should do empirical modeling with a large enough  to enable one to test the validity of all the model assumptions using trenchant misspecification testing. If any departures are detected one should then respecify the original model to account for all the systematic statistical information not accounted for. The respecification is considered successful when the respecified model turns out to be statistical adequate: all its probabilistic assumptions are shown to be valid for data z0 i.e. the statistical model accounts for all the chance regularity patterns in the data. 8