- The document discusses a correlation analysis between per capita cheese consumption and deaths from bedsheet entanglement using annual data from 2000-2009.
- Computing the correlation coefficient results in a highly statistically significant correlation. However, examining plots of the data reveals the means are trending over time, violating the assumption of constant means.
- This implies the estimates and statistical tests are unreliable and the results may be statistically spurious. To address this, the data can be detrended using auxiliary regressions to remove the trends before reanalyzing the correlation.
Transaction Management in Database Management System
Correlation of Cheese Consumption and Bedsheet Deaths
1. A Note on Correlation Analysis:
statistically spurious vs. non-spurious results
Aris Spanos
This note uses an empirical example to illustrate (i) how one can in-
advertently derive spurious correlation results and (ii) how such results
can be transformed into statistically reliable ones.
The example uses the following annual data for the period 2000-2009:
−Per capita consumption of cheese (US)
−Number of people who died by becoming tangled in their bedsheets
year: 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
298 301 305 306 313 317 326 331 327 328
327 456 509 497 596 573 661 741 809 717
The data were downloaded from the following web site:
http://www.tylervigen.com/
The original question of interest: are these variables correlated?
The correlation coefficient between two random variables and
is defined by:
= ()
√
() ()
=
[(−)(−)
√
[(−)2][(−)2]
=
√
2
·2
This parameter is usually estimated using the following estimator:
b=
1
P
=1(−)(− )
q
[1
P
=1(−)2][1
P
=1(− )2]
where =1
P
=1 and =1
P
=1 are the estimators of the means:
¡
=() =()
¢
b2
=[1
P
=1( − )2] and b2
=[1
P
=1( − )2] are the estimators
of the variances:
¡
2
=( − )2 2
=( − )2
¢
and b=1
P
=1( − )( − ) is the estimator of the covariance:
=[( − )( − )
1
2. In the case of the above data the resulting estimates are:
=3152 =5886 b2
=1515 b2
=216192 b=171409
giving rise to the estimated correlation coefficient:
b= 171409√
(1515)(21619156)
=94713
The key issue for this inference result is whether it is reliable, or sta-
tistically spurious? This is irrespective of whether it is substantively
meaningful or not. Statistical spuriousness arises easily when any of
the probabilistic assumptions imposed on the data so as to ensure that
the employed inference procedures have the properties assumed; these
properties were derived assuming these assumptions are valid.
What is often insufficiently appreciated in practice is that the above
estimators:
b2
b2
b
of the unknown parameters:
¡
2
2
¢
are minimally ‘good’ (consistent) when the stochastic process
{Z:=( ) =1 2 } underlying the data:
z0:={( ) =1 2 }
is Independent and Identically Distributed (IID). When these assump-
tions are supplemented with the assumption of Normality, in addition
to being consistent, the above estimators of ( ) are also unbiased,
sufficient and fully efficient and the estimators of (2
2
) are suf-
ficient and asymptotically efficient.
More formally, the implicit statistical model for correlation analysis
is a simple bivariate Normal model:
Z vNIID(μ Σ) =1 2 (1)
where Z:=
µ
¶
μ:=
µ
¶
Σ:=
µ
2
2
¶
2
3. The NIID probabilistic assumptions for the process
{Z:=( ) =1 2 } can be used to derive a N-P test for the
hypotheses:
0 : =0 vs. 1 : 6= 0 (2)
that was originally proposed by Fisher (1915) and is based on the test
statistic:
(Z) =
√
(−2)b
√
(1−b2
)
=0
v St(−2)
where "St(−2)" denotes a Student’s t distribution with (−2) degrees
of freedom. Applying this test using the above data yields:
(z0)=
√
8(94713)
√
(1−(94713)2)
=8349 P(|(Z)| |(z0)| ; =0)=000034
The p-value of .000034 indicates that is statistically (highly) signifi-
cant:
b=94713[000034] (3)
What does linear regression have to do with correlation?
The above correlation results are directly related to the results based
on a simple regression of on :
= 0 + 1 + /=1 2 (4)
since the correlation coefficient is related to the regression coefficient
1:
1=()
() via =1
µ√
()
√
()
¶
(5)
In the case of the above data, the estimated regression is:
= − 29773
(4276)
+11313
(1356)
+b 2
=897 =50056 =10 (6)
with the t-test for the significance of 1 yielding:
1(z0) = 11313
1356 =8349[000034] (7)
3
4. This result is directly related to the significance of the correlation co-
efficient, since:
b=b1
³
b
b
´
= (11313)
µ √
(1515)
√
(21619156)
¶
=947[000034] (8)
This confirms the above result that the two random variables ‘appear’
to be highly correlated!
But are they? Are the above inference results reliable?
Looking at the t plots of the above data in figures 1-2, it is clear that
the mean of both data series is not constant; it is trending upwards.
This indicates that at least one of the invoked assumptions, the ID
[constant mean, variance and covariance], is invalid for this data!
10987654321
33.5
33.0
32.5
32.0
31.5
31.0
30.5
30.0
date
x
Fig. 1: t-plot of data
10987654321
800
700
600
500
400
300
date
y
Fig. 2: t-plot of data
What does this departure imply for the above estimation and testing
results. First, =1
P
=1 and =1
P
=1 will be inconsistent
estimators of the ‘true’ means:
¡
()=() ()=()
¢
which, in light of figures 1-2, change (increase) with ; the true means
appear to be better described using a linear trend, i.e.
()=0+1 and ()=0+1 (9)
Second, the inconsistency of
¡
¢
implies that the estimators of
all the above linear regression parameters:
b0= − b1 b1=
P
=1(−)(− )
P
=1(−)2 2= 1
−2
P
=1
b2
(10)
4
5. as well as the correlation coefficient, are also inconsistently estimated
because all the formulae involve deviations of the form ( − ) and
( − ) which are deviations from the wrong (constant) mean!
Third, all inferences based on the estimated correlation coefficient
and linear regression model are likely to be unreliable and the signifi-
cance results statistically spurious. That is, equally unreliable will be
the t-test statistic (1) and the goodness-of-fit measure 2:
1(z0)=
√P
=1(−)2(b1)
2=1−
P
=1 b2
P
=1(−)2 (11)
because they also involve the erroneous deviations:
( − ) ( − )
instead of the more appropriate deviations:
( − b0−b1) ( − b0−b1)
What is one supposed to do after establishing that the above signif-
icance results are likely to be statistically spurious?
Addressing the spuriousness of inference results
A way to secure the reliability of inference and sidestep the statisti-
cally spurious results is to account for the trending mean in the above
data. This can be achieved by detrending the original data using the
following auxiliary regressions:
= 29367
(240)
+392
(039)
+b1 2
=927 =352 =10 (12)
= 33493
(3336)
+4612
(5378)
+ b2 2
=902 =4884 =10 (13)
The detrended data are evaluated using:
(e= − 392 e= − 4612) =1 2
The t-plots of the detrended data {(e e) =1 2 } are shown in
figures 3-4.
The estimated variances and covariance for the detrended data are:
b2
e=110 b2
e=2120132 bee=5885
5
6. It is worth noting how ‘inflated’ were the original estimates of these
moments:
b2
=1515 b2
=216192 b=171409
when compared to the above estimates.
10987654321
30.0
29.8
29.6
29.4
29.2
29.0
date
xd
Fig. 3: detrended data
10987654321
400
375
350
325
300
275
250
date
yd
Fig. 4: detrended data
Testing the significance of the correlation coefficient using the de-
trended data yields:
b= 5885√
(11)(2120132)
=385[271] (14)
with the p-value indicating that is statistically insignificant; contra-
dicting the previous result! There is no statistical correlation between
these two variables afterall! The source of the statistically unreliable
results based on (6) is the fact that the misspecification due to the pres-
ence of trends usually induces sizeable discrepancies between actual and
nominal error probabilities. Applying a .05 significance level test when
the actual type I error is greater than .90, will lead an inference astray.
Correlation or Linear Regression? An equivalent way to derive
the above inference result is to estimate the linear regression between
the detrended data:
e= − 1236
(1330)
+ 535
(4529)
e+b 2
=148 =45066 =10 (15)
The t-test for the statistical significance of the coefficient of e yields:
1(z0) =
535
4529
= 1181[271] (16)
6
7. Notice that the p-value is identical to the one for the correlation coef-
ficient based on the detrended data. This is not accidental. As shown
above, the correlation coefficient is a simple reparameterization of the
regression coefficient. Given that:
b=
√
110=3316 b=
√
2120132=4605
b=b1(b
b
)=535
¡3316
4605
¢
=385[271] (17)
In light of the above discussion, one might be forgiven for think-
ing that the statistical spuriousness was remedied by transmuting (or
cleaning up) our data to comply with the invoked assumptions of the
underlying statistical model in (1). Although there is an element of
truth in this viewpoint, it is much too narrow to be illuminating enough
for the broader issues raised by statistically spurious results stemming
from different departures from the model assumptions. For instance,
how does one transmut the data when Normality is false but is the
parameter of interest?
A broader, and more illuminating perspective, is provided by view-
ing the remedy as stemming from respecifying the original model to
account for the statistical systematic information in data z0 that was
not accounted for by the original statistical model. In the above case
the (unaccounted) systematic information came in the form of trending
means which can be accounted for by the following respecified linear
regression that includes a trend term to reflect (9):
= 0 + + 1 + =1 2 (18)
When this respecified regression is estimated using the original data
yields:
= −1236
(1422)
+ 2518
(1968)
+ 535
(4842)
+ b
2=916 =48178 =10
(19)
This estimated regression is interesting because the coefficient estimates
of (0 1) are identical, and their standard errors and are almost
7
8. identical, to those of the linear regression between the detrended data
in (15). Indeed, the discrepancies between them are due to numerical
approximation errors. Despite the minor differences in these estimates,
the inference results pertaining to the parameters
¡
0 1 2
¢
are iden-
tical. This includes the t-test for the significance of 1:
1(z0) =
535
4842
=1105[301] (20)
confirming that 1 is statistically insignificant.
In conclusion, it is important to emphasize that in the above ex-
ample the sample size is too small (=10) to test for departures from
the other probabilistic assumptions. This is especially true for the as-
sumption of independence. However, if other departures are present in
this data, they would undermine the reliability of the original inference
results based on (6) every further. In practice, one should do empirical
modeling with a large enough to enable one to test the validity of all
the model assumptions using trenchant misspecification testing. If any
departures are detected one should then respecify the original model to
account for all the systematic statistical information not accounted for.
The respecification is considered successful when the respecified model
turns out to be statistical adequate: all its probabilistic assumptions
are shown to be valid for data z0 i.e. the statistical model accounts
for all the chance regularity patterns in the data.
8