If you liked it you should've put a p-value on it ...or not

If you liked it you should’ve put a
p-value on it
… or not.
Chris Gorgolewski
Max Planck Institute for Human Cognitive and Brain Sciences

SIGNAL DETECTION THEORY
Signal and noise
False positive and false negative errors
Power

Vocabulary
• Type I error – false positive
• Type II error – false negative
• False positive rate
• False negative rate
• Statistical power = 1 – false negative rate
• Sensitivity = Power

Lower SNR = we miss more stuff

Lower SNR = higher FDR threshold

VOXELWISE TESTS
P-maps
Multiple comparison
FWE correction: Bonferroni, permutations
FDR correction: B-H, Local FDR

Hypothesis testing
• Distinguish between two hypotheses
1. H0 – there is no difference between groups
2. H1 – there is a difference between groups
• Or…
1. H0 – there is no relation between two variables
2. H1 – there is some relation between the two
variables

From statistical values to p-values
• Various procedures give us statistical values
– T-tests (one sample, two sample, paired etc.)
– F-Tests
– Correlation tests (r values)
• What is a p value?

P value
• P(z) = A probability if we repeat our
experiment (with all the analyses) and there is
no effect we will get this or greater statistical
value.

OK back to neuroimaging
• Assuming that we are doing a massive
univariate analysis (we look at each voxel
independently) we have a t-map
• Now using a theoretical distribution (given the
degrees of freedom) we can turn it into a p-
map

Inference!
• We take out p-map discard all voxel with
values > 0.05
– “The value for which P=0.05, or 1 in 20, is 1.96 or
nearly 2; it is convenient to take this point as a
limit in judging whether a deviation ought to be
considered significant or not. Deviations
exceeding twice the standard deviation are thus
formally regarded as significant.”
• We are done – right?

Not quite done yet…
• Let me generate two vectors of values and test
using a t-test if they are different
• What is the probability that P(t) < 0.05
– Well… 0.05
• Let me generate another set of values… and
another… 100 pairs of vectors
• What is the probability that at least one of the
test?

Correcting for multiple comparisons
• Bonferroni correction (based on Bool’s
inequality)
– Divide your p-threshold by the number of tests
you have performed
– Or multiple your p-values by the number of tests
you have performed

Bonferroni is a Family Wise Error
correction
It guarantees that the chances of getting at least
one false positive in all the tests is less than your
p-threshold

Permutation based FWE correction
• The assumptions behind the theoretical
distributions are often not met
• There are many dependencies between voxels
– Each test is not independent so Bonferroni
correction can be conservative
• We can however establish an empirical
distribution

1. Break the relation: shuffle the participants
between the groups
2. Perform the test
3. Save the maximum statistical value across
voxels
4. Repeat

Our FWE corrected p value is the percentage of
permutations that yielded statistical values
higher than the original (unshuffled one)

False Discovery Rate
• Even conceptually FWE correction seems
conservative
– At least one test out of 60 000?
• Is there a more intuitive way of looking at
this?

I present a number of voxels that I think show a
strong effect, but I admit that a certain
percentage of them might be false positives.

Percentage of false positive voxels among all
significant voxels.

FDR procedures
• Benjamini-Hochberg procedure
– With it’s dependent variables variant
• Efrons local FDR procedure
– Explicit modeling of the signal distribution

Interim Summary
• FWE corrections
– Bonferroni – simple but struggles with
dependencies (over conservative)
– Permutations – less dependent on assumptions,
but time consuming
• FDR corrections
– B-H – simple but also struggles with dependencies
– Local FDR – data driven, but can fail in case of low
SNR

CLUSTER EXTENT TESTS
Test how big are the blobs
Random field theory
Smoothness estimation
Permutation test
The problem of cluster forming threshold
Fun fact: FWE with RFT

Intuition
If we are interested in continuous regions of
activations why are we looking at voxels not
blobs?

No wait… it’s just smooth noise…

What contributes to expected cluster
size?
How likely is to get cluster of this size from pure
noise?
It depends… on:
1. cluster forming threshold
2. smoothness of the map
3. size of the map

Where do we get those parameters?
1. cluster forming threshold
– Arbitrary decision
2. smoothness of the map
– Estimated from the residuals of the GLM
3. size of the map
– Calculated from the mask

Permutation based cluster extent
probability
1. Break the relation: shuffle the participants
between the groups
2. Perform the test
3. Threshold the map to get clusters
4. Save the sizes of all clusters
5. Repeat

Permutation based cluster extent
probability
Our cluster extent p value is the percentage of
permutations that yielded cluster sizes bigger
than the original (unshuffled one)

Cluster forming threshold conundrum

HONORABLE MENTIONS
TFCE
Mixture models

Threshold Free Cluster Enhancement

Spatially Regularized Mixture Models

SPM
• RFT based voxelwise FWE correction
• Smoothness estimation
• Cluster extent p-values
• Peak height p-values
• Permutation tests through SnPM toolbox

FSL
• RFT based voxelwise FWE correction
• Smoothness estimation
• Cluster extent p-values
• FDR
• Permutation tests through randomize
– Including TFCE

AFNI
• Cluster extent p-values (3dClustSim)
– Simulations are not permutations
• Smoothness estimation (3dFWHMx)

Interim summary
Clusterwise methods allow us to find surprising
patterns in terms of spatially consistent clusters
instead of individual voxels.

P-value paradox
• There are no two entities or groups that are
truly identical
• There are no two variables that are in no way
unrelated
• We just fail to obtain enough samples to see it
– Or our tools are not sensitive enough

More samples more “significance”
• The more subjects you will have in your study
the more likely it is that you will find
something significant
• The same applies to scan length, and field
strength

H0 is never true
we just fail to show that

P-value failure
• P-values do not tell us much about actual size
of the effect
• Neither do they tell of the predictive power of
the found relation

The interesting question
Is PCC involved in autism?
vs.
Given cortical thickness of a subjects PCC how
well am I able to predict his or hers diagnosis?

Why does this matter
• More subjects, longer scans, stronger scans –
everything is significant
– We are getting there
• Lack of faith in science from the public
– Poor reproducibility

What needs to be done
We need more replications
We need to start reporting null results

What you can do
• Report effect sizes and their confidence
intervals
– For all test/voxels – not just those significant
• Share the unthresholded statistical maps
– It only takes 5 minutes on neurovault.org
• Report all the tests you have performed – not
just the significant ones

http://dx.doi.org/10.1016/j.neuron.2012.05.001

If you liked it you should’ve
convinced a skeptical researcher to
to try to replicate your results.

If you liked it you should've put a p-value on it ...or not

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (11)

Similar a If you liked it you should've put a p-value on it ...or not

Similar a If you liked it you should've put a p-value on it ...or not (20)

Más de Krzysztof Gorgolewski

Más de Krzysztof Gorgolewski (20)

Último

Último (20)

If you liked it you should've put a p-value on it ...or not