Frontiers of Computational Journalism week 7 - Randomness and Statistical Significance

Frontiers of
Computational Journalism
Columbia Journalism School
Week 7: Randomness and Spooky Significance
October 31, 2018

This class
• Randomness
• Significance testing in Journalism
• $%#$! P-Values
• Bayesian inference
• The Garden of Forking Paths
• Analysis of Competing Hypotheses

One star per box – “less” random

Two principles of randomness
1. Random data has “patterns” in it way more often than you
think.
2. This problem gets much more extreme when you have
less data.

Two dice: non-uniform distribution

Is something causing cancer?
Cancer rate per county. Darker = greater incidence of cancer.
From Graphical Inference for Infovis, Wickham et. Al.

Global temperature record
How likely is it that the temperature won't increase over next decade?

From The Signal and the Noise, Nate Silver

It is conceivable that the 14 elderly people who are reported to have died soon
after receiving the vaccination died of other causes. Government officials in
charge of the program claim that it is all a coincidence, and point out that old
people drop dead every day. The American people have even become familiar
with a new statistic: Among every 100,000 people 65 to 75 years old, there will
be nine or ten deaths in every 24-hour period under most normal
circumstances.
Even using the official statistic, it is disconcerting that three elderly people in
one clinic in Pittsburgh, all vaccinated within the same hour, should die within a
few hours thereafter. This tragedy could occur by chance, but the fact remains
that it is extremely improbable that such a group of deaths should take place in
such a peculiar cluster by pure coincidence.
- New York Times editorial, 14 October 1976

Assuming that about 40 percent of elderly Americans were vaccinated within
the first 11 days of the program, then about 9 million people aged 65 and
older would have received the vaccine in early October 1976. Assuming that
there were 5,000 clinics nationwide, this would have been 164 vaccinations
per clinic per day. A person aged 65 or older has about a 1-in-7,000 chance
of dying on any particular day; the odds of at least three such people dying
on the same day from among a group of 164 patients are indeed very long,
about 480,000 to one against. However, under our assumptions, there were
55,000 opportunities for this “extremely improbable” event to occur—5,000
clinics, multiplied by 11 days. The odds of this coincidence occurring
somewhere in America, therefore, were much shorter—only about 8 to 1
- Nate Silver, The Signal and the Noise, Ch. 7 footnote 20

Significance Testing in Journalism

Randomization to detect insider trading

Looking at executives' trading in the week before their companies made news,
the Journal found that one of every 33 who dipped in and out posted average
returns of more than 20% (or avoided 20% downturns) in the following week.
By contrast, only one in 117 executives who traded in an annual pattern did that
well.
Executives’ Good Luck in Trading Own Stock, Wall Street Journal, 2012

Randomization to detect tennis fixing
Why look at betting data? Well, the main point of fixing a match is to make
money off the betting. In a normal match, some people bet that one player will
win and some people bet on the other, based on the odds that bookmakers
have set. But if huge bets start pouring in on one side, that looks very much like
a sign that some gamblers think they know more than the bookmaker about
how that match is going to go. Perhaps they know one player is going to tank.
…
To estimate how often they should have been expected to lose, I ran 1 million
computer simulations per player.
How BuzzFeed News Used Betting Data To Investigate Match-Fixing In Tennis, John
Templon, Buzzfeed, 2016

Problems with statistical tests alone
“It’s very, very dangerous to make blasé assumptions about a match being
dubious because of prematch movements,” Dan Weston, a tennis analyst and
trader who writes for the website of the sports book Pinnacle, said in a
telephone interview. (Using only data on betting and results to demonstrate
fixing has proven problematic in other sports.)
“By itself, the analysis of betting data does not prove match-fixing,” Schoofs
said in his statement. “That’s why we did not name the players and are
declining to comment, and also why our investigation went much wider than the
algorithm and was based on a cache of leaked documents, interviews across
three continents, and much more.”
Why Betting Data Alone Can’t Identify Match Fixers In Tennis, FiveThirtyEight

Detecting campaign finance violations?
In late October 2016, Donald Trump’s personal attorney Michael Cohen paid
adult star Stormy Daniels $130,000 in order to purchase her silence about an
alleged affair a decade earlier. … Sharp-eyed observers have noted that, in late
October 2016, the Trump campaign made a series of five large payments to
Trump-affiliated entities, totaling $129,999.72.
Ultimately, our model suggests that the probability of a set of payments
coincidentally coming so close to $130,000 is approximately 0.1%, or one out
of one thousand. In other words, about 99.9% of the time, random chance
would not produce a set of payments this close to $130,000. Therefore, the
probability that the Trump campaign payments were related to the Daniels
payoff is very high.
Statistical Model Strongly Suggests the Stormy Daniels Payoff Came from the
Trump Campaign, Will Stancil

Statistical Model Strongly Suggests the Stormy Daniels Payoff Came from the
Trump Campaign, Will Stancil
“The simulation confirmed that it is extremely unlikely that, by random chance
alone, a set of payments near a specific date would almost equal $130,000.”

P-value
p(observed data > your data | null hypothesis)
What’s it good for? What’s it bad for?
From A dirty dozen: twelve p-value misconceptions, S.Goodman

T-test for two groups with different variance. Expected to have T-
distribution under under null hypothesis of equal scores
Is one classroom better than another?

Things that depend on which classroom a student is in
Things that don’t depend on which classroom they’re in
Reasons for possible differences

observed difference
between classes

observed difference
between classes
14% of all resamples have a class difference > observed, so p = 0.14

Boostrapping: resample with repetition. This gives an excellent
approximation of the sampling distribution, even if non-normal.
Computing the sampling distribution

A dirty dozen: twelve p-value misconceptions, S. Goodman

Conditional Probability
Pr(B|A) = Pr(AB)/Pr(A)

Accident
No Accident
Blue Yellow

Accident
No Accident
Blue Yellow
P(Accident|Blue) = 0.1

Relative risk as conditional probability
N = a+b+c+d
N(disease) = a+c
N(no disease) = b+d
Pr(disease) = a+c / a+b+c+d
Pr(disease|smoker) = a / (a+b)
Pr(disease|non-smoker) = c / (c+d)
RR = Pr(disease|smoker)/Pr(disease|non-smoker) = (a/a+b) / (c/c+d)

Base Rates - Taxi Accidents
Imagine you live in a city where 15% of all rides end in an
accident, and last year there were
- 75 accidents involving yellow cabs
- 25 accidents involving blue cabs
Which taxi company is more dangerous?

Base rate
We know
P(accident) = 0.15
P(accident|blue) = 0.25
P(accident|yellow) = 0.75
We do not know the “base rate”:
P(yellow)
or equivalently
N(yellow)

Evidence and Conditional Probability
Hypothesis H = Alice has a cold
Evidence E = we just saw her cough

Alice is coughing. Does she have a cold?
Most people with colds cough
P(coughing|cold) = 0.9

P(A|B) ≠ P(B|A)
Most people with colds cough
P(coughing|cold) = 0.9
but we want
P(cold | coughing)

Bayes’ Theorem
Tells us how to go from Pr(A|B) to Pr(B|A)
Pr(B|A) = Pr(A|B)Pr(B) / Pr(A)

Alice is coughing. Does she have a cold?
Prior P(H) = 0.05 (5% of our friends have a cold)
Likelihood P(E|H) = 0.9 (most people with colds cough)
Base rate P(E) = 0.1 (10% of everyone coughs today)
P(H|E) = P(E|H)P(H)/P(E)
= 0.9 * 0.05 / 0.1
= 0.45
If you believe your initial probability estimates, you should now
believe there's a 45% chance she has a cold.

Bayes’ Theorem - Diagnostic tests
Suppose I tell you:
• 14 of 1000 women under 50 have breast cancer
• If a woman has cancer, a mammogram is positive 75%
of the time
• If a woman does not have cancer, a mammogram is
positive 10% of the time
If a woman has a positive mammogram, how likely is she
to have cancer?

The Signal and the Noise, Nate Silver

cancer
no cancer
positive negative

cancer
no cancer
positive negative
Pr(positive|cancer) = 0.75
= N(positive & cancer) / N(cancer)
N(cancer) = 4
N(positive & cancer) = 3

cancer
no cancer
positive negative
Pr(positive|no cancer) = 0.1
= N(positive & no cancer) / N(positive)
N(no cancer) = 1000
N(positive & no cancer) = 100

cancer
no cancer
positive negative
Pr(cancer) = 0.0014
= N(cancer) / N

Conditional probabilities
Pr(positive|cancer) = 75%
Pr(positive|no cancer) = 10%
What is Pr(cancer|positive)?

cancer
no cancer
positive negative
Pr(cancer|positive)
= 9.6%

Bayesian diagnostics
Pr(cancer|positive) =
Pr(positive|cancer) Pr(cancer) / Pr(positive)
Pr(positive|cancer) = 0.75
Pr(cancer) = 0.014
Pr(positive) = Pr(positive|no cancer)Pr(no cancer) +
Pr(positive|cancer)Pr(cancer)
= 0.10*0.986 + 0.75*0.014
= 0.1091

Bayesian diagnostics
Pr(cancer|positive) =
Pr(positive|cancer) Pr(cancer) / Pr(positive)
= (0.75 * 0.014) / (0.1091)
= 0.0962
= 9.6% chance she has cancer
if mammogram is positive

Evidence
Information that justifies a belief.
Presented with evidence E for X, we should believe X "more."
In terms of probability, P(X|E) > P(X)

Bayes “learns” from evidence
Pr(H|E) = Pr(E|H) Pr(H) / Pr(E)
or
P(H|E) = Pr(E|H)/Pr(E) * Pr(H)
Posterior
How likely is H
given evidence E?
Prior
How likely was
H to begin with?
Likelihood
Probability of
seeing E
if H is true
Base Rate
How commonly
do we see E at all?

A more complete theory
Compare probability of multiple alternatives.

Did the stoplight reduce accidents?

1
02468
2
02468
3
02468
4
02468
5
02468
6
02468
7
02468
8
02468
9
02468
Simulated without stoplight

1
02468
2
02468
3
02468
4
02468
5
02468
6
02468
7
02468
8
02468
9
02468
Simulated with a 50% effective stoplight

Probability distribution over hypotheses
Is the NYPD targeting mosques for stop-and-frisk?
1
0
H0 H1 H2
Never RoutinelyOnce or twice
*Tricky: you have to imagine a hypothesis before you can assign it a
probability.

Parameter Estimation
Computing probability for a continuum of hypotheses
P(𝛳|E) = Pr(E|𝛳)/Pr(E) * Pr(𝛳)

Ok, but what’s a “significant” Bayes Factor?
From Bayes Factors, Kass and Raftery
There’s this, but the whole idea of “significance” is
probably flawed.

I Fooled Millions Into Thinking Chocolate Helps Weight Loss. Here's How.
John Bohannon

Science Isn’t Broken, FiveThirtyEight

“Statistical significance” is usually asking the wrong question.

Does the model reproduce the data?
Testing for Racial Discrimination in Police Searches of Motor Vehicles, Simoiu et al.

Analysis
of Competing Hypotheses

Cognitive biases
Availability heuristic: we use examples that come to mind, instead of
statistics.
Preference for earlier information: what we learn first has a much
greater effect on our judgment.
Memory formation: whatever seems important at the time is what gets
remembered.
Confirmation bias: we seek out and give greater importance to
information that confirms our expectations.

Confirmation bias
Comes in many forms.
...unconsciously filtering information that doesn't fit expectations.
...not looking for contrary information.
...not imagining the alternatives.

Method of competing hypotheses
Start with multiple hypotheses H0, H1, ... HN
(Remember, if you can't imagine it, you can't conclude it!)
Go looking for information that gives you the best ability to discriminate
between hypotheses.
Evidence which supports Hi is much less useful than evidence which supports
Hi much more than Hj, if the goal is to choose a hypothesis.

In practice: Triangulation
A good conclusion is one which is supported by multiple lines of evidence from
multiple methods.
“Philosophy ought to imitate the successful sciences in its methods, so far as to
proceed only from tangible premises which can be subjected to careful scrutiny,
and to trust rather to the multitude and variety of its arguments than to the
conclusiveness of any one. Its reasoning should not form a chain which is no
stronger than its weakest link, but a cable whose fibers may be ever so slender,
provided they are sufficiently numerous and intimately connected.”
- Charles Sanders Peirce

A difficult example
NYPD performs ~600,000 street stop and frisks per year.
What sorts of conclusions could we draw from this data?
How?

Stop and Frisk Causation
Suppose you take the address of every mosque in NYC, and
discover that there are 15% more stop-and-frisks within 100m of
mosques than the overall average.
Can we conclude that the police are targeting Muslims?

Frontiers of Computational Journalism week 7 - Randomness and Statistical Significance

Recomendados

Recomendados

Más contenido relacionado

Similar a Frontiers of Computational Journalism week 7 - Randomness and Statistical Significance

Similar a Frontiers of Computational Journalism week 7 - Randomness and Statistical Significance (20)

Más de Jonathan Stray

Más de Jonathan Stray (10)

Último

Último (20)

Frontiers of Computational Journalism week 7 - Randomness and Statistical Significance