Modeling Social Data, Lecture 2: Introduction to Counting

Introduction to Counting
APAM E4990
Modeling Social Data
Jake Hofman
Columbia University
January 27, 2017
Jake Hofman (Columbia University) Intro to Counting January 27, 2017 1 / 27

Why counting?
http://bit.ly/august2016poll
p( y
support
| x
age
)

Why counting?
http://bit.ly/ageracepoll2016
p( y
support
| x1, x2
age, race
)

Why counting?
?p( y
support
| x1, x2, x3, . . .
age, sex, race, party
)

Why counting?
Problem:
Traditionally diﬃcult to obtain reliable estimates due to small
sample sizes or sparsity
(e.g., ∼ 100 age × 2 sex × 5 race × 3 party = 3,000 groups,
but typical surveys collect ∼ 1,000s of responses)

Why counting?
Potential solution:
Sacriﬁce granularity for precision, by binning observations into
larger, but fewer, groups
(e.g., bin age into a few groups: 18-29, 30-49, 50-64, 65+)

Why counting?
Potential solution:
Develop more sophisticated methods that generalize well from
small samples
(e.g., ﬁt a model: support ∼ β0 + β1age + β2age2 + . . .)

Why counting?
(Partial) solution:
Obtain larger samples through other means, so we can just count
and divide to make estimates via relative frequencies
(e.g., with ∼ 1M responses, we have 100s per group and can
estimate support within a few percentage points)

Why counting?
International Journal of Forecasting 31 (2015) 980–991
Contents lists available at ScienceDirect
International Journal of Forecasting
journal homepage: www.elsevier.com/locate/ijforecast
Forecasting elections with non-representative polls
Wei Wanga,⇤
, David Rothschildb
, Sharad Goelb
, Andrew Gelmana,c
a
Department of Statistics, Columbia University, New York, NY, USA
b
Microsoft Research, New York, NY, USA
c
Department of Political Science, Columbia University, New York, NY, USA
a r t i c l e i n f o
Keywords:
Non-representative polling
Multilevel regression and poststratification
Election forecasting
a b s t r a c t
Election forecasts have traditionally been based on representative polls, in which randomly
sampled individuals are asked who they intend to vote for. While representative polling has
historically proven to be quite effective, it comes at considerable costs of time and money.
Moreover, as response rates have declined over the past several decades, the statistical
benefits of representative sampling have diminished. In this paper, we show that, with
proper statistical adjustment, non-representative polls can be used to generate accurate
election forecasts, and that this can often be achieved faster and at a lesser expense than
traditional survey methods. We demonstrate this approach by creating forecasts from a
novel and highly non-representative survey dataset: a series of daily voter intention polls
for the 2012 presidential election conducted on the Xbox gaming platform. After adjusting
the Xbox responses via multilevel regression and poststratification, we obtain estimates
which are in line with the forecasts from leading poll analysts, which were based on
aggregating hundreds of traditional polls conducted during the election cycle. We conclude
by arguing that non-representative polling shows promise not only for election forecasting,
but also for measuring public opinion on a broad range of social, economic and cultural
issues.
© 2014 International Institute of Forecasters. Published by Elsevier B.V. All rights reserved.
1. Introduction
At the heart of modern opinion polling is representative
sampling, built around the idea that every individual in a
The wide-scale adoption of representative polling can
be traced largely back to a pivotal polling mishap in
the 1936 US presidential election campaign. During
that campaign, the popular magazine Literary Digest
W. Wang et al. / International Journal of Forecasting 31 (2015) 980–991 981
pollsters, including George Gallup, Archibald Crossley, and
Elmo Roper, used considerably smaller but representative
samples, and predicted the election outcome with a
reasonable level of accuracy (Gosnell, 1937). Accordingly,
non-representative or ‘‘convenience sampling’’ rapidly fell
out of favor with polling experts.
So, why do we revisit this seemingly long-settled
case? Two recent trends spur our investigation. First, ran-
dom digit dialing (RDD), the standard method in modern
representative polling, has suffered increasingly high
non-response rates, due both to the general public’s grow-
ing reluctance to answer phone surveys, and to expand-
ing technical means of screening unsolicited calls (Keeter,
Kennedy, Dimock, Best, & Craighill, 2006). By one mea-
sure, RDD response rates have decreased from 36% in 1997
to 9% in 2012 (Kohut, Keeter, Doherty, Dimock, & Chris-
tian, 2012), and other studies confirm this trend (Holbrook,
Krosnick, & Pfent, 2007; Steeh, Kirgis, Cannon, & DeWitt,
2001; Tourangeau & Plewes, 2013). Assuming that the ini-
tial pool of targets is representative, such low response
rates mean that those who ultimately answer the phone
and elect to respond might not be. Even if the selection is-
sues are not yet a serious problem for accuracy, as some
have argued (Holbrook et al., 2007), the downward trend
in response rates suggests an increasing need for post-
sampling adjustments; indeed, the adjustment methods
we present here should work just as well for surveys ob-
tained by probability sampling as for convenience samples.
The second trend driving our research is the fact that, with
recent technological innovations, it is increasingly conve-
nient and cost-effective to collect large numbers of highly
non-representative samples via online surveys. The data
that took the Literary Digest editors several months to col-
lect in 1936 can now take only a few days, and, for some
surveys, can cost just pennies per response. However, the
challenge is to extract a meaningful signal from these un-
conventional samples.
In this paper, we show that, with proper statistical ad-
justments, non-representative polls are able to yield ac-
curate presidential election forecasts, on par with those
based on traditional representative polls. We proceed as
follows. Section 2 describes the election survey that we
conducted on the Xbox gaming platform during the 45
days leading up to the 2012 US presidential race. Our Xbox
sample is highly biased in two key demographic dimen-
how to transform voter intent into projections of vote
share and electoral votes. We conclude in Section 5 by
discussing the potential for non-representative polling in
other domains.
2. Xbox data
Our analysis is based on an opt-in poll which was avail-
able continuously on the Xbox gaming platform during
the 45 days preceding the 2012 US presidential election.
Each day, three to five questions were posted, one of which
gauged voter intention via the standard query, ‘‘If the elec-
tion were held today, who would you vote for?’’. Full de-
tails of the questionnaire are given in the Appendix. The
respondents were allowed to answer at most once per day.
The first time they participated in an Xbox poll, respon-
dents were also asked to provide basic demographic in-
formation about themselves, including their sex, race, age,
education, state, party ID, political ideology, and who they
voted for in the 2008 presidential election. In total, 750,148
interviews were conducted, with 345,858 unique respon-
dents – over 30,000 of whom completed five or more polls
– making this one of the largest election panel studies ever.
Despite the large sample size, the pool of Xbox respon-
dents is far from being representative of the voting pop-
ulation. Fig. 1 compares the demographic composition of
the Xbox participants to that of the general electorate, as
estimated via the 2012 national exit poll.1
The most strik-
ing differences are for age and sex. As one might expect,
young men dominate the Xbox population: 18- to 29-year-
olds comprise 65% of the Xbox dataset, compared to 19%
in the exit poll; and men make up 93% of the Xbox sam-
ple but only 47% of the electorate. Political scientists have
long observed that both age and sex are strongly correlated
with voting preferences (Kaufmann & Petrocik, 1999), and
indeed these discrepancies are apparent in the unadjusted
time series of Xbox voter intent shown in Fig. 2. In contrast
to estimates based on traditional, representative polls (in-
dicated by the dotted blue line in Fig. 2), the uncorrected
Xbox sample suggests a landslide victory for Mitt Romney,
reminiscent of the infamous Literary Digest error.
3. Estimating voter intent with multilevel regression
and poststratification
3.1. Multilevel regression and poststratification
http://bit.ly/nonreppoll

Why counting?
The good:
Shift away from sophisticated statistical methods on small samples
to simpler methods on large samples

Why counting?
The bad:
Even simple methods (e.g., counting) are computationally
challenging at large scales
(1M is easy, 1B a bit less so, 1T gets interesting)

Why counting?
Claim:
Solving the counting problem at scale enables you to investigate
many interesting questions in the social sciences

Learning to count
This week:
Counting at small/medium scales on a single machine

Learning to count
This week:
Counting at small/medium scales on a single machine
Following weeks:
Counting at large scales in parallel

Counting, the easy way
Split / Apply / Combine1
• Load dataset into memory
• Split: Arrange observations into groups of interest
• Apply: Compute distributions and statistics within each group
• Combine: Collect results across groups
1
http://bit.ly/splitapplycombine

The generic group-by operation
Split / Apply / Combine
for each observation as (group, value):
place value in bucket for corresponding group
for each group:
apply a function over values in bucket
output group and result

The generic group-by operation
Split / Apply / Combine
place value in bucket for corresponding group
for each group:
apply a function over values in bucket
Useful for computing arbitrary within-group statistics when we
have required memory
(e.g., conditional distribution, median, etc.)

Why counting?

Example: Anatomy of the long tail
Dataset Users Items Rating levels Observations
Movielens 100K 10K 10 10M
Netﬂix 500K 20K 5 100M

Example: Movielens
How many ratings are there at each star level?
0
1,000,000
2,000,000
3,000,000
1 2 3 4 5
Rating
Numberofratings

Example: Movielens
0
1,000,000
2,000,000
3,000,000
1 2 3 4 5
Rating
Numberofratings
group by rating value
for each group:
count # ratings

Example: Movielens
What is the distribution of average ratings by movie?
1 2 3 4 5
Mean Rating by Movie
Density

Example: Movielens
group by movie id
for each group:
compute average rating
1 2 3 4 5
Density

Example: Movielens
What fraction of ratings are given to the most popular movies?
0%
25%
50%
75%
100%
0 3,000 6,000 9,000
Movie Rank
CDF

Example: Movielens
0%
25%
50%
75%
100%
0 3,000 6,000 9,000
Movie Rank
CDF
group by movie id
for each group:
count # ratings
sort by group size
cumulatively sum group sizes

Example: Movielens
What is the median rank of each user’s rated movies?
0
2,000
4,000
6,000
8,000
100 10,000
User eccentricity
Numberofusers

Example: Movielens
join movie ranks to ratings
group by user id
for each group:
compute median movie rank
0
2,000
4,000
6,000
8,000
100 10,000
User eccentricity
Numberofusers

What do we do when the full dataset exceeds available memory?

Sampling?
Unreliable estimates for rare groups

Random access from disk?
1000x more storage, but 1000x slower2
2
Numbers every programmer should know

Streaming
Read data one observation at a time, storing only needed state

The combinable group-by operation
Streaming
if new group:
initialize result
update result for corresponding group as function of
existing result and current value
for each group:

The combinable group-by operation
Streaming
if new group:
initialize result
update result for corresponding group as function of
existing result and current value
for each group:
Useful for computing a subset of within-group statistics with a
limited memory footprint
(e.g., min, mean, max, variance, etc.)

Example: Movielens
0
1,000,000
2,000,000
3,000,000
1 2 3 4 5
Rating
Numberofratings
for each rating:
counts[movie id]++

Example: Movielens
for each rating:
totals[movie id] += rating
counts[movie id]++
for each group:
totals[movie id] /
counts[movie id]
1 2 3 4 5
Density

Yet another group-by operation
Per-group histograms
histogram[group][value]++
for each group:
compute result as a function of histogram

Yet another group-by operation
Per-group histograms
histogram[group][value]++
for each group:
compute result as a function of histogram
We can recover arbitrary statistics if we can aﬀord to store counts
of all distinct values within in each group

The group-by operation
For arbitrary input data:
Memory Scenario Distributions Statistics
N Small dataset Yes General
V*G Small distributions Yes General
G Small # groups No Combinable
V Small # outcomes No No
1 Large # both No No
N = total number of observations
G = number of distinct groups
V = largest number of distinct values within group

Examples (w/ 8GB RAM)
Median rating by movie for Netﬂix
N ∼ 100M ratings
G ∼ 20K movies
V ∼ 10 half-star values
V *G ∼ 200K, store per-group histograms for arbitrary statistics
(scales to arbitrary N, if you’re patient)

Median rating by video for YouTube
N ∼ 10B ratings
G ∼ 1B videos
V *G ∼ 10B, fails because per-group histograms are too large to
store in memory
G ∼ 1B, but no (exact) calculation for streaming median

Mean rating by video for YouTube
N ∼ 10B ratings
G ∼ 1B videos
G ∼ 1B, use streaming to compute combinable statistics

The group-by operation
For pre-grouped input data:
Memory Scenario Distributions Statistics
N Small dataset Yes General
V*G Small distributions Yes General
G Small # groups No Combinable
V Small # outcomes Yes General
1 Large # both No Combinable
N = total number of observations
G = number of distinct groups
V = largest number of distinct values within group

Modeling Social Data, Lecture 2: Introduction to Counting

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (9)

Destacado

Destacado (20)

Similar a Modeling Social Data, Lecture 2: Introduction to Counting

Similar a Modeling Social Data, Lecture 2: Introduction to Counting (20)

Más de jakehofman

Más de jakehofman (20)

Último

Último (20)

Modeling Social Data, Lecture 2: Introduction to Counting