2. Twitter-based election polling is a cheap alternative to
traditional “offline” polls.
Twitter-based election polling should achieve
a prediction accuracy similar to traditional polls.
millions of potential voters inferred votes biases
The what & why
4. @flickr:misteraitch
“No, you cannot predict elections with Twitter.”
D. Gayo-Avello. Internet Computing, IEEE 16.6 (2012): 91-94.
That hasn’t stopped people from
trying!
5. @flickr:practicalowl
Germany Federal
Count tweets &
hashtags
5 weeks6 party names 1.7%
Singapore Presidential
Count tweets +
sentiment
1 week4 candidate names 6.1%
USA Presidential
Count tweets +
sentiment
6 months2 candidate names 11.6%
Ireland General
Count tweets +
sentiment
3 weeks5
party names +
election hashtag
3-6%
Netherlands Senate Count tweets 1 month12 Dutch words 1.3%
USA Presidential Count tweets 6 weeks2 N/A 1.7%
Germany Federal
Count hashtags
+ sentiment
4 months6
party names +
election hashtags
N/A
USA, France Presidential sentiment 2 months2
candidate names +
election hashtag
N/A
USA
Republican
nomination
Count tweets +
sentiment
1 year7 candidate names N/A
Venezuela,
Paraguay,
Ecuador
Presidential
Count tweets +
users
7 months
2
3
2
candidate names
and aliases
0.1%-
19%
6. So far …
Twitter-based predictions lack behind traditional polls.
Most works focus on elections in the developed world.
Traditional polls are accurate.
Traditional polls are conducted often.
7. So far …
Twitter-based predictions lack behind traditional polls.
Most works focus on elections in the developed world.
What do Twitter-based methods add?
8. In the developing world
… traditional polls are less likely to be reliable.
… the demographic bias of Twitter users is high.
4.08%
3.45%11.75%
4.21%
12.24%
5.64%
6.25%
1.36%
2.69%
1.19%
7.02%
4.20%
8.84%
0.98%
3.96%
3.13%
4.24%1.15%
0.87%
11.49%
Mean Absolute Error of 20 traditional polls conducted
in the run-up to the 2014 Indonesian presidential election
9. A detailed analysis of all major factors of Twitter-based
election forecasting with a special emphasis on de-
biasing through “offline” data.
An in-depth comparison of 20 traditional polls and
Twitter-based forecasts for the 2014 Indonesian
presidential election.
Our contributions
@flickr:carbonnyc
11. Processing pipeline
(1) Data collection
election type data access duration keywords
(3) Data de-biasing
age gender location
(2) Data filtering
spam organisations geo-location
(4) Election prediction
candidate mentions one vote per user tweet sentiment
14. 2014 Indonesian
presidential election
Joko Widodo vs. Prabowo Subianto
Widodo won 53.15% of the votes.
Widodo won in 23 of the 33 provinces.
Widodo was supported by the opposition.
July 9, 2014
15. Gathered tweets
Crawling period
#Electoral tweets
Max. tweets / day
#Users
Max. active users / day
April 15 - July 8, 2014
7,020,228
375,064
490,270
148,135
Manually curated keyword list (updated daily); only tweets
geo-located in Indonesia are included.
POLLDATA
16. Gathered tweets II
#Users
Most recent 100 tweets per user. Not used for prediction purposes.
USERDATA
Crawling period July 25 - 30, 2014
#Tweets ~42,000,000
490,270
18. Is spam a problem?
7.4% are spam users
2.1% are “slacktivists”
3.8% are non-personal users
Based on a manual classification of 600 randomly selected users in USERDATA
19. How large is the bias?
Based on a manual classification of 600 randomly selected users in USERDATA
0%
20%
40%
60%
80%
Female Male
Twitter Population
gender
0%
20%
40%
60%
80%
0-19 20-49 50+
Twitter Population
age
20. How large is the bias?
0%
20%
40%
60%
80%
Female Male
Twitter Population
gender
0%
20%
40%
60%
80%
0-19 20-49 50+
Twitter Population
age
Automatic classification of POLLDATA.
age gender
21. How large is the bias?
Based on reserve geo-coding & population data for Indonesia.
location
Jakarta
Internet penetration rate: 17%
location
23. From tweets to users
tweet count 56.45% 3.3% +7 23/3343.55% -13 0.27
W
idodo
Subianto
MAE
traditional
polls
province level
correct min. MAE
26.09
max. MAE
user count 54.45% 1.3% +4 24/3345.55% -16 0.05 25.01
On the national level, “one user one vote” outperforms
tweet-based predictions (confirming prior works).
On the province level the changes are miniscule.
our baselines
25. Location de-biasing
tweet count 55.14% 2.0% +544.86% -15
W
idodo
Subianto
MAE
traditional
polls
user count 54.26% 1.1% +245.74% -18
Decreasing the influence of tweets from overrepresented
locations in the dataset improves the prediction.
26. Gender de-biasing
tweet count 56.36% 3.2% +7 21/3343.64% -13 0.33
W
idodo
Subianto
MAE
traditional
polls
province level
correct min. MAE
28.05
max. MAE
user count 54.89% 1.7% +5 23/3345.11% -15 0.10 26.72
Correcting for gender biases degrades the prediction
accuracy on the national & province level.
27. Impact of sentiment
tweet count 53.98% 0.8% +046.02% -20
W
idodo
Subianto
MAE
traditional
polls
province level
correct min. MAE max. MAE
user count 54.02% 0.9% +045.98% -20
On the national level, sentiment yields the best forecast.
tweet count 50.67% 2.5% +549.33% -15
user count 53.77% 0.6% +046.23% -20
14/33 0.01 54.90
19/33 0.26 26.51
14/33 0.01 49.79
19/33 0.01 26.40
POSPOS+NEG
The impact on the province level prediction is negative.
28. Impact of sentiment
tweet count 53.98% 0.8% +046.02% -20
W
idodo
Subianto
MAE
traditional
polls
province level
correct min. MAE max. MAE
user count 54.02% 0.9% +045.98% -20
On the national level, sentiment yields the best forecast.
tweet count 50.67% 2.5% +549.33% -15
user count 53.77% 0.6% +046.23% -20
14/33 0.01 54.90
19/33 0.26 26.51
14/33 0.01 49.79
19/33 0.01 26.40
POSPOS+NEG
The impact on the province level prediction is negative.
More than 700 languages
are spoken in Indonesia
29. Conclusions
Simple Twitter-based predictors outperform (almost) all
traditional polls in Indonesia.
Accurate predictions on province level are challenging,
due to data sparsity & data diversity.
Currently: designing a Web application prototype to
automatically observe ongoing elections.