Using Tweets for Understanding Public Opinion During U.S. Primaries and Predicting Election Results

Using Tweets for Understanding Public Opinion During U.S. Primaries
and Predicting Election Results
Monica Powell
Barnard College
Columbia University
3009 Broadway
New York, NY 10027
mmp2181@barnard.edu
Nadia Jabbar
Columbia University
Graduate School of Arts and Sciences
535 West 116th Street
New York, NY 10027
nj2290@columbia.edu
Abstract
Using social media for political analy-
sis, especially during elections, has be-
come popular in the past few years where
many researchers and media now use so-
cial media to understand the public opin-
ion and current trends. In this paper,
we investigate methods for using Twit-
ter to analyze public opinion and to pre-
dict U.S. Presidential Primary Election re-
sults. We analyzed over 13 million tweets
from February 2016 to April 2016 during
the primary elections, and we looked at
tweets that mentioned either Hillary Clin-
ton, Bernie Sanders, Donald Trump or Ted
Cruz. First, we use the methods of sen-
timent analysis, geospatial analysis, net-
work analysis, and visualizations tools to
examine public opinion on twitter. We
then use the twitter data and analysis re-
sults to propose a prediction model for pre-
dicting primary election results. Our re-
sults highlight the feasibility of using so-
cial media to look at public opinion and
predict election results.
General Terms Data Visualization, Prediction
Models
Keywords Twitter, Presidential Election, senti-
ment analysis, geomapping, RShiny, D3.js, Social
media, data visualization, Hillary Clinton, Bernie
Sanders, Donald Trump, Ted Cruz.
1 INTRODUCTION
Microblogging platforms such as Twitter have be-
come increasingly popular communication tools
for social media users who often use these plat-
forms to express their opinions on a variety of top-
ics and to discuss current issues. As more people
use Twitter and other microblogging platforms,
they post not only about their personal matters, but
also about products and services they use, and they
even discuss their political and/or religious views.
As a result, these microblogging websites have be-
come valuable sources for gathering public opin-
ion and sentiment analysis.
Twitter has over 310 Monthly active users and 1
billion unique visits monthly to sites with embed-
ded Tweets (?). Similar to other social networking
websites, Twitter allows people to share informa-
tion and express themselves in real-time. This im-
mediacy makes Twitter a platform that users can
utilize in order to express their political support
or discontent for particular individuals or policies.
However, it can be argued whether or not Twitter
inﬂuences election results and/or if sentiments ex-
pressed on Twitter represents a random sample of
a given population.
All of the 2016 presidential candidates have a
presence on Twitter and More than two thirds of
U.S. congress members have created a Twitter ac-
count and many are actively using Twitter to reach
their constituents (Wang and et. al. 2012). An in-
dividual’s network and sentiments associated with
them on Twitter are unique to them and may or
may not mirror their network ofﬂine.
In this paper, we analyze tweets obtained from
February 2016 to April 2016 in order to examine
public opinion on the 2016 U.S. Presidential Pri-
mary Elections that are currently taking place. We
hypothesize that Twitter, and by extension, other
popular microblogging websites such as Facebook
and Google+, are good sources for understand-
ing general public opinion regarding political elec-
tions. Furthermore, we hypothesize that Twitter
(as well as other popular micrblogging platforms)
are also useful for predicting election results.
In order to test our hypotheses, we use several
different techniques to extract useful information
from the tweets, including sentiment analysis of
the tweets, geospatial analysis, and network anal-
ysis. We use these methods to mine the collective

tweets to examine general public opinion regard-
ing the Democratic candidates Hillary Clinton and
Bernie Sanders, as well as the Republican candi-
dates Donald Trump and Ted Cruz.
We next use the information that is extracted
from the tweets to build several predictive models
and test them in order to analyze how well Twit-
ter is indicative of general public opinion regard-
ing the 2016 Primaries. In our predictive models,
we also incorporated polling data from several na-
tional polls conducted by different organizations
and gathered by FiveThirtyEight, a website that
focuses on opinion poll analysis. Additionally, we
incorporated the final results for those states where
the primaries have already happened into our pre-
dictive model in order to test the accuracy of our
model.
2 LITERATURE REVIEW
While their is some controversy regarding this
topic, social media data can certainly be used for
analyzing socio-political trends from the past, dur-
ing the present, and for the future. Assure and
Huberman (2012) effectively used Twitter to pre-
dict some real-world outcomes, such as box of-
fice revenues for movies pre-release and trends in
the housing market sector. Their work suggested
that Twitter data can be successfully used to pre-
dict consumer metrics. Furthermore, Varian and
Choi (2009) used data from google trends to pre-
dict real-time events, and their work indicated that
google trends can be used to predict retail sales
for Motor vehicle and parts dealers. In yet another
study by Ginsberg et al. (2010), researchers used
social media data to predict flu epidemics, while
Mao and Zeng (2011) used Twitter to perform sen-
timent analysis in order to predict stock market
trends.
Social media has also been used for examine po-
litical trends. OConnor et al. (2010) studied pub-
lic opinion measured from polls along with senti-
ment measured from text analysis of Twitter posts.
Their results showed a strong correlation (as high
as 80 percent) between Twitter data and presiden-
tial elections. Furthermore, Tumasjan et al. (2010)
studied the German federal election to investigate
whether Twitters messages correctly mirror offline
political sentiment, and they found that tweet sen-
timent regarding the candidates’ political stances
strongly correlated with the political landscape of-
fline.
In 2012, Wang and et. al. created a system for
real-time twitter sentiment analysis for the presi-
dential election because the nature and popularity
of Twitter allows researchers to analyze sentiment
in real-time, as opposed to being forced to wait af-
ter a certain period of time in order to implement
more traditional methods of data collection. The
Swedish presidential election was also tracked in
real-time by researchers using data gathered from
Twitter (Larsson, 2012). While the role of Twitter
in election outcomes is debatable, twitters users
are definitively not apolitical and thus it was in-
triguing to investigate whether or not their is a
direct correlation between political outcomes and
twitter activity.
Yet, some studies have concluded that Twit-
ter and other social media are not strongly re-
flective of real world outcomes. Gayo-Avello et
al. (2012) analyzed the 2010 U.S. Congressional
elections using Twitter data to test Twitters pre-
dictive power, and were unable to find any cor-
relation between the data analysis results and the
actual electoral outcomes. However, it is impor-
tant to note that the landscape of social media has
dramatically changed in the last few years, and so
Twitter may be a more accurate measure of public
opinion today than it was a few years ago.
3 RESEARCH QUESTION
Using social media for political discourse, espe-
cially during political elections has become com-
mon practice. Predicting election outcomes from
social media data can be feasible and as discussed
previously, positive results have often been re-
ported. In this paper, we will test the predictive
power of the social media platform Twitter in the
context of the 2016 U.S. Primary elections. We
will use Twitter data to develop a picture of public
opinion about the political candidates online, and
analyze our results against the results of the pri-
maries that have already happened. We will then
create a predictive model using the Twitter data
analysis results, and test those models using the
primary results for those states where candidate
elections have already taken place. We propose
that while Twitter is a good platform for analyz-
ing public opinion, it can not immediately replace
other measures for gathering public opinion, such
as polling data.

4 DATA
Over 13 million tweets were gathered on Twitter
from February 2016 to April 2016. The entire
dataset, as well as random samples of tweets from
the dataset were used to analyze online sentiments
towards Hillary Clinton, Bernie Sanders, Donald
Trump, and Ted Cruz. We also looked speciﬁ-
cally at dates when at least one primary election
was held. These dates were February 9th, Febru-
ary 20th, February 23, February 27th, March 1st,
March 5th, March 6th, March 9th, March 10th,
March 12th, March 15th, March 22nd, March
26th, April 5th, April 9th, April 19th, April 26th,
and May 3rd. We did not have data for February
1st, and so this was the only primary election date
that was left out from our analysis. Below is a
list of the speciﬁc elections that happened on each
date, as well as the states where the elections took
place.
Tuesday, February 9:
New Hampshire
Saturday, February 20:
Nevada Democratic caucuses
South Carolina Republican primary
Tuesday, February 23:
Nevada Republican caucuses
Saturday, February 27:
South Carolina Democratic primary
Tuesday, March 1:
Alabama
Alaska Republican caucuses
American Samoa Democratic caucuses
Arkansas
Colorado caucuses (both parties, no preference
vote for Republicans)
Democrats Abroad party-run primary
Georgia
Massachusetts
Minnesota caucuses (both parties)
North Dakota Republican caucuses (completed by
March 1)
Oklahoma
Tennessee
Texas
Vermont
Virginia
Wyoming Republican caucuses
Saturday, March 5:
Kansas caucuses (both parties)
Kentucky Republican caucuses
Louisiana
Maine Republican caucuses
Nebraska Democratic caucuses
Sunday, March 6:
Maine Democratic caucuses
Puerto Rico (Republicans only)
Tuesday, March 8:
Hawaii Republican caucuses
Idaho (Republicans only)
Michigan
Mississippi
Thursday, March 10:
Virgin Islands Republican caucuses
Saturday, March 12:
Guam Republican convention
Northern Mariana Islands Democratic caucuses
Washington, DC Republican convention
Tuesday, March 15:
Florida
Illinois
Missouri
North Carolina
Northern Mariana Islands Republican caucuses
Ohio
Tuesday, March 22:
American Samoa Republican convention
Arizona
Idaho Democratic caucuses
Utah caucuses (both parties)
Saturday, March 26:
Alaska Democratic caucuses
Hawaii Democratic caucuses
Washington Democratic caucuses
Friday-Sunday, April 1-3:
North Dakota Republican state convention
Tuesday, April 5:
Wisconsin

Saturday, April 9:
Colorado Republican state convention
Wyoming Democratic caucuses
5 METHODS
By capturing tweets mentioning each presiden-
tial candidate and analyzing the sentiments be-
hind those tweets, we could track peoples opinions
about each candidate and thus predict the final pri-
mary election results. A function was constructed
in R to automatically collect tweets from each day
for the months of February, March, and April. The
tweets along with information about the users twit-
ter handle, the location of the user, the text of the
tweet, the description of the users profile, if the
tweet was retweeted, and other information was
encoded into JSON (JavaScript Object Notation)
files. The rjson package in the R software was
used to parse the JSON files. We extracted all
tweets related to at least one of the four political
candidates (Clinton, Sanders, Trump, and Cruz),
and combined all extracted tweets into a .csv file
for further analysis.
All of the Twitter data was analyzed using the
R software, D3.js and QGIS in order to determine
whether or not certain dimensions of Twitter ac-
tivity related to presidential election correlate with
primary election results. Specifically, the research
methods implemented aimed to address whether
or not more positive sentiments towards as partic-
ular candidate on Twitter significantly increases a
candidates probability of winning a primary elec-
tion. The primary focus of the analysis was text
mining for sentiments, geospatial analysis using
GIS to look at specific states, and network analy-
sis to evaluate the network elements of the tweets
and look at useful network parameters of the men-
tion network of all four candidates. Additionally,
we used our selected parameters, as well as gen-
eral polling results obtained from FiveThrityEight,
a website that focuses on opinion poll analysis and
politics, and built several prediction models to test
if Twitter is a good indicator of offline public opin-
ion and political election outcomes.
6 VISUALIZATIONS
We constructed three types of visualizations to test
our hypotheses. We created several static visual-
izations using the R software package to get an
overall look at all of the tweets in relation to each
of the four candidates. We then created interac-
tive visualizations using R Shiny and D3.js to look
more closely at changes in public opinion over pri-
mary days in order to evaluate twitter trends dur-
ing primary elections days.
6.1 Static Visualizations
We first looked at the entire dataset, which con-
sisted of a total of over 13,289,699 tweets for
the three months of February, March, and April.
These tweets were parsed and divided into four
categories representing each of the four politi-
cal candidates. So, for example, all tweets that
mentioned Clinton were merged into a single ob-
ject data frame. This was also done for Sanders,
Trump, and Cruz.
Preliminary data-analysis was conducted on the
over 13 million tweets that were collected in or-
der to reveal high-level trends that would be rel-
evant and provide context for further sentiment
analysis. As the pie chart below illustrates, More
than fifty percent of all of the tweets in the entire
data set mentioned Trump. He is undoubtedly the
most discussed candidate on Twitter. Furthermore,
Sanders was the second most discussed candidate
on Twitter, while Cruz and Clinton were both dis-
cussed the least.
Figure 1: Proportion of Total Tweets Mentioning
Each Candidate
The next visualization (depicted below) is also
in relation to tweet volume by each candidate and
by party for all of the tweets in the final data set.
The outer donut illustrates the proportion of tweets
belonging to each party. It is clear that the Repub-
lican candidates had far more tweets (68 percent
of all tweets) than the Democratic candidates (32

percent of all tweets). This is largely because of
Trump, who was mentioned in more than 50 per-
cent of all tweets in the data set. The inner donut
shows tweets proportions for each of the four can-
didates. As can be seen, Trump-related tweets
make up the vast majority of the final data set
with 54 percent of all tweets mentioning Trump.
Sanders was the next most popular on twitter with
19 percent of tweets mentioning him, while Cruz
had 14 percent of tweets mentioning him. Clinton
is the least popular candidate on Twitter with 12
percent of all tweets mentioning her.
Trump, of course, has won the most primary
elections by a large margin in comparison to the
other Republican candidates, which Twitter con-
firms here. If we were to go by tweet vol-
umes alone to predict the Presidential Elections, it
would seem to support the claim that Trump will
win by a landslide. Likewise, the fact that Clin-
ton is less popular on Twitter compared to Sanders
would seem to indicate that Sanders will win the
Democratic primaries if we only look at tweet vol-
ume. However, looking at the primary elections
that have happened thus far, Clinton has won more
states than Sanders. Hence, this may indicate that
tweet volume is not entirely accurate for predict-
ing real outcome election results. Due to the fact
that tweet volume alone can not predict a candi-
date’s popularity in the general election expanded
the scope of measures to examine.
Figure 2: Tweet Volume by Party and by Candi-
date
We next extracted all tweets that were geo-
tagged. This considerably reduced the number of
tweets, as it is estimated that only between 5 to 20
percent of all tweets are geo-tagged with a loca-
tion. However, we wanted to look at the origins of
our tweets, and we assume that the sub-sample of
geo-tagged tweets is a strong representative of the
entire data set of tweets.
Figure 2 below is a world map showing the
origins of all tweets that were geo-tagged in the
complete data set of tweets. The yellow dots in-
dicate where tweets originated from on the map.
Unsurprisingly, that vast majority of tweets origi-
nate form inside the United States. The primaries
are for the U.S. Presidency so it is expected that
the four candidates would be most talked about
within America. However, it was interesting to see
that the highest concentration of tweets was on the
East coast of the U.S., while the West coast was
also very concentrated with tweet origins. Middle
America was not very concentrated with tweets. If
these geo-tagged tweets are reflective of the total
sample of tweets used, then there may be a bias in-
troduced in the data set with a greater proportion
of tweets from the East coast, and very few tweets
originating from Middle America.
Outside of America, northern Europe, particu-
larly the U.K. was also heavily concentrated with
tweets pertaining to the candidates. We do not
know why the U.K. in general had such a high
proportion of tweets. It may be because British
people are highly interested in American politics,
that British twitter users have a different system
for geo-tagging than other countries or that a lot
of Americans travel abroad to Northern Europe
and remain engaged with politics on Twitter dur-
ing their vacation. The four candidates were dis-
cussed in other regions of the world as well, but
with much lesser concentration. Europe showed
more interest in American politics than any other
country (excluding the United States of America).
Figure 3: World Origins of Tweets
We next looked at the tweet frequency by state
for each candidate within the United States. We
first look at Hilary Clintons map, which is de-
picted below. The states with the most number of
tweets mentioning Clinton were California, Texas,
Florida, Illinois, and of course, New York. Clinton

has won all of the primary elections in these states
with the exception of California, which has not yet
taken place. States like North and South Dakota,
Montana, Wyoming and Nebraska had almost non
existent frequency of tweets mentioning Clinton.
However, it is interesting that Clinton seems to be
more popular in Utah on Twitter in comparison to
the other four candidates, even though she lost to
Sanders in the Utah primary elections. This in-
dicates that tweet volumes on Twitter may not be
entirely accurate in predicting election results.
Figure 4: Clinton’s Tweet Frequencies Map
We next looked at Bernie Sanderss map of tweet
frequencies by state (depicted below). It is in-
teresting to see what states he is more popular
in compared to Clinton. Surprisingly, discussions
about Sanders are very popular on Twitter in Ohio,
even though he lost in the Ohio state primaries to
Clinton by a substantial margin.
Figure 5: Sanders’ Tweet Frequencies Map
We next looked at Ted Cruzs map of tweet fre-
quencies by state (depicted below). Compared to
Clinton and Sanders, Ted Cruz is more popular in
the west coast, with states like Nevada, Arizona
and Oregon showing more interest in him on Twit-
ter. He is also mentioned more in states like Mon-
tana and Nebraska, where Clinton and Sanders had
almost non-existent mentions in those states.
Lastly, we looked at Donald Trumps map of
tweet frequencies by state (depicted below). In-
terestingly, he is not as popular as Cruz in Mon-
tana, Wyoming and Nebraska, where he is rarely
mentioned on twitter.
Figure 6: Cruz’s Tweet Frequencies Map
Figure 7: Trump’s Tweet Frequencies Map
After looking at all of the maps for the ﬁve
candidates tweet frequencies, it does seem that
tweet frequencies are not always a good indica-
tor of election results when it comes to using
only tweet volume per candidate. As mentioned
above, there were some instances (such as Clin-
ton having a very high volume of tweets in Utah
and Sanders very high volume of tweets in Ohio)
where twitter did not correlate with the real world
outcome of the elections (based on the assump-
tion that a higher tweet volume should correlate
to winning the majority of votes). However, over-
all, there seems to be more correlation regarding
tweet volumes for each candidates in each state
than vice versa. Tweet volume per candidate will
be a predictor variable incorporated into the pre-
diction model that will be introduced later in this
paper.
It seems that tweet volume performs sporadi-
cally as a predictor of election results. However,
we can use an algorithm to evaluate and catego-
rize the feelings expressed in text; this is called
sentiment analysis. Hence, we next looked at tex-
tual sentiment analysis of the tweets to get a better

insight in public opinion on Twitter regarding the
2016 Primary Elections.
In order to extract sentiments for each of the
tweets, the Syuzhet R package was utilized, which
comes with four sentiment dictionaries and pro-
vides a method for accessing the robust, but
computationally expensive, sentiment extraction
tool developed in the NLP group at Stanford.
The developers of this algorithm built a dictio-
nary/lexicon containing lots of words with asso-
ciated scores for eight different emotions and two
sentiments (positive/negative). Each individual
word in the lexicon will have a yes (one) or no
(zero) for the emotions and sentiments, and we
can calculate the total sentiment of a sentence by
adding up the individual sentiments for each word
in the sentence. It is important to note that senti-
ment analysis of tweets comes with its fair share
of problems. For example, sentiment analysis al-
gorithms are built in such a way that they are
more sensitive to expressions typical of men than
women. Furthermore, it can be argued that com-
puters are not optimal at identifying emotions cor-
rectly in all cases. They are likely not great at at
identifying something like sarcasm. Most of these
concerns wont have a big effect on my analysis
here because we are looking at text. Additionally,
when using as large a dataset as the one for this
study, it is likely, that many more tweets will be
correctly identiﬁed by sentiment, and the effects
of identifying sentiments incorrectly will be nor-
malized. The entire data set was used to derive
sentiment scores for all four candidates, and the
bar graphs depicting aggregates of the results are
shown below.
The more positive the sentiment score is, the
more positive the overall sentiment is of the tweets
that are associated with each of the candidates.
Hence, the Sanders has the highest average senti-
ment score compared to all other candidates while
Trump has the second highest average sentiment
score over all tweets. Both Clinton and Cruz have
lower average sentiment scores over all the tweets.
When we look at the average very positive sen-
timent scores for each of the candidates over all
of the tweets, Trump has on average, more pos-
itive sentiment scores than the other candidates
while Sanders comes close in second place. How-
ever, it is important to note that Trump has a very
large proportion of tweets compared to Sanders,
and this may be skewing the average very positive
Figure 8: Average Sentiment Score over all Tweets
for Each Candidate
sentiment scores. It may be interesting to equal-
ize the data set to contain fewer tweets mention-
ing Trump, and to see how this affects the average
very positive sentiment scores. It is interesting that
Clinton has the lowest average positive sentiment
scores over all tweets mentioning her. Lastly, we
look at the average very negative sentiment scores
bar graph, and the results correspond to the other
two graphs. Cruz has the highest average negative
scores over all tweet relating to him, while Clin-
ton comes in second place. Sanders, on the other
hand, has the lowest average negative sentiment
scores over the all tweets mentioning him. Hence,
if we were to go by these sentiment scores to pre-
dict election outcomes, it would seem that Sanders
would win the Democratic primaries while Trump
would win the Republican primaries.
6.2 R Shiny Visualization: Word Frequency
(Wordcloud)
An interactive visualization app using the R Shiny
platform was produced to analyze the text of the
tweets. Preliminary data-analysis was conducted
on the tweets that were collected in order to reveal
trends that would be relevant with further senti-
ment analysis. An R Shiny application was de-
veloped to generate a different wordcloud visual-
ization for each date that data was collected. The

Figure 9: Average Very Positive Sentiment Score
over all Tweets for Each Candidate
wordcloud visualizations represent the words that
were most prevalent in tweets related to a par-
ticular candidate. Each day had slightly differ-
ent words that dominated a candidate’s network
and on some days in particular there was a strong
theme or increased polarization.
For example, on February 27th tweets related
to Donald Trump mainly contained ’#nevertrump’
(Figure 12). Influencers on Twitter such as Marco
Rubio, Glenn Beck and Amanda Carpenter all
published tweets that contained the hashtag as a
strategic move against Donald Trump prior to Su-
per Tuesday on March 1st which led Trump to
have a 0.09 Sentiment Score (Figure 11) (Figure
13).
Backlash and outrage to Hillary Clinton com-
mending Nancy Reagan’s involvement in the
H.I.V./AIDS conversation following Reagan’s
death dominated tweets about Clinton on March
12th (Figure 14). ”The problem with Mrs.
Clintons compliment: It was the Reagans who
wanted nothing to do with the disease at the time
(Source: http://www.nytimes.com/politics/first-
draft/2016/03/11/hillary-clinton-lauds-reagans-
on-aids-a-backlash-erupts/). It was confirmed by
sentiment analysis that tweets regarding Clinton
were overall negative as she only had a sentiment
score of -0.01 on March 12th, while Sanders
Figure 10: Average Very Negative Sentiment
Score over all Tweets for Each Candidate
Figure 11: Marco Rubio Tweets #NeverTrump on
February 27th
and Cruz had higher sentiment scores (.22 and
.31 respectively) (Figure 15). Trump also, had a
relatively low sentiment score of -0.01 on March
12th which was the same day that protesters
disrupted a Trump rally in Chicago and forced the
event to be canceled (Figure 15) (Figure 16).
On March, 26th a scandal broke out involving
Ted Cruz. The group anonymous alleged that Ted
Cruz was involved in a sex scandal. Most tweets
that mentioned Ted Cruz on March 26th involved
the scandal (Figure 17). Although, he generally
had the highest sentiment score out of all the can-
didates on March 22nd and March 26th he had the
lowest sentiment score of all the candidate (0.02
and 0.01 respectively) (Figure 18).
In general, the words that appeared most fre-
quently (as illustrated in the wordcloud) were pre-
dictive of a candidate’s sentiment score and this

Figure 12: February, 27th, 2016 Wordcloud for
Donald Trump
Figure 13: February, 27th, 2016 Sentiment
consistency further reinforced the appropriateness
and validity of the Syuzhet R package that was
used for sentiment calculations. The sentiment
score is able to provide a concrete quantitative
measure of how a network feels towards a partic-
ular candidate whereas the wordcloud represented
the qualitative feelings of individuals and provide
further context for the sentiment scores.
6.3 D3.js Visualizations
We next created several visualizations using the
D3.js platform. D3.js (D3 for Data-Driven Doc-
uments) is a JavaScript library for producing
dynamic, interactive data visualizations in web
browsers. It makes use of the widely imple-
mented SVG, HTML5, and CSS standards. All
of the visualizations produced using D3.js are
available at: http://aboutmonica.com/
final%20D3/.
A D3 visualization of a force layout visualiza-
tion of the mention network for all four candidates
was generated. The network was constructed only
from tweets on the days of primary elections. This
is a very large network with over 30,000 edges,
and hence when the D3 visualization is produced,
the resulting layout graph is very large and takes a
while to load.
Figure 14: March, 12th, 2016 Wordcloud for
Hillary Clinton
Figure 15: March, 12th, 2016 Sentiment
In the force layout visualization at the provided
link above, you can see a vast social mention net-
work of tweets. Since this social network has di-
rected edges, we can look at the direction of tweet
mentions where there are many nodes connected
to one central node, the arrows are all pointing to
the central node. This means that there are many
twitter users tweeting and mentioning the central
node. From the network graph you can see that
some twitter users (represented by the nodes in
the graph) have very large networks and are very
densely connected by edges to other twitter users.
Edges between two twitter users signify that one of
the users mentioned or re-tweeted the other user,
so those areas that are very dense and dark in the
graph are likely people who were mentioned or
re-tweeted many times. On the other hand, to-
wards the outskirts of the graph, there are a few
nodes that are connected to each other by a few
edges. These twitter users are connected by edges
because they mentioned or re-tweeted each other
during the time that the data was collected. How-
ever, they are separate from other clusters in the
graph by not being connected to other nodes by
edges. This graph clearly depicts which twitter
users have larger networks (more dense clusters

Figure 16: March, 12th, 2016 Wordcloud for Don-
ald Trump
Figure 17: March, 26th, 2016 Wordcloud for Ted
Cruz
around the nodes). Lastly, we also have nodes
that are connected by maybe one or two (or in
some cases, no ties) indicating that they are not
being mentioned by others users in the networks
and they are also not mentioning others users in
their tweets.
We next looked at each of the four candidates
networks separately, and using Gephi, we derived
network parameter values in order to better assess
what is going on in this network in relation to each
of the four candidates. Table 1 below depicts the
results of our analysis. There are some interesting
results to point out. Cruzs average clustering co-
efficient is 0 while Trumps network is almost zero
at 0.001. Hence, it seems that Cruzs tweet men-
tion network is very small with there being very
little clustering of users, and most users not be-
ing interconnected. In general, all of the candi-
dates have very small clustering coefficients with
Sanders having the highest value at 0.005. This
may be due to the fact that the social network an-
alyzed is a network of mentioned tweets, and it is
unlikely that the candidates would reply to many
of the tweets that mention them. Additionally,
these tweets were collected in real-time, so a can-
didate may have responded to any of the tweets
in the network at a later time, that was not cap-
tured in our dataset. Furthermore, Sanders net-
work has the highest average degree at 2.109 while
Figure 18: March, 26th, 2016 Sentiment
Clinton leads closely behind at 2.013. This im-
plies that on average, a node in Sanders network
has 2.109 edges connected with it, meaning that
users are more likely to interact in Sanders net-
work in comparison to networks of the other can-
didates. In Sanders network it appears that nodes
are more likely to interact and mention other nodes
than other candidate networks. Sanders also has
the largest network diameter at 6, which indicates
that it is likely that he reaches a greater audience
than the other candidates. Lastly, it is interest-
ing to note that both Republican candidates have
lower average path lengths when compared to the
Democratic candidates, meaning that nodes can be
reached in fewer steps in the networks for the Re-
publican candidates.
Figure 19: Network Parameters for Each Candi-
date
7 Prediction Model
After having explored and analyzed the twitter
data, we next focused on building a prediction
model. We created a panel dataset for all four
candidates and looked at the primary election days
as well as randomly chosen days from out twitter
data set. In the end, we had a total of 70 days of
twitter data used for the panel data set. We chose
to specifically look at the states of New York, In-
diana, and Nebraska. The next primaries will take

place in Nebraska on May 10th, and so we would
like evaluate our models prediction results against
the actual outcome of the Nebraska primaries. We
coded all my independent and dependent variables
for these three states. New York was used as a
training dataset to train the prediction model on.
Trump and Clinton won this state. The testing
dataset that was used was for the states of Indiana
and Pennsylvania, and it was fitted using the re-
sults of the training dataset to calculate predicted
values for who would win in Indiana and Pennsyl-
vania. It is important to note that for this analysis,
it was necessary to look at only those tweets that
were geo-tagged and belonged to one of the three
states used in the panel data. This undoubtedly
decreased the total size of the tweets available to
analyze, as most of the tweets collected were not
geotagged at all. However, we were still able to
obtain thousands of tweets for most of the days
for each candidate.
The dependent variable used in the analysis was
called electionresults, and it was basically equal
to 0 if the candidate did not win the primary elec-
tion (or poll average taken from FiveThirtyEight
from that day) and equal to 1 otherwise. The re-
searchers at FiveThirtyEight have collected and
continue to collect national polls for the Repub-
lican and Democratic primaries, and they generate
a polling average from all polls collected for each
candidate. For the Democratic primary, a total of
671 polls have been collected thus far, and a total
of 681 polls have been collected for the Republi-
can primary. This polling average is adjusted for
pollster quality, sample size, and recency, and as
a result, it is a good indicator of public opinion
regarding the primaries and the candidates. Fur-
thermore, FiveThirtyEight offers daily polling av-
erages from as early as July 10, 2015 upto the cur-
rent day. Hence, it was fairly simple for me to
collect the daily polling average to see which can-
didate won the polls for each day in my dataset.
For the independent variables, we used tweet
volume for each candidate in each state, the av-
erage sentiment score (which was calculated from
each candidates tweet corpus for each day for
each state), and lastly we used the networked pa-
rameters that we described above. These param-
eters (average degree, average clustering coeffi-
cient, network diameter, and average path length)
were not derived for each day and each specific
state, but were derived from the entire Twitter data
set, and were thus constant over all days.
In addition to the independent variables, we
also added control variables to the panel dataset.
The control variables used were the population
for each state and average income for each state.
Lastly, we used a lagged dependent variable as an
independent variable because in time series anal-
ysis, it is expected that the poll results from the
previous day would be predictors of the poll re-
sults for the current day, and we needed to account
for this correlation.
8 Findings
In order to train and test our data set, we used three
different statistical methods: Logistic Regression,
Random Forests, and Support Vector Machines.
We wanted to see if one of these three models
performed better than the others. The regression
equation for the logistic regression model is shown
in the figure below.
Figure 20: Logistic Regression Equation
All three models performed well in terms of
prediction results, which is surprising for us. As
we mentioned earlier, it seems that Sanders is very
popular on twitter compared to Clinton, and so we
expected that this would skew the results, but it
does not look like it did. After training the model
on the New York dataset, we tested the trained
models on the Indiana and Pennsylvania datasets,
and for both datasets, the models correctly de-
picted that Clinton and Trump would win Penn-
sylvania (which they did) and Sanders and Trump
would win Indiana (which they did). We used
ROC curves (depicted below) in order to evaluate
the predictive accuracy of our models and it seems
that the Random Forests model and Support Vec-
tor Machine model performed better than the Lo-
gistic Regression model.
9 Conclusion
In this paper, we looked at twitter data from the
months of February, March, and April in order to
predict election outcomes for the 2016 Presiden-
tial Primaries. We analyzed several variables to
explore the twitter data, including network param-
eters, text sentiments of the tweets, and tweet vol-
ume for each of the four candidates. In order to

Figure 21: ROC curves for all Three Prediction
Models
visualize our results, we build several static and
interactive visualizations. The prediction models
that we developed for our analysis performed very
well in predicting the election outcomes. How-
ever, we only tested our models on two states, and
would like to do further tests on using other state
primaries in order to test the predictive accuracy
of our models.
10 References
”Company — About.” Company — About.
Twitter, 31 Mar. 2016. Web. 07 May 2016.
¡https://about.twitter.com/company¿.
Larsson, A. O., & Moe, H. (2012). Studying
political microblogging: Twitter users in the
2010 Swedish election campaign. New Media &
Society, 14(5), 729-747.
Wang, H., Can, D., Kazemzadeh, A., Bar, F., &
Narayanan, S. (2012, July). A system for real-time
twitter sentiment analysis of 2012 us presidential
election cycle. In Proceedings of the ACL 2012
System Demonstrations (pp. 115-120). Associa-
tion for Computational Linguistics.
11 URLs
All D3 graphics used in this project are available
for viewing online. The R Shiny application was
too large to upload but the source code is avail-
able to be viewed by clicking on the menu items at
http://aboutmonica.com/final%20D3/
Republican Sentiments:
http://aboutmonica.com/final%20D3/republican%
20sentiments/
Democrat Sentiments:
http://aboutmonica.com/final%20D3/democrat%
20sentiments/
Volume of Tweets per Candidate:
http://aboutmonica.com/final%20D3/candidate%
20tweet%20volume%20prop%20D3/

Using Tweets for Understanding Public Opinion During U.S. Primaries and Predicting Election Results

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (6)

Similar a Using Tweets for Understanding Public Opinion During U.S. Primaries and Predicting Election Results

Similar a Using Tweets for Understanding Public Opinion During U.S. Primaries and Predicting Election Results (20)

Último

Último (20)

Using Tweets for Understanding Public Opinion During U.S. Primaries and Predicting Election Results