SlideShare una empresa de Scribd logo
1 de 12
Descargar para leer sin conexión
Using Tweets for Understanding Public Opinion During U.S. Primaries
and Predicting Election Results
Monica Powell
Barnard College
Columbia University
3009 Broadway
New York, NY 10027
mmp2181@barnard.edu
Nadia Jabbar
Columbia University
Graduate School of Arts and Sciences
535 West 116th Street
New York, NY 10027
nj2290@columbia.edu
Abstract
Using social media for political analy-
sis, especially during elections, has be-
come popular in the past few years where
many researchers and media now use so-
cial media to understand the public opin-
ion and current trends. In this paper,
we investigate methods for using Twit-
ter to analyze public opinion and to pre-
dict U.S. Presidential Primary Election re-
sults. We analyzed over 13 million tweets
from February 2016 to April 2016 during
the primary elections, and we looked at
tweets that mentioned either Hillary Clin-
ton, Bernie Sanders, Donald Trump or Ted
Cruz. First, we use the methods of sen-
timent analysis, geospatial analysis, net-
work analysis, and visualizations tools to
examine public opinion on twitter. We
then use the twitter data and analysis re-
sults to propose a prediction model for pre-
dicting primary election results. Our re-
sults highlight the feasibility of using so-
cial media to look at public opinion and
predict election results.
General Terms Data Visualization, Prediction
Models
Keywords Twitter, Presidential Election, senti-
ment analysis, geomapping, RShiny, D3.js, Social
media, data visualization, Hillary Clinton, Bernie
Sanders, Donald Trump, Ted Cruz.
1 INTRODUCTION
Microblogging platforms such as Twitter have be-
come increasingly popular communication tools
for social media users who often use these plat-
forms to express their opinions on a variety of top-
ics and to discuss current issues. As more people
use Twitter and other microblogging platforms,
they post not only about their personal matters, but
also about products and services they use, and they
even discuss their political and/or religious views.
As a result, these microblogging websites have be-
come valuable sources for gathering public opin-
ion and sentiment analysis.
Twitter has over 310 Monthly active users and 1
billion unique visits monthly to sites with embed-
ded Tweets (?). Similar to other social networking
websites, Twitter allows people to share informa-
tion and express themselves in real-time. This im-
mediacy makes Twitter a platform that users can
utilize in order to express their political support
or discontent for particular individuals or policies.
However, it can be argued whether or not Twitter
influences election results and/or if sentiments ex-
pressed on Twitter represents a random sample of
a given population.
All of the 2016 presidential candidates have a
presence on Twitter and More than two thirds of
U.S. congress members have created a Twitter ac-
count and many are actively using Twitter to reach
their constituents (Wang and et. al. 2012). An in-
dividual’s network and sentiments associated with
them on Twitter are unique to them and may or
may not mirror their network offline.
In this paper, we analyze tweets obtained from
February 2016 to April 2016 in order to examine
public opinion on the 2016 U.S. Presidential Pri-
mary Elections that are currently taking place. We
hypothesize that Twitter, and by extension, other
popular microblogging websites such as Facebook
and Google+, are good sources for understand-
ing general public opinion regarding political elec-
tions. Furthermore, we hypothesize that Twitter
(as well as other popular micrblogging platforms)
are also useful for predicting election results.
In order to test our hypotheses, we use several
different techniques to extract useful information
from the tweets, including sentiment analysis of
the tweets, geospatial analysis, and network anal-
ysis. We use these methods to mine the collective
tweets to examine general public opinion regard-
ing the Democratic candidates Hillary Clinton and
Bernie Sanders, as well as the Republican candi-
dates Donald Trump and Ted Cruz.
We next use the information that is extracted
from the tweets to build several predictive models
and test them in order to analyze how well Twit-
ter is indicative of general public opinion regard-
ing the 2016 Primaries. In our predictive models,
we also incorporated polling data from several na-
tional polls conducted by different organizations
and gathered by FiveThirtyEight, a website that
focuses on opinion poll analysis. Additionally, we
incorporated the final results for those states where
the primaries have already happened into our pre-
dictive model in order to test the accuracy of our
model.
2 LITERATURE REVIEW
While their is some controversy regarding this
topic, social media data can certainly be used for
analyzing socio-political trends from the past, dur-
ing the present, and for the future. Assure and
Huberman (2012) effectively used Twitter to pre-
dict some real-world outcomes, such as box of-
fice revenues for movies pre-release and trends in
the housing market sector. Their work suggested
that Twitter data can be successfully used to pre-
dict consumer metrics. Furthermore, Varian and
Choi (2009) used data from google trends to pre-
dict real-time events, and their work indicated that
google trends can be used to predict retail sales
for Motor vehicle and parts dealers. In yet another
study by Ginsberg et al. (2010), researchers used
social media data to predict flu epidemics, while
Mao and Zeng (2011) used Twitter to perform sen-
timent analysis in order to predict stock market
trends.
Social media has also been used for examine po-
litical trends. OConnor et al. (2010) studied pub-
lic opinion measured from polls along with senti-
ment measured from text analysis of Twitter posts.
Their results showed a strong correlation (as high
as 80 percent) between Twitter data and presiden-
tial elections. Furthermore, Tumasjan et al. (2010)
studied the German federal election to investigate
whether Twitters messages correctly mirror offline
political sentiment, and they found that tweet sen-
timent regarding the candidates’ political stances
strongly correlated with the political landscape of-
fline.
In 2012, Wang and et. al. created a system for
real-time twitter sentiment analysis for the presi-
dential election because the nature and popularity
of Twitter allows researchers to analyze sentiment
in real-time, as opposed to being forced to wait af-
ter a certain period of time in order to implement
more traditional methods of data collection. The
Swedish presidential election was also tracked in
real-time by researchers using data gathered from
Twitter (Larsson, 2012). While the role of Twitter
in election outcomes is debatable, twitters users
are definitively not apolitical and thus it was in-
triguing to investigate whether or not their is a
direct correlation between political outcomes and
twitter activity.
Yet, some studies have concluded that Twit-
ter and other social media are not strongly re-
flective of real world outcomes. Gayo-Avello et
al. (2012) analyzed the 2010 U.S. Congressional
elections using Twitter data to test Twitters pre-
dictive power, and were unable to find any cor-
relation between the data analysis results and the
actual electoral outcomes. However, it is impor-
tant to note that the landscape of social media has
dramatically changed in the last few years, and so
Twitter may be a more accurate measure of public
opinion today than it was a few years ago.
3 RESEARCH QUESTION
Using social media for political discourse, espe-
cially during political elections has become com-
mon practice. Predicting election outcomes from
social media data can be feasible and as discussed
previously, positive results have often been re-
ported. In this paper, we will test the predictive
power of the social media platform Twitter in the
context of the 2016 U.S. Primary elections. We
will use Twitter data to develop a picture of public
opinion about the political candidates online, and
analyze our results against the results of the pri-
maries that have already happened. We will then
create a predictive model using the Twitter data
analysis results, and test those models using the
primary results for those states where candidate
elections have already taken place. We propose
that while Twitter is a good platform for analyz-
ing public opinion, it can not immediately replace
other measures for gathering public opinion, such
as polling data.
4 DATA
Over 13 million tweets were gathered on Twitter
from February 2016 to April 2016. The entire
dataset, as well as random samples of tweets from
the dataset were used to analyze online sentiments
towards Hillary Clinton, Bernie Sanders, Donald
Trump, and Ted Cruz. We also looked specifi-
cally at dates when at least one primary election
was held. These dates were February 9th, Febru-
ary 20th, February 23, February 27th, March 1st,
March 5th, March 6th, March 9th, March 10th,
March 12th, March 15th, March 22nd, March
26th, April 5th, April 9th, April 19th, April 26th,
and May 3rd. We did not have data for February
1st, and so this was the only primary election date
that was left out from our analysis. Below is a
list of the specific elections that happened on each
date, as well as the states where the elections took
place.
Tuesday, February 9:
New Hampshire
Saturday, February 20:
Nevada Democratic caucuses
South Carolina Republican primary
Tuesday, February 23:
Nevada Republican caucuses
Saturday, February 27:
South Carolina Democratic primary
Tuesday, March 1:
Alabama
Alaska Republican caucuses
American Samoa Democratic caucuses
Arkansas
Colorado caucuses (both parties, no preference
vote for Republicans)
Democrats Abroad party-run primary
Georgia
Massachusetts
Minnesota caucuses (both parties)
North Dakota Republican caucuses (completed by
March 1)
Oklahoma
Tennessee
Texas
Vermont
Virginia
Wyoming Republican caucuses
Saturday, March 5:
Kansas caucuses (both parties)
Kentucky Republican caucuses
Louisiana
Maine Republican caucuses
Nebraska Democratic caucuses
Sunday, March 6:
Maine Democratic caucuses
Puerto Rico (Republicans only)
Tuesday, March 8:
Hawaii Republican caucuses
Idaho (Republicans only)
Michigan
Mississippi
Thursday, March 10:
Virgin Islands Republican caucuses
Saturday, March 12:
Guam Republican convention
Northern Mariana Islands Democratic caucuses
Washington, DC Republican convention
Tuesday, March 15:
Florida
Illinois
Missouri
North Carolina
Northern Mariana Islands Republican caucuses
Ohio
Tuesday, March 22:
American Samoa Republican convention
Arizona
Idaho Democratic caucuses
Utah caucuses (both parties)
Saturday, March 26:
Alaska Democratic caucuses
Hawaii Democratic caucuses
Washington Democratic caucuses
Friday-Sunday, April 1-3:
North Dakota Republican state convention
Tuesday, April 5:
Wisconsin
Saturday, April 9:
Colorado Republican state convention
Wyoming Democratic caucuses
5 METHODS
By capturing tweets mentioning each presiden-
tial candidate and analyzing the sentiments be-
hind those tweets, we could track peoples opinions
about each candidate and thus predict the final pri-
mary election results. A function was constructed
in R to automatically collect tweets from each day
for the months of February, March, and April. The
tweets along with information about the users twit-
ter handle, the location of the user, the text of the
tweet, the description of the users profile, if the
tweet was retweeted, and other information was
encoded into JSON (JavaScript Object Notation)
files. The rjson package in the R software was
used to parse the JSON files. We extracted all
tweets related to at least one of the four political
candidates (Clinton, Sanders, Trump, and Cruz),
and combined all extracted tweets into a .csv file
for further analysis.
All of the Twitter data was analyzed using the
R software, D3.js and QGIS in order to determine
whether or not certain dimensions of Twitter ac-
tivity related to presidential election correlate with
primary election results. Specifically, the research
methods implemented aimed to address whether
or not more positive sentiments towards as partic-
ular candidate on Twitter significantly increases a
candidates probability of winning a primary elec-
tion. The primary focus of the analysis was text
mining for sentiments, geospatial analysis using
GIS to look at specific states, and network analy-
sis to evaluate the network elements of the tweets
and look at useful network parameters of the men-
tion network of all four candidates. Additionally,
we used our selected parameters, as well as gen-
eral polling results obtained from FiveThrityEight,
a website that focuses on opinion poll analysis and
politics, and built several prediction models to test
if Twitter is a good indicator of offline public opin-
ion and political election outcomes.
6 VISUALIZATIONS
We constructed three types of visualizations to test
our hypotheses. We created several static visual-
izations using the R software package to get an
overall look at all of the tweets in relation to each
of the four candidates. We then created interac-
tive visualizations using R Shiny and D3.js to look
more closely at changes in public opinion over pri-
mary days in order to evaluate twitter trends dur-
ing primary elections days.
6.1 Static Visualizations
We first looked at the entire dataset, which con-
sisted of a total of over 13,289,699 tweets for
the three months of February, March, and April.
These tweets were parsed and divided into four
categories representing each of the four politi-
cal candidates. So, for example, all tweets that
mentioned Clinton were merged into a single ob-
ject data frame. This was also done for Sanders,
Trump, and Cruz.
Preliminary data-analysis was conducted on the
over 13 million tweets that were collected in or-
der to reveal high-level trends that would be rel-
evant and provide context for further sentiment
analysis. As the pie chart below illustrates, More
than fifty percent of all of the tweets in the entire
data set mentioned Trump. He is undoubtedly the
most discussed candidate on Twitter. Furthermore,
Sanders was the second most discussed candidate
on Twitter, while Cruz and Clinton were both dis-
cussed the least.
Figure 1: Proportion of Total Tweets Mentioning
Each Candidate
The next visualization (depicted below) is also
in relation to tweet volume by each candidate and
by party for all of the tweets in the final data set.
The outer donut illustrates the proportion of tweets
belonging to each party. It is clear that the Repub-
lican candidates had far more tweets (68 percent
of all tweets) than the Democratic candidates (32
percent of all tweets). This is largely because of
Trump, who was mentioned in more than 50 per-
cent of all tweets in the data set. The inner donut
shows tweets proportions for each of the four can-
didates. As can be seen, Trump-related tweets
make up the vast majority of the final data set
with 54 percent of all tweets mentioning Trump.
Sanders was the next most popular on twitter with
19 percent of tweets mentioning him, while Cruz
had 14 percent of tweets mentioning him. Clinton
is the least popular candidate on Twitter with 12
percent of all tweets mentioning her.
Trump, of course, has won the most primary
elections by a large margin in comparison to the
other Republican candidates, which Twitter con-
firms here. If we were to go by tweet vol-
umes alone to predict the Presidential Elections, it
would seem to support the claim that Trump will
win by a landslide. Likewise, the fact that Clin-
ton is less popular on Twitter compared to Sanders
would seem to indicate that Sanders will win the
Democratic primaries if we only look at tweet vol-
ume. However, looking at the primary elections
that have happened thus far, Clinton has won more
states than Sanders. Hence, this may indicate that
tweet volume is not entirely accurate for predict-
ing real outcome election results. Due to the fact
that tweet volume alone can not predict a candi-
date’s popularity in the general election expanded
the scope of measures to examine.
Figure 2: Tweet Volume by Party and by Candi-
date
We next extracted all tweets that were geo-
tagged. This considerably reduced the number of
tweets, as it is estimated that only between 5 to 20
percent of all tweets are geo-tagged with a loca-
tion. However, we wanted to look at the origins of
our tweets, and we assume that the sub-sample of
geo-tagged tweets is a strong representative of the
entire data set of tweets.
Figure 2 below is a world map showing the
origins of all tweets that were geo-tagged in the
complete data set of tweets. The yellow dots in-
dicate where tweets originated from on the map.
Unsurprisingly, that vast majority of tweets origi-
nate form inside the United States. The primaries
are for the U.S. Presidency so it is expected that
the four candidates would be most talked about
within America. However, it was interesting to see
that the highest concentration of tweets was on the
East coast of the U.S., while the West coast was
also very concentrated with tweet origins. Middle
America was not very concentrated with tweets. If
these geo-tagged tweets are reflective of the total
sample of tweets used, then there may be a bias in-
troduced in the data set with a greater proportion
of tweets from the East coast, and very few tweets
originating from Middle America.
Outside of America, northern Europe, particu-
larly the U.K. was also heavily concentrated with
tweets pertaining to the candidates. We do not
know why the U.K. in general had such a high
proportion of tweets. It may be because British
people are highly interested in American politics,
that British twitter users have a different system
for geo-tagging than other countries or that a lot
of Americans travel abroad to Northern Europe
and remain engaged with politics on Twitter dur-
ing their vacation. The four candidates were dis-
cussed in other regions of the world as well, but
with much lesser concentration. Europe showed
more interest in American politics than any other
country (excluding the United States of America).
Figure 3: World Origins of Tweets
We next looked at the tweet frequency by state
for each candidate within the United States. We
first look at Hilary Clintons map, which is de-
picted below. The states with the most number of
tweets mentioning Clinton were California, Texas,
Florida, Illinois, and of course, New York. Clinton
has won all of the primary elections in these states
with the exception of California, which has not yet
taken place. States like North and South Dakota,
Montana, Wyoming and Nebraska had almost non
existent frequency of tweets mentioning Clinton.
However, it is interesting that Clinton seems to be
more popular in Utah on Twitter in comparison to
the other four candidates, even though she lost to
Sanders in the Utah primary elections. This in-
dicates that tweet volumes on Twitter may not be
entirely accurate in predicting election results.
Figure 4: Clinton’s Tweet Frequencies Map
We next looked at Bernie Sanderss map of tweet
frequencies by state (depicted below). It is in-
teresting to see what states he is more popular
in compared to Clinton. Surprisingly, discussions
about Sanders are very popular on Twitter in Ohio,
even though he lost in the Ohio state primaries to
Clinton by a substantial margin.
Figure 5: Sanders’ Tweet Frequencies Map
We next looked at Ted Cruzs map of tweet fre-
quencies by state (depicted below). Compared to
Clinton and Sanders, Ted Cruz is more popular in
the west coast, with states like Nevada, Arizona
and Oregon showing more interest in him on Twit-
ter. He is also mentioned more in states like Mon-
tana and Nebraska, where Clinton and Sanders had
almost non-existent mentions in those states.
Lastly, we looked at Donald Trumps map of
tweet frequencies by state (depicted below). In-
terestingly, he is not as popular as Cruz in Mon-
tana, Wyoming and Nebraska, where he is rarely
mentioned on twitter.
Figure 6: Cruz’s Tweet Frequencies Map
Figure 7: Trump’s Tweet Frequencies Map
After looking at all of the maps for the five
candidates tweet frequencies, it does seem that
tweet frequencies are not always a good indica-
tor of election results when it comes to using
only tweet volume per candidate. As mentioned
above, there were some instances (such as Clin-
ton having a very high volume of tweets in Utah
and Sanders very high volume of tweets in Ohio)
where twitter did not correlate with the real world
outcome of the elections (based on the assump-
tion that a higher tweet volume should correlate
to winning the majority of votes). However, over-
all, there seems to be more correlation regarding
tweet volumes for each candidates in each state
than vice versa. Tweet volume per candidate will
be a predictor variable incorporated into the pre-
diction model that will be introduced later in this
paper.
It seems that tweet volume performs sporadi-
cally as a predictor of election results. However,
we can use an algorithm to evaluate and catego-
rize the feelings expressed in text; this is called
sentiment analysis. Hence, we next looked at tex-
tual sentiment analysis of the tweets to get a better
insight in public opinion on Twitter regarding the
2016 Primary Elections.
In order to extract sentiments for each of the
tweets, the Syuzhet R package was utilized, which
comes with four sentiment dictionaries and pro-
vides a method for accessing the robust, but
computationally expensive, sentiment extraction
tool developed in the NLP group at Stanford.
The developers of this algorithm built a dictio-
nary/lexicon containing lots of words with asso-
ciated scores for eight different emotions and two
sentiments (positive/negative). Each individual
word in the lexicon will have a yes (one) or no
(zero) for the emotions and sentiments, and we
can calculate the total sentiment of a sentence by
adding up the individual sentiments for each word
in the sentence. It is important to note that senti-
ment analysis of tweets comes with its fair share
of problems. For example, sentiment analysis al-
gorithms are built in such a way that they are
more sensitive to expressions typical of men than
women. Furthermore, it can be argued that com-
puters are not optimal at identifying emotions cor-
rectly in all cases. They are likely not great at at
identifying something like sarcasm. Most of these
concerns wont have a big effect on my analysis
here because we are looking at text. Additionally,
when using as large a dataset as the one for this
study, it is likely, that many more tweets will be
correctly identified by sentiment, and the effects
of identifying sentiments incorrectly will be nor-
malized. The entire data set was used to derive
sentiment scores for all four candidates, and the
bar graphs depicting aggregates of the results are
shown below.
The more positive the sentiment score is, the
more positive the overall sentiment is of the tweets
that are associated with each of the candidates.
Hence, the Sanders has the highest average senti-
ment score compared to all other candidates while
Trump has the second highest average sentiment
score over all tweets. Both Clinton and Cruz have
lower average sentiment scores over all the tweets.
When we look at the average very positive sen-
timent scores for each of the candidates over all
of the tweets, Trump has on average, more pos-
itive sentiment scores than the other candidates
while Sanders comes close in second place. How-
ever, it is important to note that Trump has a very
large proportion of tweets compared to Sanders,
and this may be skewing the average very positive
Figure 8: Average Sentiment Score over all Tweets
for Each Candidate
sentiment scores. It may be interesting to equal-
ize the data set to contain fewer tweets mention-
ing Trump, and to see how this affects the average
very positive sentiment scores. It is interesting that
Clinton has the lowest average positive sentiment
scores over all tweets mentioning her. Lastly, we
look at the average very negative sentiment scores
bar graph, and the results correspond to the other
two graphs. Cruz has the highest average negative
scores over all tweet relating to him, while Clin-
ton comes in second place. Sanders, on the other
hand, has the lowest average negative sentiment
scores over the all tweets mentioning him. Hence,
if we were to go by these sentiment scores to pre-
dict election outcomes, it would seem that Sanders
would win the Democratic primaries while Trump
would win the Republican primaries.
6.2 R Shiny Visualization: Word Frequency
(Wordcloud)
An interactive visualization app using the R Shiny
platform was produced to analyze the text of the
tweets. Preliminary data-analysis was conducted
on the tweets that were collected in order to reveal
trends that would be relevant with further senti-
ment analysis. An R Shiny application was de-
veloped to generate a different wordcloud visual-
ization for each date that data was collected. The
Figure 9: Average Very Positive Sentiment Score
over all Tweets for Each Candidate
wordcloud visualizations represent the words that
were most prevalent in tweets related to a par-
ticular candidate. Each day had slightly differ-
ent words that dominated a candidate’s network
and on some days in particular there was a strong
theme or increased polarization.
For example, on February 27th tweets related
to Donald Trump mainly contained ’#nevertrump’
(Figure 12). Influencers on Twitter such as Marco
Rubio, Glenn Beck and Amanda Carpenter all
published tweets that contained the hashtag as a
strategic move against Donald Trump prior to Su-
per Tuesday on March 1st which led Trump to
have a 0.09 Sentiment Score (Figure 11) (Figure
13).
Backlash and outrage to Hillary Clinton com-
mending Nancy Reagan’s involvement in the
H.I.V./AIDS conversation following Reagan’s
death dominated tweets about Clinton on March
12th (Figure 14). ”The problem with Mrs.
Clintons compliment: It was the Reagans who
wanted nothing to do with the disease at the time
(Source: http://www.nytimes.com/politics/first-
draft/2016/03/11/hillary-clinton-lauds-reagans-
on-aids-a-backlash-erupts/). It was confirmed by
sentiment analysis that tweets regarding Clinton
were overall negative as she only had a sentiment
score of -0.01 on March 12th, while Sanders
Figure 10: Average Very Negative Sentiment
Score over all Tweets for Each Candidate
Figure 11: Marco Rubio Tweets #NeverTrump on
February 27th
and Cruz had higher sentiment scores (.22 and
.31 respectively) (Figure 15). Trump also, had a
relatively low sentiment score of -0.01 on March
12th which was the same day that protesters
disrupted a Trump rally in Chicago and forced the
event to be canceled (Figure 15) (Figure 16).
On March, 26th a scandal broke out involving
Ted Cruz. The group anonymous alleged that Ted
Cruz was involved in a sex scandal. Most tweets
that mentioned Ted Cruz on March 26th involved
the scandal (Figure 17). Although, he generally
had the highest sentiment score out of all the can-
didates on March 22nd and March 26th he had the
lowest sentiment score of all the candidate (0.02
and 0.01 respectively) (Figure 18).
In general, the words that appeared most fre-
quently (as illustrated in the wordcloud) were pre-
dictive of a candidate’s sentiment score and this
Figure 12: February, 27th, 2016 Wordcloud for
Donald Trump
Figure 13: February, 27th, 2016 Sentiment
consistency further reinforced the appropriateness
and validity of the Syuzhet R package that was
used for sentiment calculations. The sentiment
score is able to provide a concrete quantitative
measure of how a network feels towards a partic-
ular candidate whereas the wordcloud represented
the qualitative feelings of individuals and provide
further context for the sentiment scores.
6.3 D3.js Visualizations
We next created several visualizations using the
D3.js platform. D3.js (D3 for Data-Driven Doc-
uments) is a JavaScript library for producing
dynamic, interactive data visualizations in web
browsers. It makes use of the widely imple-
mented SVG, HTML5, and CSS standards. All
of the visualizations produced using D3.js are
available at: http://aboutmonica.com/
final%20D3/.
A D3 visualization of a force layout visualiza-
tion of the mention network for all four candidates
was generated. The network was constructed only
from tweets on the days of primary elections. This
is a very large network with over 30,000 edges,
and hence when the D3 visualization is produced,
the resulting layout graph is very large and takes a
while to load.
Figure 14: March, 12th, 2016 Wordcloud for
Hillary Clinton
Figure 15: March, 12th, 2016 Sentiment
In the force layout visualization at the provided
link above, you can see a vast social mention net-
work of tweets. Since this social network has di-
rected edges, we can look at the direction of tweet
mentions where there are many nodes connected
to one central node, the arrows are all pointing to
the central node. This means that there are many
twitter users tweeting and mentioning the central
node. From the network graph you can see that
some twitter users (represented by the nodes in
the graph) have very large networks and are very
densely connected by edges to other twitter users.
Edges between two twitter users signify that one of
the users mentioned or re-tweeted the other user,
so those areas that are very dense and dark in the
graph are likely people who were mentioned or
re-tweeted many times. On the other hand, to-
wards the outskirts of the graph, there are a few
nodes that are connected to each other by a few
edges. These twitter users are connected by edges
because they mentioned or re-tweeted each other
during the time that the data was collected. How-
ever, they are separate from other clusters in the
graph by not being connected to other nodes by
edges. This graph clearly depicts which twitter
users have larger networks (more dense clusters
Figure 16: March, 12th, 2016 Wordcloud for Don-
ald Trump
Figure 17: March, 26th, 2016 Wordcloud for Ted
Cruz
around the nodes). Lastly, we also have nodes
that are connected by maybe one or two (or in
some cases, no ties) indicating that they are not
being mentioned by others users in the networks
and they are also not mentioning others users in
their tweets.
We next looked at each of the four candidates
networks separately, and using Gephi, we derived
network parameter values in order to better assess
what is going on in this network in relation to each
of the four candidates. Table 1 below depicts the
results of our analysis. There are some interesting
results to point out. Cruzs average clustering co-
efficient is 0 while Trumps network is almost zero
at 0.001. Hence, it seems that Cruzs tweet men-
tion network is very small with there being very
little clustering of users, and most users not be-
ing interconnected. In general, all of the candi-
dates have very small clustering coefficients with
Sanders having the highest value at 0.005. This
may be due to the fact that the social network an-
alyzed is a network of mentioned tweets, and it is
unlikely that the candidates would reply to many
of the tweets that mention them. Additionally,
these tweets were collected in real-time, so a can-
didate may have responded to any of the tweets
in the network at a later time, that was not cap-
tured in our dataset. Furthermore, Sanders net-
work has the highest average degree at 2.109 while
Figure 18: March, 26th, 2016 Sentiment
Clinton leads closely behind at 2.013. This im-
plies that on average, a node in Sanders network
has 2.109 edges connected with it, meaning that
users are more likely to interact in Sanders net-
work in comparison to networks of the other can-
didates. In Sanders network it appears that nodes
are more likely to interact and mention other nodes
than other candidate networks. Sanders also has
the largest network diameter at 6, which indicates
that it is likely that he reaches a greater audience
than the other candidates. Lastly, it is interest-
ing to note that both Republican candidates have
lower average path lengths when compared to the
Democratic candidates, meaning that nodes can be
reached in fewer steps in the networks for the Re-
publican candidates.
Figure 19: Network Parameters for Each Candi-
date
7 Prediction Model
After having explored and analyzed the twitter
data, we next focused on building a prediction
model. We created a panel dataset for all four
candidates and looked at the primary election days
as well as randomly chosen days from out twitter
data set. In the end, we had a total of 70 days of
twitter data used for the panel data set. We chose
to specifically look at the states of New York, In-
diana, and Nebraska. The next primaries will take
place in Nebraska on May 10th, and so we would
like evaluate our models prediction results against
the actual outcome of the Nebraska primaries. We
coded all my independent and dependent variables
for these three states. New York was used as a
training dataset to train the prediction model on.
Trump and Clinton won this state. The testing
dataset that was used was for the states of Indiana
and Pennsylvania, and it was fitted using the re-
sults of the training dataset to calculate predicted
values for who would win in Indiana and Pennsyl-
vania. It is important to note that for this analysis,
it was necessary to look at only those tweets that
were geo-tagged and belonged to one of the three
states used in the panel data. This undoubtedly
decreased the total size of the tweets available to
analyze, as most of the tweets collected were not
geotagged at all. However, we were still able to
obtain thousands of tweets for most of the days
for each candidate.
The dependent variable used in the analysis was
called electionresults, and it was basically equal
to 0 if the candidate did not win the primary elec-
tion (or poll average taken from FiveThirtyEight
from that day) and equal to 1 otherwise. The re-
searchers at FiveThirtyEight have collected and
continue to collect national polls for the Repub-
lican and Democratic primaries, and they generate
a polling average from all polls collected for each
candidate. For the Democratic primary, a total of
671 polls have been collected thus far, and a total
of 681 polls have been collected for the Republi-
can primary. This polling average is adjusted for
pollster quality, sample size, and recency, and as
a result, it is a good indicator of public opinion
regarding the primaries and the candidates. Fur-
thermore, FiveThirtyEight offers daily polling av-
erages from as early as July 10, 2015 upto the cur-
rent day. Hence, it was fairly simple for me to
collect the daily polling average to see which can-
didate won the polls for each day in my dataset.
For the independent variables, we used tweet
volume for each candidate in each state, the av-
erage sentiment score (which was calculated from
each candidates tweet corpus for each day for
each state), and lastly we used the networked pa-
rameters that we described above. These param-
eters (average degree, average clustering coeffi-
cient, network diameter, and average path length)
were not derived for each day and each specific
state, but were derived from the entire Twitter data
set, and were thus constant over all days.
In addition to the independent variables, we
also added control variables to the panel dataset.
The control variables used were the population
for each state and average income for each state.
Lastly, we used a lagged dependent variable as an
independent variable because in time series anal-
ysis, it is expected that the poll results from the
previous day would be predictors of the poll re-
sults for the current day, and we needed to account
for this correlation.
8 Findings
In order to train and test our data set, we used three
different statistical methods: Logistic Regression,
Random Forests, and Support Vector Machines.
We wanted to see if one of these three models
performed better than the others. The regression
equation for the logistic regression model is shown
in the figure below.
Figure 20: Logistic Regression Equation
All three models performed well in terms of
prediction results, which is surprising for us. As
we mentioned earlier, it seems that Sanders is very
popular on twitter compared to Clinton, and so we
expected that this would skew the results, but it
does not look like it did. After training the model
on the New York dataset, we tested the trained
models on the Indiana and Pennsylvania datasets,
and for both datasets, the models correctly de-
picted that Clinton and Trump would win Penn-
sylvania (which they did) and Sanders and Trump
would win Indiana (which they did). We used
ROC curves (depicted below) in order to evaluate
the predictive accuracy of our models and it seems
that the Random Forests model and Support Vec-
tor Machine model performed better than the Lo-
gistic Regression model.
9 Conclusion
In this paper, we looked at twitter data from the
months of February, March, and April in order to
predict election outcomes for the 2016 Presiden-
tial Primaries. We analyzed several variables to
explore the twitter data, including network param-
eters, text sentiments of the tweets, and tweet vol-
ume for each of the four candidates. In order to
Figure 21: ROC curves for all Three Prediction
Models
visualize our results, we build several static and
interactive visualizations. The prediction models
that we developed for our analysis performed very
well in predicting the election outcomes. How-
ever, we only tested our models on two states, and
would like to do further tests on using other state
primaries in order to test the predictive accuracy
of our models.
10 References
”Company — About.” Company — About.
Twitter, 31 Mar. 2016. Web. 07 May 2016.
¡https://about.twitter.com/company¿.
Larsson, A. O., & Moe, H. (2012). Studying
political microblogging: Twitter users in the
2010 Swedish election campaign. New Media &
Society, 14(5), 729-747.
Wang, H., Can, D., Kazemzadeh, A., Bar, F., &
Narayanan, S. (2012, July). A system for real-time
twitter sentiment analysis of 2012 us presidential
election cycle. In Proceedings of the ACL 2012
System Demonstrations (pp. 115-120). Associa-
tion for Computational Linguistics.
11 URLs
All D3 graphics used in this project are available
for viewing online. The R Shiny application was
too large to upload but the source code is avail-
able to be viewed by clicking on the menu items at
http://aboutmonica.com/final%20D3/
Republican Sentiments:
http://aboutmonica.com/final%20D3/republican%
20sentiments/
Democrat Sentiments:
http://aboutmonica.com/final%20D3/democrat%
20sentiments/
Volume of Tweets per Candidate:
http://aboutmonica.com/final%20D3/candidate%
20tweet%20volume%20prop%20D3/

Más contenido relacionado

La actualidad más candente

Unpredictably Trump? Predicting the Clictivist-like Actions on Trump's Facebo...
Unpredictably Trump? Predicting the Clictivist-like Actions on Trump's Facebo...Unpredictably Trump? Predicting the Clictivist-like Actions on Trump's Facebo...
Unpredictably Trump? Predicting the Clictivist-like Actions on Trump's Facebo...
University of Groningen (The Netherlands)
 
Monitoring of the Last US Presidential Elections
Monitoring of the Last US Presidential ElectionsMonitoring of the Last US Presidential Elections
Monitoring of the Last US Presidential Elections
AIRCC Publishing Corporation
 
Journalist Involvement in Comment Sections
Journalist Involvement  in Comment SectionsJournalist Involvement  in Comment Sections
Journalist Involvement in Comment Sections
Genaro Bardy
 
Dissertation - Karina Ochis
Dissertation - Karina OchisDissertation - Karina Ochis
Dissertation - Karina Ochis
Karina Ochis
 

La actualidad más candente (20)

Rt for Trump, Fav for Clinton
Rt for Trump, Fav for ClintonRt for Trump, Fav for Clinton
Rt for Trump, Fav for Clinton
 
Unpredictably Trump? Predicting the Clictivist-like Actions on Trump's Facebo...
Unpredictably Trump? Predicting the Clictivist-like Actions on Trump's Facebo...Unpredictably Trump? Predicting the Clictivist-like Actions on Trump's Facebo...
Unpredictably Trump? Predicting the Clictivist-like Actions on Trump's Facebo...
 
Burson-Marsteller DC Advocacy Groups Social Media Study Final
Burson-Marsteller DC Advocacy Groups Social Media Study FinalBurson-Marsteller DC Advocacy Groups Social Media Study Final
Burson-Marsteller DC Advocacy Groups Social Media Study Final
 
Trump vs Clinton - Polling Opinions: How the polls were wrong and how to fix...
Trump vs Clinton - Polling Opinions:  How the polls were wrong and how to fix...Trump vs Clinton - Polling Opinions:  How the polls were wrong and how to fix...
Trump vs Clinton - Polling Opinions: How the polls were wrong and how to fix...
 
Are Americans worried about the NSA?
Are Americans worried about the NSA? Are Americans worried about the NSA?
Are Americans worried about the NSA?
 
Monitoring of the Last US Presidential Elections
Monitoring of the Last US Presidential ElectionsMonitoring of the Last US Presidential Elections
Monitoring of the Last US Presidential Elections
 
News use across social media platforms 2017
News use across social media platforms 2017News use across social media platforms 2017
News use across social media platforms 2017
 
TR-Social Network Users by YUSUF ZIYA ziya@selasturkiye.com SELAS OMNIBUS INT...
TR-Social Network Users by YUSUF ZIYA ziya@selasturkiye.com SELAS OMNIBUS INT...TR-Social Network Users by YUSUF ZIYA ziya@selasturkiye.com SELAS OMNIBUS INT...
TR-Social Network Users by YUSUF ZIYA ziya@selasturkiye.com SELAS OMNIBUS INT...
 
Facebook news 10-24-2013
Facebook news 10-24-2013Facebook news 10-24-2013
Facebook news 10-24-2013
 
Burson-Marsteller - Congressional Use of Twitter 2010
Burson-Marsteller - Congressional Use of Twitter 2010Burson-Marsteller - Congressional Use of Twitter 2010
Burson-Marsteller - Congressional Use of Twitter 2010
 
Finding political network bridges on facebook
Finding political network bridges on facebookFinding political network bridges on facebook
Finding political network bridges on facebook
 
Journalist Involvement in Comment Sections
Journalist Involvement  in Comment SectionsJournalist Involvement  in Comment Sections
Journalist Involvement in Comment Sections
 
Communication's Next Top Model
Communication's Next Top ModelCommunication's Next Top Model
Communication's Next Top Model
 
Social Networking InView (September 2010)
Social Networking InView (September 2010)Social Networking InView (September 2010)
Social Networking InView (September 2010)
 
Social Media Content Analysis: Ossoff Threat Assessment 2017.05.03
Social Media Content Analysis: Ossoff Threat Assessment  2017.05.03Social Media Content Analysis: Ossoff Threat Assessment  2017.05.03
Social Media Content Analysis: Ossoff Threat Assessment 2017.05.03
 
#Microposts16 - Comparing Social Media and Traditional Surveys around the Bos...
#Microposts16 - Comparing Social Media and Traditional Surveys around the Bos...#Microposts16 - Comparing Social Media and Traditional Surveys around the Bos...
#Microposts16 - Comparing Social Media and Traditional Surveys around the Bos...
 
Dissertation - Karina Ochis
Dissertation - Karina OchisDissertation - Karina Ochis
Dissertation - Karina Ochis
 
Social networking 2013
Social networking 2013Social networking 2013
Social networking 2013
 
I love Big Bird
I love Big Bird I love Big Bird
I love Big Bird
 
If Likes were votes: an empirical study on italian administrative election
If Likes were votes: an empirical study on italian administrative electionIf Likes were votes: an empirical study on italian administrative election
If Likes were votes: an empirical study on italian administrative election
 

Destacado

Destacado (6)

How Humans See Data - Amazon Cut
How Humans See Data - Amazon CutHow Humans See Data - Amazon Cut
How Humans See Data - Amazon Cut
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
reveal.js 3.0.0
reveal.js 3.0.0reveal.js 3.0.0
reveal.js 3.0.0
 
Angular 2 Essential Training
Angular 2 Essential Training Angular 2 Essential Training
Angular 2 Essential Training
 
3 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 20173 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 2017
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your Niche
 

Similar a Using Tweets for Understanding Public Opinion During U.S. Primaries and Predicting Election Results

Final Research Project
Final Research ProjectFinal Research Project
Final Research Project
Mollie Neal
 
1 Crore Projects | ieee 2016 Projects | 2016 ieee Projects in chennai
1 Crore Projects | ieee 2016 Projects | 2016 ieee Projects in chennai1 Crore Projects | ieee 2016 Projects | 2016 ieee Projects in chennai
1 Crore Projects | ieee 2016 Projects | 2016 ieee Projects in chennai
1crore projects
 
Jason A. Cohen - Political Communication Literature Review and Analysis Paper
Jason A. Cohen - Political Communication Literature Review and Analysis PaperJason A. Cohen - Political Communication Literature Review and Analysis Paper
Jason A. Cohen - Political Communication Literature Review and Analysis Paper
Jason A. Cohen
 
Sentiment analysis of pre elections tweets (general elections)
Sentiment analysis of pre elections tweets (general elections)Sentiment analysis of pre elections tweets (general elections)
Sentiment analysis of pre elections tweets (general elections)
sahiba javid
 
Research Proposal : Political Representation of Different types of voters on ...
Research Proposal : Political Representation of Different types of voters on ...Research Proposal : Political Representation of Different types of voters on ...
Research Proposal : Political Representation of Different types of voters on ...
Joshua Wong
 

Similar a Using Tweets for Understanding Public Opinion During U.S. Primaries and Predicting Election Results (20)

Kim, M.J., & Park, H. W. (2012). Measuring Twitter-Based Political Participat...
Kim, M.J., & Park, H. W. (2012). Measuring Twitter-Based Political Participat...Kim, M.J., & Park, H. W. (2012). Measuring Twitter-Based Political Participat...
Kim, M.J., & Park, H. W. (2012). Measuring Twitter-Based Political Participat...
 
How social media used by politicians? 2016
How social media used by politicians? 2016How social media used by politicians? 2016
How social media used by politicians? 2016
 
PREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATA
PREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATAPREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATA
PREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATA
 
PREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATA
PREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATAPREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATA
PREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATA
 
Document(2)
Document(2)Document(2)
Document(2)
 
Social Media and Politics
Social Media and PoliticsSocial Media and Politics
Social Media and Politics
 
Final Research Project
Final Research ProjectFinal Research Project
Final Research Project
 
PREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATA
PREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATAPREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATA
PREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATA
 
PREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATA
PREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATAPREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATA
PREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATA
 
1 Crore Projects | ieee 2016 Projects | 2016 ieee Projects in chennai
1 Crore Projects | ieee 2016 Projects | 2016 ieee Projects in chennai1 Crore Projects | ieee 2016 Projects | 2016 ieee Projects in chennai
1 Crore Projects | ieee 2016 Projects | 2016 ieee Projects in chennai
 
Social Media Metrics and Politics
Social Media Metrics and PoliticsSocial Media Metrics and Politics
Social Media Metrics and Politics
 
Social Media Metrics and Politics Final
Social Media Metrics and Politics FinalSocial Media Metrics and Politics Final
Social Media Metrics and Politics Final
 
Twitter Based Sentiment Analysis of Each Presidential Candidate Using Long Sh...
Twitter Based Sentiment Analysis of Each Presidential Candidate Using Long Sh...Twitter Based Sentiment Analysis of Each Presidential Candidate Using Long Sh...
Twitter Based Sentiment Analysis of Each Presidential Candidate Using Long Sh...
 
Jason A. Cohen - Political Communication Literature Review and Analysis Paper
Jason A. Cohen - Political Communication Literature Review and Analysis PaperJason A. Cohen - Political Communication Literature Review and Analysis Paper
Jason A. Cohen - Political Communication Literature Review and Analysis Paper
 
Twitter Based Outcome Predictions of 2019 Indian General Elections Using Deci...
Twitter Based Outcome Predictions of 2019 Indian General Elections Using Deci...Twitter Based Outcome Predictions of 2019 Indian General Elections Using Deci...
Twitter Based Outcome Predictions of 2019 Indian General Elections Using Deci...
 
Twitter’s new guide for campaigners
Twitter’s new guide for campaignersTwitter’s new guide for campaigners
Twitter’s new guide for campaigners
 
Social Media Playbook by Twitter for Government & Elections
Social Media Playbook by Twitter for Government & ElectionsSocial Media Playbook by Twitter for Government & Elections
Social Media Playbook by Twitter for Government & Elections
 
the complete draft about the CA election time tweets -- awaiting final weedin...
the complete draft about the CA election time tweets -- awaiting final weedin...the complete draft about the CA election time tweets -- awaiting final weedin...
the complete draft about the CA election time tweets -- awaiting final weedin...
 
Sentiment analysis of pre elections tweets (general elections)
Sentiment analysis of pre elections tweets (general elections)Sentiment analysis of pre elections tweets (general elections)
Sentiment analysis of pre elections tweets (general elections)
 
Research Proposal : Political Representation of Different types of voters on ...
Research Proposal : Political Representation of Different types of voters on ...Research Proposal : Political Representation of Different types of voters on ...
Research Proposal : Political Representation of Different types of voters on ...
 

Último

Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
JohnnyPlasten
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 

Último (20)

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 

Using Tweets for Understanding Public Opinion During U.S. Primaries and Predicting Election Results

  • 1. Using Tweets for Understanding Public Opinion During U.S. Primaries and Predicting Election Results Monica Powell Barnard College Columbia University 3009 Broadway New York, NY 10027 mmp2181@barnard.edu Nadia Jabbar Columbia University Graduate School of Arts and Sciences 535 West 116th Street New York, NY 10027 nj2290@columbia.edu Abstract Using social media for political analy- sis, especially during elections, has be- come popular in the past few years where many researchers and media now use so- cial media to understand the public opin- ion and current trends. In this paper, we investigate methods for using Twit- ter to analyze public opinion and to pre- dict U.S. Presidential Primary Election re- sults. We analyzed over 13 million tweets from February 2016 to April 2016 during the primary elections, and we looked at tweets that mentioned either Hillary Clin- ton, Bernie Sanders, Donald Trump or Ted Cruz. First, we use the methods of sen- timent analysis, geospatial analysis, net- work analysis, and visualizations tools to examine public opinion on twitter. We then use the twitter data and analysis re- sults to propose a prediction model for pre- dicting primary election results. Our re- sults highlight the feasibility of using so- cial media to look at public opinion and predict election results. General Terms Data Visualization, Prediction Models Keywords Twitter, Presidential Election, senti- ment analysis, geomapping, RShiny, D3.js, Social media, data visualization, Hillary Clinton, Bernie Sanders, Donald Trump, Ted Cruz. 1 INTRODUCTION Microblogging platforms such as Twitter have be- come increasingly popular communication tools for social media users who often use these plat- forms to express their opinions on a variety of top- ics and to discuss current issues. As more people use Twitter and other microblogging platforms, they post not only about their personal matters, but also about products and services they use, and they even discuss their political and/or religious views. As a result, these microblogging websites have be- come valuable sources for gathering public opin- ion and sentiment analysis. Twitter has over 310 Monthly active users and 1 billion unique visits monthly to sites with embed- ded Tweets (?). Similar to other social networking websites, Twitter allows people to share informa- tion and express themselves in real-time. This im- mediacy makes Twitter a platform that users can utilize in order to express their political support or discontent for particular individuals or policies. However, it can be argued whether or not Twitter influences election results and/or if sentiments ex- pressed on Twitter represents a random sample of a given population. All of the 2016 presidential candidates have a presence on Twitter and More than two thirds of U.S. congress members have created a Twitter ac- count and many are actively using Twitter to reach their constituents (Wang and et. al. 2012). An in- dividual’s network and sentiments associated with them on Twitter are unique to them and may or may not mirror their network offline. In this paper, we analyze tweets obtained from February 2016 to April 2016 in order to examine public opinion on the 2016 U.S. Presidential Pri- mary Elections that are currently taking place. We hypothesize that Twitter, and by extension, other popular microblogging websites such as Facebook and Google+, are good sources for understand- ing general public opinion regarding political elec- tions. Furthermore, we hypothesize that Twitter (as well as other popular micrblogging platforms) are also useful for predicting election results. In order to test our hypotheses, we use several different techniques to extract useful information from the tweets, including sentiment analysis of the tweets, geospatial analysis, and network anal- ysis. We use these methods to mine the collective
  • 2. tweets to examine general public opinion regard- ing the Democratic candidates Hillary Clinton and Bernie Sanders, as well as the Republican candi- dates Donald Trump and Ted Cruz. We next use the information that is extracted from the tweets to build several predictive models and test them in order to analyze how well Twit- ter is indicative of general public opinion regard- ing the 2016 Primaries. In our predictive models, we also incorporated polling data from several na- tional polls conducted by different organizations and gathered by FiveThirtyEight, a website that focuses on opinion poll analysis. Additionally, we incorporated the final results for those states where the primaries have already happened into our pre- dictive model in order to test the accuracy of our model. 2 LITERATURE REVIEW While their is some controversy regarding this topic, social media data can certainly be used for analyzing socio-political trends from the past, dur- ing the present, and for the future. Assure and Huberman (2012) effectively used Twitter to pre- dict some real-world outcomes, such as box of- fice revenues for movies pre-release and trends in the housing market sector. Their work suggested that Twitter data can be successfully used to pre- dict consumer metrics. Furthermore, Varian and Choi (2009) used data from google trends to pre- dict real-time events, and their work indicated that google trends can be used to predict retail sales for Motor vehicle and parts dealers. In yet another study by Ginsberg et al. (2010), researchers used social media data to predict flu epidemics, while Mao and Zeng (2011) used Twitter to perform sen- timent analysis in order to predict stock market trends. Social media has also been used for examine po- litical trends. OConnor et al. (2010) studied pub- lic opinion measured from polls along with senti- ment measured from text analysis of Twitter posts. Their results showed a strong correlation (as high as 80 percent) between Twitter data and presiden- tial elections. Furthermore, Tumasjan et al. (2010) studied the German federal election to investigate whether Twitters messages correctly mirror offline political sentiment, and they found that tweet sen- timent regarding the candidates’ political stances strongly correlated with the political landscape of- fline. In 2012, Wang and et. al. created a system for real-time twitter sentiment analysis for the presi- dential election because the nature and popularity of Twitter allows researchers to analyze sentiment in real-time, as opposed to being forced to wait af- ter a certain period of time in order to implement more traditional methods of data collection. The Swedish presidential election was also tracked in real-time by researchers using data gathered from Twitter (Larsson, 2012). While the role of Twitter in election outcomes is debatable, twitters users are definitively not apolitical and thus it was in- triguing to investigate whether or not their is a direct correlation between political outcomes and twitter activity. Yet, some studies have concluded that Twit- ter and other social media are not strongly re- flective of real world outcomes. Gayo-Avello et al. (2012) analyzed the 2010 U.S. Congressional elections using Twitter data to test Twitters pre- dictive power, and were unable to find any cor- relation between the data analysis results and the actual electoral outcomes. However, it is impor- tant to note that the landscape of social media has dramatically changed in the last few years, and so Twitter may be a more accurate measure of public opinion today than it was a few years ago. 3 RESEARCH QUESTION Using social media for political discourse, espe- cially during political elections has become com- mon practice. Predicting election outcomes from social media data can be feasible and as discussed previously, positive results have often been re- ported. In this paper, we will test the predictive power of the social media platform Twitter in the context of the 2016 U.S. Primary elections. We will use Twitter data to develop a picture of public opinion about the political candidates online, and analyze our results against the results of the pri- maries that have already happened. We will then create a predictive model using the Twitter data analysis results, and test those models using the primary results for those states where candidate elections have already taken place. We propose that while Twitter is a good platform for analyz- ing public opinion, it can not immediately replace other measures for gathering public opinion, such as polling data.
  • 3. 4 DATA Over 13 million tweets were gathered on Twitter from February 2016 to April 2016. The entire dataset, as well as random samples of tweets from the dataset were used to analyze online sentiments towards Hillary Clinton, Bernie Sanders, Donald Trump, and Ted Cruz. We also looked specifi- cally at dates when at least one primary election was held. These dates were February 9th, Febru- ary 20th, February 23, February 27th, March 1st, March 5th, March 6th, March 9th, March 10th, March 12th, March 15th, March 22nd, March 26th, April 5th, April 9th, April 19th, April 26th, and May 3rd. We did not have data for February 1st, and so this was the only primary election date that was left out from our analysis. Below is a list of the specific elections that happened on each date, as well as the states where the elections took place. Tuesday, February 9: New Hampshire Saturday, February 20: Nevada Democratic caucuses South Carolina Republican primary Tuesday, February 23: Nevada Republican caucuses Saturday, February 27: South Carolina Democratic primary Tuesday, March 1: Alabama Alaska Republican caucuses American Samoa Democratic caucuses Arkansas Colorado caucuses (both parties, no preference vote for Republicans) Democrats Abroad party-run primary Georgia Massachusetts Minnesota caucuses (both parties) North Dakota Republican caucuses (completed by March 1) Oklahoma Tennessee Texas Vermont Virginia Wyoming Republican caucuses Saturday, March 5: Kansas caucuses (both parties) Kentucky Republican caucuses Louisiana Maine Republican caucuses Nebraska Democratic caucuses Sunday, March 6: Maine Democratic caucuses Puerto Rico (Republicans only) Tuesday, March 8: Hawaii Republican caucuses Idaho (Republicans only) Michigan Mississippi Thursday, March 10: Virgin Islands Republican caucuses Saturday, March 12: Guam Republican convention Northern Mariana Islands Democratic caucuses Washington, DC Republican convention Tuesday, March 15: Florida Illinois Missouri North Carolina Northern Mariana Islands Republican caucuses Ohio Tuesday, March 22: American Samoa Republican convention Arizona Idaho Democratic caucuses Utah caucuses (both parties) Saturday, March 26: Alaska Democratic caucuses Hawaii Democratic caucuses Washington Democratic caucuses Friday-Sunday, April 1-3: North Dakota Republican state convention Tuesday, April 5: Wisconsin
  • 4. Saturday, April 9: Colorado Republican state convention Wyoming Democratic caucuses 5 METHODS By capturing tweets mentioning each presiden- tial candidate and analyzing the sentiments be- hind those tweets, we could track peoples opinions about each candidate and thus predict the final pri- mary election results. A function was constructed in R to automatically collect tweets from each day for the months of February, March, and April. The tweets along with information about the users twit- ter handle, the location of the user, the text of the tweet, the description of the users profile, if the tweet was retweeted, and other information was encoded into JSON (JavaScript Object Notation) files. The rjson package in the R software was used to parse the JSON files. We extracted all tweets related to at least one of the four political candidates (Clinton, Sanders, Trump, and Cruz), and combined all extracted tweets into a .csv file for further analysis. All of the Twitter data was analyzed using the R software, D3.js and QGIS in order to determine whether or not certain dimensions of Twitter ac- tivity related to presidential election correlate with primary election results. Specifically, the research methods implemented aimed to address whether or not more positive sentiments towards as partic- ular candidate on Twitter significantly increases a candidates probability of winning a primary elec- tion. The primary focus of the analysis was text mining for sentiments, geospatial analysis using GIS to look at specific states, and network analy- sis to evaluate the network elements of the tweets and look at useful network parameters of the men- tion network of all four candidates. Additionally, we used our selected parameters, as well as gen- eral polling results obtained from FiveThrityEight, a website that focuses on opinion poll analysis and politics, and built several prediction models to test if Twitter is a good indicator of offline public opin- ion and political election outcomes. 6 VISUALIZATIONS We constructed three types of visualizations to test our hypotheses. We created several static visual- izations using the R software package to get an overall look at all of the tweets in relation to each of the four candidates. We then created interac- tive visualizations using R Shiny and D3.js to look more closely at changes in public opinion over pri- mary days in order to evaluate twitter trends dur- ing primary elections days. 6.1 Static Visualizations We first looked at the entire dataset, which con- sisted of a total of over 13,289,699 tweets for the three months of February, March, and April. These tweets were parsed and divided into four categories representing each of the four politi- cal candidates. So, for example, all tweets that mentioned Clinton were merged into a single ob- ject data frame. This was also done for Sanders, Trump, and Cruz. Preliminary data-analysis was conducted on the over 13 million tweets that were collected in or- der to reveal high-level trends that would be rel- evant and provide context for further sentiment analysis. As the pie chart below illustrates, More than fifty percent of all of the tweets in the entire data set mentioned Trump. He is undoubtedly the most discussed candidate on Twitter. Furthermore, Sanders was the second most discussed candidate on Twitter, while Cruz and Clinton were both dis- cussed the least. Figure 1: Proportion of Total Tweets Mentioning Each Candidate The next visualization (depicted below) is also in relation to tweet volume by each candidate and by party for all of the tweets in the final data set. The outer donut illustrates the proportion of tweets belonging to each party. It is clear that the Repub- lican candidates had far more tweets (68 percent of all tweets) than the Democratic candidates (32
  • 5. percent of all tweets). This is largely because of Trump, who was mentioned in more than 50 per- cent of all tweets in the data set. The inner donut shows tweets proportions for each of the four can- didates. As can be seen, Trump-related tweets make up the vast majority of the final data set with 54 percent of all tweets mentioning Trump. Sanders was the next most popular on twitter with 19 percent of tweets mentioning him, while Cruz had 14 percent of tweets mentioning him. Clinton is the least popular candidate on Twitter with 12 percent of all tweets mentioning her. Trump, of course, has won the most primary elections by a large margin in comparison to the other Republican candidates, which Twitter con- firms here. If we were to go by tweet vol- umes alone to predict the Presidential Elections, it would seem to support the claim that Trump will win by a landslide. Likewise, the fact that Clin- ton is less popular on Twitter compared to Sanders would seem to indicate that Sanders will win the Democratic primaries if we only look at tweet vol- ume. However, looking at the primary elections that have happened thus far, Clinton has won more states than Sanders. Hence, this may indicate that tweet volume is not entirely accurate for predict- ing real outcome election results. Due to the fact that tweet volume alone can not predict a candi- date’s popularity in the general election expanded the scope of measures to examine. Figure 2: Tweet Volume by Party and by Candi- date We next extracted all tweets that were geo- tagged. This considerably reduced the number of tweets, as it is estimated that only between 5 to 20 percent of all tweets are geo-tagged with a loca- tion. However, we wanted to look at the origins of our tweets, and we assume that the sub-sample of geo-tagged tweets is a strong representative of the entire data set of tweets. Figure 2 below is a world map showing the origins of all tweets that were geo-tagged in the complete data set of tweets. The yellow dots in- dicate where tweets originated from on the map. Unsurprisingly, that vast majority of tweets origi- nate form inside the United States. The primaries are for the U.S. Presidency so it is expected that the four candidates would be most talked about within America. However, it was interesting to see that the highest concentration of tweets was on the East coast of the U.S., while the West coast was also very concentrated with tweet origins. Middle America was not very concentrated with tweets. If these geo-tagged tweets are reflective of the total sample of tweets used, then there may be a bias in- troduced in the data set with a greater proportion of tweets from the East coast, and very few tweets originating from Middle America. Outside of America, northern Europe, particu- larly the U.K. was also heavily concentrated with tweets pertaining to the candidates. We do not know why the U.K. in general had such a high proportion of tweets. It may be because British people are highly interested in American politics, that British twitter users have a different system for geo-tagging than other countries or that a lot of Americans travel abroad to Northern Europe and remain engaged with politics on Twitter dur- ing their vacation. The four candidates were dis- cussed in other regions of the world as well, but with much lesser concentration. Europe showed more interest in American politics than any other country (excluding the United States of America). Figure 3: World Origins of Tweets We next looked at the tweet frequency by state for each candidate within the United States. We first look at Hilary Clintons map, which is de- picted below. The states with the most number of tweets mentioning Clinton were California, Texas, Florida, Illinois, and of course, New York. Clinton
  • 6. has won all of the primary elections in these states with the exception of California, which has not yet taken place. States like North and South Dakota, Montana, Wyoming and Nebraska had almost non existent frequency of tweets mentioning Clinton. However, it is interesting that Clinton seems to be more popular in Utah on Twitter in comparison to the other four candidates, even though she lost to Sanders in the Utah primary elections. This in- dicates that tweet volumes on Twitter may not be entirely accurate in predicting election results. Figure 4: Clinton’s Tweet Frequencies Map We next looked at Bernie Sanderss map of tweet frequencies by state (depicted below). It is in- teresting to see what states he is more popular in compared to Clinton. Surprisingly, discussions about Sanders are very popular on Twitter in Ohio, even though he lost in the Ohio state primaries to Clinton by a substantial margin. Figure 5: Sanders’ Tweet Frequencies Map We next looked at Ted Cruzs map of tweet fre- quencies by state (depicted below). Compared to Clinton and Sanders, Ted Cruz is more popular in the west coast, with states like Nevada, Arizona and Oregon showing more interest in him on Twit- ter. He is also mentioned more in states like Mon- tana and Nebraska, where Clinton and Sanders had almost non-existent mentions in those states. Lastly, we looked at Donald Trumps map of tweet frequencies by state (depicted below). In- terestingly, he is not as popular as Cruz in Mon- tana, Wyoming and Nebraska, where he is rarely mentioned on twitter. Figure 6: Cruz’s Tweet Frequencies Map Figure 7: Trump’s Tweet Frequencies Map After looking at all of the maps for the five candidates tweet frequencies, it does seem that tweet frequencies are not always a good indica- tor of election results when it comes to using only tweet volume per candidate. As mentioned above, there were some instances (such as Clin- ton having a very high volume of tweets in Utah and Sanders very high volume of tweets in Ohio) where twitter did not correlate with the real world outcome of the elections (based on the assump- tion that a higher tweet volume should correlate to winning the majority of votes). However, over- all, there seems to be more correlation regarding tweet volumes for each candidates in each state than vice versa. Tweet volume per candidate will be a predictor variable incorporated into the pre- diction model that will be introduced later in this paper. It seems that tweet volume performs sporadi- cally as a predictor of election results. However, we can use an algorithm to evaluate and catego- rize the feelings expressed in text; this is called sentiment analysis. Hence, we next looked at tex- tual sentiment analysis of the tweets to get a better
  • 7. insight in public opinion on Twitter regarding the 2016 Primary Elections. In order to extract sentiments for each of the tweets, the Syuzhet R package was utilized, which comes with four sentiment dictionaries and pro- vides a method for accessing the robust, but computationally expensive, sentiment extraction tool developed in the NLP group at Stanford. The developers of this algorithm built a dictio- nary/lexicon containing lots of words with asso- ciated scores for eight different emotions and two sentiments (positive/negative). Each individual word in the lexicon will have a yes (one) or no (zero) for the emotions and sentiments, and we can calculate the total sentiment of a sentence by adding up the individual sentiments for each word in the sentence. It is important to note that senti- ment analysis of tweets comes with its fair share of problems. For example, sentiment analysis al- gorithms are built in such a way that they are more sensitive to expressions typical of men than women. Furthermore, it can be argued that com- puters are not optimal at identifying emotions cor- rectly in all cases. They are likely not great at at identifying something like sarcasm. Most of these concerns wont have a big effect on my analysis here because we are looking at text. Additionally, when using as large a dataset as the one for this study, it is likely, that many more tweets will be correctly identified by sentiment, and the effects of identifying sentiments incorrectly will be nor- malized. The entire data set was used to derive sentiment scores for all four candidates, and the bar graphs depicting aggregates of the results are shown below. The more positive the sentiment score is, the more positive the overall sentiment is of the tweets that are associated with each of the candidates. Hence, the Sanders has the highest average senti- ment score compared to all other candidates while Trump has the second highest average sentiment score over all tweets. Both Clinton and Cruz have lower average sentiment scores over all the tweets. When we look at the average very positive sen- timent scores for each of the candidates over all of the tweets, Trump has on average, more pos- itive sentiment scores than the other candidates while Sanders comes close in second place. How- ever, it is important to note that Trump has a very large proportion of tweets compared to Sanders, and this may be skewing the average very positive Figure 8: Average Sentiment Score over all Tweets for Each Candidate sentiment scores. It may be interesting to equal- ize the data set to contain fewer tweets mention- ing Trump, and to see how this affects the average very positive sentiment scores. It is interesting that Clinton has the lowest average positive sentiment scores over all tweets mentioning her. Lastly, we look at the average very negative sentiment scores bar graph, and the results correspond to the other two graphs. Cruz has the highest average negative scores over all tweet relating to him, while Clin- ton comes in second place. Sanders, on the other hand, has the lowest average negative sentiment scores over the all tweets mentioning him. Hence, if we were to go by these sentiment scores to pre- dict election outcomes, it would seem that Sanders would win the Democratic primaries while Trump would win the Republican primaries. 6.2 R Shiny Visualization: Word Frequency (Wordcloud) An interactive visualization app using the R Shiny platform was produced to analyze the text of the tweets. Preliminary data-analysis was conducted on the tweets that were collected in order to reveal trends that would be relevant with further senti- ment analysis. An R Shiny application was de- veloped to generate a different wordcloud visual- ization for each date that data was collected. The
  • 8. Figure 9: Average Very Positive Sentiment Score over all Tweets for Each Candidate wordcloud visualizations represent the words that were most prevalent in tweets related to a par- ticular candidate. Each day had slightly differ- ent words that dominated a candidate’s network and on some days in particular there was a strong theme or increased polarization. For example, on February 27th tweets related to Donald Trump mainly contained ’#nevertrump’ (Figure 12). Influencers on Twitter such as Marco Rubio, Glenn Beck and Amanda Carpenter all published tweets that contained the hashtag as a strategic move against Donald Trump prior to Su- per Tuesday on March 1st which led Trump to have a 0.09 Sentiment Score (Figure 11) (Figure 13). Backlash and outrage to Hillary Clinton com- mending Nancy Reagan’s involvement in the H.I.V./AIDS conversation following Reagan’s death dominated tweets about Clinton on March 12th (Figure 14). ”The problem with Mrs. Clintons compliment: It was the Reagans who wanted nothing to do with the disease at the time (Source: http://www.nytimes.com/politics/first- draft/2016/03/11/hillary-clinton-lauds-reagans- on-aids-a-backlash-erupts/). It was confirmed by sentiment analysis that tweets regarding Clinton were overall negative as she only had a sentiment score of -0.01 on March 12th, while Sanders Figure 10: Average Very Negative Sentiment Score over all Tweets for Each Candidate Figure 11: Marco Rubio Tweets #NeverTrump on February 27th and Cruz had higher sentiment scores (.22 and .31 respectively) (Figure 15). Trump also, had a relatively low sentiment score of -0.01 on March 12th which was the same day that protesters disrupted a Trump rally in Chicago and forced the event to be canceled (Figure 15) (Figure 16). On March, 26th a scandal broke out involving Ted Cruz. The group anonymous alleged that Ted Cruz was involved in a sex scandal. Most tweets that mentioned Ted Cruz on March 26th involved the scandal (Figure 17). Although, he generally had the highest sentiment score out of all the can- didates on March 22nd and March 26th he had the lowest sentiment score of all the candidate (0.02 and 0.01 respectively) (Figure 18). In general, the words that appeared most fre- quently (as illustrated in the wordcloud) were pre- dictive of a candidate’s sentiment score and this
  • 9. Figure 12: February, 27th, 2016 Wordcloud for Donald Trump Figure 13: February, 27th, 2016 Sentiment consistency further reinforced the appropriateness and validity of the Syuzhet R package that was used for sentiment calculations. The sentiment score is able to provide a concrete quantitative measure of how a network feels towards a partic- ular candidate whereas the wordcloud represented the qualitative feelings of individuals and provide further context for the sentiment scores. 6.3 D3.js Visualizations We next created several visualizations using the D3.js platform. D3.js (D3 for Data-Driven Doc- uments) is a JavaScript library for producing dynamic, interactive data visualizations in web browsers. It makes use of the widely imple- mented SVG, HTML5, and CSS standards. All of the visualizations produced using D3.js are available at: http://aboutmonica.com/ final%20D3/. A D3 visualization of a force layout visualiza- tion of the mention network for all four candidates was generated. The network was constructed only from tweets on the days of primary elections. This is a very large network with over 30,000 edges, and hence when the D3 visualization is produced, the resulting layout graph is very large and takes a while to load. Figure 14: March, 12th, 2016 Wordcloud for Hillary Clinton Figure 15: March, 12th, 2016 Sentiment In the force layout visualization at the provided link above, you can see a vast social mention net- work of tweets. Since this social network has di- rected edges, we can look at the direction of tweet mentions where there are many nodes connected to one central node, the arrows are all pointing to the central node. This means that there are many twitter users tweeting and mentioning the central node. From the network graph you can see that some twitter users (represented by the nodes in the graph) have very large networks and are very densely connected by edges to other twitter users. Edges between two twitter users signify that one of the users mentioned or re-tweeted the other user, so those areas that are very dense and dark in the graph are likely people who were mentioned or re-tweeted many times. On the other hand, to- wards the outskirts of the graph, there are a few nodes that are connected to each other by a few edges. These twitter users are connected by edges because they mentioned or re-tweeted each other during the time that the data was collected. How- ever, they are separate from other clusters in the graph by not being connected to other nodes by edges. This graph clearly depicts which twitter users have larger networks (more dense clusters
  • 10. Figure 16: March, 12th, 2016 Wordcloud for Don- ald Trump Figure 17: March, 26th, 2016 Wordcloud for Ted Cruz around the nodes). Lastly, we also have nodes that are connected by maybe one or two (or in some cases, no ties) indicating that they are not being mentioned by others users in the networks and they are also not mentioning others users in their tweets. We next looked at each of the four candidates networks separately, and using Gephi, we derived network parameter values in order to better assess what is going on in this network in relation to each of the four candidates. Table 1 below depicts the results of our analysis. There are some interesting results to point out. Cruzs average clustering co- efficient is 0 while Trumps network is almost zero at 0.001. Hence, it seems that Cruzs tweet men- tion network is very small with there being very little clustering of users, and most users not be- ing interconnected. In general, all of the candi- dates have very small clustering coefficients with Sanders having the highest value at 0.005. This may be due to the fact that the social network an- alyzed is a network of mentioned tweets, and it is unlikely that the candidates would reply to many of the tweets that mention them. Additionally, these tweets were collected in real-time, so a can- didate may have responded to any of the tweets in the network at a later time, that was not cap- tured in our dataset. Furthermore, Sanders net- work has the highest average degree at 2.109 while Figure 18: March, 26th, 2016 Sentiment Clinton leads closely behind at 2.013. This im- plies that on average, a node in Sanders network has 2.109 edges connected with it, meaning that users are more likely to interact in Sanders net- work in comparison to networks of the other can- didates. In Sanders network it appears that nodes are more likely to interact and mention other nodes than other candidate networks. Sanders also has the largest network diameter at 6, which indicates that it is likely that he reaches a greater audience than the other candidates. Lastly, it is interest- ing to note that both Republican candidates have lower average path lengths when compared to the Democratic candidates, meaning that nodes can be reached in fewer steps in the networks for the Re- publican candidates. Figure 19: Network Parameters for Each Candi- date 7 Prediction Model After having explored and analyzed the twitter data, we next focused on building a prediction model. We created a panel dataset for all four candidates and looked at the primary election days as well as randomly chosen days from out twitter data set. In the end, we had a total of 70 days of twitter data used for the panel data set. We chose to specifically look at the states of New York, In- diana, and Nebraska. The next primaries will take
  • 11. place in Nebraska on May 10th, and so we would like evaluate our models prediction results against the actual outcome of the Nebraska primaries. We coded all my independent and dependent variables for these three states. New York was used as a training dataset to train the prediction model on. Trump and Clinton won this state. The testing dataset that was used was for the states of Indiana and Pennsylvania, and it was fitted using the re- sults of the training dataset to calculate predicted values for who would win in Indiana and Pennsyl- vania. It is important to note that for this analysis, it was necessary to look at only those tweets that were geo-tagged and belonged to one of the three states used in the panel data. This undoubtedly decreased the total size of the tweets available to analyze, as most of the tweets collected were not geotagged at all. However, we were still able to obtain thousands of tweets for most of the days for each candidate. The dependent variable used in the analysis was called electionresults, and it was basically equal to 0 if the candidate did not win the primary elec- tion (or poll average taken from FiveThirtyEight from that day) and equal to 1 otherwise. The re- searchers at FiveThirtyEight have collected and continue to collect national polls for the Repub- lican and Democratic primaries, and they generate a polling average from all polls collected for each candidate. For the Democratic primary, a total of 671 polls have been collected thus far, and a total of 681 polls have been collected for the Republi- can primary. This polling average is adjusted for pollster quality, sample size, and recency, and as a result, it is a good indicator of public opinion regarding the primaries and the candidates. Fur- thermore, FiveThirtyEight offers daily polling av- erages from as early as July 10, 2015 upto the cur- rent day. Hence, it was fairly simple for me to collect the daily polling average to see which can- didate won the polls for each day in my dataset. For the independent variables, we used tweet volume for each candidate in each state, the av- erage sentiment score (which was calculated from each candidates tweet corpus for each day for each state), and lastly we used the networked pa- rameters that we described above. These param- eters (average degree, average clustering coeffi- cient, network diameter, and average path length) were not derived for each day and each specific state, but were derived from the entire Twitter data set, and were thus constant over all days. In addition to the independent variables, we also added control variables to the panel dataset. The control variables used were the population for each state and average income for each state. Lastly, we used a lagged dependent variable as an independent variable because in time series anal- ysis, it is expected that the poll results from the previous day would be predictors of the poll re- sults for the current day, and we needed to account for this correlation. 8 Findings In order to train and test our data set, we used three different statistical methods: Logistic Regression, Random Forests, and Support Vector Machines. We wanted to see if one of these three models performed better than the others. The regression equation for the logistic regression model is shown in the figure below. Figure 20: Logistic Regression Equation All three models performed well in terms of prediction results, which is surprising for us. As we mentioned earlier, it seems that Sanders is very popular on twitter compared to Clinton, and so we expected that this would skew the results, but it does not look like it did. After training the model on the New York dataset, we tested the trained models on the Indiana and Pennsylvania datasets, and for both datasets, the models correctly de- picted that Clinton and Trump would win Penn- sylvania (which they did) and Sanders and Trump would win Indiana (which they did). We used ROC curves (depicted below) in order to evaluate the predictive accuracy of our models and it seems that the Random Forests model and Support Vec- tor Machine model performed better than the Lo- gistic Regression model. 9 Conclusion In this paper, we looked at twitter data from the months of February, March, and April in order to predict election outcomes for the 2016 Presiden- tial Primaries. We analyzed several variables to explore the twitter data, including network param- eters, text sentiments of the tweets, and tweet vol- ume for each of the four candidates. In order to
  • 12. Figure 21: ROC curves for all Three Prediction Models visualize our results, we build several static and interactive visualizations. The prediction models that we developed for our analysis performed very well in predicting the election outcomes. How- ever, we only tested our models on two states, and would like to do further tests on using other state primaries in order to test the predictive accuracy of our models. 10 References ”Company — About.” Company — About. Twitter, 31 Mar. 2016. Web. 07 May 2016. ¡https://about.twitter.com/company¿. Larsson, A. O., & Moe, H. (2012). Studying political microblogging: Twitter users in the 2010 Swedish election campaign. New Media & Society, 14(5), 729-747. Wang, H., Can, D., Kazemzadeh, A., Bar, F., & Narayanan, S. (2012, July). A system for real-time twitter sentiment analysis of 2012 us presidential election cycle. In Proceedings of the ACL 2012 System Demonstrations (pp. 115-120). Associa- tion for Computational Linguistics. 11 URLs All D3 graphics used in this project are available for viewing online. The R Shiny application was too large to upload but the source code is avail- able to be viewed by clicking on the menu items at http://aboutmonica.com/final%20D3/ Republican Sentiments: http://aboutmonica.com/final%20D3/republican% 20sentiments/ Democrat Sentiments: http://aboutmonica.com/final%20D3/democrat% 20sentiments/ Volume of Tweets per Candidate: http://aboutmonica.com/final%20D3/candidate% 20tweet%20volume%20prop%20D3/