Sentiment Analysis and Political Disaffection in Italy

Introduction
Development of the classiﬁcation system
Application and experimental results
Conclusions
Sentiment Analysis for italian language
microblogging: development and valutation of an
automatic system
Corrado Monti
22-04-2013
Corrado Monti Sentiment Analysis for italian microblogging 1/23

Introduction
Conclusions
Twitter
Main microblogging platform
instant pubblication of short textual contents
Tweet → 140 characters
Widespread in Italy too
Italian internet
users
Hearsay
knowledge of
Twitter
Weekly
Twitter users
Dayly
Twitter
users
28.6 M people
100%
25.3 M people
88.6% of internet users
1.24 M people
4.7 M people
december 2012

Introduction
Conclusions
Textual classification
We would like to classify millions of texts
Rule-based approach: definitions of rules (keywords, regular
expressions) together with field experts
Inaccurate, hard to define
Supervised machine learning
They need a training set to learn how to classify new texts
Every text is transformed in a sequence of numerical features;
the algorithm learns what these numbers mean
Recent problem: Sentiment Analysis → recognize the
sentiment expressed by the author of the text
Many applications, both industrial and academic

Introduction
Conclusions
Measuring collective feelings
One of the ﬁrst works is O’Connor et al. (2010), who
measured economic trust through Sentiment Analysis on
Twitter
SentimentRatio
1.52.02.53.03.54.0
GallupEconomicConfidence
−60−50−40−30−20
2008−01
2008−02
2008−03
2008−04
2008−05
2008−06
2008−07
2008−08
2008−09
2008−10
2008−11
2008−12
2009−01
2009−02
2009−03
2009−04
2009−05
2009−06
2009−07
2009−08
2009−09
2009−10
2009−11
Dates Dates
For which phenomena is this possible?
How much are these measures valid? Do they work in Italy
too?

Introduction
Conclusions
Goals of this work
Hypothesis: can political disaffection be measured through
massive tweet classification?
It is a relevant phenomenon, especially in Italy
Lot of interest, academic (sociology) and not academic
We’d also like to build reusable classifiers

Introduction
Conclusions
Political disaffection
How to define a “disaffected” tweet?
According to domain experts, it must
1. have a politic-related topic → topic detection
2. have negative sentiment → sentiment analysis
3. be directed to all politicians
Tweet Is political?
Discarded
No
Yes
Is negative?
Discarded
No
Yes
Is general?
Discarded
No
Yes
Politically
Disaffected
Tweet

Introduction
Conclusions
Training Set
28 340 tweet labelled by 40 political science students
Between april and june 2012
3 labellers for every text
We keep in the dataset only tweet with unanimous “politic”
label
For other labels we measure agreement with Krippendorﬀ α:
“negative” → 0.78 → reliable labels
“generic” → 0.41 → much noise on labels ×
⇓
negative non negative
politic 7 965 4 544
not politic 15 831

Introduction
Conclusions
1 – Topic Detection: politics
Time-robust classiﬁer
Training Set extension: 17 388 news titles (January-October
2012)
Best feature extraction:
5-grams of characters → uniformity from diﬀerent spacings
tf-idf
Discard terms with less than 4 occourences

Introduction
Conclusions
45 728 points with 78 642 features
SMO SVM-solver, k-Nearest Neighbor, Random Forest, kernel are too
resource-hungry
We preferred online algorithms
ALMA, Passive-Aggressive, PEGASOS, OIPCAC gave good results
Classiﬁer Accuracy F-Measure Global time
ALMA 0.88 ± 0.01 0.87 ± 0.01 13.5 ± 1
PA 0.89 ± 0.01 0.89 ± 0.01 10.6 ± 0.1
PEGASOS 0.88 ± 0.01 0.88 ± 0.01 1103 ± 10
OIPCAC 0.89 ± 0.001 0.89 ± 0.01 5911 ± 52
Other algorithms were tested, but with worst results (i.e. Na¨ıve
Bayes)
We selected Passive-Aggressive: good results, low costs

Introduction
Conclusions
2 – Sentiment Analysis: feature extraction
Diﬀerent ways were tested:
n-grams of characters, words, n-grams of words
boolean presence, term frequency, tf-idf, fuzzy match between
n-grams
Best: term frequency of single words
Adjustments:
Fraction of uppercase words as a feature
Features of synonyms were joined together
Other adjustments did not gave better results:
Consider as diﬀerent words in beginning or in the end of the
tweet

Introduction
Conclusions
2 – Removal of the target of sentiment
We do not want that the sentiment could be guessed basing
on specific entities (PD, Berlusconi. . . ). It should be based on
the surrounding words.
Training Set prejudices ⇒ Overfitting
Problem sometimes ignored, especially in Italian language Sentiment
Analysis
How can we decide which words must be removed?
Proposed solution: ontology-based filter
DBpedia
query
SPARQL
fi rst
objects
list
text from
online
newspapers
list of
objects
to remove
appeared?
Discarded
No
Yes

Introduction
Conclusions
2 – Sentiment Analysis: classiﬁcation
12 476 points with 2 383 features
Classiﬁer Accuracy F-Measure Global time
ALMA 0.70 ± 0.03 0.75 ± 0.03 0.82 ± 0.3
PA 0.67 ± 0.06 0.71 ± 0.12 0.9 ± 10−3
PEGASOS 0.69 ± 0.03 0.73 ± 0.05 76 ± 0.1
OIPCAC 0.71 ± 0.03 0.75 ± 0.02 121 ± 25
RF 0.72 ± 0.03 0.78 ± 0.03 2173 ± 48
We select ALMA: good results, low costs

Introduction
Conclusions
3 – Genericity: rule-based approach
Problem: select tweets directed to all politicians
Unhelpful training set labels → we simplify the problem and
decide to use a rule-based approach
Regole usate:
Presence of keywords
defined with field experts
(based on the most “secure”
part of the training set)
∨
Entities from at least three
different political areas
(identified through
DBpedia ontologies)

Introduction
Conclusions
Applicazione del sistema di classificazione
Hypothesis:
Applying this classification system to tweet of April-October
2012 we can obtain a good measure of the diffusion of
political disaffection in society?

Introduction
Conclusions
Surveys
Surveys are a curtesy of IPSOS
We compute two indexes:
1. Inefficacy: fraction of italians that for every
party say that their propensity to vote them
is 1 in a 1 to 10 scale
Sociologists say that this capture the
sentiment of inefficacy perceived by citizens,
symptom of political disaffection
2. Non-vote: fraction of italians that will not
vote
It measures a behaviour
Influenced by other factors: election proximity,
“moral” perception of abstention
...
...
We have a survey every ∼ 10 days in April-October

Introduction
Conclusions
Tweet sample
We gather a set of user active in October and we
sample also their followers
We obtain 261 313 users active in October
∼ 5% of italian users and ∼ 10% of active italian
users in October 2012
We select only those active in April → 167 557
utenti
Scraping of their tweets in this time period:
35 882 423 tweet
Lorem ipsum dolor sit
amet, consectetur
adipiscing elit. In
rhoncus diam a urna
pulvinar convallis.
amet, consectetur
adipiscing elit. In
rhoncus diam a urna
pulvinar convallis.
amet, consectetur
adipiscing elit. In
rhoncus diam a urna
pulvinar convallis.
amet, consectetur
adipiscing elit. In
rhoncus diam a urna
pulvinar convallis.
amet, consectetur
adipiscing elit. In
rhoncus diam a urna
pulvinar convallis.
amet, consectetur
adipiscing elit. In
rhoncus diam a urna
pulvinar convallis.

Introduction
Conclusions
Comparison
For every survey date, we compute the il ratio between
disaﬀected tweet volume and political tweet volume in a
certain time window. We consider:
∆14
1 ∆14
7 ∆7
1
Giugno 2012
L M M G V S D
1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
Giugno 2012
L M M G V S D
1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
Giugno 2012
L M M G V S D
1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
Last two weeks Between 14 and 7 days before Last week
We compare these ratios with polls through Pearson
correlation index

Introduction
Conclusions
Inefficacy:
∆max
min ρ 95%Confidenceinterval P-Value for ρ > 0
∆14
1 0.7860 0.476-0.922 0.031%
∆14
7 0.7749 0.454-0.917 0.042%
∆7
1 0.6880 0.310-0.878 0.226%
Non-voto:
∆max
min ρ 95%Confidenceinterval P-Value for ρ > 0
∆14
1 0.5579 0.190-0.788 0.567%
∆14
7 0.5920 0.248-0.803 0.231%
∆7
1 0.4433 0.049-0.718 3.00%

Introduction
Conclusions
Apr 15 May 01 May 15 Jun 01 Jun 15 Jul 01 Jul 15 Aug 01 Aug 15 Sep 01 Sep 15 Oct 01
0.00
0.05
0.10
0.15
0.20
0.25
0.004
0.005
0.006
0.007
0.008
0.009
0.01
0.011
Time
Inefficacyindicator
Twitterdisaffectionratio
Twitter disaffection ratio
Inefficacy indicator

Introduction
Conclusions
Interpretation
Data seem to indicate a quite strong correlation between
disaﬀected tweet and diﬀusion of the phenomena in society
This does not mean that Twitter is a representative sample
We can guess that the quantity of discussion about this
pheonomenon is connected with how much it will spread

Introduction
Conclusions
Further investigation
With this big tweet sample, we can study causes of
oscillations with text mining techinques
We compute the ratio day by day and we found out when we
have disaﬀection peaks
For every peak
1. we take news of that day
2. we compare them with the mass of tweet of the peak
3. which is the most similar news?

Introduction
Conclusions
Apr 01 Apr 15 May 01 May 15 Jun 01 Jun 15 Jul 01 Jul 15 Aug 01 Aug 15 Sep 01 Sep 15 Oct 01
0.000
0.005
0.010
0.015
0.020
0.
0.025
0.05
0.075
0.1
0.125
0.15
0.175
0.2
Time
Twitterdisaffectionratio
Inefficacyindicator
Twitter disaffection ratio
Inefficacy indicator
Rapporto di tweet
Sondaggi (Inefficacia)
Tempo
S
o
n
d
a
g
g
i
S
o
n
d
a
g
g
i
T
w
e
e
t

(Teorie del complotto su
stragismo di Stato)
Scandalo Lega
Amministrative
Attentato di Brindisi
Scandalo
Fiorito

Introduction
Conclusions
Conclusions
We build classifiers for political messages
Topic and sentiment classifers are reusable
We showed a correlation between quantity of discussion on
Twitter and diffusion of a phenomenon
Future works
Classifying users as nodes of a social network
Confirm or denial of models that could explain our data

Introduction
Conclusions
Thanks

Sentiment Analysis and Political Disaffection in Italy

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (7)

Similar a Sentiment Analysis and Political Disaffection in Italy

Similar a Sentiment Analysis and Political Disaffection in Italy (20)

Último

Último (20)

Sentiment Analysis and Political Disaffection in Italy