This document summarizes the results and problems of entity-oriented sentiment analysis of tweets in the SentiRuEval 2014-2015 evaluation. It describes the task of determining sentiment towards mentioned companies in tweets. The best performing systems achieved macro F-measures of 0.488 for telecom tweets and 0.36 for bank tweets. Performance varied between domains due to differences in training and test data. Many tweets were difficult to classify, either due to new events or vocabulary not covered in training, complex tweets mentioning multiple entities with different polarities, or irony. While most systems treated sentiment classification as a general task, true entity-oriented approaches did not achieve better results. Improving sentiment vocabularies and handling current events were seen as opportunities
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Entity-oriented sentiment analysis of tweets: results and problems
1. Entity-oriented sentiment analysis
of tweets: results and problems
Natalia Loukachevitch
Lomonosov Moscow State University
Yuliya Rubtsova
A.P. Ershov Institute of Informatics Systems
3. SentiRuEval 2014-2015
Aspect-oriented
analysis of reviews
• Restaurants
• Cars
Entity-Oriented analysis
of tweets: reputation
monitoring
• Banks [8]
• Telecom companies [7]
Testing of sentiment analysis systems
of Russian texts
4. SentiRuEval: Entity-Oriented
analysis of tweets
Reputation-oriented tweet may express
Task: to determine sentiment towards the mentioned
company
Participation
9 participants 33 runs
positive or negative opinion
about a company
positive or negative fact
concerning a company
5. SentiRuEval: Entity-Oriented
analysis of tweets
Training collection
5000 banking tweets
5000 telecom tweets
Test collection
4549 banking tweets
3845 telecom tweets
December
2013
February
2014
July
2014
August
2014
Test collection Train collection
6. Expert annotation
• Tweet considered as neutral
0
• Positive fact or opinion
1
• Negative fact or opinion
-1
• Positive and negative sentiments in
the same tweet
+-
• Meaningless
--
7. Annotation problem
Test data were annotated using the voting scheme
Agreement between 2 or 3 annotators
The number of
tweets with the
same labels from
at least 2 assessors
Full agreement The final
number
of tweets in the
test collection
Telecom 4 503 (90.06%) 2 233 (44.66%) 3 845
Banks 4 915 (98.3%) 3 818 (76.36%) 4 549
8. Distribution of messages in collections
according to sentiment classes
2397
973
1667
2816
413
944
Neutral Positive Negative
Telecom Training collecion
Gold standard test
collection
3569
410
2138
3592
350
670
Neutral Positive Negative
Banks Training collecion
Gold standard test
collection
9. Quality measure
macro-average F-measure:
F-measure of the
positive class
F-measure of the
negative class
+
2
ignored F-measure of neutral class
this does not reduce the task to the two-class prediction
Additionally micro-average F-measures were
calculated for two sentiment classes
10. Results
Run id Macro F Micro F
Baseline 0.1823 0.337
2 0.4882 0.5355
3 0.4804 0.5094
4 0.467 0.506
Run id Macro F Micro F
Baseline 0.1267 0.2377
4 0.3598 0.343
10 0.352 0.337
2 0.3354 0.3656
Top 3 results for telecom
tweets
Top 3 results for bank
tweets
Manual labeling of participant for telecom domain
Macro-F – 0.703
Micro-F – 0.7487
11. Classification methods
•lemmas and syntactic links presented as triples (head word,
dependent word, type of relation)
2
•rule-based approach accounting syntactic relations between
sentiment words and the target entities
3
•maximum entropy method on the basis of word n-grams, symbol n-
grams, and topic modeling results.
4
•word n-grams, letter n-grams, emoticons, punctuation marks,
smilies, a manual sentiment vocabulary, and automatically
generated sentiment list based on (PMI) of a word occurrences in
positive or negative training subsets.
10
12. Classification methods
SVM + syntactic relations
Linguistic syntax-based pattern (without
machine learning)
Maxent, SVM using various features
13. Explaining the difference in the
perfomance in two domains
Best results in banking and telecom domains are
different: 0.36 vs. 0.488
Difference between training and test collections:
Kullback-Leibler divergence
14. Explaining the difference in the
performance in two domains
The topics of reputation-oriented tweets greatly
depend on positive or negative events with
the regard of the target entities
15. Problems of reputation
analysis of tweets
In any moment some events influencing reputation can
occur => absence in training data
Test collections. December 2013-
February 2014. Ukraine events did
not influence target entities
Train collections in both domains.
July-August 2014 after Ukraine
events 2013-2014 Sanctions
against banks. Problems with
communication in Crimea
16. Analyzing difficult tweets
71 tweets in the
banking domain
wrongly classified by all
participants
85 tweets in the
telecom domain
difficult for almost all
participants (maximum 2
systems were correct)
17. First group. 1.1
Contains evident sentiment words
(such as понравиться – to like)
that were absent in the training set
General vocabulary of
Russian sentiment words could help
18. First group. 1.2
Contains words expressing well-known positive
or negative situations such as theft or murder
but absent in the training collection
General vocabulary of connotative
words would be useful
19. First group. 1.3
Tweets contains words and phrases describing
current events, concerning the current news
flow
Parallel analysis of the current news, revealing
correlations between tweet words and general
sentiment and connotation vocabularies in
news texts
20. Second group
Misclassified tweets includes
tweets that are really complicated
Mention more than one entity with
different attitudes
Several sentiment words with different
polarity orientation
Contain irony
22. Were systems entity-oriented?
Test tweets mentioning two or more entities
• 58 tweets in the banking domain (15 tweets with different
polarity labels),
• 232 tweets in the telecom domain (71 tweets with
different polarity labels)
3 of 9 participants considered the task as
entity-oriented one
• Other participants always assigned the same polarity
class to all entities mentioned in a tweet
Performance
• Worse than for all tweets on average
• Entity-oriented approaches did not achieve better
results
23. Conclusion
We described the tasks, approaches and results in
SentiRuEval testing
– High dependence from train collections
– High impact from current dramatic events
– Capability to do entity-oriented analysis is quite restricted
– large impact for improving results can be based on
integration of a general sentiment vocabulary and a
general vocabulary of connotative words
– The most participants solved the general task of tweet
classification;
– Entity-oriented approaches did not achieve better results.
All prepared materials are accessible for research purposes
http://goo.gl/qHeAVo
24. Thank you!
You can help us to assess
tweets for SentiRuEval-2016
http://sentimeter.ru/assess/texts/
Yuliya Rubtsova
Notas del editor
In general: sentiment of the whole document, fragment or sentence
Entity-oriented
Sentiment about a specific entity
Politician, political party
Company etc.
Sentiment about specific parts or properties of an entity (aspects)
Переходи в Билайн. «Все за 300» — отличный тариф!
The goal of the Twitter sentiment analysis at SentiRuEval was to find tweets influencing the reputation of a company in two domains
The datasets were collected with Streaming API Twitter
To prepare the datasets, 20,000 messages were labeled including 5,000 messages
in each domain for training and test collections Each collection was labeled at least by two assessors. The gold standard test collections were labeled by three assessors. Irrelevant or unclear messages were removed from the training and test sets.
To avoid inconsistency and disputes, the voting scheme was applied to the test collections labeling
We noticed that sometimes users do not want to be rude and add positive emoticons to clearly negative or ironic messages. That is why simple methods based on extraction of emoticons, which are used for classification on the whole tweet level, do not work well
Main quality measure:
The baselines are based on the majority reputation-oriented category (negative one in this case).
one of the participants fulfilled independent expert labeling of telecom tweets which can be considered as the maximum possible performance of automated systems in this task.
Most participants used the SVM classification method.
Most participants used the SVM classification method.
we computed the Kullback-Leibler divergence to compare the difference of word probability distributions in the test collections in relation to the training collections
includes tweets that were misclassified because of the restricted size of the training collection, which did not contain appropriate training
These words are usually considered as neutral, not-opinionated, but having positive or negative associations (so called connotations). For solving these problems, a general vocabulary of connotative words would be useful because the appearance of these words in connection with a company influences its reputation.
Problematic tweets contains words and phrases describing current events, concerning the current news flow.
The apperance of some events and their influence the company’s reputation are very difficult to predict, their mentioning will always be absent in the training collection. In this case, the parallel analysis of the current news, revealing correlations between tweet words and general sentiment and connotation vocabulaties in news texts, can help.
It means that integration of various vocabularies into the machine-learning framework can improve the performance of reputation-oriented automatic systems