Automatic generation of event summaries using microblog streams
1. “Twitsum” : Automatic generation of
event summaries using microblog
streams
P.K.K.Madhawa
2012MCS044
2. Motivation - The problem with Twitter search
● Twitter ranks tweets based on
user interaction with them.
(number of retweets, favorites)
● Top results for the query
‘Ebola’ (25th November 2014)
● How to distinguish newsworthy
tweets drowned in a sea of
noise
3. Goal
● Distinguish newsworthy tweets based on syntactic features
without depending on manual annotations
● Group tweets discussing the similar content together
4. Contributions
● A heuristic based scheme for annotating tweets as
subjective/objective
● A classifier capable of detecting objective tweets using only
the syntactic information of tweets
● An entity-centric tweet clustering algorithm
5. Twitter summarization - Earlier approaches
Sub-event detection based methods
● Use of a Hidden Markov Model to detect sub-events during an American football
match (D.Chakrabarti and K.Punera, 2011)
● Sub-event detection by identifying outlier peaks in the temporal distribution of
tweets on a topic. (Zubiaga et al., 2012)
Clustering based approaches
● A support platform for event detection using social intelligence (T.Baldwin, P.
Cook and B.Han, 2012)
○ Tweets are filtered using manually selected keywords
6. Design
● Tweet storage - stores
the set of tweets
downloaded using
streaming API
● Classifier - selection of
objective tweets
● Summarizer - removes
duplicates and clusters
the tweets based on their
similarity
7. Design - Objectivity detection
● Tweets are periodically
downloaded by querying
the public timeline using
Streaming API
● Structure of a tweet
object:
tweet text, user name, created time, geo
location, language code, favorite count,
retweeted_status, retweet count
8. Data collection
● Training data annotated using a heuristic
measure
● Objective - If the tweet is generated by a
verified profile
● Subjective - Tweets containing at least a
single emoticon or an emoji character
9. Preprocessing
● All emoticons and emoji characters
are removed from the corpus
● User mentions are replaced with the
tag ‘MENTION’ (eg: “@john said
this” converts to “MENTION sad
this”)
● Punctuation symbols including the
pound(#) character are removed.
● Urls are replaced with the tag ‘URL’
(eg: http://t.co/12d3 converts to URL)
● Numbers in a tweet are replaced by
the tag ‘NUMERIC’
● Remove stop words
10. Feature extraction
● Tweets are tokenized using TweetNLP
tokenizer (K. Gimpel, N. Schneider, and
B. O’Connor, 2011)
● Words are stemmed using Porter stemmer
● Stemmed unigrams, bigrams converted to
binary Tf-Idf values (with Laplace
smoothing)
● binary feature - presence of slang words
(using an external gazetteer)
● binary feature - presence of bad words
● Unigrams, bigrams and trigrams of POS
tags as binary Tf-Idf values
● Average number of misspelled words
● Average number of all-capital words
● Average number of hashtags
11. Classifier selection
● A dataset of 6,000 tweets on Ebola is used to
benchmark three classifiers (3,000 tweets
from each class)
○ Support Vector Machines
○ Logistic Regression
○ Naive Bayes
● Classifiers trained on a random sample of
4800 tweets and remaining used as the test
set.
● Classifier parameters are found using 10-fold
cross validation
12. Classifier performance
● SVM was selected because it had higher recall than Logistic Regression
● A higher recall results in a larger fraction of newsworthy tweets being detected
13. Contribution from features
● Measured using ablation test
● Features divided into three sets
WRD - unigram and bigrams
LEX - all other lexical features
14. Selection of the POS-tagger
● NLTK POS tagger
● Stanford tagger with GATE twitter model (L. Derczynski et al., 2013)
● SENNA tagger (Ronan Collobert, 2011) - “deep” recurrent convolutional neural
network based discriminant parser
Eg:"Last US Ebola Patient Is Cured: Dr. Craig Spencer To Be Released… http://t.
co/92JfMm2LaN | http://t.co/NoFij4iACl #news"
NLTK tagger:
[('Last', 'JJ'), ('US', 'NNP'), ('Ebola', 'NNP'), ('Patient', 'NNP'), ('Is', 'NNP'), ('Cured', 'NNP'), ('Dr', 'NNP'), ('Craig',
'NNP'), ('Spencer', 'NNP'), ('To', 'NNP'), ('Be', 'NNP'), ('Released', 'NNP'), ('u2026', 'NNP'), ('|', 'NNP'), ('news',
'NN')]
16. Results
Data sets
● 1 million tweets containing the term ‘Ebola’
● 22,250 tweets related to the fifth Sri Lanka vs India ODI cricket match held on
16th November (objective- 465, subjective- 878)
○ Filtered using terms “SLvIND”, “SLvsIND”, “INDvSL” and “INDvsSL”.
● 6,800 tweets related to the fourth Sri Lanka vs England ODI cricket match held on
7th December (objective- 215, subjective- 242)
○ Filtered using terms “SLvENG”, “SLvsENG”, “ENGvSL” and ENGvsSL”.
17. Gold standard data set
● A sample 500 tweets on the topic ‘ebola’ is annotated manually as objective or
subjective (objective- 206, subjective- 294)
● Classifier scores on this data
● Errors:
“RT @TheDailyEdge: UPDATE: Obama has reduced the US deficit by 70% and Ebola cases in the
US by 100%.”
It’s hard to judge the objectivity of such sentences only based on syntactical information.
18. Comparison with prior research
● Event related tweets detection with user type recognition (L.Silva, E.Rillof, 2013)
○ A set of 6,000 tweets on disease outbreaks manually labeled using Amazon Mechanical Turk
● Twitter Sentiment Classification using Distant Supervision (A.Go, R.Bhayani and
L.huang, 2013)
○ An SVM model trained on syntactic features used for sentiment classification
Classifier Precision Recall F1-score
User type agnostic classifier 83.15 55.99 66.92
User type specific classifier 80.35 66.07 72.15
Features Accuracy
Unigram + Bigram 81.6
Unigram + POS 81.9
19. Cross-domain applicability
● The classifier trained on Ebola tweets applied on cricket related tweets
● The classifier trained on SLvIndia match performed well on SLvEngland tweets
well
20. Summarizer
● Duplicates and near-duplicate tweets are
abundant due to Retweets and tweets
generated by ‘Tweet’ buttons on news sites
● Removes duplicates in the objective tweets
detected by the classifier
● Tweets discussing the same entities are
clustered together
21. ● Objective tweets are stripped of following
symbols ‘RT’, ‘@-mentions’ and punctuation
● Jaccard similarity of tokens used to detect
duplicate tweets
● Two tweets are considered similar if their
Jaccard similarity is greater than a threshold d
Near-duplicate removal
22. Clustering
● The goal is to cluster tweets mentioning the same entities together
Eg: “#Miami #News NYC Doc Free of Ebola: Sources: Dr. Craig Spencer, the
physician being treated for Ebola at Belle... http://t.co/iXSUk4axVV”
“#Ebola so the good doctor Craig Spencer will go home - well - the nurse too
free to roam but lest we forget 3 countries still suffer deeply”
● Vectors of NER tags converted to Tf-Idf scores and cosine value is
selected as the distance measure among two NER tag vectors
● DBSCAN is selected because the number of clusters is not
required and it is capable of identifying arbitrary shaped clusters
23. Clustering - results
● SVM classifier trained on ebola-3000 data set is applied on a corpus of 24,038
unseen tweets retrieved on a single day (11-11-2014)
● 13,380 tweets detected as objective and 8,138 as duplicates among them.
Clustering resulted in 332 clusters while 2751 tweets labeled as noise
● Clusters depend on the quality of Named Entity Recognizer
Entities: ['Craig', 'Ebola', 'Patient', 'Spencer', 'US']
24. Clustering - discussion
● In contrast this tweet labeled as noise
“‘#Ebola Ebola Outbreak: US Free of Virus After New York Doctor Craig
Spencer Cleared - International Business Times UK”
entities - ['Business', 'Craig', 'Ebola', 'Free', 'International', 'New', 'Outbreak', 'Spencer'
'Times', 'US' 'Virus' 'York']
25. Future work
● Improve cross-domain applicability
○ Finding better features with less dependence on the domain
● A better methodology to evaluate summaries
● Improve clustering to consider verbs also
● Generate an abstractive summary
○ Generate novel sentences from the information contained in tweets
● Generate summaries realtime