SlideShare una empresa de Scribd logo
1 de 25
Descargar para leer sin conexión
“Twitsum” : Automatic generation of
event summaries using microblog
streams
P.K.K.Madhawa
2012MCS044
Motivation - The problem with Twitter search
● Twitter ranks tweets based on
user interaction with them.
(number of retweets, favorites)
● Top results for the query
‘Ebola’ (25th November 2014)
● How to distinguish newsworthy
tweets drowned in a sea of
noise
Goal
● Distinguish newsworthy tweets based on syntactic features
without depending on manual annotations
● Group tweets discussing the similar content together
Contributions
● A heuristic based scheme for annotating tweets as
subjective/objective
● A classifier capable of detecting objective tweets using only
the syntactic information of tweets
● An entity-centric tweet clustering algorithm
Twitter summarization - Earlier approaches
Sub-event detection based methods
● Use of a Hidden Markov Model to detect sub-events during an American football
match (D.Chakrabarti and K.Punera, 2011)
● Sub-event detection by identifying outlier peaks in the temporal distribution of
tweets on a topic. (Zubiaga et al., 2012)
Clustering based approaches
● A support platform for event detection using social intelligence (T.Baldwin, P.
Cook and B.Han, 2012)
○ Tweets are filtered using manually selected keywords
Design
● Tweet storage - stores
the set of tweets
downloaded using
streaming API
● Classifier - selection of
objective tweets
● Summarizer - removes
duplicates and clusters
the tweets based on their
similarity
Design - Objectivity detection
● Tweets are periodically
downloaded by querying
the public timeline using
Streaming API
● Structure of a tweet
object:
tweet text, user name, created time, geo
location, language code, favorite count,
retweeted_status, retweet count
Data collection
● Training data annotated using a heuristic
measure
● Objective - If the tweet is generated by a
verified profile
● Subjective - Tweets containing at least a
single emoticon or an emoji character
Preprocessing
● All emoticons and emoji characters
are removed from the corpus
● User mentions are replaced with the
tag ‘MENTION’ (eg: “@john said
this” converts to “MENTION sad
this”)
● Punctuation symbols including the
pound(#) character are removed.
● Urls are replaced with the tag ‘URL’
(eg: http://t.co/12d3 converts to URL)
● Numbers in a tweet are replaced by
the tag ‘NUMERIC’
● Remove stop words
Feature extraction
● Tweets are tokenized using TweetNLP
tokenizer (K. Gimpel, N. Schneider, and
B. O’Connor, 2011)
● Words are stemmed using Porter stemmer
● Stemmed unigrams, bigrams converted to
binary Tf-Idf values (with Laplace
smoothing)
● binary feature - presence of slang words
(using an external gazetteer)
● binary feature - presence of bad words
● Unigrams, bigrams and trigrams of POS
tags as binary Tf-Idf values
● Average number of misspelled words
● Average number of all-capital words
● Average number of hashtags
Classifier selection
● A dataset of 6,000 tweets on Ebola is used to
benchmark three classifiers (3,000 tweets
from each class)
○ Support Vector Machines
○ Logistic Regression
○ Naive Bayes
● Classifiers trained on a random sample of
4800 tweets and remaining used as the test
set.
● Classifier parameters are found using 10-fold
cross validation
Classifier performance
● SVM was selected because it had higher recall than Logistic Regression
● A higher recall results in a larger fraction of newsworthy tweets being detected
Contribution from features
● Measured using ablation test
● Features divided into three sets
WRD - unigram and bigrams
LEX - all other lexical features
Selection of the POS-tagger
● NLTK POS tagger
● Stanford tagger with GATE twitter model (L. Derczynski et al., 2013)
● SENNA tagger (Ronan Collobert, 2011) - “deep” recurrent convolutional neural
network based discriminant parser
Eg:"Last US Ebola Patient Is Cured: Dr. Craig Spencer To Be Released… http://t.
co/92JfMm2LaN | http://t.co/NoFij4iACl #news"
NLTK tagger:
[('Last', 'JJ'), ('US', 'NNP'), ('Ebola', 'NNP'), ('Patient', 'NNP'), ('Is', 'NNP'), ('Cured', 'NNP'), ('Dr', 'NNP'), ('Craig',
'NNP'), ('Spencer', 'NNP'), ('To', 'NNP'), ('Be', 'NNP'), ('Released', 'NNP'), ('u2026', 'NNP'), ('|', 'NNP'), ('news',
'NN')]
Selection of the POS tagger...
"Last US Ebola Patient Is Cured: Dr. Craig Spencer To Be Released… http://t.co/92JfMm2LaN | http://t.
co/NoFij4iACl #news"
SENNA tagger:
[('Last', 'JJ'), ('US', 'NNP'), ('Ebola', 'NNP'), ('Patient', 'NNP'), ('Is', 'VBZ'), ('Cured', 'VBN'), ('Dr', 'NNP'),
('Craig', 'NNP'), ('Spencer', 'NNP'), ('To', 'TO'), ('Be', 'VB'), ('Released', 'VBN'), ('u2026', 'JJ'), ('|', 'NN'), ('news',
'NN')]
Stanford tagger with Gate twitter model:
[('Last', 'JJ'), ('US', 'NNP'), ('Ebola', 'NNP'), ('Patient', 'NN'), ('Is', 'VBZ'), ('Cured', 'VBN'), ('Dr', 'NNP'), ('Craig',
'NNP'), ('Spencer', 'NNP'), ('To', 'TO'), ('Be', 'VB'), ('Released', 'VBN'), ('u2026', '.'), ('|', ':'), ('news', 'NN')]
Results
Data sets
● 1 million tweets containing the term ‘Ebola’
● 22,250 tweets related to the fifth Sri Lanka vs India ODI cricket match held on
16th November (objective- 465, subjective- 878)
○ Filtered using terms “SLvIND”, “SLvsIND”, “INDvSL” and “INDvsSL”.
● 6,800 tweets related to the fourth Sri Lanka vs England ODI cricket match held on
7th December (objective- 215, subjective- 242)
○ Filtered using terms “SLvENG”, “SLvsENG”, “ENGvSL” and ENGvsSL”.
Gold standard data set
● A sample 500 tweets on the topic ‘ebola’ is annotated manually as objective or
subjective (objective- 206, subjective- 294)
● Classifier scores on this data
● Errors:
“RT @TheDailyEdge: UPDATE: Obama has reduced the US deficit by 70% and Ebola cases in the
US by 100%.”
It’s hard to judge the objectivity of such sentences only based on syntactical information.
Comparison with prior research
● Event related tweets detection with user type recognition (L.Silva, E.Rillof, 2013)
○ A set of 6,000 tweets on disease outbreaks manually labeled using Amazon Mechanical Turk
● Twitter Sentiment Classification using Distant Supervision (A.Go, R.Bhayani and
L.huang, 2013)
○ An SVM model trained on syntactic features used for sentiment classification
Classifier Precision Recall F1-score
User type agnostic classifier 83.15 55.99 66.92
User type specific classifier 80.35 66.07 72.15
Features Accuracy
Unigram + Bigram 81.6
Unigram + POS 81.9
Cross-domain applicability
● The classifier trained on Ebola tweets applied on cricket related tweets
● The classifier trained on SLvIndia match performed well on SLvEngland tweets
well
Summarizer
● Duplicates and near-duplicate tweets are
abundant due to Retweets and tweets
generated by ‘Tweet’ buttons on news sites
● Removes duplicates in the objective tweets
detected by the classifier
● Tweets discussing the same entities are
clustered together
● Objective tweets are stripped of following
symbols ‘RT’, ‘@-mentions’ and punctuation
● Jaccard similarity of tokens used to detect
duplicate tweets
● Two tweets are considered similar if their
Jaccard similarity is greater than a threshold d
Near-duplicate removal
Clustering
● The goal is to cluster tweets mentioning the same entities together
Eg: “#Miami #News NYC Doc Free of Ebola: Sources: Dr. Craig Spencer, the
physician being treated for Ebola at Belle... http://t.co/iXSUk4axVV”
“#Ebola so the good doctor Craig Spencer will go home - well - the nurse too
free to roam but lest we forget 3 countries still suffer deeply”
● Vectors of NER tags converted to Tf-Idf scores and cosine value is
selected as the distance measure among two NER tag vectors
● DBSCAN is selected because the number of clusters is not
required and it is capable of identifying arbitrary shaped clusters
Clustering - results
● SVM classifier trained on ebola-3000 data set is applied on a corpus of 24,038
unseen tweets retrieved on a single day (11-11-2014)
● 13,380 tweets detected as objective and 8,138 as duplicates among them.
Clustering resulted in 332 clusters while 2751 tweets labeled as noise
● Clusters depend on the quality of Named Entity Recognizer
Entities: ['Craig', 'Ebola', 'Patient', 'Spencer', 'US']
Clustering - discussion
● In contrast this tweet labeled as noise
“‘#Ebola Ebola Outbreak: US Free of Virus After New York Doctor Craig
Spencer Cleared - International Business Times UK”
entities - ['Business', 'Craig', 'Ebola', 'Free', 'International', 'New', 'Outbreak', 'Spencer'
'Times', 'US' 'Virus' 'York']
Future work
● Improve cross-domain applicability
○ Finding better features with less dependence on the domain
● A better methodology to evaluate summaries
● Improve clustering to consider verbs also
● Generate an abstractive summary
○ Generate novel sentences from the information contained in tweets
● Generate summaries realtime

Más contenido relacionado

Destacado

Destacado (20)

Leveraging mobile network big data for urban planning
Leveraging mobile network big data for urban planningLeveraging mobile network big data for urban planning
Leveraging mobile network big data for urban planning
 
ABRA: Approximating Betweenness Centrality in Static and Dynamic Graphs with ...
ABRA: Approximating Betweenness Centrality in Static and Dynamic Graphs with ...ABRA: Approximating Betweenness Centrality in Static and Dynamic Graphs with ...
ABRA: Approximating Betweenness Centrality in Static and Dynamic Graphs with ...
 
An Introduction to Optimal Transport
An Introduction to Optimal TransportAn Introduction to Optimal Transport
An Introduction to Optimal Transport
 
AISTAT2016 SNFS
AISTAT2016 SNFSAISTAT2016 SNFS
AISTAT2016 SNFS
 
Probabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate SolutionsProbabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate Solutions
 
Neural word embedding as implicit matrix factorization の論文紹介
Neural word embedding as implicit matrix factorization の論文紹介Neural word embedding as implicit matrix factorization の論文紹介
Neural word embedding as implicit matrix factorization の論文紹介
 
Dbda勉強会~概要説明ochi20130803
Dbda勉強会~概要説明ochi20130803Dbda勉強会~概要説明ochi20130803
Dbda勉強会~概要説明ochi20130803
 
[DL輪読会]Unsupervised Learning of 3D Structure from Images
[DL輪読会]Unsupervised Learning of 3D Structure from Images[DL輪読会]Unsupervised Learning of 3D Structure from Images
[DL輪読会]Unsupervised Learning of 3D Structure from Images
 
[Dl輪読会]bridging the gaps between residual learning, recurrent neural networks...
[Dl輪読会]bridging the gaps between residual learning, recurrent neural networks...[Dl輪読会]bridging the gaps between residual learning, recurrent neural networks...
[Dl輪読会]bridging the gaps between residual learning, recurrent neural networks...
 
[DL輪読会]Learning to simplify fully convolutional networks for rough sketch
[DL輪読会]Learning to simplify fully convolutional networks for rough sketch[DL輪読会]Learning to simplify fully convolutional networks for rough sketch
[DL輪読会]Learning to simplify fully convolutional networks for rough sketch
 
A Gentle Introduction to Locality Sensitive Hashing with Apache Spark
A Gentle Introduction to Locality Sensitive Hashing with Apache SparkA Gentle Introduction to Locality Sensitive Hashing with Apache Spark
A Gentle Introduction to Locality Sensitive Hashing with Apache Spark
 
[DL輪読会]Learning convolutional neural networks for graphs
[DL輪読会]Learning convolutional neural networks for graphs[DL輪読会]Learning convolutional neural networks for graphs
[DL輪読会]Learning convolutional neural networks for graphs
 
Learning to remember rare events
Learning to remember rare eventsLearning to remember rare events
Learning to remember rare events
 
[DL輪読会]Let there be color
[DL輪読会]Let there be color[DL輪読会]Let there be color
[DL輪読会]Let there be color
 
[DL輪読会]TREE-STRUCTURED VARIATIONAL AUTOENCODER
[DL輪読会]TREE-STRUCTURED VARIATIONAL AUTOENCODER[DL輪読会]TREE-STRUCTURED VARIATIONAL AUTOENCODER
[DL輪読会]TREE-STRUCTURED VARIATIONAL AUTOENCODER
 
Improving Variational Inference with Inverse Autoregressive Flow
Improving Variational Inference with Inverse Autoregressive FlowImproving Variational Inference with Inverse Autoregressive Flow
Improving Variational Inference with Inverse Autoregressive Flow
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
 
[Dl輪読会]dl hacks輪読
[Dl輪読会]dl hacks輪読[Dl輪読会]dl hacks輪読
[Dl輪読会]dl hacks輪読
 
[DL輪読会]Combining Fully Convolutional and Recurrent Neural Networks for 3D Bio...
[DL輪読会]Combining Fully Convolutional and Recurrent Neural Networks for 3D Bio...[DL輪読会]Combining Fully Convolutional and Recurrent Neural Networks for 3D Bio...
[DL輪読会]Combining Fully Convolutional and Recurrent Neural Networks for 3D Bio...
 
[DL輪読会]Image-to-Image Translation with Conditional Adversarial Networks
[DL輪読会]Image-to-Image Translation with Conditional Adversarial Networks[DL輪読会]Image-to-Image Translation with Conditional Adversarial Networks
[DL輪読会]Image-to-Image Translation with Conditional Adversarial Networks
 

Similar a Automatic generation of event summaries using microblog streams

Multilingual Tweet Intimacy Analysis using Bidirectional LSTM.pptx
Multilingual Tweet Intimacy Analysis using Bidirectional LSTM.pptxMultilingual Tweet Intimacy Analysis using Bidirectional LSTM.pptx
Multilingual Tweet Intimacy Analysis using Bidirectional LSTM.pptx
SAMIMAKTAR9
 
Questions about questions
Questions about questionsQuestions about questions
Questions about questions
moresmile
 
SubTopic Detection of Tweets Related to an Entity
SubTopic Detection of Tweets Related to an EntitySubTopic Detection of Tweets Related to an Entity
SubTopic Detection of Tweets Related to an Entity
Ankita Kumari
 
Monitoring-and-Predicting-Mental-Health-using-Morphological-and-Emotion-Analy...
Monitoring-and-Predicting-Mental-Health-using-Morphological-and-Emotion-Analy...Monitoring-and-Predicting-Mental-Health-using-Morphological-and-Emotion-Analy...
Monitoring-and-Predicting-Mental-Health-using-Morphological-and-Emotion-Analy...
MahmudulHaque71
 
Semantic Entity extraction from Sports Tweets
Semantic Entity extraction from Sports TweetsSemantic Entity extraction from Sports Tweets
Semantic Entity extraction from Sports Tweets
mitsmit
 
SENTIMENT ANALYSIS OF TWITTER DATA
SENTIMENT ANALYSIS OF TWITTER DATASENTIMENT ANALYSIS OF TWITTER DATA
SENTIMENT ANALYSIS OF TWITTER DATA
anargha gangadharan
 

Similar a Automatic generation of event summaries using microblog streams (20)

Extract Stressors for Suicide from Twitter Using Deep Learning
Extract Stressors for Suicide from Twitter Using Deep LearningExtract Stressors for Suicide from Twitter Using Deep Learning
Extract Stressors for Suicide from Twitter Using Deep Learning
 
An Ensemble Model for Cross-Domain Polarity Classification on Twitter
An Ensemble Model for Cross-Domain Polarity Classification on TwitterAn Ensemble Model for Cross-Domain Polarity Classification on Twitter
An Ensemble Model for Cross-Domain Polarity Classification on Twitter
 
Alz Hack II
Alz Hack IIAlz Hack II
Alz Hack II
 
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...
 
Multilingual Tweet Intimacy Analysis using Bidirectional LSTM.pptx
Multilingual Tweet Intimacy Analysis using Bidirectional LSTM.pptxMultilingual Tweet Intimacy Analysis using Bidirectional LSTM.pptx
Multilingual Tweet Intimacy Analysis using Bidirectional LSTM.pptx
 
Twitter Sentiment Analysis.pdf
Twitter Sentiment Analysis.pdfTwitter Sentiment Analysis.pdf
Twitter Sentiment Analysis.pdf
 
Social Networks analysis to characterize HIV at-risk populations - Progress a...
Social Networks analysis to characterize HIV at-risk populations - Progress a...Social Networks analysis to characterize HIV at-risk populations - Progress a...
Social Networks analysis to characterize HIV at-risk populations - Progress a...
 
Sentiment Analysis on Twitter
Sentiment Analysis on TwitterSentiment Analysis on Twitter
Sentiment Analysis on Twitter
 
wendi_ppt
wendi_pptwendi_ppt
wendi_ppt
 
Using Chaos to Disentangle an ISIS-Related Twitter Network
Using Chaos to Disentangle an ISIS-Related Twitter NetworkUsing Chaos to Disentangle an ISIS-Related Twitter Network
Using Chaos to Disentangle an ISIS-Related Twitter Network
 
Questions about questions
Questions about questionsQuestions about questions
Questions about questions
 
THE REACTION DATA ANALYSIS OFCOVID-19 VACCINATIONS
THE REACTION DATA ANALYSIS OFCOVID-19 VACCINATIONSTHE REACTION DATA ANALYSIS OFCOVID-19 VACCINATIONS
THE REACTION DATA ANALYSIS OFCOVID-19 VACCINATIONS
 
SubTopic Detection of Tweets Related to an Entity
SubTopic Detection of Tweets Related to an EntitySubTopic Detection of Tweets Related to an Entity
SubTopic Detection of Tweets Related to an Entity
 
Sentiment Analysis of Airline Tweets
Sentiment Analysis of Airline TweetsSentiment Analysis of Airline Tweets
Sentiment Analysis of Airline Tweets
 
Monitoring-and-Predicting-Mental-Health-using-Morphological-and-Emotion-Analy...
Monitoring-and-Predicting-Mental-Health-using-Morphological-and-Emotion-Analy...Monitoring-and-Predicting-Mental-Health-using-Morphological-and-Emotion-Analy...
Monitoring-and-Predicting-Mental-Health-using-Morphological-and-Emotion-Analy...
 
Twitter as a personalizable information service ii
Twitter as a personalizable information service iiTwitter as a personalizable information service ii
Twitter as a personalizable information service ii
 
Semantic Entity extraction from Sports Tweets
Semantic Entity extraction from Sports TweetsSemantic Entity extraction from Sports Tweets
Semantic Entity extraction from Sports Tweets
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter Data
 
SENTIMENT ANALYSIS OF TWITTER DATA
SENTIMENT ANALYSIS OF TWITTER DATASENTIMENT ANALYSIS OF TWITTER DATA
SENTIMENT ANALYSIS OF TWITTER DATA
 
REAL TIME SENTIMENT ANALYSIS OF TWITTER DATA
REAL TIME SENTIMENT ANALYSIS OF TWITTER DATAREAL TIME SENTIMENT ANALYSIS OF TWITTER DATA
REAL TIME SENTIMENT ANALYSIS OF TWITTER DATA
 

Último

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 

Último (20)

Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 

Automatic generation of event summaries using microblog streams

  • 1. “Twitsum” : Automatic generation of event summaries using microblog streams P.K.K.Madhawa 2012MCS044
  • 2. Motivation - The problem with Twitter search ● Twitter ranks tweets based on user interaction with them. (number of retweets, favorites) ● Top results for the query ‘Ebola’ (25th November 2014) ● How to distinguish newsworthy tweets drowned in a sea of noise
  • 3. Goal ● Distinguish newsworthy tweets based on syntactic features without depending on manual annotations ● Group tweets discussing the similar content together
  • 4. Contributions ● A heuristic based scheme for annotating tweets as subjective/objective ● A classifier capable of detecting objective tweets using only the syntactic information of tweets ● An entity-centric tweet clustering algorithm
  • 5. Twitter summarization - Earlier approaches Sub-event detection based methods ● Use of a Hidden Markov Model to detect sub-events during an American football match (D.Chakrabarti and K.Punera, 2011) ● Sub-event detection by identifying outlier peaks in the temporal distribution of tweets on a topic. (Zubiaga et al., 2012) Clustering based approaches ● A support platform for event detection using social intelligence (T.Baldwin, P. Cook and B.Han, 2012) ○ Tweets are filtered using manually selected keywords
  • 6. Design ● Tweet storage - stores the set of tweets downloaded using streaming API ● Classifier - selection of objective tweets ● Summarizer - removes duplicates and clusters the tweets based on their similarity
  • 7. Design - Objectivity detection ● Tweets are periodically downloaded by querying the public timeline using Streaming API ● Structure of a tweet object: tweet text, user name, created time, geo location, language code, favorite count, retweeted_status, retweet count
  • 8. Data collection ● Training data annotated using a heuristic measure ● Objective - If the tweet is generated by a verified profile ● Subjective - Tweets containing at least a single emoticon or an emoji character
  • 9. Preprocessing ● All emoticons and emoji characters are removed from the corpus ● User mentions are replaced with the tag ‘MENTION’ (eg: “@john said this” converts to “MENTION sad this”) ● Punctuation symbols including the pound(#) character are removed. ● Urls are replaced with the tag ‘URL’ (eg: http://t.co/12d3 converts to URL) ● Numbers in a tweet are replaced by the tag ‘NUMERIC’ ● Remove stop words
  • 10. Feature extraction ● Tweets are tokenized using TweetNLP tokenizer (K. Gimpel, N. Schneider, and B. O’Connor, 2011) ● Words are stemmed using Porter stemmer ● Stemmed unigrams, bigrams converted to binary Tf-Idf values (with Laplace smoothing) ● binary feature - presence of slang words (using an external gazetteer) ● binary feature - presence of bad words ● Unigrams, bigrams and trigrams of POS tags as binary Tf-Idf values ● Average number of misspelled words ● Average number of all-capital words ● Average number of hashtags
  • 11. Classifier selection ● A dataset of 6,000 tweets on Ebola is used to benchmark three classifiers (3,000 tweets from each class) ○ Support Vector Machines ○ Logistic Regression ○ Naive Bayes ● Classifiers trained on a random sample of 4800 tweets and remaining used as the test set. ● Classifier parameters are found using 10-fold cross validation
  • 12. Classifier performance ● SVM was selected because it had higher recall than Logistic Regression ● A higher recall results in a larger fraction of newsworthy tweets being detected
  • 13. Contribution from features ● Measured using ablation test ● Features divided into three sets WRD - unigram and bigrams LEX - all other lexical features
  • 14. Selection of the POS-tagger ● NLTK POS tagger ● Stanford tagger with GATE twitter model (L. Derczynski et al., 2013) ● SENNA tagger (Ronan Collobert, 2011) - “deep” recurrent convolutional neural network based discriminant parser Eg:"Last US Ebola Patient Is Cured: Dr. Craig Spencer To Be Released… http://t. co/92JfMm2LaN | http://t.co/NoFij4iACl #news" NLTK tagger: [('Last', 'JJ'), ('US', 'NNP'), ('Ebola', 'NNP'), ('Patient', 'NNP'), ('Is', 'NNP'), ('Cured', 'NNP'), ('Dr', 'NNP'), ('Craig', 'NNP'), ('Spencer', 'NNP'), ('To', 'NNP'), ('Be', 'NNP'), ('Released', 'NNP'), ('u2026', 'NNP'), ('|', 'NNP'), ('news', 'NN')]
  • 15. Selection of the POS tagger... "Last US Ebola Patient Is Cured: Dr. Craig Spencer To Be Released… http://t.co/92JfMm2LaN | http://t. co/NoFij4iACl #news" SENNA tagger: [('Last', 'JJ'), ('US', 'NNP'), ('Ebola', 'NNP'), ('Patient', 'NNP'), ('Is', 'VBZ'), ('Cured', 'VBN'), ('Dr', 'NNP'), ('Craig', 'NNP'), ('Spencer', 'NNP'), ('To', 'TO'), ('Be', 'VB'), ('Released', 'VBN'), ('u2026', 'JJ'), ('|', 'NN'), ('news', 'NN')] Stanford tagger with Gate twitter model: [('Last', 'JJ'), ('US', 'NNP'), ('Ebola', 'NNP'), ('Patient', 'NN'), ('Is', 'VBZ'), ('Cured', 'VBN'), ('Dr', 'NNP'), ('Craig', 'NNP'), ('Spencer', 'NNP'), ('To', 'TO'), ('Be', 'VB'), ('Released', 'VBN'), ('u2026', '.'), ('|', ':'), ('news', 'NN')]
  • 16. Results Data sets ● 1 million tweets containing the term ‘Ebola’ ● 22,250 tweets related to the fifth Sri Lanka vs India ODI cricket match held on 16th November (objective- 465, subjective- 878) ○ Filtered using terms “SLvIND”, “SLvsIND”, “INDvSL” and “INDvsSL”. ● 6,800 tweets related to the fourth Sri Lanka vs England ODI cricket match held on 7th December (objective- 215, subjective- 242) ○ Filtered using terms “SLvENG”, “SLvsENG”, “ENGvSL” and ENGvsSL”.
  • 17. Gold standard data set ● A sample 500 tweets on the topic ‘ebola’ is annotated manually as objective or subjective (objective- 206, subjective- 294) ● Classifier scores on this data ● Errors: “RT @TheDailyEdge: UPDATE: Obama has reduced the US deficit by 70% and Ebola cases in the US by 100%.” It’s hard to judge the objectivity of such sentences only based on syntactical information.
  • 18. Comparison with prior research ● Event related tweets detection with user type recognition (L.Silva, E.Rillof, 2013) ○ A set of 6,000 tweets on disease outbreaks manually labeled using Amazon Mechanical Turk ● Twitter Sentiment Classification using Distant Supervision (A.Go, R.Bhayani and L.huang, 2013) ○ An SVM model trained on syntactic features used for sentiment classification Classifier Precision Recall F1-score User type agnostic classifier 83.15 55.99 66.92 User type specific classifier 80.35 66.07 72.15 Features Accuracy Unigram + Bigram 81.6 Unigram + POS 81.9
  • 19. Cross-domain applicability ● The classifier trained on Ebola tweets applied on cricket related tweets ● The classifier trained on SLvIndia match performed well on SLvEngland tweets well
  • 20. Summarizer ● Duplicates and near-duplicate tweets are abundant due to Retweets and tweets generated by ‘Tweet’ buttons on news sites ● Removes duplicates in the objective tweets detected by the classifier ● Tweets discussing the same entities are clustered together
  • 21. ● Objective tweets are stripped of following symbols ‘RT’, ‘@-mentions’ and punctuation ● Jaccard similarity of tokens used to detect duplicate tweets ● Two tweets are considered similar if their Jaccard similarity is greater than a threshold d Near-duplicate removal
  • 22. Clustering ● The goal is to cluster tweets mentioning the same entities together Eg: “#Miami #News NYC Doc Free of Ebola: Sources: Dr. Craig Spencer, the physician being treated for Ebola at Belle... http://t.co/iXSUk4axVV” “#Ebola so the good doctor Craig Spencer will go home - well - the nurse too free to roam but lest we forget 3 countries still suffer deeply” ● Vectors of NER tags converted to Tf-Idf scores and cosine value is selected as the distance measure among two NER tag vectors ● DBSCAN is selected because the number of clusters is not required and it is capable of identifying arbitrary shaped clusters
  • 23. Clustering - results ● SVM classifier trained on ebola-3000 data set is applied on a corpus of 24,038 unseen tweets retrieved on a single day (11-11-2014) ● 13,380 tweets detected as objective and 8,138 as duplicates among them. Clustering resulted in 332 clusters while 2751 tweets labeled as noise ● Clusters depend on the quality of Named Entity Recognizer Entities: ['Craig', 'Ebola', 'Patient', 'Spencer', 'US']
  • 24. Clustering - discussion ● In contrast this tweet labeled as noise “‘#Ebola Ebola Outbreak: US Free of Virus After New York Doctor Craig Spencer Cleared - International Business Times UK” entities - ['Business', 'Craig', 'Ebola', 'Free', 'International', 'New', 'Outbreak', 'Spencer' 'Times', 'US' 'Virus' 'York']
  • 25. Future work ● Improve cross-domain applicability ○ Finding better features with less dependence on the domain ● A better methodology to evaluate summaries ● Improve clustering to consider verbs also ● Generate an abstractive summary ○ Generate novel sentences from the information contained in tweets ● Generate summaries realtime