Group-13 Project 15 Sub event detection on social media
Learning from Twitter Hashtags: Leveraging Proximate Tags to Enhance Graph-based Keyphrase Extraction
1. Learning from Twitter Hashtags: Leveraging
Proximate Tags to Enhance Graph-based
Keyphrase Extraction
Abdelghani Bellaachia & Mohammed Al-Dhelaan
(Bell@gwu.edu , mdhelaan@gwu.edu)
Computer Science Department
George Washington University
Washington, DC, USA 1
2. Overview
• Twitter Introduction
• Why Extracting Keyphrases in Twitter?
• Learning from Twitter Hashtags
• Twitter Lexical Graph Expansion
• Proposed Approach for Graph Expansion
• How to Choose Hashtags
• Frequency Approach
• Hybrid Approach
• How to Build Lexical Graph
• Topic Modeling
• Graph-based Ranking Scheme
• Experiments
• Experimental Results 2
• Conclusion
3. Twitter Introduction
• Twitter is a micro-blogging social network site
• It enables users to blog or broadcast their thoughts and
messages
• It gained a lot of popularity due to the speed of broadcasting
news through it.
• The main idea behind it is that a user can follow people or
organizations accounts that seems to be interesting to the
user.
• Once a user follows an account, all the news and tweets
issued by that account will be shown to that user in his
timeline tweets.
3
4. Tweets
• Tweets are the posts or messages broadcasted by users.
• It can only include up to 140 characters
• In it is nature, it meant to be broadcasted to all the followers
of a user. However, it can be directed to a specific user using
the mention “@” feature.
• Tweets are generally public and anyone can view them except
if the user made his tweets private and only can be seen by
his/her followers (rarely used!).
• Tweets can include text, hashtags, or mentions. Or any
combination of them.
4
6. Hashtags
• Hashtags started as a user convention.
• They are used to index and organize tweets.
• Trend discovery
• Every Hashtag is generally about a specific topic that if you
include a hashtag into a tweet, that tweet will be directed to
that topic which have a specific audience.
• Multiple hashtags are accepted
• Hashtag is a hyperlink to all tweets containing that hashtag.
6
7. Why Extracting Keyphrases
in Twitter?
• In 2011, Twitter has attracted over 200 million users, whom
publish at least a billion tweets each week [2].
• With such massive amount of user generated text, the need
for summarizing topics in tweets becomes important
• However, tweets are short text documents so normal
summarization techniques are not applicable
• Instead, extracting short keyphrases that could represent
topics in tweets can be an insightful approach
7
8. Definitions
• Topical Tweets: are the collection of tweets that we will
extract keyphrases from. Also called target set
• Auxiliary Hashtag Tweets: Are the collection of tweets
gathered from a selected hashtag from the topical
tweets.
• In this research, we investigate the possibility of
expanding the lexical graph for topical tweets with
auxiliary hashtag tweets, and whether it could improve
the ranking for keyphrases extracted from the target
tweets. 8
9. Learning from Twitter Hashtags
• Tweets are short text documents
• The shortage of text in tweets could be an obstacle when
trying to learn from text
• However, tweets can contain an abundant number of links in
the form of hashtags
• Can we improve the ranking using an auxiliary set of hashtag
tweets (external tweets)?
• How can we choose the best hashtags to fit the topic? Some
hashtags are general! Some are very specific!
• Can we expand the graph to include auxiliary hashtag tweets?
How can it affect the ranking?
9
10. Twitter Lexical Graph
Expansion
Target Tweets Set Lexical Graph
t
t
t t
t
H
Hashtags H
H Expanded Lexical Graph
H
Auxiliary Tweets Set
t t
t
t
10
11. Proposed Approach
• From a random collection of tweets:
• Identify topics
• Cluster tweets based on topics found
• For every cluster (topic):
• Build a lexical graph to calculate words weights
• Expand the graph with auxiliary hashtag tweets
similar to topic
• Generate keyphrases using top keywords
• Rank keyphrase
• Show top 10 keyphrases 11
13. How to Choose Hashtags?
• Hashtags are user generated and varies in scope
• Expanding the graph with the wrong hashtags can
deteriorate the ranking (irrelative or general hashtags)
• Two approaches to choose hashtags for expanding the
graph:
• Frequency Approach – By choosing the most frequent
hashtag in each topical cluster of tweets (target
tweets).
• Hybrid Approach – By measuring similarity between
top-10 frequent hashtag tweets keywords and the
target tweets keywords 13
15. Hybrid Approach
Target Tweets
Cosine Sim
k1
k2
k3
Hashtag1 Tweets Hashtag2 Tweets Hashtag 10 Tweets
.
.
k1 k1 k1
kn
k2 k2 k2
k3 k3 … k3
. . .
. . .
kn kn kn
K: keywords extracted from all tweets in the set
Select the highest similar hashtag to expand the lexical graph 15
16. Hybrid Approach
• Let Target Tweets be a set of tweets {t1, t2, …,tn}
•From all tweets in the set, we have a vector of words
TT_terms ={k1, k2, …,kn} Target Tweets TT_terms
t1 k1
t2 k2
t3 k3
. .
. .
tn kn
•In the Target Tweets set, we have a set of hashtags
occurring in all tweets. We call it
HashtagsTitles = {h1, h2 ,…, hn} 16
17. Hybrid Approach
• For each hashtag in HashtagTitles set = {h1, h2 ,…, hn},
we search Twitter for all tweets that does not occur in the
Target Tweets set.
•The search result for each hashtag is grouped in a vector
of tweets called HT( Hashtag Tweets)
HashtagTitles
h1= Ht1, Ht2,…, Htn
h1 h2= Ht1, Ht2,…, Htn
h2
h3 :
. hn= Ht1, Ht2,…, Htn
.
hn
17
18. Hybrid Approach
•For each HT, we build a vector of words representing each
hashtag separately which we call HT_terms
•We compute the cosine similarity between the two
vectors TT_terms and HT_terms
•Finally, we choose the most similar hashtag to expand the
graph with
18
19. Hybrid Approach
• Measures the similarity of top frequent hashtag tweets
content with target tweets content using cosine similarity
• The top-10 frequent hashtags are used since we assume
that the most relevant hashtag is frequent
• Selecting the most similar hashtag using cosine similarity
with top-10 frequent hashtags will use both approach
which will improve the accuracy of the selection
19
20. Hybrid Approach
• After selecting an auxiliary hashtag tweet set:
• classify each hashtag’s tweet as either relevant or
irrelevant
• by measuring the word overlap between auxiliary tweet
terms and top-10 tf-idf in target tweets terms
• If there is at least two words from the top-10, then we
classify an auxiliary tweet as relevant.
20
21. How to Build Lexical Graph
• Let G=(V,E) be a weighted graph that represent the text
• Vertices V denote words
• We build an edge E between every two words if they
co-occur within a specific window size
• The weight of the edges for terms in the target tweets is
the frequency of the co-occurrence
• The frequency of the co-occurrence shows how strong
the relationship between two nodes
Edge_weight(Vi, Vj) = |co-occurrence|
21
24. Topic Modeling
• Latent Dirichlet Allocation (LDA) (D. M. Blei, A. Y. Ng, and
M. I. Jordan)
• Unsupervised model that identifies topics in a
collection of documents.
• A statistical model that uses “bag of words”
assumption for each document.
• Documents are represented over probability
distribution over topics .
• Topics are represented over probability distribution
over collection of words.
24
25. Topic Modeling
• Latent Dirichlet Allocation (LDA)
• Dirichlet prior α and β
• Multinomial distribution over topics Ѳ
• Multinomial distribution over words φ
Ѳ Z w
J
D
α β φ
25
26. Graph-based Ranking Scheme
• PageRank (Brin and Page, 1998)
• Voting idea!
• When a vertex links to another, it cast a vote for the
other vertex.
• The algorithm has a recursive nature! The importance
of the vertex casting the vote determines the
importance of the vote.
• Uses nodes rank iteratively until convergence
26
28. Graph-based Ranking Scheme
• TextRank (Mihalcea & Tarau, 2004)
• Create a graph for text
• Words are represented in nodes (nouns and adjectives
only)
• Edges are the co-occurrence between words within a
window
• Frequency of co-occurring words is represented on
edge weights
• TextRank uses edge weights to influence the rank
28
30. Graph-based Ranking Scheme
• NE-Rank (Node Edge- Rank)(Bellaachia & Al-Dhelaan)
• Incorporate node’s weight into the formula
• Instead of either using only node weights or only edge
weights, we try to use both features.
• In text, node weights are best represented by tf-idf to
represent the content of documents.
• PageRank only focuses on the relations between
objects without the content.
• TextRank only uses the co-occurrence relation to
identify important words.
• NE-Rank takes the content into consideration as tf-idf 30
32. Experiment
• Crawled Twitter since 1/19/2012 to 2/6/2012
• Dataset have 31,227 tweets.
• 244,139 tokens
• 40,674 hashtags in tweets (4,079 unique hashtag).
• Hashtags have been segmented into word tokens into
tokenization step.
• We have extracted 30 topics out of tweets.
• Let C be the collection of tweets, 1..k are topics.
• Aggregate tweets for topic yielding Ck
• Build a graph and extract keyphrases from every Ck 32
• C= C1 U C2 U …Ck
33. Experiment
• Preprocessing :
• Removed non-English tweets
• Removed URL links
• Normalized tweets from conversational style to
standard English: for example: luv became love
• Part of speech tagging to extract nouns and adjectives
only
• Stemming and stopwords removal
33
34. Experiment
• Since NE-Rank has showed better result compared to
other ranking methods in our previous research[8], we
used it to compare the ranking of 3 approaches:
• Single Approach: No graph expansion
• Expanded with hashtags-Frequency Approach
• Expanded with hashtags-Hybrid Approach
• We validated our results using an empirical
evaluation approach as in the next slides
34
35. Experiment
• Since there is no golden labels to compare against, we
empirically designed an evaluation approach utilizing a
search engine to generate labels.
• To generate such labels we searched Google using top-5
terms in LDA for each topic.
• We only focused on two fields from search snippets
results: title and description
• If a keyphrase happens to occur in search results, then
we consider it correct
35
36. Experimental Results
Automatic Approach Using Search Engine
Top-10 Keyphrases
Precision BPref
Single NE-Rank 0.40 0.67
Expanded with Hashtags – Frequency Approach 0.45 0.52
Expanded with Hashtags – Hybrid Approach 0.55 0.73
36
37. Conclusion
• Twitter Introduction
• Why Extracting Keyphrases in Twitter?
• Learning from Twitter Hashtags
• Twitter Lexical Graph Expansion
• Proposed Approach for Graph Expansion
• How to Choose Hashtags
• Frequency Approach
• Hybrid Approach
• How to Build Lexical Graph
• Topic Modeling
• Graph-based Ranking Scheme
• Experiments
• Experimental Results 37
• Conclusion
38. References
• [1] Liu, et al.,2010. “Automatic Keyphrase Extraction via Topic
Decomposition”
• [2] Lin, Snow, & Morgan “Smoothing Techniques
for Adaptive Online Language Models: Topic Tracking in Tweet
Streams,”
• [3] Liu, et al., 2011. “Why is “SXSW” Trending? Exploring Multiple Text
Sources For Twitter Topic Summarization”
• [4] X. Wan and J. Xiao, “Single document keyphrase extraction
using neighborhood knowledge,”
• [5] Weng, et al., 2010. “TwitterRank: Finding Topic-sensitive Influential
Twitterers”
• [6] Zhao, et al., 2011. “Topical Keyphrase Extraction from Twitter”
• [7] Mihaleca & Tarau, “Textrank: Bringing order into texts”
• [8] Bellaachia & Al-Dhelaan, “NE-Rank: A Novel Graph-based Keyphrase 38
Exctraction in Twitter” in press