SlideShare una empresa de Scribd logo
1 de 39
Learning from Twitter Hashtags: Leveraging
Proximate Tags to Enhance Graph-based
Keyphrase Extraction
Abdelghani Bellaachia & Mohammed Al-Dhelaan
(Bell@gwu.edu , mdhelaan@gwu.edu)

             Computer Science Department
             George Washington University
                 Washington, DC, USA          1
Overview
•   Twitter Introduction
•   Why Extracting Keyphrases in Twitter?
•   Learning from Twitter Hashtags
•   Twitter Lexical Graph Expansion
•   Proposed Approach for Graph Expansion
•   How to Choose Hashtags
     • Frequency Approach
     • Hybrid Approach
•   How to Build Lexical Graph
•   Topic Modeling
•   Graph-based Ranking Scheme
•   Experiments
•   Experimental Results                    2
•   Conclusion
Twitter Introduction
• Twitter is a micro-blogging social network site
• It enables users to blog or broadcast their thoughts and
  messages
• It gained a lot of popularity due to the speed of broadcasting
  news through it.
• The main idea behind it is that a user can follow people or
  organizations accounts that seems to be interesting to the
  user.
• Once a user follows an account, all the news and tweets
  issued by that account will be shown to that user in his
  timeline tweets.
                                                                   3
Tweets
• Tweets are the posts or messages broadcasted by users.
• It can only include up to 140 characters
• In it is nature, it meant to be broadcasted to all the followers
  of a user. However, it can be directed to a specific user using
  the mention “@” feature.
• Tweets are generally public and anyone can view them except
  if the user made his tweets private and only can be seen by
  his/her followers (rarely used!).
• Tweets can include text, hashtags, or mentions. Or any
  combination of them.

                                                                     4
Tweets
• Example of a tweet containing a hashtag, text, and link




                                                            5
Hashtags
• Hashtags started as a user convention.
• They are used to index and organize tweets.
• Trend discovery
• Every Hashtag is generally about a specific topic that if you
  include a hashtag into a tweet, that tweet will be directed to
  that topic which have a specific audience.
• Multiple hashtags are accepted
• Hashtag is a hyperlink to all tweets containing that hashtag.


                                                                   6
Why Extracting Keyphrases
in Twitter?
• In 2011, Twitter has attracted over 200 million users, whom
  publish at least a billion tweets each week [2].
• With such massive amount of user generated text, the need
  for summarizing topics in tweets becomes important
• However, tweets are short text documents so normal
  summarization techniques are not applicable
• Instead, extracting short keyphrases that could represent
  topics in tweets can be an insightful approach


                                                                7
Definitions
• Topical Tweets: are the collection of tweets that we will
  extract keyphrases from. Also called target set
• Auxiliary Hashtag Tweets: Are the collection of tweets
  gathered from a selected hashtag from the topical
  tweets.

• In this research, we investigate the possibility of
  expanding the lexical graph for topical tweets with
  auxiliary hashtag tweets, and whether it could improve
  the ranking for keyphrases extracted from the target
  tweets.                                                     8
Learning from Twitter Hashtags
• Tweets are short text documents
• The shortage of text in tweets could be an obstacle when
  trying to learn from text
• However, tweets can contain an abundant number of links in
  the form of hashtags
• Can we improve the ranking using an auxiliary set of hashtag
  tweets (external tweets)?
• How can we choose the best hashtags to fit the topic? Some
  hashtags are general! Some are very specific!
• Can we expand the graph to include auxiliary hashtag tweets?
  How can it affect the ranking?
                                                                 9
Twitter Lexical Graph
Expansion
  Target Tweets Set                 Lexical Graph
  t
      t
  t         t
       t
                            H
      Hashtags H
                        H                   Expanded Lexical Graph
                H

           Auxiliary Tweets Set
                t       t
                                t
                    t


                                                                     10
Proposed Approach
• From a random collection of tweets:
  • Identify topics
  • Cluster tweets based on topics found
  • For every cluster (topic):
     • Build a lexical graph to calculate words weights
     • Expand the graph with auxiliary hashtag tweets
       similar to topic
     • Generate keyphrases using top keywords
     • Rank keyphrase
     • Show top 10 keyphrases                             11
Proposed Approach for Graph Expansion




                                        12
How to Choose Hashtags?
• Hashtags are user generated and varies in scope
• Expanding the graph with the wrong hashtags can
  deteriorate the ranking (irrelative or general hashtags)
• Two approaches to choose hashtags for expanding the
  graph:
   • Frequency Approach – By choosing the most frequent
     hashtag in each topical cluster of tweets (target
     tweets).
   • Hybrid Approach – By measuring similarity between
     top-10 frequent hashtag tweets keywords and the
     target tweets keywords                                  13
Frequency Approach
• Frequency approach is not always correct
• Topic “Sandusky”




                                             14
Hybrid Approach
Target Tweets
                                                 Cosine Sim

   k1
   k2
   k3
                       Hashtag1 Tweets   Hashtag2 Tweets          Hashtag 10 Tweets
    .
    .
                          k1                k1                       k1
   kn
                          k2                k2                       k2
                          k3                k3                …      k3
                           .                 .                        .
                           .                 .                        .
                          kn                kn                       kn


 K: keywords extracted from all tweets in the set
 Select the highest similar hashtag to expand the lexical graph                       15
Hybrid Approach
• Let Target Tweets be a set of tweets {t1, t2, …,tn}
•From all tweets in the set, we have a vector of words
 TT_terms ={k1, k2, …,kn}        Target Tweets    TT_terms

                                   t1                k1
                                   t2                k2
                                   t3                k3
                                    .                 .
                                    .                 .
                                   tn                kn


•In the Target Tweets set, we have a set of hashtags
occurring in all tweets. We call it
 HashtagsTitles = {h1, h2 ,…, hn}                            16
Hybrid Approach
• For each hashtag in HashtagTitles set = {h1, h2 ,…, hn},
we search Twitter for all tweets that does not occur in the
Target Tweets set.
•The search result for each hashtag is grouped in a vector
of tweets called HT( Hashtag Tweets)

         HashtagTitles
                                      h1=   Ht1, Ht2,…, Htn
            h1                        h2=   Ht1, Ht2,…, Htn
            h2
            h3                                  :
             .                        hn=   Ht1, Ht2,…, Htn
             .
            hn
                                                              17
Hybrid Approach
•For each HT, we build a vector of words representing each
hashtag separately which we call HT_terms
•We compute the cosine similarity between the two
vectors TT_terms and HT_terms
•Finally, we choose the most similar hashtag to expand the
graph with




                                                             18
Hybrid Approach
• Measures the similarity of top frequent hashtag tweets
  content with target tweets content using cosine similarity
• The top-10 frequent hashtags are used since we assume
  that the most relevant hashtag is frequent
• Selecting the most similar hashtag using cosine similarity
  with top-10 frequent hashtags will use both approach
  which will improve the accuracy of the selection




                                                               19
Hybrid Approach
• After selecting an auxiliary hashtag tweet set:
• classify each hashtag’s tweet as either relevant or
  irrelevant
• by measuring the word overlap between auxiliary tweet
  terms and top-10 tf-idf in target tweets terms
• If there is at least two words from the top-10, then we
  classify an auxiliary tweet as relevant.




                                                            20
How to Build Lexical Graph
• Let G=(V,E) be a weighted graph that represent the text
• Vertices V denote words
• We build an edge E between every two words if they
  co-occur within a specific window size
• The weight of the edges for terms in the target tweets is
  the frequency of the co-occurrence
• The frequency of the co-occurrence shows how strong
  the relationship between two nodes
           Edge_weight(Vi, Vj) = |co-occurrence|

                                                              21
How to Build Lexical Graph
•




                             22
How to Build Lexical Graph
•




                             23
Topic Modeling
• Latent Dirichlet Allocation (LDA) (D. M. Blei, A. Y. Ng, and
  M. I. Jordan)
   • Unsupervised model that identifies topics in a
     collection of documents.
   • A statistical model that uses “bag of words”
     assumption for each document.
   • Documents are represented over probability
     distribution over topics .
   • Topics are represented over probability distribution
     over collection of words.
                                                                 24
Topic Modeling
•   Latent Dirichlet Allocation (LDA)
•   Dirichlet prior α and β
•   Multinomial distribution over topics Ѳ
•   Multinomial distribution over words φ



                 Ѳ          Z       w
                                         J
                                             D
                 α         β        φ
                                                 25
Graph-based Ranking Scheme
• PageRank (Brin and Page, 1998)
  • Voting idea!
  • When a vertex links to another, it cast a vote for the
    other vertex.
  • The algorithm has a recursive nature! The importance
    of the vertex casting the vote determines the
    importance of the vote.
  • Uses nodes rank iteratively until convergence


                                                             26
Graph-based Ranking Scheme
•




                             27
Graph-based Ranking Scheme
• TextRank (Mihalcea & Tarau, 2004)
   • Create a graph for text
   • Words are represented in nodes (nouns and adjectives
     only)
   • Edges are the co-occurrence between words within a
     window
   • Frequency of co-occurring words is represented on
     edge weights
   • TextRank uses edge weights to influence the rank

                                                            28
Graph-based Ranking Scheme
•




                             29
Graph-based Ranking Scheme
• NE-Rank (Node Edge- Rank)(Bellaachia & Al-Dhelaan)
  • Incorporate node’s weight into the formula
  • Instead of either using only node weights or only edge
    weights, we try to use both features.
  • In text, node weights are best represented by tf-idf to
    represent the content of documents.
  • PageRank only focuses on the relations between
    objects without the content.
  • TextRank only uses the co-occurrence relation to
    identify important words.
  • NE-Rank takes the content into consideration as tf-idf    30
Graph-based Ranking Scheme
•




                             31
Experiment
•   Crawled Twitter since 1/19/2012 to 2/6/2012
•   Dataset have 31,227 tweets.
•   244,139 tokens
•   40,674 hashtags in tweets (4,079 unique hashtag).
•   Hashtags have been segmented into word tokens into
    tokenization step.

•   We have extracted 30 topics out of tweets.
•   Let C be the collection of tweets, 1..k are topics.
•   Aggregate tweets for topic yielding Ck
•   Build a graph and extract keyphrases from every Ck    32
•   C= C1 U C2 U …Ck
Experiment
• Preprocessing :
  • Removed non-English tweets
  • Removed URL links
  • Normalized tweets from conversational style to
    standard English: for example: luv became love
  • Part of speech tagging to extract nouns and adjectives
    only
  • Stemming and stopwords removal


                                                             33
Experiment
• Since NE-Rank has showed better result compared to
  other ranking methods in our previous research[8], we
  used it to compare the ranking of 3 approaches:
  • Single Approach: No graph expansion
  • Expanded with hashtags-Frequency Approach
  • Expanded with hashtags-Hybrid Approach

• We validated our results using an empirical
  evaluation approach as in the next slides
                                                          34
Experiment
• Since there is no golden labels to compare against, we
  empirically designed an evaluation approach utilizing a
  search engine to generate labels.
• To generate such labels we searched Google using top-5
  terms in LDA for each topic.
• We only focused on two fields from search snippets
  results: title and description
• If a keyphrase happens to occur in search results, then
  we consider it correct

                                                            35
Experimental Results
Automatic Approach Using Search Engine
Top-10 Keyphrases

                                              Precision   BPref
Single NE-Rank                                0.40        0.67
Expanded with Hashtags – Frequency Approach   0.45        0.52
Expanded with Hashtags – Hybrid Approach      0.55        0.73




                                                                  36
Conclusion
•   Twitter Introduction
•   Why Extracting Keyphrases in Twitter?
•   Learning from Twitter Hashtags
•   Twitter Lexical Graph Expansion
•   Proposed Approach for Graph Expansion
•   How to Choose Hashtags
     • Frequency Approach
     • Hybrid Approach
•   How to Build Lexical Graph
•   Topic Modeling
•   Graph-based Ranking Scheme
•   Experiments
•   Experimental Results                    37
•   Conclusion
References
• [1] Liu, et al.,2010. “Automatic Keyphrase Extraction via Topic
  Decomposition”
• [2] Lin, Snow, & Morgan “Smoothing Techniques
  for Adaptive Online Language Models: Topic Tracking in Tweet
  Streams,”
• [3] Liu, et al., 2011. “Why is “SXSW” Trending? Exploring Multiple Text
  Sources For Twitter Topic Summarization”
• [4] X. Wan and J. Xiao, “Single document keyphrase extraction
   using neighborhood knowledge,”
• [5] Weng, et al., 2010. “TwitterRank: Finding Topic-sensitive Influential
  Twitterers”
• [6] Zhao, et al., 2011. “Topical Keyphrase Extraction from Twitter”
• [7] Mihaleca & Tarau, “Textrank: Bringing order into texts”
• [8] Bellaachia & Al-Dhelaan, “NE-Rank: A Novel Graph-based Keyphrase        38
  Exctraction in Twitter” in press
The End



 Thank You!



              39

Más contenido relacionado

Similar a Learning from Twitter Hashtags: Leveraging Proximate Tags to Enhance Graph-based Keyphrase Extraction

Actualization of a Course Library through Influential Twitter Knowledge
Actualization of a Course Library through Influential Twitter KnowledgeActualization of a Course Library through Influential Twitter Knowledge
Actualization of a Course Library through Influential Twitter KnowledgeMalinka Ivanova
 
Tutorial of Sentiment Analysis
Tutorial of Sentiment AnalysisTutorial of Sentiment Analysis
Tutorial of Sentiment AnalysisFabio Benedetti
 
Insights into the Twitterverse: Benchmarking and analysis twitter content
Insights into the Twitterverse: Benchmarking and analysis twitter contentInsights into the Twitterverse: Benchmarking and analysis twitter content
Insights into the Twitterverse: Benchmarking and analysis twitter contentStephen Dann
 
Social Media Training
Social Media Training Social Media Training
Social Media Training Susan Tenby
 
Explaining Controversy on Social Media via Stance Summarization
Explaining Controversy on Social Media via Stance SummarizationExplaining Controversy on Social Media via Stance Summarization
Explaining Controversy on Social Media via Stance Summarizationmiajang
 
Binary Similarity : Theory, Algorithms and Tool Evaluation
Binary Similarity :  Theory, Algorithms and  Tool EvaluationBinary Similarity :  Theory, Algorithms and  Tool Evaluation
Binary Similarity : Theory, Algorithms and Tool EvaluationLiwei Ren任力偉
 
Temporal Effects on Hashtag Reuse in Twitter
Temporal Effects on Hashtag Reuse in TwitterTemporal Effects on Hashtag Reuse in Twitter
Temporal Effects on Hashtag Reuse in TwitterDominik Kowald
 
What makes a tweet relevant for a topic?
What makes a tweet relevant for a topic?What makes a tweet relevant for a topic?
What makes a tweet relevant for a topic?Ke Tao
 
Twitter analysis - Data as factor for designing the right communication star...
Twitter analysis  - Data as factor for designing the right communication star...Twitter analysis  - Data as factor for designing the right communication star...
Twitter analysis - Data as factor for designing the right communication star...Pere Claver Llimona
 
Ki, Qi, Key: The Way of DITA Harmony With Keys and Key References
Ki, Qi, Key: The Way of DITA Harmony With Keys and Key ReferencesKi, Qi, Key: The Way of DITA Harmony With Keys and Key References
Ki, Qi, Key: The Way of DITA Harmony With Keys and Key ReferencesContrext Solutions
 
final_nlp
final_nlpfinal_nlp
final_nlpaphex34
 
Franklin university humn 240 assignment help
Franklin university humn 240 assignment helpFranklin university humn 240 assignment help
Franklin university humn 240 assignment helpleesa marteen
 
A User Modeling Oriented Analysis of Cultural Backgrounds in Microblogging
A User Modeling Oriented Analysis of Cultural Backgrounds in MicrobloggingA User Modeling Oriented Analysis of Cultural Backgrounds in Microblogging
A User Modeling Oriented Analysis of Cultural Backgrounds in MicrobloggingElena Daehnhardt
 
Finding Missing Tweets using Topic Structure and Browsing Time
Finding Missing Tweets using Topic Structure and Browsing TimeFinding Missing Tweets using Topic Structure and Browsing Time
Finding Missing Tweets using Topic Structure and Browsing Timeysuzuki-naist
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Lucidworks
 
Group-13 Project 15 Sub event detection on social media
Group-13 Project 15 Sub event detection on social mediaGroup-13 Project 15 Sub event detection on social media
Group-13 Project 15 Sub event detection on social mediaAhmedali Durga
 

Similar a Learning from Twitter Hashtags: Leveraging Proximate Tags to Enhance Graph-based Keyphrase Extraction (20)

Actualization of a Course Library through Influential Twitter Knowledge
Actualization of a Course Library through Influential Twitter KnowledgeActualization of a Course Library through Influential Twitter Knowledge
Actualization of a Course Library through Influential Twitter Knowledge
 
Tutorial of Sentiment Analysis
Tutorial of Sentiment AnalysisTutorial of Sentiment Analysis
Tutorial of Sentiment Analysis
 
Insights into the Twitterverse: Benchmarking and analysis twitter content
Insights into the Twitterverse: Benchmarking and analysis twitter contentInsights into the Twitterverse: Benchmarking and analysis twitter content
Insights into the Twitterverse: Benchmarking and analysis twitter content
 
Social Media Training
Social Media Training Social Media Training
Social Media Training
 
Swdm15
Swdm15Swdm15
Swdm15
 
Explaining Controversy on Social Media via Stance Summarization
Explaining Controversy on Social Media via Stance SummarizationExplaining Controversy on Social Media via Stance Summarization
Explaining Controversy on Social Media via Stance Summarization
 
Binary Similarity : Theory, Algorithms and Tool Evaluation
Binary Similarity :  Theory, Algorithms and  Tool EvaluationBinary Similarity :  Theory, Algorithms and  Tool Evaluation
Binary Similarity : Theory, Algorithms and Tool Evaluation
 
Temporal Effects on Hashtag Reuse in Twitter
Temporal Effects on Hashtag Reuse in TwitterTemporal Effects on Hashtag Reuse in Twitter
Temporal Effects on Hashtag Reuse in Twitter
 
Hashtag (#Hashtag)
Hashtag (#Hashtag)Hashtag (#Hashtag)
Hashtag (#Hashtag)
 
What makes a tweet relevant for a topic?
What makes a tweet relevant for a topic?What makes a tweet relevant for a topic?
What makes a tweet relevant for a topic?
 
Automatic Summarizaton Tutorial
Automatic Summarizaton TutorialAutomatic Summarizaton Tutorial
Automatic Summarizaton Tutorial
 
Twitter analysis - Data as factor for designing the right communication star...
Twitter analysis  - Data as factor for designing the right communication star...Twitter analysis  - Data as factor for designing the right communication star...
Twitter analysis - Data as factor for designing the right communication star...
 
Ki, Qi, Key: The Way of DITA Harmony With Keys and Key References
Ki, Qi, Key: The Way of DITA Harmony With Keys and Key ReferencesKi, Qi, Key: The Way of DITA Harmony With Keys and Key References
Ki, Qi, Key: The Way of DITA Harmony With Keys and Key References
 
Ire major project
Ire major projectIre major project
Ire major project
 
final_nlp
final_nlpfinal_nlp
final_nlp
 
Franklin university humn 240 assignment help
Franklin university humn 240 assignment helpFranklin university humn 240 assignment help
Franklin university humn 240 assignment help
 
A User Modeling Oriented Analysis of Cultural Backgrounds in Microblogging
A User Modeling Oriented Analysis of Cultural Backgrounds in MicrobloggingA User Modeling Oriented Analysis of Cultural Backgrounds in Microblogging
A User Modeling Oriented Analysis of Cultural Backgrounds in Microblogging
 
Finding Missing Tweets using Topic Structure and Browsing Time
Finding Missing Tweets using Topic Structure and Browsing TimeFinding Missing Tweets using Topic Structure and Browsing Time
Finding Missing Tweets using Topic Structure and Browsing Time
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
 
Group-13 Project 15 Sub event detection on social media
Group-13 Project 15 Sub event detection on social mediaGroup-13 Project 15 Sub event detection on social media
Group-13 Project 15 Sub event detection on social media
 

Learning from Twitter Hashtags: Leveraging Proximate Tags to Enhance Graph-based Keyphrase Extraction

  • 1. Learning from Twitter Hashtags: Leveraging Proximate Tags to Enhance Graph-based Keyphrase Extraction Abdelghani Bellaachia & Mohammed Al-Dhelaan (Bell@gwu.edu , mdhelaan@gwu.edu) Computer Science Department George Washington University Washington, DC, USA 1
  • 2. Overview • Twitter Introduction • Why Extracting Keyphrases in Twitter? • Learning from Twitter Hashtags • Twitter Lexical Graph Expansion • Proposed Approach for Graph Expansion • How to Choose Hashtags • Frequency Approach • Hybrid Approach • How to Build Lexical Graph • Topic Modeling • Graph-based Ranking Scheme • Experiments • Experimental Results 2 • Conclusion
  • 3. Twitter Introduction • Twitter is a micro-blogging social network site • It enables users to blog or broadcast their thoughts and messages • It gained a lot of popularity due to the speed of broadcasting news through it. • The main idea behind it is that a user can follow people or organizations accounts that seems to be interesting to the user. • Once a user follows an account, all the news and tweets issued by that account will be shown to that user in his timeline tweets. 3
  • 4. Tweets • Tweets are the posts or messages broadcasted by users. • It can only include up to 140 characters • In it is nature, it meant to be broadcasted to all the followers of a user. However, it can be directed to a specific user using the mention “@” feature. • Tweets are generally public and anyone can view them except if the user made his tweets private and only can be seen by his/her followers (rarely used!). • Tweets can include text, hashtags, or mentions. Or any combination of them. 4
  • 5. Tweets • Example of a tweet containing a hashtag, text, and link 5
  • 6. Hashtags • Hashtags started as a user convention. • They are used to index and organize tweets. • Trend discovery • Every Hashtag is generally about a specific topic that if you include a hashtag into a tweet, that tweet will be directed to that topic which have a specific audience. • Multiple hashtags are accepted • Hashtag is a hyperlink to all tweets containing that hashtag. 6
  • 7. Why Extracting Keyphrases in Twitter? • In 2011, Twitter has attracted over 200 million users, whom publish at least a billion tweets each week [2]. • With such massive amount of user generated text, the need for summarizing topics in tweets becomes important • However, tweets are short text documents so normal summarization techniques are not applicable • Instead, extracting short keyphrases that could represent topics in tweets can be an insightful approach 7
  • 8. Definitions • Topical Tweets: are the collection of tweets that we will extract keyphrases from. Also called target set • Auxiliary Hashtag Tweets: Are the collection of tweets gathered from a selected hashtag from the topical tweets. • In this research, we investigate the possibility of expanding the lexical graph for topical tweets with auxiliary hashtag tweets, and whether it could improve the ranking for keyphrases extracted from the target tweets. 8
  • 9. Learning from Twitter Hashtags • Tweets are short text documents • The shortage of text in tweets could be an obstacle when trying to learn from text • However, tweets can contain an abundant number of links in the form of hashtags • Can we improve the ranking using an auxiliary set of hashtag tweets (external tweets)? • How can we choose the best hashtags to fit the topic? Some hashtags are general! Some are very specific! • Can we expand the graph to include auxiliary hashtag tweets? How can it affect the ranking? 9
  • 10. Twitter Lexical Graph Expansion Target Tweets Set Lexical Graph t t t t t H Hashtags H H Expanded Lexical Graph H Auxiliary Tweets Set t t t t 10
  • 11. Proposed Approach • From a random collection of tweets: • Identify topics • Cluster tweets based on topics found • For every cluster (topic): • Build a lexical graph to calculate words weights • Expand the graph with auxiliary hashtag tweets similar to topic • Generate keyphrases using top keywords • Rank keyphrase • Show top 10 keyphrases 11
  • 12. Proposed Approach for Graph Expansion 12
  • 13. How to Choose Hashtags? • Hashtags are user generated and varies in scope • Expanding the graph with the wrong hashtags can deteriorate the ranking (irrelative or general hashtags) • Two approaches to choose hashtags for expanding the graph: • Frequency Approach – By choosing the most frequent hashtag in each topical cluster of tweets (target tweets). • Hybrid Approach – By measuring similarity between top-10 frequent hashtag tweets keywords and the target tweets keywords 13
  • 14. Frequency Approach • Frequency approach is not always correct • Topic “Sandusky” 14
  • 15. Hybrid Approach Target Tweets Cosine Sim k1 k2 k3 Hashtag1 Tweets Hashtag2 Tweets Hashtag 10 Tweets . . k1 k1 k1 kn k2 k2 k2 k3 k3 … k3 . . . . . . kn kn kn K: keywords extracted from all tweets in the set Select the highest similar hashtag to expand the lexical graph 15
  • 16. Hybrid Approach • Let Target Tweets be a set of tweets {t1, t2, …,tn} •From all tweets in the set, we have a vector of words TT_terms ={k1, k2, …,kn} Target Tweets TT_terms t1 k1 t2 k2 t3 k3 . . . . tn kn •In the Target Tweets set, we have a set of hashtags occurring in all tweets. We call it HashtagsTitles = {h1, h2 ,…, hn} 16
  • 17. Hybrid Approach • For each hashtag in HashtagTitles set = {h1, h2 ,…, hn}, we search Twitter for all tweets that does not occur in the Target Tweets set. •The search result for each hashtag is grouped in a vector of tweets called HT( Hashtag Tweets) HashtagTitles h1= Ht1, Ht2,…, Htn h1 h2= Ht1, Ht2,…, Htn h2 h3 : . hn= Ht1, Ht2,…, Htn . hn 17
  • 18. Hybrid Approach •For each HT, we build a vector of words representing each hashtag separately which we call HT_terms •We compute the cosine similarity between the two vectors TT_terms and HT_terms •Finally, we choose the most similar hashtag to expand the graph with 18
  • 19. Hybrid Approach • Measures the similarity of top frequent hashtag tweets content with target tweets content using cosine similarity • The top-10 frequent hashtags are used since we assume that the most relevant hashtag is frequent • Selecting the most similar hashtag using cosine similarity with top-10 frequent hashtags will use both approach which will improve the accuracy of the selection 19
  • 20. Hybrid Approach • After selecting an auxiliary hashtag tweet set: • classify each hashtag’s tweet as either relevant or irrelevant • by measuring the word overlap between auxiliary tweet terms and top-10 tf-idf in target tweets terms • If there is at least two words from the top-10, then we classify an auxiliary tweet as relevant. 20
  • 21. How to Build Lexical Graph • Let G=(V,E) be a weighted graph that represent the text • Vertices V denote words • We build an edge E between every two words if they co-occur within a specific window size • The weight of the edges for terms in the target tweets is the frequency of the co-occurrence • The frequency of the co-occurrence shows how strong the relationship between two nodes Edge_weight(Vi, Vj) = |co-occurrence| 21
  • 22. How to Build Lexical Graph • 22
  • 23. How to Build Lexical Graph • 23
  • 24. Topic Modeling • Latent Dirichlet Allocation (LDA) (D. M. Blei, A. Y. Ng, and M. I. Jordan) • Unsupervised model that identifies topics in a collection of documents. • A statistical model that uses “bag of words” assumption for each document. • Documents are represented over probability distribution over topics . • Topics are represented over probability distribution over collection of words. 24
  • 25. Topic Modeling • Latent Dirichlet Allocation (LDA) • Dirichlet prior α and β • Multinomial distribution over topics Ѳ • Multinomial distribution over words φ Ѳ Z w J D α β φ 25
  • 26. Graph-based Ranking Scheme • PageRank (Brin and Page, 1998) • Voting idea! • When a vertex links to another, it cast a vote for the other vertex. • The algorithm has a recursive nature! The importance of the vertex casting the vote determines the importance of the vote. • Uses nodes rank iteratively until convergence 26
  • 28. Graph-based Ranking Scheme • TextRank (Mihalcea & Tarau, 2004) • Create a graph for text • Words are represented in nodes (nouns and adjectives only) • Edges are the co-occurrence between words within a window • Frequency of co-occurring words is represented on edge weights • TextRank uses edge weights to influence the rank 28
  • 30. Graph-based Ranking Scheme • NE-Rank (Node Edge- Rank)(Bellaachia & Al-Dhelaan) • Incorporate node’s weight into the formula • Instead of either using only node weights or only edge weights, we try to use both features. • In text, node weights are best represented by tf-idf to represent the content of documents. • PageRank only focuses on the relations between objects without the content. • TextRank only uses the co-occurrence relation to identify important words. • NE-Rank takes the content into consideration as tf-idf 30
  • 32. Experiment • Crawled Twitter since 1/19/2012 to 2/6/2012 • Dataset have 31,227 tweets. • 244,139 tokens • 40,674 hashtags in tweets (4,079 unique hashtag). • Hashtags have been segmented into word tokens into tokenization step. • We have extracted 30 topics out of tweets. • Let C be the collection of tweets, 1..k are topics. • Aggregate tweets for topic yielding Ck • Build a graph and extract keyphrases from every Ck 32 • C= C1 U C2 U …Ck
  • 33. Experiment • Preprocessing : • Removed non-English tweets • Removed URL links • Normalized tweets from conversational style to standard English: for example: luv became love • Part of speech tagging to extract nouns and adjectives only • Stemming and stopwords removal 33
  • 34. Experiment • Since NE-Rank has showed better result compared to other ranking methods in our previous research[8], we used it to compare the ranking of 3 approaches: • Single Approach: No graph expansion • Expanded with hashtags-Frequency Approach • Expanded with hashtags-Hybrid Approach • We validated our results using an empirical evaluation approach as in the next slides 34
  • 35. Experiment • Since there is no golden labels to compare against, we empirically designed an evaluation approach utilizing a search engine to generate labels. • To generate such labels we searched Google using top-5 terms in LDA for each topic. • We only focused on two fields from search snippets results: title and description • If a keyphrase happens to occur in search results, then we consider it correct 35
  • 36. Experimental Results Automatic Approach Using Search Engine Top-10 Keyphrases Precision BPref Single NE-Rank 0.40 0.67 Expanded with Hashtags – Frequency Approach 0.45 0.52 Expanded with Hashtags – Hybrid Approach 0.55 0.73 36
  • 37. Conclusion • Twitter Introduction • Why Extracting Keyphrases in Twitter? • Learning from Twitter Hashtags • Twitter Lexical Graph Expansion • Proposed Approach for Graph Expansion • How to Choose Hashtags • Frequency Approach • Hybrid Approach • How to Build Lexical Graph • Topic Modeling • Graph-based Ranking Scheme • Experiments • Experimental Results 37 • Conclusion
  • 38. References • [1] Liu, et al.,2010. “Automatic Keyphrase Extraction via Topic Decomposition” • [2] Lin, Snow, & Morgan “Smoothing Techniques for Adaptive Online Language Models: Topic Tracking in Tweet Streams,” • [3] Liu, et al., 2011. “Why is “SXSW” Trending? Exploring Multiple Text Sources For Twitter Topic Summarization” • [4] X. Wan and J. Xiao, “Single document keyphrase extraction using neighborhood knowledge,” • [5] Weng, et al., 2010. “TwitterRank: Finding Topic-sensitive Influential Twitterers” • [6] Zhao, et al., 2011. “Topical Keyphrase Extraction from Twitter” • [7] Mihaleca & Tarau, “Textrank: Bringing order into texts” • [8] Bellaachia & Al-Dhelaan, “NE-Rank: A Novel Graph-based Keyphrase 38 Exctraction in Twitter” in press
  • 39. The End Thank You! 39