SlideShare una empresa de Scribd logo
1 de 10
Descargar para leer sin conexión
Disambiguating
          Twitter Search
          Kevin Teh
          kkwteh@gmail.com
          Insight Data Science Fellows Program
          March 2013


Tuesday, February 26, 13
That’s not the python that I
          meant...




Tuesday, February 26, 13
The solution? cluster-pluck.




Tuesday, February 26, 13
cluster-pluck disambiguates
             Twitter search in real time



Tuesday, February 26, 13
It works in Spanish too!




Tuesday, February 26, 13
Tuesday, February 26, 13
Tools
           Word Filter              Web Application

 300,000
 Tweets



                           Filter




                                                      User
Tuesday, February 26, 13
Algorithm
                                    read query and d/l
                                  corpus of 1500 tweets
                 filter out
              common words               count            link two candidates
                                         words                if their relative
                                                            proportion of co-
              rank remaining
                                                              occurrence is
            words by number
                                    select potentially      greater than 0.25
           of occurrences and
                                    meaningful words
               select top 10
                                                            rank connected
              rank remaining                                components by
                                    cluster candidates
             words by rate of                              total occurrences
                                        into groups
             capitalization and                              and take top 3
               select top 10
                                      assign tweets
                                       to clusters


Tuesday, February 26, 13
Kevin Teh
          kkwteh@gmail.com



             Math PhD -- May ’13                               B.A.Sc. -- April ’07
           Topic: Noncommutative Geometry (Whatever that is)    Engineering Science (Whatever that is)




Tuesday, February 26, 13
Tuesday, February 26, 13

Más contenido relacionado

La actualidad más candente

Mit202 data base management system(dbms)
Mit202  data base management system(dbms)Mit202  data base management system(dbms)
Mit202 data base management system(dbms)smumbahelp
 
Introduction to Text Mining
Introduction to Text Mining Introduction to Text Mining
Introduction to Text Mining Rupak Roy
 
QA4MRE LIMSI-CNRS - Gleize et al. 2013
QA4MRE LIMSI-CNRS - Gleize et al. 2013QA4MRE LIMSI-CNRS - Gleize et al. 2013
QA4MRE LIMSI-CNRS - Gleize et al. 2013Frédéric Giannetti
 
Topic Modeling - NLP
Topic Modeling - NLPTopic Modeling - NLP
Topic Modeling - NLPRupak Roy
 
AUTOMATED SHORT ANSWER GRADER USING FRIENDSHIP GRAPHS
AUTOMATED SHORT ANSWER GRADER USING FRIENDSHIP GRAPHSAUTOMATED SHORT ANSWER GRADER USING FRIENDSHIP GRAPHS
AUTOMATED SHORT ANSWER GRADER USING FRIENDSHIP GRAPHScsandit
 
Learning to Link with Wikipedia
Learning to Link with WikipediaLearning to Link with Wikipedia
Learning to Link with WikipediaAshish Kulkarni
 
Mi0034 database management system
Mi0034   database management systemMi0034   database management system
Mi0034 database management systemsmumbahelp
 
Bca3020– data base management system(dbms)
Bca3020– data base management system(dbms)Bca3020– data base management system(dbms)
Bca3020– data base management system(dbms)smumbahelp
 
Latent Semanctic Analysis Auro Tripathy
Latent Semanctic Analysis Auro TripathyLatent Semanctic Analysis Auro Tripathy
Latent Semanctic Analysis Auro TripathyAuro Tripathy
 
Chapter 2 part 1(Database System)
Chapter 2 part 1(Database System)Chapter 2 part 1(Database System)
Chapter 2 part 1(Database System)DoLce MiEra
 
Interface for Finding Close Matches from Translation Memory
Interface for Finding Close Matches from Translation MemoryInterface for Finding Close Matches from Translation Memory
Interface for Finding Close Matches from Translation MemoryPriyatham Bollimpalli
 
Synaptica New Feature: Auto Match
Synaptica New Feature: Auto MatchSynaptica New Feature: Auto Match
Synaptica New Feature: Auto Matchdaniela barbosa
 
Function Inverse T
Function Inverse TFunction Inverse T
Function Inverse Tbwlomas
 
Propositional logic
Propositional logicPropositional logic
Propositional logicchauhankapil
 

La actualidad más candente (17)

Mit202 data base management system(dbms)
Mit202  data base management system(dbms)Mit202  data base management system(dbms)
Mit202 data base management system(dbms)
 
Introduction to Text Mining
Introduction to Text Mining Introduction to Text Mining
Introduction to Text Mining
 
QA4MRE LIMSI-CNRS - Gleize et al. 2013
QA4MRE LIMSI-CNRS - Gleize et al. 2013QA4MRE LIMSI-CNRS - Gleize et al. 2013
QA4MRE LIMSI-CNRS - Gleize et al. 2013
 
Topic Modeling - NLP
Topic Modeling - NLPTopic Modeling - NLP
Topic Modeling - NLP
 
AUTOMATED SHORT ANSWER GRADER USING FRIENDSHIP GRAPHS
AUTOMATED SHORT ANSWER GRADER USING FRIENDSHIP GRAPHSAUTOMATED SHORT ANSWER GRADER USING FRIENDSHIP GRAPHS
AUTOMATED SHORT ANSWER GRADER USING FRIENDSHIP GRAPHS
 
Learning to Link with Wikipedia
Learning to Link with WikipediaLearning to Link with Wikipedia
Learning to Link with Wikipedia
 
Mi0034 database management system
Mi0034   database management systemMi0034   database management system
Mi0034 database management system
 
Bca3020– data base management system(dbms)
Bca3020– data base management system(dbms)Bca3020– data base management system(dbms)
Bca3020– data base management system(dbms)
 
Latent Semanctic Analysis Auro Tripathy
Latent Semanctic Analysis Auro TripathyLatent Semanctic Analysis Auro Tripathy
Latent Semanctic Analysis Auro Tripathy
 
Bt0066 dbms
Bt0066 dbmsBt0066 dbms
Bt0066 dbms
 
Chapter 2 part 1(Database System)
Chapter 2 part 1(Database System)Chapter 2 part 1(Database System)
Chapter 2 part 1(Database System)
 
Data modeling
Data modelingData modeling
Data modeling
 
Interface for Finding Close Matches from Translation Memory
Interface for Finding Close Matches from Translation MemoryInterface for Finding Close Matches from Translation Memory
Interface for Finding Close Matches from Translation Memory
 
Synaptica New Feature: Auto Match
Synaptica New Feature: Auto MatchSynaptica New Feature: Auto Match
Synaptica New Feature: Auto Match
 
Function Inverse T
Function Inverse TFunction Inverse T
Function Inverse T
 
Publisher
PublisherPublisher
Publisher
 
Propositional logic
Propositional logicPropositional logic
Propositional logic
 

Kevin teh insight presentation

  • 1. Disambiguating Twitter Search Kevin Teh kkwteh@gmail.com Insight Data Science Fellows Program March 2013 Tuesday, February 26, 13
  • 2. That’s not the python that I meant... Tuesday, February 26, 13
  • 4. cluster-pluck disambiguates Twitter search in real time Tuesday, February 26, 13
  • 5. It works in Spanish too! Tuesday, February 26, 13
  • 7. Tools Word Filter Web Application 300,000 Tweets Filter User Tuesday, February 26, 13
  • 8. Algorithm read query and d/l corpus of 1500 tweets filter out common words count link two candidates words if their relative proportion of co- rank remaining occurrence is words by number select potentially greater than 0.25 of occurrences and meaningful words select top 10 rank connected rank remaining components by cluster candidates words by rate of total occurrences into groups capitalization and and take top 3 select top 10 assign tweets to clusters Tuesday, February 26, 13
  • 9. Kevin Teh kkwteh@gmail.com Math PhD -- May ’13 B.A.Sc. -- April ’07 Topic: Noncommutative Geometry (Whatever that is) Engineering Science (Whatever that is) Tuesday, February 26, 13