SlideShare una empresa de Scribd logo
1 de 32
Descargar para leer sin conexión
Entity Spotting in
                                  Informal Text
                                           Meena Nagarajan
                                                  with
                             Daniel Gruhl*, Jan Pieper*, Christine Robson*,
                                              Amit P. Sheth
                                        Kno.e.sis, Wright State
                                IBM Research - Almaden, San Jose CA*


Thursday, October 29, 2009                                                    1
Tracking Online Popularity
                             http://www.almaden.ibm.com/cs/projects/iis/sound/




Thursday, October 29, 2009                                                       2
Tracking Online
                                Popularity        http://www.almaden.ibm.com/cs/projects/iis/sound/


       •      What is the buzz in the online
              Music Community?

       •      Ranking and displaying top X
              music artists, songs, tracks,
              albums..

       •      Spotting entities,
              despamming, sentiment
              identification, aggregation, top X
              lists..


Thursday, October 29, 2009                                                                            3
Spotting music entities in
                  user-generated content in
                    online music forums
                         (MySpace)

Thursday, October 29, 2009                     4
Chatter in Online Music Communities




                             http://knoesis.wright.edu/research/semweb/projects/music/
Thursday, October 29, 2009                                                               5
Goal: Semantic Annotation of
                     artists, tracks, songs, albums..
                                  Music Brainz RDF




                             Ohh these sour times... rock!
 Ohh these <track id=574623> sour times </track> ... rock!
Thursday, October 29, 2009                                   6
Multiple Senses in the same
                             Domain
                                   • 60 songs with Merry
                                     Christmas
                                   • 3600 songs with
                                     Yesterday
                                   • 195 releases of
                                     American Pie
           Caught AMERICAN
          PIE on cable so much     • 31 artists covering
                                     American Pie
                   fun!
Thursday, October 29, 2009                                 7
Annotating UGC, other
                      Challenges
                     • Several Cultural named entities
                      • artifacts of culture, common words in
                             everyday language

                                                              LOVED UR MUSIC YESTERDAY!


              Just showing some Love to you Madonna you are The Queen to me



                             Lily your face lights up when you smile!

Thursday, October 29, 2009                                                                8
Annotating UGC, other
                      Challenges
                     • Informal Text
                      • slang, abbreviations, misspellings..
                      • indifferent approach to grammar..
                     • Context dependent terms
                     • Unknown distributions

Thursday, October 29, 2009                                     9
Our Approach
                  Spotting and subsequent sense
                     disambiguation of spots
                             Ohh these sour times... rock!
 Ohh these <track id=574623> sour times </track> ... rock!
Thursday, October 29, 2009                                   10
3.1   Ground Truth Data Set

              Ground Truth Data Set
Our experimental evaluation focuses on user comments from the MySpace pages
of three artists: Madonna, Rihanna and Lily Allen (see Table 2). The artists
were selected to be popular enough to draw comment but different enough to
provide variety. The entity definitions were taken from the MusicBrainz RDF (see

       • 3 artists : Madonna, Rihanna, Lily Allen
Figure 1), which also includes some but not all common aliases and misspellings.


          •
Madonna       an artist with a extensive discography as well as a current album and
               1858 spots (MySpace UGC) using naive spotter over
              concert tour
Rihanna
               MusicBrainz artist metadata
              a pop singer with recent accolades including a Grammy Award and a
              very active MySpace presence


          •
Lilly Allen   an independent artist with song titles that include “Smile,” “Allright,
               Adjudicate if a spot is an entity or not (or inconclusive)
              Still”, “Naive”, and “Friday Night” who also generates a fair amount
              of buzz around her personal life not related to music

          •    hand tagged bythe Ground Truth Data Set
                Table 2. Artists in 4 authors


     We establish a ground truth data
                               Precision    Artist         Good spots  Bad spots
set of 1858 entity spots (best case for (Spots scored)
                              for these                    Agreement Agreement
                             naive spotter)
artists (breakdown in Table 3). The                       100% 75 % 100% 75%
data was obtained by crawling the 33%       Rihanna (615) 165   18    351   8
artist’s MySpace page comments and73%       Lily (523)    268   42    10    100
                                  23%
 dentifying all exact string matches        Madonna (720) 138   24    503   20
of the artist’s song titles. Only com- Table 3. Manual scoring agreements on
ments with at least one spot were re- naive entity spotter results.
tained. October 29, 2009
  Thursday, These spots were then hand                                                  11
Experiments and
                                 Results


Thursday, October 29, 2009                     12
Experiments
             All entities from
               MusicBrainz



                                 1. Light weight, edit distance
                                 based entity spotter




Thursday, October 29, 2009                                        13
Experiments

                      1. Naive spotter using all entities from all of
                      MusicBrainz
                      2. This new Merry Christmas tune is so good!

                             ? but which one ?
                             Disambiguate between the 60+ Merry
                               Christmas entries in MusicBrainz

Thursday, October 29, 2009                                              14
Experiments
                                 2. Constrain set of possible
                                 entities from Musicbrainz

                                 - to increase spotting accuracy
                                 - constrain using cues from the
                                 comment to eliminate
                                 alternatives
          This new Merry
         Christmas tune is
              so good!
Thursday, October 29, 2009                                         15
Experiments

                                 3. Eliminate non-music
                                 mentions

                                 Natural language and domain
                                 specific cues


        Your SMILE rocks!

Thursday, October 29, 2009                                     16
Restricted Entity
                                 Spotting


Thursday, October 29, 2009                       17
2. Restricted Entity
                                    Spotting
                     • Investigating the relationship between number
                             of entities used and spotting accuracy
                     • Understand systematic ways of scoping
                             domain models for use in semantic annotation
                     • Experiments to gauge benefits of implementing
                             particular constraints in annotator systems
                             • harder artist age detector vs. easier gender
                               detector ?
Thursday, October 29, 2009                                                    18
sets of artists that are factors of 10 smaller (10%, 1%, etc). These subsets
ays contain our three actual artists (Madonna, Rihanna and Lily Allen),
ause we are interested in simulating restrictions that remove invalid artists.
e most restricted entity set contains just the songs of one artist (≈0.0001% of


                    2a. Random Restrictions
  MusicBrainz taxonomy). In order to rule out selection bias, we perform 200
dom draws of sets of artists for each set size - a total of 1200 experiments.
ure 2 Precision the precision increases as the set of possible entities shrinks.
       shows that
  each setcase for 200 results are plotted and a best fit line has been added
    (best size, all
    naive spotter)
ndicate the average precision. Note that the figure is in log-log scale.

                  !"#$"%&'()'&*"'+,-.$'/#0.%1'&02(%(34
     !"""#$      !""#$    !"#$     !#$      #$      #"$           #""$
    33%                                                              #""$




                                                                            !#"$.-.(%'()'&*"'56(&&"#
    73%                                                              #"$
    23%
                                                                     #$
                                                                                                       Domain restrictions of 10% of the RDF
                                               %&'()*''+,
                                                                     !#$
                                                                                                       result in approximately 9.8 times
                                                        /178,,1                                        improvement in precision
                                                                     !"#$
                   %&'()*''+,-.)/(.012+)314+
                                                  5&61,,1            !""#$
                   5&61,,1-.)/(.012+)314+
                   /178,,1-.)/(.012+)314+
                                                                     !"""#$

. 2. Precision of a naive spotter using differently sized portions of the MusicBrainz
onomy to spot song titles on artist’s MySpace pages

 We observe that the curves in Figure 2 conform to a power law formula,
cifically a Zipf distribution ( nR2 ). Zipf’s law was originally applied to demon-
                                1

ate the Zipf distribution in frequency of words in natural language corpora


       • From all of MusicBrainz (281890 artists, 6220519
 , and has since been demonstrated in other corpora including web searches
  Figure 2 shows that song titles in Informal English exhibit the same fre-
 ncy characteristics as plain English. Furthermore, we can see that in the
              tracks) to songs of one artist (for all three artists)
rage case, a domain restrictions of 10% of the MusicBrainz RDF will result
 roximately in a 9.8 times improvement in precision of a naive spotter.
 This result is remarkably consistent across all three artists. The R2 values
 the power lines on the three artists are 0.9776, 0.979, 0.9836, which gives a
 iation of 0.61% in R2 value between spots on the three MySpace pages.
  Thursday, October 29, 2009                                                                                                                   19
2b. Real-world Constraints
                       for Restrictions
                       “Happy 25th Rhi!” (eliminate using Artist DOB - metadata in
                                              MusicBrainz)
                   “ur new album dummy is awesome” (eliminate using Album release
                                   dates - metadata in MusicBrainz)

         • Systematic scoping of the RDF
         • Question: Do real-world constraints from
                metadata reduce size of the entity spot set in a
                meaningful way?
         • Experiments: Derived manually and tested for
                usefulness
Thursday, October 29, 2009                                                           20
D    1,193 20-30 year career


          Real-world Constraints
                Recent Album Restrictions- Applied to Madonna
                         E   6,491 Artists who released an album in the past year
                         F 10,501 Artists who released an album in the past 5 years
                Artist Age Restrictions- Applied to Lily Allen
             Restrictions over MusicBrainz
                         H     112 Artist born 1985, album in past 2 years
                         J     284 Artists born in 1985 (or bands founded in 1985)
                Key Count Restriction
                         L   4,780 Artists or bands under 25 with album in past 2 years
                Artist 10,187 Artists or bands under 25 Applied to Madonna
                        M Career Length Restrictions- years old
                Number of Album Restrictions- Applied 1 year) album
                         B      22 80’s artists with recent (within to Lily Allen
                        KC     154 First album 1983
                             1,530 Only one album, released in the past 2 years
                         D   1,193 20-30 year career
                         N 19,809 Artists with only one album
                Recent Album Restrictions- Applied to Madonna
                Recent Album Restrictions- Applied to Rihanna
                        QE   6,491 3 albums exactly, first album last the past year
                                83 Artists who released an album in year
                         R 10,501 3+ albums, first album last year the past 5 years
                         F     196 Artists who released an album in
                Artist Age Restrictions- Applied to Lily Allen
                         S   1,398 First album last year
                         H
                         T   2,653 Artistsborn 1985, album one in theyears year
                               112 Artist with 3+ albums, in past 2 past
                         UJ  6,491 Artists who released an album in the past year
                               284         born in 1985 (or bands founded in 1985)
                Specific4,780 Artists or bands under 25 witheach Artist
                         L    Artist Restrictions- Applied to album in past 2 years
                        M 10,187 Madonna only under 25 years old
                         A       1 Artists or bands
         ....
         ....   Number of 1 Lily Allen only
                        G         Album Restrictions- Applied to Lily Allen
                         P
                         K   1,530 Rihanna only
                                 1 Only one album, released in the past 2 years
                         N 281,890 All artists in only one album
                         Z 19,809 Artists with MusicBrainz
                Recent Album Restrictions- Applied to Rihanna

             D. I’ve been The fan for 25 album
                  Table 4. youralbums, first years!last sample restrictions. !
                                        efficacy of various year
                         Q      83 3 albums exactly, first album last year
                         R     196 3+                                    M. Happy 25th
                         S   1,398 First album last year
e Thursday, October 29, 2009 2,653 Artists of restrictions onecareer,past year album
   considerTthree classes with 3+ albums, - in the age and                                based
                                                                                             21
Real-world Constraints
                  • Applied different constraints to different
                             artists
                  • Reduce potential entity spot size
                  • Run naive spotter
                  • Measure precision


Thursday, October 29, 2009                                       22
Real-world Constraints
                                                                                              “I heart your new album”
 Rihanna: short career, recent album                                                           “I love all your 3 albums”
                                                                                       “You are most favorite new pop artist”

   releases, 3 album releases etc....

                             !"""#$     !""#$      !"#$        !#$      #$       #"$         #""$
                                                                                                #""$
                                             %&'()*+,-..)/*./&0%)1*-%*-%23*405&%%&*+-%6+**




                                                                                                           !"#$%&%'()'*)+,#)-.'++#"
                                             *****789!9$*,/):0+0-%;                               #"$
                                                                  )A&:.23*8*&2>?@+
                                                                      &.*2)&+.*8*&2>?@+           #$

                                                                                                  !#$
                                &/.0+.+*<5-+)*=0/+.*&2>?@*<&+*
                                           0%*.5)*,&+.*8*3)&/+
                                                                                                  !"#$
                                      *&22*&/.0+.+*<5-*/)2)&+)1*&%*
                                        &2>?@*0%*.5)*,&+.*8*3)&/+                                 !""#$
                                      *)%.0/)*B?+0:*C/&0%D*.&A-%-@3*7"!"""8$*,/):0+0-%;
                                                                                                  !"""#$

Thursday, October 29, 2009                                                                                                            23
Real-world Constraints
            Age restrictions, only one album, last year releases,
                          extensive career etc...


 !"""#$      !""#$      !"#$       !#$         #$        #"$       #""$                                     !"""#$   !""#$     !"#$       !#$       #$       #"$         #""$
                                                                      #""$                                                                                                  #""$
                 3%?@1*)8:''1&*'&%(31A*:3*:340*B%A:33%*):3C)*                                                           -%>?5*):,''5&*'&%(-52*,-*,-7<*@(7<*A775-*),-B)
                                                                                 !"#$%&%'()'*)+,#)-.'++#"




                                                                                                                                                                                    !"#$%&%'()'*)+,#)-.'++#"
                 ********D--!9$*8&12()(:3E                                                                              ***************1C#$*:&5D()(,-6
                                                                         #"$                                                                                                #"$
                                1%&40*=">)*%&'()')*+(',*%3*                                                                        %-*%7+48*(-*'95*:%)'*';,*<5%&)
                                %4567*(3*',1*8%)'*01%&                   #$                                                                        %&'()')*4-25&*=0*<5%&)* #$
                                                  %&'()')*+,:)1*;(&)'*                                                                                  ,72*1,&*+%-2)*75))*
                   %&'()')*+(',*%*             &141%)1*+%)*(3*#<=/                                                                                     '95-*=0*<5%&)*,726
                                                                         !#$                                                                                                !#$
              -"./"*01%&*2%&11&
                                                                                                                        %&'()')*+,&-*(-*#./0*
        %&'()')*+(',*%3*%4567*(3*',1*8%)'*01%&                           !"#$                                   1,&*+%-2)*3,4-252*(-*#./06                                  !"#$

             %&'()')*+(',*%3*%4567*(3*',1*8%)'*9*01%&)                   !""#$                                                %&'()')*;('9*,-7<*,-5*%7+48                   !""#$
            13'(&1*B6)(2*F&%(3)*'%G:3:70**D"!"""9$*8&12()(:3E                                                        5-'(&5*E4)(D*F&%(-G*'%H,-,8<*1"!""C$*:&5D()(,-6
                                                                         !"""#$                                                                                             !"""#$



                     Madonna
Thursday, October 29, 2009
                                                                                                                                    Lily Allen                                                           24
Take aways..
         • Real world restrictions closely follow distribution
                 of random restrictions, conforming loosely to a
                 Zipf distribution
         • Confirms general effectiveness of limiting domain
                 size regardless of restriction
         • Choosing which constraints to implement is simple
                 - pick whatever is easiest first
               • use metadata from the model to guide you
Thursday, October 29, 2009                                         25
Non-music Mentions



Thursday, October 29, 2009                    26
Disambiguating Non-
                          music References
              UGC on Lily Allen’s page about her new track Smile


                             Got your new album Smile. Loved it!
                                    Keep your SMILE on!



Thursday, October 29, 2009                                         27
Binary Classification, SVM
                                                             Got your new album Smile. Loved it!
                                                                    Keep your SMILE on!
    Syntactic features                                                   Notation-S
    +
      POS tag of s                                                       s.POS
    POS tag of one token before s
    POS tag of one token after s
                                                                         s.POSb
                                                                         s.POSa  Training data
    Typed dependency between s and sentiment word *                      s.POS-TDsent ∗
    Typed dependency between s and domain-specific term *
    Boolean Typed dependency between s and sentiment *
                                                                              550 good spots
                                                                         s.POS-TDdom ∗
                                                                         s.B-TDsent ∗
    Boolean Typed dependency between s and domain-specific term *         s.B-TDdom ∗
    Word-level features
    +
      Capitalization of spot s
                                                                                550 bad spots
                                                                         Notation-W
                                                                         s.allCaps
    +
      Capitalization of first letter of s                                 s.firstCaps
    +
      s in Quotes                                                        s.inQuotes Test data
    Domain-specific features                                              Notation-D
    Sentiment expression in the same sentence as s                       s.Ssent
    Sentiment expression elsewhere in the comment                             120 good spots
                                                                         s.Csent
    Domain-related term in the same sentence as s                        s.Sdom

                                                                             229 * 2 bad spots
    Domain-related term elsewhere in the comment                         s.Cdom
    +
      Refers to basic features, others are advanced features
    ∗
      These features apply only to one-word-long spots.

                             Table 6. Features used by the SVM learner

Thursday, October 29, 2009                                                                       28
Most Useful Combinations
                                         FP best : All features,
                                         other combinations
Precision intensive




                                 42-91                          TP next best : word,
                                                              domain, contextual (POS)
                                                     78-50
                                                                   TP best : word, domain,
                                                                        contextual
                                                    90-35


                                                                Not all syntactic features are
                                               Recall intensive
                                                                useless, contrary to general
                                                                  belief, wrt informal text
    Thursday, October 29, 2009                                                               29
Naive MB spotter + NLP
                                                                                                                                   • Annotate using naive
                  '!!"


                       &!"
  5('*%$%63)7)8'*#""




                       %!"
                                                                                                                                     spotter
                                                                                                                                    • best case baseline
                       $!"


                       #!"                                                                 71,89-9/(:;/1:<9=>:?==,(
                                                                                           71,89-9/(:;/1:@9A)(()
                                                                                           71,89-9/(:;/1:B)C/(()

                                                                                                                                       (artist is known)
                                                                                           @,8)==:D)==:0A1,,E
                       !"
                             -./00,1

                                       2!345

                                               6&35!

                                                           6$3$!

                                                                   6#345

                                                                           6'36!

                                                                                   6!35%

                                                                                           %#3&$

                                                                                                   %'36!

                                                                                                           %!3&5

                                                                                                                   $53&%

                                                                                                                           $#32'
                              ()*+,




                                                                                                                                   • follow with NLP analytics
                                                       !"#$$%&%'()#**+(#*,)$-"%.$)/0#"%12%30#"%14




 PR tradeoffs: choosing feature                                                                                                      to weed out FPs
combinations depending on end
   application requirement                                                                                                          • run on less than entire
                                                                                                                                       input data

Thursday, October 29, 2009                                                                                                                                  30
Summary..
      • Real-time large-scale data processing
       • prohibits computationally intensive NLP techniques
      • Simple inexpensive NL learners over a dictionary-
              based naive spotter can yield reasonable performance
            • restricting the taxonomy results in proportionally
                    higher precision
      • Spot + Disambiguate a feasible approach for (esply.
              Cultural) NER in Informal Text
Thursday, October 29, 2009                                         31
Thank You!
                     • Bing,Yahoo, Google: Meena Nagarajan
                     • Contact us
                             •   {dgruhl, jhpieper, crobson}@us.ibm.com, {meena, amit}@knoesis.org


                     • More about this work
                             •   http://www.almaden.ibm.com/cs/projects/iis/sound/

                             •   http://knoesis.wright.edu/researchers/meena




Thursday, October 29, 2009                                                                           32

Más contenido relacionado

Último

A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 

Último (20)

A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 

Destacado

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Destacado (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Entity Spotting in Informal Text

  • 1. Entity Spotting in Informal Text Meena Nagarajan with Daniel Gruhl*, Jan Pieper*, Christine Robson*, Amit P. Sheth Kno.e.sis, Wright State IBM Research - Almaden, San Jose CA* Thursday, October 29, 2009 1
  • 2. Tracking Online Popularity http://www.almaden.ibm.com/cs/projects/iis/sound/ Thursday, October 29, 2009 2
  • 3. Tracking Online Popularity http://www.almaden.ibm.com/cs/projects/iis/sound/ • What is the buzz in the online Music Community? • Ranking and displaying top X music artists, songs, tracks, albums.. • Spotting entities, despamming, sentiment identification, aggregation, top X lists.. Thursday, October 29, 2009 3
  • 4. Spotting music entities in user-generated content in online music forums (MySpace) Thursday, October 29, 2009 4
  • 5. Chatter in Online Music Communities http://knoesis.wright.edu/research/semweb/projects/music/ Thursday, October 29, 2009 5
  • 6. Goal: Semantic Annotation of artists, tracks, songs, albums.. Music Brainz RDF Ohh these sour times... rock! Ohh these <track id=574623> sour times </track> ... rock! Thursday, October 29, 2009 6
  • 7. Multiple Senses in the same Domain • 60 songs with Merry Christmas • 3600 songs with Yesterday • 195 releases of American Pie Caught AMERICAN PIE on cable so much • 31 artists covering American Pie fun! Thursday, October 29, 2009 7
  • 8. Annotating UGC, other Challenges • Several Cultural named entities • artifacts of culture, common words in everyday language LOVED UR MUSIC YESTERDAY! Just showing some Love to you Madonna you are The Queen to me Lily your face lights up when you smile! Thursday, October 29, 2009 8
  • 9. Annotating UGC, other Challenges • Informal Text • slang, abbreviations, misspellings.. • indifferent approach to grammar.. • Context dependent terms • Unknown distributions Thursday, October 29, 2009 9
  • 10. Our Approach Spotting and subsequent sense disambiguation of spots Ohh these sour times... rock! Ohh these <track id=574623> sour times </track> ... rock! Thursday, October 29, 2009 10
  • 11. 3.1 Ground Truth Data Set Ground Truth Data Set Our experimental evaluation focuses on user comments from the MySpace pages of three artists: Madonna, Rihanna and Lily Allen (see Table 2). The artists were selected to be popular enough to draw comment but different enough to provide variety. The entity definitions were taken from the MusicBrainz RDF (see • 3 artists : Madonna, Rihanna, Lily Allen Figure 1), which also includes some but not all common aliases and misspellings. • Madonna an artist with a extensive discography as well as a current album and 1858 spots (MySpace UGC) using naive spotter over concert tour Rihanna MusicBrainz artist metadata a pop singer with recent accolades including a Grammy Award and a very active MySpace presence • Lilly Allen an independent artist with song titles that include “Smile,” “Allright, Adjudicate if a spot is an entity or not (or inconclusive) Still”, “Naive”, and “Friday Night” who also generates a fair amount of buzz around her personal life not related to music • hand tagged bythe Ground Truth Data Set Table 2. Artists in 4 authors We establish a ground truth data Precision Artist Good spots Bad spots set of 1858 entity spots (best case for (Spots scored) for these Agreement Agreement naive spotter) artists (breakdown in Table 3). The 100% 75 % 100% 75% data was obtained by crawling the 33% Rihanna (615) 165 18 351 8 artist’s MySpace page comments and73% Lily (523) 268 42 10 100 23% dentifying all exact string matches Madonna (720) 138 24 503 20 of the artist’s song titles. Only com- Table 3. Manual scoring agreements on ments with at least one spot were re- naive entity spotter results. tained. October 29, 2009 Thursday, These spots were then hand 11
  • 12. Experiments and Results Thursday, October 29, 2009 12
  • 13. Experiments All entities from MusicBrainz 1. Light weight, edit distance based entity spotter Thursday, October 29, 2009 13
  • 14. Experiments 1. Naive spotter using all entities from all of MusicBrainz 2. This new Merry Christmas tune is so good! ? but which one ? Disambiguate between the 60+ Merry Christmas entries in MusicBrainz Thursday, October 29, 2009 14
  • 15. Experiments 2. Constrain set of possible entities from Musicbrainz - to increase spotting accuracy - constrain using cues from the comment to eliminate alternatives This new Merry Christmas tune is so good! Thursday, October 29, 2009 15
  • 16. Experiments 3. Eliminate non-music mentions Natural language and domain specific cues Your SMILE rocks! Thursday, October 29, 2009 16
  • 17. Restricted Entity Spotting Thursday, October 29, 2009 17
  • 18. 2. Restricted Entity Spotting • Investigating the relationship between number of entities used and spotting accuracy • Understand systematic ways of scoping domain models for use in semantic annotation • Experiments to gauge benefits of implementing particular constraints in annotator systems • harder artist age detector vs. easier gender detector ? Thursday, October 29, 2009 18
  • 19. sets of artists that are factors of 10 smaller (10%, 1%, etc). These subsets ays contain our three actual artists (Madonna, Rihanna and Lily Allen), ause we are interested in simulating restrictions that remove invalid artists. e most restricted entity set contains just the songs of one artist (≈0.0001% of 2a. Random Restrictions MusicBrainz taxonomy). In order to rule out selection bias, we perform 200 dom draws of sets of artists for each set size - a total of 1200 experiments. ure 2 Precision the precision increases as the set of possible entities shrinks. shows that each setcase for 200 results are plotted and a best fit line has been added (best size, all naive spotter) ndicate the average precision. Note that the figure is in log-log scale. !"#$"%&'()'&*"'+,-.$'/#0.%1'&02(%(34 !"""#$ !""#$ !"#$ !#$ #$ #"$ #""$ 33% #""$ !#"$.-.(%'()'&*"'56(&&"# 73% #"$ 23% #$ Domain restrictions of 10% of the RDF %&'()*''+, !#$ result in approximately 9.8 times /178,,1 improvement in precision !"#$ %&'()*''+,-.)/(.012+)314+ 5&61,,1 !""#$ 5&61,,1-.)/(.012+)314+ /178,,1-.)/(.012+)314+ !"""#$ . 2. Precision of a naive spotter using differently sized portions of the MusicBrainz onomy to spot song titles on artist’s MySpace pages We observe that the curves in Figure 2 conform to a power law formula, cifically a Zipf distribution ( nR2 ). Zipf’s law was originally applied to demon- 1 ate the Zipf distribution in frequency of words in natural language corpora • From all of MusicBrainz (281890 artists, 6220519 , and has since been demonstrated in other corpora including web searches Figure 2 shows that song titles in Informal English exhibit the same fre- ncy characteristics as plain English. Furthermore, we can see that in the tracks) to songs of one artist (for all three artists) rage case, a domain restrictions of 10% of the MusicBrainz RDF will result roximately in a 9.8 times improvement in precision of a naive spotter. This result is remarkably consistent across all three artists. The R2 values the power lines on the three artists are 0.9776, 0.979, 0.9836, which gives a iation of 0.61% in R2 value between spots on the three MySpace pages. Thursday, October 29, 2009 19
  • 20. 2b. Real-world Constraints for Restrictions “Happy 25th Rhi!” (eliminate using Artist DOB - metadata in MusicBrainz) “ur new album dummy is awesome” (eliminate using Album release dates - metadata in MusicBrainz) • Systematic scoping of the RDF • Question: Do real-world constraints from metadata reduce size of the entity spot set in a meaningful way? • Experiments: Derived manually and tested for usefulness Thursday, October 29, 2009 20
  • 21. D 1,193 20-30 year career Real-world Constraints Recent Album Restrictions- Applied to Madonna E 6,491 Artists who released an album in the past year F 10,501 Artists who released an album in the past 5 years Artist Age Restrictions- Applied to Lily Allen Restrictions over MusicBrainz H 112 Artist born 1985, album in past 2 years J 284 Artists born in 1985 (or bands founded in 1985) Key Count Restriction L 4,780 Artists or bands under 25 with album in past 2 years Artist 10,187 Artists or bands under 25 Applied to Madonna M Career Length Restrictions- years old Number of Album Restrictions- Applied 1 year) album B 22 80’s artists with recent (within to Lily Allen KC 154 First album 1983 1,530 Only one album, released in the past 2 years D 1,193 20-30 year career N 19,809 Artists with only one album Recent Album Restrictions- Applied to Madonna Recent Album Restrictions- Applied to Rihanna QE 6,491 3 albums exactly, first album last the past year 83 Artists who released an album in year R 10,501 3+ albums, first album last year the past 5 years F 196 Artists who released an album in Artist Age Restrictions- Applied to Lily Allen S 1,398 First album last year H T 2,653 Artistsborn 1985, album one in theyears year 112 Artist with 3+ albums, in past 2 past UJ 6,491 Artists who released an album in the past year 284 born in 1985 (or bands founded in 1985) Specific4,780 Artists or bands under 25 witheach Artist L Artist Restrictions- Applied to album in past 2 years M 10,187 Madonna only under 25 years old A 1 Artists or bands .... .... Number of 1 Lily Allen only G Album Restrictions- Applied to Lily Allen P K 1,530 Rihanna only 1 Only one album, released in the past 2 years N 281,890 All artists in only one album Z 19,809 Artists with MusicBrainz Recent Album Restrictions- Applied to Rihanna D. I’ve been The fan for 25 album Table 4. youralbums, first years!last sample restrictions. ! efficacy of various year Q 83 3 albums exactly, first album last year R 196 3+ M. Happy 25th S 1,398 First album last year e Thursday, October 29, 2009 2,653 Artists of restrictions onecareer,past year album considerTthree classes with 3+ albums, - in the age and based 21
  • 22. Real-world Constraints • Applied different constraints to different artists • Reduce potential entity spot size • Run naive spotter • Measure precision Thursday, October 29, 2009 22
  • 23. Real-world Constraints “I heart your new album” Rihanna: short career, recent album “I love all your 3 albums” “You are most favorite new pop artist” releases, 3 album releases etc.... !"""#$ !""#$ !"#$ !#$ #$ #"$ #""$ #""$ %&'()*+,-..)/*./&0%)1*-%*-%23*405&%%&*+-%6+** !"#$%&%'()'*)+,#)-.'++#" *****789!9$*,/):0+0-%; #"$ )A&:.23*8*&2>?@+ &.*2)&+.*8*&2>?@+ #$ !#$ &/.0+.+*<5-+)*=0/+.*&2>?@*<&+* 0%*.5)*,&+.*8*3)&/+ !"#$ *&22*&/.0+.+*<5-*/)2)&+)1*&%* &2>?@*0%*.5)*,&+.*8*3)&/+ !""#$ *)%.0/)*B?+0:*C/&0%D*.&A-%-@3*7"!"""8$*,/):0+0-%; !"""#$ Thursday, October 29, 2009 23
  • 24. Real-world Constraints Age restrictions, only one album, last year releases, extensive career etc... !"""#$ !""#$ !"#$ !#$ #$ #"$ #""$ !"""#$ !""#$ !"#$ !#$ #$ #"$ #""$ #""$ #""$ 3%?@1*)8:''1&*'&%(31A*:3*:340*B%A:33%*):3C)* -%>?5*):,''5&*'&%(-52*,-*,-7<*@(7<*A775-*),-B) !"#$%&%'()'*)+,#)-.'++#" !"#$%&%'()'*)+,#)-.'++#" ********D--!9$*8&12()(:3E ***************1C#$*:&5D()(,-6 #"$ #"$ 1%&40*=">)*%&'()')*+(',*%3* %-*%7+48*(-*'95*:%)'*';,*<5%&) %4567*(3*',1*8%)'*01%& #$ %&'()')*4-25&*=0*<5%&)* #$ %&'()')*+,:)1*;(&)'* ,72*1,&*+%-2)*75))* %&'()')*+(',*%* &141%)1*+%)*(3*#<=/ '95-*=0*<5%&)*,726 !#$ !#$ -"./"*01%&*2%&11& %&'()')*+,&-*(-*#./0* %&'()')*+(',*%3*%4567*(3*',1*8%)'*01%& !"#$ 1,&*+%-2)*3,4-252*(-*#./06 !"#$ %&'()')*+(',*%3*%4567*(3*',1*8%)'*9*01%&) !""#$ %&'()')*;('9*,-7<*,-5*%7+48 !""#$ 13'(&1*B6)(2*F&%(3)*'%G:3:70**D"!"""9$*8&12()(:3E 5-'(&5*E4)(D*F&%(-G*'%H,-,8<*1"!""C$*:&5D()(,-6 !"""#$ !"""#$ Madonna Thursday, October 29, 2009 Lily Allen 24
  • 25. Take aways.. • Real world restrictions closely follow distribution of random restrictions, conforming loosely to a Zipf distribution • Confirms general effectiveness of limiting domain size regardless of restriction • Choosing which constraints to implement is simple - pick whatever is easiest first • use metadata from the model to guide you Thursday, October 29, 2009 25
  • 27. Disambiguating Non- music References UGC on Lily Allen’s page about her new track Smile Got your new album Smile. Loved it! Keep your SMILE on! Thursday, October 29, 2009 27
  • 28. Binary Classification, SVM Got your new album Smile. Loved it! Keep your SMILE on! Syntactic features Notation-S + POS tag of s s.POS POS tag of one token before s POS tag of one token after s s.POSb s.POSa Training data Typed dependency between s and sentiment word * s.POS-TDsent ∗ Typed dependency between s and domain-specific term * Boolean Typed dependency between s and sentiment * 550 good spots s.POS-TDdom ∗ s.B-TDsent ∗ Boolean Typed dependency between s and domain-specific term * s.B-TDdom ∗ Word-level features + Capitalization of spot s 550 bad spots Notation-W s.allCaps + Capitalization of first letter of s s.firstCaps + s in Quotes s.inQuotes Test data Domain-specific features Notation-D Sentiment expression in the same sentence as s s.Ssent Sentiment expression elsewhere in the comment 120 good spots s.Csent Domain-related term in the same sentence as s s.Sdom 229 * 2 bad spots Domain-related term elsewhere in the comment s.Cdom + Refers to basic features, others are advanced features ∗ These features apply only to one-word-long spots. Table 6. Features used by the SVM learner Thursday, October 29, 2009 28
  • 29. Most Useful Combinations FP best : All features, other combinations Precision intensive 42-91 TP next best : word, domain, contextual (POS) 78-50 TP best : word, domain, contextual 90-35 Not all syntactic features are Recall intensive useless, contrary to general belief, wrt informal text Thursday, October 29, 2009 29
  • 30. Naive MB spotter + NLP • Annotate using naive '!!" &!" 5('*%$%63)7)8'*#"" %!" spotter • best case baseline $!" #!" 71,89-9/(:;/1:<9=>:?==,( 71,89-9/(:;/1:@9A)(() 71,89-9/(:;/1:B)C/(() (artist is known) @,8)==:D)==:0A1,,E !" -./00,1 2!345 6&35! 6$3$! 6#345 6'36! 6!35% %#3&$ %'36! %!3&5 $53&% $#32' ()*+, • follow with NLP analytics !"#$$%&%'()#**+(#*,)$-"%.$)/0#"%12%30#"%14 PR tradeoffs: choosing feature to weed out FPs combinations depending on end application requirement • run on less than entire input data Thursday, October 29, 2009 30
  • 31. Summary.. • Real-time large-scale data processing • prohibits computationally intensive NLP techniques • Simple inexpensive NL learners over a dictionary- based naive spotter can yield reasonable performance • restricting the taxonomy results in proportionally higher precision • Spot + Disambiguate a feasible approach for (esply. Cultural) NER in Informal Text Thursday, October 29, 2009 31
  • 32. Thank You! • Bing,Yahoo, Google: Meena Nagarajan • Contact us • {dgruhl, jhpieper, crobson}@us.ibm.com, {meena, amit}@knoesis.org • More about this work • http://www.almaden.ibm.com/cs/projects/iis/sound/ • http://knoesis.wright.edu/researchers/meena Thursday, October 29, 2009 32