Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Entity Spotting in Informal Text
1. Entity Spotting in
Informal Text
Meena Nagarajan
with
Daniel Gruhl*, Jan Pieper*, Christine Robson*,
Amit P. Sheth
Kno.e.sis, Wright State
IBM Research - Almaden, San Jose CA*
Thursday, October 29, 2009 1
2. Tracking Online Popularity
http://www.almaden.ibm.com/cs/projects/iis/sound/
Thursday, October 29, 2009 2
3. Tracking Online
Popularity http://www.almaden.ibm.com/cs/projects/iis/sound/
• What is the buzz in the online
Music Community?
• Ranking and displaying top X
music artists, songs, tracks,
albums..
• Spotting entities,
despamming, sentiment
identification, aggregation, top X
lists..
Thursday, October 29, 2009 3
4. Spotting music entities in
user-generated content in
online music forums
(MySpace)
Thursday, October 29, 2009 4
5. Chatter in Online Music Communities
http://knoesis.wright.edu/research/semweb/projects/music/
Thursday, October 29, 2009 5
6. Goal: Semantic Annotation of
artists, tracks, songs, albums..
Music Brainz RDF
Ohh these sour times... rock!
Ohh these <track id=574623> sour times </track> ... rock!
Thursday, October 29, 2009 6
7. Multiple Senses in the same
Domain
• 60 songs with Merry
Christmas
• 3600 songs with
Yesterday
• 195 releases of
American Pie
Caught AMERICAN
PIE on cable so much • 31 artists covering
American Pie
fun!
Thursday, October 29, 2009 7
8. Annotating UGC, other
Challenges
• Several Cultural named entities
• artifacts of culture, common words in
everyday language
LOVED UR MUSIC YESTERDAY!
Just showing some Love to you Madonna you are The Queen to me
Lily your face lights up when you smile!
Thursday, October 29, 2009 8
9. Annotating UGC, other
Challenges
• Informal Text
• slang, abbreviations, misspellings..
• indifferent approach to grammar..
• Context dependent terms
• Unknown distributions
Thursday, October 29, 2009 9
10. Our Approach
Spotting and subsequent sense
disambiguation of spots
Ohh these sour times... rock!
Ohh these <track id=574623> sour times </track> ... rock!
Thursday, October 29, 2009 10
11. 3.1 Ground Truth Data Set
Ground Truth Data Set
Our experimental evaluation focuses on user comments from the MySpace pages
of three artists: Madonna, Rihanna and Lily Allen (see Table 2). The artists
were selected to be popular enough to draw comment but different enough to
provide variety. The entity definitions were taken from the MusicBrainz RDF (see
• 3 artists : Madonna, Rihanna, Lily Allen
Figure 1), which also includes some but not all common aliases and misspellings.
•
Madonna an artist with a extensive discography as well as a current album and
1858 spots (MySpace UGC) using naive spotter over
concert tour
Rihanna
MusicBrainz artist metadata
a pop singer with recent accolades including a Grammy Award and a
very active MySpace presence
•
Lilly Allen an independent artist with song titles that include “Smile,” “Allright,
Adjudicate if a spot is an entity or not (or inconclusive)
Still”, “Naive”, and “Friday Night” who also generates a fair amount
of buzz around her personal life not related to music
• hand tagged bythe Ground Truth Data Set
Table 2. Artists in 4 authors
We establish a ground truth data
Precision Artist Good spots Bad spots
set of 1858 entity spots (best case for (Spots scored)
for these Agreement Agreement
naive spotter)
artists (breakdown in Table 3). The 100% 75 % 100% 75%
data was obtained by crawling the 33% Rihanna (615) 165 18 351 8
artist’s MySpace page comments and73% Lily (523) 268 42 10 100
23%
dentifying all exact string matches Madonna (720) 138 24 503 20
of the artist’s song titles. Only com- Table 3. Manual scoring agreements on
ments with at least one spot were re- naive entity spotter results.
tained. October 29, 2009
Thursday, These spots were then hand 11
13. Experiments
All entities from
MusicBrainz
1. Light weight, edit distance
based entity spotter
Thursday, October 29, 2009 13
14. Experiments
1. Naive spotter using all entities from all of
MusicBrainz
2. This new Merry Christmas tune is so good!
? but which one ?
Disambiguate between the 60+ Merry
Christmas entries in MusicBrainz
Thursday, October 29, 2009 14
15. Experiments
2. Constrain set of possible
entities from Musicbrainz
- to increase spotting accuracy
- constrain using cues from the
comment to eliminate
alternatives
This new Merry
Christmas tune is
so good!
Thursday, October 29, 2009 15
16. Experiments
3. Eliminate non-music
mentions
Natural language and domain
specific cues
Your SMILE rocks!
Thursday, October 29, 2009 16
18. 2. Restricted Entity
Spotting
• Investigating the relationship between number
of entities used and spotting accuracy
• Understand systematic ways of scoping
domain models for use in semantic annotation
• Experiments to gauge benefits of implementing
particular constraints in annotator systems
• harder artist age detector vs. easier gender
detector ?
Thursday, October 29, 2009 18
19. sets of artists that are factors of 10 smaller (10%, 1%, etc). These subsets
ays contain our three actual artists (Madonna, Rihanna and Lily Allen),
ause we are interested in simulating restrictions that remove invalid artists.
e most restricted entity set contains just the songs of one artist (≈0.0001% of
2a. Random Restrictions
MusicBrainz taxonomy). In order to rule out selection bias, we perform 200
dom draws of sets of artists for each set size - a total of 1200 experiments.
ure 2 Precision the precision increases as the set of possible entities shrinks.
shows that
each setcase for 200 results are plotted and a best fit line has been added
(best size, all
naive spotter)
ndicate the average precision. Note that the figure is in log-log scale.
!"#$"%&'()'&*"'+,-.$'/#0.%1'&02(%(34
!"""#$ !""#$ !"#$ !#$ #$ #"$ #""$
33% #""$
!#"$.-.(%'()'&*"'56(&&"#
73% #"$
23%
#$
Domain restrictions of 10% of the RDF
%&'()*''+,
!#$
result in approximately 9.8 times
/178,,1 improvement in precision
!"#$
%&'()*''+,-.)/(.012+)314+
5&61,,1 !""#$
5&61,,1-.)/(.012+)314+
/178,,1-.)/(.012+)314+
!"""#$
. 2. Precision of a naive spotter using differently sized portions of the MusicBrainz
onomy to spot song titles on artist’s MySpace pages
We observe that the curves in Figure 2 conform to a power law formula,
cifically a Zipf distribution ( nR2 ). Zipf’s law was originally applied to demon-
1
ate the Zipf distribution in frequency of words in natural language corpora
• From all of MusicBrainz (281890 artists, 6220519
, and has since been demonstrated in other corpora including web searches
Figure 2 shows that song titles in Informal English exhibit the same fre-
ncy characteristics as plain English. Furthermore, we can see that in the
tracks) to songs of one artist (for all three artists)
rage case, a domain restrictions of 10% of the MusicBrainz RDF will result
roximately in a 9.8 times improvement in precision of a naive spotter.
This result is remarkably consistent across all three artists. The R2 values
the power lines on the three artists are 0.9776, 0.979, 0.9836, which gives a
iation of 0.61% in R2 value between spots on the three MySpace pages.
Thursday, October 29, 2009 19
20. 2b. Real-world Constraints
for Restrictions
“Happy 25th Rhi!” (eliminate using Artist DOB - metadata in
MusicBrainz)
“ur new album dummy is awesome” (eliminate using Album release
dates - metadata in MusicBrainz)
• Systematic scoping of the RDF
• Question: Do real-world constraints from
metadata reduce size of the entity spot set in a
meaningful way?
• Experiments: Derived manually and tested for
usefulness
Thursday, October 29, 2009 20
21. D 1,193 20-30 year career
Real-world Constraints
Recent Album Restrictions- Applied to Madonna
E 6,491 Artists who released an album in the past year
F 10,501 Artists who released an album in the past 5 years
Artist Age Restrictions- Applied to Lily Allen
Restrictions over MusicBrainz
H 112 Artist born 1985, album in past 2 years
J 284 Artists born in 1985 (or bands founded in 1985)
Key Count Restriction
L 4,780 Artists or bands under 25 with album in past 2 years
Artist 10,187 Artists or bands under 25 Applied to Madonna
M Career Length Restrictions- years old
Number of Album Restrictions- Applied 1 year) album
B 22 80’s artists with recent (within to Lily Allen
KC 154 First album 1983
1,530 Only one album, released in the past 2 years
D 1,193 20-30 year career
N 19,809 Artists with only one album
Recent Album Restrictions- Applied to Madonna
Recent Album Restrictions- Applied to Rihanna
QE 6,491 3 albums exactly, first album last the past year
83 Artists who released an album in year
R 10,501 3+ albums, first album last year the past 5 years
F 196 Artists who released an album in
Artist Age Restrictions- Applied to Lily Allen
S 1,398 First album last year
H
T 2,653 Artistsborn 1985, album one in theyears year
112 Artist with 3+ albums, in past 2 past
UJ 6,491 Artists who released an album in the past year
284 born in 1985 (or bands founded in 1985)
Specific4,780 Artists or bands under 25 witheach Artist
L Artist Restrictions- Applied to album in past 2 years
M 10,187 Madonna only under 25 years old
A 1 Artists or bands
....
.... Number of 1 Lily Allen only
G Album Restrictions- Applied to Lily Allen
P
K 1,530 Rihanna only
1 Only one album, released in the past 2 years
N 281,890 All artists in only one album
Z 19,809 Artists with MusicBrainz
Recent Album Restrictions- Applied to Rihanna
D. I’ve been The fan for 25 album
Table 4. youralbums, first years!last sample restrictions. !
efficacy of various year
Q 83 3 albums exactly, first album last year
R 196 3+ M. Happy 25th
S 1,398 First album last year
e Thursday, October 29, 2009 2,653 Artists of restrictions onecareer,past year album
considerTthree classes with 3+ albums, - in the age and based
21
22. Real-world Constraints
• Applied different constraints to different
artists
• Reduce potential entity spot size
• Run naive spotter
• Measure precision
Thursday, October 29, 2009 22
23. Real-world Constraints
“I heart your new album”
Rihanna: short career, recent album “I love all your 3 albums”
“You are most favorite new pop artist”
releases, 3 album releases etc....
!"""#$ !""#$ !"#$ !#$ #$ #"$ #""$
#""$
%&'()*+,-..)/*./&0%)1*-%*-%23*405&%%&*+-%6+**
!"#$%&%'()'*)+,#)-.'++#"
*****789!9$*,/):0+0-%; #"$
)A&:.23*8*&2>?@+
&.*2)&+.*8*&2>?@+ #$
!#$
&/.0+.+*<5-+)*=0/+.*&2>?@*<&+*
0%*.5)*,&+.*8*3)&/+
!"#$
*&22*&/.0+.+*<5-*/)2)&+)1*&%*
&2>?@*0%*.5)*,&+.*8*3)&/+ !""#$
*)%.0/)*B?+0:*C/&0%D*.&A-%-@3*7"!"""8$*,/):0+0-%;
!"""#$
Thursday, October 29, 2009 23
24. Real-world Constraints
Age restrictions, only one album, last year releases,
extensive career etc...
!"""#$ !""#$ !"#$ !#$ #$ #"$ #""$ !"""#$ !""#$ !"#$ !#$ #$ #"$ #""$
#""$ #""$
3%?@1*)8:''1&*'&%(31A*:3*:340*B%A:33%*):3C)* -%>?5*):,''5&*'&%(-52*,-*,-7<*@(7<*A775-*),-B)
!"#$%&%'()'*)+,#)-.'++#"
!"#$%&%'()'*)+,#)-.'++#"
********D--!9$*8&12()(:3E ***************1C#$*:&5D()(,-6
#"$ #"$
1%&40*=">)*%&'()')*+(',*%3* %-*%7+48*(-*'95*:%)'*';,*<5%&)
%4567*(3*',1*8%)'*01%& #$ %&'()')*4-25&*=0*<5%&)* #$
%&'()')*+,:)1*;(&)'* ,72*1,&*+%-2)*75))*
%&'()')*+(',*%* &141%)1*+%)*(3*#<=/ '95-*=0*<5%&)*,726
!#$ !#$
-"./"*01%&*2%&11&
%&'()')*+,&-*(-*#./0*
%&'()')*+(',*%3*%4567*(3*',1*8%)'*01%& !"#$ 1,&*+%-2)*3,4-252*(-*#./06 !"#$
%&'()')*+(',*%3*%4567*(3*',1*8%)'*9*01%&) !""#$ %&'()')*;('9*,-7<*,-5*%7+48 !""#$
13'(&1*B6)(2*F&%(3)*'%G:3:70**D"!"""9$*8&12()(:3E 5-'(&5*E4)(D*F&%(-G*'%H,-,8<*1"!""C$*:&5D()(,-6
!"""#$ !"""#$
Madonna
Thursday, October 29, 2009
Lily Allen 24
25. Take aways..
• Real world restrictions closely follow distribution
of random restrictions, conforming loosely to a
Zipf distribution
• Confirms general effectiveness of limiting domain
size regardless of restriction
• Choosing which constraints to implement is simple
- pick whatever is easiest first
• use metadata from the model to guide you
Thursday, October 29, 2009 25
27. Disambiguating Non-
music References
UGC on Lily Allen’s page about her new track Smile
Got your new album Smile. Loved it!
Keep your SMILE on!
Thursday, October 29, 2009 27
28. Binary Classification, SVM
Got your new album Smile. Loved it!
Keep your SMILE on!
Syntactic features Notation-S
+
POS tag of s s.POS
POS tag of one token before s
POS tag of one token after s
s.POSb
s.POSa Training data
Typed dependency between s and sentiment word * s.POS-TDsent ∗
Typed dependency between s and domain-specific term *
Boolean Typed dependency between s and sentiment *
550 good spots
s.POS-TDdom ∗
s.B-TDsent ∗
Boolean Typed dependency between s and domain-specific term * s.B-TDdom ∗
Word-level features
+
Capitalization of spot s
550 bad spots
Notation-W
s.allCaps
+
Capitalization of first letter of s s.firstCaps
+
s in Quotes s.inQuotes Test data
Domain-specific features Notation-D
Sentiment expression in the same sentence as s s.Ssent
Sentiment expression elsewhere in the comment 120 good spots
s.Csent
Domain-related term in the same sentence as s s.Sdom
229 * 2 bad spots
Domain-related term elsewhere in the comment s.Cdom
+
Refers to basic features, others are advanced features
∗
These features apply only to one-word-long spots.
Table 6. Features used by the SVM learner
Thursday, October 29, 2009 28
29. Most Useful Combinations
FP best : All features,
other combinations
Precision intensive
42-91 TP next best : word,
domain, contextual (POS)
78-50
TP best : word, domain,
contextual
90-35
Not all syntactic features are
Recall intensive
useless, contrary to general
belief, wrt informal text
Thursday, October 29, 2009 29
30. Naive MB spotter + NLP
• Annotate using naive
'!!"
&!"
5('*%$%63)7)8'*#""
%!"
spotter
• best case baseline
$!"
#!" 71,89-9/(:;/1:<9=>:?==,(
71,89-9/(:;/1:@9A)(()
71,89-9/(:;/1:B)C/(()
(artist is known)
@,8)==:D)==:0A1,,E
!"
-./00,1
2!345
6&35!
6$3$!
6#345
6'36!
6!35%
%#3&$
%'36!
%!3&5
$53&%
$#32'
()*+,
• follow with NLP analytics
!"#$$%&%'()#**+(#*,)$-"%.$)/0#"%12%30#"%14
PR tradeoffs: choosing feature to weed out FPs
combinations depending on end
application requirement • run on less than entire
input data
Thursday, October 29, 2009 30
31. Summary..
• Real-time large-scale data processing
• prohibits computationally intensive NLP techniques
• Simple inexpensive NL learners over a dictionary-
based naive spotter can yield reasonable performance
• restricting the taxonomy results in proportionally
higher precision
• Spot + Disambiguate a feasible approach for (esply.
Cultural) NER in Informal Text
Thursday, October 29, 2009 31
32. Thank You!
• Bing,Yahoo, Google: Meena Nagarajan
• Contact us
• {dgruhl, jhpieper, crobson}@us.ibm.com, {meena, amit}@knoesis.org
• More about this work
• http://www.almaden.ibm.com/cs/projects/iis/sound/
• http://knoesis.wright.edu/researchers/meena
Thursday, October 29, 2009 32