SlideShare una empresa de Scribd logo
1 de 60
Descargar para leer sin conexión
Seman&c	
  Analysis	
  in	
  Language	
  Technology	
  
http://stp.lingfil.uu.se/~santinim/sais/2016/sais_2016.htm 



Relation Extraction
Marina	
  San(ni	
  
san$nim@stp.lingfil.uu.se	
  
	
  
Department	
  of	
  Linguis(cs	
  and	
  Philology	
  
Uppsala	
  University,	
  Uppsala,	
  Sweden	
  
	
  
Spring	
  2016	
  
	
  
	
  
Previous	
  Lecture:	
  Ques$on	
  Answering	
  
2	
  
Ques$on	
  Answering	
  systems	
  
•  Factoid	
  ques(ons:	
  	
  	
  
•  Google	
  
•  Wolfram	
  
•  Ask	
  Jeeves	
  
•  Start	
  	
  
•  ….	
  
3	
  
•  Approaches:	
   	
  	
  
•  IR-­‐based	
  
•  Knowelege	
  based	
  
•  Hybrid	
  
Katz	
  et	
  al.	
  (2006)	
  
hFp://start.csail.mit.edu/publica$ons/FLAIRS0601KatzB.pdf	
  	
  
•  START	
  answers	
  natural	
  language	
  ques(ons	
  by	
  presen(ng	
  components	
  of	
  text	
  and	
  
mul(-­‐media	
  informa(on	
  drawn	
  from	
  a	
  set	
  of	
  informa(on	
  resources	
  that	
  are	
  
hosted	
  locally	
  or	
  accessed	
  remotely	
  through	
  the	
  Internet.	
  	
  
•  START	
  targets	
  high	
  precision	
  in	
  its	
  ques(on	
  answering.	
  	
  
•  The	
  START	
  system	
  analyzes	
  English	
  text	
  and	
  produces	
  a	
  knowledge	
  base	
  which	
  
incorporates,	
  in	
  the	
  form	
  of	
  nested	
  ternary	
  expressions	
  (=triples),	
  the	
  
informa(on	
  found	
  in	
  the	
  text.	
  
4	
  
Is	
  it	
  	
  true?:	
  hFp://uncyclopedia.wikia.com/wiki/Ask_Jeeves	
  
•  Ask	
  Jeeves,	
  more	
  correctly	
  known	
  as	
  Ask.com,	
  is	
  a	
  search	
  engine	
  founded	
  
in	
  1996	
  in	
  California.	
  	
  
•  Ini(ally	
  it	
  represented	
  a	
  stereotypical	
  English	
  butler	
  who	
  would	
  "fetch"	
  
the	
  answer	
  to	
  any	
  ques(on	
  asked.	
  
•  Ask.com	
  is	
  now	
  considered	
  one	
  of	
  the	
  great	
  failures	
  of	
  the	
  internet.	
  The	
  
ques(on	
  and	
  answer	
  feature	
  simply	
  didn't	
  work	
  as	
  well	
  as	
  hoped,	
  and	
  
a^er	
  trying	
  his	
  hand	
  at	
  being	
  both	
  a	
  tradi(onal	
  search	
  engine	
  and	
  a	
  
terrible	
  kind	
  of	
  "ar(ficial	
  AI"	
  with	
  a	
  bald	
  spot,	
  	
  
•  These	
  days	
  Jeeves	
  is	
  ranked	
  as	
  the	
  4th	
  most	
  successful	
  search	
  engine	
  on	
  
the	
  web,	
  and	
  the	
  4th	
  most	
  successful	
  overall.	
  This	
  seems	
  impressive	
  un$l	
  
you	
  consider	
  that	
  Google	
  holds	
  the	
  top	
  spot	
  with	
  95%	
  of	
  the	
  market.	
  It	
  
has	
  even	
  fallen	
  behind	
  Bing;	
  enough	
  said.	
  
5	
  
Search	
  engines	
  that	
  can	
  be	
  used	
  as	
  QA	
  systems	
  
•  Yahoo	
  
•  Bing	
  
6	
  
Siri	
  
hFp://en.wikipedia.org/wiki/Siri	
  	
  
•  Siri	
  /ˈsɪri/	
  is	
  an	
  intelligent	
  personal	
  assistant	
  and	
  knowledge	
  navigator	
  which	
  works	
  as	
  an	
  
applica(on	
  for	
  Apple	
  Inc.'s	
  iOS.	
  
•  	
  The	
  applica(on	
  uses	
  a	
  natural	
  language	
  user	
  interface	
  to	
  answer	
  ques$ons,	
  make	
  
recommenda(ons,	
  and	
  perform	
  ac(ons	
  by	
  delega$ng	
  requests	
  to	
  a	
  set	
  of	
  Web	
  services.	
  	
  
•  The	
  so^ware,	
  both	
  in	
  its	
  original	
  version	
  and	
  as	
  an	
  iOS	
  applica(on,	
  adapts	
  to	
  the	
  user's	
  
individual	
  language	
  usage	
  and	
  individual	
  searches	
  (preferences)	
  with	
  con(nuing	
  use,	
  and	
  
returns	
  results	
  that	
  are	
  individualized.	
  	
  
•  The	
  name	
  Siri	
  is	
  Scandinavian,	
  a	
  short	
  form	
  of	
  the	
  Norse	
  name	
  Sigrid	
  meaning	
  "beauty"	
  and	
  
"victory",	
  and	
  comes	
  from	
  the	
  intended	
  name	
  for	
  the	
  original	
  developer's	
  first	
  child.	
  
7	
  
ChaFerbots	
  
•  Siri…	
  conversa(onal	
  ”safety	
  net”.	
  
•  Conversa(onal	
  agents	
  (chaker	
  bots,	
  
and	
  personal	
  assistants)	
  	
  
àcustomer	
  care,	
  customer	
  analy(cs	
  
(replacing/integra(ng	
  FAQs	
  and	
  help	
  
desk)	
  
8	
  
Avatar: a picture of a person or animal that
represents you on a computer screen, for
example in some chat rooms or when you are
playing games over the Internet
Eliza	
  
hFp://en.wikipedia.org/wiki/ELIZA	
  
ELIZA	
  was	
  wriFen	
  at	
  MIT	
  by	
  Joseph	
  Weizenbaum	
  between	
  1964	
  and	
  1966	
  	
  
9	
  
General	
  IR	
  architecture	
  for	
  factoid	
  ques$ons	
  
10	
  
Document
DocumentDocument
Docume
ntDocume
ntDocume
ntDocume
ntDocume
nt
Question
Processing
Passage
Retrieval
Query
Formulation
Answer Type
Detection
Question
Passage
Retrieval
Document
Retrieval
Answer
Processing
Answer
passages
Indexing
Relevant
Docs
DocumentDocument
Document
Things	
  to	
  extract	
  from	
  the	
  ques$on	
  
•  Answer	
  Type	
  Detec(on	
  
•  Decide	
  the	
  named	
  en$ty	
  type	
  (person,	
  place)	
  
of	
  the	
  answer	
  
•  Query	
  Formula(on	
  
•  Choose	
  query	
  keywords	
  for	
  the	
  IR	
  system	
  
•  Ques(on	
  Type	
  classifica(on	
  
•  Is	
  this	
  a	
  defini(on	
  ques(on,	
  a	
  math	
  ques(on,	
  a	
  
list	
  ques(on?	
  
•  Focus	
  Detec(on	
  
•  Find	
  the	
  ques(on	
  words	
  that	
  are	
  replaced	
  by	
  
the	
  answer	
  
•  Rela(on	
  Extrac(on	
  
•  Find	
  rela(ons	
  between	
  en((es	
  in	
  the	
  ques(on	
  11	
  
12	
  
Common	
  Evalua$on	
  Metrics	
  
1. Accuracy	
  (does	
  answer	
  match	
  gold-­‐labeled	
  answer?)	
  
2. Mean	
  Reciprocal	
  Rank:	
  	
  
•  The	
  reciprocal	
  rank	
  of	
  a	
  query	
  response	
  is	
  the	
  inverse	
  of	
  the	
  rank	
  of	
  the	
  
first	
  correct	
  answer.	
  	
  
•  The	
  mean	
  reciprocal	
  rank	
  is	
  the	
  average	
  of	
  the	
  reciprocal	
  ranks	
  of	
  
results	
  for	
  a	
  sample	
  of	
  queries	
  Q	
  
MRR =
1
rankii=1
N
∑
N
=	
  
Common	
  Evalua$on	
  Metrics:	
  MRR	
  
•  The	
  mean	
  reciprocal	
  rank	
  is	
  the	
  average	
  of	
  the	
  reciprocal	
  ranks	
  
of	
  results	
  for	
  a	
  sample	
  of	
  queries	
  Q.	
  
•  (ex	
  adapted	
  from	
  Wikipedia)	
  
•  3	
  ranked	
  answers	
  for	
  a	
  query,	
  with	
  the	
  first	
  one	
  being	
  the	
  one	
  it	
  thinks	
  is	
  
most	
  likely	
  correct	
  	
  
•  Given	
  those	
  3	
  samples,	
  we	
  could	
  calculate	
  the	
  mean	
  reciprocal	
  rank	
  as	
  
(1/3	
  +	
  1/2	
  +	
  1)/3	
  =	
  0.61.	
  
13	
  
Complex	
  ques$ons:	
  “What	
  is	
  the	
  ‘hajii’”?	
  
•  The	
  (bokom-­‐up)	
  snippet	
  method	
  
•  Find	
  a	
  set	
  of	
  relevant	
  documents	
  
•  Extract	
  informa(ve	
  sentences	
  from	
  the	
  documents	
  (using	
  p-­‐idf,	
  MMR)	
  
•  Order	
  and	
  modify	
  the	
  sentences	
  into	
  an	
  answer	
  
•  The	
  (top-­‐down)	
  informa(on	
  extrac(on	
  method	
  
•  build	
  specific	
  answerers	
  for	
  different	
  ques(on	
  types:	
  
•  defini(on	
  ques(ons,	
  
•  biography	
  ques(ons,	
  	
  
•  certain	
  medical	
  ques(ons	
  
Informa$on	
  that	
  should	
  be	
  in	
  the	
  answer	
  
for	
  3	
  kinds	
  of	
  ques$ons	
  
Document
Retrieval
11 Web documents
1127 total
sentences
Predicate
Identification
Data-Driven
Analysis
383 Non-Specific Definitional sentences
Sentence clusters,
Importance ordering
Definition
Creation
9 Genus-Species Sentences
The Hajj, or pilgrimage to Makkah (Mecca), is the central duty of Islam.
The Hajj is a milestone event in a Muslim's life.
The hajj is one of five pillars that make up the foundation of Islam.
...
The Hajj, or pilgrimage to Makkah [Mecca], is the central duty of Islam. More than
two million Muslims are expected to take the Hajj this year. Muslims must perform
the hajj at least once in their lifetime if physically and financially able. The Hajj is a
milestone event in a Muslim's life. The annual hajj begins in the twelfth month of
the Islamic year (which is lunar, not solar, so that hajj and Ramadan fall sometimes
in summer, sometimes in winter). The Hajj is a week-long pilgrimage that begins in
the 12th month of the Islamic lunar calendar. Another ceremony, which was not
connected with the rites of the Ka'ba before the rise of Islam, is the Hajj, the
annual pilgrimage to 'Arafat, about two miles east of Mecca, toward Mina…
"What is the Hajj?"
(Ndocs=20, Len=8)
Architecture	
  for	
  complex	
  ques$on	
  answering:	
  	
  
defini$on	
  ques$ons	
   S.	
  Blair-­‐Goldensohn,	
  K.	
  McKeown	
  and	
  A.	
  Schlaikjer.	
  2004.	
  
Answering	
  Defini(on	
  Ques(ons:	
  A	
  Hyrbid	
  Approach.	
  	
  
State-­‐of-­‐the-­‐art:	
  ex	
  
•  Top	
  downMing	
  Tan,	
  Cicero	
  dos	
  Santos,	
  Bing	
  Xiang	
  &	
  Bowen	
  
Zhou.	
  2015.	
  LSTM-­‐Based	
  Deep	
  Learning	
  Models	
  for	
  non	
  factoid	
  
Answer	
  Selec(on.	
  
•  Di	
  Wang	
  and	
  Eric	
  Nyberg.	
  2015.	
  A	
  Long	
  Short-­‐Term	
  Memory	
  
Model	
  for	
  Answer	
  Sentence	
  Selec(on	
  in	
  Ques(on	
  Answering.	
  In	
  
ACL	
  2015.s	
  
•  Minwei	
  Feng,	
  Bing	
  Xiang,	
  Michael	
  R.	
  Glass,	
  Lidan	
  Wang,	
  Bowen	
  
Zhou.	
  2015.	
  Applying	
  deep	
  learning	
  to	
  answer	
  selec(on:	
  A	
  study	
  
and	
  an	
  open	
  task.	
  	
  
17	
  
Deep	
  Learning	
  is	
  a	
  new	
  area	
  of	
  Machine	
  Learning	
  
research.	
  Said	
  to	
  be	
  very	
  promising.	
  It	
  is	
  about	
  learning	
  
mul(ple	
  levels	
  of	
  representa(on	
  and	
  abstrac(on	
  that	
  
help	
  to	
  make	
  sense	
  of	
  data	
  such	
  as	
  images,	
  sound,	
  and	
  
text.	
  It	
  is	
  based	
  on	
  neural	
  networks.	
  	
  
Prac$cal	
  ac$vity	
  
•  Start	
  seems	
  to	
  be	
  limited,	
  but	
  it	
  understands	
  natural	
  language	
  
•  Google	
  (presumably	
  helped	
  by	
  Knowledge	
  Graph)	
  is	
  more	
  
accurate,	
  but	
  skips	
  natural	
  language	
  (uses	
  keywords).	
  	
  
•  Google	
  is	
  customized	
  to	
  the	
  users’	
  preferences	
  (different	
  results)	
  
•  Interes(ng	
  outcomes	
  
•  Currency	
  vs.	
  Coin	
  
•  What’s	
  love?	
  
•  Lyric/song	
  vs.	
  Defini(on	
  ques(on	
  
18	
  
What’s	
  the	
  meaning	
  of	
  life?	
  
•  Google	
  
19	
  
Presumably	
  from	
  Knowledge	
  Graph…	
  
Start	
  and	
  the	
  42	
  puzzle	
  
•  gg	
  
20	
  
End	
  of	
  previous	
  lecture	
  
21	
  
Acknowledgements
Most	
  slides	
  borrowed	
  or	
  adapted	
  from:	
  
Dan	
  Jurafsky	
  and	
  Christopher	
  Manning,	
  Coursera	
  
Dan	
  Jurafsky	
  and	
  James	
  H.	
  Mar(n	
  (2015)	
  
	
  	
  
	
  
J&M(2015,	
  dra^):	
  hkps://web.stanford.edu/~jurafsky/slp3/	
  	
  	
  
	
  
	
  	
  	
  
Relation
Extraction
What	
  is	
  rela(on	
  
extrac(on?	
  
Extrac$ng	
  rela$ons	
  from	
  text	
  
•  Company	
  report:	
  “Interna(onal	
  Business	
  Machines	
  Corpora(on	
  (IBM	
  or	
  the	
  
company)	
  was	
  incorporated	
  in	
  the	
  State	
  of	
  New	
  York	
  on	
  June	
  16,	
  1911,	
  as	
  the	
  
Compu(ng-­‐Tabula(ng-­‐Recording	
  Co.	
  (C-­‐T-­‐R)…”	
  
•  Extracted	
  Complex	
  Rela(on:	
  
Company-­‐Founding	
  
	
  	
  Company	
   	
  IBM	
  
	
  	
  Loca(on	
  	
   	
  New	
  York	
  
	
  	
  Date	
   	
   	
  June	
  16,	
  1911	
  
	
  	
  Original-­‐Name	
  	
   	
  Compu(ng-­‐Tabula(ng-­‐Recording	
  Co.	
  
•  But	
  we	
  will	
  focus	
  on	
  the	
  simpler	
  task	
  of	
  extrac(ng	
  rela(on	
  triples	
  
Founding-­‐year(IBM,1911)	
  
Founding-­‐loca(on(IBM,New	
  York)	
  24	
  
Extrac$ng	
  Rela$on	
  Triples	
  from	
  Text	
  
	
  The	
  Leland	
  Stanford	
  Junior	
  University,	
  
commonly	
  referred	
  to	
  as	
  Stanford	
  
University	
  or	
  Stanford,	
  is	
  an	
  American	
  
private	
  research	
  university	
  located	
  in	
  
Stanford,	
  California	
  …	
  near	
  Palo	
  Alto,	
  
California…	
  Leland	
  Stanford…founded	
  
the	
  university	
  in	
  1891	
  
Stanford EQ Leland Stanford Junior University
Stanford LOC-IN California
Stanford IS-A research university
Stanford LOC-NEAR Palo Alto
Stanford FOUNDED-IN 1891
Stanford FOUNDER Leland Stanford25	
  
Why	
  Rela$on	
  Extrac$on?	
  
•  Create	
  new	
  structured	
  knowledge	
  bases,	
  useful	
  for	
  any	
  app	
  
•  Augment	
  current	
  knowledge	
  bases	
  
•  Adding	
  words	
  to	
  WordNet	
  thesaurus,	
  facts	
  to	
  FreeBase	
  or	
  DBPedia	
  
•  Support	
  ques(on	
  answering	
  
•  The	
  granddaughter	
  of	
  which	
  actor	
  starred	
  in	
  the	
  movie	
  “E.T.”?	
  
(acted-in ?x “E.T.”)(is-a ?y actor)(granddaughter-of ?x ?y)!
•  But	
  which	
  rela(ons	
  should	
  we	
  extract?	
  
!
26	
  
Automated	
  Content	
  Extrac$on	
  (ACE)	
  
ARTIFACT
GENERAL
AFFILIATION
ORG
AFFILIATION
PART-
WHOLE
PERSON-
SOCIAL
PHYSICAL
Located
Near
Business
Family Lasting
Personal
Citizen-
Resident-
Ethnicity-
Religion
Org-Location-
Origin
Founder
Employment
Membership
Ownership
Student-Alum
Investor
User-Owner-Inventor-
Manufacturer
Geographical
Subsidiary
Sports-Affiliation
“Relation Extraction Task”
27	
  
Automa(c	
  Content	
  Extrac(on	
  
(ACE)	
  is	
  a	
  research	
  program	
  for	
  
developing	
  advanced	
  Informa(on	
  
extrac(on	
  technologies.	
  
Given	
  a	
  text	
  in	
  natural	
  language,	
  
the	
  ACE	
  challenge	
  is	
  to	
  detect:	
  
•  en((es	
  	
  
•  rela(ons	
  between	
  en((es	
  
•  events	
  	
  
Automated	
  Content	
  Extrac$on	
  (ACE)	
  
•  Physical-­‐Located	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  PER-­‐GPE	
  
!He was in Tennessee!
•  Part-­‐Whole-­‐Subsidiary	
  	
  ORG-­‐ORG	
  
	
   	
  	
  XYZ, the parent company of ABC!
•  Person-­‐Social-­‐Family	
  	
  	
  	
  	
  PER-­‐PER	
  
John’s wife Yoko!
•  Org-­‐AFF-­‐Founder	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  PER-­‐ORG	
  
!Steve Jobs, co-founder of Apple…!
•  	
  	
  28	
  
UMLS:	
  Unified	
  Medical	
  Language	
  System	
  
•  134	
  en(ty	
  types,	
  54	
  rela(ons	
  
Injury 	
   	
   	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  disrupts 	
   	
  Physiological	
  Func(on	
  
Bodily	
  Loca(on 	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  loca(on-­‐of 	
  Biologic	
  Func(on	
  
Anatomical	
  Structure	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  part-­‐of 	
   	
  Organism	
  
Pharmacologic	
  Substance	
  	
  	
  	
  causes 	
   	
  Pathological	
  Func(on	
  
Pharmacologic	
  Substance	
  	
  	
  	
  treats	
   	
   	
  Pathologic	
  Func(on	
  
29	
  
Extrac$ng	
  UMLS	
  rela$ons	
  from	
  a	
  sentence	
  
	
  Doppler echocardiography can be used to
diagnose left anterior descending artery
stenosis in patients with type 2 diabetes!
ê	
  
	
  Echocardiography,	
  Doppler	
  DIAGNOSES	
  Acquired	
  stenosis	
  
30	
  
Databases	
  of	
  Wikipedia	
  Rela$ons	
  
31	
  
Rela(ons	
  extracted	
  from	
  Infobox	
  
Stanford	
  state	
  California	
  
Stanford	
  moko	
  “Die	
  Lu^	
  der	
  Freiheit	
  weht”	
  
…	
  
Wikipedia	
  Infobox	
  
Rela$on	
  databases	
  	
  
that	
  draw	
  from	
  Wikipedia	
  
•  Resource	
  Descrip(on	
  Framework	
  (RDF)	
  triples	
  
subject	
  predicate	
  object	
  
Golden Gate Park location San Francisco!
dbpedia:Golden_Gate_Park	
  	
  	
  dbpedia-­‐owl:loca(on	
  	
  	
  dbpedia:San_Francisco!
•  The	
  DBpedia	
  project	
  uses	
  the	
  Resource	
  Descrip(on	
  Framework	
  (RDF)	
  to	
  represent	
  the	
  extracted	
  informa(on	
  and	
  
consists	
  of	
  3	
  billion	
  RDF	
  triples,	
  580	
  million	
  extracted	
  from	
  the	
  English	
  edi(on	
  of	
  Wikipedia	
  and	
  2.46	
  billion	
  from	
  other	
  
language	
  edi(ons	
  (wikipedia,	
  March	
  2016).	
  
•  Frequent	
  Freebase	
  rela(ons:	
  
people/person/na(onality,	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  loca(on/loca(on/contains 	
  	
  
people/person/profession,	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  people/person/place-­‐of-­‐birth 	
  	
  
biology/organism_higher_classifica(on	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  film/film/genre	
  
32	
  
DBpedia	
  is	
  a	
  project	
  aiming	
  to	
  extract	
  
structured	
  content	
  from	
  the	
  informa(on	
  
created	
  as	
  part	
  of	
  the	
  Wikipedia	
  project.	
  
Freebase	
  was	
  a	
  large	
  
collabora(ve	
  knowledge	
  base	
  
consis(ng	
  of	
  data	
  composed	
  
mainly	
  by	
  its	
  community	
  
members	
  (cf	
  Seman(c	
  Web).	
  -­‐-­‐>	
  
Knowledge	
  Graph:	
  
hkps://en.wikipedia.org/wiki/
Freebase	
  	
  
How	
  to	
  build	
  rela$on	
  extractors	
  
1.  Hand-­‐wriken	
  pakerns	
  
2.  Supervised	
  machine	
  learning	
  
3.  Semi-­‐supervised	
  and	
  unsupervised	
  	
  
•  Bootstrapping	
  (using	
  seeds)	
  
•  Distant	
  supervision	
  
•  Unsupervised	
  learning	
  from	
  the	
  web	
  
33	
  
Relation
Extraction
Using	
  pakerns	
  to	
  
extract	
  rela(ons	
  
Rules	
  for	
  extrac$ng	
  IS-­‐A	
  rela$on	
  
	
  
Early	
  intui(on	
  from	
  Hearst	
  (1992)	
  	
  
•  “Agar	
  is	
  a	
  substance	
  prepared	
  from	
  a	
  mixture	
  of	
  
red	
  algae,	
  such	
  as	
  Gelidium,	
  for	
  laboratory	
  or	
  
industrial	
  use”	
  
•  What	
  does	
  Gelidium	
  mean?	
  	
  
•  How	
  do	
  you	
  know?`	
  
35	
  
Rules	
  for	
  extrac$ng	
  IS-­‐A	
  rela$on	
  
	
  
Early	
  intui(on	
  from	
  Hearst	
  (1992)	
  	
  
•  “Agar	
  is	
  a	
  substance	
  prepared	
  from	
  a	
  mixture	
  of	
  
red	
  algae,	
  such	
  as	
  Gelidium,	
  for	
  laboratory	
  or	
  
industrial	
  use”	
  
•  What	
  does	
  Gelidium	
  mean?	
  	
  
•  How	
  do	
  you	
  know?`	
  
36	
  
Hearst’s	
  PaFerns	
  for	
  extrac$ng	
  IS-­‐A	
  rela$ons	
  
(Hearst,	
  1992):	
  	
  	
  Automa(c	
  Acquisi(on	
  of	
  Hyponyms	
  
“Y such as X ((, X)* (, and|or) X)”!
“such Y as X”!
“X or other Y”!
“X and other Y”!
“Y including X”!
“Y, especially X”!
37	
  
Hearst’s	
  PaFerns	
  for	
  extrac$ng	
  IS-­‐A	
  rela$ons	
  
Hearst	
  paFern	
   Example	
  occurrences	
  
X	
  and	
  other	
  	
  Y	
   ...temples,	
  treasuries,	
  and	
  other	
  important	
  civic	
  buildings.	
  
X	
  or	
  other	
  	
  Y	
   Bruises,	
  wounds,	
  broken	
  bones	
  or	
  other	
  injuries...	
  
Y	
  such	
  as	
  X	
   The	
  bow	
  lute,	
  such	
  as	
  the	
  Bambara	
  ndang...	
  
Such	
  	
  Y	
  as	
  X	
   ...such	
  authors	
  as	
  Herrick,	
  Goldsmith,	
  and	
  Shakespeare.	
  
Y	
  including	
  X	
   ...common-­‐law	
  countries,	
  including	
  Canada	
  and	
  England...	
  
Y	
  ,	
  especially	
  X	
   European	
  countries,	
  especially	
  France,	
  England,	
  and	
  Spain...	
  
38	
  
Hand-­‐built	
  paFerns	
  for	
  rela$ons	
  
•  Plus:
•  Human patterns tend to be high-precision
•  Can be tailored to specific domains
•  Minus
•  Human patterns are often low-recall
•  A lot of work to think of all possible patterns!
•  Don’t want to have to do this for every relation!
•  We’d like better accuracy39	
  
Relation
Extraction
Supervised	
  rela(on	
  
extrac(on	
  
Supervised	
  machine	
  learning	
  for	
  rela$ons	
  
•  Choose	
  a	
  set	
  of	
  rela(ons	
  we’d	
  like	
  to	
  extract	
  
•  Choose	
  a	
  set	
  of	
  relevant	
  named	
  en((es	
  
•  Find	
  and	
  label	
  data	
  
•  Choose	
  a	
  representa(ve	
  corpus	
  
•  Label	
  the	
  named	
  en((es	
  in	
  the	
  corpus	
  
•  Hand-­‐label	
  the	
  rela(ons	
  between	
  these	
  en((es	
  
•  Break	
  into	
  training,	
  development,	
  and	
  test	
  
•  Train	
  a	
  classifier	
  on	
  the	
  training	
  set	
  
41	
  
How	
  to	
  do	
  classifica$on	
  in	
  supervised	
  
rela$on	
  extrac$on	
  
1.  Find	
  all	
  pairs	
  of	
  named	
  en((es	
  (usually	
  in	
  same	
  sentence)	
  
2.  Decide	
  if	
  2	
  en((es	
  are	
  related	
  
3.  If	
  yes,	
  classify	
  the	
  rela(on	
  
•  Why	
  the	
  extra	
  step?	
  
•  Faster	
  classifica(on	
  training	
  by	
  elimina(ng	
  most	
  pairs	
  
•  Can	
  use	
  dis(nct	
  feature-­‐sets	
  appropriate	
  for	
  each	
  task.	
  
42	
  
Word	
  Features	
  for	
  Rela$on	
  Extrac$on	
  
•  Headwords	
  of	
  M1	
  and	
  M2,	
  and	
  combina(on	
  
Airlines	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Wagner	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Airlines-­‐Wagner	
  
•  Bag	
  of	
  words	
  and	
  bigrams	
  in	
  M1	
  and	
  M2	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  {American,	
  Airlines,	
  Tim,	
  Wagner,	
  American	
  Airlines,	
  Tim	
  Wagner}	
  
•  Words	
  or	
  bigrams	
  in	
  par(cular	
  posi(ons	
  le^	
  and	
  right	
  of	
  M1/M2	
  
M2:	
  -­‐1	
  spokesman	
  
M2:	
  +1	
  said	
  
•  Bag	
  of	
  words	
  or	
  bigrams	
  between	
  the	
  two	
  en((es	
  
{a,	
  AMR,	
  of,	
  immediately,	
  matched,	
  move,	
  spokesman,	
  the,	
  unit}	
  
American	
  Airlines,	
  a	
  unit	
  of	
  AMR,	
  immediately	
  matched	
  the	
  move,	
  spokesman	
  Tim	
  Wagner	
  said	
  
Men(on	
  1	
   Men(on	
  2	
  
43	
  
Named	
  En$ty	
  Type	
  and	
  Men$on	
  Level	
  
Features	
  for	
  Rela$on	
  Extrac$on	
  
•  Named-­‐en(ty	
  types	
  
•  M1:	
  	
  ORG	
  
•  M2:	
  	
  PERSON	
  
•  Concatena(on	
  of	
  the	
  two	
  named-­‐en(ty	
  types	
  
•  ORG-­‐PERSON	
  
•  En(ty	
  Level	
  of	
  M1	
  and	
  M2	
  	
  (NAME,	
  NOMINAL,	
  PRONOUN)	
  
•  M1:	
  NAME	
   	
  [it	
  	
  or	
  he	
  would	
  be	
  PRONOUN]	
  
•  M2:	
  NAME	
   	
  [the	
  company	
  	
  would	
  be	
  NOMINAL]	
  
American	
  Airlines,	
  a	
  unit	
  of	
  AMR,	
  immediately	
  matched	
  the	
  move,	
  spokesman	
  Tim	
  Wagner	
  said	
  
Men(on	
  1	
   Men(on	
  2	
  
44	
  
Parse	
  Features	
  for	
  Rela$on	
  Extrac$on	
  
•  Base	
  syntac(c	
  chunk	
  sequence	
  from	
  one	
  to	
  the	
  other	
  
NP	
  	
  	
  	
  	
  NP	
  	
  	
  	
  PP	
  	
  	
  VP	
  	
  	
  	
  NP	
  	
  	
  	
  NP	
  
•  Cons(tuent	
  path	
  through	
  the	
  tree	
  from	
  one	
  to	
  the	
  other	
  
NP	
  	
  	
  é NP	
  	
  	
  é	
  	
  	
  	
  S	
  	
  	
  	
  é S	
  	
  	
  	
  ê NP	
  
•  Dependency	
  path	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  Airlines	
  	
  	
  	
  matched	
  	
  	
  	
  	
  	
  Wagner	
  	
  	
  said	
  
American	
  Airlines,	
  a	
  unit	
  of	
  AMR,	
  immediately	
  matched	
  the	
  move,	
  spokesman	
  Tim	
  Wagner	
  said	
  
Men(on	
  1	
   Men(on	
  2	
  
45	
  
American	
  Airlines,	
  a	
  unit	
  of	
  AMR,	
  immediately	
  
matched	
  the	
  move,	
  spokesman	
  Tim	
  Wagner	
  said.
46	
  
Classifiers	
  for	
  supervised	
  methods	
  
•  Now	
  you	
  can	
  use	
  any	
  classifier	
  you	
  like	
  
•  MaxEnt	
  
•  Naïve	
  Bayes	
  
•  SVM	
  
•  ...	
  
•  Train	
  it	
  on	
  the	
  training	
  set,	
  tune	
  on	
  the	
  dev	
  set,	
  test	
  on	
  the	
  test	
  
set	
  
47	
  
Evalua$on	
  of	
  Supervised	
  Rela$on	
  
Extrac$on	
  
•  Compute	
  P/R/F1	
  for	
  each	
  rela(on	
  
48	
  
P =
# of correctly extracted relations
Total # of extracted relations
R =
# of correctly extracted relations
Total # of gold relations
F1 =
2PR
P + R
Summary:	
  Supervised	
  Rela$on	
  Extrac$on	
  
+	
  	
  Can	
  get	
  high	
  accuracies	
  with	
  enough	
  hand-­‐labeled	
  
training	
  data,	
  if	
  test	
  similar	
  enough	
  to	
  training	
  
-­‐	
  	
  	
  Labeling	
  a	
  large	
  training	
  set	
  is	
  expensive	
  
-­‐	
  	
  	
  Supervised	
  models	
  are	
  brikle,	
  don’t	
  generalize	
  well	
  
to	
  different	
  genres	
  
	
  
49	
  
Relation
Extraction
Semi-­‐supervised	
  
and	
  unsupervised	
  
rela(on	
  extrac(on	
  
Seed-­‐based	
  or	
  bootstrapping	
  approaches	
  
to	
  rela$on	
  extrac$on	
  
•  No	
  training	
  set?	
  Maybe	
  you	
  have:	
  
•  A	
  few	
  seed	
  tuples	
  	
  or	
  
•  A	
  few	
  high-­‐precision	
  pakerns	
  
•  Can	
  you	
  use	
  those	
  seeds	
  to	
  do	
  something	
  useful?	
  
•  Bootstrapping:	
  use	
  the	
  seeds	
  to	
  directly	
  learn	
  to	
  populate	
  a	
  
rela(on	
  
51	
  
Roughly	
  said:	
  Use	
  seeds	
  to	
  ini(alize	
  a	
  
process	
  of	
  annota(on,	
  then	
  refine	
  
through	
  itera(ons	
  
Rela$on	
  Bootstrapping	
  (Hearst	
  1992)	
  
•  Gather	
  a	
  set	
  of	
  seed	
  pairs	
  that	
  have	
  rela(on	
  R	
  
•  Iterate:	
  
1.  Find	
  sentences	
  with	
  these	
  pairs	
  
2.  Look	
  at	
  the	
  context	
  between	
  or	
  around	
  the	
  pair	
  and	
  
generalize	
  the	
  context	
  to	
  create	
  pakerns	
  
3.  Use	
  the	
  pakerns	
  for	
  grep	
  for	
  more	
  pairs	
  
	
  
52	
  
Bootstrapping	
  	
  
•  <Mark	
  Twain,	
  Elmira>	
  	
  Seed	
  tuple	
  
•  Grep	
  (google)	
  for	
  the	
  environments	
  of	
  the	
  seed	
  tuple	
  
“Mark	
  Twain	
  is	
  buried	
  in	
  Elmira,	
  NY.”	
  
X	
  is	
  buried	
  in	
  Y	
  
“The	
  grave	
  of	
  Mark	
  Twain	
  is	
  in	
  Elmira”	
  
The	
  grave	
  of	
  X	
  is	
  in	
  Y	
  
“Elmira	
  is	
  Mark	
  Twain’s	
  final	
  res(ng	
  place”	
  
Y	
  is	
  X’s	
  final	
  res(ng	
  place.	
  
•  Use	
  those	
  pakerns	
  to	
  grep	
  for	
  new	
  tuples	
  
•  Iterate	
  53	
  
Dipre:	
  Extract	
  <author,book>	
  pairs	
  
•  Start	
  with	
  5	
  seeds:	
  
	
  
	
  
•  Find	
  Instances:	
  
The	
  Comedy	
  of	
  Errors,	
  by	
  	
  William	
  Shakespeare,	
  was	
  
The	
  Comedy	
  of	
  Errors,	
  by	
  	
  William	
  Shakespeare,	
  is	
  
The	
  Comedy	
  of	
  Errors,	
  one	
  of	
  William	
  Shakespeare's	
  earliest	
  akempts	
  
The	
  Comedy	
  of	
  Errors,	
  one	
  of	
  William	
  Shakespeare's	
  most	
  
•  Extract	
  pakerns	
  (group	
  by	
  middle,	
  take	
  longest	
  common	
  prefix/suffix)	
  
?x , by ?y , ?x , one of ?y ‘s !
•  Now	
  iterate,	
  finding	
  new	
  seeds	
  that	
  match	
  the	
  pakern	
  
!
Brin, Sergei. 1998. Extracting Patterns and Relations from the World Wide Web.
Author	
   Book	
  
Isaac	
  Asimov	
   The	
  Robots	
  of	
  Dawn	
  
David	
  Brin	
   Star(de	
  Rising	
  
James	
  Gleick	
   Chaos:	
  Making	
  a	
  New	
  Science	
  
Charles	
  Dickens	
   Great	
  Expecta(ons	
  
William	
  Shakespeare	
   The	
  Comedy	
  of	
  Errors	
  
54	
  
Distant	
  Supervision	
  
•  Combine	
  bootstrapping	
  with	
  supervised	
  learning	
  
•  Instead	
  of	
  5	
  seeds,	
  
•  Use	
  a	
  large	
  database	
  to	
  get	
  huge	
  #	
  of	
  seed	
  examples	
  
•  Create	
  lots	
  of	
  features	
  from	
  all	
  these	
  examples	
  
•  Combine	
  in	
  a	
  supervised	
  classifier	
  
Snow,	
  Jurafsky,	
  Ng.	
  2005.	
  Learning	
  syntac(c	
  pakerns	
  for	
  automa(c	
  hypernym	
  discovery.	
  NIPS	
  17	
  
Fei	
  Wu	
  and	
  Daniel	
  S.	
  Weld.	
  2007.	
  	
  Autonomously	
  Seman(fying	
  Wikipeida.	
  CIKM	
  2007	
  
Mintz,	
  Bills,	
  Snow,	
  Jurafsky.	
  2009.	
  Distant	
  supervision	
  for	
  rela(on	
  extrac(on	
  without	
  labeled	
  data.	
  ACL09	
  
55	
  
Distant	
  supervision	
  paradigm	
  
•  Like	
  supervised	
  classifica(on:	
  
•  Uses	
  a	
  classifier	
  with	
  lots	
  of	
  features	
  
•  Supervised	
  by	
  detailed	
  hand-­‐created	
  knowledge	
  
•  Doesn’t	
  require	
  itera(vely	
  expanding	
  pakerns	
  
•  Like	
  unsupervised	
  classifica(on:	
  
•  Uses	
  very	
  large	
  amounts	
  of	
  unlabeled	
  data	
  
•  Not	
  sensi(ve	
  to	
  genre	
  issues	
  in	
  training	
  corpus	
  
56	
  
Distantly	
  supervised	
  learning	
  	
  
of	
  rela$on	
  extrac$on	
  paFerns	
  
	
  
For	
  each	
  rela(on	
  
	
  
For	
  each	
  tuple	
  in	
  big	
  database	
  
	
  
Find	
  sentences	
  in	
  large	
  corpus	
  
with	
  both	
  en((es	
  
	
  
Extract	
  frequent	
  features	
  
(parse,	
  words,	
  etc)	
  
	
  
Train	
  supervised	
  classifier	
  using	
  
thousands	
  of	
  pakerns	
  
4
1
2
3
5
PER	
  was	
  born	
  in	
  LOC	
  
PER,	
  born	
  (XXXX),	
  LOC	
  
PER’s	
  birthplace	
  in	
  LOC	
  
	
  
<Edwin	
  Hubble,	
  Marshfield>	
  
<Albert	
  Einstein,	
  Ulm>	
  
Born-­‐In	
  
Hubble	
  was	
  born	
  in	
  Marshfield	
  
Einstein,	
  born	
  (1879),	
  	
  Ulm	
  
Hubble’s	
  birthplace	
  in	
  Marshfield	
  
P(born-in | f1,f2,f3,…,f70000)57	
  
Unsupervised	
  rela$on	
  extrac$on	
  
•  Open	
  InformaLon	
  ExtracLon:	
  	
  
•  extract	
  rela(ons	
  from	
  the	
  web	
  with	
  no	
  training	
  data,	
  no	
  list	
  of	
  rela(ons	
  
1.  Use	
  parsed	
  data	
  to	
  train	
  a	
  “trustworthy	
  tuple”	
  classifier	
  
2.  Single-­‐pass	
  extract	
  all	
  rela(ons	
  between	
  NPs,	
  keep	
  if	
  trustworthy	
  
3.  Assessor	
  ranks	
  rela(ons	
  based	
  on	
  text	
  redundancy	
  
(FCI,	
  specializes	
  in,	
  so^ware	
  development)	
  	
  
(Tesla,	
  invented,	
  coil	
  transformer)	
  
58	
  
M.	
  Banko,	
  M.	
  Cararella,	
  S.	
  Soderland,	
  M.	
  Broadhead,	
  and	
  O.	
  Etzioni.	
  
2007.	
  Open	
  informa(on	
  extrac(on	
  from	
  the	
  web.	
  IJCAI	
  
Evalua$on	
  of	
  Semi-­‐supervised	
  and	
  
Unsupervised	
  Rela$on	
  Extrac$on	
  
•  Since	
  it	
  extracts	
  totally	
  new	
  rela(ons	
  from	
  the	
  web	
  	
  
•  There	
  is	
  no	
  gold	
  set	
  of	
  correct	
  instances	
  of	
  rela(ons!	
  
•  Can’t	
  compute	
  precision	
  (don’t	
  know	
  which	
  ones	
  are	
  correct)	
  
•  Can’t	
  compute	
  recall	
  (don’t	
  know	
  which	
  ones	
  were	
  missed)	
  
•  Instead,	
  we	
  can	
  approximate	
  precision	
  (only)	
  
•  	
  Draw	
  a	
  random	
  sample	
  of	
  rela(ons	
  from	
  output,	
  check	
  precision	
  manually	
  
•  Can	
  also	
  compute	
  precision	
  at	
  different	
  levels	
  of	
  recall.	
  
•  Precision	
  for	
  top	
  1000	
  new	
  rela(ons,	
  top	
  10,000	
  new	
  rela(ons,	
  top	
  100,000	
  
•  In	
  each	
  case	
  taking	
  a	
  random	
  sample	
  of	
  that	
  set	
  
•  But	
  no	
  way	
  to	
  evaluate	
  recall	
  59	
  
ˆP =
# of correctly extracted relations in the sample
Total # of extracted relations in the sample
The end

Más contenido relacionado

La actualidad más candente

Hidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognitionHidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognition
butest
 
Text clustering
Text clusteringText clustering
Text clustering
KU Leuven
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Yasir Khan
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
KU Leuven
 

La actualidad más candente (20)

Information retrieval 9 tf idf weights
Information retrieval 9 tf idf weightsInformation retrieval 9 tf idf weights
Information retrieval 9 tf idf weights
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
NLP Project Presentation
NLP Project PresentationNLP Project Presentation
NLP Project Presentation
 
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
 
NLTK
NLTKNLTK
NLTK
 
Spell checker using Natural language processing
Spell checker using Natural language processing Spell checker using Natural language processing
Spell checker using Natural language processing
 
Hidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognitionHidden Markov Models with applications to speech recognition
Hidden Markov Models with applications to speech recognition
 
bag-of-words models
bag-of-words models bag-of-words models
bag-of-words models
 
Text clustering
Text clusteringText clustering
Text clustering
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updated
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Topic Modeling
Topic ModelingTopic Modeling
Topic Modeling
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Tutorial on Question Answering Systems
Tutorial on Question Answering Systems Tutorial on Question Answering Systems
Tutorial on Question Answering Systems
 
Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics
 

Similar a Relation Extraction

Chapter 1 Introduction to ISR (1).pdf
Chapter 1 Introduction to ISR (1).pdfChapter 1 Introduction to ISR (1).pdf
Chapter 1 Introduction to ISR (1).pdf
JemalNesre1
 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data system
Trey Grainger
 
Salford uni pres 2011
Salford uni pres 2011Salford uni pres 2011
Salford uni pres 2011
oseamons
 
Salford uni pres 2011
Salford uni pres 2011Salford uni pres 2011
Salford uni pres 2011
oseamons
 
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Andre Freitas
 

Similar a Relation Extraction (20)

Epistemic networks for Epistemic Commitments
Epistemic networks for Epistemic CommitmentsEpistemic networks for Epistemic Commitments
Epistemic networks for Epistemic Commitments
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: Summarization
 
Evolution of Search
Evolution of SearchEvolution of Search
Evolution of Search
 
Answer Extraction for how and why Questions in Question Answering Systems
Answer Extraction for how and why Questions in Question Answering SystemsAnswer Extraction for how and why Questions in Question Answering Systems
Answer Extraction for how and why Questions in Question Answering Systems
 
Crowdsourcing the Quality of Knowledge Graphs: A DBpedia Study
Crowdsourcing the Quality of Knowledge Graphs:A DBpedia StudyCrowdsourcing the Quality of Knowledge Graphs:A DBpedia Study
Crowdsourcing the Quality of Knowledge Graphs: A DBpedia Study
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science project
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science project
 
Why Watson Won: A cognitive perspective
Why Watson Won: A cognitive perspectiveWhy Watson Won: A cognitive perspective
Why Watson Won: A cognitive perspective
 
Chapter 1 Introduction to ISR (1).pdf
Chapter 1 Introduction to ISR (1).pdfChapter 1 Introduction to ISR (1).pdf
Chapter 1 Introduction to ISR (1).pdf
 
A Knowledge Discovery Framework for Planetary Defense
A Knowledge Discovery Framework for Planetary DefenseA Knowledge Discovery Framework for Planetary Defense
A Knowledge Discovery Framework for Planetary Defense
 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data system
 
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
 
Data Science Workshop - day 1
Data Science Workshop - day 1Data Science Workshop - day 1
Data Science Workshop - day 1
 
Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)
 
Salford uni pres 2011
Salford uni pres 2011Salford uni pres 2011
Salford uni pres 2011
 
Salford uni pres 2011
Salford uni pres 2011Salford uni pres 2011
Salford uni pres 2011
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlp
 
Relation Extraction from the Web using Distant Supervision
Relation Extraction from the Web using Distant SupervisionRelation Extraction from the Web using Distant Supervision
Relation Extraction from the Web using Distant Supervision
 
Techniques For Deep Query Understanding
Techniques For Deep Query UnderstandingTechniques For Deep Query Understanding
Techniques For Deep Query Understanding
 
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
 

Más de Marina Santini

Más de Marina Santini (20)

Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology Applications
 
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
 
An Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability FeaturesAn Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability Features
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word Clouds
 
Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic Web
 
Lecture: Question Answering
Lecture: Question AnsweringLecture: Question Answering
Lecture: Question Answering
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
 
Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense Disambiguation
 
Lecture: Word Senses
Lecture: Word SensesLecture: Word Senses
Lecture: Word Senses
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role Labeling
 
Semantics and Computational Semantics
Semantics and Computational SemanticsSemantics and Computational Semantics
Semantics and Computational Semantics
 
Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)
 
Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1)
 
Lecture 5: Interval Estimation
Lecture 5: Interval Estimation Lecture 5: Interval Estimation
Lecture 5: Interval Estimation
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
 
Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)
 
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationLecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
 

Último

Último (20)

FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 

Relation Extraction

  • 1. Seman&c  Analysis  in  Language  Technology   http://stp.lingfil.uu.se/~santinim/sais/2016/sais_2016.htm 
 
 Relation Extraction Marina  San(ni   san$nim@stp.lingfil.uu.se     Department  of  Linguis(cs  and  Philology   Uppsala  University,  Uppsala,  Sweden     Spring  2016      
  • 2. Previous  Lecture:  Ques$on  Answering   2  
  • 3. Ques$on  Answering  systems   •  Factoid  ques(ons:       •  Google   •  Wolfram   •  Ask  Jeeves   •  Start     •  ….   3   •  Approaches:       •  IR-­‐based   •  Knowelege  based   •  Hybrid  
  • 4. Katz  et  al.  (2006)   hFp://start.csail.mit.edu/publica$ons/FLAIRS0601KatzB.pdf     •  START  answers  natural  language  ques(ons  by  presen(ng  components  of  text  and   mul(-­‐media  informa(on  drawn  from  a  set  of  informa(on  resources  that  are   hosted  locally  or  accessed  remotely  through  the  Internet.     •  START  targets  high  precision  in  its  ques(on  answering.     •  The  START  system  analyzes  English  text  and  produces  a  knowledge  base  which   incorporates,  in  the  form  of  nested  ternary  expressions  (=triples),  the   informa(on  found  in  the  text.   4  
  • 5. Is  it    true?:  hFp://uncyclopedia.wikia.com/wiki/Ask_Jeeves   •  Ask  Jeeves,  more  correctly  known  as  Ask.com,  is  a  search  engine  founded   in  1996  in  California.     •  Ini(ally  it  represented  a  stereotypical  English  butler  who  would  "fetch"   the  answer  to  any  ques(on  asked.   •  Ask.com  is  now  considered  one  of  the  great  failures  of  the  internet.  The   ques(on  and  answer  feature  simply  didn't  work  as  well  as  hoped,  and   a^er  trying  his  hand  at  being  both  a  tradi(onal  search  engine  and  a   terrible  kind  of  "ar(ficial  AI"  with  a  bald  spot,     •  These  days  Jeeves  is  ranked  as  the  4th  most  successful  search  engine  on   the  web,  and  the  4th  most  successful  overall.  This  seems  impressive  un$l   you  consider  that  Google  holds  the  top  spot  with  95%  of  the  market.  It   has  even  fallen  behind  Bing;  enough  said.   5  
  • 6. Search  engines  that  can  be  used  as  QA  systems   •  Yahoo   •  Bing   6  
  • 7. Siri   hFp://en.wikipedia.org/wiki/Siri     •  Siri  /ˈsɪri/  is  an  intelligent  personal  assistant  and  knowledge  navigator  which  works  as  an   applica(on  for  Apple  Inc.'s  iOS.   •   The  applica(on  uses  a  natural  language  user  interface  to  answer  ques$ons,  make   recommenda(ons,  and  perform  ac(ons  by  delega$ng  requests  to  a  set  of  Web  services.     •  The  so^ware,  both  in  its  original  version  and  as  an  iOS  applica(on,  adapts  to  the  user's   individual  language  usage  and  individual  searches  (preferences)  with  con(nuing  use,  and   returns  results  that  are  individualized.     •  The  name  Siri  is  Scandinavian,  a  short  form  of  the  Norse  name  Sigrid  meaning  "beauty"  and   "victory",  and  comes  from  the  intended  name  for  the  original  developer's  first  child.   7  
  • 8. ChaFerbots   •  Siri…  conversa(onal  ”safety  net”.   •  Conversa(onal  agents  (chaker  bots,   and  personal  assistants)     àcustomer  care,  customer  analy(cs   (replacing/integra(ng  FAQs  and  help   desk)   8   Avatar: a picture of a person or animal that represents you on a computer screen, for example in some chat rooms or when you are playing games over the Internet
  • 9. Eliza   hFp://en.wikipedia.org/wiki/ELIZA   ELIZA  was  wriFen  at  MIT  by  Joseph  Weizenbaum  between  1964  and  1966     9  
  • 10. General  IR  architecture  for  factoid  ques$ons   10   Document DocumentDocument Docume ntDocume ntDocume ntDocume ntDocume nt Question Processing Passage Retrieval Query Formulation Answer Type Detection Question Passage Retrieval Document Retrieval Answer Processing Answer passages Indexing Relevant Docs DocumentDocument Document
  • 11. Things  to  extract  from  the  ques$on   •  Answer  Type  Detec(on   •  Decide  the  named  en$ty  type  (person,  place)   of  the  answer   •  Query  Formula(on   •  Choose  query  keywords  for  the  IR  system   •  Ques(on  Type  classifica(on   •  Is  this  a  defini(on  ques(on,  a  math  ques(on,  a   list  ques(on?   •  Focus  Detec(on   •  Find  the  ques(on  words  that  are  replaced  by   the  answer   •  Rela(on  Extrac(on   •  Find  rela(ons  between  en((es  in  the  ques(on  11  
  • 12. 12   Common  Evalua$on  Metrics   1. Accuracy  (does  answer  match  gold-­‐labeled  answer?)   2. Mean  Reciprocal  Rank:     •  The  reciprocal  rank  of  a  query  response  is  the  inverse  of  the  rank  of  the   first  correct  answer.     •  The  mean  reciprocal  rank  is  the  average  of  the  reciprocal  ranks  of   results  for  a  sample  of  queries  Q   MRR = 1 rankii=1 N ∑ N =  
  • 13. Common  Evalua$on  Metrics:  MRR   •  The  mean  reciprocal  rank  is  the  average  of  the  reciprocal  ranks   of  results  for  a  sample  of  queries  Q.   •  (ex  adapted  from  Wikipedia)   •  3  ranked  answers  for  a  query,  with  the  first  one  being  the  one  it  thinks  is   most  likely  correct     •  Given  those  3  samples,  we  could  calculate  the  mean  reciprocal  rank  as   (1/3  +  1/2  +  1)/3  =  0.61.   13  
  • 14. Complex  ques$ons:  “What  is  the  ‘hajii’”?   •  The  (bokom-­‐up)  snippet  method   •  Find  a  set  of  relevant  documents   •  Extract  informa(ve  sentences  from  the  documents  (using  p-­‐idf,  MMR)   •  Order  and  modify  the  sentences  into  an  answer   •  The  (top-­‐down)  informa(on  extrac(on  method   •  build  specific  answerers  for  different  ques(on  types:   •  defini(on  ques(ons,   •  biography  ques(ons,     •  certain  medical  ques(ons  
  • 15. Informa$on  that  should  be  in  the  answer   for  3  kinds  of  ques$ons  
  • 16. Document Retrieval 11 Web documents 1127 total sentences Predicate Identification Data-Driven Analysis 383 Non-Specific Definitional sentences Sentence clusters, Importance ordering Definition Creation 9 Genus-Species Sentences The Hajj, or pilgrimage to Makkah (Mecca), is the central duty of Islam. The Hajj is a milestone event in a Muslim's life. The hajj is one of five pillars that make up the foundation of Islam. ... The Hajj, or pilgrimage to Makkah [Mecca], is the central duty of Islam. More than two million Muslims are expected to take the Hajj this year. Muslims must perform the hajj at least once in their lifetime if physically and financially able. The Hajj is a milestone event in a Muslim's life. The annual hajj begins in the twelfth month of the Islamic year (which is lunar, not solar, so that hajj and Ramadan fall sometimes in summer, sometimes in winter). The Hajj is a week-long pilgrimage that begins in the 12th month of the Islamic lunar calendar. Another ceremony, which was not connected with the rites of the Ka'ba before the rise of Islam, is the Hajj, the annual pilgrimage to 'Arafat, about two miles east of Mecca, toward Mina… "What is the Hajj?" (Ndocs=20, Len=8) Architecture  for  complex  ques$on  answering:     defini$on  ques$ons   S.  Blair-­‐Goldensohn,  K.  McKeown  and  A.  Schlaikjer.  2004.   Answering  Defini(on  Ques(ons:  A  Hyrbid  Approach.    
  • 17. State-­‐of-­‐the-­‐art:  ex   •  Top  downMing  Tan,  Cicero  dos  Santos,  Bing  Xiang  &  Bowen   Zhou.  2015.  LSTM-­‐Based  Deep  Learning  Models  for  non  factoid   Answer  Selec(on.   •  Di  Wang  and  Eric  Nyberg.  2015.  A  Long  Short-­‐Term  Memory   Model  for  Answer  Sentence  Selec(on  in  Ques(on  Answering.  In   ACL  2015.s   •  Minwei  Feng,  Bing  Xiang,  Michael  R.  Glass,  Lidan  Wang,  Bowen   Zhou.  2015.  Applying  deep  learning  to  answer  selec(on:  A  study   and  an  open  task.     17   Deep  Learning  is  a  new  area  of  Machine  Learning   research.  Said  to  be  very  promising.  It  is  about  learning   mul(ple  levels  of  representa(on  and  abstrac(on  that   help  to  make  sense  of  data  such  as  images,  sound,  and   text.  It  is  based  on  neural  networks.    
  • 18. Prac$cal  ac$vity   •  Start  seems  to  be  limited,  but  it  understands  natural  language   •  Google  (presumably  helped  by  Knowledge  Graph)  is  more   accurate,  but  skips  natural  language  (uses  keywords).     •  Google  is  customized  to  the  users’  preferences  (different  results)   •  Interes(ng  outcomes   •  Currency  vs.  Coin   •  What’s  love?   •  Lyric/song  vs.  Defini(on  ques(on   18  
  • 19. What’s  the  meaning  of  life?   •  Google   19   Presumably  from  Knowledge  Graph…  
  • 20. Start  and  the  42  puzzle   •  gg   20  
  • 21. End  of  previous  lecture   21  
  • 22. Acknowledgements Most  slides  borrowed  or  adapted  from:   Dan  Jurafsky  and  Christopher  Manning,  Coursera   Dan  Jurafsky  and  James  H.  Mar(n  (2015)         J&M(2015,  dra^):  hkps://web.stanford.edu/~jurafsky/slp3/              
  • 24. Extrac$ng  rela$ons  from  text   •  Company  report:  “Interna(onal  Business  Machines  Corpora(on  (IBM  or  the   company)  was  incorporated  in  the  State  of  New  York  on  June  16,  1911,  as  the   Compu(ng-­‐Tabula(ng-­‐Recording  Co.  (C-­‐T-­‐R)…”   •  Extracted  Complex  Rela(on:   Company-­‐Founding      Company    IBM      Loca(on      New  York      Date      June  16,  1911      Original-­‐Name      Compu(ng-­‐Tabula(ng-­‐Recording  Co.   •  But  we  will  focus  on  the  simpler  task  of  extrac(ng  rela(on  triples   Founding-­‐year(IBM,1911)   Founding-­‐loca(on(IBM,New  York)  24  
  • 25. Extrac$ng  Rela$on  Triples  from  Text    The  Leland  Stanford  Junior  University,   commonly  referred  to  as  Stanford   University  or  Stanford,  is  an  American   private  research  university  located  in   Stanford,  California  …  near  Palo  Alto,   California…  Leland  Stanford…founded   the  university  in  1891   Stanford EQ Leland Stanford Junior University Stanford LOC-IN California Stanford IS-A research university Stanford LOC-NEAR Palo Alto Stanford FOUNDED-IN 1891 Stanford FOUNDER Leland Stanford25  
  • 26. Why  Rela$on  Extrac$on?   •  Create  new  structured  knowledge  bases,  useful  for  any  app   •  Augment  current  knowledge  bases   •  Adding  words  to  WordNet  thesaurus,  facts  to  FreeBase  or  DBPedia   •  Support  ques(on  answering   •  The  granddaughter  of  which  actor  starred  in  the  movie  “E.T.”?   (acted-in ?x “E.T.”)(is-a ?y actor)(granddaughter-of ?x ?y)! •  But  which  rela(ons  should  we  extract?   ! 26  
  • 27. Automated  Content  Extrac$on  (ACE)   ARTIFACT GENERAL AFFILIATION ORG AFFILIATION PART- WHOLE PERSON- SOCIAL PHYSICAL Located Near Business Family Lasting Personal Citizen- Resident- Ethnicity- Religion Org-Location- Origin Founder Employment Membership Ownership Student-Alum Investor User-Owner-Inventor- Manufacturer Geographical Subsidiary Sports-Affiliation “Relation Extraction Task” 27   Automa(c  Content  Extrac(on   (ACE)  is  a  research  program  for   developing  advanced  Informa(on   extrac(on  technologies.   Given  a  text  in  natural  language,   the  ACE  challenge  is  to  detect:   •  en((es     •  rela(ons  between  en((es   •  events    
  • 28. Automated  Content  Extrac$on  (ACE)   •  Physical-­‐Located                        PER-­‐GPE   !He was in Tennessee! •  Part-­‐Whole-­‐Subsidiary    ORG-­‐ORG        XYZ, the parent company of ABC! •  Person-­‐Social-­‐Family          PER-­‐PER   John’s wife Yoko! •  Org-­‐AFF-­‐Founder                      PER-­‐ORG   !Steve Jobs, co-founder of Apple…! •     28  
  • 29. UMLS:  Unified  Medical  Language  System   •  134  en(ty  types,  54  rela(ons   Injury                          disrupts    Physiological  Func(on   Bodily  Loca(on                      loca(on-­‐of  Biologic  Func(on   Anatomical  Structure                      part-­‐of    Organism   Pharmacologic  Substance        causes    Pathological  Func(on   Pharmacologic  Substance        treats      Pathologic  Func(on   29  
  • 30. Extrac$ng  UMLS  rela$ons  from  a  sentence    Doppler echocardiography can be used to diagnose left anterior descending artery stenosis in patients with type 2 diabetes! ê    Echocardiography,  Doppler  DIAGNOSES  Acquired  stenosis   30  
  • 31. Databases  of  Wikipedia  Rela$ons   31   Rela(ons  extracted  from  Infobox   Stanford  state  California   Stanford  moko  “Die  Lu^  der  Freiheit  weht”   …   Wikipedia  Infobox  
  • 32. Rela$on  databases     that  draw  from  Wikipedia   •  Resource  Descrip(on  Framework  (RDF)  triples   subject  predicate  object   Golden Gate Park location San Francisco! dbpedia:Golden_Gate_Park      dbpedia-­‐owl:loca(on      dbpedia:San_Francisco! •  The  DBpedia  project  uses  the  Resource  Descrip(on  Framework  (RDF)  to  represent  the  extracted  informa(on  and   consists  of  3  billion  RDF  triples,  580  million  extracted  from  the  English  edi(on  of  Wikipedia  and  2.46  billion  from  other   language  edi(ons  (wikipedia,  March  2016).   •  Frequent  Freebase  rela(ons:   people/person/na(onality,                                                                loca(on/loca(on/contains     people/person/profession,                                                                  people/person/place-­‐of-­‐birth     biology/organism_higher_classifica(on                      film/film/genre   32   DBpedia  is  a  project  aiming  to  extract   structured  content  from  the  informa(on   created  as  part  of  the  Wikipedia  project.   Freebase  was  a  large   collabora(ve  knowledge  base   consis(ng  of  data  composed   mainly  by  its  community   members  (cf  Seman(c  Web).  -­‐-­‐>   Knowledge  Graph:   hkps://en.wikipedia.org/wiki/ Freebase    
  • 33. How  to  build  rela$on  extractors   1.  Hand-­‐wriken  pakerns   2.  Supervised  machine  learning   3.  Semi-­‐supervised  and  unsupervised     •  Bootstrapping  (using  seeds)   •  Distant  supervision   •  Unsupervised  learning  from  the  web   33  
  • 34. Relation Extraction Using  pakerns  to   extract  rela(ons  
  • 35. Rules  for  extrac$ng  IS-­‐A  rela$on     Early  intui(on  from  Hearst  (1992)     •  “Agar  is  a  substance  prepared  from  a  mixture  of   red  algae,  such  as  Gelidium,  for  laboratory  or   industrial  use”   •  What  does  Gelidium  mean?     •  How  do  you  know?`   35  
  • 36. Rules  for  extrac$ng  IS-­‐A  rela$on     Early  intui(on  from  Hearst  (1992)     •  “Agar  is  a  substance  prepared  from  a  mixture  of   red  algae,  such  as  Gelidium,  for  laboratory  or   industrial  use”   •  What  does  Gelidium  mean?     •  How  do  you  know?`   36  
  • 37. Hearst’s  PaFerns  for  extrac$ng  IS-­‐A  rela$ons   (Hearst,  1992):      Automa(c  Acquisi(on  of  Hyponyms   “Y such as X ((, X)* (, and|or) X)”! “such Y as X”! “X or other Y”! “X and other Y”! “Y including X”! “Y, especially X”! 37  
  • 38. Hearst’s  PaFerns  for  extrac$ng  IS-­‐A  rela$ons   Hearst  paFern   Example  occurrences   X  and  other    Y   ...temples,  treasuries,  and  other  important  civic  buildings.   X  or  other    Y   Bruises,  wounds,  broken  bones  or  other  injuries...   Y  such  as  X   The  bow  lute,  such  as  the  Bambara  ndang...   Such    Y  as  X   ...such  authors  as  Herrick,  Goldsmith,  and  Shakespeare.   Y  including  X   ...common-­‐law  countries,  including  Canada  and  England...   Y  ,  especially  X   European  countries,  especially  France,  England,  and  Spain...   38  
  • 39. Hand-­‐built  paFerns  for  rela$ons   •  Plus: •  Human patterns tend to be high-precision •  Can be tailored to specific domains •  Minus •  Human patterns are often low-recall •  A lot of work to think of all possible patterns! •  Don’t want to have to do this for every relation! •  We’d like better accuracy39  
  • 41. Supervised  machine  learning  for  rela$ons   •  Choose  a  set  of  rela(ons  we’d  like  to  extract   •  Choose  a  set  of  relevant  named  en((es   •  Find  and  label  data   •  Choose  a  representa(ve  corpus   •  Label  the  named  en((es  in  the  corpus   •  Hand-­‐label  the  rela(ons  between  these  en((es   •  Break  into  training,  development,  and  test   •  Train  a  classifier  on  the  training  set   41  
  • 42. How  to  do  classifica$on  in  supervised   rela$on  extrac$on   1.  Find  all  pairs  of  named  en((es  (usually  in  same  sentence)   2.  Decide  if  2  en((es  are  related   3.  If  yes,  classify  the  rela(on   •  Why  the  extra  step?   •  Faster  classifica(on  training  by  elimina(ng  most  pairs   •  Can  use  dis(nct  feature-­‐sets  appropriate  for  each  task.   42  
  • 43. Word  Features  for  Rela$on  Extrac$on   •  Headwords  of  M1  and  M2,  and  combina(on   Airlines                          Wagner                              Airlines-­‐Wagner   •  Bag  of  words  and  bigrams  in  M1  and  M2                      {American,  Airlines,  Tim,  Wagner,  American  Airlines,  Tim  Wagner}   •  Words  or  bigrams  in  par(cular  posi(ons  le^  and  right  of  M1/M2   M2:  -­‐1  spokesman   M2:  +1  said   •  Bag  of  words  or  bigrams  between  the  two  en((es   {a,  AMR,  of,  immediately,  matched,  move,  spokesman,  the,  unit}   American  Airlines,  a  unit  of  AMR,  immediately  matched  the  move,  spokesman  Tim  Wagner  said   Men(on  1   Men(on  2   43  
  • 44. Named  En$ty  Type  and  Men$on  Level   Features  for  Rela$on  Extrac$on   •  Named-­‐en(ty  types   •  M1:    ORG   •  M2:    PERSON   •  Concatena(on  of  the  two  named-­‐en(ty  types   •  ORG-­‐PERSON   •  En(ty  Level  of  M1  and  M2    (NAME,  NOMINAL,  PRONOUN)   •  M1:  NAME    [it    or  he  would  be  PRONOUN]   •  M2:  NAME    [the  company    would  be  NOMINAL]   American  Airlines,  a  unit  of  AMR,  immediately  matched  the  move,  spokesman  Tim  Wagner  said   Men(on  1   Men(on  2   44  
  • 45. Parse  Features  for  Rela$on  Extrac$on   •  Base  syntac(c  chunk  sequence  from  one  to  the  other   NP          NP        PP      VP        NP        NP   •  Cons(tuent  path  through  the  tree  from  one  to  the  other   NP      é NP      é        S        é S        ê NP   •  Dependency  path                    Airlines        matched            Wagner      said   American  Airlines,  a  unit  of  AMR,  immediately  matched  the  move,  spokesman  Tim  Wagner  said   Men(on  1   Men(on  2   45  
  • 46. American  Airlines,  a  unit  of  AMR,  immediately   matched  the  move,  spokesman  Tim  Wagner  said. 46  
  • 47. Classifiers  for  supervised  methods   •  Now  you  can  use  any  classifier  you  like   •  MaxEnt   •  Naïve  Bayes   •  SVM   •  ...   •  Train  it  on  the  training  set,  tune  on  the  dev  set,  test  on  the  test   set   47  
  • 48. Evalua$on  of  Supervised  Rela$on   Extrac$on   •  Compute  P/R/F1  for  each  rela(on   48   P = # of correctly extracted relations Total # of extracted relations R = # of correctly extracted relations Total # of gold relations F1 = 2PR P + R
  • 49. Summary:  Supervised  Rela$on  Extrac$on   +    Can  get  high  accuracies  with  enough  hand-­‐labeled   training  data,  if  test  similar  enough  to  training   -­‐      Labeling  a  large  training  set  is  expensive   -­‐      Supervised  models  are  brikle,  don’t  generalize  well   to  different  genres     49  
  • 51. Seed-­‐based  or  bootstrapping  approaches   to  rela$on  extrac$on   •  No  training  set?  Maybe  you  have:   •  A  few  seed  tuples    or   •  A  few  high-­‐precision  pakerns   •  Can  you  use  those  seeds  to  do  something  useful?   •  Bootstrapping:  use  the  seeds  to  directly  learn  to  populate  a   rela(on   51   Roughly  said:  Use  seeds  to  ini(alize  a   process  of  annota(on,  then  refine   through  itera(ons  
  • 52. Rela$on  Bootstrapping  (Hearst  1992)   •  Gather  a  set  of  seed  pairs  that  have  rela(on  R   •  Iterate:   1.  Find  sentences  with  these  pairs   2.  Look  at  the  context  between  or  around  the  pair  and   generalize  the  context  to  create  pakerns   3.  Use  the  pakerns  for  grep  for  more  pairs     52  
  • 53. Bootstrapping     •  <Mark  Twain,  Elmira>    Seed  tuple   •  Grep  (google)  for  the  environments  of  the  seed  tuple   “Mark  Twain  is  buried  in  Elmira,  NY.”   X  is  buried  in  Y   “The  grave  of  Mark  Twain  is  in  Elmira”   The  grave  of  X  is  in  Y   “Elmira  is  Mark  Twain’s  final  res(ng  place”   Y  is  X’s  final  res(ng  place.   •  Use  those  pakerns  to  grep  for  new  tuples   •  Iterate  53  
  • 54. Dipre:  Extract  <author,book>  pairs   •  Start  with  5  seeds:       •  Find  Instances:   The  Comedy  of  Errors,  by    William  Shakespeare,  was   The  Comedy  of  Errors,  by    William  Shakespeare,  is   The  Comedy  of  Errors,  one  of  William  Shakespeare's  earliest  akempts   The  Comedy  of  Errors,  one  of  William  Shakespeare's  most   •  Extract  pakerns  (group  by  middle,  take  longest  common  prefix/suffix)   ?x , by ?y , ?x , one of ?y ‘s ! •  Now  iterate,  finding  new  seeds  that  match  the  pakern   ! Brin, Sergei. 1998. Extracting Patterns and Relations from the World Wide Web. Author   Book   Isaac  Asimov   The  Robots  of  Dawn   David  Brin   Star(de  Rising   James  Gleick   Chaos:  Making  a  New  Science   Charles  Dickens   Great  Expecta(ons   William  Shakespeare   The  Comedy  of  Errors   54  
  • 55. Distant  Supervision   •  Combine  bootstrapping  with  supervised  learning   •  Instead  of  5  seeds,   •  Use  a  large  database  to  get  huge  #  of  seed  examples   •  Create  lots  of  features  from  all  these  examples   •  Combine  in  a  supervised  classifier   Snow,  Jurafsky,  Ng.  2005.  Learning  syntac(c  pakerns  for  automa(c  hypernym  discovery.  NIPS  17   Fei  Wu  and  Daniel  S.  Weld.  2007.    Autonomously  Seman(fying  Wikipeida.  CIKM  2007   Mintz,  Bills,  Snow,  Jurafsky.  2009.  Distant  supervision  for  rela(on  extrac(on  without  labeled  data.  ACL09   55  
  • 56. Distant  supervision  paradigm   •  Like  supervised  classifica(on:   •  Uses  a  classifier  with  lots  of  features   •  Supervised  by  detailed  hand-­‐created  knowledge   •  Doesn’t  require  itera(vely  expanding  pakerns   •  Like  unsupervised  classifica(on:   •  Uses  very  large  amounts  of  unlabeled  data   •  Not  sensi(ve  to  genre  issues  in  training  corpus   56  
  • 57. Distantly  supervised  learning     of  rela$on  extrac$on  paFerns     For  each  rela(on     For  each  tuple  in  big  database     Find  sentences  in  large  corpus   with  both  en((es     Extract  frequent  features   (parse,  words,  etc)     Train  supervised  classifier  using   thousands  of  pakerns   4 1 2 3 5 PER  was  born  in  LOC   PER,  born  (XXXX),  LOC   PER’s  birthplace  in  LOC     <Edwin  Hubble,  Marshfield>   <Albert  Einstein,  Ulm>   Born-­‐In   Hubble  was  born  in  Marshfield   Einstein,  born  (1879),    Ulm   Hubble’s  birthplace  in  Marshfield   P(born-in | f1,f2,f3,…,f70000)57  
  • 58. Unsupervised  rela$on  extrac$on   •  Open  InformaLon  ExtracLon:     •  extract  rela(ons  from  the  web  with  no  training  data,  no  list  of  rela(ons   1.  Use  parsed  data  to  train  a  “trustworthy  tuple”  classifier   2.  Single-­‐pass  extract  all  rela(ons  between  NPs,  keep  if  trustworthy   3.  Assessor  ranks  rela(ons  based  on  text  redundancy   (FCI,  specializes  in,  so^ware  development)     (Tesla,  invented,  coil  transformer)   58   M.  Banko,  M.  Cararella,  S.  Soderland,  M.  Broadhead,  and  O.  Etzioni.   2007.  Open  informa(on  extrac(on  from  the  web.  IJCAI  
  • 59. Evalua$on  of  Semi-­‐supervised  and   Unsupervised  Rela$on  Extrac$on   •  Since  it  extracts  totally  new  rela(ons  from  the  web     •  There  is  no  gold  set  of  correct  instances  of  rela(ons!   •  Can’t  compute  precision  (don’t  know  which  ones  are  correct)   •  Can’t  compute  recall  (don’t  know  which  ones  were  missed)   •  Instead,  we  can  approximate  precision  (only)   •   Draw  a  random  sample  of  rela(ons  from  output,  check  precision  manually   •  Can  also  compute  precision  at  different  levels  of  recall.   •  Precision  for  top  1000  new  rela(ons,  top  10,000  new  rela(ons,  top  100,000   •  In  each  case  taking  a  random  sample  of  that  set   •  But  no  way  to  evaluate  recall  59   ˆP = # of correctly extracted relations in the sample Total # of extracted relations in the sample