SlideShare una empresa de Scribd logo
1 de 56
Descargar para leer sin conexión
Seman&c	
  Analysis	
  in	
  Language	
  Technology	
  
http://stp.lingfil.uu.se/~santinim/sais/2016/sais_2016.htm 



Vector Semantics

(aka Distributional Semantics)


Marina	
  San(ni	
  
san$nim@stp.lingfil.uu.se	
  
	
  
Department	
  of	
  Linguis(cs	
  and	
  Philology	
  
Uppsala	
  University,	
  Uppsala,	
  Sweden	
  
	
  
Spring	
  2016	
  
	
  
	
  
1	
  
Previous	
  Lecture:	
  Word	
  Sense	
  Disambigua$on	
  
2	
  
Similarity	
  measures	
  (dic$onary-­‐based)	
  
Colloca$onal	
  features:	
  supervised	
  
•  Posi(on-­‐specific	
  informa(on	
  about	
  the	
  words	
  and	
  
colloca(ons	
  in	
  window	
  
•  guitar	
  and	
  bass	
  player	
  stand	
  
•  word	
  1,2,3	
  grams	
  in	
  window	
  of	
  ±3	
  is	
  common	
  
encoding local lexical and grammatical information that can often accurately isola
a given sense.
For example consider the ambiguous word bass in the following WSJ sentenc
(16.17) An electric guitar and bass player stand off to one side, not really part of
the scene, just as a sort of nod to gringo expectations perhaps.
A collocational feature vector, extracted from a window of two words to the rig
and left of the target word, made up of the words themselves, their respective part
of-speech, and pairs of words, that is,
[wi 2,POSi 2,wi 1,POSi 1,wi+1,POSi+1,wi+2,POSi+2,wi 1
i 2,wi+1
i ] (16.1
would yield the following vector:
[guitar, NN, and, CC, player, NN, stand, VB, and guitar, player stand]
High performing systems generally use POS tags and word collocations of leng
1, 2, and 3 from a window of words 3 to the left and 3 to the right (Zhong and N
For example consider the ambiguous word bass in the following WSJ sent
6.17) An electric guitar and bass player stand off to one side, not really par
the scene, just as a sort of nod to gringo expectations perhaps.
collocational feature vector, extracted from a window of two words to the
d left of the target word, made up of the words themselves, their respective
-speech, and pairs of words, that is,
[wi 2,POSi 2,wi 1,POSi 1,wi+1,POSi+1,wi+2,POSi+2,wi 1
i 2,wi+1
i ] (
ould yield the following vector:
[guitar, NN, and, CC, player, NN, stand, VB, and guitar, player stand]
gh performing systems generally use POS tags and word collocations of l
2, and 3 from a window of words 3 to the left and 3 to the right (Zhong an
Bag-­‐of-­‐words	
  features:	
  supervised	
  
•  Assume	
  we’ve	
  seGled	
  on	
  a	
  possible	
  vocabulary	
  of	
  12	
  words	
  in	
  
“bass”	
  sentences:	
  	
  
	
  
[fishing,	
  big,	
  sound,	
  player,	
  fly,	
  rod,	
  pound,	
  double,	
  runs,	
  playing,	
  guitar,	
  band]	
  	
  
•  The	
  vector	
  for:	
  
	
  guitar	
  and	
  bass	
  player	
  stand	
  
	
  [0,0,0,1,0,0,0,0,0,0,1,0]	
  	
  
	
  
Prac$cal	
  ac$vity:	
  Lesk	
  algorithms	
  
•  Michael	
  Lesk	
  (1986):	
  Original	
  Lesk	
  
•  Compare	
  the	
  target	
  word’s	
  signature	
  with	
  the	
  signature	
  of	
  each	
  of	
  the	
  
context	
  words	
  
•  Kilgarriff	
  and	
  Rosenzweig	
  (2000):	
  Simplified	
  Lesk	
  
•  Compare	
  the	
  target	
  word’s	
  signature	
  with	
  the	
  context	
  words	
  
•  Vasilescu	
  et	
  al.	
  (2004):	
  Corpus	
  Lesk	
  	
  
•  Add	
  all	
  the	
  words	
  in	
  a	
  labelled	
  corpus	
  sentence	
  for	
  a	
  word	
  sense	
  into	
  the	
  
signature	
  of	
  that	
  sense	
  (remember	
  the	
  labelled	
  sentences	
  in	
  Senseval	
  2).	
  	
  
signature	
  <-­‐	
  set	
  of	
  words	
  in	
  the	
  gloss	
  and	
  examples	
  of	
  sense	
  	
  6 	
  
	
  	
  
Simplified	
  Lesk:	
  Time	
  flies	
  like	
  an	
  arrow	
  
•  Common	
  sense:	
  
•  Modern	
  English	
  speakers	
  unambiguously	
  understand	
  the	
  sentence	
  to	
  
mean	
  "As	
  a	
  generalisa(on,	
  (me	
  passes	
  in	
  the	
  same	
  way	
  that	
  an	
  arrow	
  
generally	
  flies	
  (i.e.	
  quickly)"	
  (as	
  in	
  the	
  common	
  metaphor	
  5me	
  goes	
  by	
  
quickly).	
  
7	
  
Ref:	
  wikipedia	
  
•  But	
  formally/logically/syntactally/seman(cally	
  à	
  ambiguous:	
  
1.  (as	
  an	
  impera(ve)	
  Measure	
  the	
  speed	
  of	
  flies	
  like	
  you	
  would	
  measure	
  that	
  
of	
  an	
  arrow	
  -­‐	
  i.e.	
  (You	
  should)	
  (me	
  flies	
  as	
  you	
  would	
  (me	
  an	
  arrow.	
  
2.  (impera(ve)	
  Measure	
  the	
  speed	
  of	
  flies	
  like	
  an	
  arrow	
  would	
  -­‐	
  i.e.	
  (You	
  
should)	
  (me	
  flies	
  in	
  the	
  same	
  manner	
  that	
  an	
  arrow	
  would	
  (me	
  them.	
  
3.  (impera(ve)	
  Measure	
  the	
  speed	
  of	
  flies	
  that	
  are	
  like	
  arrows	
  -­‐	
  i.e.	
  (You	
  
should)	
  (me	
  those	
  flies	
  that	
  are	
  like	
  an	
  arrow.	
  
4.  (declara(ve)	
  Time	
  moves	
  in	
  a	
  way	
  an	
  arrow	
  would.	
  
5.  (declara(ve,	
  i.e.	
  neutrally	
  sta(ng	
  a	
  proposi(on)	
  Certain	
  flying	
  insects,	
  
"(me	
  flies,"	
  enjoy	
  an	
  arrow.	
  
8	
  
Simplified	
  Lesk	
  algorithm	
  (2000)	
  and	
  WordNet	
  (3.1)	
  
•  	
  Disambigua(ng	
  $me	
  :	
  
•  (me#n#5	
  shares	
  ”pass”	
  	
  and	
  ”$me	
  flies	
  as	
  an	
  arrow”	
  with	
  flies#v#8	
  
•  Disambigua(ng	
  flies	
  
•  flies#v#8	
  shares	
  ”pass”	
  	
  and	
  ”$me	
  flies	
  as	
  an	
  arrow”	
  with	
  (me#v#5	
  
So	
  we	
  select	
  the	
  following	
  senses:	
  (me#n#5	
  	
  and	
  flies#v#8.	
  
9	
  
like	
  &	
  arrow	
  
	
  Disambigua(ng	
  like	
  :	
  
•	
  	
  like#a#1	
  shares	
  like	
  	
  with	
  flies#v#8	
  	
  
	
  
	
  
Arrow	
  cannot	
  be	
  disambiguated	
  
	
  
10	
  
11	
  
Similar	
  a#3	
  
like	
  a#1	
  
fly	
  v#8	
  
Time	
  n#5	
  
Corpus	
  Lesk	
  Algorithm	
  
•  Expands	
  the	
  approach	
  by:	
  
•  Adding	
  all	
  the	
  words	
  of	
  any	
  sense-­‐tagged	
  corpus	
  data	
  (like	
  SemCor)	
  for	
  a	
  
word	
  sense	
  into	
  the	
  signature	
  for	
  that	
  sense.	
  
•  Signature=	
  gloss+examples	
  of	
  a	
  word	
  sense	
  
12	
  
MacMillan	
  dic$onary	
  
	
  	
  
13	
  
Arrow???	
  
Time	
  n#1	
  
Fly	
  v#6	
  
Like	
  a#1	
  
Arrow	
  ???	
  
14	
  
Implementa$on?	
  
•  What	
  if,	
  the	
  next	
  ac(vity	
  was:	
  
•  Build	
  an	
  implementa$on	
  of	
  	
  your	
  solu$on	
  of	
  the	
  simplified	
  Lesk	
  ?	
  
•  Watch	
  out	
  :	
  licences	
  (commercial,	
  academic,	
  crea(ve	
  commons,	
  
etc.)	
  	
  
15	
  
Problems	
  with	
  thesaurus-­‐based	
  meaning	
  
•  We	
  don’t	
  have	
  a	
  thesaurus	
  for	
  every	
  language	
  
•  Even	
  if	
  we	
  do,	
  they	
  have	
  problems	
  with	
  recall	
  
•  Many	
  words	
  are	
  missing	
  
•  Most	
  (if	
  not	
  all)	
  phrases	
  are	
  missing	
  
•  Some	
  connec(ons	
  between	
  senses	
  are	
  missing	
  
•  Thesauri	
  work	
  less	
  well	
  for	
  verbs,	
  adjec(ves	
  
	
  
End	
  of	
  previous	
  lecture	
  
17	
  
Vector/Distribu$onal	
  Seman$cs	
  
•  The	
  meaning	
  of	
  a	
  word	
  is	
  computed	
  from	
  the	
  distribu(on	
  of	
  
words	
  around	
  it.	
  	
  
•  These	
  words	
  are	
  represented	
  as	
  a	
  vector	
  of	
  numbers.	
  
•  Very	
  popular	
  and	
  very	
  intruiging!	
  
18	
  
hZp://esslli2016.unibz.it/?page_id=256	
  	
  
19	
  
(Oversimplified)	
  Preliminaries	
  	
  
(cf	
  also	
  Lect	
  03:	
  SA,	
  Turney	
  Algorithm)	
  
•  Probability	
  
•  Joint	
  probability	
  
•  Marginals	
  
•  PMI	
  
•  PPMI	
  
•  Smoothing	
  
•  Dot	
  product	
  (aka	
  inner	
  product)	
  
•  Window	
  
20	
  
Probability	
  
•  Probability	
  is	
  the	
  measure	
  of	
  how	
  likely	
  an	
  event	
  is. 	
  	
  
	
  
21	
  
Ex:	
  	
  
John	
  has	
  a	
  box	
  with	
  a	
  book,	
  a	
  map	
  and	
  a	
  ruler	
  in	
  it	
  (Cantos	
  Gomez,	
  2013)	
  
This	
  sentence	
  has	
  14	
  words	
  and	
  5	
  nouns.	
  
	
  	
  
The	
  probability	
  of	
  picking	
  up	
  a	
  noun	
  is:	
  	
  
P(noun)=	
  5/14	
  =	
  0.357	
  
Joints	
  and	
  Marginals	
  (oversimplifying)	
  
•  Joint:	
  The	
  probability	
  of	
  word	
  A	
  occurring	
  together	
  with	
  word	
  B	
  
à	
  the	
  frequency	
  with	
  which	
  the	
  two	
  words	
  appear	
  together	
  
•  P(A,B)	
  
•  Marginals:	
  the	
  probability	
  of	
  a	
  word	
  A	
  &	
  the	
  probability	
  of	
  the	
  
other	
  word	
  B	
  
•  P(A)	
  	
  	
  	
  	
  P(B)	
  
22	
  
Can	
  also	
  be	
  said	
  in	
  other	
  ways:	
  	
  
Dependent	
  and	
  independent	
  events:	
  Joints	
  &	
  Marginals	
  
•  Two	
  events	
  are	
  dependent	
  if	
  the	
  outcome	
  or	
  occurrence	
  of	
  the	
  
first	
  affects	
  the	
  outcome	
  or	
  occurrence	
  of	
  the	
  second	
  so	
  that	
  the	
  
probability	
  is	
  changed. 	
  	
  
•  Consider	
  two	
  dependent	
  events,	
  A	
  and	
  B.	
  	
  The	
  joint	
  probability	
  that	
  A	
  and	
  B	
  
occur	
  together	
  is	
  :	
  
•  P(A	
  and	
  B)	
  =	
  P(A)*P(B	
  given	
  A)	
  OR	
  P(A	
  and	
  B)	
  =	
  P(B)*P(A	
  given	
  B)	
  	
  
•  Two	
  events	
  are	
  independent,	
  each	
  probability	
  is	
  mul(plied	
  
together	
  to	
  find	
  the	
  overall	
  probability	
  for	
  the	
  set	
  of	
  events.	
  	
  
•  P(A	
  and	
  B)	
  =	
  P(A)*P(B)	
  
Marginal	
  probability	
  is	
  the	
  probability	
  	
  
of	
  the	
  occurrence	
  of	
  a	
  single	
  event	
  in	
  joint	
  probability.	
  23	
  
Equivalent	
  Nota(ons	
  
(joint)	
  	
  
•  P(A,B)	
  or	
  P(A	
  ∩B)	
  
Associa$on	
  measure	
  
•  Pointwise	
  mutual	
  informa$on:	
  	
  
•  How	
  much	
  more	
  do	
  events	
  x	
  and	
  y	
  co-­‐occur	
  than	
  if	
  they	
  were	
  independent?	
  
Read:	
  the	
  joint	
  probability	
  of	
  two	
  dependent	
  events	
  (ie,	
  the	
  2	
  words	
  that	
  are	
  supposed	
  to	
  be	
  
associated)	
  divided	
  by	
  the	
  product	
  of	
  the	
  individual	
  probabili(es	
  (ie,	
  we	
  assume	
  that	
  the	
  words	
  
are	
  not	
  associated,	
  we	
  assume	
  they	
  are	
  independent),	
  and	
  we	
  take	
  the	
  log	
  of	
  it.	
  
	
  
It	
  tells	
  us	
  how	
  much	
  more	
  the	
  two	
  events	
  co-­‐occur	
  than	
  if	
  they	
  were	
  independent	
  
PMI(X,Y) = log2
P(x,y)
P(x)P(y)
POSITIVE	
  PMI	
  
•  We	
  replace	
  all	
  the	
  nega(ve	
  values	
  with	
  0.	
  
25	
  
Smoothing	
  (addi$ve,	
  Laplace,	
  etc.)	
  
•  In	
  very	
  simple	
  words:	
  we	
  add	
  an	
  arbitrary	
  value	
  to	
  the	
  counts.	
  
•  In	
  a	
  bag	
  of	
  words	
  model	
  of	
  natural	
  language	
  processing	
  and	
  
informa(on	
  retrieval,	
  addi(ve	
  smoothing	
  allows	
  the	
  assignment	
  
of	
  non-­‐zero	
  probabili(es	
  to	
  words	
  which	
  do	
  not	
  occur	
  in	
  the	
  
sample	
  à	
  data	
  sparsenessà	
  mul(plica(on	
  by	
  0	
  probability:	
  all	
  
the	
  counts	
  are	
  0.	
  
•  (Addi(ve	
  smoothing	
  is	
  commonly	
  a	
  component	
  of	
  naive	
  Bayes	
  
classifiers.	
  26	
  
Dot	
  product	
  (aka	
  inner	
  product)	
  
•  Given	
  the	
  two	
  vectors:	
  
•  The	
  dot	
  product	
  is	
  :	
  	
  
•  The	
  Dot	
  Product	
  is	
  wriGen	
  using	
  a	
  central	
  dot	
  
27	
  
Window	
  (around	
  the	
  ambiguous	
  word)	
  
•  The	
  number	
  of	
  words	
  that	
  we	
  take	
  into	
  account	
  before	
  and	
  axer	
  
the	
  word	
  we	
  want	
  to	
  disambiguate:	
  
•  We	
  can	
  decide	
  any	
  arbirtrary	
  value,	
  eg:	
  	
  
•  -­‐3	
  ???	
  +3	
  :	
  	
  
•  Ex:	
  The	
  president	
  said	
  central	
  banks	
  should	
  maintain	
  flows	
  of	
  
cheap	
  credit	
  to	
  households	
  	
  
28	
  
Acknowledgements
Most	
  slides	
  borrowed	
  or	
  adapted	
  from:	
  
Dan	
  Jurafsky	
  and	
  James	
  H.	
  Mar(n	
  
Dan	
  Jurafsky	
  and	
  Christopher	
  Manning,	
  Coursera	
  	
  
	
  
J&M(2015,	
  drax):	
  hGps://web.stanford.edu/~jurafsky/slp3/	
  	
  	
  
	
  
	
  	
  	
  
Distributional Semantics
Term-­‐context	
  matrix	
  
Distribu$onal	
  models	
  of	
  meaning	
  
•  Also	
  called	
  vector-­‐space	
  models	
  of	
  meaning	
  
•  Offer	
  much	
  higher	
  recall	
  than	
  hand-­‐built	
  thesauri	
  
•  Although	
  they	
  tend	
  to	
  have	
  lower	
  precision	
  
•  Zellig	
  Harris	
  (1954):	
  “oculist	
  and	
  eye-­‐doctor	
  …	
  
occur	
  in	
  almost	
  the	
  same	
  environments….	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
If	
  A	
  and	
  B	
  have	
  almost	
  iden$cal	
  environments	
  
we	
  say	
  that	
  they	
  are	
  synonyms.	
  
	
  
•  Firth	
  (1957):	
  “You	
  shall	
  know	
  a	
  word	
  by	
  the	
  
company	
  it	
  keeps!”	
  31	
  
•  Also	
  called	
  vector-­‐space	
  models	
  of	
  meaning	
  
•  Offer	
  much	
  higher	
  recall	
  than	
  hand-­‐built	
  thesauri	
  
•  Although	
  they	
  tend	
  to	
  have	
  lower	
  precision	
  
Intui$on	
  of	
  distribu$onal	
  word	
  similarity	
  
•  Examples:	
  
A bottle of tesgüino is on the table!
Everybody likes tesgüino!
Tesgüino makes you drunk!
We make tesgüino out of corn.!
•  From context words humans can guess tesgüino means
•  an	
  alcoholic	
  beverage	
  like	
  beer	
  
•  Intui(on	
  for	
  algorithm:	
  	
  
•  Two	
  words	
  are	
  similar	
  if	
  they	
  have	
  similar	
  word	
  contexts.	
  
As#You#Like#It Twelfth#Night Julius#Caesar Henry#V
battle 1 1 8 15
soldier 2 2 12 36
fool 37 58 1 5
clown 6 117 0 0
IR:	
  Term-­‐document	
  matrix	
  
•  Each	
  cell:	
  count	
  of	
  term	
  t	
  in	
  a	
  document	
  d:	
  	
  |t,d:	
  	
  
•  Each	
  document	
  is	
  a	
  count	
  vector	
  in	
  ℕv:	
  a	
  column	
  below	
  	
  
33	
  
Document	
  similarity:	
  Term-­‐document	
  matrix	
  
•  Two	
  documents	
  are	
  similar	
  if	
  their	
  vectors	
  are	
  similar	
  
34	
  
As#You#Like#It Twelfth#Night Julius#Caesar Henry#V
battle 1 1 8 15
soldier 2 2 12 36
fool 37 58 1 5
clown 6 117 0 0
The	
  words	
  in	
  a	
  term-­‐document	
  matrix	
  
•  Each	
  word	
  is	
  a	
  count	
  vector	
  in	
  ℕD:	
  a	
  row	
  below	
  	
  
35	
  
As#You#Like#It Twelfth#Night Julius#Caesar Henry#V
battle 1 1 8 15
soldier 2 2 12 36
fool 37 58 1 5
clown 6 117 0 0
The	
  words	
  in	
  a	
  term-­‐document	
  matrix	
  
•  Two	
  words	
  are	
  similar	
  if	
  their	
  vectors	
  are	
  similar	
  
36	
  
As#You#Like#It Twelfth#Night Julius#Caesar Henry#V
battle 1 1 8 15
soldier 2 2 12 36
fool 37 58 1 5
clown 6 117 0 0
The	
  intui$on	
  of	
  distribu$onal	
  word	
  similarity…	
  
•  Instead	
  of	
  using	
  en(re	
  documents,	
  use	
  smaller	
  contexts	
  
•  Paragraph	
  
•  Window	
  of	
  10	
  words	
  
•  A	
  word	
  is	
  now	
  defined	
  by	
  a	
  vector	
  over	
  counts	
  of	
  
context	
  words	
  
37	
  
Sample	
  contexts:	
  20	
  words	
  (Brown	
  corpus)	
  	
  	
  
•  equal	
  amount	
  of	
  sugar,	
  a	
  sliced	
  lemon,	
  a	
  tablespoonful	
  of	
  apricot	
  
preserve	
  or	
  jam,	
  a	
  pinch	
  each	
  of	
  clove	
  and	
  nutmeg,	
  
•  on	
  board	
  for	
  their	
  enjoyment.	
  Cau(ously	
  she	
  sampled	
  her	
  first	
  
pineapple	
  and	
  another	
  fruit	
  whose	
  taste	
  she	
  likened	
  to	
  that	
  of	
  
38	
  
•  of	
  a	
  recursive	
  type	
  well	
  suited	
  to	
  programming	
  on	
  
the	
  digital	
  computer.	
  In	
  finding	
  the	
  op(mal	
  R-­‐stage	
  
policy	
  from	
  that	
  of	
  
•  substan(ally	
  affect	
  commerce,	
  for	
  the	
  purpose	
  of	
  
gathering	
  data	
  and	
  informa$on	
  necessary	
  for	
  the	
  
study	
  authorized	
  in	
  the	
  first	
  sec(on	
  of	
  this	
  
Term-­‐context	
  matrix	
  for	
  word	
  similarity	
  
•  Two	
  words	
  are	
  similar	
  in	
  meaning	
  if	
  their	
  context	
  
vectors	
  are	
  similar	
  
39	
  
aardvark computer data pinch result sugar …
apricot 0 0 0 1 0 1
pineapple 0 0 0 1 0 1
digital 0 2 1 0 1 0
information 0 1 6 0 4 0
Should	
  we	
  use	
  raw	
  counts?	
  
•  For	
  the	
  term-­‐document	
  matrix	
  
•  We	
  used	
  |-­‐idf	
  instead	
  of	
  raw	
  term	
  counts	
  
•  For	
  the	
  term-­‐context	
  matrix	
  
•  Posi(ve	
  Pointwise	
  Mutual	
  Informa(on	
  (PPMI)	
  is	
  common	
  
40	
  
Pointwise	
  Mutual	
  Informa$on	
  
•  Pointwise	
  mutual	
  informa$on:	
  	
  
•  Do	
  events	
  x	
  and	
  y	
  co-­‐occur	
  more	
  than	
  if	
  they	
  were	
  independent?	
  
•  PMI	
  between	
  two	
  words:	
  	
  (Church	
  &	
  Hanks	
  1989)	
  
•  	
  Do	
  words	
  x	
  and	
  y	
  co-­‐occur	
  more	
  than	
  if	
  they	
  were	
  independent?	
  	
  
•  Posi$ve	
  PMI	
  between	
  two	
  words	
  (Niwa	
  &	
  NiGa	
  1994)	
  
•  	
  Replace	
  all	
  PMI	
  values	
  less	
  than	
  0	
  with	
  zero	
  
PMI(X,Y) = log2
P(x,y)
P(x)P(y)
PMI(word1,word2 ) = log2
P(word1,word2)
P(word1)P(word2)
Compu$ng	
  PPMI	
  on	
  a	
  term-­‐context	
  matrix	
  
•  Matrix	
  F	
  with	
  W	
  rows	
  (words)	
  and	
  C	
  columns	
  (contexts)	
  
•  fij	
  is	
  #	
  of	
  $mes	
  wi	
  occurs	
  in	
  context	
  cj
42	
  
pij =
fij
fij
j=1
C
∑
i=1
W
∑
pi* =
fij
j=1
C
∑
fij
j=1
C
∑
i=1
W
∑ p* j =
fij
i=1
W
∑
fij
j=1
C
∑
i=1
W
∑
pmiij = log2
pij
pi* p* j
ppmiij =
pmiij if pmiij > 0
0 otherwise
!
"
#
$#
The	
  count	
  of	
  all	
  
the	
  words	
  that	
  
occur	
  in	
  that	
  
context	
  
The	
  count	
  of	
  all	
  the	
  
contexts	
  where	
  the	
  
word	
  appear	
  
The	
  sum	
  of	
  all	
  words	
  in	
  
all	
  contexts	
  =	
  all	
  the	
  
numbers	
  in	
  the	
  matrix	
  
p(w=informa(on,c=data)	
  =	
  	
  
p(w=informa(on)	
  =	
  
p(c=data)	
  =	
  
43	
  
=	
  .32	
  6/19	
  
11/19	
   =	
  .58	
  
7/19	
   =	
  .37	
  
pij =
fij
fij
j=1
C
∑
i=1
W
∑
p(wi ) =
fij
j=1
C
∑
N
p(cj ) =
fij
i=1
W
∑
N
The	
  count	
  of	
  all	
  the	
  words	
  
that	
  occur	
  in	
  that	
  context	
  
The	
  count	
  
of	
  all	
  the	
  
contexts	
  
where	
  the	
  
word	
  
appear	
  
N=The
sum of all
words in
all
contexts
= all the
numbers
in the
matrix
	
  
44	
  
pmiij = log2
pij
pi* p* j
•  pmi(informa(on,data)	
  =	
  log2	
  (	
  
PPMI(w,context)
computer data pinch result sugar
apricot 1 1 2.25 1 2.25
pineapple 1 1 2.25 1 2.25
digital 1.66 0.00 1 0.00 1
information 0.00 0.57 1 0.47 1
.32	
  /	
   (.37*.58)	
  )	
  	
  =	
  .58	
  
Weighing	
  PMI	
  
•  PMI	
  is	
  biased	
  toward	
  infrequent	
  events	
  
•  Various	
  weigh(ng	
  schemes	
  help	
  alleviate	
  this	
  
•  See	
  Turney	
  and	
  Pantel	
  (2010)	
  
•  Add-­‐one	
  smoothing	
  can	
  also	
  help	
  
45	
  
46	
  
Add#2%Smoothed%Count(w,context)
computer data pinch result sugar
apricot 2 2 3 2 3
pineapple 2 2 3 2 3
digital 4 3 2 3 2
information 3 8 2 6 2
p(w,context),[add02] p(w)
computer data pinch result sugar
apricot 0.03 0.03 0.05 0.03 0.05 0.20
pineapple 0.03 0.03 0.05 0.03 0.05 0.20
digital 0.07 0.05 0.03 0.05 0.03 0.24
information 0.05 0.14 0.03 0.10 0.03 0.36
p(context) 0.19 0.25 0.17 0.22 0.17
Original	
  vs	
  add-­‐2	
  smoothing	
  
47	
  
PPMI(w,context).[add22]
computer data pinch result sugar
apricot 0.00 0.00 0.56 0.00 0.56
pineapple 0.00 0.00 0.56 0.00 0.56
digital 0.62 0.00 0.00 0.00 0.00
information 0.00 0.58 0.00 0.37 0.00
PPMI(w,context)
computer data pinch result sugar
apricot 1 1 2.25 1 2.25
pineapple 1 1 2.25 1 2.25
digital 1.66 0.00 1 0.00 1
information 0.00 0.57 1 0.47 1
Distributional Semantics
Dependency	
  rela(ons	
  
Using	
  syntax	
  to	
  define	
  a	
  word’s	
  context	
  
•  Zellig	
  Harris	
  (1968)	
  
•  “The	
  meaning	
  of	
  en((es,	
  and	
  the	
  meaning	
  of	
  gramma(cal	
  rela(ons	
  among	
  them,	
  is	
  
related	
  to	
  the	
  restric(on	
  of	
  combina(ons	
  of	
  these	
  en((es	
  rela(ve	
  to	
  other	
  en((es”	
  
•  Two	
  words	
  are	
  similar	
  if	
  they	
  have	
  similar	
  parse	
  contexts	
  
•  Duty	
  and	
  responsibility	
  (Chris	
  Callison-­‐Burch’s	
  example)	
  
Modified	
  by	
  
adjec$ves	
  
addi(onal,	
  administra(ve,	
  assumed,	
  
collec(ve,	
  congressional,	
  cons(tu(onal	
  …	
  
Objects	
  of	
  verbs	
   assert,	
  assign,	
  assume,	
  aGend	
  to,	
  avoid,	
  
become,	
  breach	
  …	
  
Co-­‐occurrence	
  vectors	
  based	
  on	
  syntac$c	
  dependencies	
  
•  The	
  contexts	
  C	
  are	
  different	
  dependency	
  rela(ons	
  
•  Subject-­‐of-­‐	
  “absorb”	
  
•  Preposi(onal-­‐object	
  of	
  “inside”	
  
•  Counts	
  for	
  the	
  word	
  cell:	
  
Dekang	
  Lin,	
  1998	
  “Automa(c	
  Retrieval	
  and	
  Clustering	
  of	
  Similar	
  Words”	
  
PMI	
  applied	
  to	
  dependency	
  rela$ons	
  
•  “Drink it” more	
  common	
  than	
  “drink wine”!
•  But	
  “wine”	
  is	
  a	
  beGer	
  “drinkable”	
  thing	
  than	
  “it”	
  
Object	
  of	
  “drink”	
   Count	
   PMI	
  
it	
   3	
   1.3	
  
anything	
   3	
   5.2	
  
wine	
   2	
   9.3	
  
tea	
   2	
   11.8	
  
liquid	
   2	
   10.5	
  
Hindle, Don. 1990. Noun Classification from Predicate-Argument Structure. ACL
Object	
  of	
  “drink”	
   Count	
   PMI	
  
tea	
   2	
   11.8	
  
liquid	
   2	
   10.5	
  
wine	
   2	
   9.3	
  
anything	
   3	
   5.2	
  
it	
   3	
   1.3	
  
Cosine	
  for	
  compu$ng	
  similarity	
  
cos(

v,

w) =

v •

w

v

w
=

v

v
•

w

w
=
viwii=1
N
∑
vi
2
i=1
N
∑ wi
2
i=1
N
∑
Dot product Unit vectors
vi is the PPMI value for word v in context i
wi is the PPMI value for word w in context i.
Cos(v,w) is the cosine similarity of v and w
Sec. 6.3
Cosine	
  as	
  a	
  similarity	
  metric	
  
•  -­‐1:	
  vectors	
  point	
  in	
  opposite	
  direc(ons	
  	
  
•  +1:	
  	
  vectors	
  point	
  in	
  same	
  direc(ons	
  
•  0:	
  vectors	
  are	
  orthogonal	
  
•  Raw	
  frequency	
  or	
  PPMI	
  are	
  non-­‐
nega(ve,	
  so	
  	
  cosine	
  range	
  0-­‐1	
  
53	
  
large	
   data	
   computer	
  
apricot	
   1	
   0	
   0	
  
digital	
   0	
   1	
   2	
  
informa(on	
   1	
   6	
   1	
  
54	
  
Which	
  pair	
  of	
  words	
  is	
  more	
  similar?	
  
cosine(apricot,informa(on)	
  =	
  	
  
	
  
cosine(digital,informa(on)	
  =	
  
	
  
cosine(apricot,digital)	
  =	
  
	
  
cos(

v,

w) =

v •

w

v

w
=

v

v
•

w

w
=
viwii=1
N
∑
vi
2
i=1
N
∑ wi
2
i=1
N
∑
1+ 0 + 0
1+ 0 + 0
1+36 +1
1+36 +1
0 +1+ 4
0 +1+ 4
1+ 0 + 0
0 + 6 + 2
0 + 0 + 0
=
1
38
=.16
=
8
38 5
=.58
= 0
Other	
  possible	
  similarity	
  measures	
  
The end

Más contenido relacionado

La actualidad más candente

La actualidad más candente (19)

A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
 
OUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information ExtractionOUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information Extraction
 
Lidia Pivovarova
Lidia PivovarovaLidia Pivovarova
Lidia Pivovarova
 
Word2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimWord2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensim
 
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
 
Intro to NLP. Lecture 2
Intro to NLP.  Lecture 2Intro to NLP.  Lecture 2
Intro to NLP. Lecture 2
 
Introduction to word embeddings with Python
Introduction to word embeddings with PythonIntroduction to word embeddings with Python
Introduction to word embeddings with Python
 
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
 
NLP_KASHK:POS Tagging
NLP_KASHK:POS TaggingNLP_KASHK:POS Tagging
NLP_KASHK:POS Tagging
 
A general method applicable to the search for anglicisms in russian social ne...
A general method applicable to the search for anglicisms in russian social ne...A general method applicable to the search for anglicisms in russian social ne...
A general method applicable to the search for anglicisms in russian social ne...
 
Wordnet
WordnetWordnet
Wordnet
 
Word representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2VecWord representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2Vec
 
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)
 
AINL 2016: Yagunova
AINL 2016: YagunovaAINL 2016: Yagunova
AINL 2016: Yagunova
 
L1
L1L1
L1
 
Tutorial on word2vec
Tutorial on word2vecTutorial on word2vec
Tutorial on word2vec
 

Destacado

Destacado (8)

Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word Clouds
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: Summarization
 
Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)
 
Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic Web
 
Analytics Education in the era of Big Data
Analytics Education in the era of Big DataAnalytics Education in the era of Big Data
Analytics Education in the era of Big Data
 
Latent Semanctic Analysis Auro Tripathy
Latent Semanctic Analysis Auro TripathyLatent Semanctic Analysis Auro Tripathy
Latent Semanctic Analysis Auro Tripathy
 
Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?
 

Similar a Lecture: Vector Semantics (aka Distributional Semantics)

Cohesion And Coherence Relations
Cohesion And Coherence RelationsCohesion And Coherence Relations
Cohesion And Coherence Relations
Dr. Cupid Lucid
 
Cohesion And Coherence Relations
Cohesion And Coherence RelationsCohesion And Coherence Relations
Cohesion And Coherence Relations
Dr. Cupid Lucid
 
BibleTech2013.pptx
BibleTech2013.pptxBibleTech2013.pptx
BibleTech2013.pptx
Andi Wu
 
Dictionary tutorial
Dictionary tutorialDictionary tutorial
Dictionary tutorial
clynnc
 

Similar a Lecture: Vector Semantics (aka Distributional Semantics) (20)

Natural Language parsing.pptx
Natural Language parsing.pptxNatural Language parsing.pptx
Natural Language parsing.pptx
 
A Simple Walkthrough of Word Sense Disambiguation
A Simple Walkthrough of Word Sense DisambiguationA Simple Walkthrough of Word Sense Disambiguation
A Simple Walkthrough of Word Sense Disambiguation
 
Cohesion And Coherence Relations
Cohesion And Coherence RelationsCohesion And Coherence Relations
Cohesion And Coherence Relations
 
Cohesion And Coherence Relations
Cohesion And Coherence RelationsCohesion And Coherence Relations
Cohesion And Coherence Relations
 
Basic Grammar
Basic GrammarBasic Grammar
Basic Grammar
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
Basics of Language Science, Probabilities & Statistics and Artificial Neural ...
Basics of Language Science, Probabilities & Statistics and Artificial Neural ...Basics of Language Science, Probabilities & Statistics and Artificial Neural ...
Basics of Language Science, Probabilities & Statistics and Artificial Neural ...
 
vlang
vlangvlang
vlang
 
Information retrieval to recommender systems
Information retrieval to recommender systemsInformation retrieval to recommender systems
Information retrieval to recommender systems
 
Word vectors
Word vectorsWord vectors
Word vectors
 
Subword tokenizers
Subword tokenizersSubword tokenizers
Subword tokenizers
 
An approach to speed up the word sense disambiguation procedure through sense...
An approach to speed up the word sense disambiguation procedure through sense...An approach to speed up the word sense disambiguation procedure through sense...
An approach to speed up the word sense disambiguation procedure through sense...
 
Syntactic Piece: Idea, Purpose and Application to Sentiment Analysis
Syntactic Piece: Idea, Purpose and Application to Sentiment AnalysisSyntactic Piece: Idea, Purpose and Application to Sentiment Analysis
Syntactic Piece: Idea, Purpose and Application to Sentiment Analysis
 
Word embeddings
Word embeddingsWord embeddings
Word embeddings
 
BibleTech2013.pptx
BibleTech2013.pptxBibleTech2013.pptx
BibleTech2013.pptx
 
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ?
 
INFO-2950-Languages-and-Grammars.ppt
INFO-2950-Languages-and-Grammars.pptINFO-2950-Languages-and-Grammars.ppt
INFO-2950-Languages-and-Grammars.ppt
 
Dictionary tutorial
Dictionary tutorialDictionary tutorial
Dictionary tutorial
 
Analysing Word Meaning over Time by Exploiting Temporal Random Indexing
Analysing Word Meaning over Time by Exploiting Temporal Random IndexingAnalysing Word Meaning over Time by Exploiting Temporal Random Indexing
Analysing Word Meaning over Time by Exploiting Temporal Random Indexing
 
Designing, Visualizing and Understanding Deep Neural Networks
Designing, Visualizing and Understanding Deep Neural NetworksDesigning, Visualizing and Understanding Deep Neural Networks
Designing, Visualizing and Understanding Deep Neural Networks
 

Más de Marina Santini

Más de Marina Santini (19)

Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology Applications
 
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
 
An Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability FeaturesAn Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability Features
 
Relation Extraction
Relation ExtractionRelation Extraction
Relation Extraction
 
Lecture: Question Answering
Lecture: Question AnsweringLecture: Question Answering
Lecture: Question Answering
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
 
Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)
 
Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1)
 
Lecture 5: Interval Estimation
Lecture 5: Interval Estimation Lecture 5: Interval Estimation
Lecture 5: Interval Estimation
 
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationLecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
 
Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)
 
Lecture 1: Introduction to the Course (Practical Information)
Lecture 1: Introduction to the Course (Practical Information)Lecture 1: Introduction to the Course (Practical Information)
Lecture 1: Introduction to the Course (Practical Information)
 
Lecture: Joint, Conditional and Marginal Probabilities
Lecture: Joint, Conditional and Marginal Probabilities Lecture: Joint, Conditional and Marginal Probabilities
Lecture: Joint, Conditional and Marginal Probabilities
 
Mathematics for Language Technology: Introduction to Probability Theory
Mathematics for Language Technology: Introduction to Probability TheoryMathematics for Language Technology: Introduction to Probability Theory
Mathematics for Language Technology: Introduction to Probability Theory
 
Lecture: Context-Free Grammars
Lecture: Context-Free GrammarsLecture: Context-Free Grammars
Lecture: Context-Free Grammars
 
Lecture: Regular Expressions and Regular Languages
Lecture: Regular Expressions and Regular LanguagesLecture: Regular Expressions and Regular Languages
Lecture: Regular Expressions and Regular Languages
 
Lecture: Automata
Lecture: AutomataLecture: Automata
Lecture: Automata
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word Clouds
 

Último

Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 

Último (20)

ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 

Lecture: Vector Semantics (aka Distributional Semantics)

  • 1. Seman&c  Analysis  in  Language  Technology   http://stp.lingfil.uu.se/~santinim/sais/2016/sais_2016.htm 
 
 Vector Semantics
 (aka Distributional Semantics)
 Marina  San(ni   san$nim@stp.lingfil.uu.se     Department  of  Linguis(cs  and  Philology   Uppsala  University,  Uppsala,  Sweden     Spring  2016       1  
  • 2. Previous  Lecture:  Word  Sense  Disambigua$on   2  
  • 4. Colloca$onal  features:  supervised   •  Posi(on-­‐specific  informa(on  about  the  words  and   colloca(ons  in  window   •  guitar  and  bass  player  stand   •  word  1,2,3  grams  in  window  of  ±3  is  common   encoding local lexical and grammatical information that can often accurately isola a given sense. For example consider the ambiguous word bass in the following WSJ sentenc (16.17) An electric guitar and bass player stand off to one side, not really part of the scene, just as a sort of nod to gringo expectations perhaps. A collocational feature vector, extracted from a window of two words to the rig and left of the target word, made up of the words themselves, their respective part of-speech, and pairs of words, that is, [wi 2,POSi 2,wi 1,POSi 1,wi+1,POSi+1,wi+2,POSi+2,wi 1 i 2,wi+1 i ] (16.1 would yield the following vector: [guitar, NN, and, CC, player, NN, stand, VB, and guitar, player stand] High performing systems generally use POS tags and word collocations of leng 1, 2, and 3 from a window of words 3 to the left and 3 to the right (Zhong and N For example consider the ambiguous word bass in the following WSJ sent 6.17) An electric guitar and bass player stand off to one side, not really par the scene, just as a sort of nod to gringo expectations perhaps. collocational feature vector, extracted from a window of two words to the d left of the target word, made up of the words themselves, their respective -speech, and pairs of words, that is, [wi 2,POSi 2,wi 1,POSi 1,wi+1,POSi+1,wi+2,POSi+2,wi 1 i 2,wi+1 i ] ( ould yield the following vector: [guitar, NN, and, CC, player, NN, stand, VB, and guitar, player stand] gh performing systems generally use POS tags and word collocations of l 2, and 3 from a window of words 3 to the left and 3 to the right (Zhong an
  • 5. Bag-­‐of-­‐words  features:  supervised   •  Assume  we’ve  seGled  on  a  possible  vocabulary  of  12  words  in   “bass”  sentences:       [fishing,  big,  sound,  player,  fly,  rod,  pound,  double,  runs,  playing,  guitar,  band]     •  The  vector  for:    guitar  and  bass  player  stand    [0,0,0,1,0,0,0,0,0,0,1,0]      
  • 6. Prac$cal  ac$vity:  Lesk  algorithms   •  Michael  Lesk  (1986):  Original  Lesk   •  Compare  the  target  word’s  signature  with  the  signature  of  each  of  the   context  words   •  Kilgarriff  and  Rosenzweig  (2000):  Simplified  Lesk   •  Compare  the  target  word’s  signature  with  the  context  words   •  Vasilescu  et  al.  (2004):  Corpus  Lesk     •  Add  all  the  words  in  a  labelled  corpus  sentence  for  a  word  sense  into  the   signature  of  that  sense  (remember  the  labelled  sentences  in  Senseval  2).     signature  <-­‐  set  of  words  in  the  gloss  and  examples  of  sense    6      
  • 7. Simplified  Lesk:  Time  flies  like  an  arrow   •  Common  sense:   •  Modern  English  speakers  unambiguously  understand  the  sentence  to   mean  "As  a  generalisa(on,  (me  passes  in  the  same  way  that  an  arrow   generally  flies  (i.e.  quickly)"  (as  in  the  common  metaphor  5me  goes  by   quickly).   7  
  • 8. Ref:  wikipedia   •  But  formally/logically/syntactally/seman(cally  à  ambiguous:   1.  (as  an  impera(ve)  Measure  the  speed  of  flies  like  you  would  measure  that   of  an  arrow  -­‐  i.e.  (You  should)  (me  flies  as  you  would  (me  an  arrow.   2.  (impera(ve)  Measure  the  speed  of  flies  like  an  arrow  would  -­‐  i.e.  (You   should)  (me  flies  in  the  same  manner  that  an  arrow  would  (me  them.   3.  (impera(ve)  Measure  the  speed  of  flies  that  are  like  arrows  -­‐  i.e.  (You   should)  (me  those  flies  that  are  like  an  arrow.   4.  (declara(ve)  Time  moves  in  a  way  an  arrow  would.   5.  (declara(ve,  i.e.  neutrally  sta(ng  a  proposi(on)  Certain  flying  insects,   "(me  flies,"  enjoy  an  arrow.   8  
  • 9. Simplified  Lesk  algorithm  (2000)  and  WordNet  (3.1)   •   Disambigua(ng  $me  :   •  (me#n#5  shares  ”pass”    and  ”$me  flies  as  an  arrow”  with  flies#v#8   •  Disambigua(ng  flies   •  flies#v#8  shares  ”pass”    and  ”$me  flies  as  an  arrow”  with  (me#v#5   So  we  select  the  following  senses:  (me#n#5    and  flies#v#8.   9  
  • 10. like  &  arrow    Disambigua(ng  like  :   •    like#a#1  shares  like    with  flies#v#8         Arrow  cannot  be  disambiguated     10  
  • 11. 11   Similar  a#3   like  a#1   fly  v#8   Time  n#5  
  • 12. Corpus  Lesk  Algorithm   •  Expands  the  approach  by:   •  Adding  all  the  words  of  any  sense-­‐tagged  corpus  data  (like  SemCor)  for  a   word  sense  into  the  signature  for  that  sense.   •  Signature=  gloss+examples  of  a  word  sense   12  
  • 13. MacMillan  dic$onary       13   Arrow???   Time  n#1   Fly  v#6   Like  a#1  
  • 15. Implementa$on?   •  What  if,  the  next  ac(vity  was:   •  Build  an  implementa$on  of    your  solu$on  of  the  simplified  Lesk  ?   •  Watch  out  :  licences  (commercial,  academic,  crea(ve  commons,   etc.)     15  
  • 16. Problems  with  thesaurus-­‐based  meaning   •  We  don’t  have  a  thesaurus  for  every  language   •  Even  if  we  do,  they  have  problems  with  recall   •  Many  words  are  missing   •  Most  (if  not  all)  phrases  are  missing   •  Some  connec(ons  between  senses  are  missing   •  Thesauri  work  less  well  for  verbs,  adjec(ves    
  • 17. End  of  previous  lecture   17  
  • 18. Vector/Distribu$onal  Seman$cs   •  The  meaning  of  a  word  is  computed  from  the  distribu(on  of   words  around  it.     •  These  words  are  represented  as  a  vector  of  numbers.   •  Very  popular  and  very  intruiging!   18  
  • 20. (Oversimplified)  Preliminaries     (cf  also  Lect  03:  SA,  Turney  Algorithm)   •  Probability   •  Joint  probability   •  Marginals   •  PMI   •  PPMI   •  Smoothing   •  Dot  product  (aka  inner  product)   •  Window   20  
  • 21. Probability   •  Probability  is  the  measure  of  how  likely  an  event  is.       21   Ex:     John  has  a  box  with  a  book,  a  map  and  a  ruler  in  it  (Cantos  Gomez,  2013)   This  sentence  has  14  words  and  5  nouns.       The  probability  of  picking  up  a  noun  is:     P(noun)=  5/14  =  0.357  
  • 22. Joints  and  Marginals  (oversimplifying)   •  Joint:  The  probability  of  word  A  occurring  together  with  word  B   à  the  frequency  with  which  the  two  words  appear  together   •  P(A,B)   •  Marginals:  the  probability  of  a  word  A  &  the  probability  of  the   other  word  B   •  P(A)          P(B)   22  
  • 23. Can  also  be  said  in  other  ways:     Dependent  and  independent  events:  Joints  &  Marginals   •  Two  events  are  dependent  if  the  outcome  or  occurrence  of  the   first  affects  the  outcome  or  occurrence  of  the  second  so  that  the   probability  is  changed.     •  Consider  two  dependent  events,  A  and  B.    The  joint  probability  that  A  and  B   occur  together  is  :   •  P(A  and  B)  =  P(A)*P(B  given  A)  OR  P(A  and  B)  =  P(B)*P(A  given  B)     •  Two  events  are  independent,  each  probability  is  mul(plied   together  to  find  the  overall  probability  for  the  set  of  events.     •  P(A  and  B)  =  P(A)*P(B)   Marginal  probability  is  the  probability     of  the  occurrence  of  a  single  event  in  joint  probability.  23   Equivalent  Nota(ons   (joint)     •  P(A,B)  or  P(A  ∩B)  
  • 24. Associa$on  measure   •  Pointwise  mutual  informa$on:     •  How  much  more  do  events  x  and  y  co-­‐occur  than  if  they  were  independent?   Read:  the  joint  probability  of  two  dependent  events  (ie,  the  2  words  that  are  supposed  to  be   associated)  divided  by  the  product  of  the  individual  probabili(es  (ie,  we  assume  that  the  words   are  not  associated,  we  assume  they  are  independent),  and  we  take  the  log  of  it.     It  tells  us  how  much  more  the  two  events  co-­‐occur  than  if  they  were  independent   PMI(X,Y) = log2 P(x,y) P(x)P(y)
  • 25. POSITIVE  PMI   •  We  replace  all  the  nega(ve  values  with  0.   25  
  • 26. Smoothing  (addi$ve,  Laplace,  etc.)   •  In  very  simple  words:  we  add  an  arbitrary  value  to  the  counts.   •  In  a  bag  of  words  model  of  natural  language  processing  and   informa(on  retrieval,  addi(ve  smoothing  allows  the  assignment   of  non-­‐zero  probabili(es  to  words  which  do  not  occur  in  the   sample  à  data  sparsenessà  mul(plica(on  by  0  probability:  all   the  counts  are  0.   •  (Addi(ve  smoothing  is  commonly  a  component  of  naive  Bayes   classifiers.  26  
  • 27. Dot  product  (aka  inner  product)   •  Given  the  two  vectors:   •  The  dot  product  is  :     •  The  Dot  Product  is  wriGen  using  a  central  dot   27  
  • 28. Window  (around  the  ambiguous  word)   •  The  number  of  words  that  we  take  into  account  before  and  axer   the  word  we  want  to  disambiguate:   •  We  can  decide  any  arbirtrary  value,  eg:     •  -­‐3  ???  +3  :     •  Ex:  The  president  said  central  banks  should  maintain  flows  of   cheap  credit  to  households     28  
  • 29. Acknowledgements Most  slides  borrowed  or  adapted  from:   Dan  Jurafsky  and  James  H.  Mar(n   Dan  Jurafsky  and  Christopher  Manning,  Coursera       J&M(2015,  drax):  hGps://web.stanford.edu/~jurafsky/slp3/              
  • 31. Distribu$onal  models  of  meaning   •  Also  called  vector-­‐space  models  of  meaning   •  Offer  much  higher  recall  than  hand-­‐built  thesauri   •  Although  they  tend  to  have  lower  precision   •  Zellig  Harris  (1954):  “oculist  and  eye-­‐doctor  …   occur  in  almost  the  same  environments….                                   If  A  and  B  have  almost  iden$cal  environments   we  say  that  they  are  synonyms.     •  Firth  (1957):  “You  shall  know  a  word  by  the   company  it  keeps!”  31   •  Also  called  vector-­‐space  models  of  meaning   •  Offer  much  higher  recall  than  hand-­‐built  thesauri   •  Although  they  tend  to  have  lower  precision  
  • 32. Intui$on  of  distribu$onal  word  similarity   •  Examples:   A bottle of tesgüino is on the table! Everybody likes tesgüino! Tesgüino makes you drunk! We make tesgüino out of corn.! •  From context words humans can guess tesgüino means •  an  alcoholic  beverage  like  beer   •  Intui(on  for  algorithm:     •  Two  words  are  similar  if  they  have  similar  word  contexts.  
  • 33. As#You#Like#It Twelfth#Night Julius#Caesar Henry#V battle 1 1 8 15 soldier 2 2 12 36 fool 37 58 1 5 clown 6 117 0 0 IR:  Term-­‐document  matrix   •  Each  cell:  count  of  term  t  in  a  document  d:    |t,d:     •  Each  document  is  a  count  vector  in  ℕv:  a  column  below     33  
  • 34. Document  similarity:  Term-­‐document  matrix   •  Two  documents  are  similar  if  their  vectors  are  similar   34   As#You#Like#It Twelfth#Night Julius#Caesar Henry#V battle 1 1 8 15 soldier 2 2 12 36 fool 37 58 1 5 clown 6 117 0 0
  • 35. The  words  in  a  term-­‐document  matrix   •  Each  word  is  a  count  vector  in  ℕD:  a  row  below     35   As#You#Like#It Twelfth#Night Julius#Caesar Henry#V battle 1 1 8 15 soldier 2 2 12 36 fool 37 58 1 5 clown 6 117 0 0
  • 36. The  words  in  a  term-­‐document  matrix   •  Two  words  are  similar  if  their  vectors  are  similar   36   As#You#Like#It Twelfth#Night Julius#Caesar Henry#V battle 1 1 8 15 soldier 2 2 12 36 fool 37 58 1 5 clown 6 117 0 0
  • 37. The  intui$on  of  distribu$onal  word  similarity…   •  Instead  of  using  en(re  documents,  use  smaller  contexts   •  Paragraph   •  Window  of  10  words   •  A  word  is  now  defined  by  a  vector  over  counts  of   context  words   37  
  • 38. Sample  contexts:  20  words  (Brown  corpus)       •  equal  amount  of  sugar,  a  sliced  lemon,  a  tablespoonful  of  apricot   preserve  or  jam,  a  pinch  each  of  clove  and  nutmeg,   •  on  board  for  their  enjoyment.  Cau(ously  she  sampled  her  first   pineapple  and  another  fruit  whose  taste  she  likened  to  that  of   38   •  of  a  recursive  type  well  suited  to  programming  on   the  digital  computer.  In  finding  the  op(mal  R-­‐stage   policy  from  that  of   •  substan(ally  affect  commerce,  for  the  purpose  of   gathering  data  and  informa$on  necessary  for  the   study  authorized  in  the  first  sec(on  of  this  
  • 39. Term-­‐context  matrix  for  word  similarity   •  Two  words  are  similar  in  meaning  if  their  context   vectors  are  similar   39   aardvark computer data pinch result sugar … apricot 0 0 0 1 0 1 pineapple 0 0 0 1 0 1 digital 0 2 1 0 1 0 information 0 1 6 0 4 0
  • 40. Should  we  use  raw  counts?   •  For  the  term-­‐document  matrix   •  We  used  |-­‐idf  instead  of  raw  term  counts   •  For  the  term-­‐context  matrix   •  Posi(ve  Pointwise  Mutual  Informa(on  (PPMI)  is  common   40  
  • 41. Pointwise  Mutual  Informa$on   •  Pointwise  mutual  informa$on:     •  Do  events  x  and  y  co-­‐occur  more  than  if  they  were  independent?   •  PMI  between  two  words:    (Church  &  Hanks  1989)   •   Do  words  x  and  y  co-­‐occur  more  than  if  they  were  independent?     •  Posi$ve  PMI  between  two  words  (Niwa  &  NiGa  1994)   •   Replace  all  PMI  values  less  than  0  with  zero   PMI(X,Y) = log2 P(x,y) P(x)P(y) PMI(word1,word2 ) = log2 P(word1,word2) P(word1)P(word2)
  • 42. Compu$ng  PPMI  on  a  term-­‐context  matrix   •  Matrix  F  with  W  rows  (words)  and  C  columns  (contexts)   •  fij  is  #  of  $mes  wi  occurs  in  context  cj 42   pij = fij fij j=1 C ∑ i=1 W ∑ pi* = fij j=1 C ∑ fij j=1 C ∑ i=1 W ∑ p* j = fij i=1 W ∑ fij j=1 C ∑ i=1 W ∑ pmiij = log2 pij pi* p* j ppmiij = pmiij if pmiij > 0 0 otherwise ! " # $# The  count  of  all   the  words  that   occur  in  that   context   The  count  of  all  the   contexts  where  the   word  appear   The  sum  of  all  words  in   all  contexts  =  all  the   numbers  in  the  matrix  
  • 43. p(w=informa(on,c=data)  =     p(w=informa(on)  =   p(c=data)  =   43   =  .32  6/19   11/19   =  .58   7/19   =  .37   pij = fij fij j=1 C ∑ i=1 W ∑ p(wi ) = fij j=1 C ∑ N p(cj ) = fij i=1 W ∑ N The  count  of  all  the  words   that  occur  in  that  context   The  count   of  all  the   contexts   where  the   word   appear   N=The sum of all words in all contexts = all the numbers in the matrix  
  • 44. 44   pmiij = log2 pij pi* p* j •  pmi(informa(on,data)  =  log2  (   PPMI(w,context) computer data pinch result sugar apricot 1 1 2.25 1 2.25 pineapple 1 1 2.25 1 2.25 digital 1.66 0.00 1 0.00 1 information 0.00 0.57 1 0.47 1 .32  /   (.37*.58)  )    =  .58  
  • 45. Weighing  PMI   •  PMI  is  biased  toward  infrequent  events   •  Various  weigh(ng  schemes  help  alleviate  this   •  See  Turney  and  Pantel  (2010)   •  Add-­‐one  smoothing  can  also  help   45  
  • 46. 46   Add#2%Smoothed%Count(w,context) computer data pinch result sugar apricot 2 2 3 2 3 pineapple 2 2 3 2 3 digital 4 3 2 3 2 information 3 8 2 6 2 p(w,context),[add02] p(w) computer data pinch result sugar apricot 0.03 0.03 0.05 0.03 0.05 0.20 pineapple 0.03 0.03 0.05 0.03 0.05 0.20 digital 0.07 0.05 0.03 0.05 0.03 0.24 information 0.05 0.14 0.03 0.10 0.03 0.36 p(context) 0.19 0.25 0.17 0.22 0.17
  • 47. Original  vs  add-­‐2  smoothing   47   PPMI(w,context).[add22] computer data pinch result sugar apricot 0.00 0.00 0.56 0.00 0.56 pineapple 0.00 0.00 0.56 0.00 0.56 digital 0.62 0.00 0.00 0.00 0.00 information 0.00 0.58 0.00 0.37 0.00 PPMI(w,context) computer data pinch result sugar apricot 1 1 2.25 1 2.25 pineapple 1 1 2.25 1 2.25 digital 1.66 0.00 1 0.00 1 information 0.00 0.57 1 0.47 1
  • 49. Using  syntax  to  define  a  word’s  context   •  Zellig  Harris  (1968)   •  “The  meaning  of  en((es,  and  the  meaning  of  gramma(cal  rela(ons  among  them,  is   related  to  the  restric(on  of  combina(ons  of  these  en((es  rela(ve  to  other  en((es”   •  Two  words  are  similar  if  they  have  similar  parse  contexts   •  Duty  and  responsibility  (Chris  Callison-­‐Burch’s  example)   Modified  by   adjec$ves   addi(onal,  administra(ve,  assumed,   collec(ve,  congressional,  cons(tu(onal  …   Objects  of  verbs   assert,  assign,  assume,  aGend  to,  avoid,   become,  breach  …  
  • 50. Co-­‐occurrence  vectors  based  on  syntac$c  dependencies   •  The  contexts  C  are  different  dependency  rela(ons   •  Subject-­‐of-­‐  “absorb”   •  Preposi(onal-­‐object  of  “inside”   •  Counts  for  the  word  cell:   Dekang  Lin,  1998  “Automa(c  Retrieval  and  Clustering  of  Similar  Words”  
  • 51. PMI  applied  to  dependency  rela$ons   •  “Drink it” more  common  than  “drink wine”! •  But  “wine”  is  a  beGer  “drinkable”  thing  than  “it”   Object  of  “drink”   Count   PMI   it   3   1.3   anything   3   5.2   wine   2   9.3   tea   2   11.8   liquid   2   10.5   Hindle, Don. 1990. Noun Classification from Predicate-Argument Structure. ACL Object  of  “drink”   Count   PMI   tea   2   11.8   liquid   2   10.5   wine   2   9.3   anything   3   5.2   it   3   1.3  
  • 52. Cosine  for  compu$ng  similarity   cos(  v,  w) =  v •  w  v  w =  v  v •  w  w = viwii=1 N ∑ vi 2 i=1 N ∑ wi 2 i=1 N ∑ Dot product Unit vectors vi is the PPMI value for word v in context i wi is the PPMI value for word w in context i. Cos(v,w) is the cosine similarity of v and w Sec. 6.3
  • 53. Cosine  as  a  similarity  metric   •  -­‐1:  vectors  point  in  opposite  direc(ons     •  +1:    vectors  point  in  same  direc(ons   •  0:  vectors  are  orthogonal   •  Raw  frequency  or  PPMI  are  non-­‐ nega(ve,  so    cosine  range  0-­‐1   53  
  • 54. large   data   computer   apricot   1   0   0   digital   0   1   2   informa(on   1   6   1   54   Which  pair  of  words  is  more  similar?   cosine(apricot,informa(on)  =       cosine(digital,informa(on)  =     cosine(apricot,digital)  =     cos(  v,  w) =  v •  w  v  w =  v  v •  w  w = viwii=1 N ∑ vi 2 i=1 N ∑ wi 2 i=1 N ∑ 1+ 0 + 0 1+ 0 + 0 1+36 +1 1+36 +1 0 +1+ 4 0 +1+ 4 1+ 0 + 0 0 + 6 + 2 0 + 0 + 0 = 1 38 =.16 = 8 38 5 =.58 = 0