SlideShare una empresa de Scribd logo
1 de 68
Descargar para leer sin conexión
Introduction	
  to	
  Natural	
  
Language	
  Processing	
  
Rutu	
  Mulkar-­‐Mehta,	
  PhD	
  
Founder	
  and	
  Data	
  Scientist	
  @Ticary	
  
@RutuMulkar	
  
Co-­‐hosted	
  Meetup	
  
Data	
  Science	
  Dojo	
  
http://www.meetup.com/data-­‐science-­‐dojo	
  
Natural	
  Language	
  Processing	
  
http://www.meetup.com/Natural-­‐Language-­‐Processing-­‐Meetup/	
  
About	
  Me	
  
•  Founder	
  and	
  Data	
  Scientist	
  at	
  Ticary	
  	
  
•  Background:	
  
– PhD	
  in	
  Natural	
  Language	
  Processing	
  
– Computer	
  Science	
  
•  Worked	
  on	
  applying	
  NLP	
  to:	
  
– Healthcare	
  
– SEO	
  (Search	
  Engine	
  Optimization)	
  
– Other	
  Stuff:	
  Sentiment	
  Analysis,	
  Question	
  
Answering,	
  Natural	
  Language	
  Understanding	
  ++	
  	
  
4	
  
Agenda	
  
•  Understanding	
  Natural	
  Language	
  
•  Introduction	
  to	
  different	
  NLP	
  Problems	
  
•  Part	
  of	
  Speech	
  tagging	
  
•  Linguistic	
  Resources	
  
	
  
UNDERSTANDING	
  NATURAL	
  
LANGUAGE	
  
Some	
  Example	
  Sentences	
  
•  Children	
  make	
  delicious	
  snacks	
  
•  I	
  saw	
  the	
  Grand	
  Canyon	
  flying	
  to	
  New	
  York	
  
•  Stolen	
  painting	
  found	
  by	
  the	
  tree	
  
	
  
•  Two	
  sentences:	
  
– Monkeys	
  like	
  bananas	
  when	
  they	
  wake	
  up.	
  
– Monkeys	
  like	
  bananas	
  when	
  they	
  are	
  ripe.	
  
Why	
  is	
  NLP	
  Hard?	
  
Brazil	
  crowds	
  attend	
  funeral	
  of	
  late	
  candidate	
  Campos	
  
	
  
More	
  than	
  100,000	
  people	
  in	
  Brazil	
  have	
  paid	
  their	
  last	
  respects	
  to	
  the	
  
late	
  presidential	
  candidate,	
  Eduardo	
  Campos,	
  who	
  died	
  in	
  a	
  plane	
  
crash	
  on	
  Wednesday.	
  
They	
  attended	
  a	
  funeral	
  Mass	
  and	
  filled	
  the	
  streets	
  of	
  the	
  city	
  of	
  
Recife	
  to	
  follow	
  the	
  passage	
  of	
  his	
  coffin.	
  
Later	
  this	
  week,	
  Mr.	
  Campos's	
  Socialist	
  Party	
  is	
  expected	
  to	
  appoint	
  
former	
  Environment	
  Minister	
  Marina	
  Silva	
  as	
  a	
  replacement	
  
candidate.	
  
Mr.	
  Campos's	
  jet	
  crashed	
  in	
  bad	
  weather	
  in	
  Santos,	
  near	
  Sao	
  Paulo.	
  
Investigators	
  are	
  still	
  trying	
  to	
  establish	
  the	
  exact	
  causes	
  of	
  the	
  crash,	
  
which	
  killed	
  six	
  other	
  people.	
  
Why	
  is	
  NLP	
  Hard?	
  
Brazil	
  crowds	
  attend	
  funeral	
  of	
  late	
  candidate	
  Campos	
  
	
  
More	
  than	
  100,000	
  people	
  in	
  Brazil	
  have	
  paid	
  their	
  last	
  respects	
  to	
  the	
  
late	
  presidential	
  candidate,	
  Eduardo	
  Campos,	
  who	
  died	
  in	
  a	
  plane	
  
crash	
  on	
  Wednesday.	
  
They	
  attended	
  a	
  funeral	
  Mass	
  and	
  filled	
  the	
  streets	
  of	
  the	
  city	
  of	
  
Recife	
  to	
  follow	
  the	
  passage	
  of	
  his	
  coffin.	
  
Later	
  this	
  week,	
  Mr	
  Campos's	
  Socialist	
  Party	
  is	
  expected	
  to	
  appoint	
  
former	
  Environment	
  Minister	
  Marina	
  Silva	
  as	
  a	
  replacement	
  
candidate.	
  
Mr	
  Campos's	
  jet	
  crashed	
  in	
  bad	
  weather	
  in	
  Santos,	
  near	
  Sao	
  Paulo.	
  
Investigators	
  are	
  still	
  trying	
  to	
  establish	
  the	
  exact	
  causes	
  of	
  the	
  crash,	
  
which	
  killed	
  six	
  other	
  people.	
  
Why	
  is	
  NLP	
  Hard?	
  
Brazil	
  crowds	
  attend	
  funeral	
  of	
  late	
  candidate	
  Campos	
  
	
  
More	
  than	
  100,000	
  people	
  in	
  Brazil	
  have	
  paid	
  their	
  last	
  respects	
  to	
  the	
  
late	
  presidential	
  candidate,	
  Eduardo	
  Campos,	
  who	
  died	
  in	
  a	
  plane	
  
crash	
  on	
  Wednesday.	
  
They	
  attended	
  a	
  funeral	
  Mass	
  and	
  filled	
  the	
  streets	
  of	
  the	
  city	
  of	
  
Recife	
  to	
  follow	
  the	
  passage	
  of	
  his	
  coffin.	
  
Later	
  this	
  week,	
  Mr	
  Campos's	
  Socialist	
  Party	
  is	
  expected	
  to	
  appoint	
  
former	
  Environment	
  Minister	
  Marina	
  Silva	
  as	
  a	
  replacement	
  
candidate.	
  
Mr	
  Campos's	
  jet	
  crashed	
  in	
  bad	
  weather	
  in	
  Santos,	
  near	
  Sao	
  Paulo.	
  
Investigators	
  are	
  still	
  trying	
  to	
  establish	
  the	
  exact	
  causes	
  of	
  the	
  crash,	
  
which	
  killed	
  six	
  other	
  people.	
  
Why	
  is	
  NLP	
  Hard?	
  
•  To	
  understand	
  the	
  current	
  event,	
  you	
  need	
  to	
  
understand	
  several	
  other	
  concepts:	
  
– Current	
  Event	
  
– Background	
  Event	
  
– Property	
  
– references	
  to	
  other	
  events	
  
– pronouns	
  
NLP	
  TASKS	
  
What	
  can	
  we	
  solve	
  with	
  Natural	
  Language	
  Processing	
  
NLP	
  Tasks	
  
•  Text	
  Categorization	
  
•  Sentiment	
  Analysis	
  
•  Information	
  Extraction	
  
•  Information	
  Retrieval	
  
•  Question	
  Answering	
  
•  Text	
  Summarization	
  
•  Machine	
  Translation	
  
Text	
  Categorization	
  
Input	
  Document	
  
What	
  is	
  the	
  document	
  about:	
  
	
  
sports:	
  0.2%	
  
	
  
politics:	
  2%	
  
	
  
entertainment:	
  96%	
  
	
  
religion:	
  …	
  
	
  
finance:	
  …	
  
Text	
  Classification	
  
finance.yahoo.com	
   sports.yahoo.com	
  
make	
  your	
  own	
  wordle	
  using	
  wordle.net	
  
Vocabulary	
  used	
  in	
  one	
  genre	
  of	
  text,	
  is	
  different	
  from	
  
vocabulary	
  used	
  in	
  another	
  genre	
  
NLP	
  Tasks	
  
•  Text	
  Categorization	
  
•  Sentiment	
  Analysis	
  
•  Information	
  Extraction	
  
•  Information	
  Retrieval	
  
•  Question	
  Answering	
  
•  Text	
  Summarization	
  
•  Machine	
  Translation	
  
Sentiment	
  Analysis	
  
Sharp
screen resolution
Low
battery life
v	
  
Product Reviews – Kindle Paperwhite
Sentiment	
  Analysis	
  
•  What	
  are	
  people	
  saying?	
  
–  Twitter	
  
–  Reviews	
  
–  Blogs	
  
–  Emails	
  
•  Can	
  be	
  for:	
  
–  Products	
  
–  Companies	
  
–  Movies	
  
–  Books	
  
Sentiment	
  Analysis	
  
Possible	
  Features	
  
•  Important	
  keywords,	
  and	
  key	
  phrases:	
  
–  POS:	
  dazzling,	
  brilliant,	
  phenomenal	
  
–  NEG:	
  hideous,	
  awful,	
  unwatchable	
  
•  Emoticons	
  
–  POS	
  :-­‐)	
  	
  
–  NEG	
  :-­‐(	
  
•  Ontologies	
  
–  Wordnet:	
  https://wordnet.princeton.edu/	
  
–  SentiWordnet:	
  http://sentiwordnet.isti.cnr.it/	
  
Challenges	
  
•  People	
  express	
  opinions	
  in	
  complex	
  ways	
  
– “The	
  acting	
  was	
  great	
  and	
  the	
  plots	
  were	
  intense	
  
and	
  mesmerizing,	
  but	
  I	
  hated	
  the	
  movie”	
  
•  Sarcasm,	
  humor	
  and	
  other	
  expressions	
  
– “It	
  was	
  a	
  great	
  movie	
  for	
  a	
  Sunday	
  nap.	
  I	
  only	
  fell	
  
asleep	
  twice,	
  but	
  it	
  was	
  very	
  restful”	
  
NLP	
  Tasks	
  
•  Text	
  Categorization	
  
•  Sentiment	
  Analysis	
  
•  Information	
  Extraction	
  
•  Information	
  Retrieval	
  
•  Question	
  Answering	
  
•  Text	
  Summarization	
  
•  Machine	
  Translation	
  
Information	
  Extraction	
  
Input	
  Document	
  
What	
  are	
  the	
  key	
  
pieces	
  of	
  information	
  ?	
  
	
  
Location:	
  
Time:	
  
People:	
  
…	
  
Extracting	
  Named	
  Entities	
  from	
  Documents	
  
Other	
  ways	
  for	
  IE	
  :	
  	
  
Hypernyms	
  (type	
  of)	
  
colors	
  such	
  as	
  red,	
  blue	
  and	
  …	
  
25	
  
Other	
  ways	
  for	
  IE:	
  	
  
Synonyms	
  	
  
Find	
  different	
  relations	
  between	
  2	
  concepts:	
  
Microsoft	
  bought	
  Farecast	
  
26	
  
NLP	
  Tasks	
  
•  Text	
  Categorization	
  
•  Sentiment	
  Analysis	
  
•  Information	
  Extraction	
  
•  Information	
  Retrieval	
  
•  Question	
  Answering	
  
•  Text	
  Summarization	
  
•  Machine	
  Translation	
  
Information	
  Retrieval	
  
Information	
  Retrieval	
  
Input	
  Document	
  
What	
  are	
  the	
  documents	
  
relevant	
  to	
  the	
  query?	
  
Input	
  Document	
  
Input	
  Document	
  
Input	
  Document	
  
Input	
  Document	
  
query	
  
Information	
  Retrieval	
  
Q)	
  Which	
  documents	
  are	
  most	
  relevant	
  to	
  a	
  
given	
  query?	
  
	
  
A)	
  Similar	
  vocabulary	
  between	
  query	
  and	
  
document?	
  
Quantify	
  similarity	
  based	
  on	
  maximum	
  overlap	
  
– Cosine	
  Similarity	
  
– Jaccard	
  Similarity	
  
Information	
  Retrieval	
  
Q)	
  If	
  you	
  rewrite	
  the	
  query	
  –	
  will	
  that	
  give	
  you	
  
more	
  precise	
  results?	
  
	
  
A)	
  Yes!	
  It	
  is	
  called	
  “Query	
  Expansion”	
  
Commercial	
  Search	
  Tools	
  
•  Lucene	
  
– http://lucene.apache.org/	
  	
  
•  ElasticSearch	
  
– https://www.elastic.co/	
  
Underlying	
  technology	
  in	
  most	
  of	
  these	
  is	
  the	
  same,	
  with	
  some	
  variations	
  
	
  
Meetup	
  about	
  this	
  topic	
  scheduled	
  for	
  early	
  2016	
  
NLP	
  Tasks	
  
•  Text	
  Categorization	
  
•  Sentiment	
  Analysis	
  
•  Information	
  Extraction	
  
•  Information	
  Retrieval	
  
•  Question	
  Answering	
  
•  Text	
  Summarization	
  
•  Machine	
  Translation	
  
Question	
  Answering	
  -­‐	
  Closed	
  
Input	
  Data	
  Source	
  
Questions:	
  
	
  
What	
  event	
  happened?	
  
	
  
When	
  did	
  the	
  event	
  happen?	
  
	
  
Why	
  did	
  the	
  event	
  happen?	
  
	
  
How	
  long	
  was	
  the	
  event?	
  
	
  
How	
  did	
  the	
  event	
  happen?	
  
Question	
  Answering	
  -­‐	
  Open	
  
38	
  
NLP	
  Tasks	
  
•  Text	
  Categorization	
  
•  Sentiment	
  Analysis	
  
•  Information	
  Extraction	
  
•  Information	
  Retrieval	
  
•  Question	
  Answering	
  
•  Text	
  Summarization	
  
•  Machine	
  Translation	
  
Text	
  Summarization	
  
Types	
  of	
  Text	
  Summarization	
  
•  Keyword	
  Summaries	
  
–  Extract	
  significant	
  Keywords	
  from	
  text	
  
–  Easy	
  to	
  implement	
  
–  Hard	
  to	
  understand	
  by	
  end	
  user	
  a	
  
Types	
  of	
  Text	
  Summarization	
  
•  Sentence/Phrase	
  Extraction	
  
–  Extract	
  relevant	
  sentences	
  
–  Medium-­‐Hard	
  to	
  implement	
  
–  Easy	
  for	
  end	
  user	
  to	
  understand	
  
Types	
  of	
  Text	
  Summarization	
  
•  Natural	
  Language	
  Understanding	
  and	
  Generation	
  
–  Understand	
  meaning	
  of	
  text	
  
–  Generate	
  sentences	
  from	
  meaning	
  of	
  original	
  text	
  
–  Hard	
  to	
  implement	
  
–  Easy	
  for	
  end	
  user	
  
President	
  of	
  University	
  
of	
  Missouri	
  resigned	
  
after	
  graduate	
  student	
  
hunger	
  strike	
  and	
  class	
  
cancellations	
  by	
  faculty	
  
NLP	
  Tasks	
  
•  Text	
  Categorization	
  
•  Sentiment	
  Analysis	
  
•  Information	
  Extraction	
  
•  Information	
  Retrieval	
  
•  Question	
  Answering	
  
•  Text	
  Summarization	
  
•  Machine	
  Translation	
  
Machine	
  Translation	
  
translate.google.com	
  
Why	
  is	
  MT	
  Hard?	
  
•  It	
  is	
  not	
  a	
  1	
  to	
  1	
  translation	
  
– In	
  the	
  previous	
  example	
  4	
  words	
  in	
  English	
  
translate	
  into	
  2	
  in	
  Spanish	
  
•  Grammar	
  is	
  different	
  in	
  different	
  languages	
  
– SOV	
  (Subject	
  –	
  Object	
  –	
  Verb)	
  
•  “She	
  him	
  loves”	
  (Hindi,	
  Japanese)	
  
– SVO	
  (Subject	
  –	
  Verb	
  –	
  Object)	
  	
  
•  “She	
  loves	
  him”	
  (English,	
  Mandarin)	
  
Machine	
  Translation	
  
•  Waygoapp	
  
•  Instantly	
  translated	
  Chinese,	
  
Japanese	
  and	
  Korean	
  
•  Simply	
  point	
  and	
  translate	
  
•  Offline	
  
	
  
http://waygoapp.com/	
  
LINGUISTIC	
  NUANCES	
  
Back	
  to	
  the	
  basics	
  
Example	
  
All	
  the	
  gobulins	
  were	
  gramzies.	
  
It	
  was	
  grimbleton.	
  
What	
  are	
  the	
  underlined	
  words?	
  
	
  
gobulins	
  	
  
•  Noun	
  
gramzies	
  	
  
•  Noun	
  or	
  Adjective	
  
grimbleton	
  
•  Noun	
  or	
  Adjective	
  
Why	
  is	
  the	
  example	
  important?	
  
We	
  can	
  get	
  a	
  sense	
  of	
  what	
  the	
  word	
  means,	
  
based	
  on	
  how	
  it	
  is	
  used	
  in	
  language.	
  
Nouns	
  
•  E.g.	
  cat,	
  car,	
  computer,	
  tree	
  
•  Variations:	
  
– Number:	
  singular,	
  plural	
  
•  one	
  car,	
  two	
  cars	
  
– Gender:	
  masculine,	
  feminine,	
  neuter	
  
– Case:	
  nominative,	
  genitive,	
  accusative,	
  dative	
  
Pronouns	
  
•  Vary	
  in	
  
–  E.g.	
  she,	
  ourselves,	
  mine	
  
–  Person	
  
–  Gender	
  
•  his,	
  her	
  
–  Number	
  
–  Case:	
  nominative,	
  accusative,	
  possessive,	
  2nd	
  
possessive	
  
–  Reflexive	
  and	
  Anaphoric	
  Forms:	
  	
  
•  herself,	
  each	
  other	
  
Determiners	
  
•  Articles	
  
– a,	
  an,	
  the	
  
•  Demonstratives	
  
– this,	
  that	
  
	
  
Adjectives	
  
•  Describe	
  Properties	
  
– sunny,	
  beautiful,	
  calm	
  
•  Attributive	
  and	
  predicative	
  properties	
  
•  Agreement	
  
– in	
  gender,	
  number	
  
•  Comparative	
  and	
  superlative	
  forms	
  
– derivative	
  and	
  periphrastic	
  
•  positive	
  form	
  
Verbs	
  
•  Tense:	
  past,	
  present,	
  future	
  
– danced,	
  dancing,	
  will	
  dance	
  
•  Aspect:	
  progressive,	
  perfective	
  
•  Voice:	
  active,	
  passive	
  
•  Other:	
  number,	
  person	
  
•  Arguments:	
  transitive,	
  intransitive,	
  
ditransitive	
  
Other	
  POS	
  tags	
  
•  Adverbs	
  
– happily	
  
•  Prepositions	
  
– of,	
  on,	
  in	
  
•  Particles	
  
– ran	
  a	
  bill	
  vs	
  ran	
  up	
  a	
  bill	
  
Morphological	
  Analysis	
  
•  Sleeps	
  =	
  sleep	
  +	
  v	
  +	
  3rd	
  Person	
  +	
  Singular	
  
•  If	
  we	
  have	
  a	
  good	
  enough	
  grammar	
  with	
  all	
  of	
  
these	
  rules,	
  we	
  have	
  a	
  good	
  shot	
  at	
  
understanding	
  syntax	
  of	
  language	
  
Automatic	
  Taggers	
  
•  Almost	
  all	
  the	
  POS	
  taggers	
  use	
  the	
  Penn-­‐Treebank	
  
list	
  of	
  tags	
  
•  https://www.ling.upenn.edu/courses/Fall_2003/
ling001/penn_treebank_pos.html	
  
58	
  
Automatic	
  Taggers	
  
•  Almost	
  all	
  the	
  POS	
  taggers	
  use	
  the	
  Penn-­‐Treebank	
  list	
  of	
  
tags	
  
•  https://www.ling.upenn.edu/courses/Fall_2003/ling001/
penn_treebank_pos.html	
  
–  Nouns	
  :	
  	
  
•  NN	
  (house),	
  NNS(houses),	
  NNP(White	
  House),	
  NNPS	
  
–  Verbs:	
  	
  
•  VB(say),	
  VBD(said),	
  VBG(saying),	
  VBN,	
  VBP,	
  VBZ	
  
–  Adjectives:	
  	
  
•  JJ	
  (good),	
  JJR(better),	
  JJS(best)	
  
–  Adverbs:	
  RB,	
  RBR,	
  RBS	
  
–  Prepositions:	
  IN	
  
59	
  
Example	
  
60	
  
POS	
  Tagging	
  and	
  Parsing	
  
•  Stanford	
  Core	
  NLP	
  
– http://nlp.stanford.edu:8080/corenlp/	
  
•  NLTK	
  
– Natural	
  Language	
  Toolkit	
  
– You	
  need	
  to	
  provide	
  your	
  own	
  training	
  data,	
  and	
  
train	
  models	
  for	
  NLTK	
  to	
  be	
  effective	
  
61	
  
Other	
  Linguistic	
  Features	
  of	
  Interest	
  
– We	
  want	
  to	
  get	
  nouns	
  and	
  verbs	
  into	
  a	
  root	
  form	
  
E.g.	
  
•  am,	
  are,	
  is	
  à	
  be	
  
•  car,	
  cars,	
  car’s	
  à	
  car	
  	
  
– Two	
  approaches:	
  	
  
•  Stemming	
  	
  
•  Lemmatization	
  
62	
  
Stemming	
  and	
  Lemmatization	
  
•  Lemmatization	
  	
  
–  use	
  of	
  a	
  vocabulary	
  
–  morphological	
  analysis	
  of	
  words	
  
–  returns	
  the	
  base	
  or	
  dictionary	
  form	
  of	
  a	
  word	
  
–  base	
  form	
  is	
  known	
  as	
  the	
  lemma	
  
–  e.g.	
  am,	
  are,	
  is	
  à	
  be	
  
•  Stemming	
  
–  crude	
  heuristic	
  process	
  	
  
–  chops	
  off	
  the	
  ends	
  of	
  words	
  	
  
–  hope	
  of	
  achieving	
  this	
  goal	
  	
  
–  e.g.	
  Marked	
  à	
  Mark,	
  Marker	
  à	
  Mark	
  
63	
  
Parsing	
  Resources	
  
•  NLTK	
  
– python,	
  low	
  accuracy,	
  fast	
  
– http://www.nltk.org/	
  
•  Stanford	
  Core	
  NLP	
  
– java,	
  high	
  accuracy,	
  slow	
  
– http://nlp.stanford.edu/software/corenlp.shtml	
  
•  SpaCy	
  
– python,	
  medium	
  accuracy,	
  fast	
  
– https://spacy.io/	
  
Other	
  Resources:	
  Ontologies 	
  	
  
•  Wordnet	
  
–  groups	
  words	
  when	
  they	
  have	
  the	
  same	
  meaning	
  	
  
–  represents	
  hierarchical	
  links	
  between	
  groups	
  
–  E.g.	
  car	
  is	
  the	
  same	
  thing	
  as	
  an	
  automobile	
  
•  SentiWordnet	
  
•  Wordnet	
  +	
  Sentiment	
  
•  ConceptNet	
  
–  broader	
  relationships	
  than	
  WordNet	
  
–  E.g.	
  bread	
  is	
  typically	
  found	
  near	
  a	
  toaster.	
  
•  FrameNet	
  
–  Frames	
  represent	
  concepts	
  and	
  their	
  associated	
  roles	
  
SOMETHING	
  TO	
  THINK	
  ABOUT	
  
Semantics	
  and	
  Word	
  Co-­‐locations	
  
•  It	
  is	
  important	
  to	
  know	
  which	
  words	
  occur	
  
together	
  	
  
– Strong	
  Beer	
  vs	
  Powerful	
  Beer	
  
– Big	
  Sister	
  vs	
  Large	
  Sister	
  	
  
•  Two	
  approaches	
  have	
  been	
  used	
  
– Semantics	
  –	
  ontologies	
  and	
  word	
  meanings	
  
– Statistics	
  –	
  word	
  colocations	
  and	
  probabilities	
  
Thank	
  you	
  for	
  Listening	
  
rutu@ticary.com	
  
@RutuMulkar	
  
	
  

Más contenido relacionado

La actualidad más candente

Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Yasir Khan
 
Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)
Kuppusamy P
 

La actualidad más candente (20)

Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text Classification
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)
 
Natural Language Processing
Natural Language Processing Natural Language Processing
Natural Language Processing
 
NLP
NLPNLP
NLP
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
NLP PPT.pptx
NLP PPT.pptxNLP PPT.pptx
NLP PPT.pptx
 
SoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming textSoDA v2 - Named Entity Recognition from streaming text
SoDA v2 - Named Entity Recognition from streaming text
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
NLP
NLPNLP
NLP
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...
 
Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)
 
Text Classification
Text ClassificationText Classification
Text Classification
 
Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Anal...
Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Anal...Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Anal...
Analyzing Text Preprocessing and Feature Selection Methods for Sentiment Anal...
 

Destacado

Natural language processing 2
Natural language processing 2Natural language processing 2
Natural language processing 2
Tony Vo
 
Natural Language Processing in Practice
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in Practice
Vsevolod Dyomkin
 
Practical Natural Language Processing
Practical Natural Language ProcessingPractical Natural Language Processing
Practical Natural Language Processing
Jaganadh Gopinadhan
 
Persona Driven Keyword Research
Persona Driven Keyword ResearchPersona Driven Keyword Research
Persona Driven Keyword Research
Michael King
 
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with r
Vivian S. Zhang
 

Destacado (20)

Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Text analytics and R - Open Question: is it a good match?
Text analytics and R - Open Question: is it a good match?Text analytics and R - Open Question: is it a good match?
Text analytics and R - Open Question: is it a good match?
 
Measuring Opinion Credibility in Twitter
Measuring Opinion Credibility in TwitterMeasuring Opinion Credibility in Twitter
Measuring Opinion Credibility in Twitter
 
Annotation processing
Annotation processingAnnotation processing
Annotation processing
 
Startupfest 2015: HARPER REED (Modest, Inc.) - Lightning Keynote
Startupfest 2015: HARPER REED (Modest, Inc.) - Lightning KeynoteStartupfest 2015: HARPER REED (Modest, Inc.) - Lightning Keynote
Startupfest 2015: HARPER REED (Modest, Inc.) - Lightning Keynote
 
Natural language procesing in R
Natural language procesing in RNatural language procesing in R
Natural language procesing in R
 
Natural language processing 2
Natural language processing 2Natural language processing 2
Natural language processing 2
 
Gordana Panajotović - NLP Master
Gordana Panajotović - NLP MasterGordana Panajotović - NLP Master
Gordana Panajotović - NLP Master
 
Challenges of social media analysis in the real world
Challenges of social media analysis in the real worldChallenges of social media analysis in the real world
Challenges of social media analysis in the real world
 
Natural Language Processing in Practice
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in Practice
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in R
 
Introduction to nlp 2014
Introduction to nlp 2014Introduction to nlp 2014
Introduction to nlp 2014
 
Online Tweet Sentiment Analysis with Apache Spark
Online Tweet Sentiment Analysis with Apache SparkOnline Tweet Sentiment Analysis with Apache Spark
Online Tweet Sentiment Analysis with Apache Spark
 
Text analytics in Python and R with examples from Tobacco Control
Text analytics in Python and R with examples from Tobacco ControlText analytics in Python and R with examples from Tobacco Control
Text analytics in Python and R with examples from Tobacco Control
 
Practical Natural Language Processing
Practical Natural Language ProcessingPractical Natural Language Processing
Practical Natural Language Processing
 
Persona Driven Keyword Research
Persona Driven Keyword ResearchPersona Driven Keyword Research
Persona Driven Keyword Research
 
Introducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with rIntroducing natural language processing(NLP) with r
Introducing natural language processing(NLP) with r
 
Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
R by example: mining Twitter for consumer attitudes towards airlines
R by example: mining Twitter for consumer attitudes towards airlinesR by example: mining Twitter for consumer attitudes towards airlines
R by example: mining Twitter for consumer attitudes towards airlines
 

Similar a Intro to nlp

topics natural language processing and image processing
topics natural language processing and image processingtopics natural language processing and image processing
topics natural language processing and image processing
youkayaslam
 
Intro 2 document
Intro 2 documentIntro 2 document
Intro 2 document
Uma Kant
 

Similar a Intro to nlp (20)

Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations
 
Information Architecture Fundamentals
Information Architecture FundamentalsInformation Architecture Fundamentals
Information Architecture Fundamentals
 
Introduction to nlp
Introduction to nlpIntroduction to nlp
Introduction to nlp
 
1004-nlp.ppt
1004-nlp.ppt1004-nlp.ppt
1004-nlp.ppt
 
Nlp app
Nlp appNlp app
Nlp app
 
Relation Extraction
Relation ExtractionRelation Extraction
Relation Extraction
 
way_topics.ppt
way_topics.pptway_topics.ppt
way_topics.ppt
 
topics natural language processing and image processing
topics natural language processing and image processingtopics natural language processing and image processing
topics natural language processing and image processing
 
Intro 2 document
Intro 2 documentIntro 2 document
Intro 2 document
 
Intro
IntroIntro
Intro
 
Intro
IntroIntro
Intro
 
Text Analytics: Yesterday, Today and Tomorrow
Text Analytics: Yesterday, Today and TomorrowText Analytics: Yesterday, Today and Tomorrow
Text Analytics: Yesterday, Today and Tomorrow
 
Ted Talk
Ted TalkTed Talk
Ted Talk
 
Natural_Language_Processing_1.ppt
Natural_Language_Processing_1.pptNatural_Language_Processing_1.ppt
Natural_Language_Processing_1.ppt
 
Natural Language processing
Natural Language processingNatural Language processing
Natural Language processing
 
Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Mining Web content for Enhanced Search
Mining Web content for Enhanced Search
 
The Coming Explosion of Records at FamilySearch Syllabus
The Coming Explosion of Records at FamilySearch SyllabusThe Coming Explosion of Records at FamilySearch Syllabus
The Coming Explosion of Records at FamilySearch Syllabus
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
PLAIN2013 Rethink, Reorganize, Reword, Redesign
PLAIN2013   Rethink, Reorganize, Reword, RedesignPLAIN2013   Rethink, Reorganize, Reword, Redesign
PLAIN2013 Rethink, Reorganize, Reword, Redesign
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 

Intro to nlp

  • 1. Introduction  to  Natural   Language  Processing   Rutu  Mulkar-­‐Mehta,  PhD   Founder  and  Data  Scientist  @Ticary   @RutuMulkar  
  • 2. Co-­‐hosted  Meetup   Data  Science  Dojo   http://www.meetup.com/data-­‐science-­‐dojo   Natural  Language  Processing   http://www.meetup.com/Natural-­‐Language-­‐Processing-­‐Meetup/  
  • 3.
  • 4. About  Me   •  Founder  and  Data  Scientist  at  Ticary     •  Background:   – PhD  in  Natural  Language  Processing   – Computer  Science   •  Worked  on  applying  NLP  to:   – Healthcare   – SEO  (Search  Engine  Optimization)   – Other  Stuff:  Sentiment  Analysis,  Question   Answering,  Natural  Language  Understanding  ++     4  
  • 5. Agenda   •  Understanding  Natural  Language   •  Introduction  to  different  NLP  Problems   •  Part  of  Speech  tagging   •  Linguistic  Resources    
  • 7. Some  Example  Sentences   •  Children  make  delicious  snacks   •  I  saw  the  Grand  Canyon  flying  to  New  York   •  Stolen  painting  found  by  the  tree     •  Two  sentences:   – Monkeys  like  bananas  when  they  wake  up.   – Monkeys  like  bananas  when  they  are  ripe.  
  • 8. Why  is  NLP  Hard?   Brazil  crowds  attend  funeral  of  late  candidate  Campos     More  than  100,000  people  in  Brazil  have  paid  their  last  respects  to  the   late  presidential  candidate,  Eduardo  Campos,  who  died  in  a  plane   crash  on  Wednesday.   They  attended  a  funeral  Mass  and  filled  the  streets  of  the  city  of   Recife  to  follow  the  passage  of  his  coffin.   Later  this  week,  Mr.  Campos's  Socialist  Party  is  expected  to  appoint   former  Environment  Minister  Marina  Silva  as  a  replacement   candidate.   Mr.  Campos's  jet  crashed  in  bad  weather  in  Santos,  near  Sao  Paulo.   Investigators  are  still  trying  to  establish  the  exact  causes  of  the  crash,   which  killed  six  other  people.  
  • 9. Why  is  NLP  Hard?   Brazil  crowds  attend  funeral  of  late  candidate  Campos     More  than  100,000  people  in  Brazil  have  paid  their  last  respects  to  the   late  presidential  candidate,  Eduardo  Campos,  who  died  in  a  plane   crash  on  Wednesday.   They  attended  a  funeral  Mass  and  filled  the  streets  of  the  city  of   Recife  to  follow  the  passage  of  his  coffin.   Later  this  week,  Mr  Campos's  Socialist  Party  is  expected  to  appoint   former  Environment  Minister  Marina  Silva  as  a  replacement   candidate.   Mr  Campos's  jet  crashed  in  bad  weather  in  Santos,  near  Sao  Paulo.   Investigators  are  still  trying  to  establish  the  exact  causes  of  the  crash,   which  killed  six  other  people.  
  • 10. Why  is  NLP  Hard?   Brazil  crowds  attend  funeral  of  late  candidate  Campos     More  than  100,000  people  in  Brazil  have  paid  their  last  respects  to  the   late  presidential  candidate,  Eduardo  Campos,  who  died  in  a  plane   crash  on  Wednesday.   They  attended  a  funeral  Mass  and  filled  the  streets  of  the  city  of   Recife  to  follow  the  passage  of  his  coffin.   Later  this  week,  Mr  Campos's  Socialist  Party  is  expected  to  appoint   former  Environment  Minister  Marina  Silva  as  a  replacement   candidate.   Mr  Campos's  jet  crashed  in  bad  weather  in  Santos,  near  Sao  Paulo.   Investigators  are  still  trying  to  establish  the  exact  causes  of  the  crash,   which  killed  six  other  people.  
  • 11. Why  is  NLP  Hard?   •  To  understand  the  current  event,  you  need  to   understand  several  other  concepts:   – Current  Event   – Background  Event   – Property   – references  to  other  events   – pronouns  
  • 12. NLP  TASKS   What  can  we  solve  with  Natural  Language  Processing  
  • 13. NLP  Tasks   •  Text  Categorization   •  Sentiment  Analysis   •  Information  Extraction   •  Information  Retrieval   •  Question  Answering   •  Text  Summarization   •  Machine  Translation  
  • 14. Text  Categorization   Input  Document   What  is  the  document  about:     sports:  0.2%     politics:  2%     entertainment:  96%     religion:  …     finance:  …  
  • 15. Text  Classification   finance.yahoo.com   sports.yahoo.com   make  your  own  wordle  using  wordle.net   Vocabulary  used  in  one  genre  of  text,  is  different  from   vocabulary  used  in  another  genre  
  • 16. NLP  Tasks   •  Text  Categorization   •  Sentiment  Analysis   •  Information  Extraction   •  Information  Retrieval   •  Question  Answering   •  Text  Summarization   •  Machine  Translation  
  • 17. Sentiment  Analysis   Sharp screen resolution Low battery life v   Product Reviews – Kindle Paperwhite
  • 18. Sentiment  Analysis   •  What  are  people  saying?   –  Twitter   –  Reviews   –  Blogs   –  Emails   •  Can  be  for:   –  Products   –  Companies   –  Movies   –  Books  
  • 19. Sentiment  Analysis   Possible  Features   •  Important  keywords,  and  key  phrases:   –  POS:  dazzling,  brilliant,  phenomenal   –  NEG:  hideous,  awful,  unwatchable   •  Emoticons   –  POS  :-­‐)     –  NEG  :-­‐(   •  Ontologies   –  Wordnet:  https://wordnet.princeton.edu/   –  SentiWordnet:  http://sentiwordnet.isti.cnr.it/  
  • 20. Challenges   •  People  express  opinions  in  complex  ways   – “The  acting  was  great  and  the  plots  were  intense   and  mesmerizing,  but  I  hated  the  movie”   •  Sarcasm,  humor  and  other  expressions   – “It  was  a  great  movie  for  a  Sunday  nap.  I  only  fell   asleep  twice,  but  it  was  very  restful”  
  • 21.
  • 22. NLP  Tasks   •  Text  Categorization   •  Sentiment  Analysis   •  Information  Extraction   •  Information  Retrieval   •  Question  Answering   •  Text  Summarization   •  Machine  Translation  
  • 23. Information  Extraction   Input  Document   What  are  the  key   pieces  of  information  ?     Location:   Time:   People:   …   Extracting  Named  Entities  from  Documents  
  • 24.
  • 25. Other  ways  for  IE  :     Hypernyms  (type  of)   colors  such  as  red,  blue  and  …   25  
  • 26. Other  ways  for  IE:     Synonyms     Find  different  relations  between  2  concepts:   Microsoft  bought  Farecast   26  
  • 27. NLP  Tasks   •  Text  Categorization   •  Sentiment  Analysis   •  Information  Extraction   •  Information  Retrieval   •  Question  Answering   •  Text  Summarization   •  Machine  Translation  
  • 29. Information  Retrieval   Input  Document   What  are  the  documents   relevant  to  the  query?   Input  Document   Input  Document   Input  Document   Input  Document   query  
  • 30.
  • 31. Information  Retrieval   Q)  Which  documents  are  most  relevant  to  a   given  query?     A)  Similar  vocabulary  between  query  and   document?   Quantify  similarity  based  on  maximum  overlap   – Cosine  Similarity   – Jaccard  Similarity  
  • 32. Information  Retrieval   Q)  If  you  rewrite  the  query  –  will  that  give  you   more  precise  results?     A)  Yes!  It  is  called  “Query  Expansion”  
  • 33. Commercial  Search  Tools   •  Lucene   – http://lucene.apache.org/     •  ElasticSearch   – https://www.elastic.co/   Underlying  technology  in  most  of  these  is  the  same,  with  some  variations     Meetup  about  this  topic  scheduled  for  early  2016  
  • 34. NLP  Tasks   •  Text  Categorization   •  Sentiment  Analysis   •  Information  Extraction   •  Information  Retrieval   •  Question  Answering   •  Text  Summarization   •  Machine  Translation  
  • 35. Question  Answering  -­‐  Closed   Input  Data  Source   Questions:     What  event  happened?     When  did  the  event  happen?     Why  did  the  event  happen?     How  long  was  the  event?     How  did  the  event  happen?  
  • 36.
  • 38. 38  
  • 39. NLP  Tasks   •  Text  Categorization   •  Sentiment  Analysis   •  Information  Extraction   •  Information  Retrieval   •  Question  Answering   •  Text  Summarization   •  Machine  Translation  
  • 41. Types  of  Text  Summarization   •  Keyword  Summaries   –  Extract  significant  Keywords  from  text   –  Easy  to  implement   –  Hard  to  understand  by  end  user  a  
  • 42. Types  of  Text  Summarization   •  Sentence/Phrase  Extraction   –  Extract  relevant  sentences   –  Medium-­‐Hard  to  implement   –  Easy  for  end  user  to  understand  
  • 43. Types  of  Text  Summarization   •  Natural  Language  Understanding  and  Generation   –  Understand  meaning  of  text   –  Generate  sentences  from  meaning  of  original  text   –  Hard  to  implement   –  Easy  for  end  user   President  of  University   of  Missouri  resigned   after  graduate  student   hunger  strike  and  class   cancellations  by  faculty  
  • 44. NLP  Tasks   •  Text  Categorization   •  Sentiment  Analysis   •  Information  Extraction   •  Information  Retrieval   •  Question  Answering   •  Text  Summarization   •  Machine  Translation  
  • 46. Why  is  MT  Hard?   •  It  is  not  a  1  to  1  translation   – In  the  previous  example  4  words  in  English   translate  into  2  in  Spanish   •  Grammar  is  different  in  different  languages   – SOV  (Subject  –  Object  –  Verb)   •  “She  him  loves”  (Hindi,  Japanese)   – SVO  (Subject  –  Verb  –  Object)     •  “She  loves  him”  (English,  Mandarin)  
  • 47. Machine  Translation   •  Waygoapp   •  Instantly  translated  Chinese,   Japanese  and  Korean   •  Simply  point  and  translate   •  Offline     http://waygoapp.com/  
  • 48. LINGUISTIC  NUANCES   Back  to  the  basics  
  • 49. Example   All  the  gobulins  were  gramzies.   It  was  grimbleton.   What  are  the  underlined  words?     gobulins     •  Noun   gramzies     •  Noun  or  Adjective   grimbleton   •  Noun  or  Adjective  
  • 50. Why  is  the  example  important?   We  can  get  a  sense  of  what  the  word  means,   based  on  how  it  is  used  in  language.  
  • 51. Nouns   •  E.g.  cat,  car,  computer,  tree   •  Variations:   – Number:  singular,  plural   •  one  car,  two  cars   – Gender:  masculine,  feminine,  neuter   – Case:  nominative,  genitive,  accusative,  dative  
  • 52. Pronouns   •  Vary  in   –  E.g.  she,  ourselves,  mine   –  Person   –  Gender   •  his,  her   –  Number   –  Case:  nominative,  accusative,  possessive,  2nd   possessive   –  Reflexive  and  Anaphoric  Forms:     •  herself,  each  other  
  • 53. Determiners   •  Articles   – a,  an,  the   •  Demonstratives   – this,  that    
  • 54. Adjectives   •  Describe  Properties   – sunny,  beautiful,  calm   •  Attributive  and  predicative  properties   •  Agreement   – in  gender,  number   •  Comparative  and  superlative  forms   – derivative  and  periphrastic   •  positive  form  
  • 55. Verbs   •  Tense:  past,  present,  future   – danced,  dancing,  will  dance   •  Aspect:  progressive,  perfective   •  Voice:  active,  passive   •  Other:  number,  person   •  Arguments:  transitive,  intransitive,   ditransitive  
  • 56. Other  POS  tags   •  Adverbs   – happily   •  Prepositions   – of,  on,  in   •  Particles   – ran  a  bill  vs  ran  up  a  bill  
  • 57. Morphological  Analysis   •  Sleeps  =  sleep  +  v  +  3rd  Person  +  Singular   •  If  we  have  a  good  enough  grammar  with  all  of   these  rules,  we  have  a  good  shot  at   understanding  syntax  of  language  
  • 58. Automatic  Taggers   •  Almost  all  the  POS  taggers  use  the  Penn-­‐Treebank   list  of  tags   •  https://www.ling.upenn.edu/courses/Fall_2003/ ling001/penn_treebank_pos.html   58  
  • 59. Automatic  Taggers   •  Almost  all  the  POS  taggers  use  the  Penn-­‐Treebank  list  of   tags   •  https://www.ling.upenn.edu/courses/Fall_2003/ling001/ penn_treebank_pos.html   –  Nouns  :     •  NN  (house),  NNS(houses),  NNP(White  House),  NNPS   –  Verbs:     •  VB(say),  VBD(said),  VBG(saying),  VBN,  VBP,  VBZ   –  Adjectives:     •  JJ  (good),  JJR(better),  JJS(best)   –  Adverbs:  RB,  RBR,  RBS   –  Prepositions:  IN   59  
  • 61. POS  Tagging  and  Parsing   •  Stanford  Core  NLP   – http://nlp.stanford.edu:8080/corenlp/   •  NLTK   – Natural  Language  Toolkit   – You  need  to  provide  your  own  training  data,  and   train  models  for  NLTK  to  be  effective   61  
  • 62. Other  Linguistic  Features  of  Interest   – We  want  to  get  nouns  and  verbs  into  a  root  form   E.g.   •  am,  are,  is  à  be   •  car,  cars,  car’s  à  car     – Two  approaches:     •  Stemming     •  Lemmatization   62  
  • 63. Stemming  and  Lemmatization   •  Lemmatization     –  use  of  a  vocabulary   –  morphological  analysis  of  words   –  returns  the  base  or  dictionary  form  of  a  word   –  base  form  is  known  as  the  lemma   –  e.g.  am,  are,  is  à  be   •  Stemming   –  crude  heuristic  process     –  chops  off  the  ends  of  words     –  hope  of  achieving  this  goal     –  e.g.  Marked  à  Mark,  Marker  à  Mark   63  
  • 64. Parsing  Resources   •  NLTK   – python,  low  accuracy,  fast   – http://www.nltk.org/   •  Stanford  Core  NLP   – java,  high  accuracy,  slow   – http://nlp.stanford.edu/software/corenlp.shtml   •  SpaCy   – python,  medium  accuracy,  fast   – https://spacy.io/  
  • 65. Other  Resources:  Ontologies     •  Wordnet   –  groups  words  when  they  have  the  same  meaning     –  represents  hierarchical  links  between  groups   –  E.g.  car  is  the  same  thing  as  an  automobile   •  SentiWordnet   •  Wordnet  +  Sentiment   •  ConceptNet   –  broader  relationships  than  WordNet   –  E.g.  bread  is  typically  found  near  a  toaster.   •  FrameNet   –  Frames  represent  concepts  and  their  associated  roles  
  • 67. Semantics  and  Word  Co-­‐locations   •  It  is  important  to  know  which  words  occur   together     – Strong  Beer  vs  Powerful  Beer   – Big  Sister  vs  Large  Sister     •  Two  approaches  have  been  used   – Semantics  –  ontologies  and  word  meanings   – Statistics  –  word  colocations  and  probabilities  
  • 68. Thank  you  for  Listening   rutu@ticary.com   @RutuMulkar