Why Teams call analytics are critical to your entire business
Intro to nlp
1. Introduction
to
Natural
Language
Processing
Rutu
Mulkar-‐Mehta,
PhD
Founder
and
Data
Scientist
@Ticary
@RutuMulkar
2. Co-‐hosted
Meetup
Data
Science
Dojo
http://www.meetup.com/data-‐science-‐dojo
Natural
Language
Processing
http://www.meetup.com/Natural-‐Language-‐Processing-‐Meetup/
3.
4. About
Me
• Founder
and
Data
Scientist
at
Ticary
• Background:
– PhD
in
Natural
Language
Processing
– Computer
Science
• Worked
on
applying
NLP
to:
– Healthcare
– SEO
(Search
Engine
Optimization)
– Other
Stuff:
Sentiment
Analysis,
Question
Answering,
Natural
Language
Understanding
++
4
5. Agenda
• Understanding
Natural
Language
• Introduction
to
different
NLP
Problems
• Part
of
Speech
tagging
• Linguistic
Resources
7. Some
Example
Sentences
• Children
make
delicious
snacks
• I
saw
the
Grand
Canyon
flying
to
New
York
• Stolen
painting
found
by
the
tree
• Two
sentences:
– Monkeys
like
bananas
when
they
wake
up.
– Monkeys
like
bananas
when
they
are
ripe.
8. Why
is
NLP
Hard?
Brazil
crowds
attend
funeral
of
late
candidate
Campos
More
than
100,000
people
in
Brazil
have
paid
their
last
respects
to
the
late
presidential
candidate,
Eduardo
Campos,
who
died
in
a
plane
crash
on
Wednesday.
They
attended
a
funeral
Mass
and
filled
the
streets
of
the
city
of
Recife
to
follow
the
passage
of
his
coffin.
Later
this
week,
Mr.
Campos's
Socialist
Party
is
expected
to
appoint
former
Environment
Minister
Marina
Silva
as
a
replacement
candidate.
Mr.
Campos's
jet
crashed
in
bad
weather
in
Santos,
near
Sao
Paulo.
Investigators
are
still
trying
to
establish
the
exact
causes
of
the
crash,
which
killed
six
other
people.
9. Why
is
NLP
Hard?
Brazil
crowds
attend
funeral
of
late
candidate
Campos
More
than
100,000
people
in
Brazil
have
paid
their
last
respects
to
the
late
presidential
candidate,
Eduardo
Campos,
who
died
in
a
plane
crash
on
Wednesday.
They
attended
a
funeral
Mass
and
filled
the
streets
of
the
city
of
Recife
to
follow
the
passage
of
his
coffin.
Later
this
week,
Mr
Campos's
Socialist
Party
is
expected
to
appoint
former
Environment
Minister
Marina
Silva
as
a
replacement
candidate.
Mr
Campos's
jet
crashed
in
bad
weather
in
Santos,
near
Sao
Paulo.
Investigators
are
still
trying
to
establish
the
exact
causes
of
the
crash,
which
killed
six
other
people.
10. Why
is
NLP
Hard?
Brazil
crowds
attend
funeral
of
late
candidate
Campos
More
than
100,000
people
in
Brazil
have
paid
their
last
respects
to
the
late
presidential
candidate,
Eduardo
Campos,
who
died
in
a
plane
crash
on
Wednesday.
They
attended
a
funeral
Mass
and
filled
the
streets
of
the
city
of
Recife
to
follow
the
passage
of
his
coffin.
Later
this
week,
Mr
Campos's
Socialist
Party
is
expected
to
appoint
former
Environment
Minister
Marina
Silva
as
a
replacement
candidate.
Mr
Campos's
jet
crashed
in
bad
weather
in
Santos,
near
Sao
Paulo.
Investigators
are
still
trying
to
establish
the
exact
causes
of
the
crash,
which
killed
six
other
people.
11. Why
is
NLP
Hard?
• To
understand
the
current
event,
you
need
to
understand
several
other
concepts:
– Current
Event
– Background
Event
– Property
– references
to
other
events
– pronouns
12. NLP
TASKS
What
can
we
solve
with
Natural
Language
Processing
13. NLP
Tasks
• Text
Categorization
• Sentiment
Analysis
• Information
Extraction
• Information
Retrieval
• Question
Answering
• Text
Summarization
• Machine
Translation
14. Text
Categorization
Input
Document
What
is
the
document
about:
sports:
0.2%
politics:
2%
entertainment:
96%
religion:
…
finance:
…
15. Text
Classification
finance.yahoo.com
sports.yahoo.com
make
your
own
wordle
using
wordle.net
Vocabulary
used
in
one
genre
of
text,
is
different
from
vocabulary
used
in
another
genre
16. NLP
Tasks
• Text
Categorization
• Sentiment
Analysis
• Information
Extraction
• Information
Retrieval
• Question
Answering
• Text
Summarization
• Machine
Translation
18. Sentiment
Analysis
• What
are
people
saying?
– Twitter
– Reviews
– Blogs
– Emails
• Can
be
for:
– Products
– Companies
– Movies
– Books
19. Sentiment
Analysis
Possible
Features
• Important
keywords,
and
key
phrases:
– POS:
dazzling,
brilliant,
phenomenal
– NEG:
hideous,
awful,
unwatchable
• Emoticons
– POS
:-‐)
– NEG
:-‐(
• Ontologies
– Wordnet:
https://wordnet.princeton.edu/
– SentiWordnet:
http://sentiwordnet.isti.cnr.it/
20. Challenges
• People
express
opinions
in
complex
ways
– “The
acting
was
great
and
the
plots
were
intense
and
mesmerizing,
but
I
hated
the
movie”
• Sarcasm,
humor
and
other
expressions
– “It
was
a
great
movie
for
a
Sunday
nap.
I
only
fell
asleep
twice,
but
it
was
very
restful”
21.
22. NLP
Tasks
• Text
Categorization
• Sentiment
Analysis
• Information
Extraction
• Information
Retrieval
• Question
Answering
• Text
Summarization
• Machine
Translation
23. Information
Extraction
Input
Document
What
are
the
key
pieces
of
information
?
Location:
Time:
People:
…
Extracting
Named
Entities
from
Documents
24.
25. Other
ways
for
IE
:
Hypernyms
(type
of)
colors
such
as
red,
blue
and
…
25
26. Other
ways
for
IE:
Synonyms
Find
different
relations
between
2
concepts:
Microsoft
bought
Farecast
26
27. NLP
Tasks
• Text
Categorization
• Sentiment
Analysis
• Information
Extraction
• Information
Retrieval
• Question
Answering
• Text
Summarization
• Machine
Translation
29. Information
Retrieval
Input
Document
What
are
the
documents
relevant
to
the
query?
Input
Document
Input
Document
Input
Document
Input
Document
query
30.
31. Information
Retrieval
Q)
Which
documents
are
most
relevant
to
a
given
query?
A)
Similar
vocabulary
between
query
and
document?
Quantify
similarity
based
on
maximum
overlap
– Cosine
Similarity
– Jaccard
Similarity
32. Information
Retrieval
Q)
If
you
rewrite
the
query
–
will
that
give
you
more
precise
results?
A)
Yes!
It
is
called
“Query
Expansion”
33. Commercial
Search
Tools
• Lucene
– http://lucene.apache.org/
• ElasticSearch
– https://www.elastic.co/
Underlying
technology
in
most
of
these
is
the
same,
with
some
variations
Meetup
about
this
topic
scheduled
for
early
2016
34. NLP
Tasks
• Text
Categorization
• Sentiment
Analysis
• Information
Extraction
• Information
Retrieval
• Question
Answering
• Text
Summarization
• Machine
Translation
35. Question
Answering
-‐
Closed
Input
Data
Source
Questions:
What
event
happened?
When
did
the
event
happen?
Why
did
the
event
happen?
How
long
was
the
event?
How
did
the
event
happen?
41. Types
of
Text
Summarization
• Keyword
Summaries
– Extract
significant
Keywords
from
text
– Easy
to
implement
– Hard
to
understand
by
end
user
a
42. Types
of
Text
Summarization
• Sentence/Phrase
Extraction
– Extract
relevant
sentences
– Medium-‐Hard
to
implement
– Easy
for
end
user
to
understand
43. Types
of
Text
Summarization
• Natural
Language
Understanding
and
Generation
– Understand
meaning
of
text
– Generate
sentences
from
meaning
of
original
text
– Hard
to
implement
– Easy
for
end
user
President
of
University
of
Missouri
resigned
after
graduate
student
hunger
strike
and
class
cancellations
by
faculty
44. NLP
Tasks
• Text
Categorization
• Sentiment
Analysis
• Information
Extraction
• Information
Retrieval
• Question
Answering
• Text
Summarization
• Machine
Translation
46. Why
is
MT
Hard?
• It
is
not
a
1
to
1
translation
– In
the
previous
example
4
words
in
English
translate
into
2
in
Spanish
• Grammar
is
different
in
different
languages
– SOV
(Subject
–
Object
–
Verb)
• “She
him
loves”
(Hindi,
Japanese)
– SVO
(Subject
–
Verb
–
Object)
• “She
loves
him”
(English,
Mandarin)
47. Machine
Translation
• Waygoapp
• Instantly
translated
Chinese,
Japanese
and
Korean
• Simply
point
and
translate
• Offline
http://waygoapp.com/
49. Example
All
the
gobulins
were
gramzies.
It
was
grimbleton.
What
are
the
underlined
words?
gobulins
• Noun
gramzies
• Noun
or
Adjective
grimbleton
• Noun
or
Adjective
50. Why
is
the
example
important?
We
can
get
a
sense
of
what
the
word
means,
based
on
how
it
is
used
in
language.
51. Nouns
• E.g.
cat,
car,
computer,
tree
• Variations:
– Number:
singular,
plural
• one
car,
two
cars
– Gender:
masculine,
feminine,
neuter
– Case:
nominative,
genitive,
accusative,
dative
52. Pronouns
• Vary
in
– E.g.
she,
ourselves,
mine
– Person
– Gender
• his,
her
– Number
– Case:
nominative,
accusative,
possessive,
2nd
possessive
– Reflexive
and
Anaphoric
Forms:
• herself,
each
other
54. Adjectives
• Describe
Properties
– sunny,
beautiful,
calm
• Attributive
and
predicative
properties
• Agreement
– in
gender,
number
• Comparative
and
superlative
forms
– derivative
and
periphrastic
• positive
form
56. Other
POS
tags
• Adverbs
– happily
• Prepositions
– of,
on,
in
• Particles
– ran
a
bill
vs
ran
up
a
bill
57. Morphological
Analysis
• Sleeps
=
sleep
+
v
+
3rd
Person
+
Singular
• If
we
have
a
good
enough
grammar
with
all
of
these
rules,
we
have
a
good
shot
at
understanding
syntax
of
language
58. Automatic
Taggers
• Almost
all
the
POS
taggers
use
the
Penn-‐Treebank
list
of
tags
• https://www.ling.upenn.edu/courses/Fall_2003/
ling001/penn_treebank_pos.html
58
59. Automatic
Taggers
• Almost
all
the
POS
taggers
use
the
Penn-‐Treebank
list
of
tags
• https://www.ling.upenn.edu/courses/Fall_2003/ling001/
penn_treebank_pos.html
– Nouns
:
• NN
(house),
NNS(houses),
NNP(White
House),
NNPS
– Verbs:
• VB(say),
VBD(said),
VBG(saying),
VBN,
VBP,
VBZ
– Adjectives:
• JJ
(good),
JJR(better),
JJS(best)
– Adverbs:
RB,
RBR,
RBS
– Prepositions:
IN
59
61. POS
Tagging
and
Parsing
• Stanford
Core
NLP
– http://nlp.stanford.edu:8080/corenlp/
• NLTK
– Natural
Language
Toolkit
– You
need
to
provide
your
own
training
data,
and
train
models
for
NLTK
to
be
effective
61
62. Other
Linguistic
Features
of
Interest
– We
want
to
get
nouns
and
verbs
into
a
root
form
E.g.
• am,
are,
is
à
be
• car,
cars,
car’s
à
car
– Two
approaches:
• Stemming
• Lemmatization
62
63. Stemming
and
Lemmatization
• Lemmatization
– use
of
a
vocabulary
– morphological
analysis
of
words
– returns
the
base
or
dictionary
form
of
a
word
– base
form
is
known
as
the
lemma
– e.g.
am,
are,
is
à
be
• Stemming
– crude
heuristic
process
– chops
off
the
ends
of
words
– hope
of
achieving
this
goal
– e.g.
Marked
à
Mark,
Marker
à
Mark
63
64. Parsing
Resources
• NLTK
– python,
low
accuracy,
fast
– http://www.nltk.org/
• Stanford
Core
NLP
– java,
high
accuracy,
slow
– http://nlp.stanford.edu/software/corenlp.shtml
• SpaCy
– python,
medium
accuracy,
fast
– https://spacy.io/
65. Other
Resources:
Ontologies
• Wordnet
– groups
words
when
they
have
the
same
meaning
– represents
hierarchical
links
between
groups
– E.g.
car
is
the
same
thing
as
an
automobile
• SentiWordnet
• Wordnet
+
Sentiment
• ConceptNet
– broader
relationships
than
WordNet
– E.g.
bread
is
typically
found
near
a
toaster.
• FrameNet
– Frames
represent
concepts
and
their
associated
roles
67. Semantics
and
Word
Co-‐locations
• It
is
important
to
know
which
words
occur
together
– Strong
Beer
vs
Powerful
Beer
– Big
Sister
vs
Large
Sister
• Two
approaches
have
been
used
– Semantics
–
ontologies
and
word
meanings
– Statistics
–
word
colocations
and
probabilities