2. 2
What is Natural Language Processing (NLP)
⢠A field of computer science that is concerned with interactions
between computers and human(natural) languages.
⢠A subfield of Artificial intelligence
⢠Natural Language :
Refers to the natural language spoken by people as opposed to
the artificial languages like Java , Python,C++ etc.
Basha D (Natural Language Processing)
3. 3
Forms of Natural Language
⢠The input/output of a NLP system can be:
â written text
â speech
⢠We will mostly concerned with written text (not speech).
⢠To process written text, we need:
â lexical, syntactic, semantic knowledge about the language
â discourse information, real world knowledge
Basha D (Natural Language Processing)
4. 4
Components of NLP
⢠Natural Language Understanding
â Mapping the given input in the natural language into a useful representation.
â Different level of analysis required:
morphological analysis,
syntactic analysis,
semantic analysis,
discourse analysis, âŚ
⢠Natural Language Generation
â Producing output in the natural language from some internal representation.
â Different level of synthesis required:
deep planning (what to say),
syntactic generation
⢠NL Understanding is much harder than NL Generation.
But, still both of them are hard.
Basha D (Natural Language Processing)
5. 5
Why NL Understanding is hard?
⢠Natural language is extremely rich in form and structure, and
very ambiguous.
â How to represent meaning,
â Which structures map to which meaning structures.
⢠One input can mean many different things. Ambiguity can be at
different levels.
â Lexical (word level) ambiguity -- different meanings of words
â Syntactic ambiguity -- different ways to parse the sentence
â Interpreting partial information -- how to interpret pronouns
â Contextual information -- context of the sentence may affect the meaning of that
sentence.
⢠Many input can mean the same thing.
Basha D (Natural Language Processing)
6. 6
Knowledge of Language
⢠Phonology â concerns how words are related to the sounds that
realize them.
⢠Morphology â concerns how words are constructed from more
basic meaning units called morphemes. A morpheme is the
primitive unit of meaning in a language.
⢠Syntax â concerns how can be put together to form correct
sentences and determines what structural role each word plays in
the sentence and what phrases are subparts of other phrases.
⢠Semantics â concerns what words mean and how these meaning
combine in sentences to form sentence meaning. The study of
context-independent meaning.
Basha D (Natural Language Processing)
7. 7
Knowledge of Language (cont.)
⢠Pragmatics â concerns how sentences are used in different
situations and how use affects the interpretation of the sentence.
⢠Discourse â concerns how the immediately preceding sentences
affect the interpretation of the next sentence. For example,
interpreting pronouns and interpreting the temporal aspects of the
information.
⢠World Knowledge â includes general knowledge about the
world. What each language user must know about the otherâs
beliefs and goals.
Basha D (Natural Language Processing)
8. 8
Ambiguity
I made her duck.
⢠How many different interpretations does this sentence have?
⢠What are the reasons for the ambiguity?
⢠The categories of knowledge of language can be thought of as
ambiguity resolving components.
⢠How can each ambiguous piece be resolved?
⢠Does speech input make the sentence even more ambiguous?
â Yes â deciding word boundaries
Basha D (Natural Language Processing)
9. 9
Ambiguity (cont.)
⢠Some interpretations of : I made her duck.
1. I cooked duck for her.
2. I cooked duck belonging to her.
3. I created a toy duck which she owns.
4. I caused her to quickly lower her head or body.
5. I used magic and turned her into a duck.
⢠duck â morphologically and syntactically ambiguous:
noun or verb.
⢠her â syntactically ambiguous: dative or possessive.
⢠make â semantically ambiguous: cook or create.
⢠make â syntactically ambiguous:
Basha D (Natural Language Processing)
10. 10
Resolve Ambiguities
⢠We will introduce models and algorithms to resolve ambiguities
at different levels.
⢠part-of-speech tagging -- Deciding whether duck is verb or
noun.
⢠word-sense disambiguation -- Deciding whether make is
create or cook.
⢠lexical disambiguation -- Resolution of part-of-speech and
word-sense ambiguities are two important kinds of lexical
disambiguation.
⢠syntactic ambiguity -- her duck is an example of syntactic
ambiguity, and can be addressed by probabilistic parsing.
Basha D (Natural Language Processing)
11. 11
Resolve Ambiguities (cont.)
I made her duck
S S
NP VP NP VP
I V NP NP I V NP
made her duck made DET N
her duck
Basha D (Natural Language Processing)
12. Zipf's law
⢠States that the frequency of a word is inversely proportional to the rank of the
word, where rank 1 is given to the most frequent word, 2 to the second most
frequent and so on. This is also called the power law distribution.
⢠The Zipf's law helps us form the basic intuition for stopwords - these are the
words having the highest frequencies (or lowest ranks) in the text, and are
typically of limited 'importanceâ.
Broadly, there are three kinds of words present in any text corpus:
⢠Highly frequent words, called stop words, such as âisâ, âanâ, âtheâ, etc.
⢠Significant words, which are typically more important to understand the text
⢠Rarely occurring words, which are again less important than significant words
Basha D (Natural Language Processing) 12
13. Stopwords
⢠Generally speaking, stopwords are removed from the text for two reasons:
⢠They provide no useful information, especially in applications such as spam
detector or search engine.
⢠Since the frequency of words is very high, removing stopwords results in a
much smaller data as far as the size of data is concerned. Reduced size results
in faster computation on text data. Thereâs also the advantage of less number of
features to deal with if stopwords are removed.
Basha D (Natural Language Processing) 13
14. NLP tasks that we deal
⢠Lexical processing
⢠Syntactic Analysis
⢠Semantic processing
Basha D (Natural Language Processing) 14
15. Lexical Processing
⢠Stop word removal
⢠Tokenization
⢠Bag of words representation
⢠Stemming and Lemmatization
⢠DTM
⢠TF-IDF representation
Basha D (Natural Language Processing) 15
16. 16
Lexical Processing
⢠Stopword removal âremoving the less important words from
corpus.
⢠Tokenization â a technique thatâs used to split the text into
smaller elements. These elements can be characters, words,
sentences, or even paragraphs depending on the application we
are working on.
⢠Bag of words Representation â To represent text in a format that
we can feed into machine learning algorithms. Here sequence of
occurrence does not matter. A bag-of-words model is just the
matrix that you get from text data.
Basha D (Natural Language Processing)
17. 17
Lexical Processing (cont.)
⢠Stemmingâ It is a rule-based technique that just chops off the
suffix of a word to get its root form, which is called the âstemâ.
⢠Example: "The driver is racing in his bossâ car", the words
âdriverâ and âracingâ will be converted to their root form by just
chopping of the suffixes âerâ and âingâ. So, âdriverâ will be
converted to âdrivâ and âracingâ will be converted to âracâ.
⢠Lemmatizationâ it takes an input word and searches for its base
word by going recursively through all the variations of dictionary
words. The base word in this case is called the lemma. Words
such as âfeetâ, âdroveâ, âaroseâ, âboughtâ, etc
Basha D (Natural Language Processing)
18. 18
Lexical Processing (cont.)
⢠DTMâ Document term matrix is the one that describes the
frequency of terms that occur in a collection of documents.
⢠In a document-term matrix, rows correspond to documents in the
collection and columns correspond to terms
Basha D (Natural Language Processing)
19. 19
Lexical Processing (cont.)
⢠The TF (term frequency) of a word is the frequency of a
word (i.e. number of times it appears) in a document.
⢠For example, when a 100 word document contains the term âcatâ
12 times, the TF for the word âcatâ is
⢠TFcat = 12/100 i.e. 0.12
⢠The IDF (inverse document frequency):
⢠The IDF (inverse document frequency) of a word is the measure
of how significant that term is in the whole corpus.
Basha D (Natural Language Processing)
20. Syntactic Analysis
⢠Part-of-speech (POS) tagging
⢠Named Entity Recognition
⢠Constituency parsing
⢠Dependency parsing
Basha D (Natural Language Processing) 20
21. 21
Part-of-Speech (POS) Tagging
⢠Each word has a part-of-speech tag to describe its category.
⢠Part-of-speech tag of a word is one of major word groups
(or its subgroups).
â open classes -- noun, verb, adjective, adverb
â closed classes -- prepositions, determiners, conjuctions, pronouns, particples
⢠POS Taggers try to find POS tags for the words.
⢠duck is a verb or noun? (morphological analyzer cannot make
decision).
⢠A POS tagger may make that decision by looking the surrounding
words.
â Duck! (verb)
â Duck is delicious for dinner. (noun)
Basha D (Natural Language Processing)
22. 22
Syntactic Analysis
⢠ParsingâA key task in syntactical analysis is parsing. It means to
break down a given sentence into its 'grammatical constituents'.
Parsing is an important step in many applications which helps us
better understand the linguistic structure of sentences
Eg: "The quick brown fox jumps over the table"
⢠This structure divides the sentence into three main constituents:
'The quick brown fox' is a noun phrase
'jumps' is a verb phrase
'over the table' is a prepositional phrase.
Basha D (Natural Language Processing)
23. 23
Syntactic Analysis
⢠IOB (or BIO) method tags each token in the sentence with one of the three
labels: I - inside (the entity), O- outside (the entity) and B - beginning (of
entity)
⢠IOB labeling is especially helpful if the entities contain multiple words. We
would want our system to read words like âAir Indiaâ, âNew Delhiâ, etc, as
single entities.
⢠Named Entity Recognition task identifies âentitiesâ in the text. Entities could
refer to names of people, organizations (e.g. Air India, United Airlines),
places/cities (Mumbai, Chicago), dates and time points (May, Wednesday,
morning flight), numbers of specific types (e.g. money - 5000 INR) etc. POS
tagging in itself wonât be able to identify such word entities. Therefore, IOB
labeling is required. So, NER task is to predict IOB labels of each word.
â˘
Basha D (Natural Language Processing)
24. 24
Syntactic Analysis
⢠Constituency parsersâdivide the sentence into constituent
phrases such as noun phrase, verb phrase, prepositional phrase
etc. Each constituent phrase can itself be divided into further
phrases. The constituency parse tree given below divides the
sentence into two main phrases - a noun phrase and a verb phrase.
The verb phrase is further divided into a verb and a prepositional
phrase, and so on.
Basha D (Natural Language Processing)
25. 25
Syntactic Analysis
⢠Dependency Parsers do not divide a sentence into constituent
phrases, but rather establish relationships directly between the
words themselves. The figure below is an example of a
dependency parse tree.
Basha D (Natural Language Processing)
26. 26
Semantic Analysis
⢠Assigning meanings to the structures created by syntactic
analysis.
⢠Mapping words and structures to particular domain objects in way
consistent with our knowledge of the world.
⢠Semantic can play an import role in selecting among competing
syntactic analyses and discarding illogical analyses.
â I robbed the bank -- bank is a river bank or a financial institution
⢠We have to decide the formalisms which will be used in the
meaning representation.
Basha D (Natural Language Processing)