Natural language processing

1
Natural Language Processing (NLP)
D Basha
Subex Limited
Basha D (Natural Language Processing)

2
What is Natural Language Processing (NLP)
• A field of computer science that is concerned with interactions
between computers and human(natural) languages.
• A subfield of Artificial intelligence
• Natural Language :
Refers to the natural language spoken by people as opposed to
the artificial languages like Java , Python,C++ etc.

3
Forms of Natural Language
• The input/output of a NLP system can be:
– written text
– speech
• We will mostly concerned with written text (not speech).
• To process written text, we need:
– lexical, syntactic, semantic knowledge about the language
– discourse information, real world knowledge

4
Components of NLP
• Natural Language Understanding
– Mapping the given input in the natural language into a useful representation.
– Different level of analysis required:
morphological analysis,
syntactic analysis,
semantic analysis,
discourse analysis, …
• Natural Language Generation
– Producing output in the natural language from some internal representation.
– Different level of synthesis required:
deep planning (what to say),
syntactic generation
• NL Understanding is much harder than NL Generation.
But, still both of them are hard.

5
Why NL Understanding is hard?
• Natural language is extremely rich in form and structure, and
very ambiguous.
– How to represent meaning,
– Which structures map to which meaning structures.
• One input can mean many different things. Ambiguity can be at
different levels.
– Lexical (word level) ambiguity -- different meanings of words
– Syntactic ambiguity -- different ways to parse the sentence
– Interpreting partial information -- how to interpret pronouns
– Contextual information -- context of the sentence may affect the meaning of that
sentence.
• Many input can mean the same thing.

6
Knowledge of Language
• Phonology – concerns how words are related to the sounds that
realize them.
• Morphology – concerns how words are constructed from more
basic meaning units called morphemes. A morpheme is the
primitive unit of meaning in a language.
• Syntax – concerns how can be put together to form correct
sentences and determines what structural role each word plays in
the sentence and what phrases are subparts of other phrases.
• Semantics – concerns what words mean and how these meaning
combine in sentences to form sentence meaning. The study of
context-independent meaning.

7
Knowledge of Language (cont.)
• Pragmatics – concerns how sentences are used in different
situations and how use affects the interpretation of the sentence.
• Discourse – concerns how the immediately preceding sentences
affect the interpretation of the next sentence. For example,
interpreting pronouns and interpreting the temporal aspects of the
information.
• World Knowledge – includes general knowledge about the
world. What each language user must know about the other’s
beliefs and goals.

8
Ambiguity
I made her duck.
• How many different interpretations does this sentence have?
• What are the reasons for the ambiguity?
• The categories of knowledge of language can be thought of as
ambiguity resolving components.
• How can each ambiguous piece be resolved?
• Does speech input make the sentence even more ambiguous?
– Yes – deciding word boundaries

9
Ambiguity (cont.)
• Some interpretations of : I made her duck.
1. I cooked duck for her.
2. I cooked duck belonging to her.
3. I created a toy duck which she owns.
4. I caused her to quickly lower her head or body.
5. I used magic and turned her into a duck.
• duck – morphologically and syntactically ambiguous:
noun or verb.
• her – syntactically ambiguous: dative or possessive.
• make – semantically ambiguous: cook or create.
• make – syntactically ambiguous:

10
Resolve Ambiguities
• We will introduce models and algorithms to resolve ambiguities
at different levels.
• part-of-speech tagging -- Deciding whether duck is verb or
noun.
• word-sense disambiguation -- Deciding whether make is
create or cook.
• lexical disambiguation -- Resolution of part-of-speech and
word-sense ambiguities are two important kinds of lexical
disambiguation.
• syntactic ambiguity -- her duck is an example of syntactic
ambiguity, and can be addressed by probabilistic parsing.

11
Resolve Ambiguities (cont.)
I made her duck
S S
NP VP NP VP
I V NP NP I V NP
made her duck made DET N
her duck

Zipf's law
• States that the frequency of a word is inversely proportional to the rank of the
word, where rank 1 is given to the most frequent word, 2 to the second most
frequent and so on. This is also called the power law distribution.
• The Zipf's law helps us form the basic intuition for stopwords - these are the
words having the highest frequencies (or lowest ranks) in the text, and are
typically of limited 'importance’.
Broadly, there are three kinds of words present in any text corpus:
• Highly frequent words, called stop words, such as ‘is’, ‘an’, ‘the’, etc.
• Significant words, which are typically more important to understand the text
• Rarely occurring words, which are again less important than significant words
Basha D (Natural Language Processing) 12

Stopwords
• Generally speaking, stopwords are removed from the text for two reasons:
• They provide no useful information, especially in applications such as spam
detector or search engine.
• Since the frequency of words is very high, removing stopwords results in a
much smaller data as far as the size of data is concerned. Reduced size results
in faster computation on text data. There’s also the advantage of less number of
features to deal with if stopwords are removed.

NLP tasks that we deal
• Lexical processing
• Syntactic Analysis
• Semantic processing

Lexical Processing
• Stop word removal
• Tokenization
• Bag of words representation
• Stemming and Lemmatization
• DTM
• TF-IDF representation

16
Lexical Processing
• Stopword removal –removing the less important words from
corpus.
• Tokenization – a technique that’s used to split the text into
smaller elements. These elements can be characters, words,
sentences, or even paragraphs depending on the application we
are working on.
• Bag of words Representation – To represent text in a format that
we can feed into machine learning algorithms. Here sequence of
occurrence does not matter. A bag-of-words model is just the
matrix that you get from text data.

17
Lexical Processing (cont.)
• Stemming– It is a rule-based technique that just chops off the
suffix of a word to get its root form, which is called the ‘stem’.
• Example: "The driver is racing in his boss’ car", the words
‘driver’ and ‘racing’ will be converted to their root form by just
chopping of the suffixes ‘er’ and ‘ing’. So, ‘driver’ will be
converted to ‘driv’ and ‘racing’ will be converted to ‘rac’.
• Lemmatization– it takes an input word and searches for its base
word by going recursively through all the variations of dictionary
words. The base word in this case is called the lemma. Words
such as ‘feet’, ‘drove’, ‘arose’, ‘bought’, etc

18
• DTM– Document term matrix is the one that describes the
frequency of terms that occur in a collection of documents.
• In a document-term matrix, rows correspond to documents in the
collection and columns correspond to terms

19
• The TF (term frequency) of a word is the frequency of a
word (i.e. number of times it appears) in a document.
• For example, when a 100 word document contains the term “cat”
12 times, the TF for the word ‘cat’ is
• TFcat = 12/100 i.e. 0.12
• The IDF (inverse document frequency):
• The IDF (inverse document frequency) of a word is the measure
of how significant that term is in the whole corpus.

Syntactic Analysis
• Part-of-speech (POS) tagging
• Named Entity Recognition
• Constituency parsing
• Dependency parsing

21
Part-of-Speech (POS) Tagging
• Each word has a part-of-speech tag to describe its category.
• Part-of-speech tag of a word is one of major word groups
(or its subgroups).
– open classes -- noun, verb, adjective, adverb
– closed classes -- prepositions, determiners, conjuctions, pronouns, particples
• POS Taggers try to find POS tags for the words.
• duck is a verb or noun? (morphological analyzer cannot make
decision).
• A POS tagger may make that decision by looking the surrounding
words.
– Duck! (verb)
– Duck is delicious for dinner. (noun)

22
Syntactic Analysis
• Parsing–A key task in syntactical analysis is parsing. It means to
break down a given sentence into its 'grammatical constituents'.
Parsing is an important step in many applications which helps us
better understand the linguistic structure of sentences
Eg: "The quick brown fox jumps over the table"
• This structure divides the sentence into three main constituents:
'The quick brown fox' is a noun phrase
'jumps' is a verb phrase
'over the table' is a prepositional phrase.

23
Syntactic Analysis
• IOB (or BIO) method tags each token in the sentence with one of the three
labels: I - inside (the entity), O- outside (the entity) and B - beginning (of
entity)
• IOB labeling is especially helpful if the entities contain multiple words. We
would want our system to read words like ‘Air India’, ‘New Delhi’, etc, as
single entities.
• Named Entity Recognition task identifies ‘entities’ in the text. Entities could
refer to names of people, organizations (e.g. Air India, United Airlines),
places/cities (Mumbai, Chicago), dates and time points (May, Wednesday,
morning flight), numbers of specific types (e.g. money - 5000 INR) etc. POS
tagging in itself won’t be able to identify such word entities. Therefore, IOB
labeling is required. So, NER task is to predict IOB labels of each word.
•

24
Syntactic Analysis
• Constituency parsers–divide the sentence into constituent
phrases such as noun phrase, verb phrase, prepositional phrase
etc. Each constituent phrase can itself be divided into further
phrases. The constituency parse tree given below divides the
sentence into two main phrases - a noun phrase and a verb phrase.
The verb phrase is further divided into a verb and a prepositional
phrase, and so on.

25
Syntactic Analysis
• Dependency Parsers do not divide a sentence into constituent
phrases, but rather establish relationships directly between the
words themselves. The figure below is an example of a
dependency parse tree.

26
Semantic Analysis
• Assigning meanings to the structures created by syntactic
analysis.
• Mapping words and structures to particular domain objects in way
consistent with our knowledge of the world.
• Semantic can play an import role in selecting among competing
syntactic analyses and discarding illogical analyses.
– I robbed the bank -- bank is a river bank or a financial institution
• We have to decide the formalisms which will be used in the
meaning representation.

A Typical information extraction system

28
Databases -WordNet and ConceptNet
• WordNet is a semantically oriented dictionary of English, similar
to a traditional thesaurus but with a richer structure.
• WordNet is a part of NLTK and we can use WordNet to identify
the 'correct' sense of a word (i.e for word sense disambiguation).
• ConceptNet is a representation that provides commonsense
linkages between words. For example, it states that bread is
commonly found near toasters. These everyday facts could be
useful if, for e.g., you wanted to make a smart chatbot which says
- “Since you like toasters, do also like bread? I can order some for
you.”

29
Distributional Semantics
• The term-document occurrence matrix, where each row is a term
in the vocabulary and each column is a document (such as a
webpage, tweet, book etc.)
• The term-term co-occurrence matrix, where the ith row and jth
column represents the occurrence of the ith word in the context
of the jth word.

• word vectors created using techniques (term-
document/occurrence context matrices and the term-term/co-
occurrence matrices) are high-dimensional and sparse.
• Word embeddings are a lower-dimensional representation of the
word vectors. There are broadly two ways to generate word
embeddings - frequency-based and prediction-based:
• In a frequency-based approach, you take the high-
dimensional occurrence-context or a co-occurrence matrix.
Word embeddings are then generated by performing the
dimensionality reduction of the matrix using matrix factorization
(e.g. LSA).


• Prediction based approach involves training a shallow neural
network which learns to predict the words in the context of a
given input word. The two widely used prediction-based models
are the skip-gram model and the Continuous Bag of Words
(CBOW) model. In the skip-gram model, the input is the
current/target word and the output are the context words. The
embeddings then are represented by the weight matrix between
the input layer and the hidden layer.
Also, word2vec and GloVe vectors are two of the most popular
pre-trained word embeddings available for use.

32
Word Sense Disambiguation
• Word sense disambiguation (WSD) is the task of identifying the
correct sense of an ambiguous word such as 'bank', 'bark', 'pitch'
etc.
• Supervised techniques for word sense disambiguation require the
input words to be tagged with their senses
• Supervised : Naive Bayes Classifier.
• Unsupervised : Lesk algorithm.

33
Natural Language Generation
• NLG is the process of constructing natural language outputs from
non-linguistic inputs.
• NLG can be viewed as the reverse process of NL understanding.
• A NLG system may have two main parts:
– Discourse Planner -- what will be generated. which
sentences.
– Surface Realizer -- realizes a sentence from its internal
representation.
• Lexical Selection -- selecting the correct words describing the
concepts.

34
Some NLP Applications
• Machine Translation – Translation between two natural
languages.
• Information Retrieval – Web search (uni-lingual or multi-lingual).
• Query Answering/Dialogue – Natural language interface with a
database system, or a dialogue system..
• Chat Bots
• Sentiment Analysis
• Some Small Applications –
– Grammar Checking, Spell Checking, Spell Correctors

35
Python Libraries for NLP
• NLTK –supports multiple languages compared to other
libraries ,No support for Word vectors
• Spacy- Fastest NLP framework ,provides built-in word vectors
• Gensim-Designed primarily for Unsupervised text modelling
• TextBLOB-Provides language translation and detection which is
powered by google translate

Natural language processing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Natural language processing

Similar to Natural language processing (20)

Recently uploaded

Recently uploaded (20)

Natural language processing