SlideShare a Scribd company logo
1 of 35
1
Natural Language Processing (NLP)
D Basha
Subex Limited
Basha D (Natural Language Processing)
2
What is Natural Language Processing (NLP)
• A field of computer science that is concerned with interactions
between computers and human(natural) languages.
• A subfield of Artificial intelligence
• Natural Language :
Refers to the natural language spoken by people as opposed to
the artificial languages like Java , Python,C++ etc.
Basha D (Natural Language Processing)
3
Forms of Natural Language
• The input/output of a NLP system can be:
– written text
– speech
• We will mostly concerned with written text (not speech).
• To process written text, we need:
– lexical, syntactic, semantic knowledge about the language
– discourse information, real world knowledge
Basha D (Natural Language Processing)
4
Components of NLP
• Natural Language Understanding
– Mapping the given input in the natural language into a useful representation.
– Different level of analysis required:
morphological analysis,
syntactic analysis,
semantic analysis,
discourse analysis, …
• Natural Language Generation
– Producing output in the natural language from some internal representation.
– Different level of synthesis required:
deep planning (what to say),
syntactic generation
• NL Understanding is much harder than NL Generation.
But, still both of them are hard.
Basha D (Natural Language Processing)
5
Why NL Understanding is hard?
• Natural language is extremely rich in form and structure, and
very ambiguous.
– How to represent meaning,
– Which structures map to which meaning structures.
• One input can mean many different things. Ambiguity can be at
different levels.
– Lexical (word level) ambiguity -- different meanings of words
– Syntactic ambiguity -- different ways to parse the sentence
– Interpreting partial information -- how to interpret pronouns
– Contextual information -- context of the sentence may affect the meaning of that
sentence.
• Many input can mean the same thing.
Basha D (Natural Language Processing)
6
Knowledge of Language
• Phonology – concerns how words are related to the sounds that
realize them.
• Morphology – concerns how words are constructed from more
basic meaning units called morphemes. A morpheme is the
primitive unit of meaning in a language.
• Syntax – concerns how can be put together to form correct
sentences and determines what structural role each word plays in
the sentence and what phrases are subparts of other phrases.
• Semantics – concerns what words mean and how these meaning
combine in sentences to form sentence meaning. The study of
context-independent meaning.
Basha D (Natural Language Processing)
7
Knowledge of Language (cont.)
• Pragmatics – concerns how sentences are used in different
situations and how use affects the interpretation of the sentence.
• Discourse – concerns how the immediately preceding sentences
affect the interpretation of the next sentence. For example,
interpreting pronouns and interpreting the temporal aspects of the
information.
• World Knowledge – includes general knowledge about the
world. What each language user must know about the other’s
beliefs and goals.
Basha D (Natural Language Processing)
8
Ambiguity
I made her duck.
• How many different interpretations does this sentence have?
• What are the reasons for the ambiguity?
• The categories of knowledge of language can be thought of as
ambiguity resolving components.
• How can each ambiguous piece be resolved?
• Does speech input make the sentence even more ambiguous?
– Yes – deciding word boundaries
Basha D (Natural Language Processing)
9
Ambiguity (cont.)
• Some interpretations of : I made her duck.
1. I cooked duck for her.
2. I cooked duck belonging to her.
3. I created a toy duck which she owns.
4. I caused her to quickly lower her head or body.
5. I used magic and turned her into a duck.
• duck – morphologically and syntactically ambiguous:
noun or verb.
• her – syntactically ambiguous: dative or possessive.
• make – semantically ambiguous: cook or create.
• make – syntactically ambiguous:
Basha D (Natural Language Processing)
10
Resolve Ambiguities
• We will introduce models and algorithms to resolve ambiguities
at different levels.
• part-of-speech tagging -- Deciding whether duck is verb or
noun.
• word-sense disambiguation -- Deciding whether make is
create or cook.
• lexical disambiguation -- Resolution of part-of-speech and
word-sense ambiguities are two important kinds of lexical
disambiguation.
• syntactic ambiguity -- her duck is an example of syntactic
ambiguity, and can be addressed by probabilistic parsing.
Basha D (Natural Language Processing)
11
Resolve Ambiguities (cont.)
I made her duck
S S
NP VP NP VP
I V NP NP I V NP
made her duck made DET N
her duck
Basha D (Natural Language Processing)
Zipf's law
• States that the frequency of a word is inversely proportional to the rank of the
word, where rank 1 is given to the most frequent word, 2 to the second most
frequent and so on. This is also called the power law distribution.
• The Zipf's law helps us form the basic intuition for stopwords - these are the
words having the highest frequencies (or lowest ranks) in the text, and are
typically of limited 'importance’.
Broadly, there are three kinds of words present in any text corpus:
• Highly frequent words, called stop words, such as ‘is’, ‘an’, ‘the’, etc.
• Significant words, which are typically more important to understand the text
• Rarely occurring words, which are again less important than significant words
Basha D (Natural Language Processing) 12
Stopwords
• Generally speaking, stopwords are removed from the text for two reasons:
• They provide no useful information, especially in applications such as spam
detector or search engine.
• Since the frequency of words is very high, removing stopwords results in a
much smaller data as far as the size of data is concerned. Reduced size results
in faster computation on text data. There’s also the advantage of less number of
features to deal with if stopwords are removed.
Basha D (Natural Language Processing) 13
NLP tasks that we deal
• Lexical processing
• Syntactic Analysis
• Semantic processing
Basha D (Natural Language Processing) 14
Lexical Processing
• Stop word removal
• Tokenization
• Bag of words representation
• Stemming and Lemmatization
• DTM
• TF-IDF representation
Basha D (Natural Language Processing) 15
16
Lexical Processing
• Stopword removal –removing the less important words from
corpus.
• Tokenization – a technique that’s used to split the text into
smaller elements. These elements can be characters, words,
sentences, or even paragraphs depending on the application we
are working on.
• Bag of words Representation – To represent text in a format that
we can feed into machine learning algorithms. Here sequence of
occurrence does not matter. A bag-of-words model is just the
matrix that you get from text data.
Basha D (Natural Language Processing)
17
Lexical Processing (cont.)
• Stemming– It is a rule-based technique that just chops off the
suffix of a word to get its root form, which is called the ‘stem’.
• Example: "The driver is racing in his boss’ car", the words
‘driver’ and ‘racing’ will be converted to their root form by just
chopping of the suffixes ‘er’ and ‘ing’. So, ‘driver’ will be
converted to ‘driv’ and ‘racing’ will be converted to ‘rac’.
• Lemmatization– it takes an input word and searches for its base
word by going recursively through all the variations of dictionary
words. The base word in this case is called the lemma. Words
such as ‘feet’, ‘drove’, ‘arose’, ‘bought’, etc
Basha D (Natural Language Processing)
18
Lexical Processing (cont.)
• DTM– Document term matrix is the one that describes the
frequency of terms that occur in a collection of documents.
• In a document-term matrix, rows correspond to documents in the
collection and columns correspond to terms
Basha D (Natural Language Processing)
19
Lexical Processing (cont.)
• The TF (term frequency) of a word is the frequency of a
word (i.e. number of times it appears) in a document.
• For example, when a 100 word document contains the term “cat”
12 times, the TF for the word ‘cat’ is
• TFcat = 12/100 i.e. 0.12
• The IDF (inverse document frequency):
• The IDF (inverse document frequency) of a word is the measure
of how significant that term is in the whole corpus.
Basha D (Natural Language Processing)
Syntactic Analysis
• Part-of-speech (POS) tagging
• Named Entity Recognition
• Constituency parsing
• Dependency parsing
Basha D (Natural Language Processing) 20
21
Part-of-Speech (POS) Tagging
• Each word has a part-of-speech tag to describe its category.
• Part-of-speech tag of a word is one of major word groups
(or its subgroups).
– open classes -- noun, verb, adjective, adverb
– closed classes -- prepositions, determiners, conjuctions, pronouns, particples
• POS Taggers try to find POS tags for the words.
• duck is a verb or noun? (morphological analyzer cannot make
decision).
• A POS tagger may make that decision by looking the surrounding
words.
– Duck! (verb)
– Duck is delicious for dinner. (noun)
Basha D (Natural Language Processing)
22
Syntactic Analysis
• Parsing–A key task in syntactical analysis is parsing. It means to
break down a given sentence into its 'grammatical constituents'.
Parsing is an important step in many applications which helps us
better understand the linguistic structure of sentences
Eg: "The quick brown fox jumps over the table"
• This structure divides the sentence into three main constituents:
'The quick brown fox' is a noun phrase
'jumps' is a verb phrase
'over the table' is a prepositional phrase.
Basha D (Natural Language Processing)
23
Syntactic Analysis
• IOB (or BIO) method tags each token in the sentence with one of the three
labels: I - inside (the entity), O- outside (the entity) and B - beginning (of
entity)
• IOB labeling is especially helpful if the entities contain multiple words. We
would want our system to read words like ‘Air India’, ‘New Delhi’, etc, as
single entities.
• Named Entity Recognition task identifies ‘entities’ in the text. Entities could
refer to names of people, organizations (e.g. Air India, United Airlines),
places/cities (Mumbai, Chicago), dates and time points (May, Wednesday,
morning flight), numbers of specific types (e.g. money - 5000 INR) etc. POS
tagging in itself won’t be able to identify such word entities. Therefore, IOB
labeling is required. So, NER task is to predict IOB labels of each word.
•
Basha D (Natural Language Processing)
24
Syntactic Analysis
• Constituency parsers–divide the sentence into constituent
phrases such as noun phrase, verb phrase, prepositional phrase
etc. Each constituent phrase can itself be divided into further
phrases. The constituency parse tree given below divides the
sentence into two main phrases - a noun phrase and a verb phrase.
The verb phrase is further divided into a verb and a prepositional
phrase, and so on.
Basha D (Natural Language Processing)
25
Syntactic Analysis
• Dependency Parsers do not divide a sentence into constituent
phrases, but rather establish relationships directly between the
words themselves. The figure below is an example of a
dependency parse tree.
Basha D (Natural Language Processing)
26
Semantic Analysis
• Assigning meanings to the structures created by syntactic
analysis.
• Mapping words and structures to particular domain objects in way
consistent with our knowledge of the world.
• Semantic can play an import role in selecting among competing
syntactic analyses and discarding illogical analyses.
– I robbed the bank -- bank is a river bank or a financial institution
• We have to decide the formalisms which will be used in the
meaning representation.
Basha D (Natural Language Processing)
A Typical information extraction system
Basha D (Natural Language Processing) 27
28
Databases -WordNet and ConceptNet
• WordNet is a semantically oriented dictionary of English, similar 
to a traditional thesaurus but with a richer structure.
• WordNet is a part of NLTK and we can use WordNet to identify 
the 'correct' sense of a word (i.e for word sense disambiguation).
• ConceptNet is a representation that provides commonsense 
linkages between words. For example, it states that bread is 
commonly found near toasters. These everyday facts could be 
useful if, for e.g., you wanted to make a smart chatbot which says 
- “Since you like toasters, do also like bread? I can order some for 
you.”
Basha D (Natural Language Processing)
29
Distributional Semantics
• The term-document occurrence matrix, where each row is a term 
in the vocabulary and each column is a document (such as a 
webpage, tweet, book etc.)  
• The term-term co-occurrence matrix, where the ith row and jth 
column represents the occurrence of the ith word in the context
of the jth word.
Basha D (Natural Language Processing)
Distributional Semantics
• word vectors created using techniques (term-
document/occurrence context matrices and the term-term/co-
occurrence matrices) are high-dimensional and sparse.
• Word embeddings are a lower-dimensional representation of the 
word vectors. There are broadly two ways to generate word 
embeddings - frequency-based and prediction-based:
• In a frequency-based approach, you take the high-
dimensional occurrence-context or a co-occurrence matrix. 
Word embeddings are then generated by performing the 
dimensionality reduction of the matrix using matrix factorization 
(e.g. LSA).
 
Basha D (Natural Language Processing) 30
Distributional Semantics
• Prediction based approach involves training a shallow neural 
network which learns to predict the words in the context of a 
given input word. The two widely used prediction-based models 
are the skip-gram model and the Continuous Bag of Words
(CBOW) model. In the skip-gram model, the input is the 
current/target word and the output are the context words. The 
embeddings then are represented by the weight matrix between 
the input layer and the hidden layer. 
Also, word2vec and GloVe vectors are two of the most popular 
pre-trained word embeddings available for use.
Basha D (Natural Language Processing) 31
32
Word Sense Disambiguation
• Word sense disambiguation (WSD) is the task of identifying the 
correct sense of an ambiguous word such as 'bank', 'bark', 'pitch' 
etc.
• Supervised techniques for word sense disambiguation require the 
input words to be tagged with their senses
• Supervised : Naive Bayes Classifier.
• Unsupervised : Lesk algorithm.
Basha D (Natural Language Processing)
33
Natural Language Generation
• NLG is the process of constructing natural language outputs from  
  non-linguistic inputs.
• NLG can be viewed as the reverse process of NL understanding.
• A NLG system may have two main parts:
– Discourse Planner -- what will be generated. which 
sentences.
– Surface Realizer -- realizes a sentence from its internal 
representation.
• Lexical Selection -- selecting the correct words describing the 
concepts.
Basha D (Natural Language Processing)
34
Some NLP Applications
• Machine Translation – Translation between two natural 
languages. 
• Information Retrieval – Web search (uni-lingual or multi-lingual).
• Query Answering/Dialogue – Natural language interface with a 
database system, or a dialogue system..
• Chat Bots 
• Sentiment Analysis
• Some Small Applications –
–  Grammar Checking, Spell Checking, Spell Correctors
Basha D (Natural Language Processing)
35
Python Libraries for NLP
• NLTK –supports multiple languages compared to other 
libraries ,No support for Word vectors 
• Spacy- Fastest NLP framework ,provides built-in word vectors
• Gensim-Designed primarily for Unsupervised text modelling 
• TextBLOB-Provides language translation and detection which is 
powered by google translate 
Basha D (Natural Language Processing)

More Related Content

What's hot

Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Yasir Khan
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Jaganadh Gopinadhan
 
Machine Tanslation
Machine TanslationMachine Tanslation
Machine Tanslation
Mahsa Mohaghegh
 

What's hot (20)

Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processing
 
Natural language processing PPT presentation
Natural language processing PPT presentationNatural language processing PPT presentation
Natural language processing PPT presentation
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural language processing (NLP)
Natural language processing (NLP) Natural language processing (NLP)
Natural language processing (NLP)
 
Natural Language Processing
Natural Language Processing Natural Language Processing
Natural Language Processing
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural Language Processing for Games Research
Natural Language Processing for Games ResearchNatural Language Processing for Games Research
Natural Language Processing for Games Research
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Language models
Language modelsLanguage models
Language models
 
Machine Tanslation
Machine TanslationMachine Tanslation
Machine Tanslation
 
Natural Language processing
Natural Language processingNatural Language processing
Natural Language processing
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 

Similar to Natural language processing

Natural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptxNatural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptx
SHIBDASDUTTA
 
Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)
Kuppusamy P
 
Natural Language Processing Course in AI
Natural Language Processing Course in AINatural Language Processing Course in AI
Natural Language Processing Course in AI
SATHYANARAYANAKB
 

Similar to Natural language processing (20)

AI Lesson 41
AI Lesson 41AI Lesson 41
AI Lesson 41
 
Lesson 41
Lesson 41Lesson 41
Lesson 41
 
Natural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptxNatural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptx
 
Lesson 40
Lesson 40Lesson 40
Lesson 40
 
AI Lesson 40
AI Lesson 40AI Lesson 40
AI Lesson 40
 
Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4
 
Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)
 
NLP_KASHK:Text Normalization
NLP_KASHK:Text NormalizationNLP_KASHK:Text Normalization
NLP_KASHK:Text Normalization
 
Nlp (1)
Nlp (1)Nlp (1)
Nlp (1)
 
Natural Language Processing Course in AI
Natural Language Processing Course in AINatural Language Processing Course in AI
Natural Language Processing Course in AI
 
L1 nlp intro
L1 nlp introL1 nlp intro
L1 nlp intro
 
intro.ppt
intro.pptintro.ppt
intro.ppt
 
Nlp
NlpNlp
Nlp
 
5810 oral lang anly transcr wkshp (fall 2014) pdf
5810 oral lang anly transcr wkshp (fall 2014) pdf  5810 oral lang anly transcr wkshp (fall 2014) pdf
5810 oral lang anly transcr wkshp (fall 2014) pdf
 
Natural Language Processing from Object Automation
Natural Language Processing from Object Automation Natural Language Processing from Object Automation
Natural Language Processing from Object Automation
 
6 POS SA.pptx
6 POS SA.pptx6 POS SA.pptx
6 POS SA.pptx
 
Natural Language Processing - Unit 1
Natural Language Processing - Unit 1Natural Language Processing - Unit 1
Natural Language Processing - Unit 1
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
 
REPORT.doc
REPORT.docREPORT.doc
REPORT.doc
 

Recently uploaded

Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
SUHANI PANDEY
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 

Recently uploaded (20)

Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 

Natural language processing

  • 1. 1 Natural Language Processing (NLP) D Basha Subex Limited Basha D (Natural Language Processing)
  • 2. 2 What is Natural Language Processing (NLP) • A field of computer science that is concerned with interactions between computers and human(natural) languages. • A subfield of Artificial intelligence • Natural Language : Refers to the natural language spoken by people as opposed to the artificial languages like Java , Python,C++ etc. Basha D (Natural Language Processing)
  • 3. 3 Forms of Natural Language • The input/output of a NLP system can be: – written text – speech • We will mostly concerned with written text (not speech). • To process written text, we need: – lexical, syntactic, semantic knowledge about the language – discourse information, real world knowledge Basha D (Natural Language Processing)
  • 4. 4 Components of NLP • Natural Language Understanding – Mapping the given input in the natural language into a useful representation. – Different level of analysis required: morphological analysis, syntactic analysis, semantic analysis, discourse analysis, … • Natural Language Generation – Producing output in the natural language from some internal representation. – Different level of synthesis required: deep planning (what to say), syntactic generation • NL Understanding is much harder than NL Generation. But, still both of them are hard. Basha D (Natural Language Processing)
  • 5. 5 Why NL Understanding is hard? • Natural language is extremely rich in form and structure, and very ambiguous. – How to represent meaning, – Which structures map to which meaning structures. • One input can mean many different things. Ambiguity can be at different levels. – Lexical (word level) ambiguity -- different meanings of words – Syntactic ambiguity -- different ways to parse the sentence – Interpreting partial information -- how to interpret pronouns – Contextual information -- context of the sentence may affect the meaning of that sentence. • Many input can mean the same thing. Basha D (Natural Language Processing)
  • 6. 6 Knowledge of Language • Phonology – concerns how words are related to the sounds that realize them. • Morphology – concerns how words are constructed from more basic meaning units called morphemes. A morpheme is the primitive unit of meaning in a language. • Syntax – concerns how can be put together to form correct sentences and determines what structural role each word plays in the sentence and what phrases are subparts of other phrases. • Semantics – concerns what words mean and how these meaning combine in sentences to form sentence meaning. The study of context-independent meaning. Basha D (Natural Language Processing)
  • 7. 7 Knowledge of Language (cont.) • Pragmatics – concerns how sentences are used in different situations and how use affects the interpretation of the sentence. • Discourse – concerns how the immediately preceding sentences affect the interpretation of the next sentence. For example, interpreting pronouns and interpreting the temporal aspects of the information. • World Knowledge – includes general knowledge about the world. What each language user must know about the other’s beliefs and goals. Basha D (Natural Language Processing)
  • 8. 8 Ambiguity I made her duck. • How many different interpretations does this sentence have? • What are the reasons for the ambiguity? • The categories of knowledge of language can be thought of as ambiguity resolving components. • How can each ambiguous piece be resolved? • Does speech input make the sentence even more ambiguous? – Yes – deciding word boundaries Basha D (Natural Language Processing)
  • 9. 9 Ambiguity (cont.) • Some interpretations of : I made her duck. 1. I cooked duck for her. 2. I cooked duck belonging to her. 3. I created a toy duck which she owns. 4. I caused her to quickly lower her head or body. 5. I used magic and turned her into a duck. • duck – morphologically and syntactically ambiguous: noun or verb. • her – syntactically ambiguous: dative or possessive. • make – semantically ambiguous: cook or create. • make – syntactically ambiguous: Basha D (Natural Language Processing)
  • 10. 10 Resolve Ambiguities • We will introduce models and algorithms to resolve ambiguities at different levels. • part-of-speech tagging -- Deciding whether duck is verb or noun. • word-sense disambiguation -- Deciding whether make is create or cook. • lexical disambiguation -- Resolution of part-of-speech and word-sense ambiguities are two important kinds of lexical disambiguation. • syntactic ambiguity -- her duck is an example of syntactic ambiguity, and can be addressed by probabilistic parsing. Basha D (Natural Language Processing)
  • 11. 11 Resolve Ambiguities (cont.) I made her duck S S NP VP NP VP I V NP NP I V NP made her duck made DET N her duck Basha D (Natural Language Processing)
  • 12. Zipf's law • States that the frequency of a word is inversely proportional to the rank of the word, where rank 1 is given to the most frequent word, 2 to the second most frequent and so on. This is also called the power law distribution. • The Zipf's law helps us form the basic intuition for stopwords - these are the words having the highest frequencies (or lowest ranks) in the text, and are typically of limited 'importance’. Broadly, there are three kinds of words present in any text corpus: • Highly frequent words, called stop words, such as ‘is’, ‘an’, ‘the’, etc. • Significant words, which are typically more important to understand the text • Rarely occurring words, which are again less important than significant words Basha D (Natural Language Processing) 12
  • 13. Stopwords • Generally speaking, stopwords are removed from the text for two reasons: • They provide no useful information, especially in applications such as spam detector or search engine. • Since the frequency of words is very high, removing stopwords results in a much smaller data as far as the size of data is concerned. Reduced size results in faster computation on text data. There’s also the advantage of less number of features to deal with if stopwords are removed. Basha D (Natural Language Processing) 13
  • 14. NLP tasks that we deal • Lexical processing • Syntactic Analysis • Semantic processing Basha D (Natural Language Processing) 14
  • 15. Lexical Processing • Stop word removal • Tokenization • Bag of words representation • Stemming and Lemmatization • DTM • TF-IDF representation Basha D (Natural Language Processing) 15
  • 16. 16 Lexical Processing • Stopword removal –removing the less important words from corpus. • Tokenization – a technique that’s used to split the text into smaller elements. These elements can be characters, words, sentences, or even paragraphs depending on the application we are working on. • Bag of words Representation – To represent text in a format that we can feed into machine learning algorithms. Here sequence of occurrence does not matter. A bag-of-words model is just the matrix that you get from text data. Basha D (Natural Language Processing)
  • 17. 17 Lexical Processing (cont.) • Stemming– It is a rule-based technique that just chops off the suffix of a word to get its root form, which is called the ‘stem’. • Example: "The driver is racing in his boss’ car", the words ‘driver’ and ‘racing’ will be converted to their root form by just chopping of the suffixes ‘er’ and ‘ing’. So, ‘driver’ will be converted to ‘driv’ and ‘racing’ will be converted to ‘rac’. • Lemmatization– it takes an input word and searches for its base word by going recursively through all the variations of dictionary words. The base word in this case is called the lemma. Words such as ‘feet’, ‘drove’, ‘arose’, ‘bought’, etc Basha D (Natural Language Processing)
  • 18. 18 Lexical Processing (cont.) • DTM– Document term matrix is the one that describes the frequency of terms that occur in a collection of documents. • In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms Basha D (Natural Language Processing)
  • 19. 19 Lexical Processing (cont.) • The TF (term frequency) of a word is the frequency of a word (i.e. number of times it appears) in a document. • For example, when a 100 word document contains the term “cat” 12 times, the TF for the word ‘cat’ is • TFcat = 12/100 i.e. 0.12 • The IDF (inverse document frequency): • The IDF (inverse document frequency) of a word is the measure of how significant that term is in the whole corpus. Basha D (Natural Language Processing)
  • 20. Syntactic Analysis • Part-of-speech (POS) tagging • Named Entity Recognition • Constituency parsing • Dependency parsing Basha D (Natural Language Processing) 20
  • 21. 21 Part-of-Speech (POS) Tagging • Each word has a part-of-speech tag to describe its category. • Part-of-speech tag of a word is one of major word groups (or its subgroups). – open classes -- noun, verb, adjective, adverb – closed classes -- prepositions, determiners, conjuctions, pronouns, particples • POS Taggers try to find POS tags for the words. • duck is a verb or noun? (morphological analyzer cannot make decision). • A POS tagger may make that decision by looking the surrounding words. – Duck! (verb) – Duck is delicious for dinner. (noun) Basha D (Natural Language Processing)
  • 22. 22 Syntactic Analysis • Parsing–A key task in syntactical analysis is parsing. It means to break down a given sentence into its 'grammatical constituents'. Parsing is an important step in many applications which helps us better understand the linguistic structure of sentences Eg: "The quick brown fox jumps over the table" • This structure divides the sentence into three main constituents: 'The quick brown fox' is a noun phrase 'jumps' is a verb phrase 'over the table' is a prepositional phrase. Basha D (Natural Language Processing)
  • 23. 23 Syntactic Analysis • IOB (or BIO) method tags each token in the sentence with one of the three labels: I - inside (the entity), O- outside (the entity) and B - beginning (of entity) • IOB labeling is especially helpful if the entities contain multiple words. We would want our system to read words like ‘Air India’, ‘New Delhi’, etc, as single entities. • Named Entity Recognition task identifies ‘entities’ in the text. Entities could refer to names of people, organizations (e.g. Air India, United Airlines), places/cities (Mumbai, Chicago), dates and time points (May, Wednesday, morning flight), numbers of specific types (e.g. money - 5000 INR) etc. POS tagging in itself won’t be able to identify such word entities. Therefore, IOB labeling is required. So, NER task is to predict IOB labels of each word. • Basha D (Natural Language Processing)
  • 24. 24 Syntactic Analysis • Constituency parsers–divide the sentence into constituent phrases such as noun phrase, verb phrase, prepositional phrase etc. Each constituent phrase can itself be divided into further phrases. The constituency parse tree given below divides the sentence into two main phrases - a noun phrase and a verb phrase. The verb phrase is further divided into a verb and a prepositional phrase, and so on. Basha D (Natural Language Processing)
  • 25. 25 Syntactic Analysis • Dependency Parsers do not divide a sentence into constituent phrases, but rather establish relationships directly between the words themselves. The figure below is an example of a dependency parse tree. Basha D (Natural Language Processing)
  • 26. 26 Semantic Analysis • Assigning meanings to the structures created by syntactic analysis. • Mapping words and structures to particular domain objects in way consistent with our knowledge of the world. • Semantic can play an import role in selecting among competing syntactic analyses and discarding illogical analyses. – I robbed the bank -- bank is a river bank or a financial institution • We have to decide the formalisms which will be used in the meaning representation. Basha D (Natural Language Processing)
  • 27. A Typical information extraction system Basha D (Natural Language Processing) 27
  • 28. 28 Databases -WordNet and ConceptNet • WordNet is a semantically oriented dictionary of English, similar  to a traditional thesaurus but with a richer structure. • WordNet is a part of NLTK and we can use WordNet to identify  the 'correct' sense of a word (i.e for word sense disambiguation). • ConceptNet is a representation that provides commonsense  linkages between words. For example, it states that bread is  commonly found near toasters. These everyday facts could be  useful if, for e.g., you wanted to make a smart chatbot which says  - “Since you like toasters, do also like bread? I can order some for  you.” Basha D (Natural Language Processing)
  • 29. 29 Distributional Semantics • The term-document occurrence matrix, where each row is a term  in the vocabulary and each column is a document (such as a  webpage, tweet, book etc.)   • The term-term co-occurrence matrix, where the ith row and jth  column represents the occurrence of the ith word in the context of the jth word. Basha D (Natural Language Processing)
  • 30. Distributional Semantics • word vectors created using techniques (term- document/occurrence context matrices and the term-term/co- occurrence matrices) are high-dimensional and sparse. • Word embeddings are a lower-dimensional representation of the  word vectors. There are broadly two ways to generate word  embeddings - frequency-based and prediction-based: • In a frequency-based approach, you take the high- dimensional occurrence-context or a co-occurrence matrix.  Word embeddings are then generated by performing the  dimensionality reduction of the matrix using matrix factorization  (e.g. LSA).   Basha D (Natural Language Processing) 30
  • 31. Distributional Semantics • Prediction based approach involves training a shallow neural  network which learns to predict the words in the context of a  given input word. The two widely used prediction-based models  are the skip-gram model and the Continuous Bag of Words (CBOW) model. In the skip-gram model, the input is the  current/target word and the output are the context words. The  embeddings then are represented by the weight matrix between  the input layer and the hidden layer.  Also, word2vec and GloVe vectors are two of the most popular  pre-trained word embeddings available for use. Basha D (Natural Language Processing) 31
  • 32. 32 Word Sense Disambiguation • Word sense disambiguation (WSD) is the task of identifying the  correct sense of an ambiguous word such as 'bank', 'bark', 'pitch'  etc. • Supervised techniques for word sense disambiguation require the  input words to be tagged with their senses • Supervised : Naive Bayes Classifier. • Unsupervised : Lesk algorithm. Basha D (Natural Language Processing)
  • 33. 33 Natural Language Generation • NLG is the process of constructing natural language outputs from     non-linguistic inputs. • NLG can be viewed as the reverse process of NL understanding. • A NLG system may have two main parts: – Discourse Planner -- what will be generated. which  sentences. – Surface Realizer -- realizes a sentence from its internal  representation. • Lexical Selection -- selecting the correct words describing the  concepts. Basha D (Natural Language Processing)
  • 34. 34 Some NLP Applications • Machine Translation – Translation between two natural  languages.  • Information Retrieval – Web search (uni-lingual or multi-lingual). • Query Answering/Dialogue – Natural language interface with a  database system, or a dialogue system.. • Chat Bots  • Sentiment Analysis • Some Small Applications – –  Grammar Checking, Spell Checking, Spell Correctors Basha D (Natural Language Processing)
  • 35. 35 Python Libraries for NLP • NLTK –supports multiple languages compared to other  libraries ,No support for Word vectors  • Spacy- Fastest NLP framework ,provides built-in word vectors • Gensim-Designed primarily for Unsupervised text modelling  • TextBLOB-Provides language translation and detection which is  powered by google translate  Basha D (Natural Language Processing)