SlideShare una empresa de Scribd logo
1 de 37
Descargar para leer sin conexión
Final Assignment Demo
Lekha Muraleedharan | 133050002
Sagar Ahire | 133050073
Deepali Gupta | 13305R001
Assignment 01

Spelling Correction
Roadmap for Today
● Edit Distance Approach
● Confusion Matrix Approach
● Alignment-based Approach
Edit Distance Approach
● Uses dynamic programming: Gets distance
values for 4 different types of errors and
returns their min
Edit Distance Approach: Challenges
● Ties for Edit Distance
○ Solution: Bigram Probabilities of word
○ For all tied candidates, the word with highest Bigram
probability is selected as result

● Favouring shorter words
○ Solution: Brevity Penalty
○ If 'r' is average word length in corpus & 'c' is
candidate word length, Brevity Penalty is given by
○ BP = e ( 1 – r ) / c
○ However, the differences between probabilities are
too high to be noticeably affected by the penalty
○ Eg : realitvely → really (actual : relatively)
Edit Distance Approach: Results
●
●
●
●

Accuracy overall : 93%
Accuracy for edit distance 1 : 98.58%
Accuracy for edit distance <=2 : 96.27%
Examples of common confusions:
Edit Distance 1
○ Wrong word: recide
○ Predicted correct: decide
○ Actual correct: reside

Edit Distance > 1
○ Wrong word: rememberable
○ Predicted correct: remember
○ Actual correct: memorable
Confusion Matrix Approach
● Generative model
● Product of error probability and word
probability used
● 4 types of errors :
○
○
○
○

Insertion
Deletion
Substitution
Transposition

● Makes single-error assumption
Confusion Matrix Approach: Results
● Accuracy: 99%
Confusion Matrix Approach:
Examples of Common Confusions
● Vowels transposed, substituted, inserted,
deleted
○ acheive --> achieve
● Same letter errors
○ cc-->ccc or c-->cc (Similarly for other
alphabets – typing as well as common )
● Keyboard layout
○ preiod --> period (e and r next to each
other)
○ htey --> they (h and t are diagonally
placed)
American – British spellings & pronunciation
airbourne --> airborne
humoural --> humoral
missle --> missile
Words derived from the same root
fourty --> forty
desireable --> desirable ; careing --> caring ; interfereing --> interfering
Pronunciation
arbitary --> arbitrary (r sound is difficult to pronounce for some, because of
mother tongue)
marrage --> marriage ( .i is silent )
orginal --> original (regional accents)
dimention --> dimension (-tion and -sion have same sound)
critisisms --> criticisms and ansestors --> ancestors (both c and s used for
similar sound in different words)
immediatley --> immediately
levle --> level
Alignment-based Approach
● Uses MOSES
● Moses is the most widely used SMT
framework which includes tools for
preprocessing, training and tuning
● Uses GIZA++ to obtain alignments
● Given an incorrect sentence, finds the most
probable sentence, depending on four
factors
Moses: How it works
Four most important ingredients are:
1. Phrase Translation Table: Mapping of
source language with target language and
translation probabilities
2. Language Model: Unigrams, bigrams and
trigrams on correct words
3. Distortion Model: Extent of reordering
4. Word Model: Makes sure translations are
not too short or long
Alignment-based Approach:
Observations
● Absurd mapping for some sentences in the
phrase translation table leading to wrong
output (eg. a b i ---> t)
● Does not consider single error assumption
leading to change of word altogether (eg
beationsfully when beautiful was expected)
Alignment-based Approaches:
Results
● Language Model = Training Set and no
restriction on phrase length: 15%
● Language Model = Brown Corpus and no
restriction on phrase length: 20%
● Language Model = Brown Corpus and
phrase length = 3: 35.5%
Alignment-based Approach:
Error Analysis
● Single insertion / deletion
○
○
○
○

i -> e (aborigine -> aborigene)
n -> nn (bananas -> banannas)
t -> th (cartographer -> carthographer)
s -> z (business -> buziness)

● Pattern insertion / deletion
○ becuase -> bequatse (Expected: because)
○ autority -> auttorily (Expected: authority)

● Errors due to frequent pattern positions:
○ ‘-ly’, ‘-ed’, ‘-es’ in the end
■ hieroglph -> hierogly (Expected: hieroglyph)
In Summary

Approach

Accuracy

Edit Distance

93%

Confusion Matrix

99%

Alignment (MOSES)

35.5%
Assignment 02

Part-of-Speech Tagging
Roadmap for Today
● General Viterbi
● Problems faced and their Solutions
● Results
Viterbi Algorithm
● Implements POS Tagging as a sequencelabeling task using the HMM framework
● Corresponds to the HMM problem of finding
the most likely state sequence for an
observation sequence
● Uses dynamic programming
Challenges: Data Sparsity
● Not all transitions seen
● Not all POS tags seen for every word seen
(Obvious in general, but misses rare uses of
a word in different part of speech)
● Not all words seen
Since probabilities get multiplied, a single zero
kills the entire path.
Accuracy with no smoothing : 35.82%
Solutions to Data Sparsity
● Laplace Smoothing (Add 1/Add delta
smoothing)
● Suffix based smoothing for unknown words
Eliminates problem caused due to zeroes.
Good approximation for rare phenomena,
without biasing the results
Results
● Accuracy: 91.09%
● Precision, Recall and F-Score:
○ Precision(tag) = Correct(tag) / Assigned(tag)
○ Recall(tag) = Correct(tag) / Corpus(tag)
○ F(tag) = 2pr / (p+r)
Results: Commonly Confused Tags
● ZZ0 (Alphabets) are confused with AT0(A Bend) and proper nouns
(P O Box)
● VVZ (-s form of lexical verb) confused with NN2 (Plural common
noun) eg means, works
● VVN(past participle verb form eg forgotten) confused with VVI
(infinitive verb form eg forget) for cases like become (I have
become, to become)
● Also VVN and VVD (past tense verb) eg I defeated, I have defeated
● AVQ i.e Wh-adverb (e.g. when, where, how, why, wherever) is
confused with CJS i.e subordinating conjunction (e.g. although,
when)
● AJ0 tend to have -ed ending (eg involved discussion), -ing ending
(eg living proof) or form similar to infinitive verb (to deliberate ;
deliberate meaning) hence confused often with verb forms.
● Similarly, NN1 singular noun form and AJ0 adjective form is same
for many words(happy person, I am happy)
Aside: TO experiment
● Replace all instances of ‘TO0’ tag with ‘PRP’
and see the difference if any
● Result: Accuracy unchanged
● Our hypothesis: The separate TO0 tag may
come in handy in later stages of NLP
In Summary
Accuracy: 91.09%
Assignment 03

Metaphor Detection
Roadmap for Today
● Approach used
● Challenges
● Results
Assumptions
● Concentration only on Noun-Noun
metaphors of the form
Noun1 be-verb Noun2
● Examples:
○ Words are weapons (Metaphor)
○ Swords are weapons (Not metaphor)
Hypothesis
● Driving hypothesis:
Pairs of words used in metaphors are more
dissimilar than pairs of words used in normal
language
● Thus, similarity between pairs of words can
be measured to find if the sentence is a
metaphor
Word Similarity
● Uses the Path Similarity measure which
depends on the shortest path between two
words
● Similarity is calculated between pairs of
nouns in the sentence related by the nsubj
dependency
● The Stanford Parser is used for POS tagging
and dependency parsing
Challenges
● Proper Nouns and Pronouns have no
wordnet entries
○ Thus, we must ignore them

● Other dependencies may give more clues
○ The teenage boy’s room is a disaster area vs.
○ The teenage boy’s room is a messy area
○ However, no way to calculate similarity across
different parts of speech
Results

Is Metaphor

Is Not Metaphor

Detected Metaphor

69.23%

17.95%

Detected Not
Metaphor

30.77%

82.05%
False Positives
● Money is the main component of a
capitalist society
● Scars are marks on the body
○ Changes depending on selected sense of ‘scars’
False Negatives
● Life is a mere dream
● Children are roses
● Her eyes were fireflies
○ “fireflies” is tagged as adjective

● Scars are a roadmap to the soul
○ “roadmap” absent from Wordnet
In Summary
● Overall accuracy: 75.64%
● False Positives: 17.95%
● False Negatives: 30.77%
Overall Summary
Problem

Approach

Accuracy

Edit Distance

98.58%

Kernighan

99%

Alignment

35.50%

POS Tagging

Viterbi

91.09%

Metaphor Detection

Wordnet Similarity

75.64%

Spell Correction

Más contenido relacionado

La actualidad más candente

Sarcasm & Thwarting in Sentiment Analysis [IIT-Bombay]
Sarcasm & Thwarting in Sentiment Analysis [IIT-Bombay]Sarcasm & Thwarting in Sentiment Analysis [IIT-Bombay]
Sarcasm & Thwarting in Sentiment Analysis [IIT-Bombay]Sagar Ahire
 
Tamil Morphological Analysis
Tamil Morphological AnalysisTamil Morphological Analysis
Tamil Morphological AnalysisKarthik Sankar
 
Tamil-English Document Translation Using Statistical Machine Translation Appr...
Tamil-English Document Translation Using Statistical Machine Translation Appr...Tamil-English Document Translation Using Statistical Machine Translation Appr...
Tamil-English Document Translation Using Statistical Machine Translation Appr...baskaran_md
 
Open vocabulary problem
Open vocabulary problemOpen vocabulary problem
Open vocabulary problemJaeHo Jang
 
2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categoriesWarNik Chow
 
HMM BASED POS TAGGER FOR HINDI
HMM BASED POS TAGGER FOR HINDIHMM BASED POS TAGGER FOR HINDI
HMM BASED POS TAGGER FOR HINDIcscpconf
 
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Daniele Di Mitri
 
MaxEnt (Loglinear) Models - Overview
MaxEnt (Loglinear) Models - OverviewMaxEnt (Loglinear) Models - Overview
MaxEnt (Loglinear) Models - Overviewananth
 
Comparison of Transfer-Learning Approaches for Response Selection in Multi-Tu...
Comparison of Transfer-Learning Approaches for Response Selection in Multi-Tu...Comparison of Transfer-Learning Approaches for Response Selection in Multi-Tu...
Comparison of Transfer-Learning Approaches for Response Selection in Multi-Tu...Jesse Vig
 
Np completeness h4
Np completeness  h4Np completeness  h4
Np completeness h4Rajendran
 
Identification of Translationese: A Machine Learning Approach
Identification of Translationese: A Machine Learning ApproachIdentification of Translationese: A Machine Learning Approach
Identification of Translationese: A Machine Learning Approachiustinailisei
 
Effectof morphologicalsegmentation&amp;de segmentationonmachinetranslation
Effectof morphologicalsegmentation&amp;de segmentationonmachinetranslationEffectof morphologicalsegmentation&amp;de segmentationonmachinetranslation
Effectof morphologicalsegmentation&amp;de segmentationonmachinetranslationSunayana Gawde
 
2021 04-04-google nmt
2021 04-04-google nmt2021 04-04-google nmt
2021 04-04-google nmtJAEMINJEONG5
 
Machine Translation System: Chhattisgarhi to Hindi
Machine Translation System: Chhattisgarhi to HindiMachine Translation System: Chhattisgarhi to Hindi
Machine Translation System: Chhattisgarhi to HindiPadma Metta
 
Real-time DirectTranslation System for Sinhala and Tamil Languages.
Real-time DirectTranslation System for Sinhala and Tamil Languages.Real-time DirectTranslation System for Sinhala and Tamil Languages.
Real-time DirectTranslation System for Sinhala and Tamil Languages.Sheeyam Shellvacumar
 

La actualidad más candente (18)

Sarcasm & Thwarting in Sentiment Analysis [IIT-Bombay]
Sarcasm & Thwarting in Sentiment Analysis [IIT-Bombay]Sarcasm & Thwarting in Sentiment Analysis [IIT-Bombay]
Sarcasm & Thwarting in Sentiment Analysis [IIT-Bombay]
 
DLBLR talk
DLBLR talkDLBLR talk
DLBLR talk
 
Tamil Morphological Analysis
Tamil Morphological AnalysisTamil Morphological Analysis
Tamil Morphological Analysis
 
Tamil-English Document Translation Using Statistical Machine Translation Appr...
Tamil-English Document Translation Using Statistical Machine Translation Appr...Tamil-English Document Translation Using Statistical Machine Translation Appr...
Tamil-English Document Translation Using Statistical Machine Translation Appr...
 
Open vocabulary problem
Open vocabulary problemOpen vocabulary problem
Open vocabulary problem
 
2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories
 
HMM BASED POS TAGGER FOR HINDI
HMM BASED POS TAGGER FOR HINDIHMM BASED POS TAGGER FOR HINDI
HMM BASED POS TAGGER FOR HINDI
 
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
 
MaxEnt (Loglinear) Models - Overview
MaxEnt (Loglinear) Models - OverviewMaxEnt (Loglinear) Models - Overview
MaxEnt (Loglinear) Models - Overview
 
Comparison of Transfer-Learning Approaches for Response Selection in Multi-Tu...
Comparison of Transfer-Learning Approaches for Response Selection in Multi-Tu...Comparison of Transfer-Learning Approaches for Response Selection in Multi-Tu...
Comparison of Transfer-Learning Approaches for Response Selection in Multi-Tu...
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Np completeness h4
Np completeness  h4Np completeness  h4
Np completeness h4
 
Identification of Translationese: A Machine Learning Approach
Identification of Translationese: A Machine Learning ApproachIdentification of Translationese: A Machine Learning Approach
Identification of Translationese: A Machine Learning Approach
 
1909 paclic
1909 paclic1909 paclic
1909 paclic
 
Effectof morphologicalsegmentation&amp;de segmentationonmachinetranslation
Effectof morphologicalsegmentation&amp;de segmentationonmachinetranslationEffectof morphologicalsegmentation&amp;de segmentationonmachinetranslation
Effectof morphologicalsegmentation&amp;de segmentationonmachinetranslation
 
2021 04-04-google nmt
2021 04-04-google nmt2021 04-04-google nmt
2021 04-04-google nmt
 
Machine Translation System: Chhattisgarhi to Hindi
Machine Translation System: Chhattisgarhi to HindiMachine Translation System: Chhattisgarhi to Hindi
Machine Translation System: Chhattisgarhi to Hindi
 
Real-time DirectTranslation System for Sinhala and Tamil Languages.
Real-time DirectTranslation System for Sinhala and Tamil Languages.Real-time DirectTranslation System for Sinhala and Tamil Languages.
Real-time DirectTranslation System for Sinhala and Tamil Languages.
 

Destacado

Document Classification In PHP - Slight Return
Document Classification In PHP - Slight ReturnDocument Classification In PHP - Slight Return
Document Classification In PHP - Slight ReturnIan Barber
 
Shallow parser for hindi language with an input from a transliterator
Shallow parser for hindi language with an input from a transliteratorShallow parser for hindi language with an input from a transliterator
Shallow parser for hindi language with an input from a transliteratorShashank Shisodia
 
Who cares about sarcastic tweets? Investigating the impact of sarcasm on sent...
Who cares about sarcastic tweets? Investigating the impact of sarcasm on sent...Who cares about sarcastic tweets? Investigating the impact of sarcasm on sent...
Who cares about sarcastic tweets? Investigating the impact of sarcasm on sent...Diana Maynard
 
Text Analysis and Semantic Search with GATE
Text Analysis and Semantic Search with GATEText Analysis and Semantic Search with GATE
Text Analysis and Semantic Search with GATEDiana Maynard
 
SAS University Edition - Getting Started
SAS University Edition - Getting StartedSAS University Edition - Getting Started
SAS University Edition - Getting StartedCraig Trim
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment AnalysisSagar Ahire
 
Sentiments Analysis using Python and nltk
Sentiments Analysis using Python and nltk Sentiments Analysis using Python and nltk
Sentiments Analysis using Python and nltk Ashwin Perti
 
Text analysis and Semantic Search with GATE
Text analysis and Semantic Search with GATEText analysis and Semantic Search with GATE
Text analysis and Semantic Search with GATEDiana Maynard
 
Tourism of Bangladesh
Tourism of BangladeshTourism of Bangladesh
Tourism of BangladeshAbdul Hamid
 
How to Invest in Agarwood
How to Invest in AgarwoodHow to Invest in Agarwood
How to Invest in AgarwoodAntony Bell
 
Sentiment analysis-by-nltk
Sentiment analysis-by-nltkSentiment analysis-by-nltk
Sentiment analysis-by-nltkWei-Ting Kuo
 
MTech Seminar Presentation [IIT-Bombay]
MTech Seminar Presentation [IIT-Bombay]MTech Seminar Presentation [IIT-Bombay]
MTech Seminar Presentation [IIT-Bombay]Sagar Ahire
 
Frankfinn Personality Development Assignment
Frankfinn Personality Development AssignmentFrankfinn Personality Development Assignment
Frankfinn Personality Development Assignmentprincessminu
 
Introduction to Sentiment Analysis
Introduction to Sentiment AnalysisIntroduction to Sentiment Analysis
Introduction to Sentiment AnalysisJaganadh Gopinadhan
 
Sentiment analysis of twitter data
Sentiment analysis of twitter dataSentiment analysis of twitter data
Sentiment analysis of twitter dataBhagyashree Deokar
 
Tutorial of Sentiment Analysis
Tutorial of Sentiment AnalysisTutorial of Sentiment Analysis
Tutorial of Sentiment AnalysisFabio Benedetti
 
Sentiment analysis of tweets
Sentiment analysis of tweetsSentiment analysis of tweets
Sentiment analysis of tweetsVasu Jain
 
Twitter sentiment-analysis Jiit2013-14
Twitter sentiment-analysis Jiit2013-14Twitter sentiment-analysis Jiit2013-14
Twitter sentiment-analysis Jiit2013-14Rachit Goel
 

Destacado (20)

Document Classification In PHP - Slight Return
Document Classification In PHP - Slight ReturnDocument Classification In PHP - Slight Return
Document Classification In PHP - Slight Return
 
Shallow parser for hindi language with an input from a transliterator
Shallow parser for hindi language with an input from a transliteratorShallow parser for hindi language with an input from a transliterator
Shallow parser for hindi language with an input from a transliterator
 
sent_analysis_report
sent_analysis_reportsent_analysis_report
sent_analysis_report
 
Who cares about sarcastic tweets? Investigating the impact of sarcasm on sent...
Who cares about sarcastic tweets? Investigating the impact of sarcasm on sent...Who cares about sarcastic tweets? Investigating the impact of sarcasm on sent...
Who cares about sarcastic tweets? Investigating the impact of sarcasm on sent...
 
Python NLTK
Python NLTKPython NLTK
Python NLTK
 
Text Analysis and Semantic Search with GATE
Text Analysis and Semantic Search with GATEText Analysis and Semantic Search with GATE
Text Analysis and Semantic Search with GATE
 
SAS University Edition - Getting Started
SAS University Edition - Getting StartedSAS University Edition - Getting Started
SAS University Edition - Getting Started
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
 
Sentiments Analysis using Python and nltk
Sentiments Analysis using Python and nltk Sentiments Analysis using Python and nltk
Sentiments Analysis using Python and nltk
 
Text analysis and Semantic Search with GATE
Text analysis and Semantic Search with GATEText analysis and Semantic Search with GATE
Text analysis and Semantic Search with GATE
 
Tourism of Bangladesh
Tourism of BangladeshTourism of Bangladesh
Tourism of Bangladesh
 
How to Invest in Agarwood
How to Invest in AgarwoodHow to Invest in Agarwood
How to Invest in Agarwood
 
Sentiment analysis-by-nltk
Sentiment analysis-by-nltkSentiment analysis-by-nltk
Sentiment analysis-by-nltk
 
MTech Seminar Presentation [IIT-Bombay]
MTech Seminar Presentation [IIT-Bombay]MTech Seminar Presentation [IIT-Bombay]
MTech Seminar Presentation [IIT-Bombay]
 
Frankfinn Personality Development Assignment
Frankfinn Personality Development AssignmentFrankfinn Personality Development Assignment
Frankfinn Personality Development Assignment
 
Introduction to Sentiment Analysis
Introduction to Sentiment AnalysisIntroduction to Sentiment Analysis
Introduction to Sentiment Analysis
 
Sentiment analysis of twitter data
Sentiment analysis of twitter dataSentiment analysis of twitter data
Sentiment analysis of twitter data
 
Tutorial of Sentiment Analysis
Tutorial of Sentiment AnalysisTutorial of Sentiment Analysis
Tutorial of Sentiment Analysis
 
Sentiment analysis of tweets
Sentiment analysis of tweetsSentiment analysis of tweets
Sentiment analysis of tweets
 
Twitter sentiment-analysis Jiit2013-14
Twitter sentiment-analysis Jiit2013-14Twitter sentiment-analysis Jiit2013-14
Twitter sentiment-analysis Jiit2013-14
 

Similar a NLP Asignment Final Presentation [IIT-Bombay]

Parts of speech tagger
Parts of speech taggerParts of speech tagger
Parts of speech taggersadakpramodh
 
Part of speech tagger English - By sadak pramodh
Part of speech tagger   English - By sadak pramodhPart of speech tagger   English - By sadak pramodh
Part of speech tagger English - By sadak pramodhsadakpramodh
 
2-Chapter Two-N-gram Language Models.ppt
2-Chapter Two-N-gram Language Models.ppt2-Chapter Two-N-gram Language Models.ppt
2-Chapter Two-N-gram Language Models.pptmilkesa13
 
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...Lifeng (Aaron) Han
 
Bilingual terminology mining
Bilingual terminology miningBilingual terminology mining
Bilingual terminology miningEstelle Delpech
 
Canonical Formatted Address Data
Canonical Formatted Address DataCanonical Formatted Address Data
Canonical Formatted Address Datadanielschulz2005
 
Canonical Formatted Address Data
Canonical Formatted Address DataCanonical Formatted Address Data
Canonical Formatted Address Datadanielschulz2005
 
Junki Matsuo - 2015 - Source Phrase Segmentation and Translation for Japanese...
Junki Matsuo - 2015 - Source Phrase Segmentation and Translation for Japanese...Junki Matsuo - 2015 - Source Phrase Segmentation and Translation for Japanese...
Junki Matsuo - 2015 - Source Phrase Segmentation and Translation for Japanese...Association for Computational Linguistics
 
Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)kevig
 
Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)kevig
 
Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)kevig
 
L05 language model_part2
L05 language model_part2L05 language model_part2
L05 language model_part2ananth
 
AUTOMATED WORD PREDICTION IN BANGLA LANGUAGE USING STOCHASTIC LANGUAGE MODELS
AUTOMATED WORD PREDICTION IN BANGLA LANGUAGE USING STOCHASTIC LANGUAGE MODELSAUTOMATED WORD PREDICTION IN BANGLA LANGUAGE USING STOCHASTIC LANGUAGE MODELS
AUTOMATED WORD PREDICTION IN BANGLA LANGUAGE USING STOCHASTIC LANGUAGE MODELSijfcstjournal
 
[ACL2017読み会] What do Neural Machine Translation Models Learn about Morphology?
[ACL2017読み会] What do Neural Machine Translation Models Learn about Morphology?[ACL2017読み会] What do Neural Machine Translation Models Learn about Morphology?
[ACL2017読み会] What do Neural Machine Translation Models Learn about Morphology?Hayahide Yamagishi
 
AUTOMATED WORD PREDICTION IN BANGLA LANGUAGE USING STOCHASTIC LANGUAGE MODELS
AUTOMATED WORD PREDICTION IN BANGLA LANGUAGE USING STOCHASTIC LANGUAGE MODELSAUTOMATED WORD PREDICTION IN BANGLA LANGUAGE USING STOCHASTIC LANGUAGE MODELS
AUTOMATED WORD PREDICTION IN BANGLA LANGUAGE USING STOCHASTIC LANGUAGE MODELSijfcstjournal
 
fujii22apsipa_asc
fujii22apsipa_ascfujii22apsipa_asc
fujii22apsipa_ascYuki Saito
 

Similar a NLP Asignment Final Presentation [IIT-Bombay] (20)

Parts of speech tagger
Parts of speech taggerParts of speech tagger
Parts of speech tagger
 
Part of speech tagger English - By sadak pramodh
Part of speech tagger   English - By sadak pramodhPart of speech tagger   English - By sadak pramodh
Part of speech tagger English - By sadak pramodh
 
2-Chapter Two-N-gram Language Models.ppt
2-Chapter Two-N-gram Language Models.ppt2-Chapter Two-N-gram Language Models.ppt
2-Chapter Two-N-gram Language Models.ppt
 
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
 
Bilingual terminology mining
Bilingual terminology miningBilingual terminology mining
Bilingual terminology mining
 
Networks and Natural Language Processing
Networks and Natural Language ProcessingNetworks and Natural Language Processing
Networks and Natural Language Processing
 
Canonical Formatted Address Data
Canonical Formatted Address DataCanonical Formatted Address Data
Canonical Formatted Address Data
 
Canonical Formatted Address Data
Canonical Formatted Address DataCanonical Formatted Address Data
Canonical Formatted Address Data
 
Junki Matsuo - 2015 - Source Phrase Segmentation and Translation for Japanese...
Junki Matsuo - 2015 - Source Phrase Segmentation and Translation for Japanese...Junki Matsuo - 2015 - Source Phrase Segmentation and Translation for Japanese...
Junki Matsuo - 2015 - Source Phrase Segmentation and Translation for Japanese...
 
Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)
 
Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)
 
Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)
 
L05 language model_part2
L05 language model_part2L05 language model_part2
L05 language model_part2
 
AUTOMATED WORD PREDICTION IN BANGLA LANGUAGE USING STOCHASTIC LANGUAGE MODELS
AUTOMATED WORD PREDICTION IN BANGLA LANGUAGE USING STOCHASTIC LANGUAGE MODELSAUTOMATED WORD PREDICTION IN BANGLA LANGUAGE USING STOCHASTIC LANGUAGE MODELS
AUTOMATED WORD PREDICTION IN BANGLA LANGUAGE USING STOCHASTIC LANGUAGE MODELS
 
[ACL2017読み会] What do Neural Machine Translation Models Learn about Morphology?
[ACL2017読み会] What do Neural Machine Translation Models Learn about Morphology?[ACL2017読み会] What do Neural Machine Translation Models Learn about Morphology?
[ACL2017読み会] What do Neural Machine Translation Models Learn about Morphology?
 
AUTOMATED WORD PREDICTION IN BANGLA LANGUAGE USING STOCHASTIC LANGUAGE MODELS
AUTOMATED WORD PREDICTION IN BANGLA LANGUAGE USING STOCHASTIC LANGUAGE MODELSAUTOMATED WORD PREDICTION IN BANGLA LANGUAGE USING STOCHASTIC LANGUAGE MODELS
AUTOMATED WORD PREDICTION IN BANGLA LANGUAGE USING STOCHASTIC LANGUAGE MODELS
 
5. bleu
5. bleu5. bleu
5. bleu
 
2211 APSIPA
2211 APSIPA2211 APSIPA
2211 APSIPA
 
fujii22apsipa_asc
fujii22apsipa_ascfujii22apsipa_asc
fujii22apsipa_asc
 
Text Summarization
Text SummarizationText Summarization
Text Summarization
 

Último

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Último (20)

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

NLP Asignment Final Presentation [IIT-Bombay]

  • 1. Final Assignment Demo Lekha Muraleedharan | 133050002 Sagar Ahire | 133050073 Deepali Gupta | 13305R001
  • 3. Roadmap for Today ● Edit Distance Approach ● Confusion Matrix Approach ● Alignment-based Approach
  • 4. Edit Distance Approach ● Uses dynamic programming: Gets distance values for 4 different types of errors and returns their min
  • 5. Edit Distance Approach: Challenges ● Ties for Edit Distance ○ Solution: Bigram Probabilities of word ○ For all tied candidates, the word with highest Bigram probability is selected as result ● Favouring shorter words ○ Solution: Brevity Penalty ○ If 'r' is average word length in corpus & 'c' is candidate word length, Brevity Penalty is given by ○ BP = e ( 1 – r ) / c ○ However, the differences between probabilities are too high to be noticeably affected by the penalty ○ Eg : realitvely → really (actual : relatively)
  • 6. Edit Distance Approach: Results ● ● ● ● Accuracy overall : 93% Accuracy for edit distance 1 : 98.58% Accuracy for edit distance <=2 : 96.27% Examples of common confusions: Edit Distance 1 ○ Wrong word: recide ○ Predicted correct: decide ○ Actual correct: reside Edit Distance > 1 ○ Wrong word: rememberable ○ Predicted correct: remember ○ Actual correct: memorable
  • 7. Confusion Matrix Approach ● Generative model ● Product of error probability and word probability used ● 4 types of errors : ○ ○ ○ ○ Insertion Deletion Substitution Transposition ● Makes single-error assumption
  • 8. Confusion Matrix Approach: Results ● Accuracy: 99%
  • 9. Confusion Matrix Approach: Examples of Common Confusions ● Vowels transposed, substituted, inserted, deleted ○ acheive --> achieve ● Same letter errors ○ cc-->ccc or c-->cc (Similarly for other alphabets – typing as well as common ) ● Keyboard layout ○ preiod --> period (e and r next to each other) ○ htey --> they (h and t are diagonally placed)
  • 10. American – British spellings & pronunciation airbourne --> airborne humoural --> humoral missle --> missile Words derived from the same root fourty --> forty desireable --> desirable ; careing --> caring ; interfereing --> interfering Pronunciation arbitary --> arbitrary (r sound is difficult to pronounce for some, because of mother tongue) marrage --> marriage ( .i is silent ) orginal --> original (regional accents) dimention --> dimension (-tion and -sion have same sound) critisisms --> criticisms and ansestors --> ancestors (both c and s used for similar sound in different words) immediatley --> immediately levle --> level
  • 11. Alignment-based Approach ● Uses MOSES ● Moses is the most widely used SMT framework which includes tools for preprocessing, training and tuning ● Uses GIZA++ to obtain alignments ● Given an incorrect sentence, finds the most probable sentence, depending on four factors
  • 12. Moses: How it works Four most important ingredients are: 1. Phrase Translation Table: Mapping of source language with target language and translation probabilities 2. Language Model: Unigrams, bigrams and trigrams on correct words 3. Distortion Model: Extent of reordering 4. Word Model: Makes sure translations are not too short or long
  • 13. Alignment-based Approach: Observations ● Absurd mapping for some sentences in the phrase translation table leading to wrong output (eg. a b i ---> t) ● Does not consider single error assumption leading to change of word altogether (eg beationsfully when beautiful was expected)
  • 14. Alignment-based Approaches: Results ● Language Model = Training Set and no restriction on phrase length: 15% ● Language Model = Brown Corpus and no restriction on phrase length: 20% ● Language Model = Brown Corpus and phrase length = 3: 35.5%
  • 15. Alignment-based Approach: Error Analysis ● Single insertion / deletion ○ ○ ○ ○ i -> e (aborigine -> aborigene) n -> nn (bananas -> banannas) t -> th (cartographer -> carthographer) s -> z (business -> buziness) ● Pattern insertion / deletion ○ becuase -> bequatse (Expected: because) ○ autority -> auttorily (Expected: authority) ● Errors due to frequent pattern positions: ○ ‘-ly’, ‘-ed’, ‘-es’ in the end ■ hieroglph -> hierogly (Expected: hieroglyph)
  • 16. In Summary Approach Accuracy Edit Distance 93% Confusion Matrix 99% Alignment (MOSES) 35.5%
  • 18. Roadmap for Today ● General Viterbi ● Problems faced and their Solutions ● Results
  • 19. Viterbi Algorithm ● Implements POS Tagging as a sequencelabeling task using the HMM framework ● Corresponds to the HMM problem of finding the most likely state sequence for an observation sequence ● Uses dynamic programming
  • 20. Challenges: Data Sparsity ● Not all transitions seen ● Not all POS tags seen for every word seen (Obvious in general, but misses rare uses of a word in different part of speech) ● Not all words seen Since probabilities get multiplied, a single zero kills the entire path. Accuracy with no smoothing : 35.82%
  • 21. Solutions to Data Sparsity ● Laplace Smoothing (Add 1/Add delta smoothing) ● Suffix based smoothing for unknown words Eliminates problem caused due to zeroes. Good approximation for rare phenomena, without biasing the results
  • 22. Results ● Accuracy: 91.09% ● Precision, Recall and F-Score: ○ Precision(tag) = Correct(tag) / Assigned(tag) ○ Recall(tag) = Correct(tag) / Corpus(tag) ○ F(tag) = 2pr / (p+r)
  • 23. Results: Commonly Confused Tags ● ZZ0 (Alphabets) are confused with AT0(A Bend) and proper nouns (P O Box) ● VVZ (-s form of lexical verb) confused with NN2 (Plural common noun) eg means, works ● VVN(past participle verb form eg forgotten) confused with VVI (infinitive verb form eg forget) for cases like become (I have become, to become) ● Also VVN and VVD (past tense verb) eg I defeated, I have defeated ● AVQ i.e Wh-adverb (e.g. when, where, how, why, wherever) is confused with CJS i.e subordinating conjunction (e.g. although, when) ● AJ0 tend to have -ed ending (eg involved discussion), -ing ending (eg living proof) or form similar to infinitive verb (to deliberate ; deliberate meaning) hence confused often with verb forms. ● Similarly, NN1 singular noun form and AJ0 adjective form is same for many words(happy person, I am happy)
  • 24. Aside: TO experiment ● Replace all instances of ‘TO0’ tag with ‘PRP’ and see the difference if any ● Result: Accuracy unchanged ● Our hypothesis: The separate TO0 tag may come in handy in later stages of NLP
  • 27. Roadmap for Today ● Approach used ● Challenges ● Results
  • 28. Assumptions ● Concentration only on Noun-Noun metaphors of the form Noun1 be-verb Noun2 ● Examples: ○ Words are weapons (Metaphor) ○ Swords are weapons (Not metaphor)
  • 29. Hypothesis ● Driving hypothesis: Pairs of words used in metaphors are more dissimilar than pairs of words used in normal language ● Thus, similarity between pairs of words can be measured to find if the sentence is a metaphor
  • 30. Word Similarity ● Uses the Path Similarity measure which depends on the shortest path between two words ● Similarity is calculated between pairs of nouns in the sentence related by the nsubj dependency ● The Stanford Parser is used for POS tagging and dependency parsing
  • 31.
  • 32. Challenges ● Proper Nouns and Pronouns have no wordnet entries ○ Thus, we must ignore them ● Other dependencies may give more clues ○ The teenage boy’s room is a disaster area vs. ○ The teenage boy’s room is a messy area ○ However, no way to calculate similarity across different parts of speech
  • 33. Results Is Metaphor Is Not Metaphor Detected Metaphor 69.23% 17.95% Detected Not Metaphor 30.77% 82.05%
  • 34. False Positives ● Money is the main component of a capitalist society ● Scars are marks on the body ○ Changes depending on selected sense of ‘scars’
  • 35. False Negatives ● Life is a mere dream ● Children are roses ● Her eyes were fireflies ○ “fireflies” is tagged as adjective ● Scars are a roadmap to the soul ○ “roadmap” absent from Wordnet
  • 36. In Summary ● Overall accuracy: 75.64% ● False Positives: 17.95% ● False Negatives: 30.77%
  • 37. Overall Summary Problem Approach Accuracy Edit Distance 98.58% Kernighan 99% Alignment 35.50% POS Tagging Viterbi 91.09% Metaphor Detection Wordnet Similarity 75.64% Spell Correction