NLP Asignment Final Presentation [IIT-Bombay]

Final Assignment Demo
Lekha Muraleedharan | 133050002
Sagar Ahire | 133050073
Deepali Gupta | 13305R001

Assignment 01

Spelling Correction

Roadmap for Today
● Edit Distance Approach
● Confusion Matrix Approach
● Alignment-based Approach

Edit Distance Approach
● Uses dynamic programming: Gets distance
values for 4 different types of errors and
returns their min

Edit Distance Approach: Challenges
● Ties for Edit Distance
○ Solution: Bigram Probabilities of word
○ For all tied candidates, the word with highest Bigram
probability is selected as result

● Favouring shorter words
○ Solution: Brevity Penalty
○ If 'r' is average word length in corpus & 'c' is
candidate word length, Brevity Penalty is given by
○ BP = e ( 1 – r ) / c
○ However, the differences between probabilities are
too high to be noticeably affected by the penalty
○ Eg : realitvely → really (actual : relatively)

Edit Distance Approach: Results
●
●
●
●

Accuracy overall : 93%
Accuracy for edit distance 1 : 98.58%
Accuracy for edit distance <=2 : 96.27%
Examples of common confusions:
Edit Distance 1
○ Wrong word: recide
○ Predicted correct: decide
○ Actual correct: reside

Edit Distance > 1
○ Wrong word: rememberable
○ Predicted correct: remember
○ Actual correct: memorable

Confusion Matrix Approach
● Generative model
● Product of error probability and word
probability used
● 4 types of errors :
○
○
○
○

Insertion
Deletion
Substitution
Transposition

● Makes single-error assumption

Confusion Matrix Approach: Results
● Accuracy: 99%

Confusion Matrix Approach:
Examples of Common Confusions
● Vowels transposed, substituted, inserted,
deleted
○ acheive --> achieve
● Same letter errors
○ cc-->ccc or c-->cc (Similarly for other
alphabets – typing as well as common )
● Keyboard layout
○ preiod --> period (e and r next to each
other)
○ htey --> they (h and t are diagonally
placed)

American – British spellings & pronunciation
airbourne --> airborne
humoural --> humoral
missle --> missile
Words derived from the same root
fourty --> forty
desireable --> desirable ; careing --> caring ; interfereing --> interfering
Pronunciation
arbitary --> arbitrary (r sound is difficult to pronounce for some, because of
mother tongue)
marrage --> marriage ( .i is silent )
orginal --> original (regional accents)
dimention --> dimension (-tion and -sion have same sound)
critisisms --> criticisms and ansestors --> ancestors (both c and s used for
similar sound in different words)
immediatley --> immediately
levle --> level

Alignment-based Approach
● Uses MOSES
● Moses is the most widely used SMT
framework which includes tools for
preprocessing, training and tuning
● Uses GIZA++ to obtain alignments
● Given an incorrect sentence, finds the most
probable sentence, depending on four
factors

Moses: How it works
Four most important ingredients are:
1. Phrase Translation Table: Mapping of
source language with target language and
translation probabilities
2. Language Model: Unigrams, bigrams and
trigrams on correct words
3. Distortion Model: Extent of reordering
4. Word Model: Makes sure translations are
not too short or long

Alignment-based Approach:
Observations
● Absurd mapping for some sentences in the
phrase translation table leading to wrong
output (eg. a b i ---> t)
● Does not consider single error assumption
leading to change of word altogether (eg
beationsfully when beautiful was expected)

Alignment-based Approaches:
Results
● Language Model = Training Set and no
restriction on phrase length: 15%
● Language Model = Brown Corpus and no
restriction on phrase length: 20%
● Language Model = Brown Corpus and
phrase length = 3: 35.5%

Alignment-based Approach:
Error Analysis
● Single insertion / deletion
○
○
○
○

i -> e (aborigine -> aborigene)
n -> nn (bananas -> banannas)
t -> th (cartographer -> carthographer)
s -> z (business -> buziness)

● Pattern insertion / deletion
○ becuase -> bequatse (Expected: because)
○ autority -> auttorily (Expected: authority)

● Errors due to frequent pattern positions:
○ ‘-ly’, ‘-ed’, ‘-es’ in the end
■ hieroglph -> hierogly (Expected: hieroglyph)

In Summary

Approach

Accuracy

Edit Distance

93%

Confusion Matrix

99%

Alignment (MOSES)

35.5%

Assignment 02

Part-of-Speech Tagging

Roadmap for Today
● General Viterbi
● Problems faced and their Solutions
● Results

Viterbi Algorithm
● Implements POS Tagging as a sequencelabeling task using the HMM framework
● Corresponds to the HMM problem of finding
the most likely state sequence for an
observation sequence
● Uses dynamic programming

Challenges: Data Sparsity
● Not all transitions seen
● Not all POS tags seen for every word seen
(Obvious in general, but misses rare uses of
a word in different part of speech)
● Not all words seen
Since probabilities get multiplied, a single zero
kills the entire path.
Accuracy with no smoothing : 35.82%

Solutions to Data Sparsity
● Laplace Smoothing (Add 1/Add delta
smoothing)
● Suffix based smoothing for unknown words
Eliminates problem caused due to zeroes.
Good approximation for rare phenomena,
without biasing the results

Results
● Accuracy: 91.09%
● Precision, Recall and F-Score:
○ Precision(tag) = Correct(tag) / Assigned(tag)
○ Recall(tag) = Correct(tag) / Corpus(tag)
○ F(tag) = 2pr / (p+r)

Results: Commonly Confused Tags
● ZZ0 (Alphabets) are confused with AT0(A Bend) and proper nouns
(P O Box)
● VVZ (-s form of lexical verb) confused with NN2 (Plural common
noun) eg means, works
● VVN(past participle verb form eg forgotten) confused with VVI
(infinitive verb form eg forget) for cases like become (I have
become, to become)
● Also VVN and VVD (past tense verb) eg I defeated, I have defeated
● AVQ i.e Wh-adverb (e.g. when, where, how, why, wherever) is
confused with CJS i.e subordinating conjunction (e.g. although,
when)
● AJ0 tend to have -ed ending (eg involved discussion), -ing ending
(eg living proof) or form similar to infinitive verb (to deliberate ;
deliberate meaning) hence confused often with verb forms.
● Similarly, NN1 singular noun form and AJ0 adjective form is same
for many words(happy person, I am happy)

Aside: TO experiment
● Replace all instances of ‘TO0’ tag with ‘PRP’
and see the difference if any
● Result: Accuracy unchanged
● Our hypothesis: The separate TO0 tag may
come in handy in later stages of NLP

Assignment 03

Metaphor Detection

Roadmap for Today
● Approach used
● Challenges
● Results

Assumptions
● Concentration only on Noun-Noun
metaphors of the form
Noun1 be-verb Noun2
● Examples:
○ Words are weapons (Metaphor)
○ Swords are weapons (Not metaphor)

Hypothesis
● Driving hypothesis:
Pairs of words used in metaphors are more
dissimilar than pairs of words used in normal
language
● Thus, similarity between pairs of words can
be measured to find if the sentence is a
metaphor

Word Similarity
● Uses the Path Similarity measure which
depends on the shortest path between two
words
● Similarity is calculated between pairs of
nouns in the sentence related by the nsubj
dependency
● The Stanford Parser is used for POS tagging
and dependency parsing

Challenges
● Proper Nouns and Pronouns have no
wordnet entries
○ Thus, we must ignore them

● Other dependencies may give more clues
○ The teenage boy’s room is a disaster area vs.
○ The teenage boy’s room is a messy area
○ However, no way to calculate similarity across
different parts of speech

Results

Is Metaphor

Is Not Metaphor

Detected Metaphor

69.23%

17.95%

Detected Not
Metaphor

30.77%

82.05%

False Positives
● Money is the main component of a
capitalist society
● Scars are marks on the body
○ Changes depending on selected sense of ‘scars’

False Negatives
● Life is a mere dream
● Children are roses
● Her eyes were fireflies
○ “fireflies” is tagged as adjective

● Scars are a roadmap to the soul
○ “roadmap” absent from Wordnet

In Summary
● Overall accuracy: 75.64%
● False Positives: 17.95%
● False Negatives: 30.77%

Overall Summary
Problem

Approach

Accuracy

Edit Distance

98.58%

Kernighan

99%

Alignment

35.50%

POS Tagging

Viterbi

91.09%

Metaphor Detection

Wordnet Similarity

75.64%

Spell Correction

NLP Asignment Final Presentation [IIT-Bombay]

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (18)

Destacado

Destacado (20)

Similar a NLP Asignment Final Presentation [IIT-Bombay]

Similar a NLP Asignment Final Presentation [IIT-Bombay] (20)

Último

Último (20)

NLP Asignment Final Presentation [IIT-Bombay]