The final presentation I did with Lekha & Deepali for the Natural Language Processing assignments at IIT-Bombay.
Assignments included:
1: Spelling Correction
2: Part-of-speech Tagging
3: Metaphor Detection
4. Edit Distance Approach
● Uses dynamic programming: Gets distance
values for 4 different types of errors and
returns their min
5. Edit Distance Approach: Challenges
● Ties for Edit Distance
○ Solution: Bigram Probabilities of word
○ For all tied candidates, the word with highest Bigram
probability is selected as result
● Favouring shorter words
○ Solution: Brevity Penalty
○ If 'r' is average word length in corpus & 'c' is
candidate word length, Brevity Penalty is given by
○ BP = e ( 1 – r ) / c
○ However, the differences between probabilities are
too high to be noticeably affected by the penalty
○ Eg : realitvely → really (actual : relatively)
7. Confusion Matrix Approach
● Generative model
● Product of error probability and word
probability used
● 4 types of errors :
○
○
○
○
Insertion
Deletion
Substitution
Transposition
● Makes single-error assumption
9. Confusion Matrix Approach:
Examples of Common Confusions
● Vowels transposed, substituted, inserted,
deleted
○ acheive --> achieve
● Same letter errors
○ cc-->ccc or c-->cc (Similarly for other
alphabets – typing as well as common )
● Keyboard layout
○ preiod --> period (e and r next to each
other)
○ htey --> they (h and t are diagonally
placed)
10. American – British spellings & pronunciation
airbourne --> airborne
humoural --> humoral
missle --> missile
Words derived from the same root
fourty --> forty
desireable --> desirable ; careing --> caring ; interfereing --> interfering
Pronunciation
arbitary --> arbitrary (r sound is difficult to pronounce for some, because of
mother tongue)
marrage --> marriage ( .i is silent )
orginal --> original (regional accents)
dimention --> dimension (-tion and -sion have same sound)
critisisms --> criticisms and ansestors --> ancestors (both c and s used for
similar sound in different words)
immediatley --> immediately
levle --> level
11. Alignment-based Approach
● Uses MOSES
● Moses is the most widely used SMT
framework which includes tools for
preprocessing, training and tuning
● Uses GIZA++ to obtain alignments
● Given an incorrect sentence, finds the most
probable sentence, depending on four
factors
12. Moses: How it works
Four most important ingredients are:
1. Phrase Translation Table: Mapping of
source language with target language and
translation probabilities
2. Language Model: Unigrams, bigrams and
trigrams on correct words
3. Distortion Model: Extent of reordering
4. Word Model: Makes sure translations are
not too short or long
13. Alignment-based Approach:
Observations
● Absurd mapping for some sentences in the
phrase translation table leading to wrong
output (eg. a b i ---> t)
● Does not consider single error assumption
leading to change of word altogether (eg
beationsfully when beautiful was expected)
14. Alignment-based Approaches:
Results
● Language Model = Training Set and no
restriction on phrase length: 15%
● Language Model = Brown Corpus and no
restriction on phrase length: 20%
● Language Model = Brown Corpus and
phrase length = 3: 35.5%
15. Alignment-based Approach:
Error Analysis
● Single insertion / deletion
○
○
○
○
i -> e (aborigine -> aborigene)
n -> nn (bananas -> banannas)
t -> th (cartographer -> carthographer)
s -> z (business -> buziness)
● Pattern insertion / deletion
○ becuase -> bequatse (Expected: because)
○ autority -> auttorily (Expected: authority)
● Errors due to frequent pattern positions:
○ ‘-ly’, ‘-ed’, ‘-es’ in the end
■ hieroglph -> hierogly (Expected: hieroglyph)
18. Roadmap for Today
● General Viterbi
● Problems faced and their Solutions
● Results
19. Viterbi Algorithm
● Implements POS Tagging as a sequencelabeling task using the HMM framework
● Corresponds to the HMM problem of finding
the most likely state sequence for an
observation sequence
● Uses dynamic programming
20. Challenges: Data Sparsity
● Not all transitions seen
● Not all POS tags seen for every word seen
(Obvious in general, but misses rare uses of
a word in different part of speech)
● Not all words seen
Since probabilities get multiplied, a single zero
kills the entire path.
Accuracy with no smoothing : 35.82%
21. Solutions to Data Sparsity
● Laplace Smoothing (Add 1/Add delta
smoothing)
● Suffix based smoothing for unknown words
Eliminates problem caused due to zeroes.
Good approximation for rare phenomena,
without biasing the results
23. Results: Commonly Confused Tags
● ZZ0 (Alphabets) are confused with AT0(A Bend) and proper nouns
(P O Box)
● VVZ (-s form of lexical verb) confused with NN2 (Plural common
noun) eg means, works
● VVN(past participle verb form eg forgotten) confused with VVI
(infinitive verb form eg forget) for cases like become (I have
become, to become)
● Also VVN and VVD (past tense verb) eg I defeated, I have defeated
● AVQ i.e Wh-adverb (e.g. when, where, how, why, wherever) is
confused with CJS i.e subordinating conjunction (e.g. although,
when)
● AJ0 tend to have -ed ending (eg involved discussion), -ing ending
(eg living proof) or form similar to infinitive verb (to deliberate ;
deliberate meaning) hence confused often with verb forms.
● Similarly, NN1 singular noun form and AJ0 adjective form is same
for many words(happy person, I am happy)
24. Aside: TO experiment
● Replace all instances of ‘TO0’ tag with ‘PRP’
and see the difference if any
● Result: Accuracy unchanged
● Our hypothesis: The separate TO0 tag may
come in handy in later stages of NLP
28. Assumptions
● Concentration only on Noun-Noun
metaphors of the form
Noun1 be-verb Noun2
● Examples:
○ Words are weapons (Metaphor)
○ Swords are weapons (Not metaphor)
29. Hypothesis
● Driving hypothesis:
Pairs of words used in metaphors are more
dissimilar than pairs of words used in normal
language
● Thus, similarity between pairs of words can
be measured to find if the sentence is a
metaphor
30. Word Similarity
● Uses the Path Similarity measure which
depends on the shortest path between two
words
● Similarity is calculated between pairs of
nouns in the sentence related by the nsubj
dependency
● The Stanford Parser is used for POS tagging
and dependency parsing
31.
32. Challenges
● Proper Nouns and Pronouns have no
wordnet entries
○ Thus, we must ignore them
● Other dependencies may give more clues
○ The teenage boy’s room is a disaster area vs.
○ The teenage boy’s room is a messy area
○ However, no way to calculate similarity across
different parts of speech
34. False Positives
● Money is the main component of a
capitalist society
● Scars are marks on the body
○ Changes depending on selected sense of ‘scars’
35. False Negatives
● Life is a mere dream
● Children are roses
● Her eyes were fireflies
○ “fireflies” is tagged as adjective
● Scars are a roadmap to the soul
○ “roadmap” absent from Wordnet