SlideShare una empresa de Scribd logo
1 de 17
Descargar para leer sin conexión
Edit distance and
applications
Anantharaman Narayana Iyer
Narayana dot Anantharaman at gmail dot com
3 Sep 2015
Objective
• Porter stemmer: we will mention it briefly and let you go through this
at home
• Discuss in depth the concept of edit distance computation using
dynamic programming
• Look at a few applications
References
• Speech And Language Processing,
2nd Edition by Dan Jurafsky and
James Martin
• Natural Language Processing,
Coursera online learning course by
Dan Jurafsky and Christopher
Manning
• Home page of Dr Martin Porter:
http://tartarus.org/~martin/Porter
Stemmer/
• Wikipedia (for certain definitions
of linguistic terminology)
References: Tools
http://rosettacode.org/wiki/Levenshtein_distance
http://norvig.com/spell-correct.html
Edit Distance
• Suppose we have 2 strings x and y and our
goal is to transform x to y.
• Example 1: x = alogarithm, y = logarithm
• Example 2: x = alogarithm, y = algorithm
• Edit operators
• Deletion
• Insertion
• Substitution
• The weighted number of operations that are
needed to transform x to y using the above
operators is called the edit distance
• Typically, deletion and insertion are assigned a
weight 1 while substitution has weight 2
(Levenshtein)
• Edit distance is a measure of string similarity
Example
• What is the edit distance between:
1. x = alogarithm, y = logarithm
2. x = alogarithm, y = algorithm
• In the example 1, by deleting the letter a from string x, we get the
string y. Hence edit distance is 1
• In the example 2, we need the following steps:
• Step 1: Delete o from x => algarithm
• Step 2: Replace the 4th letter a with o => algorithm
• In example 2, we performed 1 deletion and 1 replacement and hence
the edit distance = 1 + 2 = 3
Why is this needed?
• Applications in Natural Language Processing
• Spelling correction
• Approximate string matching
• Spam filtering
• Longest Common Subsequence
• Finding curse words or inappropriate words
• Finding similar sentences where one sentence differs
from the other within the edit distance with respect
to the words: Word level matching
• Adobe announced 4th quarter results today
• Adobe announced quarter results today
• Computational Biology
• Aligning DNA sequences
• Speech Recognition
How to find min edit distance?
• Let us consider x = intention y =
execution (Ref Dan Jurafsky)
• Initial State: The word we are
transforming (intention)
• Final State: The word we are
transforming to (execution)
• Operators: Insert, Delete, Substitute
• Path cost: This is what we need to
minimize, ie, the number of edit
operations weighted appropriately
Minimum Edit Distance
• The space of all edit sequence is huge and so it is not viable to
compute the edit distance using this approach
• Many paths to reach the end state from the start state.
• We only need to consider the shortest path: Minimum Edit Distance
• Let us consider 2 strings X, Y with lengths n and m respectively
• Define D(i, j) as the distance between X[1..i], Y[1..j]. This the distance
between X and Y is D(n, m)
Dynamic Programming algorithm To compute
minimum edit distance
• Create a matrix where the column represents the target string, where there is
one column for each symbol and the row represents the source string with one
symbol per row.
• For minimum edit distance this is the edit-distance matrix where the cell (i, j)
contains the distance of first i characters of the target and first j characters of the
source
• Each cell can be computed as a simple function of surrounding cells; thus, from
beginning of the matrix it is possible to compute every entry
• We compute D(i,j) for small i,j
• And compute larger D(i,j) based on previously computed smaller values
• i.e., compute D(i,j) for all i (0 < i < n) and j (0 < j < m)
Minimum Edit Distance Algorithm
• Initialization
D(i,0) = i
D(0,j) = j
• Recurrence Relation:
For each i = 1…M
For each j = 1…N
D(i-1,j) + 1
D(i,j)= min D(i,j-1) + 1
D(i-1,j-1) + 2; if X(i) ≠ Y(j)
0; if X(i) = Y(j)
• Termination:
D(N,M) is distance
Example
• Let X = “alogarithm” and Y = “algorithm”
• We need to transform X to Y using the dynamic programming
algorithm
• Let us start off with the base cases:
• (ε, ε) = 0, (εa, ε) = 1, …
• (ε, a) = 1, (ε, al) = 2, …
• Recurrence:
• We need to evaluate the distance between substrings: (εa, εa), (εa, εal), (εa,
εalo), (εa, εalog)…(εal, εa), (εal, εal), …, (εalg, εa), (εalg, εal), (εalg, εalo), (εalg,
εalog), …
Example (Ref Dan Jurafsky Coursera)
N 9
O 8
I 7
T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
Example (contd): (Ref Dan Jurafsky Coursera)
N 9 8 9 10 11 12 11 10 9 8
O 8 7 8 9 10 11 10 9 8 9
I 7 6 7 8 9 10 9 8 9 10
T 6 5 6 7 8 9 8 9 10 11
N 5 4 5 6 7 8 9 10 11 10
E 4 3 4 5 6 7 8 9 10 9
T 3 4 5 6 7 8 7 8 9 8
N 2 3 4 5 6 7 8 7 8 7
I 1 2 3 4 5 6 7 6 7 8
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
Backtrace
• Often it is not adequate to just compute minimum edit distance alone. It may also
be necessary to find the path to the minimum edit distance. This for example, is
needed for finding the alignment between 2 sequences
• Example:
• Intention and Execution
• We can implement backtracing by keeping pointers for each cell to record the
possible preceding cells that contributed to the minimum distance for this cell (i,
j). It is possible that this cell might have been reached just from one previous cell,
or two cells or all the three.
• When we reach the termination condition, start from the pointer from the final
cell (i = M, j = N) and trace back to find the alignment
Example – Backtracing (Adopted from Dan
Jurafsky)
Backup slides

Más contenido relacionado

La actualidad más candente

Introduction To Applied Machine Learning
Introduction To Applied Machine LearningIntroduction To Applied Machine Learning
Introduction To Applied Machine Learningananth
 
Search problems in Artificial Intelligence
Search problems in Artificial IntelligenceSearch problems in Artificial Intelligence
Search problems in Artificial Intelligenceananth
 
Basics of Machine Learning
Basics of Machine LearningBasics of Machine Learning
Basics of Machine LearningPranav Challa
 
Natural Language Processing: L03 maths fornlp
Natural Language Processing: L03 maths fornlpNatural Language Processing: L03 maths fornlp
Natural Language Processing: L03 maths fornlpananth
 
Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Balázs Hidasi
 
Overview of TensorFlow For Natural Language Processing
Overview of TensorFlow For Natural Language ProcessingOverview of TensorFlow For Natural Language Processing
Overview of TensorFlow For Natural Language Processingananth
 
Deep Learning For Practitioners, lecture 2: Selecting the right applications...
Deep Learning For Practitioners,  lecture 2: Selecting the right applications...Deep Learning For Practitioners,  lecture 2: Selecting the right applications...
Deep Learning For Practitioners, lecture 2: Selecting the right applications...ananth
 
Machine learning Lecture 1
Machine learning Lecture 1Machine learning Lecture 1
Machine learning Lecture 1Srinivasan R
 
Session-Based Recommendations with Recurrent Neural Networks (Balazs Hidasi, ...
Session-Based Recommendations with Recurrent Neural Networks(Balazs Hidasi, ...Session-Based Recommendations with Recurrent Neural Networks(Balazs Hidasi, ...
Session-Based Recommendations with Recurrent Neural Networks (Balazs Hidasi, ...hyunsung lee
 
Decision Trees
Decision TreesDecision Trees
Decision TreesStudent
 
Ot regularization and_gradient_descent
Ot regularization and_gradient_descentOt regularization and_gradient_descent
Ot regularization and_gradient_descentankit_ppt
 
Decision Tree - ID3
Decision Tree - ID3Decision Tree - ID3
Decision Tree - ID3Xueping Peng
 
Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...
Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...
Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...Marina Santini
 
MS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning AlgorithmMS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning AlgorithmKaniska Mandal
 
H transformer-1d paper review!!
H transformer-1d paper review!!H transformer-1d paper review!!
H transformer-1d paper review!!taeseon ryu
 
Machine Learning 1 - Introduction
Machine Learning 1 - IntroductionMachine Learning 1 - Introduction
Machine Learning 1 - Introductionbutest
 
Introdution and designing a learning system
Introdution and designing a learning systemIntrodution and designing a learning system
Introdution and designing a learning systemswapnac12
 

La actualidad más candente (20)

Introduction To Applied Machine Learning
Introduction To Applied Machine LearningIntroduction To Applied Machine Learning
Introduction To Applied Machine Learning
 
Search problems in Artificial Intelligence
Search problems in Artificial IntelligenceSearch problems in Artificial Intelligence
Search problems in Artificial Intelligence
 
Basics of Machine Learning
Basics of Machine LearningBasics of Machine Learning
Basics of Machine Learning
 
Natural Language Processing: L03 maths fornlp
Natural Language Processing: L03 maths fornlpNatural Language Processing: L03 maths fornlp
Natural Language Processing: L03 maths fornlp
 
002.decision trees
002.decision trees002.decision trees
002.decision trees
 
Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017
 
Overview of TensorFlow For Natural Language Processing
Overview of TensorFlow For Natural Language ProcessingOverview of TensorFlow For Natural Language Processing
Overview of TensorFlow For Natural Language Processing
 
Deep Learning For Practitioners, lecture 2: Selecting the right applications...
Deep Learning For Practitioners,  lecture 2: Selecting the right applications...Deep Learning For Practitioners,  lecture 2: Selecting the right applications...
Deep Learning For Practitioners, lecture 2: Selecting the right applications...
 
Machine learning Lecture 1
Machine learning Lecture 1Machine learning Lecture 1
Machine learning Lecture 1
 
Session-Based Recommendations with Recurrent Neural Networks (Balazs Hidasi, ...
Session-Based Recommendations with Recurrent Neural Networks(Balazs Hidasi, ...Session-Based Recommendations with Recurrent Neural Networks(Balazs Hidasi, ...
Session-Based Recommendations with Recurrent Neural Networks (Balazs Hidasi, ...
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
AI: AI & Problem Solving
AI: AI & Problem SolvingAI: AI & Problem Solving
AI: AI & Problem Solving
 
Ot regularization and_gradient_descent
Ot regularization and_gradient_descentOt regularization and_gradient_descent
Ot regularization and_gradient_descent
 
Decision Tree - ID3
Decision Tree - ID3Decision Tree - ID3
Decision Tree - ID3
 
Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...
Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...
Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...
 
MS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning AlgorithmMS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning Algorithm
 
Machine Learning Basics
Machine Learning BasicsMachine Learning Basics
Machine Learning Basics
 
H transformer-1d paper review!!
H transformer-1d paper review!!H transformer-1d paper review!!
H transformer-1d paper review!!
 
Machine Learning 1 - Introduction
Machine Learning 1 - IntroductionMachine Learning 1 - Introduction
Machine Learning 1 - Introduction
 
Introdution and designing a learning system
Introdution and designing a learning systemIntrodution and designing a learning system
Introdution and designing a learning system
 

Similar a L06 stemmer and edit distance

Design and analysis of Algorithms - Lecture 08 (1).ppt
Design and analysis of Algorithms - Lecture 08 (1).pptDesign and analysis of Algorithms - Lecture 08 (1).ppt
Design and analysis of Algorithms - Lecture 08 (1).pptZeenaJaba
 
Algorithms & Complexity Calculation
Algorithms & Complexity CalculationAlgorithms & Complexity Calculation
Algorithms & Complexity CalculationAkhil Kaushik
 
Neural machine translation by jointly learning to align and translate.pptx
Neural machine translation by jointly learning to align and translate.pptxNeural machine translation by jointly learning to align and translate.pptx
Neural machine translation by jointly learning to align and translate.pptxssuser2624f71
 
Natural Language Query to SQL conversion using Machine Learning Approach
Natural Language Query to SQL conversion using Machine Learning ApproachNatural Language Query to SQL conversion using Machine Learning Approach
Natural Language Query to SQL conversion using Machine Learning ApproachMinhazul Arefin
 
Edit distance is a nlp technique to find the minimum distance between two words
Edit distance is a nlp technique to find the minimum distance between two wordsEdit distance is a nlp technique to find the minimum distance between two words
Edit distance is a nlp technique to find the minimum distance between two wordssaisuraj1510
 
Asymptotic Notations
Asymptotic NotationsAsymptotic Notations
Asymptotic NotationsRishabh Soni
 
Algorithm analysis insertion sort and asymptotic notations
Algorithm analysis insertion sort and  asymptotic notationsAlgorithm analysis insertion sort and  asymptotic notations
Algorithm analysis insertion sort and asymptotic notationsAmit Kumar Rathi
 
CUNI at MediaEval 2013 Similar Segments in Social Speech Task
CUNI at MediaEval 2013 Similar Segments in Social Speech TaskCUNI at MediaEval 2013 Similar Segments in Social Speech Task
CUNI at MediaEval 2013 Similar Segments in Social Speech TaskPetra Galuscakova
 
DMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringDMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringPier Luca Lanzi
 
Design and Analysis of Algorithms Lecture Notes
Design and Analysis of Algorithms Lecture NotesDesign and Analysis of Algorithms Lecture Notes
Design and Analysis of Algorithms Lecture NotesSreedhar Chowdam
 
2_EditDistance_Jan_08_2020.pptx
2_EditDistance_Jan_08_2020.pptx2_EditDistance_Jan_08_2020.pptx
2_EditDistance_Jan_08_2020.pptxAnum Zehra
 
XLnet RoBERTa Reformer
XLnet RoBERTa ReformerXLnet RoBERTa Reformer
XLnet RoBERTa ReformerSan Kim
 
DSA Complexity.pptx What is Complexity Analysis? What is the need for Compl...
DSA Complexity.pptx   What is Complexity Analysis? What is the need for Compl...DSA Complexity.pptx   What is Complexity Analysis? What is the need for Compl...
DSA Complexity.pptx What is Complexity Analysis? What is the need for Compl...2022cspaawan12556
 

Similar a L06 stemmer and edit distance (20)

Design and analysis of Algorithms - Lecture 08 (1).ppt
Design and analysis of Algorithms - Lecture 08 (1).pptDesign and analysis of Algorithms - Lecture 08 (1).ppt
Design and analysis of Algorithms - Lecture 08 (1).ppt
 
Algorithms & Complexity Calculation
Algorithms & Complexity CalculationAlgorithms & Complexity Calculation
Algorithms & Complexity Calculation
 
Neural machine translation by jointly learning to align and translate.pptx
Neural machine translation by jointly learning to align and translate.pptxNeural machine translation by jointly learning to align and translate.pptx
Neural machine translation by jointly learning to align and translate.pptx
 
DSA
DSADSA
DSA
 
Searching Algorithms
Searching AlgorithmsSearching Algorithms
Searching Algorithms
 
Unit ii algorithm
Unit   ii algorithmUnit   ii algorithm
Unit ii algorithm
 
Natural Language Query to SQL conversion using Machine Learning Approach
Natural Language Query to SQL conversion using Machine Learning ApproachNatural Language Query to SQL conversion using Machine Learning Approach
Natural Language Query to SQL conversion using Machine Learning Approach
 
Edit distance is a nlp technique to find the minimum distance between two words
Edit distance is a nlp technique to find the minimum distance between two wordsEdit distance is a nlp technique to find the minimum distance between two words
Edit distance is a nlp technique to find the minimum distance between two words
 
Asymptotic Notations
Asymptotic NotationsAsymptotic Notations
Asymptotic Notations
 
Algorithm analysis insertion sort and asymptotic notations
Algorithm analysis insertion sort and  asymptotic notationsAlgorithm analysis insertion sort and  asymptotic notations
Algorithm analysis insertion sort and asymptotic notations
 
DAA Notes.pdf
DAA Notes.pdfDAA Notes.pdf
DAA Notes.pdf
 
CUNI at MediaEval 2013 Similar Segments in Social Speech Task
CUNI at MediaEval 2013 Similar Segments in Social Speech TaskCUNI at MediaEval 2013 Similar Segments in Social Speech Task
CUNI at MediaEval 2013 Similar Segments in Social Speech Task
 
DMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringDMTM Lecture 11 Clustering
DMTM Lecture 11 Clustering
 
Speech Signal Processing
Speech Signal ProcessingSpeech Signal Processing
Speech Signal Processing
 
Design and Analysis of Algorithms Lecture Notes
Design and Analysis of Algorithms Lecture NotesDesign and Analysis of Algorithms Lecture Notes
Design and Analysis of Algorithms Lecture Notes
 
NLP_KASHK:Minimum Edit Distance
NLP_KASHK:Minimum Edit DistanceNLP_KASHK:Minimum Edit Distance
NLP_KASHK:Minimum Edit Distance
 
2_EditDistance_Jan_08_2020.pptx
2_EditDistance_Jan_08_2020.pptx2_EditDistance_Jan_08_2020.pptx
2_EditDistance_Jan_08_2020.pptx
 
XLnet RoBERTa Reformer
XLnet RoBERTa ReformerXLnet RoBERTa Reformer
XLnet RoBERTa Reformer
 
DSA Complexity.pptx What is Complexity Analysis? What is the need for Compl...
DSA Complexity.pptx   What is Complexity Analysis? What is the need for Compl...DSA Complexity.pptx   What is Complexity Analysis? What is the need for Compl...
DSA Complexity.pptx What is Complexity Analysis? What is the need for Compl...
 
Algorithm.pptx
Algorithm.pptxAlgorithm.pptx
Algorithm.pptx
 

Más de ananth

Convolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular ArchitecturesConvolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular Architecturesananth
 
Overview of Convolutional Neural Networks
Overview of Convolutional Neural NetworksOverview of Convolutional Neural Networks
Overview of Convolutional Neural Networksananth
 
An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier ananth
 
Mathematical Background for Artificial Intelligence
Mathematical Background for Artificial IntelligenceMathematical Background for Artificial Intelligence
Mathematical Background for Artificial Intelligenceananth
 
Introduction to Artificial Intelligence
Introduction to Artificial IntelligenceIntroduction to Artificial Intelligence
Introduction to Artificial Intelligenceananth
 
Word representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2VecWord representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2Vecananth
 
Deep Learning For Speech Recognition
Deep Learning For Speech RecognitionDeep Learning For Speech Recognition
Deep Learning For Speech Recognitionananth
 
Convolutional Neural Networks: Part 1
Convolutional Neural Networks: Part 1Convolutional Neural Networks: Part 1
Convolutional Neural Networks: Part 1ananth
 
Machine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision TreesMachine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision Treesananth
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRUananth
 
L05 word representation
L05 word representationL05 word representation
L05 word representationananth
 
Natural Language Processing: L02 words
Natural Language Processing: L02 wordsNatural Language Processing: L02 words
Natural Language Processing: L02 wordsananth
 
Natural Language Processing: L01 introduction
Natural Language Processing: L01 introductionNatural Language Processing: L01 introduction
Natural Language Processing: L01 introductionananth
 
Deep Learning Primer - a brief introduction
Deep Learning Primer - a brief introductionDeep Learning Primer - a brief introduction
Deep Learning Primer - a brief introductionananth
 

Más de ananth (14)

Convolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular ArchitecturesConvolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular Architectures
 
Overview of Convolutional Neural Networks
Overview of Convolutional Neural NetworksOverview of Convolutional Neural Networks
Overview of Convolutional Neural Networks
 
An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier
 
Mathematical Background for Artificial Intelligence
Mathematical Background for Artificial IntelligenceMathematical Background for Artificial Intelligence
Mathematical Background for Artificial Intelligence
 
Introduction to Artificial Intelligence
Introduction to Artificial IntelligenceIntroduction to Artificial Intelligence
Introduction to Artificial Intelligence
 
Word representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2VecWord representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2Vec
 
Deep Learning For Speech Recognition
Deep Learning For Speech RecognitionDeep Learning For Speech Recognition
Deep Learning For Speech Recognition
 
Convolutional Neural Networks: Part 1
Convolutional Neural Networks: Part 1Convolutional Neural Networks: Part 1
Convolutional Neural Networks: Part 1
 
Machine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision TreesMachine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision Trees
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
L05 word representation
L05 word representationL05 word representation
L05 word representation
 
Natural Language Processing: L02 words
Natural Language Processing: L02 wordsNatural Language Processing: L02 words
Natural Language Processing: L02 words
 
Natural Language Processing: L01 introduction
Natural Language Processing: L01 introductionNatural Language Processing: L01 introduction
Natural Language Processing: L01 introduction
 
Deep Learning Primer - a brief introduction
Deep Learning Primer - a brief introductionDeep Learning Primer - a brief introduction
Deep Learning Primer - a brief introduction
 

Último

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 

Último (20)

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

L06 stemmer and edit distance

  • 1. Edit distance and applications Anantharaman Narayana Iyer Narayana dot Anantharaman at gmail dot com 3 Sep 2015
  • 2. Objective • Porter stemmer: we will mention it briefly and let you go through this at home • Discuss in depth the concept of edit distance computation using dynamic programming • Look at a few applications
  • 3. References • Speech And Language Processing, 2nd Edition by Dan Jurafsky and James Martin • Natural Language Processing, Coursera online learning course by Dan Jurafsky and Christopher Manning • Home page of Dr Martin Porter: http://tartarus.org/~martin/Porter Stemmer/ • Wikipedia (for certain definitions of linguistic terminology)
  • 5. Edit Distance • Suppose we have 2 strings x and y and our goal is to transform x to y. • Example 1: x = alogarithm, y = logarithm • Example 2: x = alogarithm, y = algorithm • Edit operators • Deletion • Insertion • Substitution • The weighted number of operations that are needed to transform x to y using the above operators is called the edit distance • Typically, deletion and insertion are assigned a weight 1 while substitution has weight 2 (Levenshtein) • Edit distance is a measure of string similarity
  • 6. Example • What is the edit distance between: 1. x = alogarithm, y = logarithm 2. x = alogarithm, y = algorithm • In the example 1, by deleting the letter a from string x, we get the string y. Hence edit distance is 1 • In the example 2, we need the following steps: • Step 1: Delete o from x => algarithm • Step 2: Replace the 4th letter a with o => algorithm • In example 2, we performed 1 deletion and 1 replacement and hence the edit distance = 1 + 2 = 3
  • 7. Why is this needed? • Applications in Natural Language Processing • Spelling correction • Approximate string matching • Spam filtering • Longest Common Subsequence • Finding curse words or inappropriate words • Finding similar sentences where one sentence differs from the other within the edit distance with respect to the words: Word level matching • Adobe announced 4th quarter results today • Adobe announced quarter results today • Computational Biology • Aligning DNA sequences • Speech Recognition
  • 8. How to find min edit distance? • Let us consider x = intention y = execution (Ref Dan Jurafsky) • Initial State: The word we are transforming (intention) • Final State: The word we are transforming to (execution) • Operators: Insert, Delete, Substitute • Path cost: This is what we need to minimize, ie, the number of edit operations weighted appropriately
  • 9. Minimum Edit Distance • The space of all edit sequence is huge and so it is not viable to compute the edit distance using this approach • Many paths to reach the end state from the start state. • We only need to consider the shortest path: Minimum Edit Distance • Let us consider 2 strings X, Y with lengths n and m respectively • Define D(i, j) as the distance between X[1..i], Y[1..j]. This the distance between X and Y is D(n, m)
  • 10. Dynamic Programming algorithm To compute minimum edit distance • Create a matrix where the column represents the target string, where there is one column for each symbol and the row represents the source string with one symbol per row. • For minimum edit distance this is the edit-distance matrix where the cell (i, j) contains the distance of first i characters of the target and first j characters of the source • Each cell can be computed as a simple function of surrounding cells; thus, from beginning of the matrix it is possible to compute every entry • We compute D(i,j) for small i,j • And compute larger D(i,j) based on previously computed smaller values • i.e., compute D(i,j) for all i (0 < i < n) and j (0 < j < m)
  • 11. Minimum Edit Distance Algorithm • Initialization D(i,0) = i D(0,j) = j • Recurrence Relation: For each i = 1…M For each j = 1…N D(i-1,j) + 1 D(i,j)= min D(i,j-1) + 1 D(i-1,j-1) + 2; if X(i) ≠ Y(j) 0; if X(i) = Y(j) • Termination: D(N,M) is distance
  • 12. Example • Let X = “alogarithm” and Y = “algorithm” • We need to transform X to Y using the dynamic programming algorithm • Let us start off with the base cases: • (ε, ε) = 0, (εa, ε) = 1, … • (ε, a) = 1, (ε, al) = 2, … • Recurrence: • We need to evaluate the distance between substrings: (εa, εa), (εa, εal), (εa, εalo), (εa, εalog)…(εal, εa), (εal, εal), …, (εalg, εa), (εalg, εal), (εalg, εalo), (εalg, εalog), …
  • 13. Example (Ref Dan Jurafsky Coursera) N 9 O 8 I 7 T 6 N 5 E 4 T 3 N 2 I 1 # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N
  • 14. Example (contd): (Ref Dan Jurafsky Coursera) N 9 8 9 10 11 12 11 10 9 8 O 8 7 8 9 10 11 10 9 8 9 I 7 6 7 8 9 10 9 8 9 10 T 6 5 6 7 8 9 8 9 10 11 N 5 4 5 6 7 8 9 10 11 10 E 4 3 4 5 6 7 8 9 10 9 T 3 4 5 6 7 8 7 8 9 8 N 2 3 4 5 6 7 8 7 8 7 I 1 2 3 4 5 6 7 6 7 8 # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N
  • 15. Backtrace • Often it is not adequate to just compute minimum edit distance alone. It may also be necessary to find the path to the minimum edit distance. This for example, is needed for finding the alignment between 2 sequences • Example: • Intention and Execution • We can implement backtracing by keeping pointers for each cell to record the possible preceding cells that contributed to the minimum distance for this cell (i, j). It is possible that this cell might have been reached just from one previous cell, or two cells or all the three. • When we reach the termination condition, start from the pointer from the final cell (i = M, j = N) and trace back to find the alignment
  • 16. Example – Backtracing (Adopted from Dan Jurafsky)