SlideShare una empresa de Scribd logo
1 de 47
Indexing Text with
Approximate q-grams
Adriano Galati & Marjolijn Elsinga
Overview
• Approximate string matching
- Neighborhood generation
- Reduction to Exact Searching
- Intermediate Partitioning
• Indexing text using q-grams
• Filtration condition
• Finding approximate q-grams
- Trie data structure
- Non-deterministic automaton (NFA)
• Parameters
Approximate string matching
Text
Pattern
Goal: Retrieve all occurrences of P in T whose
edit distance is at most k
Edit distance: ),( BAed
nT ..1
mP..1
Solutions
All kinds of solutions, most investigated area
in computer science
In on-line versions of the problem the
pattern can be preprocessed, the text
cannot
Classical solution: using dynamic
programming and a matrix is O(mn)
Fill matrix where ìs the minimum edit
distance between P and a suffix of T
Initialize the borders with
Fill internal cells with
Classical solution
nmC ..0,..0 jiC ,
0and ,00, == ji CiC
),,min(1
if
1,1,1,1
1,1
−−−−
−−
+
=
jijiji
jiji
CCC
TPC
Solution (2)
If text is large, on-line algorithms are not practical
and preprocessing becomes necessary
Focus: Sequence retrieving indexes, with no
restrictions on the patterns and the occurrences
Approaches:
• Neighborhood Generation
• Reduction to Exact Searching
• Intermediate Partitioning
Neighborhood Generation
Set of strings matching a pattern with k errors is
finite ( )
Therefore it can be enumerated
Each string can be searched using a
data structure
This structure is designed for exact matching
)(PUk∈
)(PUk
Neighborhood Generation (2)
+ O(n) space and construction time
- Not optimized for secondary memory
- Inefficient in space requirements
Is promising for searching short patterns
only
Reduction to Exact Searching
Indexes based on filters
Filter checks for simpler condition than the
matching condition, discarding large parts of the
text
Main principle: if two strings A and B match with k
errors and k+s non-overlapping samples are
extracted from A, then at least s of these must
appear without errors in B
Reduction to Exact Searching (2)
+ can be built in linear time and need O(n)
space
+ with some method it is possible to make
an index that takes less space then the
text itself
- Are based on suffix trees or on indexing all
the q-grams
Intermediate Partitioning
 Reduces the search to approximate search
instead of exact search
 Main principle: if two strings A and B match with
at most k errors and j disjoint substrings are
taken from A, then at least one of these appears
in B with
 Split the pattern in j pieces, search each piece
in the index allowing errors, extend the
approximate matches to complete occurrences
 jk /
 jk /
Question (Ingmar)
I think the main principle is incorrect, because if
AAABBBBBB
BBBBBBBBB
These match with k=3 errors. If we take the
disjoint substrings AAA BBB BBB so j=3. Now
they say that one of these will appear in the
other with errors. However AAA match
with 3 errors, BBB with 0 and BBB with 0
  13/3 =
Answer
The pattern is split in j pieces, each piece is
searched in the index allowing errors
AAA BBB BBB
BBB BBB BBB
We match BBB with ABB and not with AAA and
AAB, because it is not possible to match them
with more then errors, with k=3 and j=3,
unless we change the parameters
 jk /
 jk /
Intermediate Partitioning (2)
+ optimizing point between neighborhood
generating (worse with longer pieces) and
reduction to exact searching (worse with
shorter pieces)
Has been used on the patterns but not yet
on the text itself
Indexing text using q-grams
Steps:
• Filtering text
• Finding approximate q-grams
Advantages:
• Takes little space
• Has an alternative tradeoff
• User can decide what is important: saving space
or better performance
Filtration condition
Based on locating approximate matches of
pattern q-grams in text
Leads to a filtration tolerating higher error
levels compared to exact q-gram matching
Condition for an approximate match
Two strings A and B
Now: at least one string Ai appears in B with at most
errors
Only the q-grams for which this hold, will be used for
searching
kBAed ≤),(
jj AxxAxAA 12211 ... −=
 jk /
Example: Condition
A: CCTC TCTC CCCT
B: CCCC CTCT TCTC
We see: k=8
We take: j=3
Now e=2, so at least one Ai appears in B
with at most 2 errors
Question (Peter)
“Note that it is possible that , so we are
not only ‘distributing’ errors across pieces, but
also ‘removing’ some of them”
How does this work?
  kjkj <× /
Answer
A1 A2 A3x1 x2
k=5
j=3
e=1
Q-grams vs. Q-samples
Q-grams overlap
Q-samples do not overlap
String: ABCDEF
Q-grams: {ABC, BCD, CDE, DEF}
Q-samples: {ABC, DEF}
In a q-gram index all the text q-grams are stored in
increasing order
In a q-sample index only some text q-grams are
stored
Constructing q-samples
We need to extract j pieces from each potential pattern
occurrence in the text
So: a q-sample every h text-characters
We need to guarantee that j q-samples are inside any
occurrence of P
Minimal length of P = m-k 




 +−−
≤
j
qkm
h
1
Question (Jacob)
Could you please explain how the restriction
of h is built up?
Answer
j
qkm
h
j
qkm
j
n
qkmn
qPn
q
1
1
1
1
1textsamples-q#
+−−
≤
+−−
≤
+−−≤
+−≤
+−=
Next step
Best match distance (bed) is calculated for
each test sequence of q-samples
This is the distance between the q-sample
sequence and the involved text (h)
The text area h is only examined if its bed is
at most k
Algorithm
Each q-sample sequence has its own counter M
M indicates the number of errors produced by the
q-sample sequence and is initialized to
So: we start by assuming that each q-sample
gives enough errors to disallow a match
)1( += ejM
Error-environment
After calculating the M for each q-sample
sequence, we obtain the e-environment of
each q-sample sequence
This is the set of possible q-samples that
appear inside the q-sample sequence with
at most e errors
Finishing
Now all text areas have its own e-
environments connected to it through the q-
samples
They can be checked with dynamic
programming
Finding approximate q-grams
Finding all the text q-samples that appear inside a
given pattern block
Note: it is not necessary to generate all since
we are interested only in the text q-samples
(position)
( ) { 1.. / , ( , ) }q
e r iI Q r n h bed d Q e= ∈ ≤  
iQ
( )q
e iU Q
Finding approximate q-grams (2)
 Idea: to store all the different text q-samples in a
trie data structure
 We fill in a matrix such that is the
sed between and a suffix of
 is relevant for some
 In a trie traversal of the q-samples, the characters
of are obtained one by one
l
1..iS 1..lQ
S ,q lC e⇔ ≤ l
S
0.. ,0..| |q QC
Question (Laurence)
 Can you please show me the matrix is
build on page section 3 in fig. 4? It is a bit
unclear to me how the matrix is initialized
and the different cells are being filled.
Answer
i jS Q=,i jC if= then 1, 1i jC − − else 1, 1, 1 , 11 min( , , )i j i j i jC C C− − − −+
Answer
s
u
r
v
e
y
1
2
3
0
1
2
34
5
0 0
4
56
s
1
2
1
1
2
2
21
2
0 0
1
22
eg
1
0
1
1
1
0
12
3
0 0
2
34
ru
1
2
2
1
2
3
33
2
0 0
3
22
yr
Finding approximate q-grams (4)
When we reach the leaf nodes (depth q) we check
in if there is a cell with value the
corresponding text is reported
Complexity
e≤ ⇒
(| | ) ( )O Q q O mq=
Finding approximate q-grams (3)
Pruning:
• All the value of a row to the next are nondecreasing
• If all the values of a row are larger than at that
point we can abandon that branch of the trie
e
Finding approximate q-grams (5)
Alternative way:
• To model the search with a non-deterministic
automaton (NFA)
Finding approximate q-grams (6)
Consider the NFA for errors
Every row denotes the number of errors seen
Every column represents matching a prefix of S
Horizontal arrows represent matching a
character
All the others increment the number of errors
2e =
Question (Bogdan)
I can imagine how the trie can be used together with the
matrix in order to benefit from common prefixes of
certain q-samples (by reusing the rows of the matrix
which are already computed for the common prefix).
However, I don't see how this can be done in the case of
the NFA. If it can't be done, this would mean that the
algorithm has to be run separately for each q-tuple,
which probably makes the NFA approach much worse.
Am I right to think that or is there a way to run the NFA in
a "smarter" way so as to benefit from common prefixes?
Bogdam (answer)
 Yes, you are right, the algorithm has to run for
each q-tuple, but you have to consider the
complexity of it, that is linear ( )O e
Parameters of the Problem
 Smaller value the search of e-environment
will be cheaper
 Larger value gives more exact estimates of the
actual number of error but with a higher cost to
search the e-environment
 As grows, longer test sequences with less
errors per piece are used the cost to find the
relevant q-samples decreases but the amount of
text verification increases.
⇒e
e
j
⇒
Parameters of the Problem (2)
1. Notice: the index of this approach only
stores non-overlapping q-samples, its
space requirement is small
2. Notice: the space consumption of index
depends on the interval h
Parameters of the Problem (3)
 Standard implementation q-gram index stores all
the locations of all the q-grams of the text
 The number of q-grams
 Storing a position takes
 space consumption is
 Ratio between this method and standard
approach
1n q= − +
log n
⇒ logn n
/ log( / ) 1
log
r
n h n h
v
n n h
= ≈
Question (Bogdan)
 Could you please explain what the
"columns" used in the 5th section are?
 The table shows how the error level
increases the number of processed
columns of matrix or NFA
Question (Lee/Bram)
 The article talks about disjoint non overlapping
q-grams. At the end they say that will probably
enhance the scheme that they allow overlapping
q-grams. Any idea how our current algorithms
have to be changed for that and what the
advantages are?
 http://www.cs.utexas.edu/users/mobios/MoBIoS
Papers/2003-IndexingProteinSequences-TR-04-
06.pdf
Question (Lee)
In the second paragraph of section 4 they
say “In that particular case we can avoid
the use of counters…” Can you explain
that ?
Answer
The error counters M are initialized at a high value
After that all pattern-blocks are compared to the
corresponding text piece and the counter value is
updated to a lower value
In this particular case, when e = the error counter
can get as low as k+1, which is higher than the initial
value
 jk /
Any other questions?

Más contenido relacionado

La actualidad más candente

Quantum Computation and the Stabilizer Formalism for Error Correction
Quantum Computation and the Stabilizer Formalism for Error CorrectionQuantum Computation and the Stabilizer Formalism for Error Correction
Quantum Computation and the Stabilizer Formalism for Error CorrectionDaniel Bulhosa Solórzano
 
A NEW PARALLEL ALGORITHM FOR COMPUTING MINIMUM SPANNING TREE
A NEW PARALLEL ALGORITHM FOR COMPUTING MINIMUM SPANNING TREEA NEW PARALLEL ALGORITHM FOR COMPUTING MINIMUM SPANNING TREE
A NEW PARALLEL ALGORITHM FOR COMPUTING MINIMUM SPANNING TREEijscmc
 
Nonlinear Algebraic Systems with Three Unknown Variables
Nonlinear Algebraic Systems with Three Unknown VariablesNonlinear Algebraic Systems with Three Unknown Variables
Nonlinear Algebraic Systems with Three Unknown VariablesIJRES Journal
 
Seminar Report (Final)
Seminar Report (Final)Seminar Report (Final)
Seminar Report (Final)Aruneel Das
 
Polynomial Tensor Sketch for Element-wise Matrix Function (ICML 2020)
Polynomial Tensor Sketch for Element-wise Matrix Function (ICML 2020)Polynomial Tensor Sketch for Element-wise Matrix Function (ICML 2020)
Polynomial Tensor Sketch for Element-wise Matrix Function (ICML 2020)ALINLAB
 
Computational intelligence based simulated annealing guided key generation in...
Computational intelligence based simulated annealing guided key generation in...Computational intelligence based simulated annealing guided key generation in...
Computational intelligence based simulated annealing guided key generation in...ijitjournal
 
Jensen's inequality, EM 알고리즘
Jensen's inequality, EM 알고리즘 Jensen's inequality, EM 알고리즘
Jensen's inequality, EM 알고리즘 Jungkyu Lee
 
Presentacion limac-unc
Presentacion limac-uncPresentacion limac-unc
Presentacion limac-uncPucheta Julian
 
FINITE DIFFERENCE MODELLING FOR HEAT TRANSFER PROBLEMS
FINITE DIFFERENCE MODELLING FOR HEAT TRANSFER PROBLEMSFINITE DIFFERENCE MODELLING FOR HEAT TRANSFER PROBLEMS
FINITE DIFFERENCE MODELLING FOR HEAT TRANSFER PROBLEMSroymeister007
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based ClusteringSSA KPI
 
audition wrutings. draft
audition wrutings. draftaudition wrutings. draft
audition wrutings. draftNeverMora
 
Grovers Algorithm
Grovers Algorithm Grovers Algorithm
Grovers Algorithm CaseyHaaland
 

La actualidad más candente (19)

25010001
2501000125010001
25010001
 
220exercises2
220exercises2220exercises2
220exercises2
 
Quantum Noise and Error Correction
Quantum Noise and Error CorrectionQuantum Noise and Error Correction
Quantum Noise and Error Correction
 
Quantum Computation and the Stabilizer Formalism for Error Correction
Quantum Computation and the Stabilizer Formalism for Error CorrectionQuantum Computation and the Stabilizer Formalism for Error Correction
Quantum Computation and the Stabilizer Formalism for Error Correction
 
A NEW PARALLEL ALGORITHM FOR COMPUTING MINIMUM SPANNING TREE
A NEW PARALLEL ALGORITHM FOR COMPUTING MINIMUM SPANNING TREEA NEW PARALLEL ALGORITHM FOR COMPUTING MINIMUM SPANNING TREE
A NEW PARALLEL ALGORITHM FOR COMPUTING MINIMUM SPANNING TREE
 
Nonlinear Algebraic Systems with Three Unknown Variables
Nonlinear Algebraic Systems with Three Unknown VariablesNonlinear Algebraic Systems with Three Unknown Variables
Nonlinear Algebraic Systems with Three Unknown Variables
 
Seminar Report (Final)
Seminar Report (Final)Seminar Report (Final)
Seminar Report (Final)
 
Polynomial Tensor Sketch for Element-wise Matrix Function (ICML 2020)
Polynomial Tensor Sketch for Element-wise Matrix Function (ICML 2020)Polynomial Tensor Sketch for Element-wise Matrix Function (ICML 2020)
Polynomial Tensor Sketch for Element-wise Matrix Function (ICML 2020)
 
Computational intelligence based simulated annealing guided key generation in...
Computational intelligence based simulated annealing guided key generation in...Computational intelligence based simulated annealing guided key generation in...
Computational intelligence based simulated annealing guided key generation in...
 
Jensen's inequality, EM 알고리즘
Jensen's inequality, EM 알고리즘 Jensen's inequality, EM 알고리즘
Jensen's inequality, EM 알고리즘
 
Presentacion limac-unc
Presentacion limac-uncPresentacion limac-unc
Presentacion limac-unc
 
Unequal-Cost Prefix-Free Codes
Unequal-Cost Prefix-Free CodesUnequal-Cost Prefix-Free Codes
Unequal-Cost Prefix-Free Codes
 
FINITE DIFFERENCE MODELLING FOR HEAT TRANSFER PROBLEMS
FINITE DIFFERENCE MODELLING FOR HEAT TRANSFER PROBLEMSFINITE DIFFERENCE MODELLING FOR HEAT TRANSFER PROBLEMS
FINITE DIFFERENCE MODELLING FOR HEAT TRANSFER PROBLEMS
 
Statistical Physics Assignment Help
Statistical Physics Assignment HelpStatistical Physics Assignment Help
Statistical Physics Assignment Help
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based Clustering
 
audition wrutings. draft
audition wrutings. draftaudition wrutings. draft
audition wrutings. draft
 
1404.1503
1404.15031404.1503
1404.1503
 
Stochastic Processes Homework Help
Stochastic Processes Homework Help Stochastic Processes Homework Help
Stochastic Processes Homework Help
 
Grovers Algorithm
Grovers Algorithm Grovers Algorithm
Grovers Algorithm
 

Similar a Indexing Text with Approximate q-grams

Bag of Pursuits and Neural Gas for Improved Sparse Codin
Bag of Pursuits and Neural Gas for Improved Sparse CodinBag of Pursuits and Neural Gas for Improved Sparse Codin
Bag of Pursuits and Neural Gas for Improved Sparse CodinKarlos Svoboda
 
2.03.Asymptotic_analysis.pptx
2.03.Asymptotic_analysis.pptx2.03.Asymptotic_analysis.pptx
2.03.Asymptotic_analysis.pptxssuser1fb3df
 
MARGINAL PERCEPTRON FOR NON-LINEAR AND MULTI CLASS CLASSIFICATION
MARGINAL PERCEPTRON FOR NON-LINEAR AND MULTI CLASS CLASSIFICATION MARGINAL PERCEPTRON FOR NON-LINEAR AND MULTI CLASS CLASSIFICATION
MARGINAL PERCEPTRON FOR NON-LINEAR AND MULTI CLASS CLASSIFICATION ijscai
 
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMWireilla
 
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMijfls
 
PCB_Lect02_Pairwise_allign (1).pdf
PCB_Lect02_Pairwise_allign (1).pdfPCB_Lect02_Pairwise_allign (1).pdf
PCB_Lect02_Pairwise_allign (1).pdfssusera1eccd
 
Module 1 notes of data warehousing and data
Module 1 notes of data warehousing and dataModule 1 notes of data warehousing and data
Module 1 notes of data warehousing and datavijipersonal2012
 
Branch and bound technique
Branch and bound techniqueBranch and bound technique
Branch and bound techniqueishmecse13
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationGeoffrey Fox
 
Data_Structure_and_Algorithms_Lecture_1.ppt
Data_Structure_and_Algorithms_Lecture_1.pptData_Structure_and_Algorithms_Lecture_1.ppt
Data_Structure_and_Algorithms_Lecture_1.pptISHANAMRITSRIVASTAVA
 
Case Study(Analysis of Algorithm.pdf
Case Study(Analysis of Algorithm.pdfCase Study(Analysis of Algorithm.pdf
Case Study(Analysis of Algorithm.pdfShaistaRiaz4
 
Time and space complexity
Time and space complexityTime and space complexity
Time and space complexityAnkit Katiyar
 
Unit-3 greedy method, Prim's algorithm, Kruskal's algorithm.pdf
Unit-3 greedy method, Prim's algorithm, Kruskal's algorithm.pdfUnit-3 greedy method, Prim's algorithm, Kruskal's algorithm.pdf
Unit-3 greedy method, Prim's algorithm, Kruskal's algorithm.pdfyashodamb
 
Learning multifractal structure in large networks (Purdue ML Seminar)
Learning multifractal structure in large networks (Purdue ML Seminar)Learning multifractal structure in large networks (Purdue ML Seminar)
Learning multifractal structure in large networks (Purdue ML Seminar)Austin Benson
 
On clusteredsteinertree slide-ver 1.1
On clusteredsteinertree slide-ver 1.1On clusteredsteinertree slide-ver 1.1
On clusteredsteinertree slide-ver 1.1VitAnhNguyn94
 

Similar a Indexing Text with Approximate q-grams (20)

Bag of Pursuits and Neural Gas for Improved Sparse Codin
Bag of Pursuits and Neural Gas for Improved Sparse CodinBag of Pursuits and Neural Gas for Improved Sparse Codin
Bag of Pursuits and Neural Gas for Improved Sparse Codin
 
2.03.Asymptotic_analysis.pptx
2.03.Asymptotic_analysis.pptx2.03.Asymptotic_analysis.pptx
2.03.Asymptotic_analysis.pptx
 
Computer Science Exam Help
Computer Science Exam Help Computer Science Exam Help
Computer Science Exam Help
 
MARGINAL PERCEPTRON FOR NON-LINEAR AND MULTI CLASS CLASSIFICATION
MARGINAL PERCEPTRON FOR NON-LINEAR AND MULTI CLASS CLASSIFICATION MARGINAL PERCEPTRON FOR NON-LINEAR AND MULTI CLASS CLASSIFICATION
MARGINAL PERCEPTRON FOR NON-LINEAR AND MULTI CLASS CLASSIFICATION
 
Slide2
Slide2Slide2
Slide2
 
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
 
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
 
PCB_Lect02_Pairwise_allign (1).pdf
PCB_Lect02_Pairwise_allign (1).pdfPCB_Lect02_Pairwise_allign (1).pdf
PCB_Lect02_Pairwise_allign (1).pdf
 
Module 1 notes of data warehousing and data
Module 1 notes of data warehousing and dataModule 1 notes of data warehousing and data
Module 1 notes of data warehousing and data
 
Branch and bound technique
Branch and bound techniqueBranch and bound technique
Branch and bound technique
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel application
 
Data_Structure_and_Algorithms_Lecture_1.ppt
Data_Structure_and_Algorithms_Lecture_1.pptData_Structure_and_Algorithms_Lecture_1.ppt
Data_Structure_and_Algorithms_Lecture_1.ppt
 
Computer Network Assignment Help
Computer Network Assignment HelpComputer Network Assignment Help
Computer Network Assignment Help
 
Case Study(Analysis of Algorithm.pdf
Case Study(Analysis of Algorithm.pdfCase Study(Analysis of Algorithm.pdf
Case Study(Analysis of Algorithm.pdf
 
Time and space complexity
Time and space complexityTime and space complexity
Time and space complexity
 
Daa unit 5
Daa unit 5Daa unit 5
Daa unit 5
 
Analysis of algorithms
Analysis of algorithmsAnalysis of algorithms
Analysis of algorithms
 
Unit-3 greedy method, Prim's algorithm, Kruskal's algorithm.pdf
Unit-3 greedy method, Prim's algorithm, Kruskal's algorithm.pdfUnit-3 greedy method, Prim's algorithm, Kruskal's algorithm.pdf
Unit-3 greedy method, Prim's algorithm, Kruskal's algorithm.pdf
 
Learning multifractal structure in large networks (Purdue ML Seminar)
Learning multifractal structure in large networks (Purdue ML Seminar)Learning multifractal structure in large networks (Purdue ML Seminar)
Learning multifractal structure in large networks (Purdue ML Seminar)
 
On clusteredsteinertree slide-ver 1.1
On clusteredsteinertree slide-ver 1.1On clusteredsteinertree slide-ver 1.1
On clusteredsteinertree slide-ver 1.1
 

Último

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 

Último (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 

Indexing Text with Approximate q-grams

  • 1. Indexing Text with Approximate q-grams Adriano Galati & Marjolijn Elsinga
  • 2. Overview • Approximate string matching - Neighborhood generation - Reduction to Exact Searching - Intermediate Partitioning • Indexing text using q-grams • Filtration condition • Finding approximate q-grams - Trie data structure - Non-deterministic automaton (NFA) • Parameters
  • 3. Approximate string matching Text Pattern Goal: Retrieve all occurrences of P in T whose edit distance is at most k Edit distance: ),( BAed nT ..1 mP..1
  • 4. Solutions All kinds of solutions, most investigated area in computer science In on-line versions of the problem the pattern can be preprocessed, the text cannot Classical solution: using dynamic programming and a matrix is O(mn)
  • 5. Fill matrix where ìs the minimum edit distance between P and a suffix of T Initialize the borders with Fill internal cells with Classical solution nmC ..0,..0 jiC , 0and ,00, == ji CiC ),,min(1 if 1,1,1,1 1,1 −−−− −− + = jijiji jiji CCC TPC
  • 6. Solution (2) If text is large, on-line algorithms are not practical and preprocessing becomes necessary Focus: Sequence retrieving indexes, with no restrictions on the patterns and the occurrences Approaches: • Neighborhood Generation • Reduction to Exact Searching • Intermediate Partitioning
  • 7. Neighborhood Generation Set of strings matching a pattern with k errors is finite ( ) Therefore it can be enumerated Each string can be searched using a data structure This structure is designed for exact matching )(PUk∈ )(PUk
  • 8. Neighborhood Generation (2) + O(n) space and construction time - Not optimized for secondary memory - Inefficient in space requirements Is promising for searching short patterns only
  • 9. Reduction to Exact Searching Indexes based on filters Filter checks for simpler condition than the matching condition, discarding large parts of the text Main principle: if two strings A and B match with k errors and k+s non-overlapping samples are extracted from A, then at least s of these must appear without errors in B
  • 10. Reduction to Exact Searching (2) + can be built in linear time and need O(n) space + with some method it is possible to make an index that takes less space then the text itself - Are based on suffix trees or on indexing all the q-grams
  • 11. Intermediate Partitioning  Reduces the search to approximate search instead of exact search  Main principle: if two strings A and B match with at most k errors and j disjoint substrings are taken from A, then at least one of these appears in B with  Split the pattern in j pieces, search each piece in the index allowing errors, extend the approximate matches to complete occurrences  jk /  jk /
  • 12. Question (Ingmar) I think the main principle is incorrect, because if AAABBBBBB BBBBBBBBB These match with k=3 errors. If we take the disjoint substrings AAA BBB BBB so j=3. Now they say that one of these will appear in the other with errors. However AAA match with 3 errors, BBB with 0 and BBB with 0   13/3 =
  • 13. Answer The pattern is split in j pieces, each piece is searched in the index allowing errors AAA BBB BBB BBB BBB BBB We match BBB with ABB and not with AAA and AAB, because it is not possible to match them with more then errors, with k=3 and j=3, unless we change the parameters  jk /  jk /
  • 14. Intermediate Partitioning (2) + optimizing point between neighborhood generating (worse with longer pieces) and reduction to exact searching (worse with shorter pieces) Has been used on the patterns but not yet on the text itself
  • 15. Indexing text using q-grams Steps: • Filtering text • Finding approximate q-grams Advantages: • Takes little space • Has an alternative tradeoff • User can decide what is important: saving space or better performance
  • 16. Filtration condition Based on locating approximate matches of pattern q-grams in text Leads to a filtration tolerating higher error levels compared to exact q-gram matching
  • 17. Condition for an approximate match Two strings A and B Now: at least one string Ai appears in B with at most errors Only the q-grams for which this hold, will be used for searching kBAed ≤),( jj AxxAxAA 12211 ... −=  jk /
  • 18. Example: Condition A: CCTC TCTC CCCT B: CCCC CTCT TCTC We see: k=8 We take: j=3 Now e=2, so at least one Ai appears in B with at most 2 errors
  • 19. Question (Peter) “Note that it is possible that , so we are not only ‘distributing’ errors across pieces, but also ‘removing’ some of them” How does this work?   kjkj <× /
  • 20. Answer A1 A2 A3x1 x2 k=5 j=3 e=1
  • 21. Q-grams vs. Q-samples Q-grams overlap Q-samples do not overlap String: ABCDEF Q-grams: {ABC, BCD, CDE, DEF} Q-samples: {ABC, DEF} In a q-gram index all the text q-grams are stored in increasing order In a q-sample index only some text q-grams are stored
  • 22. Constructing q-samples We need to extract j pieces from each potential pattern occurrence in the text So: a q-sample every h text-characters We need to guarantee that j q-samples are inside any occurrence of P Minimal length of P = m-k       +−− ≤ j qkm h 1
  • 23. Question (Jacob) Could you please explain how the restriction of h is built up?
  • 25. Next step Best match distance (bed) is calculated for each test sequence of q-samples This is the distance between the q-sample sequence and the involved text (h) The text area h is only examined if its bed is at most k
  • 26. Algorithm Each q-sample sequence has its own counter M M indicates the number of errors produced by the q-sample sequence and is initialized to So: we start by assuming that each q-sample gives enough errors to disallow a match )1( += ejM
  • 27. Error-environment After calculating the M for each q-sample sequence, we obtain the e-environment of each q-sample sequence This is the set of possible q-samples that appear inside the q-sample sequence with at most e errors
  • 28. Finishing Now all text areas have its own e- environments connected to it through the q- samples They can be checked with dynamic programming
  • 29. Finding approximate q-grams Finding all the text q-samples that appear inside a given pattern block Note: it is not necessary to generate all since we are interested only in the text q-samples (position) ( ) { 1.. / , ( , ) }q e r iI Q r n h bed d Q e= ∈ ≤   iQ ( )q e iU Q
  • 30. Finding approximate q-grams (2)  Idea: to store all the different text q-samples in a trie data structure  We fill in a matrix such that is the sed between and a suffix of  is relevant for some  In a trie traversal of the q-samples, the characters of are obtained one by one l 1..iS 1..lQ S ,q lC e⇔ ≤ l S 0.. ,0..| |q QC
  • 31. Question (Laurence)  Can you please show me the matrix is build on page section 3 in fig. 4? It is a bit unclear to me how the matrix is initialized and the different cells are being filled.
  • 32. Answer i jS Q=,i jC if= then 1, 1i jC − − else 1, 1, 1 , 11 min( , , )i j i j i jC C C− − − −+
  • 34. Finding approximate q-grams (4) When we reach the leaf nodes (depth q) we check in if there is a cell with value the corresponding text is reported Complexity e≤ ⇒ (| | ) ( )O Q q O mq=
  • 35. Finding approximate q-grams (3) Pruning: • All the value of a row to the next are nondecreasing • If all the values of a row are larger than at that point we can abandon that branch of the trie e
  • 36. Finding approximate q-grams (5) Alternative way: • To model the search with a non-deterministic automaton (NFA)
  • 37. Finding approximate q-grams (6) Consider the NFA for errors Every row denotes the number of errors seen Every column represents matching a prefix of S Horizontal arrows represent matching a character All the others increment the number of errors 2e =
  • 38. Question (Bogdan) I can imagine how the trie can be used together with the matrix in order to benefit from common prefixes of certain q-samples (by reusing the rows of the matrix which are already computed for the common prefix). However, I don't see how this can be done in the case of the NFA. If it can't be done, this would mean that the algorithm has to be run separately for each q-tuple, which probably makes the NFA approach much worse. Am I right to think that or is there a way to run the NFA in a "smarter" way so as to benefit from common prefixes?
  • 39. Bogdam (answer)  Yes, you are right, the algorithm has to run for each q-tuple, but you have to consider the complexity of it, that is linear ( )O e
  • 40. Parameters of the Problem  Smaller value the search of e-environment will be cheaper  Larger value gives more exact estimates of the actual number of error but with a higher cost to search the e-environment  As grows, longer test sequences with less errors per piece are used the cost to find the relevant q-samples decreases but the amount of text verification increases. ⇒e e j ⇒
  • 41. Parameters of the Problem (2) 1. Notice: the index of this approach only stores non-overlapping q-samples, its space requirement is small 2. Notice: the space consumption of index depends on the interval h
  • 42. Parameters of the Problem (3)  Standard implementation q-gram index stores all the locations of all the q-grams of the text  The number of q-grams  Storing a position takes  space consumption is  Ratio between this method and standard approach 1n q= − + log n ⇒ logn n / log( / ) 1 log r n h n h v n n h = ≈
  • 43. Question (Bogdan)  Could you please explain what the "columns" used in the 5th section are?  The table shows how the error level increases the number of processed columns of matrix or NFA
  • 44. Question (Lee/Bram)  The article talks about disjoint non overlapping q-grams. At the end they say that will probably enhance the scheme that they allow overlapping q-grams. Any idea how our current algorithms have to be changed for that and what the advantages are?  http://www.cs.utexas.edu/users/mobios/MoBIoS Papers/2003-IndexingProteinSequences-TR-04- 06.pdf
  • 45. Question (Lee) In the second paragraph of section 4 they say “In that particular case we can avoid the use of counters…” Can you explain that ?
  • 46. Answer The error counters M are initialized at a high value After that all pattern-blocks are compared to the corresponding text piece and the counter value is updated to a lower value In this particular case, when e = the error counter can get as low as k+1, which is higher than the initial value  jk /