SlideShare una empresa de Scribd logo
1 de 38
Descargar para leer sin conexión
King’s College London, University of London
MSc in Advanced Software Engineering
Approximate Indexing: Gapped
Suffix Array
KyungHoon Park
King’s College London, University of London
Agenda
 Research Objective
 Gapped suffix array
 Application
 Going beyond gSA
 Q&A
King’s College London, University of London
Research Objective
King’s College London, University of London
Main questions
1. Using the developed suffix array, can gapped
suffix array be developed in O(n) time?
2. What are the limitations of gapped suffix array?
How can these can be overcome?
King’s College London, University of London
Research aims
1. To fully understand and implement suffix array
and LCP.
2. Implement a gapped suffix array from the suffix
array in O(n) time.
3. To study and implement the paper gapped suffix
array.
4. If there are possibilities to develop to multiple
gapped suffix array, to research other limitations.
King’s College London, University of London
Gapped Suffix Array
King’s College London, University of London
Main questions
1. Using the developed suffix array, can
gapped suffix array be developed in O(n)
time?
2. 2. What are the limitations of gapped suffix array?
How can these can be overcome?
King’s College London, University of London
Definitions
T = t1t2 … tn, P = p1 p2 … pn , strings of symbols in
finite alphabet
m = length of search string
n = length of text
k = k-mistake (Hamming distance)
King’s College London, University of London
Suffix Array
i T[i] SA T[SA[i]] LCP
0 mississippi 10 i 0
1 ississippi 7 ippi 1
2 ssissippi 4 issippi 1
3 sissippi 1 ississippi 4
4 issippi 0 mississippi 0
5 ssippi 9 pi 0
6 sippi 8 ppi 1
7 ippi 6 sippi 0
8 ppi 3 sissippi 2
9 pi 5 ssippi 1
T = mississippi
King’s College London, University of London
Gapped Suffix Array
1. First introduced by Crochemore and Tischler
(2010)
2. Constructed after SA
3. SA that has a Gap within a specific range to
provide approximate index.
4. The range of gap defined before constructing
the gapped suffix array.
King’s College London, University of London
Gapped Suffix Array
T = mississippi, (1, 2)-gSA (3,1)
i T[i] SA gSA (1, 2)- gSA(3,1)
1 mississippi 10 10 i#
2 ississippi 7 7 i#pi
3 ssissippi 4 4 i#sippi
4 sissippi 1 1 i#sissippi
5 issippi 0 0 m#ssissippi
6 Ssippi 9 9 p#
7 Sippi 8 8 p#i
8 Ippi 6 5 s#ppi
9 ppi 3 2 s#ssippi
10 pi 5 6 s#ippi
11 i 2 3 s#issippi
Definition
(g0, g1)-gSA (m, k)
gSA = Gapped suffix array
g0 = start cursor of the gap
g1 = end cursor of the gap
m = length of search string
k = Hamming distance
King’s College London, University of London
Flow of constructing the gSA
• Skew
Algorithm
1. Constructing
the SA
• Figure of the
k-mistake
• Range of gap
2. Defining the
limitations
• Sorting based on
GRANK &
HRANK
3. Constructing
the gSA
King’s College London, University of London
Limitations of gSA
1. Hamming distance, length of pattern and gap
range should define prior to constructing.
2. gSA cannot cover all of approximate string
matching based on defined k-mistake.
ex) k = 2, gap=(1,3)
coat -> c##t, ##at, co## (support)
#o#t, c#a# (cannot support)
3. gSA cannot support multiple gaps
EX) coach -> c#a#h
King’s College London, University of London
Constructing gSA - #1. GRANK
i 0 1 2 3 4 5 6 7 8 9 10
T[i] m i s s i s s i p p i
GRANK 5 1 8 8 1 8 8 1 6 6 1
GRANK contains the ranks of factors of y with
length up to g0. That is, rank created by cutting
the characters before the beginning of the gap at
position g0
For Example, m = 3, gap range = (1,2)
King’s College London, University of London
Constructing gSA - #2. HRANK
HRANK contains the RANKs of the suffixes that are
at the end of the gap.
As we have now already created the suffix array
before constructing the gapped suffix, it is possible
to easily bring the suffix of where the gap ends.
HRANK[r] = ISA[SA[r]+g1]
King’s College London, University of London
GRANK & HRANK
For example, the structure of the GRANK and
HRANK of the fourth suffix sissippi is constructed as
below.
s i s s i p p i
GRANK Gap HRANK
If we perform the radix sort by combining both
GRANK and HRANK created in this way, it is
possible to create gSA in linear time.
King’s College London, University of London
Example of (1,2)-gSA(3,1)
i T[i] SA gSA (1, 2)- gSA GRANK HRANK
1 mississippi 10 10 i# 5 0
2 ississippi 7 7 i#pi 1 6
3 ssissippi 4 4 i#sippi 8 8
4 sissippi 1 1 i#sissippi 8 9
5 issippi 0 0 m#ssissippi 1 11
6 Ssippi 9 9 p# 8 0
7 Sippi 8 8 p#i 8 1
8 Ippi 6 5 s#ppi 1 7
9 ppi 3 2 s#ssippi 6 10
10 pi 5 6 s#ippi 6 2
11 i 2 3 s#issippi 1 3
King’s College London, University of London
Search in (1,2)-gSA(3,1)
For example, if m = mis (m0, m1, m2), it needs to
search three times:
- search mi (m0, m1) in the SA
- search is (m1, m2) in the SA
- search ms (m0, m2) in the gSA
P = cot
(1,2)-gSA(3,1) c#t #ot co#
Searching array in the (1,2)-gSA(3,1) in the SA in the SA
King’s College London, University of London
Application
King’s College London, University of London
Platform and Language
1. Language: C#
2. Platform: Microsoft .NET
(.Net Framework v4.0)
King’s College London, University of London
Algorithms
1. Construction of suffix array with LCP
- Radix sort
- Skew algorithm
2. Construction of gapped suffix array with gLCP
- Radix sort
3. Approximate string search
- pattern analysis
- binary search with LCP
King’s College London, University of London
Gapped Suffix Array
King’s College London, University of London
Going beyond gSA
King’s College London, University of London
Main questions
1. Using the developed suffix array, can gapped
suffix array be developed in O(n) time?
2. What are the limitations of gapped
suffix array? How can these can be
overcome?
King’s College London, University of London
Limitation of gSA
P = coat
(2,3)-gSA(4,1) #oat c#at co#t coa#
Searching array SA Cannot
support
gSA(4,1) SA
P = coast
(3,4)-gSA(5,1) #oast c#oast co#st coa#t coas#
Searching array SA Cannot
support
Cannot
support
gSA(5,1) SA
If we suppose k is 1 and gap is ended at m-1
King’s College London, University of London
Countermeasure
P = coat
(2,3)-gSA(4,1) #oat c#at co#t coa#
Searching array SA gSA(3,1) gSA(4,1) SA
P = coast
(3,4)-gSA(5,1) #oast c#oast co#st coa#t coas#
Searching array SA gSA(3,1) gSA(4,1) gSA(5,1) SA
King’s College London, University of London
Countermeasure
P = cot c#t, #ot, co#
gSA(3, 1)  SA, gSA(3, 1)
P = coat #oat, c#at, co#t, coa#
gSA(4, 1)  SA, gSA(3, 1), gSA(4, 1)
P = coast #oast, c#oast, co#st, coa#t, coas#
gSA(5, 1)  SA, gSA(3, 1), gSA(4, 1), gSA(5, 1)
P = coasts #oasts, c#oasts, co#sts, coa#ts, coas#s, coast#
gSA(6, 1)  SA, gSA(3, 1), gSA(4, 1), gSA(5, 1), gSA(6, 1)
gSA(m, 1) SA, gSA(3, 1) … gSA(m-2, 1), gSA(m-1, 1), gSA(m, 1)
King’s College London, University of London
Theorem If the length of the Gap is 1, the required
count of gSA is | m - 2 |, and it is possible for both
construction and search time to be performed in linear
time.
King’s College London, University of London
Total count of required gSAs
gSA(m, p) Required gapped suffix arrays
gSA(3,1)  SA, gSA(3,1)
gSA(4,1)  SA, gSA(3,1), gSA(4,1)
gSA(4,2)  SA, gSA(3,1), gSA(4,2)
gSA(5,1)  SA, gSA(3,1), gSA(4,1), gSA(5,1)
gSA(5,2)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,2)
gSA(5,3)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,3)
gSA(6,1)  SA, gSA(3,1), gSA(4,1), gSA(5,1), gSA(6,1)
gSA(6,2)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(6,2),
gSA(6,3)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS
A(6,3)
gSA(6,4)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS
A(6,4)
gSA(7,1)  SA, gSA(3,1), gSA(4,1), gSA(5,1), gSA(6,1), gSA(7,1)
gSA(7,2)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2) , gSA(6,1), gS
A(6,2), gSA(7,2)
gSA(7,3)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS
A(6,1) , gSA(6,2), gSA(6,3), gSA(7,3)
gSA(7,4)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS
A(6,1) , gSA(6,2) , gSA(6,3), gSA(6,4), gSA(7,4)
gSA(7,5)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS
gC =Total count of required
gSAs
𝒈𝑪 =
𝒊=𝟏
𝒑−𝟏
𝒌 − 𝒊 𝒊𝒇 𝒌 − 𝒊 > 𝟎
𝟏 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆
King’s College London, University of London
Multiple gaps, m is various
P = coat ##at, #o#t, #oa#, c##t, c#a#, co##
gSA(4,2)  SA, gSA(3,1), gSA(4,2)
P = coast ##ast, #o#st, #oa#t, #oas#, c##st, c#a#t, c#as#, co##t, co#s#,coa##
gSA(5,2)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,2), (1,2)(3,4)-gSA(5,2)
P = coasts ##asts, #o#sts, #oa#ts, #oas#s, #oast#, c##sts, c#a#ts, c#as#s, c#ast#, co#
#ts, co#s#s, co#st#, coa##s, coa#t#, coas##
gSA(6,2)  SA, gSA(3,1) , gSA(4,1),gSA(4,2), gSA(5,1), gSA(5,2), (1,2)(3,4)-gSA(5,2), gS
A(6,2), (1,2)(4,5)-gSA(6,2), (2,3)(4,5)-gSA(6,2)
P = coasts ###sts, ##a#ts, ##as#s, ##ast#, #o##ts, #o#s#s, #o#st#, #oa##s, #oa#t#, #
oas##, c###ts, c##s#s, c##st#, c#a##s, c#a#t#, c#as##, co###s, co##t#, co
#s##, coa###
gSA(6,3)  SA, gSA(3,1) , gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), (1,2)(3,4)-gSA(5,2), gS
A(5,3)gSA(6,3), (1,3)(4,5)-gSA(6,3), (1,2)(3,5)-gSA(6,3)
King’s College London, University of London
Two approaches to support the
multiple gaps
Second is to continuously additionally create multiple gapped
suffix array as per above method.
Perform a search where the search is carried out until the first gap
of the search pattern, and after that every individual character is
compared.
King’s College London, University of London
First approach
c # a # t
r = gSA[i](3,1),T[r]
T[ r+2 ]T[ r+3 ]T[ r+4 ]
c # a s # s
r = gSA[i](3,1),T[r]
T[r+3]T[r+4]T[r+5]
King’s College London, University of London
Worst case for searching with it
First fragment’s length is defined fm
Binary search the first fragment with gLCP = O(logn + fm)
Search rest of fragment = O((m - fm)n)
So O((m - fm)n + log n + fm)
King’s College London, University of London
Summary
King’s College London, University of London
Further work
Gapped suffix array only supports searching of specific
patterns.
For it to support approximate indexing in all situations,
will require more research and development into
multiple gapped suffix arrays.
Future task is to study multiple gapped suffix array and
its efficiency
King’s College London, University of London
Conclusion
The theory of Maxime that gSA can be created in linear
time has been put into practice and confirmed to be
true
Additionally to this research, further potentials of
multiple gSAs were looked at and were able to
conclude that it’s an area requiring more research
King’s College London, University of London
King’s College London, University of London
Q&A

Más contenido relacionado

Similar a Approximate Indexing: Gapped Suffix Array

A taxonomy of suffix array construction algorithms
A taxonomy of suffix array construction algorithmsA taxonomy of suffix array construction algorithms
A taxonomy of suffix array construction algorithms
unyil96
 
Combining text and pattern preprocessing in an adaptive dna pattern matcher
Combining text and pattern preprocessing in an adaptive dna pattern matcherCombining text and pattern preprocessing in an adaptive dna pattern matcher
Combining text and pattern preprocessing in an adaptive dna pattern matcher
IAEME Publication
 

Similar a Approximate Indexing: Gapped Suffix Array (18)

A taxonomy of suffix array construction algorithms
A taxonomy of suffix array construction algorithmsA taxonomy of suffix array construction algorithms
A taxonomy of suffix array construction algorithms
 
SPACE-EFFICIENT K-MER ALGORITHM FOR GENERALISED SUFFIX TREE
SPACE-EFFICIENT K-MER ALGORITHM FOR GENERALISED SUFFIX TREESPACE-EFFICIENT K-MER ALGORITHM FOR GENERALISED SUFFIX TREE
SPACE-EFFICIENT K-MER ALGORITHM FOR GENERALISED SUFFIX TREE
 
String kmp
String kmpString kmp
String kmp
 
Parallel random projection using R high performance computing for planted mot...
Parallel random projection using R high performance computing for planted mot...Parallel random projection using R high performance computing for planted mot...
Parallel random projection using R high performance computing for planted mot...
 
poster
posterposter
poster
 
Combining text and pattern preprocessing in an adaptive dna pattern matcher
Combining text and pattern preprocessing in an adaptive dna pattern matcherCombining text and pattern preprocessing in an adaptive dna pattern matcher
Combining text and pattern preprocessing in an adaptive dna pattern matcher
 
A Matching Approach Based on Term Clusters for eRecruitment
A Matching Approach Based on Term Clusters for eRecruitmentA Matching Approach Based on Term Clusters for eRecruitment
A Matching Approach Based on Term Clusters for eRecruitment
 
prolog-coolPrograms-flora.ppt
prolog-coolPrograms-flora.pptprolog-coolPrograms-flora.ppt
prolog-coolPrograms-flora.ppt
 
Deconstructing Columnar Transposition Ciphers
Deconstructing Columnar Transposition CiphersDeconstructing Columnar Transposition Ciphers
Deconstructing Columnar Transposition Ciphers
 
Presentation 2
Presentation 2Presentation 2
Presentation 2
 
Graph Summarization with Quality Guarantees
Graph Summarization with Quality GuaranteesGraph Summarization with Quality Guarantees
Graph Summarization with Quality Guarantees
 
32 -longest-common-prefix
32 -longest-common-prefix32 -longest-common-prefix
32 -longest-common-prefix
 
2nd Semester M Tech: Structural Engineering (June-2015) Question Papers
2nd  Semester M Tech: Structural Engineering  (June-2015) Question Papers2nd  Semester M Tech: Structural Engineering  (June-2015) Question Papers
2nd Semester M Tech: Structural Engineering (June-2015) Question Papers
 
Point Placement Algorithms: An Experimental Study
Point Placement Algorithms: An Experimental StudyPoint Placement Algorithms: An Experimental Study
Point Placement Algorithms: An Experimental Study
 
Ch06 multalign
Ch06 multalignCh06 multalign
Ch06 multalign
 
Langford sequences through a product of labeled digraphs
Langford sequences through a product of labeled digraphsLangford sequences through a product of labeled digraphs
Langford sequences through a product of labeled digraphs
 
Msa & rooted/unrooted tree
Msa & rooted/unrooted treeMsa & rooted/unrooted tree
Msa & rooted/unrooted tree
 
A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...
A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...
A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...
 

Último

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Último (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Approximate Indexing: Gapped Suffix Array

  • 1. King’s College London, University of London MSc in Advanced Software Engineering Approximate Indexing: Gapped Suffix Array KyungHoon Park
  • 2. King’s College London, University of London Agenda  Research Objective  Gapped suffix array  Application  Going beyond gSA  Q&A
  • 3. King’s College London, University of London Research Objective
  • 4. King’s College London, University of London Main questions 1. Using the developed suffix array, can gapped suffix array be developed in O(n) time? 2. What are the limitations of gapped suffix array? How can these can be overcome?
  • 5. King’s College London, University of London Research aims 1. To fully understand and implement suffix array and LCP. 2. Implement a gapped suffix array from the suffix array in O(n) time. 3. To study and implement the paper gapped suffix array. 4. If there are possibilities to develop to multiple gapped suffix array, to research other limitations.
  • 6. King’s College London, University of London Gapped Suffix Array
  • 7. King’s College London, University of London Main questions 1. Using the developed suffix array, can gapped suffix array be developed in O(n) time? 2. 2. What are the limitations of gapped suffix array? How can these can be overcome?
  • 8. King’s College London, University of London Definitions T = t1t2 … tn, P = p1 p2 … pn , strings of symbols in finite alphabet m = length of search string n = length of text k = k-mistake (Hamming distance)
  • 9. King’s College London, University of London Suffix Array i T[i] SA T[SA[i]] LCP 0 mississippi 10 i 0 1 ississippi 7 ippi 1 2 ssissippi 4 issippi 1 3 sissippi 1 ississippi 4 4 issippi 0 mississippi 0 5 ssippi 9 pi 0 6 sippi 8 ppi 1 7 ippi 6 sippi 0 8 ppi 3 sissippi 2 9 pi 5 ssippi 1 T = mississippi
  • 10. King’s College London, University of London Gapped Suffix Array 1. First introduced by Crochemore and Tischler (2010) 2. Constructed after SA 3. SA that has a Gap within a specific range to provide approximate index. 4. The range of gap defined before constructing the gapped suffix array.
  • 11. King’s College London, University of London Gapped Suffix Array T = mississippi, (1, 2)-gSA (3,1) i T[i] SA gSA (1, 2)- gSA(3,1) 1 mississippi 10 10 i# 2 ississippi 7 7 i#pi 3 ssissippi 4 4 i#sippi 4 sissippi 1 1 i#sissippi 5 issippi 0 0 m#ssissippi 6 Ssippi 9 9 p# 7 Sippi 8 8 p#i 8 Ippi 6 5 s#ppi 9 ppi 3 2 s#ssippi 10 pi 5 6 s#ippi 11 i 2 3 s#issippi Definition (g0, g1)-gSA (m, k) gSA = Gapped suffix array g0 = start cursor of the gap g1 = end cursor of the gap m = length of search string k = Hamming distance
  • 12. King’s College London, University of London Flow of constructing the gSA • Skew Algorithm 1. Constructing the SA • Figure of the k-mistake • Range of gap 2. Defining the limitations • Sorting based on GRANK & HRANK 3. Constructing the gSA
  • 13. King’s College London, University of London Limitations of gSA 1. Hamming distance, length of pattern and gap range should define prior to constructing. 2. gSA cannot cover all of approximate string matching based on defined k-mistake. ex) k = 2, gap=(1,3) coat -> c##t, ##at, co## (support) #o#t, c#a# (cannot support) 3. gSA cannot support multiple gaps EX) coach -> c#a#h
  • 14. King’s College London, University of London Constructing gSA - #1. GRANK i 0 1 2 3 4 5 6 7 8 9 10 T[i] m i s s i s s i p p i GRANK 5 1 8 8 1 8 8 1 6 6 1 GRANK contains the ranks of factors of y with length up to g0. That is, rank created by cutting the characters before the beginning of the gap at position g0 For Example, m = 3, gap range = (1,2)
  • 15. King’s College London, University of London Constructing gSA - #2. HRANK HRANK contains the RANKs of the suffixes that are at the end of the gap. As we have now already created the suffix array before constructing the gapped suffix, it is possible to easily bring the suffix of where the gap ends. HRANK[r] = ISA[SA[r]+g1]
  • 16. King’s College London, University of London GRANK & HRANK For example, the structure of the GRANK and HRANK of the fourth suffix sissippi is constructed as below. s i s s i p p i GRANK Gap HRANK If we perform the radix sort by combining both GRANK and HRANK created in this way, it is possible to create gSA in linear time.
  • 17. King’s College London, University of London Example of (1,2)-gSA(3,1) i T[i] SA gSA (1, 2)- gSA GRANK HRANK 1 mississippi 10 10 i# 5 0 2 ississippi 7 7 i#pi 1 6 3 ssissippi 4 4 i#sippi 8 8 4 sissippi 1 1 i#sissippi 8 9 5 issippi 0 0 m#ssissippi 1 11 6 Ssippi 9 9 p# 8 0 7 Sippi 8 8 p#i 8 1 8 Ippi 6 5 s#ppi 1 7 9 ppi 3 2 s#ssippi 6 10 10 pi 5 6 s#ippi 6 2 11 i 2 3 s#issippi 1 3
  • 18. King’s College London, University of London Search in (1,2)-gSA(3,1) For example, if m = mis (m0, m1, m2), it needs to search three times: - search mi (m0, m1) in the SA - search is (m1, m2) in the SA - search ms (m0, m2) in the gSA P = cot (1,2)-gSA(3,1) c#t #ot co# Searching array in the (1,2)-gSA(3,1) in the SA in the SA
  • 19. King’s College London, University of London Application
  • 20. King’s College London, University of London Platform and Language 1. Language: C# 2. Platform: Microsoft .NET (.Net Framework v4.0)
  • 21. King’s College London, University of London Algorithms 1. Construction of suffix array with LCP - Radix sort - Skew algorithm 2. Construction of gapped suffix array with gLCP - Radix sort 3. Approximate string search - pattern analysis - binary search with LCP
  • 22. King’s College London, University of London Gapped Suffix Array
  • 23. King’s College London, University of London Going beyond gSA
  • 24. King’s College London, University of London Main questions 1. Using the developed suffix array, can gapped suffix array be developed in O(n) time? 2. What are the limitations of gapped suffix array? How can these can be overcome?
  • 25. King’s College London, University of London Limitation of gSA P = coat (2,3)-gSA(4,1) #oat c#at co#t coa# Searching array SA Cannot support gSA(4,1) SA P = coast (3,4)-gSA(5,1) #oast c#oast co#st coa#t coas# Searching array SA Cannot support Cannot support gSA(5,1) SA If we suppose k is 1 and gap is ended at m-1
  • 26. King’s College London, University of London Countermeasure P = coat (2,3)-gSA(4,1) #oat c#at co#t coa# Searching array SA gSA(3,1) gSA(4,1) SA P = coast (3,4)-gSA(5,1) #oast c#oast co#st coa#t coas# Searching array SA gSA(3,1) gSA(4,1) gSA(5,1) SA
  • 27. King’s College London, University of London Countermeasure P = cot c#t, #ot, co# gSA(3, 1)  SA, gSA(3, 1) P = coat #oat, c#at, co#t, coa# gSA(4, 1)  SA, gSA(3, 1), gSA(4, 1) P = coast #oast, c#oast, co#st, coa#t, coas# gSA(5, 1)  SA, gSA(3, 1), gSA(4, 1), gSA(5, 1) P = coasts #oasts, c#oasts, co#sts, coa#ts, coas#s, coast# gSA(6, 1)  SA, gSA(3, 1), gSA(4, 1), gSA(5, 1), gSA(6, 1) gSA(m, 1) SA, gSA(3, 1) … gSA(m-2, 1), gSA(m-1, 1), gSA(m, 1)
  • 28. King’s College London, University of London Theorem If the length of the Gap is 1, the required count of gSA is | m - 2 |, and it is possible for both construction and search time to be performed in linear time.
  • 29. King’s College London, University of London Total count of required gSAs gSA(m, p) Required gapped suffix arrays gSA(3,1)  SA, gSA(3,1) gSA(4,1)  SA, gSA(3,1), gSA(4,1) gSA(4,2)  SA, gSA(3,1), gSA(4,2) gSA(5,1)  SA, gSA(3,1), gSA(4,1), gSA(5,1) gSA(5,2)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,2) gSA(5,3)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,3) gSA(6,1)  SA, gSA(3,1), gSA(4,1), gSA(5,1), gSA(6,1) gSA(6,2)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(6,2), gSA(6,3)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS A(6,3) gSA(6,4)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS A(6,4) gSA(7,1)  SA, gSA(3,1), gSA(4,1), gSA(5,1), gSA(6,1), gSA(7,1) gSA(7,2)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2) , gSA(6,1), gS A(6,2), gSA(7,2) gSA(7,3)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS A(6,1) , gSA(6,2), gSA(6,3), gSA(7,3) gSA(7,4)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS A(6,1) , gSA(6,2) , gSA(6,3), gSA(6,4), gSA(7,4) gSA(7,5)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS gC =Total count of required gSAs 𝒈𝑪 = 𝒊=𝟏 𝒑−𝟏 𝒌 − 𝒊 𝒊𝒇 𝒌 − 𝒊 > 𝟎 𝟏 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆
  • 30. King’s College London, University of London Multiple gaps, m is various P = coat ##at, #o#t, #oa#, c##t, c#a#, co## gSA(4,2)  SA, gSA(3,1), gSA(4,2) P = coast ##ast, #o#st, #oa#t, #oas#, c##st, c#a#t, c#as#, co##t, co#s#,coa## gSA(5,2)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,2), (1,2)(3,4)-gSA(5,2) P = coasts ##asts, #o#sts, #oa#ts, #oas#s, #oast#, c##sts, c#a#ts, c#as#s, c#ast#, co# #ts, co#s#s, co#st#, coa##s, coa#t#, coas## gSA(6,2)  SA, gSA(3,1) , gSA(4,1),gSA(4,2), gSA(5,1), gSA(5,2), (1,2)(3,4)-gSA(5,2), gS A(6,2), (1,2)(4,5)-gSA(6,2), (2,3)(4,5)-gSA(6,2) P = coasts ###sts, ##a#ts, ##as#s, ##ast#, #o##ts, #o#s#s, #o#st#, #oa##s, #oa#t#, # oas##, c###ts, c##s#s, c##st#, c#a##s, c#a#t#, c#as##, co###s, co##t#, co #s##, coa### gSA(6,3)  SA, gSA(3,1) , gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), (1,2)(3,4)-gSA(5,2), gS A(5,3)gSA(6,3), (1,3)(4,5)-gSA(6,3), (1,2)(3,5)-gSA(6,3)
  • 31. King’s College London, University of London Two approaches to support the multiple gaps Second is to continuously additionally create multiple gapped suffix array as per above method. Perform a search where the search is carried out until the first gap of the search pattern, and after that every individual character is compared.
  • 32. King’s College London, University of London First approach c # a # t r = gSA[i](3,1),T[r] T[ r+2 ]T[ r+3 ]T[ r+4 ] c # a s # s r = gSA[i](3,1),T[r] T[r+3]T[r+4]T[r+5]
  • 33. King’s College London, University of London Worst case for searching with it First fragment’s length is defined fm Binary search the first fragment with gLCP = O(logn + fm) Search rest of fragment = O((m - fm)n) So O((m - fm)n + log n + fm)
  • 34. King’s College London, University of London Summary
  • 35. King’s College London, University of London Further work Gapped suffix array only supports searching of specific patterns. For it to support approximate indexing in all situations, will require more research and development into multiple gapped suffix arrays. Future task is to study multiple gapped suffix array and its efficiency
  • 36. King’s College London, University of London Conclusion The theory of Maxime that gSA can be created in linear time has been put into practice and confirmed to be true Additionally to this research, further potentials of multiple gSAs were looked at and were able to conclude that it’s an area requiring more research
  • 37. King’s College London, University of London
  • 38. King’s College London, University of London Q&A