SlideShare a Scribd company logo
1 of 270
Download to read offline
Semi-local string comparison:
              Algorithmic techniques and applications


                                    Alexander Tiskin

                                Department of Computer Science
                                     University of Warwick
                             http://go.warwick.ac.uk/alextiskin




Alexander Tiskin (Warwick)           Semi-local string comparison   1 / 132
1   Introduction                                7    Sparse string comparison

2   Matrix distance multiplication              8    Compressed string comparison

3   Semi-local string comparison                9    Beyond semi-locality

4   The seaweed method                          10   Conclusions and future work

5   Periodic string comparison

6   The transposition network
    method



    Alexander Tiskin (Warwick)   Semi-local string comparison                      2 / 132
1   Introduction                                7    Sparse string comparison

2   Matrix distance multiplication              8    Compressed string comparison

3   Semi-local string comparison                9    Beyond semi-locality

4   The seaweed method                          10   Conclusions and future work

5   Periodic string comparison

6   The transposition network
    method



    Alexander Tiskin (Warwick)   Semi-local string comparison                      3 / 132
Introduction

String matching: finding an exact pattern in a string
String comparison: finding similar patterns in two strings
Applications: computational biology, image recognition, . . .




  Alexander Tiskin (Warwick)   Semi-local string comparison     4 / 132
Introduction

String matching: finding an exact pattern in a string
String comparison: finding similar patterns in two strings
Applications: computational biology, image recognition, . . .
Standard types of string comparison:
     global: whole string vs whole string
     local: substrings vs substrings
Main focus of this work:
     semi-local: whole string vs substrings; prefixes vs suffixes
Closely related to approximate string matching (no relation to
approximation algorithms!)
Main tool: implicit unit-Monge matrices (a.k.a. seaweed matrices)

  Alexander Tiskin (Warwick)   Semi-local string comparison         4 / 132
Introduction
Terminology and notation


x− = x −       1
               2        x+ = x +     1
                                     2
Integers: {. . . − 2, −1, 0, 1, 2, . . .}
Half-integers:         . . . − 3 , − 1 , 1 , 2 , 2 , . . . = . . . (−2)+ , (−1)+ , 0+ , 1+ , 2+
                               2     2 2
                                             3 5

(i, j)      (i , j ) iff i < i and j < j                  (i, j)         (i , j ) iff i > i and j < j
A permutation matrix is a 0/1 matrix with exactly one nonzero per row
and per column
        
 0 1 0
1 0 0
 0 0 1




   Alexander Tiskin (Warwick)            Semi-local string comparison                                 5 / 132
Introduction
Terminology and notation


Given matrix D, its distribution matrix is made up of          -dominance sums:




   Alexander Tiskin (Warwick)   Semi-local string comparison                 6 / 132
Introduction
Terminology and notation


Given matrix D, its distribution matrix is made up of          -dominance sums:
Given matrix E , its density matrix is made up of quadrangle differences:
E (ˆ, ) = E (ˆ− , + ) − E (ˆ− , − ) − E (ˆ+ , + ) + E (ˆ+ , − )
    ı ˆ       ı ˆ            ı ˆ            ı ˆ            ı ˆ
where D Σ , E over integers; D, E         over half-integers




   Alexander Tiskin (Warwick)   Semi-local string comparison                 6 / 132
Introduction
Terminology and notation


Given matrix D, its distribution matrix is made up of          -dominance sums:
Given matrix E , its density matrix is made up of quadrangle differences:
E (ˆ, ) = E (ˆ− , + ) − E (ˆ− , − ) − E (ˆ+ , + ) + E (ˆ+ , − )
    ı ˆ       ı ˆ            ı ˆ            ı ˆ            ı ˆ
where D Σ , E over integers; D, E over half-integers
                                            
        Σ      0 1 2 3           0 1 2 3                 
 0 1 0          0 1 1 2         0 1 1 2           0 1 0
1 0 0 = 
                                  0 0 0 1 = 1 0 0
                                                        
                0 0 0 1
 0 0 1                                                0 0 1
                 0 0 0 0           0 0 0 0




   Alexander Tiskin (Warwick)   Semi-local string comparison                 6 / 132
Introduction
Terminology and notation


Given matrix D, its distribution matrix is made up of          -dominance sums:
Given matrix E , its density matrix is made up of quadrangle differences:
E (ˆ, ) = E (ˆ− , + ) − E (ˆ− , − ) − E (ˆ+ , + ) + E (ˆ+ , − )
    ı ˆ       ı ˆ            ı ˆ            ı ˆ            ı ˆ
where D Σ , E over integers; D, E over half-integers
                                            
        Σ      0 1 2 3           0 1 2 3                 
 0 1 0          0 1 1 2         0 1 1 2           0 1 0
1 0 0 = 
                                  0 0 0 1 = 1 0 0
                                                        
                0 0 0 1
 0 0 1                                                0 0 1
                 0 0 0 0           0 0 0 0
(D Σ ) = D for all D
Matrix E is simple, if (E )Σ = E ; equivalently, if it has all zeros in the left
column and bottom row



   Alexander Tiskin (Warwick)   Semi-local string comparison                 6 / 132
Introduction
Terminology and notation


Matrix E is Monge, if E         is nonnegative
Intuition: boundary-to-boundary distances in a (weighted) planar graph
Matrix E is unit-Monge, if E         is a permutation matrix
Intuition: boundary-to-boundary distances in a grid-like graph




   Alexander Tiskin (Warwick)     Semi-local string comparison           7 / 132
Introduction
Terminology and notation


Matrix E is Monge, if E               is nonnegative
Intuition: boundary-to-boundary distances in a (weighted) planar graph
Matrix E is unit-Monge, if E                is a permutation matrix
Intuition: boundary-to-boundary distances in a grid-like graph
Simple unit-Monge matrix: P Σ , where P is a permutation matrix
Seaweed matrix: P               used as an implicit representation of P Σ
                                        
       Σ      0                1 2 3
 0 1 0        0
1 0 0 =                       1 1 2  
              0                 0 0 1
 0 0 1
                0                0 0 0




   Alexander Tiskin (Warwick)            Semi-local string comparison       7 / 132
Introduction
Implicit unit-Monge matrices


Efficient P Σ queries: range tree on nonzeros of P                             [Bentley: 1980]
         binary search tree by i-coordinate
         under every node, binary search tree by j-coordinate
     •                            •                       •
 •               −→         •                 −→      •
         •                            •                       •
             •                            •                       •
      ↓
     •                            •                       •
 •               −→         •                 −→      •
         •                            •                       •
             •                            •                       •
      ↓
     •                            •                       •
 •               −→         •                 −→      •
         •                            •                       •
             •                            •                       •

     Alexander Tiskin (Warwick)               Semi-local string comparison               8 / 132
Introduction
Implicit unit-Monge matrices


Efficient P Σ queries: (contd.)
Every node of the range tree represents a canonical range (rectangular
region), and stores its nonzero count
Overall, ≤ n log n canonical ranges are non-empty
A P Σ query is equivalent to -dominance counting: how many nonzeros
are -dominated by query point?
Answer: sum up nonzero counts in ≤ log2 n disjoint canonical ranges
Total size O(n log n), query time O(log2 n)




   Alexander Tiskin (Warwick)   Semi-local string comparison             9 / 132
Introduction
Implicit unit-Monge matrices


Efficient P Σ queries: (contd.)
Every node of the range tree represents a canonical range (rectangular
region), and stores its nonzero count
Overall, ≤ n log n canonical ranges are non-empty
A P Σ query is equivalent to -dominance counting: how many nonzeros
are -dominated by query point?
Answer: sum up nonzero counts in ≤ log2 n disjoint canonical ranges
Total size O(n log n), query time O(log2 n)
There are asymptotically more efficient (but less practical) data structures
                                      log n
Total size O(n), query time O       log log n                          [J´J´+: 2004]
                                                                         a a
                                                               [Chan, Pˇtra¸cu: 2010]
                                                                       a s


   Alexander Tiskin (Warwick)   Semi-local string comparison                     9 / 132
1   Introduction                                7    Sparse string comparison

2   Matrix distance multiplication              8    Compressed string comparison

3   Semi-local string comparison                9    Beyond semi-locality

4   The seaweed method                          10   Conclusions and future work

5   Periodic string comparison

6   The transposition network
    method



    Alexander Tiskin (Warwick)   Semi-local string comparison                      10 / 132
Matrix distance multiplication
Seaweed braids


Distance algebra (a.k.a (min, +) or tropical algebra):
       addition ⊕ given by min
       multiplication            given by +
Matrix        -multiplication
A      B=C               C (i, k) =    j   A(i, j)       B(j, k) = minj A(i, j) + B(j, k)
Matrix classes closed under                -multiplication (for given n):
       general numerical (integer, real) matrices
       Monge matrices
       simple unit-Monge matrices (!)




    Alexander Tiskin (Warwick)         Semi-local string comparison                    11 / 132
Matrix distance multiplication
Seaweed braids


Recall that simple unit-Monge matrices are represented implicitly by
permutation (seaweed) matrices
Define PA                       Σ
                   PB = PC as PA        Σ    Σ
                                       PB = PC
The seaweed monoid Tn :
      simple unit-Monge matrices under
      equivalently, permutation (seaweed) matrices under
Also known as the 0-Hecke monoid of the symmetric group H0 (Sn )




   Alexander Tiskin (Warwick)      Semi-local string comparison        12 / 132
Matrix distance multiplication
Seaweed braids


PA           PB = PC can be seen as combing of seaweed braids

     •                            •                     •
         •               •                         •
                 •           •                                  •
 •                                    •        •
                     •                    •                 •
             •                   •                     •
         PA                      PB                    PC




     Alexander Tiskin (Warwick)               Semi-local string comparison   13 / 132
Matrix distance multiplication
Seaweed braids


PA           PB = PC can be seen as combing of seaweed braids

     •                            •                     •
         •               •                         •
                 •           •                                  •
 •                                    •        •
                     •                    •                 •
             •                   •                     •
         PA                      PB                    PC


PA



PB



     Alexander Tiskin (Warwick)               Semi-local string comparison   13 / 132
Matrix distance multiplication
Seaweed braids


PA           PB = PC can be seen as combing of seaweed braids

     •                            •                     •
         •               •                         •
                 •           •                                  •
 •                                    •        •
                     •                    •                 •
             •                   •                     •
         PA                      PB                    PC


PA



PB



     Alexander Tiskin (Warwick)               Semi-local string comparison   13 / 132
Matrix distance multiplication
Seaweed braids


PA           PB = PC can be seen as combing of seaweed braids

     •                            •                     •
         •               •                         •
                 •           •                                  •
 •                                    •        •
                     •                    •                 •
             •                   •                     •
         PA                      PB                    PC


PA

                                                                             PC

PB



     Alexander Tiskin (Warwick)               Semi-local string comparison        13 / 132
Matrix distance multiplication
Seaweed braids


The seaweed monoid Tn :

      n! elements (permutations of size n)
      n − 1 generators g1 , g2 , . . . , gn−1 (elementary crossings)

Idempotence:
gi2 = gi for all i                                                           =


Far commutativity:
gi gj = gj gi j − i > 1                                        ···   =       ···

Braid relations:
gi gj gi = gj gi gj j − i = 1                                            =


   Alexander Tiskin (Warwick)   Semi-local string comparison                       14 / 132
Matrix distance multiplication
Seaweed braids


Identity: 1        x =x
           
   • · · ·
  · • · ·
  · · • · =
1=         

    · · · •

Zero: 0        x =0
           
    · · · •
  · · • ·
0=
  · • · · =
            

   • · · ·


   Alexander Tiskin (Warwick)   Semi-local string comparison   15 / 132
Matrix distance multiplication
Seaweed braids


Related structures:
      positive braids: far comm; braid relations
      braids: gi gi−1 = 1; far comm; braid relations
      Coxeter’s presentation of Sn : gi2 = 1; far comm; braid relations
      locally free idempotent monoid: idem; far comm            [Vershik+: 2000]
Generalisations:
      general 0-Hecke monoids                 [Fomin, Greene: 1998; Buch+: 2008]
      Coxeter monoids           [Tsaranov: 1990; Richardson, Springer: 1990]
      J -trivial monoids                                        [Denton+: 2011]




   Alexander Tiskin (Warwick)   Semi-local string comparison                16 / 132
Matrix distance multiplication
Seaweed braids


Computation in the seaweed monoid: a confluent rewriting system can be
obtained by software (Semigroupe, GAP)




   Alexander Tiskin (Warwick)   Semi-local string comparison      17 / 132
Matrix distance multiplication
Seaweed braids


Computation in the seaweed monoid: a confluent rewriting system can be
obtained by software (Semigroupe, GAP)
T3 : 1, a = g1 , b = g2 ; ab, ba, aba = 0

aa → a                          bb → b                  bab → 0         aba → 0




   Alexander Tiskin (Warwick)            Semi-local string comparison             17 / 132
Matrix distance multiplication
Seaweed braids


Computation in the seaweed monoid: a confluent rewriting system can be
obtained by software (Semigroupe, GAP)
T3 : 1, a = g1 , b = g2 ; ab, ba, aba = 0

aa → a                          bb → b                   bab → 0         aba → 0

T4 : 1, a = g1 , b = g2 , c = g3 ; ab, ac, ba, bc, cb, aba, abc, acb, bac,
bcb, cba, abac, abcb, acba, bacb, bcba, abacb, abcba, bacba, abacba = 0

aa → a                          ca → ac                  bab → aba       cbac → bcba
bb → b                          cc → c                   cbc → bcb       abacba → 0




   Alexander Tiskin (Warwick)             Semi-local string comparison             17 / 132
Matrix distance multiplication
Seaweed braids


Computation in the seaweed monoid: a confluent rewriting system can be
obtained by software (Semigroupe, GAP)
T3 : 1, a = g1 , b = g2 ; ab, ba, aba = 0

aa → a                          bb → b                   bab → 0         aba → 0

T4 : 1, a = g1 , b = g2 , c = g3 ; ab, ac, ba, bc, cb, aba, abc, acb, bac,
bcb, cba, abac, abcb, acba, bacb, bcba, abacb, abcba, bacba, abacba = 0

aa → a                          ca → ac                  bab → aba       cbac → bcba
bb → b                          cc → c                   cbc → bcb       abacba → 0

Easy to use, but not an efficient algorithm


   Alexander Tiskin (Warwick)             Semi-local string comparison             17 / 132
Matrix distance multiplication
Seaweed matrix multiplication


The implicit unit-Monge matrix              -multiplication problem
Given permutation matrices PA , PB , compute PC , such that
 Σ     Σ     Σ
PA PB = PC (equivalently, PA PB = PC )




   Alexander Tiskin (Warwick)   Semi-local string comparison          18 / 132
Matrix distance multiplication
Seaweed matrix multiplication


The implicit unit-Monge matrix                      -multiplication problem
Given permutation matrices PA , PB , compute PC , such that
 Σ     Σ     Σ
PA PB = PC (equivalently, PA PB = PC )

Matrix        -multiplication: running time
type                            time
general                         O(n3 )                                               standard
                                    3          3
                                O n (log log n)
                                       log2 n
                                                                                [Chan: 2007]
Monge                           O(n2 )                                 via [Aggarwal+: 1987]
implicit unit-Monge             O(n1.5 )                                            [T: 2006]
                                O(n log n)                                          [T: 2010]



   Alexander Tiskin (Warwick)           Semi-local string comparison                     18 / 132
Matrix distance multiplication
Seaweed matrix multiplication

                                             PB
                                     ••
                                •             ••
                                    ••             •
                                                       • •
                                         •        •
                                •                       •
                                             ••       • •
                                                         • •

    ••   •
   •      • •
     • •
                  •
           ••
 ••
      •
                ••                            ?
    •          •
       •     •
         PA                                  PC


   Alexander Tiskin (Warwick)                         Semi-local string comparison   19 / 132
Matrix distance multiplication
Seaweed matrix multiplication

                                    PB,lo , PB,hi
                                     ••
                                •             ••
                                    ••             •
                                                       • •
                                         •        •
                                •                       •
                                             ••       • •
                                                         • •

    ••   •
   •      • •
     • •
                  •
           ••
 ••             ••
      •        •
    •        •
       •
   PA,lo , PA,hi


   Alexander Tiskin (Warwick)                         Semi-local string comparison   20 / 132
Matrix distance multiplication
Seaweed matrix multiplication

                                    PB,lo , PB,hi
                                     ••
                                •             ••
                                    ••             •
                                                       • •
                                         •        •
                                •                       •
                                             ••       • •
                                                         • •

    ••   •                                    ••
   •      • •                            •
     • •                                           •        •
                  •
           ••
 ••             ••              • •
      •        •                  •
    •        •                   •
       •                                               •
   PA,lo , PA,hi


   Alexander Tiskin (Warwick)                         Semi-local string comparison   20 / 132
Matrix distance multiplication
Seaweed matrix multiplication

                                    PB,lo , PB,hi
                                     ••
                                •             ••
                                    ••             •
                                                       • •
                                         •        •
                                •                       •
                                             ••       • •
                                                         • •

    ••   •                               •
   •      • •                                     •
     • •                                                   •
                  •                                     • •
           ••                   •
 ••             ••                                    • •
      •        •                              •
    •        •                               •
       •
   PA,lo , PA,hi


   Alexander Tiskin (Warwick)                         Semi-local string comparison   20 / 132
Matrix distance multiplication
Seaweed matrix multiplication

                                    PB,lo , PB,hi
                                      ••
                                •             ••
                                    ••             •
                                                       • •
                                         •        •
                                •                       •
                                             ••       • •
                                                         • •

    ••   •                               •• •
                                             •
   •      • •                                 •
     • •                                       •         •
                  •                                       •
           ••                    •                      • •
 ••             ••              • •                   • •
      •        •                   •
    •        •                    • ••
       •                                               •
   PA,lo , PA,hi                    PC ,lo + PC ,hi


   Alexander Tiskin (Warwick)                         Semi-local string comparison   20 / 132
Matrix distance multiplication
Seaweed matrix multiplication

                                    PB,lo , PB,hi
                                      ••
                                •             ••
                                    ••             •
                                                       • •
                                         •        •
                                •                       •
                                             ••       • •
                                                         • •

    ••   •                               •• •
                                             •
   •      • •                                 •
     • •                                       •         •
                  •                                       •
           ••                    •                      • •
 ••             ••              • •                   • •
      •        •                   •
    •        •                    • ••
       •                                               •
   PA,lo , PA,hi                    PC ,lo + PC ,hi


   Alexander Tiskin (Warwick)                         Semi-local string comparison   21 / 132
Matrix distance multiplication
Seaweed matrix multiplication

                                    PB,lo , PB,hi
                                     ••
                                •             ••
                                    ••             •
                                                       • •
                                         •        •
                                •                       •
                                             ••       • •
                                                         • •

    ••   •                                    ••          •
   •      • •                            •             • •
     • •                                          ••
                  •                                     • •
           ••                        •
 ••             ••              • •                   • •
      •        •                   •
    •        •                    • ••
       •                         •
   PA,lo , PA,hi                             PC


   Alexander Tiskin (Warwick)                         Semi-local string comparison   21 / 132
Matrix distance multiplication
Seaweed matrix multiplication

                                             PB
                                     ••
                                •             ••
                                    ••             •
                                                       • •
                                         •        •
                                •                       •
                                             ••       • •
                                                         • •

    ••   •                                    ••        •
   •      • •                            •           • •
     • •                                          ••
                  •                                   • •
           ••                        •
 ••             ••              • •                   • •
      •        •                   •
    •        •                    • ••
       •                         •
         PA                                  PC


   Alexander Tiskin (Warwick)                         Semi-local string comparison   22 / 132
Matrix distance multiplication
Seaweed matrix multiplication


Implicit unit-Monge matrix              -multiplication: the algorithm
 Σ                Σ           Σ
PC (i, k) = minj PA (i, j) + PB (j, k)
Divide-and-conquer on the range of j
Divide PA horizontally, PB vertically: two subproblems of effective size n/2
 Σ
PA,lo       Σ       Σ
           PB,lo = PC ,lo        Σ
                                PA,hi      Σ       Σ
                                          PB,hi = PC ,hi
Conquer:
   -low nonzeros of PC ,lo and          -high nonzeros of PC ,hi appear in PC
The remaining nonzeros of PC ,lo and PC ,hi are “wrong”, and need to be
corrected to obtain the remaining nonzeros of PC
Correction can be done in time O(n) using the unit-Monge property
Overall time O(n log n)

   Alexander Tiskin (Warwick)      Semi-local string comparison                 23 / 132
Matrix distance multiplication
Bruhat order


Comparing permutations by the “degree of sortedness”
Bruhat order
Permutation A is lower (“more sorted”) than permutation B in the Bruhat
order (A B), if B can be transformed to A by successive pairwise sorting
between arbitrary pairs of elements.

Permutation matrices: PA PB , if PB can be transformed to PA by
successive submatrix substitution: ( 0 1 )
                                     10    (1 0)
                                            01




   Alexander Tiskin (Warwick)   Semi-local string comparison        24 / 132
Matrix distance multiplication
Bruhat order


Bruhat comparability: running time
O(n2 )                                                                  folklore
O(n log n)                                                            [T: NEW]

PA      PB iff PA ≤ PB elementwise, time O(n2 )
               Σ    Σ                                                   folklore
                   R               R
PA      PB iff     PA        PB = Id , time O(n log n)                 [T: NEW]
where P R denotes clockwise rotation of matrix P




   Alexander Tiskin (Warwick)          Semi-local string comparison         25 / 132
1   Introduction                                7    Sparse string comparison

2   Matrix distance multiplication              8    Compressed string comparison

3   Semi-local string comparison                9    Beyond semi-locality

4   The seaweed method                          10   Conclusions and future work

5   Periodic string comparison

6   The transposition network
    method



    Alexander Tiskin (Warwick)   Semi-local string comparison                      26 / 132
Semi-local string comparison
Semi-local LCS and edit distance


Consider strings (= sequences) over an alphabet of size σ
Distinguish contiguous substrings and not necessarily contiguous
subsequences
Special cases of substring: prefix, suffix
Notation: strings a, b of length m, n respectively
Assume where necessary: m ≤ n; m, n reasonably close




   Alexander Tiskin (Warwick)      Semi-local string comparison    27 / 132
Semi-local string comparison
Semi-local LCS and edit distance


Consider strings (= sequences) over an alphabet of size σ
Distinguish contiguous substrings and not necessarily contiguous
subsequences
Special cases of substring: prefix, suffix
Notation: strings a, b of length m, n respectively
Assume where necessary: m ≤ n; m, n reasonably close
The longest common subsequence (LCS) score:
      length of longest string that is a subsequence of both a and b
      equivalently, alignment score, where score(match) = 1 and
      score(mismatch) = 0
In biological terms, “loss-free alignment” (unlike “lossy” BLAST)

   Alexander Tiskin (Warwick)      Semi-local string comparison        27 / 132
Semi-local string comparison
Semi-local LCS and edit distance


The LCS problem
Give the LCS score for a vs b




   Alexander Tiskin (Warwick)      Semi-local string comparison   28 / 132
Semi-local string comparison
Semi-local LCS and edit distance


The LCS problem
Give the LCS score for a vs b

LCS: running time
O(mn)                                                                      [Wagner, Fischer:    1974]
    mn
O log2 n                    σ = O(1)                                      [Masek, Paterson:     1980]
                                                                              [Crochemore+:     2003]
     mn(log log n)2
O       log2 n
                                                                         [Paterson, Danˇ´cık:   1994]
                                                                      [Bille, Farach-Colton:    2008]

Running time varies depending on the RAM model version
We assume word-RAM with word size log n (where it matters)


    Alexander Tiskin (Warwick)         Semi-local string comparison                              28 / 132
Semi-local string comparison
Semi-local LCS and edit distance


LCS on the alignment graph (directed, acyclic)
  B A A B C A B C A B A C A                                       blue = 0
B                                                                  red = 1
A
A
B
C
B
C
A
score(“BAABCBCA”, “BAABCABCABACA”) = len(“BAABCBCA”) = 8
LCS = highest-score path from top-left to bottom-right


   Alexander Tiskin (Warwick)      Semi-local string comparison       29 / 132
Semi-local string comparison
Semi-local LCS and edit distance


LCS: dynamic programming                                              [WF: 1974]
Sweep cells in any              -compatible order
Cell update: time O(1)
Overall time O(mn)




   Alexander Tiskin (Warwick)          Semi-local string comparison         30 / 132
Semi-local string comparison
Semi-local LCS and edit distance



                                     ‘Begin at the beginning,’ the King said
                                     gravely, ‘and go on till you come to the end:
                                     then stop.’
                                                      L. Carroll, Alice in Wonderland
                                     (The standard approach in dynamic
                                     programming)




   Alexander Tiskin (Warwick)      Semi-local string comparison                31 / 132
Semi-local string comparison
Semi-local LCS and edit distance




Sometimes dynamic programming can be run from both ends for extra
flexibility




   Alexander Tiskin (Warwick)      Semi-local string comparison     32 / 132
Semi-local string comparison
Semi-local LCS and edit distance




Sometimes dynamic programming can be run from both ends for extra
flexibility
Is there a better, fully flexible alternative (e.g. for comparing compressed
strings, comparing strings dynamically or in parallel, etc.)?

   Alexander Tiskin (Warwick)      Semi-local string comparison          32 / 132
Semi-local string comparison
Semi-local LCS and edit distance


LCS: micro-block dynamic programming                                    [MP: 1980; BF: 2008]
Sweep cells in micro-blocks, in any                  -compatible order
Micro-block size:
      t = O(log n) when σ = O(1)
                    log n
      t=O         log log n     otherwise
Micro-block interface:
      O(t) characters, each O(log σ) bits, can be reduced to O(log t) bits
      O(t) small integers, each O(1) bits
Micro-block update: time O(1), by precomputing all possible interfaces
                          mn                                  mn(log log n)2
Overall time O          log2 n
                                  when σ = O(1), O               log2 n
                                                                               otherwise

   Alexander Tiskin (Warwick)          Semi-local string comparison                        33 / 132
Semi-local string comparison
Semi-local LCS and edit distance


The semi-local LCS problem
Give the (implicit) matrix of O (m + n)2 LCS scores:
      string-substring LCS: string a vs every substring of b
      prefix-suffix LCS: every prefix of a vs every suffix of b
      suffix-prefix LCS: every suffix of a vs every prefix of b
      substring-string LCS: every substring of a vs string b




   Alexander Tiskin (Warwick)      Semi-local string comparison   34 / 132
Semi-local string comparison
Semi-local LCS and edit distance


The semi-local LCS problem
Give the (implicit) matrix of O (m + n)2 LCS scores:
      string-substring LCS: string a vs every substring of b
      prefix-suffix LCS: every prefix of a vs every suffix of b
      suffix-prefix LCS: every suffix of a vs every prefix of b
      substring-string LCS: every substring of a vs string b

Cf.: dynamic programming gives prefix-prefix LCS




   Alexander Tiskin (Warwick)      Semi-local string comparison   34 / 132
Semi-local string comparison
Semi-local LCS and edit distance


Semi-local LCS on the alignment graph
  B A A B C A B C A B A C A                                       blue = 0
B                                                                  red = 1
A
A
B
C
B
C
A
score(“BAABCBCA”, “CABCABA”) = len(“ABCBA”) = 5
String-substring LCS: all highest-score top-to-bottom paths
Semi-local LCS: all highest-score boundary-to-boundary paths

   Alexander Tiskin (Warwick)      Semi-local string comparison       35 / 132
Semi-local string comparison
Score matrices and seaweed matrices


The score matrix H
 0     1   2   3    4   5   6     6   7   8   8   8    8    8           a = “BAABCBCA”
 -1 0      1   2    3   4   5     5   6   7   7   7    7    7
 -2 -1 0       1    2   3   4     4   5   6   6   6    6    7
                                                                        b = “BAABCABCABACA”
 -3 -2 -1 0         1   2   3     3   4   5   5   6    6    7           H(i, j) = score(a, b i : j )
 -4 -3 -2 -1 0          1   2     2   3   4   4   5    5    6
                                                                        H(4, 11) = 5
 -5 -4 -3 -2 -1 0           1     2   3   4   4   5    5    6
 -6 -5 -4 -3 -2 -1 0              1   2   3   3   4    4    5           H(i, j) = j − i if i > j
 -7 -6 -5 -4 -3 -2 -1 0               1   2   2   3    3    4
 -8 -7 -6 -5 -4 -3 -2 -1 0                1   2   3    3    4
 -9 -8 -7 -6 -5 -4 -3 -2 -1 0                 1   2    3    4
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0                  1    2    3
-11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0                   1    2
-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0                    1
-13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0

     Alexander Tiskin (Warwick)               Semi-local string comparison                             36 / 132
Semi-local string comparison
Score matrices and seaweed matrices


Semi-local LCS: output representation and running time
size               query time
O(n2 )             O(1)                                                                 trivial
O(m1/2 n)          O(log n)    string-substring                               [Alves+: 2003]
O(n)               O(n)        string-substring                               [Alves+: 2005]
O(n log n)         O(log2 n)                                                        [T: 2006]
  . . . or any    2D orthogonal range counting data structure
running time
O(mn2 )                                                                                 naive
O(mn)                       string-substring                   [Schmidt: 1998; Alves+: 2005]
O(mn)                                                                               [T: 2006]
     mn
O log0.5 n                                                                          [T: 2006]
     mn(log log n)2
O       log2 n
                                                                                    [T: 2007]

    Alexander Tiskin (Warwick)         Semi-local string comparison                       37 / 132
Semi-local string comparison
Score matrices and seaweed matrices


The score matrix H and the seaweed matrix P
H(i, j): the number of matched characters for a vs substring b i : j
j − i − H(i, j): the number of unmatched characters
Properties of matrix j − i − H(i, j):
      simple unit-Monge
      therefore, = P Σ , where P = −H               is a permutation matrix
P is the seaweed matrix, giving an implicit representation of H
Range tree for P: memory O(n log n), query time O(log2 n)




   Alexander Tiskin (Warwick)    Semi-local string comparison                 38 / 132
Semi-local string comparison
Score matrices and seaweed matrices


The score matrix H and the seaweed matrix P
 0     1   2   3    4   5   6       6   7   8       8       8       8       8   a = “BAABCBCA”
 -1 0      1   2    3   4   5       5   6   7       7       7       7       7
                                                                        •       b = “BAABCABCABACA”
 -2 -1 0       1    2   3   4       4   5   6       6       6       6       7
                                                        •
 -3 -2 -1 0         1   2   3       3   4   5       5       6       6       7   H(i, j) = score(a, b i : j )
 -4 -3 -2 -1 0          1   2       2   3   4       4       5       5       6
                                •                                               H(4, 11) = 5
 -5 -4 -3 -2 -1 0           1       2   3   4       4       5       5       6
 -6 -5 -4 -3 -2 -1 0                1   2   3       3       4       4       5   H(i, j) = j − i if i > j
 -7 -6 -5 -4 -3 -2 -1 0                 1   2       2       3       3       4
                                                •
 -8 -7 -6 -5 -4 -3 -2 -1 0                  1       2       3       3       4
                                                                •
 -9 -8 -7 -6 -5 -4 -3 -2 -1 0                       1       2       3       4
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0                            1       2       3
-11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0                                1       2
-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0                                    1
-13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0

     Alexander Tiskin (Warwick)                     Semi-local string comparison                               39 / 132
Semi-local string comparison
Score matrices and seaweed matrices


The score matrix H and the seaweed matrix P
 0     1   2   3    4   5   6       6   7   8       8       8       8       8   a = “BAABCBCA”
 -1 0      1   2    3   4   5       5   6   7       7       7       7       7
                                                                        •       b = “BAABCABCABACA”
 -2 -1 0       1    2   3   4       4   5   6       6       6       6       7
                                                        •
 -3 -2 -1 0         1   2   3       3   4   5       5       6       6       7   H(i, j) = score(a, b i : j )
 -4 -3 -2 -1 0          1   2       2   3   4       4       5       5       6
                                •                                               H(4, 11) = 5
 -5 -4 -3 -2 -1 0           1       2   3   4       4       5       5       6
 -6 -5 -4 -3 -2 -1 0                1   2   3       3       4       4       5   H(i, j) = j − i if i > j
 -7 -6 -5 -4 -3 -2 -1 0                 1   2       2       3       3       4
                                                •                               blue: difference in H is 0
 -8 -7 -6 -5 -4 -3 -2 -1 0                  1       2       3       3       4
                                                                •               red: difference in H is 1
 -9 -8 -7 -6 -5 -4 -3 -2 -1 0                       1       2       3       4
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0                            1       2       3
-11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0                                1       2
-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0                                    1
-13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0

     Alexander Tiskin (Warwick)                     Semi-local string comparison                               39 / 132
Semi-local string comparison
Score matrices and seaweed matrices


The score matrix H and the seaweed matrix P
 0     1   2   3    4   5   6       6   7   8       8       8       8       8   a = “BAABCBCA”
 -1 0      1   2    3   4   5       5   6   7       7       7       7       7
                                                                        •       b = “BAABCABCABACA”
 -2 -1 0       1    2   3   4       4   5   6       6       6       6       7
                                                        •
 -3 -2 -1 0         1   2   3       3   4   5       5       6       6       7   H(i, j) = score(a, b i : j )
 -4 -3 -2 -1 0          1   2       2   3   4       4       5       5       6
                                •                                               H(4, 11) = 5
 -5 -4 -3 -2 -1 0           1       2   3   4       4       5       5       6
 -6 -5 -4 -3 -2 -1 0                1   2   3       3       4       4       5   H(i, j) = j − i if i > j
 -7 -6 -5 -4 -3 -2 -1 0                 1   2       2       3       3       4
                                                •                               blue: difference in H is 0
 -8 -7 -6 -5 -4 -3 -2 -1 0                  1       2       3       3       4
                                                                •               red: difference in H is 1
 -9 -8 -7 -6 -5 -4 -3 -2 -1 0                       1       2       3       4
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0                            1       2       3   green: P(i, j) = 1
-11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0                                1       2
-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0                                    1   H(i, j) = j − i − P Σ (i, j)
-13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0

     Alexander Tiskin (Warwick)                     Semi-local string comparison                               39 / 132
Semi-local string comparison
Score matrices and seaweed matrices


The score matrix H and the seaweed matrix P
                                                              a = “BAABCBCA”
                                               •              b = “BAABCABCABACA”
                                      •
                                                              H(4, 11) =
                            •                                 11 − 4 − P Σ (i, j) =
                                                              11 − 4 − 2 = 5

                                •
                                          •




   Alexander Tiskin (Warwick)       Semi-local string comparison                      40 / 132
Semi-local string comparison
Score matrices and seaweed matrices


The seaweed braid in the alignment graph
  B A A B C A B C A B A C A                                a = “BAABCBCA”
B
A                                                          b = “BAABCABCABACA”
A                                                          H(4, 11) =
B                                                          11 − 4 − P Σ (i, j) =
C                                                          11 − 4 − 2 = 5
B
C
A
P(i, j) = 1 corresponds to seaweed top i                  bottom j




   Alexander Tiskin (Warwick)    Semi-local string comparison                      41 / 132
Semi-local string comparison
Score matrices and seaweed matrices


The seaweed braid in the alignment graph
  B A A B C A B C A B A C A                                         a = “BAABCBCA”
B
A                                                                   b = “BAABCABCABACA”
A                                                                   H(4, 11) =
B                                                                   11 − 4 − P Σ (i, j) =
C                                                                   11 − 4 − 2 = 5
B
C
A
P(i, j) = 1 corresponds to seaweed top i                           bottom j
Also define top              right, left        right, left               bottom seaweeds
Gives bijection between top-left and bottom-right graph boundaries

   Alexander Tiskin (Warwick)             Semi-local string comparison                      41 / 132
Semi-local string comparison
Score matrices and seaweed matrices




Seaweed braid: a highly symmetric object (element of the 0-Hecke monoid
of the symmetric group)
Can be built recursively by assembling subbraids from separate parts
Highly flexible: local alignment, compression, parallel computation. . .

   Alexander Tiskin (Warwick)    Semi-local string comparison             42 / 132
Semi-local string comparison
Weighted alignment


The LCS problem is a special case of the weighted alignment score
problem with weighted matches (wM ), mismatches (wX ) and gaps (wG )

      LCS score: wM = 1, wX = wG = 0
      Levenshtein score: wM = 2, wX = 1, wG = 0




   Alexander Tiskin (Warwick)   Semi-local string comparison       43 / 132
Semi-local string comparison
Weighted alignment


The LCS problem is a special case of the weighted alignment score
problem with weighted matches (wM ), mismatches (wX ) and gaps (wG )

      LCS score: wM = 1, wX = wG = 0
      Levenshtein score: wM = 2, wX = 1, wG = 0

Alignment score is rational, if wM , wX , wG are rational numbers
Equivalent to LCS score on blown-up strings




   Alexander Tiskin (Warwick)   Semi-local string comparison        43 / 132
Semi-local string comparison
Weighted alignment


The LCS problem is a special case of the weighted alignment score
problem with weighted matches (wM ), mismatches (wX ) and gaps (wG )

      LCS score: wM = 1, wX = wG = 0
      Levenshtein score: wM = 2, wX = 1, wG = 0

Alignment score is rational, if wM , wX , wG are rational numbers
Equivalent to LCS score on blown-up strings
Edit distance: minimum cost to transform a into b by weighted character
edits (insertion, deletion, substitution)
Corresponds to weighted alignment score with wM = 0, insertion/deletion
weight −wG , substitution weight −wX



   Alexander Tiskin (Warwick)   Semi-local string comparison        43 / 132
Semi-local string comparison
Weighted alignment


Weighted alignment graph
  B A A B C A B C A B A C A                                            blue = 0
B                                                                red (solid) = 2
A                                                              red (dotted) = 1
A
B
C
B
C
A
Levenshtein score(“BAABCBCA”, “CABCABA”) = 11




   Alexander Tiskin (Warwick)   Semi-local string comparison                44 / 132
Semi-local string comparison
Weighted alignment


Alignment graph for blown-up strings
     $B $A $A $B $C $A $B $C $A $B $A $C $A                         blue = 0
$B                                                             red = 0.5 or 1
$A
$A
$B
$C
$B
$C
$A
Levenshtein score(“BAABCBCA”, “CABCABA”) = 2 · 5.5




   Alexander Tiskin (Warwick)   Semi-local string comparison             45 / 132
Semi-local string comparison
Weighted alignment


Rational-weighted semi-local alignment reduced to semi-local LCS
     $B $A $A $B $C $A $B $C $A $B $A $C $A
$B
$A
$A
$B
$C
$B
$C
$A

Let wM = 1, wX = µ , wG = 0
                 ν
Increase × ν 2 in complexity (can be reduced to ν)


   Alexander Tiskin (Warwick)   Semi-local string comparison       46 / 132
Alexander Tiskin (Warwick)   Semi-local string comparison   47 / 132
1   Introduction                                7    Sparse string comparison

2   Matrix distance multiplication              8    Compressed string comparison

3   Semi-local string comparison                9    Beyond semi-locality

4   The seaweed method                          10   Conclusions and future work

5   Periodic string comparison

6   The transposition network
    method



    Alexander Tiskin (Warwick)   Semi-local string comparison                      48 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   49 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   51 / 132
The seaweed method
Seaweed combing


Semi-local LCS: seaweed combing                                       [T: 2006]
Initialise seaweed braid: crossings in all mismatch cells
Sweep cells in any              -compatible order
Match cell: two seaweeds uncrossed; skip
Mismatch cell: two seaweeds cross
      if the same seaweeds crossed before, uncross them
      otherwise skip, keep seaweeds crossed
Cell update: time O(1)
Overall time O(mn)




   Alexander Tiskin (Warwick)          Semi-local string comparison        52 / 132
The seaweed method
Micro-block seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   53 / 132
The seaweed method
Micro-block seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   54 / 132
The seaweed method
Micro-block seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   54 / 132
The seaweed method
Micro-block seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   54 / 132
The seaweed method
Micro-block seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   54 / 132
The seaweed method
Micro-block seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   54 / 132
The seaweed method
Micro-block seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   54 / 132
The seaweed method
Micro-block seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   54 / 132
The seaweed method
Micro-block seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   54 / 132
The seaweed method
Micro-block seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   54 / 132
The seaweed method
Micro-block seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   54 / 132
The seaweed method
Micro-block seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   54 / 132
The seaweed method
Micro-block seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   54 / 132
The seaweed method
Micro-block seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   54 / 132
The seaweed method
Micro-block seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   55 / 132
The seaweed method
Micro-block seaweed combing


Semi-local LCS: micro-block seaweed combing                                [T: 2007]
Initialise seaweed braid: crossings in all mismatch cells
Sweep cells in micro-blocks, in any                    -compatible order
                                     log n
Micro-block size: t = O            log log n
Micro-block interface:
      O(t) characters, each O(log σ) bits, can be reduced to O(log t) bits
      O(t) integers, each O(log n) bits, can be reduced to O(log t) bits
Micro-block update: time O(1), by precomputing all possible interfaces
                        mn(log log n)2
Overall time O             log2 n




   Alexander Tiskin (Warwick)            Semi-local string comparison           56 / 132
The seaweed method
Cyclic LCS


The cyclic LCS problem
Give the maximum LCS score for a vs all cyclic rotations of b




   Alexander Tiskin (Warwick)   Semi-local string comparison    57 / 132
The seaweed method
Cyclic LCS


The cyclic LCS problem
Give the maximum LCS score for a vs all cyclic rotations of b

Cyclic LCS: running time
  mn    2
O log n                                                                       naive
O(mn log m)                                                            [Maes: 1990]
O(mn)                           [Bunke, B¨hler: 1993; Landau+: 1998; Schmidt: 1998]
                                         u
                2
O mn(log 2log n)
       log n
                                                                          [T: 2007]




   Alexander Tiskin (Warwick)           Semi-local string comparison           57 / 132
The seaweed method
Cyclic LCS


Cyclic LCS: the algorithm
                                                               mn(log log n)2
Micro-block seaweed combing on a vs bb, time O                    log2 n
Make n string-substring LCS queries, time negligible




   Alexander Tiskin (Warwick)   Semi-local string comparison                    58 / 132
The seaweed method
Longest repeating subsequence


The longest repeating subsequence problem
Find the longest subsequence of a that is a square (a repetition of two
identical strings)




   Alexander Tiskin (Warwick)   Semi-local string comparison              59 / 132
The seaweed method
Longest repeating subsequence


The longest repeating subsequence problem
Find the longest subsequence of a that is a square (a repetition of two
identical strings)

Longest repeating subsequence: running time
O(m3 )                                                                    naive
O(m2 )                                                         [Kosowski: 2004]
   2      log 2
O m (log 2 m m)
      log
                                                                      [T: 2007]




   Alexander Tiskin (Warwick)   Semi-local string comparison               59 / 132
The seaweed method
Longest repeating subsequence


Longest repeating subsequence: the algorithm
                                                               m2 (log log m)2
Micro-block seaweed combing on a vs a, time O                      log2 m
Make m − 1 suffix-prefix LCS queries, time negligible




   Alexander Tiskin (Warwick)   Semi-local string comparison                     60 / 132
The seaweed method
Approximate matching


The approximate pattern matching problem
Give the substring closest to a by alignment score, starting at each
position in b

Assume rational alignment score
Approximate pattern matching: running time
O(mn)                                                                              [Sellers: 1980]
   mn
O log n                     σ = O(1)                                  via [Masek, Paterson: 1980]
     mn(log log n)2
O       log2 n
                                                               via [Bille, Farach-Colton: 2008]




    Alexander Tiskin (Warwick)         Semi-local string comparison                           61 / 132
The seaweed method
Approximate matching


Approximate pattern matching: the algorithm
Micro-block seaweed combing on a vs b (with blow-up), time
                2
O mn(log 2log n)
      log n
The implicit semi-local edit score matrix:
      an anti-Monge matrix
      approximate pattern matching ∼ row minima
Row minima in O(n) element queries                                  [Aggarwal+: 1987]
Each query in time O(log2 n) using the range tree representation,
combined query time negligible
                                mn(log log n)2
Overall running time O             log2 n
                                                  , same as [Bille, Farach-Colton: 2008]


   Alexander Tiskin (Warwick)        Semi-local string comparison                   62 / 132
Alexander Tiskin (Warwick)   Semi-local string comparison   63 / 132
1   Introduction                                7    Sparse string comparison

2   Matrix distance multiplication              8    Compressed string comparison

3   Semi-local string comparison                9    Beyond semi-locality

4   The seaweed method                          10   Conclusions and future work

5   Periodic string comparison

6   The transposition network
    method



    Alexander Tiskin (Warwick)   Semi-local string comparison                      64 / 132
Periodic string comparison
Wraparound seaweed combing


The periodic string-substring LCS problem
Give (implicit) LCS scores for a vs each substring of b = . . . uuu . . . = u ±∞

Let u be of length p
May assume that every character of a occurs in u
Only substrings of b of length at most mp (otherwise LCS score is m)




   Alexander Tiskin (Warwick)   Semi-local string comparison               65 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   66 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   68 / 132
Periodic string comparison
Wraparound seaweed combing


Periodic string-substring LCS: Wraparound seaweed combing
Initialise seaweed braid: crossings in all mismatch cells
Sweep cells row-by-row: each row starts at match cell, wraps at boundary
Match cell: two seaweeds uncrossed; skip
Mismatch cell: two seaweeds cross
      if the same seaweeds crossed before (with wrapping), uncross them
      otherwise skip, keep seaweeds crossed
Cell update: time O(1)
Overall time O(mn)
String-substring LCS score: count seaweeds with multiplicities


   Alexander Tiskin (Warwick)   Semi-local string comparison         69 / 132
Periodic string comparison
Wraparound seaweed combing


The tandem LCS problem
Give LCS score for a vs b = u k

We have n = kp; may assume k ≤ m
Tandem LCS: running time
O(mkp)                                                                               naive
O m(k + p)                                                     [Landau, Ziv-Ukelson: 2001]
O(mp)                                                                            [T: 2009]

Direct application of wraparound seaweed combing




   Alexander Tiskin (Warwick)   Semi-local string comparison                          70 / 132
Periodic string comparison
Wraparound seaweed combing


The tandem alignment problem
Give the substring closest to a by alignment score among certain
substrings of b = u ±∞ :
      global: substrings u k of length kp across all k
      cyclic: substrings of length kp across all k
      local: substrings of any length

Tandem alignment: running time
O(m2 p)               all                                                        naive
O(mp)                 global                                   [Myers, Miller:   1989]
O(mp log p)           cyclic                                        [Benson:     2005]
O(mp)                 cyclic                                               [T:   2009]
O(mp)                 local                                    [Myers, Miller:   1989]

   Alexander Tiskin (Warwick)   Semi-local string comparison                      71 / 132
Periodic string comparison
Wraparound seaweed combing


Cyclic tandem alignment: the algorithm
Periodic seaweed combing for a vs b (with blow-up), time O(mp)
For each k ∈ [1 : m]:
      solve tandem LCS (under given alignment score) for a vs u k
      obtain scores for a vs p successive substrings of b of length kp by LCS
      batch query: time O(1) per substring
Running time O(mp)




   Alexander Tiskin (Warwick)   Semi-local string comparison             72 / 132
Alexander Tiskin (Warwick)   Semi-local string comparison   73 / 132
1   Introduction                                7    Sparse string comparison

2   Matrix distance multiplication              8    Compressed string comparison

3   Semi-local string comparison                9    Beyond semi-locality

4   The seaweed method                          10   Conclusions and future work

5   Periodic string comparison

6   The transposition network
    method



    Alexander Tiskin (Warwick)   Semi-local string comparison                      74 / 132
The transposition network method
Transposition networks


Comparison network: a circuit of comparators
A comparator sorts two inputs and outputs them in prescribed order
Comparison networks traditionally used for non-branching merging/sorting

Classical comparison networks
               # comparators
merging        O(n log n)                                      [Batcher: 1968]
sorting        O(n log2 n)                                     [Batcher: 1968]
               O(n log n)                                       [Ajtai+: 1983]




   Alexander Tiskin (Warwick)   Semi-local string comparison              75 / 132
The transposition network method
Transposition networks


Comparison network: a circuit of comparators
A comparator sorts two inputs and outputs them in prescribed order
Comparison networks traditionally used for non-branching merging/sorting

Classical comparison networks
               # comparators
merging        O(n log n)                                      [Batcher: 1968]
sorting        O(n log2 n)                                     [Batcher: 1968]
               O(n log n)                                       [Ajtai+: 1983]

Comparison networks are visualised by wire diagrams
Transposition network: all comparisons are between adjacent wires


   Alexander Tiskin (Warwick)   Semi-local string comparison              75 / 132
The transposition network method
Transposition networks


Seaweed combing as a transposition network
                                                         −7
                                                        −5
                                                      −3
                                                     −1
               A B C A                             +1
          A                                       +3
                                                +5
          C                                    +7

          B                                                              −7
          C                                                             −1
                                                                      +3
                                                                     −3
                                                                   −5
                                                                  +7
                                                                +5
                                                               +1

Character mismatches correspond to comparators
Inputs anti-sorted (sorted in reverse); each value traces a seaweed


   Alexander Tiskin (Warwick)   Semi-local string comparison                  76 / 132
The transposition network method
Transposition networks


Global LCS: transposition network with binary input
                                                                           0
                                                                       0
                                                                   0
                                                               0
               A B C A                                     1
          A                                            1
                                                   1
          C                                    1

          B                                                                                                0
                                                                                                       0
          C                                                                                        1
                                                                                               0
                                                                                           0
                                                                                       1
                                                                                   1
                                                                               1

Inputs still anti-sorted, but may not be distinct
Comparison between equal values is indeterminate


   Alexander Tiskin (Warwick)   Semi-local string comparison                                                   77 / 132
The transposition network method
Parameterised string comparison


Parameterised string comparison
String comparison sensitive e.g. to
      low similarity: small λ = LCS(a, b)
      high similarity: small κ = dist LCS (a, b) = m + n − 2λ
Can also use weighted alignment score or edit distance

Assume m = n, therefore κ = 2(n − λ)




   Alexander Tiskin (Warwick)     Semi-local string comparison   78 / 132
The transposition network method
Parameterised string comparison


Low-similarity comparison: small λ
      sparse set of matches, may need to look at them all
      preprocess matches for fast searching, time O(n log σ)
High-similarity comparison: small κ
      set of matches may be dense, but only need to look at small subset
      no need to preprocess, linear search is OK
Flexible comparison: sensitive to both high and low similarity, e.g. by both
comparison types running alongside each other




   Alexander Tiskin (Warwick)     Semi-local string comparison          79 / 132
The transposition network method
Parameterised string comparison


Parameterised string comparison: running time
Low-similarity, after preprocessing in O(n log σ)
O(nλ)                                                                       [Hirschberg: 1977]
                                                                    [Apostolico, Guerra: 1985]
                                                                          [Apostolico+: 1992]
High-similarity, no preprocessing
O(n · κ)                                                                      [Ukkonen: 1985]
                                                                                [Myers: 1986]
Flexible
O(λ · κ · log n)          no preproc                                [Myers: 1986; Wu+: 1990]
O(λ · κ)                  after preproc                                           [Rick: 1995]




   Alexander Tiskin (Warwick)        Semi-local string comparison                         80 / 132
The transposition network method
Parameterised string comparison


Parameterised string comparison: the waterfall algorithm
Low-similarity: O(n · λ)                                    High-similarity: O(n · κ)
      0    0    0    0    0       0   0   0                       0    0     0   0   0   0   0   0
 1                                             0             1                                       0
 1                                             1             1                                       0
 1                                             0             1                                       0
 1                                             1             1                                       0
 1                                             1             1                                       1
 1                                             0             1                                       1
 1                                             1             1                                       1
 1                                             1             1                                       0

      0    0    1    1    0       0   1   0                       1    1     1   0   1   1   0   0

Trace 0s through network in contiguous blocks and gaps

     Alexander Tiskin (Warwick)               Semi-local string comparison                               81 / 132
The transposition network method
Dynamic string comparison


The dynamic LCS problem
Maintain current LCS score under updates to one or both input strings

Both input strings are streams, updated on-line:
      appending characters at left or right
      deleting characters at left or right
Assume for simplicity m ≈ n, i.e. m = Θ(n)
Goal: linear time per update
      O(n) per update of a (n = |b|)
      O(m) per update of b (m = |a|)



   Alexander Tiskin (Warwick)   Semi-local string comparison        82 / 132
The transposition network method
Dynamic string comparison


Dynamic LCS in linear time: update models
left            right
–               app+del                               standard DP [Wagner, Fischer: 1974]
app             app             a fixed                 [Landau+: 1998], [Kim, Park: 2004]
app             app                                                       [Ishida+: 2005]
app+del         app+del                                                         [T: NEW]

Main idea:
      for append only, maintain seaweed matrix Pa,b
      for append+delete, maintain partial seaweed layout by tracing a
      transposition network



   Alexander Tiskin (Warwick)            Semi-local string comparison                83 / 132
The transposition network method
Bit-parallel string comparison


Bit-parallel string comparison
String comparison using standard instructions on words of size w

Bit-parallel string comparison: running time
O(mn/w )                        [Allison, Dix: 1986; Myers: 1999; Crochemore+: 2001]




   Alexander Tiskin (Warwick)           Semi-local string comparison            84 / 132
The transposition network method
Bit-parallel string comparison


Bit-parallel string comparison: binary transposition network
In every cell: input bits s, c; output bits s , c ; match/mismatch flag µ
                      c                               s      0       1   0   1   0   1   0   1
 µ       ¬                                            c      0       0   1   1   0   0   1   1
                                                      µ      0       0   0   0   1   1   1   1
 s                                s                   s      0       1   1   1   0   0   1   1
                                                      c      0       0   0   1   0   1   0   1

                      c
                      c                               s      0       1   0   1   0   1   0   1
 µ
           ∧                                          c      0       0   1   1   0   0   1   1
                                                      µ      0       0   0   0   1   1   1   1
 s                    +           s                   s      0       1   1   0   0   0   1   1
                                                      c      0       0   0   1   0   1   0   1

                      c




     Alexander Tiskin (Warwick)       Semi-local string comparison                               85 / 132
The transposition network method
Bit-parallel string comparison


Bit-parallel string comparison: binary transposition network
In every cell: input bits s, c; output bits s , c ; match/mismatch flag µ
                      c                               s      0       1   0   1   0   1   0   1
 µ       ¬                                            c      0       0   1   1   0   0   1   1
                                                      µ      0       0   0   0   1   1   1   1
 s                                s                   s      0       1   1   1   0   0   1   1
                                                      c      0       0   0   1   0   1   0   1

                      c
                      c                               s      0       1   0   1   0   1   0   1
 µ
           ∧                                          c      0       0   1   1   0   0   1   1
                                                      µ      0       0   0   0   1   1   1   1
 s                    +           s                   s      0       1   1   0   0   0   1   1
                                                      c      0       0   0   1   0   1   0   1

                      c

2c + s ← (s + (s ∧ µ) + c) ∨ (s ∧ ¬µ)
S ← (S + (S ∧ M)) ∨ (S ∧ ¬M), where S, M are words of bits s, µ
     Alexander Tiskin (Warwick)       Semi-local string comparison                               85 / 132
Alexander Tiskin (Warwick)   Semi-local string comparison   86 / 132
1   Introduction                                7    Sparse string comparison

2   Matrix distance multiplication              8    Compressed string comparison

3   Semi-local string comparison                9    Beyond semi-locality

4   The seaweed method                          10   Conclusions and future work

5   Periodic string comparison

6   The transposition network
    method



    Alexander Tiskin (Warwick)   Semi-local string comparison                      87 / 132
Sparse string comparison
Semi-local LCS between permutations


The LCS problem on permutation strings
Give LCS score for a vs b
In each of a, b all characters distinct: total m = n matches
Equivalent to
      longest increasing subsequence (LIS) in a string
      maximum clique in a permutation graph
      maximum planar matching in an embedded bipartite graph

LCS on permutation strings: running time
O(n log n)                                   implicit in [Erd¨s, Szekeres:
                                                             o               1935]
                                 [Robinson: 1938; Knuth: 1970; Dijkstra:     1980]
O(n log log n)          unit-RAM                           [Chang, Wang:     1992]
                                                 [Bespamyatnikh, Segal:      2000]

   Alexander Tiskin (Warwick)       Semi-local string comparison              88 / 132
Sparse string comparison
Semi-local LCS between permutations


The semi-local LCS problem on permutation strings
Give semi-local LCS scores of a vs b
In each of a, b all characters distinct: total m = n matches
Equivalent to
      longest increasing subsequence (LIS) in every substring of a string

Semi-local LCS on permutation strings: running time
O(n2 log n)                                                                         naive
O(n2 )                restricted                           [Albert+: 2003; Chen+: 2005]
O(n1.5 log n)         randomised, restricted                             [Albert+: 2007]
O(n1.5 )                                                                        [T: 2006]
O(n log2 n)                                                                    [T: NEW]



   Alexander Tiskin (Warwick)      Semi-local string comparison                      89 / 132
Sparse string comparison
Semi-local LCS between permutations

    D E H C B A F G
C
F
A
E
D
H
G
B




    Alexander Tiskin (Warwick)   Semi-local string comparison   90 / 132
Sparse string comparison
Semi-local LCS between permutations

    D E H C B A F G
C
F
A
E
D
H
G
B




    Alexander Tiskin (Warwick)   Semi-local string comparison   90 / 132
Sparse string comparison
Semi-local LCS between permutations

    D E H C B A F G
C
F
A
E
D
H
G
B




    Alexander Tiskin (Warwick)   Semi-local string comparison   91 / 132
Sparse string comparison
Semi-local LCS between permutations

    D E H C B A F G
C
F
A
E
D
H
G
B




    Alexander Tiskin (Warwick)   Semi-local string comparison   91 / 132
Sparse string comparison
Semi-local LCS between permutations

    D E H C B A F G
C
F
A
E
D
H
G
B




    Alexander Tiskin (Warwick)   Semi-local string comparison   92 / 132
Sparse string comparison
Semi-local LCS between permutations

    D E H C B A F G
C
F
A
E
D
H
G
B




    Alexander Tiskin (Warwick)   Semi-local string comparison   92 / 132
Sparse string comparison
Semi-local LCS between permutations


Semi-local LCS on permutation strings: the algorithm
Divide-and-conquer on the alignment graph
Divide graph (say) horizontally; two subproblems of effective size n/2
Conquer: seaweed matrix         -multiplication, time O(n log n)
Overall time O(n log2 n)




   Alexander Tiskin (Warwick)    Semi-local string comparison           93 / 132
Sparse string comparison
Longest piecewise monotone subsequences


A k-increasing sequence: a concatenation of k increasing sequences
A k-modal sequence: a concatenation of k alternating increasing and
decreasing sequences

The longest k-increasing (k-modal) subsequence problem
Give the longest k-increasing (k-modal) subsequence of string b

Longest k-increasing (k-modal) subsequence: running time
O(nk log n)          k-modal                                            [Demange+: 2007]
O(nk log n)                                                    via [Hunt, Szymanski: 1977]
O(n log2 n)                                                                     [T: NEW]

Main idea: LCS for id k (respectively, (idid)k/2 ) vs b

   Alexander Tiskin (Warwick)   Semi-local string comparison                          94 / 132
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin
20121020 semi local-string_comparison_tiskin

More Related Content

What's hot

Mba admission in india
Mba admission in indiaMba admission in india
Mba admission in indiaEdhole.com
 
complex variable PPT ( SEM 2 / CH -2 / GTU)
complex variable PPT ( SEM 2 / CH -2 / GTU)complex variable PPT ( SEM 2 / CH -2 / GTU)
complex variable PPT ( SEM 2 / CH -2 / GTU)tejaspatel1997
 
Computational mathematic
Computational mathematicComputational mathematic
Computational mathematicAbdul Khaliq
 
Engineering Mathematics-IV_B.Tech_Semester-IV_Unit-II
Engineering Mathematics-IV_B.Tech_Semester-IV_Unit-IIEngineering Mathematics-IV_B.Tech_Semester-IV_Unit-II
Engineering Mathematics-IV_B.Tech_Semester-IV_Unit-IIRai University
 
Spm Add Maths Formula List Form4
Spm Add Maths Formula List Form4Spm Add Maths Formula List Form4
Spm Add Maths Formula List Form4guest76f49d
 
Independence Complexes
Independence ComplexesIndependence Complexes
Independence ComplexesRickard Fors
 
C.v.n.m (m.e. 130990119004-06)
C.v.n.m (m.e. 130990119004-06)C.v.n.m (m.e. 130990119004-06)
C.v.n.m (m.e. 130990119004-06)parth98796
 
B.tech ii unit-4 material vector differentiation
B.tech ii unit-4 material vector differentiationB.tech ii unit-4 material vector differentiation
B.tech ii unit-4 material vector differentiationRai University
 
Fuzzy sets
Fuzzy setsFuzzy sets
Fuzzy setsVeni7
 
Interpolation with Finite differences
Interpolation with Finite differencesInterpolation with Finite differences
Interpolation with Finite differencesDr. Nirav Vyas
 
Bresenham circles and polygons derication
Bresenham circles and polygons dericationBresenham circles and polygons derication
Bresenham circles and polygons dericationKumar
 
Maths formulae booklet (iit jee)
Maths formulae booklet (iit jee)Maths formulae booklet (iit jee)
Maths formulae booklet (iit jee)s9182647608y
 

What's hot (19)

Mba admission in india
Mba admission in indiaMba admission in india
Mba admission in india
 
complex variable PPT ( SEM 2 / CH -2 / GTU)
complex variable PPT ( SEM 2 / CH -2 / GTU)complex variable PPT ( SEM 2 / CH -2 / GTU)
complex variable PPT ( SEM 2 / CH -2 / GTU)
 
Circle
CircleCircle
Circle
 
Computational mathematic
Computational mathematicComputational mathematic
Computational mathematic
 
Engineering Mathematics-IV_B.Tech_Semester-IV_Unit-II
Engineering Mathematics-IV_B.Tech_Semester-IV_Unit-IIEngineering Mathematics-IV_B.Tech_Semester-IV_Unit-II
Engineering Mathematics-IV_B.Tech_Semester-IV_Unit-II
 
Complex varible
Complex varibleComplex varible
Complex varible
 
Spm Add Maths Formula List Form4
Spm Add Maths Formula List Form4Spm Add Maths Formula List Form4
Spm Add Maths Formula List Form4
 
Independence Complexes
Independence ComplexesIndependence Complexes
Independence Complexes
 
Notes
NotesNotes
Notes
 
C.v.n.m (m.e. 130990119004-06)
C.v.n.m (m.e. 130990119004-06)C.v.n.m (m.e. 130990119004-06)
C.v.n.m (m.e. 130990119004-06)
 
B.tech ii unit-4 material vector differentiation
B.tech ii unit-4 material vector differentiationB.tech ii unit-4 material vector differentiation
B.tech ii unit-4 material vector differentiation
 
Fuzzy sets
Fuzzy setsFuzzy sets
Fuzzy sets
 
Graph
GraphGraph
Graph
 
Cs580
Cs580Cs580
Cs580
 
Interpolation with Finite differences
Interpolation with Finite differencesInterpolation with Finite differences
Interpolation with Finite differences
 
Bresenham circles and polygons derication
Bresenham circles and polygons dericationBresenham circles and polygons derication
Bresenham circles and polygons derication
 
Maths formulae booklet (iit jee)
Maths formulae booklet (iit jee)Maths formulae booklet (iit jee)
Maths formulae booklet (iit jee)
 
Complex variables
Complex variablesComplex variables
Complex variables
 
B.Tech-II_Unit-IV
B.Tech-II_Unit-IVB.Tech-II_Unit-IV
B.Tech-II_Unit-IV
 

Viewers also liked

20101029 reals and_integers_matiyasevich_ekb_lecture03-04
20101029 reals and_integers_matiyasevich_ekb_lecture03-0420101029 reals and_integers_matiyasevich_ekb_lecture03-04
20101029 reals and_integers_matiyasevich_ekb_lecture03-04Computer Science Club
 
20101007 proof complexity_hirsch_lecture04
20101007 proof complexity_hirsch_lecture0420101007 proof complexity_hirsch_lecture04
20101007 proof complexity_hirsch_lecture04Computer Science Club
 
20140427 parallel programming_zlobin_lecture11
20140427 parallel programming_zlobin_lecture1120140427 parallel programming_zlobin_lecture11
20140427 parallel programming_zlobin_lecture11Computer Science Club
 
20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugs20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugsComputer Science Club
 
20140511 parallel programming_kalishenko_lecture12
20140511 parallel programming_kalishenko_lecture1220140511 parallel programming_kalishenko_lecture12
20140511 parallel programming_kalishenko_lecture12Computer Science Club
 

Viewers also liked (9)

20120513 dynamics morozov
20120513 dynamics morozov20120513 dynamics morozov
20120513 dynamics morozov
 
20101002 ontology konev_lecture06
20101002 ontology konev_lecture0620101002 ontology konev_lecture06
20101002 ontology konev_lecture06
 
20101029 reals and_integers_matiyasevich_ekb_lecture03-04
20101029 reals and_integers_matiyasevich_ekb_lecture03-0420101029 reals and_integers_matiyasevich_ekb_lecture03-04
20101029 reals and_integers_matiyasevich_ekb_lecture03-04
 
20100822 satandsmt bruttomesso
20100822 satandsmt bruttomesso20100822 satandsmt bruttomesso
20100822 satandsmt bruttomesso
 
20101007 proof complexity_hirsch_lecture04
20101007 proof complexity_hirsch_lecture0420101007 proof complexity_hirsch_lecture04
20101007 proof complexity_hirsch_lecture04
 
20141223 kuznetsov distributed
20141223 kuznetsov distributed20141223 kuznetsov distributed
20141223 kuznetsov distributed
 
20140427 parallel programming_zlobin_lecture11
20140427 parallel programming_zlobin_lecture1120140427 parallel programming_zlobin_lecture11
20140427 parallel programming_zlobin_lecture11
 
20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugs20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugs
 
20140511 parallel programming_kalishenko_lecture12
20140511 parallel programming_kalishenko_lecture1220140511 parallel programming_kalishenko_lecture12
20140511 parallel programming_kalishenko_lecture12
 

Similar to 20121020 semi local-string_comparison_tiskin

EMF.0.14.VectorCalculus-IV.pdf
EMF.0.14.VectorCalculus-IV.pdfEMF.0.14.VectorCalculus-IV.pdf
EMF.0.14.VectorCalculus-IV.pdfrsrao8
 
3 4 character tables
3 4 character tables3 4 character tables
3 4 character tablesAlfiya Ali
 
Passive network-redesign-ntua
Passive network-redesign-ntuaPassive network-redesign-ntua
Passive network-redesign-ntuaIEEE NTUA SB
 
Chap10 slides
Chap10 slidesChap10 slides
Chap10 slidesHJ DS
 
The Wishart and inverse-wishart distribution
 The Wishart and inverse-wishart distribution The Wishart and inverse-wishart distribution
The Wishart and inverse-wishart distributionPankaj Das
 
01_ELMAGTER_DNN_VEKTOR-ANALYSIS_FULL.pdf
01_ELMAGTER_DNN_VEKTOR-ANALYSIS_FULL.pdf01_ELMAGTER_DNN_VEKTOR-ANALYSIS_FULL.pdf
01_ELMAGTER_DNN_VEKTOR-ANALYSIS_FULL.pdfhasanahputri2
 
Distributed Architecture of Subspace Clustering and Related
Distributed Architecture of Subspace Clustering and RelatedDistributed Architecture of Subspace Clustering and Related
Distributed Architecture of Subspace Clustering and RelatedPei-Che Chang
 
Jurnal informatika
Jurnal informatika Jurnal informatika
Jurnal informatika MamaMa28
 
Mesh Generation and Topological Data Analysis
Mesh Generation and Topological Data AnalysisMesh Generation and Topological Data Analysis
Mesh Generation and Topological Data AnalysisDon Sheehy
 
Markov Chain Monitoring - Application to demand prediction in bike sharing sy...
Markov Chain Monitoring - Application to demand prediction in bike sharing sy...Markov Chain Monitoring - Application to demand prediction in bike sharing sy...
Markov Chain Monitoring - Application to demand prediction in bike sharing sy...Harshal Chaudhari
 
Fundamentals of electromagnetics
Fundamentals of electromagneticsFundamentals of electromagnetics
Fundamentals of electromagneticsRichu Jose Cyriac
 
Lecture 2: Stochastic Hydrology
Lecture 2: Stochastic Hydrology Lecture 2: Stochastic Hydrology
Lecture 2: Stochastic Hydrology Amro Elfeki
 
A Study on Compositional Semantics of Words in Distributional Spaces
A Study on Compositional Semantics of Words in Distributional SpacesA Study on Compositional Semantics of Words in Distributional Spaces
A Study on Compositional Semantics of Words in Distributional SpacesPierpaolo Basile
 
Algorithm chapter 8
Algorithm chapter 8Algorithm chapter 8
Algorithm chapter 8chidabdu
 

Similar to 20121020 semi local-string_comparison_tiskin (20)

A superglue for string comparison
A superglue for string comparisonA superglue for string comparison
A superglue for string comparison
 
EMF.0.14.VectorCalculus-IV.pdf
EMF.0.14.VectorCalculus-IV.pdfEMF.0.14.VectorCalculus-IV.pdf
EMF.0.14.VectorCalculus-IV.pdf
 
3 4 character tables
3 4 character tables3 4 character tables
3 4 character tables
 
Graph
GraphGraph
Graph
 
Passive network-redesign-ntua
Passive network-redesign-ntuaPassive network-redesign-ntua
Passive network-redesign-ntua
 
Chap10 slides
Chap10 slidesChap10 slides
Chap10 slides
 
The Wishart and inverse-wishart distribution
 The Wishart and inverse-wishart distribution The Wishart and inverse-wishart distribution
The Wishart and inverse-wishart distribution
 
UCSD NANO106 - 01 - Introduction to Crystallography
UCSD NANO106 - 01 - Introduction to CrystallographyUCSD NANO106 - 01 - Introduction to Crystallography
UCSD NANO106 - 01 - Introduction to Crystallography
 
ALG5.1.ppt
ALG5.1.pptALG5.1.ppt
ALG5.1.ppt
 
01_ELMAGTER_DNN_VEKTOR-ANALYSIS_FULL.pdf
01_ELMAGTER_DNN_VEKTOR-ANALYSIS_FULL.pdf01_ELMAGTER_DNN_VEKTOR-ANALYSIS_FULL.pdf
01_ELMAGTER_DNN_VEKTOR-ANALYSIS_FULL.pdf
 
Distributed Architecture of Subspace Clustering and Related
Distributed Architecture of Subspace Clustering and RelatedDistributed Architecture of Subspace Clustering and Related
Distributed Architecture of Subspace Clustering and Related
 
Jurnal informatika
Jurnal informatika Jurnal informatika
Jurnal informatika
 
Mesh Generation and Topological Data Analysis
Mesh Generation and Topological Data AnalysisMesh Generation and Topological Data Analysis
Mesh Generation and Topological Data Analysis
 
Markov Chain Monitoring - Application to demand prediction in bike sharing sy...
Markov Chain Monitoring - Application to demand prediction in bike sharing sy...Markov Chain Monitoring - Application to demand prediction in bike sharing sy...
Markov Chain Monitoring - Application to demand prediction in bike sharing sy...
 
Fundamentals of electromagnetics
Fundamentals of electromagneticsFundamentals of electromagnetics
Fundamentals of electromagnetics
 
Lecture 2: Stochastic Hydrology
Lecture 2: Stochastic Hydrology Lecture 2: Stochastic Hydrology
Lecture 2: Stochastic Hydrology
 
talk9.ppt
talk9.ppttalk9.ppt
talk9.ppt
 
A Study on Compositional Semantics of Words in Distributional Spaces
A Study on Compositional Semantics of Words in Distributional SpacesA Study on Compositional Semantics of Words in Distributional Spaces
A Study on Compositional Semantics of Words in Distributional Spaces
 
Algorithm chapter 8
Algorithm chapter 8Algorithm chapter 8
Algorithm chapter 8
 
[ACM-ICPC] UVa-10245
[ACM-ICPC] UVa-10245[ACM-ICPC] UVa-10245
[ACM-ICPC] UVa-10245
 

More from Computer Science Club

20140531 serebryany lecture02_find_scary_cpp_bugs
20140531 serebryany lecture02_find_scary_cpp_bugs20140531 serebryany lecture02_find_scary_cpp_bugs
20140531 serebryany lecture02_find_scary_cpp_bugsComputer Science Club
 
20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugs20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugsComputer Science Club
 
20140420 parallel programming_kalishenko_lecture10
20140420 parallel programming_kalishenko_lecture1020140420 parallel programming_kalishenko_lecture10
20140420 parallel programming_kalishenko_lecture10Computer Science Club
 
20140413 parallel programming_kalishenko_lecture09
20140413 parallel programming_kalishenko_lecture0920140413 parallel programming_kalishenko_lecture09
20140413 parallel programming_kalishenko_lecture09Computer Science Club
 
20140329 graph drawing_dainiak_lecture02
20140329 graph drawing_dainiak_lecture0220140329 graph drawing_dainiak_lecture02
20140329 graph drawing_dainiak_lecture02Computer Science Club
 
20140329 graph drawing_dainiak_lecture01
20140329 graph drawing_dainiak_lecture0120140329 graph drawing_dainiak_lecture01
20140329 graph drawing_dainiak_lecture01Computer Science Club
 
20140310 parallel programming_kalishenko_lecture03-04
20140310 parallel programming_kalishenko_lecture03-0420140310 parallel programming_kalishenko_lecture03-04
20140310 parallel programming_kalishenko_lecture03-04Computer Science Club
 
20140216 parallel programming_kalishenko_lecture01
20140216 parallel programming_kalishenko_lecture0120140216 parallel programming_kalishenko_lecture01
20140216 parallel programming_kalishenko_lecture01Computer Science Club
 
20130928 automated theorem_proving_harrison
20130928 automated theorem_proving_harrison20130928 automated theorem_proving_harrison
20130928 automated theorem_proving_harrisonComputer Science Club
 

More from Computer Science Club (20)

Computer Vision
Computer VisionComputer Vision
Computer Vision
 
20140531 serebryany lecture02_find_scary_cpp_bugs
20140531 serebryany lecture02_find_scary_cpp_bugs20140531 serebryany lecture02_find_scary_cpp_bugs
20140531 serebryany lecture02_find_scary_cpp_bugs
 
20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugs20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugs
 
20140420 parallel programming_kalishenko_lecture10
20140420 parallel programming_kalishenko_lecture1020140420 parallel programming_kalishenko_lecture10
20140420 parallel programming_kalishenko_lecture10
 
20140413 parallel programming_kalishenko_lecture09
20140413 parallel programming_kalishenko_lecture0920140413 parallel programming_kalishenko_lecture09
20140413 parallel programming_kalishenko_lecture09
 
20140329 graph drawing_dainiak_lecture02
20140329 graph drawing_dainiak_lecture0220140329 graph drawing_dainiak_lecture02
20140329 graph drawing_dainiak_lecture02
 
20140329 graph drawing_dainiak_lecture01
20140329 graph drawing_dainiak_lecture0120140329 graph drawing_dainiak_lecture01
20140329 graph drawing_dainiak_lecture01
 
20140310 parallel programming_kalishenko_lecture03-04
20140310 parallel programming_kalishenko_lecture03-0420140310 parallel programming_kalishenko_lecture03-04
20140310 parallel programming_kalishenko_lecture03-04
 
20140223-SuffixTrees-lecture01-03
20140223-SuffixTrees-lecture01-0320140223-SuffixTrees-lecture01-03
20140223-SuffixTrees-lecture01-03
 
20140216 parallel programming_kalishenko_lecture01
20140216 parallel programming_kalishenko_lecture0120140216 parallel programming_kalishenko_lecture01
20140216 parallel programming_kalishenko_lecture01
 
20131106 h10 lecture6_matiyasevich
20131106 h10 lecture6_matiyasevich20131106 h10 lecture6_matiyasevich
20131106 h10 lecture6_matiyasevich
 
20131027 h10 lecture5_matiyasevich
20131027 h10 lecture5_matiyasevich20131027 h10 lecture5_matiyasevich
20131027 h10 lecture5_matiyasevich
 
20131027 h10 lecture5_matiyasevich
20131027 h10 lecture5_matiyasevich20131027 h10 lecture5_matiyasevich
20131027 h10 lecture5_matiyasevich
 
20131013 h10 lecture4_matiyasevich
20131013 h10 lecture4_matiyasevich20131013 h10 lecture4_matiyasevich
20131013 h10 lecture4_matiyasevich
 
20131006 h10 lecture3_matiyasevich
20131006 h10 lecture3_matiyasevich20131006 h10 lecture3_matiyasevich
20131006 h10 lecture3_matiyasevich
 
20131006 h10 lecture3_matiyasevich
20131006 h10 lecture3_matiyasevich20131006 h10 lecture3_matiyasevich
20131006 h10 lecture3_matiyasevich
 
20131006 h10 lecture2_matiyasevich
20131006 h10 lecture2_matiyasevich20131006 h10 lecture2_matiyasevich
20131006 h10 lecture2_matiyasevich
 
20130922 h10 lecture1_matiyasevich
20130922 h10 lecture1_matiyasevich20130922 h10 lecture1_matiyasevich
20130922 h10 lecture1_matiyasevich
 
20130928 automated theorem_proving_harrison
20130928 automated theorem_proving_harrison20130928 automated theorem_proving_harrison
20130928 automated theorem_proving_harrison
 
20130922 lecture3 matiyasevich
20130922 lecture3 matiyasevich20130922 lecture3 matiyasevich
20130922 lecture3 matiyasevich
 

20121020 semi local-string_comparison_tiskin

  • 1. Semi-local string comparison: Algorithmic techniques and applications Alexander Tiskin Department of Computer Science University of Warwick http://go.warwick.ac.uk/alextiskin Alexander Tiskin (Warwick) Semi-local string comparison 1 / 132
  • 2. 1 Introduction 7 Sparse string comparison 2 Matrix distance multiplication 8 Compressed string comparison 3 Semi-local string comparison 9 Beyond semi-locality 4 The seaweed method 10 Conclusions and future work 5 Periodic string comparison 6 The transposition network method Alexander Tiskin (Warwick) Semi-local string comparison 2 / 132
  • 3. 1 Introduction 7 Sparse string comparison 2 Matrix distance multiplication 8 Compressed string comparison 3 Semi-local string comparison 9 Beyond semi-locality 4 The seaweed method 10 Conclusions and future work 5 Periodic string comparison 6 The transposition network method Alexander Tiskin (Warwick) Semi-local string comparison 3 / 132
  • 4. Introduction String matching: finding an exact pattern in a string String comparison: finding similar patterns in two strings Applications: computational biology, image recognition, . . . Alexander Tiskin (Warwick) Semi-local string comparison 4 / 132
  • 5. Introduction String matching: finding an exact pattern in a string String comparison: finding similar patterns in two strings Applications: computational biology, image recognition, . . . Standard types of string comparison: global: whole string vs whole string local: substrings vs substrings Main focus of this work: semi-local: whole string vs substrings; prefixes vs suffixes Closely related to approximate string matching (no relation to approximation algorithms!) Main tool: implicit unit-Monge matrices (a.k.a. seaweed matrices) Alexander Tiskin (Warwick) Semi-local string comparison 4 / 132
  • 6. Introduction Terminology and notation x− = x − 1 2 x+ = x + 1 2 Integers: {. . . − 2, −1, 0, 1, 2, . . .} Half-integers: . . . − 3 , − 1 , 1 , 2 , 2 , . . . = . . . (−2)+ , (−1)+ , 0+ , 1+ , 2+ 2 2 2 3 5 (i, j) (i , j ) iff i < i and j < j (i, j) (i , j ) iff i > i and j < j A permutation matrix is a 0/1 matrix with exactly one nonzero per row and per column   0 1 0 1 0 0 0 0 1 Alexander Tiskin (Warwick) Semi-local string comparison 5 / 132
  • 7. Introduction Terminology and notation Given matrix D, its distribution matrix is made up of -dominance sums: Alexander Tiskin (Warwick) Semi-local string comparison 6 / 132
  • 8. Introduction Terminology and notation Given matrix D, its distribution matrix is made up of -dominance sums: Given matrix E , its density matrix is made up of quadrangle differences: E (ˆ, ) = E (ˆ− , + ) − E (ˆ− , − ) − E (ˆ+ , + ) + E (ˆ+ , − ) ı ˆ ı ˆ ı ˆ ı ˆ ı ˆ where D Σ , E over integers; D, E over half-integers Alexander Tiskin (Warwick) Semi-local string comparison 6 / 132
  • 9. Introduction Terminology and notation Given matrix D, its distribution matrix is made up of -dominance sums: Given matrix E , its density matrix is made up of quadrangle differences: E (ˆ, ) = E (ˆ− , + ) − E (ˆ− , − ) − E (ˆ+ , + ) + E (ˆ+ , − ) ı ˆ ı ˆ ı ˆ ı ˆ ı ˆ where D Σ , E over integers; D, E over half-integers      Σ 0 1 2 3 0 1 2 3   0 1 0 0 1 1 2 0 1 1 2 0 1 0 1 0 0 =  0 0 0 1 = 1 0 0      0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 Alexander Tiskin (Warwick) Semi-local string comparison 6 / 132
  • 10. Introduction Terminology and notation Given matrix D, its distribution matrix is made up of -dominance sums: Given matrix E , its density matrix is made up of quadrangle differences: E (ˆ, ) = E (ˆ− , + ) − E (ˆ− , − ) − E (ˆ+ , + ) + E (ˆ+ , − ) ı ˆ ı ˆ ı ˆ ı ˆ ı ˆ where D Σ , E over integers; D, E over half-integers      Σ 0 1 2 3 0 1 2 3   0 1 0 0 1 1 2 0 1 1 2 0 1 0 1 0 0 =  0 0 0 1 = 1 0 0      0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 (D Σ ) = D for all D Matrix E is simple, if (E )Σ = E ; equivalently, if it has all zeros in the left column and bottom row Alexander Tiskin (Warwick) Semi-local string comparison 6 / 132
  • 11. Introduction Terminology and notation Matrix E is Monge, if E is nonnegative Intuition: boundary-to-boundary distances in a (weighted) planar graph Matrix E is unit-Monge, if E is a permutation matrix Intuition: boundary-to-boundary distances in a grid-like graph Alexander Tiskin (Warwick) Semi-local string comparison 7 / 132
  • 12. Introduction Terminology and notation Matrix E is Monge, if E is nonnegative Intuition: boundary-to-boundary distances in a (weighted) planar graph Matrix E is unit-Monge, if E is a permutation matrix Intuition: boundary-to-boundary distances in a grid-like graph Simple unit-Monge matrix: P Σ , where P is a permutation matrix Seaweed matrix: P used as an implicit representation of P Σ    Σ 0 1 2 3 0 1 0 0 1 0 0 =  1 1 2  0 0 0 1 0 0 1 0 0 0 0 Alexander Tiskin (Warwick) Semi-local string comparison 7 / 132
  • 13. Introduction Implicit unit-Monge matrices Efficient P Σ queries: range tree on nonzeros of P [Bentley: 1980] binary search tree by i-coordinate under every node, binary search tree by j-coordinate • • • • −→ • −→ • • • • • • • ↓ • • • • −→ • −→ • • • • • • • ↓ • • • • −→ • −→ • • • • • • • Alexander Tiskin (Warwick) Semi-local string comparison 8 / 132
  • 14. Introduction Implicit unit-Monge matrices Efficient P Σ queries: (contd.) Every node of the range tree represents a canonical range (rectangular region), and stores its nonzero count Overall, ≤ n log n canonical ranges are non-empty A P Σ query is equivalent to -dominance counting: how many nonzeros are -dominated by query point? Answer: sum up nonzero counts in ≤ log2 n disjoint canonical ranges Total size O(n log n), query time O(log2 n) Alexander Tiskin (Warwick) Semi-local string comparison 9 / 132
  • 15. Introduction Implicit unit-Monge matrices Efficient P Σ queries: (contd.) Every node of the range tree represents a canonical range (rectangular region), and stores its nonzero count Overall, ≤ n log n canonical ranges are non-empty A P Σ query is equivalent to -dominance counting: how many nonzeros are -dominated by query point? Answer: sum up nonzero counts in ≤ log2 n disjoint canonical ranges Total size O(n log n), query time O(log2 n) There are asymptotically more efficient (but less practical) data structures log n Total size O(n), query time O log log n [J´J´+: 2004] a a [Chan, Pˇtra¸cu: 2010] a s Alexander Tiskin (Warwick) Semi-local string comparison 9 / 132
  • 16. 1 Introduction 7 Sparse string comparison 2 Matrix distance multiplication 8 Compressed string comparison 3 Semi-local string comparison 9 Beyond semi-locality 4 The seaweed method 10 Conclusions and future work 5 Periodic string comparison 6 The transposition network method Alexander Tiskin (Warwick) Semi-local string comparison 10 / 132
  • 17. Matrix distance multiplication Seaweed braids Distance algebra (a.k.a (min, +) or tropical algebra): addition ⊕ given by min multiplication given by + Matrix -multiplication A B=C C (i, k) = j A(i, j) B(j, k) = minj A(i, j) + B(j, k) Matrix classes closed under -multiplication (for given n): general numerical (integer, real) matrices Monge matrices simple unit-Monge matrices (!) Alexander Tiskin (Warwick) Semi-local string comparison 11 / 132
  • 18. Matrix distance multiplication Seaweed braids Recall that simple unit-Monge matrices are represented implicitly by permutation (seaweed) matrices Define PA Σ PB = PC as PA Σ Σ PB = PC The seaweed monoid Tn : simple unit-Monge matrices under equivalently, permutation (seaweed) matrices under Also known as the 0-Hecke monoid of the symmetric group H0 (Sn ) Alexander Tiskin (Warwick) Semi-local string comparison 12 / 132
  • 19. Matrix distance multiplication Seaweed braids PA PB = PC can be seen as combing of seaweed braids • • • • • • • • • • • • • • • • • • PA PB PC Alexander Tiskin (Warwick) Semi-local string comparison 13 / 132
  • 20. Matrix distance multiplication Seaweed braids PA PB = PC can be seen as combing of seaweed braids • • • • • • • • • • • • • • • • • • PA PB PC PA PB Alexander Tiskin (Warwick) Semi-local string comparison 13 / 132
  • 21. Matrix distance multiplication Seaweed braids PA PB = PC can be seen as combing of seaweed braids • • • • • • • • • • • • • • • • • • PA PB PC PA PB Alexander Tiskin (Warwick) Semi-local string comparison 13 / 132
  • 22. Matrix distance multiplication Seaweed braids PA PB = PC can be seen as combing of seaweed braids • • • • • • • • • • • • • • • • • • PA PB PC PA PC PB Alexander Tiskin (Warwick) Semi-local string comparison 13 / 132
  • 23. Matrix distance multiplication Seaweed braids The seaweed monoid Tn : n! elements (permutations of size n) n − 1 generators g1 , g2 , . . . , gn−1 (elementary crossings) Idempotence: gi2 = gi for all i = Far commutativity: gi gj = gj gi j − i > 1 ··· = ··· Braid relations: gi gj gi = gj gi gj j − i = 1 = Alexander Tiskin (Warwick) Semi-local string comparison 14 / 132
  • 24. Matrix distance multiplication Seaweed braids Identity: 1 x =x   • · · · · • · · · · • · = 1=  · · · • Zero: 0 x =0   · · · • · · • · 0= · • · · =  • · · · Alexander Tiskin (Warwick) Semi-local string comparison 15 / 132
  • 25. Matrix distance multiplication Seaweed braids Related structures: positive braids: far comm; braid relations braids: gi gi−1 = 1; far comm; braid relations Coxeter’s presentation of Sn : gi2 = 1; far comm; braid relations locally free idempotent monoid: idem; far comm [Vershik+: 2000] Generalisations: general 0-Hecke monoids [Fomin, Greene: 1998; Buch+: 2008] Coxeter monoids [Tsaranov: 1990; Richardson, Springer: 1990] J -trivial monoids [Denton+: 2011] Alexander Tiskin (Warwick) Semi-local string comparison 16 / 132
  • 26. Matrix distance multiplication Seaweed braids Computation in the seaweed monoid: a confluent rewriting system can be obtained by software (Semigroupe, GAP) Alexander Tiskin (Warwick) Semi-local string comparison 17 / 132
  • 27. Matrix distance multiplication Seaweed braids Computation in the seaweed monoid: a confluent rewriting system can be obtained by software (Semigroupe, GAP) T3 : 1, a = g1 , b = g2 ; ab, ba, aba = 0 aa → a bb → b bab → 0 aba → 0 Alexander Tiskin (Warwick) Semi-local string comparison 17 / 132
  • 28. Matrix distance multiplication Seaweed braids Computation in the seaweed monoid: a confluent rewriting system can be obtained by software (Semigroupe, GAP) T3 : 1, a = g1 , b = g2 ; ab, ba, aba = 0 aa → a bb → b bab → 0 aba → 0 T4 : 1, a = g1 , b = g2 , c = g3 ; ab, ac, ba, bc, cb, aba, abc, acb, bac, bcb, cba, abac, abcb, acba, bacb, bcba, abacb, abcba, bacba, abacba = 0 aa → a ca → ac bab → aba cbac → bcba bb → b cc → c cbc → bcb abacba → 0 Alexander Tiskin (Warwick) Semi-local string comparison 17 / 132
  • 29. Matrix distance multiplication Seaweed braids Computation in the seaweed monoid: a confluent rewriting system can be obtained by software (Semigroupe, GAP) T3 : 1, a = g1 , b = g2 ; ab, ba, aba = 0 aa → a bb → b bab → 0 aba → 0 T4 : 1, a = g1 , b = g2 , c = g3 ; ab, ac, ba, bc, cb, aba, abc, acb, bac, bcb, cba, abac, abcb, acba, bacb, bcba, abacb, abcba, bacba, abacba = 0 aa → a ca → ac bab → aba cbac → bcba bb → b cc → c cbc → bcb abacba → 0 Easy to use, but not an efficient algorithm Alexander Tiskin (Warwick) Semi-local string comparison 17 / 132
  • 30. Matrix distance multiplication Seaweed matrix multiplication The implicit unit-Monge matrix -multiplication problem Given permutation matrices PA , PB , compute PC , such that Σ Σ Σ PA PB = PC (equivalently, PA PB = PC ) Alexander Tiskin (Warwick) Semi-local string comparison 18 / 132
  • 31. Matrix distance multiplication Seaweed matrix multiplication The implicit unit-Monge matrix -multiplication problem Given permutation matrices PA , PB , compute PC , such that Σ Σ Σ PA PB = PC (equivalently, PA PB = PC ) Matrix -multiplication: running time type time general O(n3 ) standard 3 3 O n (log log n) log2 n [Chan: 2007] Monge O(n2 ) via [Aggarwal+: 1987] implicit unit-Monge O(n1.5 ) [T: 2006] O(n log n) [T: 2010] Alexander Tiskin (Warwick) Semi-local string comparison 18 / 132
  • 32. Matrix distance multiplication Seaweed matrix multiplication PB •• • •• •• • • • • • • • •• • • • • •• • • • • • • • •• •• • •• ? • • • • PA PC Alexander Tiskin (Warwick) Semi-local string comparison 19 / 132
  • 33. Matrix distance multiplication Seaweed matrix multiplication PB,lo , PB,hi •• • •• •• • • • • • • • •• • • • • •• • • • • • • • •• •• •• • • • • • PA,lo , PA,hi Alexander Tiskin (Warwick) Semi-local string comparison 20 / 132
  • 34. Matrix distance multiplication Seaweed matrix multiplication PB,lo , PB,hi •• • •• •• • • • • • • • •• • • • • •• • •• • • • • • • • • • •• •• •• • • • • • • • • • • PA,lo , PA,hi Alexander Tiskin (Warwick) Semi-local string comparison 20 / 132
  • 35. Matrix distance multiplication Seaweed matrix multiplication PB,lo , PB,hi •• • •• •• • • • • • • • •• • • • • •• • • • • • • • • • • • • •• • •• •• • • • • • • • • • PA,lo , PA,hi Alexander Tiskin (Warwick) Semi-local string comparison 20 / 132
  • 36. Matrix distance multiplication Seaweed matrix multiplication PB,lo , PB,hi •• • •• •• • • • • • • • •• • • • • •• • •• • • • • • • • • • • • • •• • • • •• •• • • • • • • • • • • •• • • PA,lo , PA,hi PC ,lo + PC ,hi Alexander Tiskin (Warwick) Semi-local string comparison 20 / 132
  • 37. Matrix distance multiplication Seaweed matrix multiplication PB,lo , PB,hi •• • •• •• • • • • • • • •• • • • • •• • •• • • • • • • • • • • • • •• • • • •• •• • • • • • • • • • • •• • • PA,lo , PA,hi PC ,lo + PC ,hi Alexander Tiskin (Warwick) Semi-local string comparison 21 / 132
  • 38. Matrix distance multiplication Seaweed matrix multiplication PB,lo , PB,hi •• • •• •• • • • • • • • •• • • • • •• • •• • • • • • • • • • •• • • • •• • •• •• • • • • • • • • • • •• • • PA,lo , PA,hi PC Alexander Tiskin (Warwick) Semi-local string comparison 21 / 132
  • 39. Matrix distance multiplication Seaweed matrix multiplication PB •• • •• •• • • • • • • • •• • • • • •• • •• • • • • • • • • • •• • • • •• • •• •• • • • • • • • • • • •• • • PA PC Alexander Tiskin (Warwick) Semi-local string comparison 22 / 132
  • 40. Matrix distance multiplication Seaweed matrix multiplication Implicit unit-Monge matrix -multiplication: the algorithm Σ Σ Σ PC (i, k) = minj PA (i, j) + PB (j, k) Divide-and-conquer on the range of j Divide PA horizontally, PB vertically: two subproblems of effective size n/2 Σ PA,lo Σ Σ PB,lo = PC ,lo Σ PA,hi Σ Σ PB,hi = PC ,hi Conquer: -low nonzeros of PC ,lo and -high nonzeros of PC ,hi appear in PC The remaining nonzeros of PC ,lo and PC ,hi are “wrong”, and need to be corrected to obtain the remaining nonzeros of PC Correction can be done in time O(n) using the unit-Monge property Overall time O(n log n) Alexander Tiskin (Warwick) Semi-local string comparison 23 / 132
  • 41. Matrix distance multiplication Bruhat order Comparing permutations by the “degree of sortedness” Bruhat order Permutation A is lower (“more sorted”) than permutation B in the Bruhat order (A B), if B can be transformed to A by successive pairwise sorting between arbitrary pairs of elements. Permutation matrices: PA PB , if PB can be transformed to PA by successive submatrix substitution: ( 0 1 ) 10 (1 0) 01 Alexander Tiskin (Warwick) Semi-local string comparison 24 / 132
  • 42. Matrix distance multiplication Bruhat order Bruhat comparability: running time O(n2 ) folklore O(n log n) [T: NEW] PA PB iff PA ≤ PB elementwise, time O(n2 ) Σ Σ folklore R R PA PB iff PA PB = Id , time O(n log n) [T: NEW] where P R denotes clockwise rotation of matrix P Alexander Tiskin (Warwick) Semi-local string comparison 25 / 132
  • 43. 1 Introduction 7 Sparse string comparison 2 Matrix distance multiplication 8 Compressed string comparison 3 Semi-local string comparison 9 Beyond semi-locality 4 The seaweed method 10 Conclusions and future work 5 Periodic string comparison 6 The transposition network method Alexander Tiskin (Warwick) Semi-local string comparison 26 / 132
  • 44. Semi-local string comparison Semi-local LCS and edit distance Consider strings (= sequences) over an alphabet of size σ Distinguish contiguous substrings and not necessarily contiguous subsequences Special cases of substring: prefix, suffix Notation: strings a, b of length m, n respectively Assume where necessary: m ≤ n; m, n reasonably close Alexander Tiskin (Warwick) Semi-local string comparison 27 / 132
  • 45. Semi-local string comparison Semi-local LCS and edit distance Consider strings (= sequences) over an alphabet of size σ Distinguish contiguous substrings and not necessarily contiguous subsequences Special cases of substring: prefix, suffix Notation: strings a, b of length m, n respectively Assume where necessary: m ≤ n; m, n reasonably close The longest common subsequence (LCS) score: length of longest string that is a subsequence of both a and b equivalently, alignment score, where score(match) = 1 and score(mismatch) = 0 In biological terms, “loss-free alignment” (unlike “lossy” BLAST) Alexander Tiskin (Warwick) Semi-local string comparison 27 / 132
  • 46. Semi-local string comparison Semi-local LCS and edit distance The LCS problem Give the LCS score for a vs b Alexander Tiskin (Warwick) Semi-local string comparison 28 / 132
  • 47. Semi-local string comparison Semi-local LCS and edit distance The LCS problem Give the LCS score for a vs b LCS: running time O(mn) [Wagner, Fischer: 1974] mn O log2 n σ = O(1) [Masek, Paterson: 1980] [Crochemore+: 2003] mn(log log n)2 O log2 n [Paterson, Danˇ´cık: 1994] [Bille, Farach-Colton: 2008] Running time varies depending on the RAM model version We assume word-RAM with word size log n (where it matters) Alexander Tiskin (Warwick) Semi-local string comparison 28 / 132
  • 48. Semi-local string comparison Semi-local LCS and edit distance LCS on the alignment graph (directed, acyclic) B A A B C A B C A B A C A blue = 0 B red = 1 A A B C B C A score(“BAABCBCA”, “BAABCABCABACA”) = len(“BAABCBCA”) = 8 LCS = highest-score path from top-left to bottom-right Alexander Tiskin (Warwick) Semi-local string comparison 29 / 132
  • 49. Semi-local string comparison Semi-local LCS and edit distance LCS: dynamic programming [WF: 1974] Sweep cells in any -compatible order Cell update: time O(1) Overall time O(mn) Alexander Tiskin (Warwick) Semi-local string comparison 30 / 132
  • 50. Semi-local string comparison Semi-local LCS and edit distance ‘Begin at the beginning,’ the King said gravely, ‘and go on till you come to the end: then stop.’ L. Carroll, Alice in Wonderland (The standard approach in dynamic programming) Alexander Tiskin (Warwick) Semi-local string comparison 31 / 132
  • 51. Semi-local string comparison Semi-local LCS and edit distance Sometimes dynamic programming can be run from both ends for extra flexibility Alexander Tiskin (Warwick) Semi-local string comparison 32 / 132
  • 52. Semi-local string comparison Semi-local LCS and edit distance Sometimes dynamic programming can be run from both ends for extra flexibility Is there a better, fully flexible alternative (e.g. for comparing compressed strings, comparing strings dynamically or in parallel, etc.)? Alexander Tiskin (Warwick) Semi-local string comparison 32 / 132
  • 53. Semi-local string comparison Semi-local LCS and edit distance LCS: micro-block dynamic programming [MP: 1980; BF: 2008] Sweep cells in micro-blocks, in any -compatible order Micro-block size: t = O(log n) when σ = O(1) log n t=O log log n otherwise Micro-block interface: O(t) characters, each O(log σ) bits, can be reduced to O(log t) bits O(t) small integers, each O(1) bits Micro-block update: time O(1), by precomputing all possible interfaces mn mn(log log n)2 Overall time O log2 n when σ = O(1), O log2 n otherwise Alexander Tiskin (Warwick) Semi-local string comparison 33 / 132
  • 54. Semi-local string comparison Semi-local LCS and edit distance The semi-local LCS problem Give the (implicit) matrix of O (m + n)2 LCS scores: string-substring LCS: string a vs every substring of b prefix-suffix LCS: every prefix of a vs every suffix of b suffix-prefix LCS: every suffix of a vs every prefix of b substring-string LCS: every substring of a vs string b Alexander Tiskin (Warwick) Semi-local string comparison 34 / 132
  • 55. Semi-local string comparison Semi-local LCS and edit distance The semi-local LCS problem Give the (implicit) matrix of O (m + n)2 LCS scores: string-substring LCS: string a vs every substring of b prefix-suffix LCS: every prefix of a vs every suffix of b suffix-prefix LCS: every suffix of a vs every prefix of b substring-string LCS: every substring of a vs string b Cf.: dynamic programming gives prefix-prefix LCS Alexander Tiskin (Warwick) Semi-local string comparison 34 / 132
  • 56. Semi-local string comparison Semi-local LCS and edit distance Semi-local LCS on the alignment graph B A A B C A B C A B A C A blue = 0 B red = 1 A A B C B C A score(“BAABCBCA”, “CABCABA”) = len(“ABCBA”) = 5 String-substring LCS: all highest-score top-to-bottom paths Semi-local LCS: all highest-score boundary-to-boundary paths Alexander Tiskin (Warwick) Semi-local string comparison 35 / 132
  • 57. Semi-local string comparison Score matrices and seaweed matrices The score matrix H 0 1 2 3 4 5 6 6 7 8 8 8 8 8 a = “BAABCBCA” -1 0 1 2 3 4 5 5 6 7 7 7 7 7 -2 -1 0 1 2 3 4 4 5 6 6 6 6 7 b = “BAABCABCABACA” -3 -2 -1 0 1 2 3 3 4 5 5 6 6 7 H(i, j) = score(a, b i : j ) -4 -3 -2 -1 0 1 2 2 3 4 4 5 5 6 H(4, 11) = 5 -5 -4 -3 -2 -1 0 1 2 3 4 4 5 5 6 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4 4 5 H(i, j) = j − i if i > j -7 -6 -5 -4 -3 -2 -1 0 1 2 2 3 3 4 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 Alexander Tiskin (Warwick) Semi-local string comparison 36 / 132
  • 58. Semi-local string comparison Score matrices and seaweed matrices Semi-local LCS: output representation and running time size query time O(n2 ) O(1) trivial O(m1/2 n) O(log n) string-substring [Alves+: 2003] O(n) O(n) string-substring [Alves+: 2005] O(n log n) O(log2 n) [T: 2006] . . . or any 2D orthogonal range counting data structure running time O(mn2 ) naive O(mn) string-substring [Schmidt: 1998; Alves+: 2005] O(mn) [T: 2006] mn O log0.5 n [T: 2006] mn(log log n)2 O log2 n [T: 2007] Alexander Tiskin (Warwick) Semi-local string comparison 37 / 132
  • 59. Semi-local string comparison Score matrices and seaweed matrices The score matrix H and the seaweed matrix P H(i, j): the number of matched characters for a vs substring b i : j j − i − H(i, j): the number of unmatched characters Properties of matrix j − i − H(i, j): simple unit-Monge therefore, = P Σ , where P = −H is a permutation matrix P is the seaweed matrix, giving an implicit representation of H Range tree for P: memory O(n log n), query time O(log2 n) Alexander Tiskin (Warwick) Semi-local string comparison 38 / 132
  • 60. Semi-local string comparison Score matrices and seaweed matrices The score matrix H and the seaweed matrix P 0 1 2 3 4 5 6 6 7 8 8 8 8 8 a = “BAABCBCA” -1 0 1 2 3 4 5 5 6 7 7 7 7 7 • b = “BAABCABCABACA” -2 -1 0 1 2 3 4 4 5 6 6 6 6 7 • -3 -2 -1 0 1 2 3 3 4 5 5 6 6 7 H(i, j) = score(a, b i : j ) -4 -3 -2 -1 0 1 2 2 3 4 4 5 5 6 • H(4, 11) = 5 -5 -4 -3 -2 -1 0 1 2 3 4 4 5 5 6 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4 4 5 H(i, j) = j − i if i > j -7 -6 -5 -4 -3 -2 -1 0 1 2 2 3 3 4 • -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4 • -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 Alexander Tiskin (Warwick) Semi-local string comparison 39 / 132
  • 61. Semi-local string comparison Score matrices and seaweed matrices The score matrix H and the seaweed matrix P 0 1 2 3 4 5 6 6 7 8 8 8 8 8 a = “BAABCBCA” -1 0 1 2 3 4 5 5 6 7 7 7 7 7 • b = “BAABCABCABACA” -2 -1 0 1 2 3 4 4 5 6 6 6 6 7 • -3 -2 -1 0 1 2 3 3 4 5 5 6 6 7 H(i, j) = score(a, b i : j ) -4 -3 -2 -1 0 1 2 2 3 4 4 5 5 6 • H(4, 11) = 5 -5 -4 -3 -2 -1 0 1 2 3 4 4 5 5 6 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4 4 5 H(i, j) = j − i if i > j -7 -6 -5 -4 -3 -2 -1 0 1 2 2 3 3 4 • blue: difference in H is 0 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4 • red: difference in H is 1 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 Alexander Tiskin (Warwick) Semi-local string comparison 39 / 132
  • 62. Semi-local string comparison Score matrices and seaweed matrices The score matrix H and the seaweed matrix P 0 1 2 3 4 5 6 6 7 8 8 8 8 8 a = “BAABCBCA” -1 0 1 2 3 4 5 5 6 7 7 7 7 7 • b = “BAABCABCABACA” -2 -1 0 1 2 3 4 4 5 6 6 6 6 7 • -3 -2 -1 0 1 2 3 3 4 5 5 6 6 7 H(i, j) = score(a, b i : j ) -4 -3 -2 -1 0 1 2 2 3 4 4 5 5 6 • H(4, 11) = 5 -5 -4 -3 -2 -1 0 1 2 3 4 4 5 5 6 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4 4 5 H(i, j) = j − i if i > j -7 -6 -5 -4 -3 -2 -1 0 1 2 2 3 3 4 • blue: difference in H is 0 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4 • red: difference in H is 1 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 green: P(i, j) = 1 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 H(i, j) = j − i − P Σ (i, j) -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 Alexander Tiskin (Warwick) Semi-local string comparison 39 / 132
  • 63. Semi-local string comparison Score matrices and seaweed matrices The score matrix H and the seaweed matrix P a = “BAABCBCA” • b = “BAABCABCABACA” • H(4, 11) = • 11 − 4 − P Σ (i, j) = 11 − 4 − 2 = 5 • • Alexander Tiskin (Warwick) Semi-local string comparison 40 / 132
  • 64. Semi-local string comparison Score matrices and seaweed matrices The seaweed braid in the alignment graph B A A B C A B C A B A C A a = “BAABCBCA” B A b = “BAABCABCABACA” A H(4, 11) = B 11 − 4 − P Σ (i, j) = C 11 − 4 − 2 = 5 B C A P(i, j) = 1 corresponds to seaweed top i bottom j Alexander Tiskin (Warwick) Semi-local string comparison 41 / 132
  • 65. Semi-local string comparison Score matrices and seaweed matrices The seaweed braid in the alignment graph B A A B C A B C A B A C A a = “BAABCBCA” B A b = “BAABCABCABACA” A H(4, 11) = B 11 − 4 − P Σ (i, j) = C 11 − 4 − 2 = 5 B C A P(i, j) = 1 corresponds to seaweed top i bottom j Also define top right, left right, left bottom seaweeds Gives bijection between top-left and bottom-right graph boundaries Alexander Tiskin (Warwick) Semi-local string comparison 41 / 132
  • 66. Semi-local string comparison Score matrices and seaweed matrices Seaweed braid: a highly symmetric object (element of the 0-Hecke monoid of the symmetric group) Can be built recursively by assembling subbraids from separate parts Highly flexible: local alignment, compression, parallel computation. . . Alexander Tiskin (Warwick) Semi-local string comparison 42 / 132
  • 67. Semi-local string comparison Weighted alignment The LCS problem is a special case of the weighted alignment score problem with weighted matches (wM ), mismatches (wX ) and gaps (wG ) LCS score: wM = 1, wX = wG = 0 Levenshtein score: wM = 2, wX = 1, wG = 0 Alexander Tiskin (Warwick) Semi-local string comparison 43 / 132
  • 68. Semi-local string comparison Weighted alignment The LCS problem is a special case of the weighted alignment score problem with weighted matches (wM ), mismatches (wX ) and gaps (wG ) LCS score: wM = 1, wX = wG = 0 Levenshtein score: wM = 2, wX = 1, wG = 0 Alignment score is rational, if wM , wX , wG are rational numbers Equivalent to LCS score on blown-up strings Alexander Tiskin (Warwick) Semi-local string comparison 43 / 132
  • 69. Semi-local string comparison Weighted alignment The LCS problem is a special case of the weighted alignment score problem with weighted matches (wM ), mismatches (wX ) and gaps (wG ) LCS score: wM = 1, wX = wG = 0 Levenshtein score: wM = 2, wX = 1, wG = 0 Alignment score is rational, if wM , wX , wG are rational numbers Equivalent to LCS score on blown-up strings Edit distance: minimum cost to transform a into b by weighted character edits (insertion, deletion, substitution) Corresponds to weighted alignment score with wM = 0, insertion/deletion weight −wG , substitution weight −wX Alexander Tiskin (Warwick) Semi-local string comparison 43 / 132
  • 70. Semi-local string comparison Weighted alignment Weighted alignment graph B A A B C A B C A B A C A blue = 0 B red (solid) = 2 A red (dotted) = 1 A B C B C A Levenshtein score(“BAABCBCA”, “CABCABA”) = 11 Alexander Tiskin (Warwick) Semi-local string comparison 44 / 132
  • 71. Semi-local string comparison Weighted alignment Alignment graph for blown-up strings $B $A $A $B $C $A $B $C $A $B $A $C $A blue = 0 $B red = 0.5 or 1 $A $A $B $C $B $C $A Levenshtein score(“BAABCBCA”, “CABCABA”) = 2 · 5.5 Alexander Tiskin (Warwick) Semi-local string comparison 45 / 132
  • 72. Semi-local string comparison Weighted alignment Rational-weighted semi-local alignment reduced to semi-local LCS $B $A $A $B $C $A $B $C $A $B $A $C $A $B $A $A $B $C $B $C $A Let wM = 1, wX = µ , wG = 0 ν Increase × ν 2 in complexity (can be reduced to ν) Alexander Tiskin (Warwick) Semi-local string comparison 46 / 132
  • 73. Alexander Tiskin (Warwick) Semi-local string comparison 47 / 132
  • 74. 1 Introduction 7 Sparse string comparison 2 Matrix distance multiplication 8 Compressed string comparison 3 Semi-local string comparison 9 Beyond semi-locality 4 The seaweed method 10 Conclusions and future work 5 Periodic string comparison 6 The transposition network method Alexander Tiskin (Warwick) Semi-local string comparison 48 / 132
  • 75. The seaweed method Seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 49 / 132
  • 76. The seaweed method Seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 77. The seaweed method Seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 78. The seaweed method Seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 79. The seaweed method Seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 80. The seaweed method Seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 81. The seaweed method Seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 82. The seaweed method Seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 83. The seaweed method Seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 84. The seaweed method Seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 85. The seaweed method Seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 86. The seaweed method Seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 87. The seaweed method Seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 88. The seaweed method Seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 89. The seaweed method Seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 90. The seaweed method Seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 91. The seaweed method Seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 92. The seaweed method Seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 93. The seaweed method Seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 94. The seaweed method Seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 95. The seaweed method Seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 96. The seaweed method Seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 97. The seaweed method Seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 98. The seaweed method Seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 99. The seaweed method Seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 100. The seaweed method Seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 101. The seaweed method Seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 102. The seaweed method Seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 103. The seaweed method Seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 104. The seaweed method Seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 105. The seaweed method Seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 106. The seaweed method Seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 107. The seaweed method Seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 108. The seaweed method Seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 109. The seaweed method Seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 110. The seaweed method Seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 111. The seaweed method Seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 112. The seaweed method Seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 113. The seaweed method Seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 51 / 132
  • 114. The seaweed method Seaweed combing Semi-local LCS: seaweed combing [T: 2006] Initialise seaweed braid: crossings in all mismatch cells Sweep cells in any -compatible order Match cell: two seaweeds uncrossed; skip Mismatch cell: two seaweeds cross if the same seaweeds crossed before, uncross them otherwise skip, keep seaweeds crossed Cell update: time O(1) Overall time O(mn) Alexander Tiskin (Warwick) Semi-local string comparison 52 / 132
  • 115. The seaweed method Micro-block seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 53 / 132
  • 116. The seaweed method Micro-block seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
  • 117. The seaweed method Micro-block seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
  • 118. The seaweed method Micro-block seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
  • 119. The seaweed method Micro-block seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
  • 120. The seaweed method Micro-block seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
  • 121. The seaweed method Micro-block seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
  • 122. The seaweed method Micro-block seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
  • 123. The seaweed method Micro-block seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
  • 124. The seaweed method Micro-block seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
  • 125. The seaweed method Micro-block seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
  • 126. The seaweed method Micro-block seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
  • 127. The seaweed method Micro-block seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
  • 128. The seaweed method Micro-block seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
  • 129. The seaweed method Micro-block seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 55 / 132
  • 130. The seaweed method Micro-block seaweed combing Semi-local LCS: micro-block seaweed combing [T: 2007] Initialise seaweed braid: crossings in all mismatch cells Sweep cells in micro-blocks, in any -compatible order log n Micro-block size: t = O log log n Micro-block interface: O(t) characters, each O(log σ) bits, can be reduced to O(log t) bits O(t) integers, each O(log n) bits, can be reduced to O(log t) bits Micro-block update: time O(1), by precomputing all possible interfaces mn(log log n)2 Overall time O log2 n Alexander Tiskin (Warwick) Semi-local string comparison 56 / 132
  • 131. The seaweed method Cyclic LCS The cyclic LCS problem Give the maximum LCS score for a vs all cyclic rotations of b Alexander Tiskin (Warwick) Semi-local string comparison 57 / 132
  • 132. The seaweed method Cyclic LCS The cyclic LCS problem Give the maximum LCS score for a vs all cyclic rotations of b Cyclic LCS: running time mn 2 O log n naive O(mn log m) [Maes: 1990] O(mn) [Bunke, B¨hler: 1993; Landau+: 1998; Schmidt: 1998] u 2 O mn(log 2log n) log n [T: 2007] Alexander Tiskin (Warwick) Semi-local string comparison 57 / 132
  • 133. The seaweed method Cyclic LCS Cyclic LCS: the algorithm mn(log log n)2 Micro-block seaweed combing on a vs bb, time O log2 n Make n string-substring LCS queries, time negligible Alexander Tiskin (Warwick) Semi-local string comparison 58 / 132
  • 134. The seaweed method Longest repeating subsequence The longest repeating subsequence problem Find the longest subsequence of a that is a square (a repetition of two identical strings) Alexander Tiskin (Warwick) Semi-local string comparison 59 / 132
  • 135. The seaweed method Longest repeating subsequence The longest repeating subsequence problem Find the longest subsequence of a that is a square (a repetition of two identical strings) Longest repeating subsequence: running time O(m3 ) naive O(m2 ) [Kosowski: 2004] 2 log 2 O m (log 2 m m) log [T: 2007] Alexander Tiskin (Warwick) Semi-local string comparison 59 / 132
  • 136. The seaweed method Longest repeating subsequence Longest repeating subsequence: the algorithm m2 (log log m)2 Micro-block seaweed combing on a vs a, time O log2 m Make m − 1 suffix-prefix LCS queries, time negligible Alexander Tiskin (Warwick) Semi-local string comparison 60 / 132
  • 137. The seaweed method Approximate matching The approximate pattern matching problem Give the substring closest to a by alignment score, starting at each position in b Assume rational alignment score Approximate pattern matching: running time O(mn) [Sellers: 1980] mn O log n σ = O(1) via [Masek, Paterson: 1980] mn(log log n)2 O log2 n via [Bille, Farach-Colton: 2008] Alexander Tiskin (Warwick) Semi-local string comparison 61 / 132
  • 138. The seaweed method Approximate matching Approximate pattern matching: the algorithm Micro-block seaweed combing on a vs b (with blow-up), time 2 O mn(log 2log n) log n The implicit semi-local edit score matrix: an anti-Monge matrix approximate pattern matching ∼ row minima Row minima in O(n) element queries [Aggarwal+: 1987] Each query in time O(log2 n) using the range tree representation, combined query time negligible mn(log log n)2 Overall running time O log2 n , same as [Bille, Farach-Colton: 2008] Alexander Tiskin (Warwick) Semi-local string comparison 62 / 132
  • 139. Alexander Tiskin (Warwick) Semi-local string comparison 63 / 132
  • 140. 1 Introduction 7 Sparse string comparison 2 Matrix distance multiplication 8 Compressed string comparison 3 Semi-local string comparison 9 Beyond semi-locality 4 The seaweed method 10 Conclusions and future work 5 Periodic string comparison 6 The transposition network method Alexander Tiskin (Warwick) Semi-local string comparison 64 / 132
  • 141. Periodic string comparison Wraparound seaweed combing The periodic string-substring LCS problem Give (implicit) LCS scores for a vs each substring of b = . . . uuu . . . = u ±∞ Let u be of length p May assume that every character of a occurs in u Only substrings of b of length at most mp (otherwise LCS score is m) Alexander Tiskin (Warwick) Semi-local string comparison 65 / 132
  • 142. Periodic string comparison Wraparound seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 66 / 132
  • 143. Periodic string comparison Wraparound seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 144. Periodic string comparison Wraparound seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 145. Periodic string comparison Wraparound seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 146. Periodic string comparison Wraparound seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 147. Periodic string comparison Wraparound seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 148. Periodic string comparison Wraparound seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 149. Periodic string comparison Wraparound seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 150. Periodic string comparison Wraparound seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 151. Periodic string comparison Wraparound seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 152. Periodic string comparison Wraparound seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 153. Periodic string comparison Wraparound seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 154. Periodic string comparison Wraparound seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 155. Periodic string comparison Wraparound seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 156. Periodic string comparison Wraparound seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 157. Periodic string comparison Wraparound seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 158. Periodic string comparison Wraparound seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 159. Periodic string comparison Wraparound seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 160. Periodic string comparison Wraparound seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 161. Periodic string comparison Wraparound seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 162. Periodic string comparison Wraparound seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 163. Periodic string comparison Wraparound seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 164. Periodic string comparison Wraparound seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 165. Periodic string comparison Wraparound seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 166. Periodic string comparison Wraparound seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 167. Periodic string comparison Wraparound seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 168. Periodic string comparison Wraparound seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 169. Periodic string comparison Wraparound seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 170. Periodic string comparison Wraparound seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 171. Periodic string comparison Wraparound seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 172. Periodic string comparison Wraparound seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 173. Periodic string comparison Wraparound seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 174. Periodic string comparison Wraparound seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 175. Periodic string comparison Wraparound seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 176. Periodic string comparison Wraparound seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 177. Periodic string comparison Wraparound seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 178. Periodic string comparison Wraparound seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 179. Periodic string comparison Wraparound seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 180. Periodic string comparison Wraparound seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 181. Periodic string comparison Wraparound seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 182. Periodic string comparison Wraparound seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 183. Periodic string comparison Wraparound seaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 68 / 132
  • 184. Periodic string comparison Wraparound seaweed combing Periodic string-substring LCS: Wraparound seaweed combing Initialise seaweed braid: crossings in all mismatch cells Sweep cells row-by-row: each row starts at match cell, wraps at boundary Match cell: two seaweeds uncrossed; skip Mismatch cell: two seaweeds cross if the same seaweeds crossed before (with wrapping), uncross them otherwise skip, keep seaweeds crossed Cell update: time O(1) Overall time O(mn) String-substring LCS score: count seaweeds with multiplicities Alexander Tiskin (Warwick) Semi-local string comparison 69 / 132
  • 185. Periodic string comparison Wraparound seaweed combing The tandem LCS problem Give LCS score for a vs b = u k We have n = kp; may assume k ≤ m Tandem LCS: running time O(mkp) naive O m(k + p) [Landau, Ziv-Ukelson: 2001] O(mp) [T: 2009] Direct application of wraparound seaweed combing Alexander Tiskin (Warwick) Semi-local string comparison 70 / 132
  • 186. Periodic string comparison Wraparound seaweed combing The tandem alignment problem Give the substring closest to a by alignment score among certain substrings of b = u ±∞ : global: substrings u k of length kp across all k cyclic: substrings of length kp across all k local: substrings of any length Tandem alignment: running time O(m2 p) all naive O(mp) global [Myers, Miller: 1989] O(mp log p) cyclic [Benson: 2005] O(mp) cyclic [T: 2009] O(mp) local [Myers, Miller: 1989] Alexander Tiskin (Warwick) Semi-local string comparison 71 / 132
  • 187. Periodic string comparison Wraparound seaweed combing Cyclic tandem alignment: the algorithm Periodic seaweed combing for a vs b (with blow-up), time O(mp) For each k ∈ [1 : m]: solve tandem LCS (under given alignment score) for a vs u k obtain scores for a vs p successive substrings of b of length kp by LCS batch query: time O(1) per substring Running time O(mp) Alexander Tiskin (Warwick) Semi-local string comparison 72 / 132
  • 188. Alexander Tiskin (Warwick) Semi-local string comparison 73 / 132
  • 189. 1 Introduction 7 Sparse string comparison 2 Matrix distance multiplication 8 Compressed string comparison 3 Semi-local string comparison 9 Beyond semi-locality 4 The seaweed method 10 Conclusions and future work 5 Periodic string comparison 6 The transposition network method Alexander Tiskin (Warwick) Semi-local string comparison 74 / 132
  • 190. The transposition network method Transposition networks Comparison network: a circuit of comparators A comparator sorts two inputs and outputs them in prescribed order Comparison networks traditionally used for non-branching merging/sorting Classical comparison networks # comparators merging O(n log n) [Batcher: 1968] sorting O(n log2 n) [Batcher: 1968] O(n log n) [Ajtai+: 1983] Alexander Tiskin (Warwick) Semi-local string comparison 75 / 132
  • 191. The transposition network method Transposition networks Comparison network: a circuit of comparators A comparator sorts two inputs and outputs them in prescribed order Comparison networks traditionally used for non-branching merging/sorting Classical comparison networks # comparators merging O(n log n) [Batcher: 1968] sorting O(n log2 n) [Batcher: 1968] O(n log n) [Ajtai+: 1983] Comparison networks are visualised by wire diagrams Transposition network: all comparisons are between adjacent wires Alexander Tiskin (Warwick) Semi-local string comparison 75 / 132
  • 192. The transposition network method Transposition networks Seaweed combing as a transposition network −7 −5 −3 −1 A B C A +1 A +3 +5 C +7 B −7 C −1 +3 −3 −5 +7 +5 +1 Character mismatches correspond to comparators Inputs anti-sorted (sorted in reverse); each value traces a seaweed Alexander Tiskin (Warwick) Semi-local string comparison 76 / 132
  • 193. The transposition network method Transposition networks Global LCS: transposition network with binary input 0 0 0 0 A B C A 1 A 1 1 C 1 B 0 0 C 1 0 0 1 1 1 Inputs still anti-sorted, but may not be distinct Comparison between equal values is indeterminate Alexander Tiskin (Warwick) Semi-local string comparison 77 / 132
  • 194. The transposition network method Parameterised string comparison Parameterised string comparison String comparison sensitive e.g. to low similarity: small λ = LCS(a, b) high similarity: small κ = dist LCS (a, b) = m + n − 2λ Can also use weighted alignment score or edit distance Assume m = n, therefore κ = 2(n − λ) Alexander Tiskin (Warwick) Semi-local string comparison 78 / 132
  • 195. The transposition network method Parameterised string comparison Low-similarity comparison: small λ sparse set of matches, may need to look at them all preprocess matches for fast searching, time O(n log σ) High-similarity comparison: small κ set of matches may be dense, but only need to look at small subset no need to preprocess, linear search is OK Flexible comparison: sensitive to both high and low similarity, e.g. by both comparison types running alongside each other Alexander Tiskin (Warwick) Semi-local string comparison 79 / 132
  • 196. The transposition network method Parameterised string comparison Parameterised string comparison: running time Low-similarity, after preprocessing in O(n log σ) O(nλ) [Hirschberg: 1977] [Apostolico, Guerra: 1985] [Apostolico+: 1992] High-similarity, no preprocessing O(n · κ) [Ukkonen: 1985] [Myers: 1986] Flexible O(λ · κ · log n) no preproc [Myers: 1986; Wu+: 1990] O(λ · κ) after preproc [Rick: 1995] Alexander Tiskin (Warwick) Semi-local string comparison 80 / 132
  • 197. The transposition network method Parameterised string comparison Parameterised string comparison: the waterfall algorithm Low-similarity: O(n · λ) High-similarity: O(n · κ) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 0 0 1 1 0 0 1 0 1 1 1 0 1 1 0 0 Trace 0s through network in contiguous blocks and gaps Alexander Tiskin (Warwick) Semi-local string comparison 81 / 132
  • 198. The transposition network method Dynamic string comparison The dynamic LCS problem Maintain current LCS score under updates to one or both input strings Both input strings are streams, updated on-line: appending characters at left or right deleting characters at left or right Assume for simplicity m ≈ n, i.e. m = Θ(n) Goal: linear time per update O(n) per update of a (n = |b|) O(m) per update of b (m = |a|) Alexander Tiskin (Warwick) Semi-local string comparison 82 / 132
  • 199. The transposition network method Dynamic string comparison Dynamic LCS in linear time: update models left right – app+del standard DP [Wagner, Fischer: 1974] app app a fixed [Landau+: 1998], [Kim, Park: 2004] app app [Ishida+: 2005] app+del app+del [T: NEW] Main idea: for append only, maintain seaweed matrix Pa,b for append+delete, maintain partial seaweed layout by tracing a transposition network Alexander Tiskin (Warwick) Semi-local string comparison 83 / 132
  • 200. The transposition network method Bit-parallel string comparison Bit-parallel string comparison String comparison using standard instructions on words of size w Bit-parallel string comparison: running time O(mn/w ) [Allison, Dix: 1986; Myers: 1999; Crochemore+: 2001] Alexander Tiskin (Warwick) Semi-local string comparison 84 / 132
  • 201. The transposition network method Bit-parallel string comparison Bit-parallel string comparison: binary transposition network In every cell: input bits s, c; output bits s , c ; match/mismatch flag µ c s 0 1 0 1 0 1 0 1 µ ¬ c 0 0 1 1 0 0 1 1 µ 0 0 0 0 1 1 1 1 s s s 0 1 1 1 0 0 1 1 c 0 0 0 1 0 1 0 1 c c s 0 1 0 1 0 1 0 1 µ ∧ c 0 0 1 1 0 0 1 1 µ 0 0 0 0 1 1 1 1 s + s s 0 1 1 0 0 0 1 1 c 0 0 0 1 0 1 0 1 c Alexander Tiskin (Warwick) Semi-local string comparison 85 / 132
  • 202. The transposition network method Bit-parallel string comparison Bit-parallel string comparison: binary transposition network In every cell: input bits s, c; output bits s , c ; match/mismatch flag µ c s 0 1 0 1 0 1 0 1 µ ¬ c 0 0 1 1 0 0 1 1 µ 0 0 0 0 1 1 1 1 s s s 0 1 1 1 0 0 1 1 c 0 0 0 1 0 1 0 1 c c s 0 1 0 1 0 1 0 1 µ ∧ c 0 0 1 1 0 0 1 1 µ 0 0 0 0 1 1 1 1 s + s s 0 1 1 0 0 0 1 1 c 0 0 0 1 0 1 0 1 c 2c + s ← (s + (s ∧ µ) + c) ∨ (s ∧ ¬µ) S ← (S + (S ∧ M)) ∨ (S ∧ ¬M), where S, M are words of bits s, µ Alexander Tiskin (Warwick) Semi-local string comparison 85 / 132
  • 203. Alexander Tiskin (Warwick) Semi-local string comparison 86 / 132
  • 204. 1 Introduction 7 Sparse string comparison 2 Matrix distance multiplication 8 Compressed string comparison 3 Semi-local string comparison 9 Beyond semi-locality 4 The seaweed method 10 Conclusions and future work 5 Periodic string comparison 6 The transposition network method Alexander Tiskin (Warwick) Semi-local string comparison 87 / 132
  • 205. Sparse string comparison Semi-local LCS between permutations The LCS problem on permutation strings Give LCS score for a vs b In each of a, b all characters distinct: total m = n matches Equivalent to longest increasing subsequence (LIS) in a string maximum clique in a permutation graph maximum planar matching in an embedded bipartite graph LCS on permutation strings: running time O(n log n) implicit in [Erd¨s, Szekeres: o 1935] [Robinson: 1938; Knuth: 1970; Dijkstra: 1980] O(n log log n) unit-RAM [Chang, Wang: 1992] [Bespamyatnikh, Segal: 2000] Alexander Tiskin (Warwick) Semi-local string comparison 88 / 132
  • 206. Sparse string comparison Semi-local LCS between permutations The semi-local LCS problem on permutation strings Give semi-local LCS scores of a vs b In each of a, b all characters distinct: total m = n matches Equivalent to longest increasing subsequence (LIS) in every substring of a string Semi-local LCS on permutation strings: running time O(n2 log n) naive O(n2 ) restricted [Albert+: 2003; Chen+: 2005] O(n1.5 log n) randomised, restricted [Albert+: 2007] O(n1.5 ) [T: 2006] O(n log2 n) [T: NEW] Alexander Tiskin (Warwick) Semi-local string comparison 89 / 132
  • 207. Sparse string comparison Semi-local LCS between permutations D E H C B A F G C F A E D H G B Alexander Tiskin (Warwick) Semi-local string comparison 90 / 132
  • 208. Sparse string comparison Semi-local LCS between permutations D E H C B A F G C F A E D H G B Alexander Tiskin (Warwick) Semi-local string comparison 90 / 132
  • 209. Sparse string comparison Semi-local LCS between permutations D E H C B A F G C F A E D H G B Alexander Tiskin (Warwick) Semi-local string comparison 91 / 132
  • 210. Sparse string comparison Semi-local LCS between permutations D E H C B A F G C F A E D H G B Alexander Tiskin (Warwick) Semi-local string comparison 91 / 132
  • 211. Sparse string comparison Semi-local LCS between permutations D E H C B A F G C F A E D H G B Alexander Tiskin (Warwick) Semi-local string comparison 92 / 132
  • 212. Sparse string comparison Semi-local LCS between permutations D E H C B A F G C F A E D H G B Alexander Tiskin (Warwick) Semi-local string comparison 92 / 132
  • 213. Sparse string comparison Semi-local LCS between permutations Semi-local LCS on permutation strings: the algorithm Divide-and-conquer on the alignment graph Divide graph (say) horizontally; two subproblems of effective size n/2 Conquer: seaweed matrix -multiplication, time O(n log n) Overall time O(n log2 n) Alexander Tiskin (Warwick) Semi-local string comparison 93 / 132
  • 214. Sparse string comparison Longest piecewise monotone subsequences A k-increasing sequence: a concatenation of k increasing sequences A k-modal sequence: a concatenation of k alternating increasing and decreasing sequences The longest k-increasing (k-modal) subsequence problem Give the longest k-increasing (k-modal) subsequence of string b Longest k-increasing (k-modal) subsequence: running time O(nk log n) k-modal [Demange+: 2007] O(nk log n) via [Hunt, Szymanski: 1977] O(n log2 n) [T: NEW] Main idea: LCS for id k (respectively, (idid)k/2 ) vs b Alexander Tiskin (Warwick) Semi-local string comparison 94 / 132