SlideShare una empresa de Scribd logo
1 de 56
Descargar para leer sin conexión
Efficient Top-k Algorithms for Fuzzy Search in String
                     Collections

                             Rares Vernica             Chen Li

                             Department of Computer Science
                              University of California, Irvine


  First International Workshop on Keyword Search on Structured
                            Data, 2009




 Rares Vernica (UC Irvine)         Top-k Fuzzy String Search     KEYS 2009   1 / 17
Outline



1   Motivation


2   Efficient Top-k Algorithms
      Problem Formulation
      Algorithms Overview
      Top-k Single-Pass Search Algorithm


3   Experimental Evaluation




    Rares Vernica (UC Irvine)   Top-k Fuzzy String Search   KEYS 2009   2 / 17
Need for Approximate String Queries
               ID       FirstName        LastName                   # Movies
               10       Al               Swartzberg                        1
               11       Hanna            Wartenegg                         1
               12       Rik              Swartzwelder                     30
               13       Joey             Swartzentruber                    1
               14       Rene             Swartenbroekx                     4
               15       Arnold           Schwarzenegger                 283
               16       Luc              Swartenbroeckx                    1
                          .
                          .               .
                                          .                                 .
                                                                            .
                          .               .                                 .
                              Figure: Actor names and popularities




  Rares Vernica (UC Irvine)             Top-k Fuzzy String Search               KEYS 2009   3 / 17
Need for Approximate String Queries
               ID       FirstName        LastName                   # Movies
               10       Al               Swartzberg                        1
               11       Hanna            Wartenegg                         1
               12       Rik              Swartzwelder                     30
               13       Joey             Swartzentruber                    1
               14       Rene             Swartenbroekx                     4
               15       Arnold           Schwarzenegger                 283
               16       Luc              Swartenbroeckx                    1
                          .
                          .               .
                                          .                                 .
                                                                            .
                          .               .                                 .
                              Figure: Actor names and popularities

                 SELECT * FROM Actors
                 WHERE LastName = ’Shwartzenetrugger’
                 ORDER BY ’# Movies’ DESC LIMIT 1;


  Rares Vernica (UC Irvine)             Top-k Fuzzy String Search               KEYS 2009   3 / 17
Need for Approximate String Queries
               ID       FirstName        LastName                   # Movies
               10       Al               Swartzberg                        1
               11       Hanna            Wartenegg                         1
               12       Rik              Swartzwelder                     30
               13       Joey             Swartzentruber                    1
               14       Rene             Swartenbroekx                     4
               15       Arnold           Schwarzenegger                 283
               16       Luc              Swartenbroeckx                    1
                          .
                          .               .
                                          .                                 .
                                                                            .
                          .               .                                 .
                              Figure: Actor names and popularities

                 SELECT * FROM Actors
                 WHERE LastName = ’Shwartzenetrugger’
                 ORDER BY ’# Movies’ DESC LIMIT 1;

                 0 Results
  Rares Vernica (UC Irvine)             Top-k Fuzzy String Search               KEYS 2009   3 / 17
Need for Ranking

         ID       FirstName    LastName                      # Movies     ED
         10       Al           Swartzberg                           1      8
         11       Hanna        Wartenegg                            1      8
         12       Rik          Swartzwelder                        30      8
         13       Joey         Swartzentruber                       1      4
         14       Rene         Swartenbroekx                        4      9
         15       Arnold       Schwarzenegger                    283       5
         16       Luc          Swartenbroeckx                       1      9
                    .
                    .           .
                                .                                    .
                                                                     .     .
                                                                           .
                    .           .                                    .     .
Figure: Actor names, popularities, and edit distances to a query string
“Shwartzenetrugger”.




   Rares Vernica (UC Irvine)     Top-k Fuzzy String Search               KEYS 2009   4 / 17
Need for Ranking

         ID       FirstName    LastName                      # Movies     ED
         10       Al           Swartzberg                           1      8
         11       Hanna        Wartenegg                            1      8
         12       Rik          Swartzwelder                        30      8
         13       Joey         Swartzentruber                       1      4
         14       Rene         Swartenbroekx                        4      9
         15       Arnold       Schwarzenegger                    283       5
         16       Luc          Swartenbroeckx                       1      9
                    .
                    .           .
                                .                                    .
                                                                     .     .
                                                                           .
                    .           .                                    .     .
Figure: Actor names, popularities, and edit distances to a query string
“Shwartzenetrugger”.




   Rares Vernica (UC Irvine)     Top-k Fuzzy String Search               KEYS 2009   4 / 17
Need for Ranking

         ID       FirstName    LastName                      # Movies     ED
         10       Al           Swartzberg                           1      8
         11       Hanna        Wartenegg                            1      8
         12       Rik          Swartzwelder                        30      8
         13       Joey         Swartzentruber                       1      4
         14       Rene         Swartenbroekx                        4      9
         15       Arnold       Schwarzenegger                    283       5
         16       Luc          Swartenbroeckx                       1      9
                    .
                    .           .
                                .                                    .
                                                                     .     .
                                                                           .
                    .           .                                    .     .
Figure: Actor names, popularities, and edit distances to a query string
“Shwartzenetrugger”.




   Rares Vernica (UC Irvine)     Top-k Fuzzy String Search               KEYS 2009   4 / 17
Need for Ranking

         ID       FirstName    LastName                      # Movies     ED
         10       Al           Swartzberg                           1      8
         11       Hanna        Wartenegg                            1      8
         12       Rik          Swartzwelder                        30      8
         13       Joey         Swartzentruber                       1      4
         14       Rene         Swartenbroekx                        4      9
         15       Arnold       Schwarzenegger                    283       5
         16       Luc          Swartenbroeckx                       1      9
                    .
                    .           .
                                .                                    .
                                                                     .     .
                                                                           .
                    .           .                                    .     .
Figure: Actor names, popularities, and edit distances to a query string
“Shwartzenetrugger”.


     Which one result should the system return?


   Rares Vernica (UC Irvine)     Top-k Fuzzy String Search               KEYS 2009   4 / 17
Need for Ranking

         ID       FirstName    LastName                      # Movies     ED
         10       Al           Swartzberg                           1      8
         11       Hanna        Wartenegg                            1      8
         12       Rik          Swartzwelder                        30      8
         13       Joey         Swartzentruber                       1      4
         14       Rene         Swartenbroekx                        4      9
         15       Arnold       Schwarzenegger                    283       5
         16       Luc          Swartenbroeckx                       1      9
                    .
                    .           .
                                .                                    .
                                                                     .     .
                                                                           .
                    .           .                                    .     .
Figure: Actor names, popularities, and edit distances to a query string
“Shwartzenetrugger”.


     Which one result should the system return?
     Which value is more important, # Movies or similarity?
   Rares Vernica (UC Irvine)     Top-k Fuzzy String Search               KEYS 2009   4 / 17
Top-k Similar Strings

    Given:
            Weighted string collection
                    e.g., actors’ LastName and # Movies
            Query string
                    e.g., “Shwartzenetrugger”
            Similarity function
                    e.g, Edit Distance
            Scoring function (score of a string in terms of similarity and weight)
                    e.g., linear combination of similarity and popularity
            Integer k
    Return: k best strings in terms of overall score to the query string.




  Rares Vernica (UC Irvine)           Top-k Fuzzy String Search             KEYS 2009   5 / 17
Top-k Similar Strings

    Given:
            Weighted string collection
                    e.g., actors’ LastName and # Movies
            Query string
                    e.g., “Shwartzenetrugger”
            Similarity function
                    e.g, Edit Distance
            Scoring function (score of a string in terms of similarity and weight)
                    e.g., linear combination of similarity and popularity
            Integer k
    Return: k best strings in terms of overall score to the query string.
    Advantages over Range Search:
            specify k instead of a similarity threshold
            guaranteed k results; a range search might have 0 results


  Rares Vernica (UC Irvine)           Top-k Fuzzy String Search             KEYS 2009   5 / 17
Algorithms Overview




                                                             Range
                                                             Search
          Range
          Search



      Iterative Range            Single-Pass
          Search                   Search                 Two-Phase Search




  Rares Vernica (UC Irvine)   Top-k Fuzzy String Search            KEYS 2009   6 / 17
q-grams: Overlapping substrings of fixed length



    Find similar strings: e.g., “Vernica” and “Veronica”
    q-gram: substring of length q of a string: e.g., q = 2




  Rares Vernica (UC Irvine)   Top-k Fuzzy String Search      KEYS 2009   7 / 17
q-grams: Overlapping substrings of fixed length



    Find similar strings: e.g., “Vernica” and “Veronica”
    q-gram: substring of length q of a string: e.g., q = 2

Vernica




  Rares Vernica (UC Irvine)   Top-k Fuzzy String Search      KEYS 2009   7 / 17
q-grams: Overlapping substrings of fixed length



    Find similar strings: e.g., “Vernica” and “Veronica”
    q-gram: substring of length q of a string: e.g., q = 2

Vernica → {Ve




  Rares Vernica (UC Irvine)   Top-k Fuzzy String Search      KEYS 2009   7 / 17
q-grams: Overlapping substrings of fixed length



    Find similar strings: e.g., “Vernica” and “Veronica”
    q-gram: substring of length q of a string: e.g., q = 2

Vernica → {Ve,er




  Rares Vernica (UC Irvine)   Top-k Fuzzy String Search      KEYS 2009   7 / 17
q-grams: Overlapping substrings of fixed length



    Find similar strings: e.g., “Vernica” and “Veronica”
    q-gram: substring of length q of a string: e.g., q = 2

Vernica → {Ve,er,rn




  Rares Vernica (UC Irvine)   Top-k Fuzzy String Search      KEYS 2009   7 / 17
q-grams: Overlapping substrings of fixed length



    Find similar strings: e.g., “Vernica” and “Veronica”
    q-gram: substring of length q of a string: e.g., q = 2

Vernica → {Ve,er,rn,ni




  Rares Vernica (UC Irvine)   Top-k Fuzzy String Search      KEYS 2009   7 / 17
q-grams: Overlapping substrings of fixed length



    Find similar strings: e.g., “Vernica” and “Veronica”
    q-gram: substring of length q of a string: e.g., q = 2

Vernica → {Ve,er,rn,ni,ic




  Rares Vernica (UC Irvine)   Top-k Fuzzy String Search      KEYS 2009   7 / 17
q-grams: Overlapping substrings of fixed length



    Find similar strings: e.g., “Vernica” and “Veronica”
    q-gram: substring of length q of a string: e.g., q = 2

Vernica → {Ve,er,rn,ni,ic,ca}




  Rares Vernica (UC Irvine)   Top-k Fuzzy String Search      KEYS 2009   7 / 17
q-grams: Overlapping substrings of fixed length



    Find similar strings: e.g., “Vernica” and “Veronica”
    q-gram: substring of length q of a string: e.g., q = 2

Vernica → {Ve,er,rn,ni,ic,ca}

Veronica → {Ve,er,ro,on,ni,ic,ca}




  Rares Vernica (UC Irvine)   Top-k Fuzzy String Search      KEYS 2009   7 / 17
q-grams: Overlapping substrings of fixed length



    Find similar strings: e.g., “Vernica” and “Veronica”
    q-gram: substring of length q of a string: e.g., q = 2

Vernica → {Ve,er,rn,ni,ic,ca}

Veronica → {Ve,er,ro,on,ni,ic,ca}

    Similar strings share a certain number of grams




  Rares Vernica (UC Irvine)   Top-k Fuzzy String Search      KEYS 2009   7 / 17
q-gram Inverted List Index

    q=2

       ID      String         Weight
        1      ab              0.80
        2      ccd             0.70
        3      cd              0.60
        4      abcd            0.50
        5      bcc             0.40
             Figure: Dataset




  Rares Vernica (UC Irvine)            Top-k Fuzzy String Search   KEYS 2009   8 / 17
q-gram Inverted List Index

    q=2

       ID      String         Weight
        1      ab              0.80                         ab      cc      cd        bc
        2
        3
               ccd
               cd
                               0.70
                               0.60          ⇒               1
                                                             4
                                                                    2
                                                                    5
                                                                             2
                                                                             3
                                                                                       4
                                                                                       5
        4      abcd            0.50                                          4
        5      bcc             0.40
                                                          Figure: Gram inverted-list index
             Figure: Dataset




  Rares Vernica (UC Irvine)            Top-k Fuzzy String Search                 KEYS 2009   8 / 17
q-gram Inverted List Index

    q=2

       ID      String         Weight
        1      ab              0.80                         ab      cc      cd        bc
        2      ccd             0.70                          1      2        2         4
        3      cd              0.60                          4      5        3         5
        4      abcd            0.50                                          4
        5      bcc             0.40
                                                          Figure: Gram inverted-list index
             Figure: Dataset

    Query string: “bcd”




  Rares Vernica (UC Irvine)            Top-k Fuzzy String Search                 KEYS 2009   8 / 17
q-gram Inverted List Index

    q=2

       ID      String         Weight
        1      ab              0.80                         ab      cc      cd        bc
        2      ccd             0.70                          1      2        2         4
        3      cd              0.60                          4      5        3         5
        4      abcd            0.50                                          4
        5      bcc             0.40
                                                          Figure: Gram inverted-list index
             Figure: Dataset

    Query string: “bcd”




  Rares Vernica (UC Irvine)            Top-k Fuzzy String Search                 KEYS 2009   8 / 17
q-gram Inverted List Index

    q=2

       ID      String         Weight
        1      ab              0.80                         ab      cc      cd        bc
        2
        3
               ccd
               cd
                               0.70
                               0.60          ⇐               1
                                                             4
                                                                    2
                                                                    5
                                                                             2
                                                                             3
                                                                                       4
                                                                                       5
        4      abcd            0.50                                          4
        5      bcc             0.40
                                                          Figure: Gram inverted-list index
             Figure: Dataset

    Query string: “bcd”
    Identified strings are verified by computing the real similarity.
    Verification is usually an expensive process.


  Rares Vernica (UC Irvine)            Top-k Fuzzy String Search                 KEYS 2009   8 / 17
Top-k Single-pass Search Algorithm

                                                             ID    String    Weight
                                                              1    ab         0.80
                                                              2    ccd        0.70
                                                              3    cd         0.60
                                                              4    abcd       0.50
                                                              5    bcc        0.40
                                                                   Figure: Dataset



                                                              ab     cc     cd    bc
                                                               1     2       2    4
                                                               4     5       3    5
                                                                             4
                                                          Figure: Gram inverted-list
                                                          index
  Rares Vernica (UC Irvine)   Top-k Fuzzy String Search                     KEYS 2009   9 / 17
Top-k Single-pass Search Algorithm

                                                             ID    String    Weight
                                                              1    ab         0.80
                                                              2    ccd        0.70
Setup                                                         3    cd         0.60
   Assign IDs s.t.                                            4    abcd       0.50
   ascending order of IDs ≡                                   5    bcc        0.40
   descending order of weights                                     Figure: Dataset



                                                              ab     cc     cd    bc
                                                               1      2      2     4
                                                               4      5      3     5
                                                                             4
                                                          Figure: Gram inverted-list
                                                          index
  Rares Vernica (UC Irvine)   Top-k Fuzzy String Search                     KEYS 2009   9 / 17
Top-k Single-pass Search Algorithm

                                                             ID    String    Weight
                                                              1    ab         0.80
                                                              2    ccd        0.70
Setup                                                         3    cd         0.60
   Assign IDs s.t.                                            4    abcd       0.50
   ascending order of IDs ≡                                   5    bcc        0.40
   descending order of weights                                     Figure: Dataset
   Sort the IDs on each list in ascending
   order
                                                              ab     cc     cd    bc
                                                               1      2      2     4
                                                               4      5      3     5
                                                                             4
                                                          Figure: Gram inverted-list
                                                          index
  Rares Vernica (UC Irvine)   Top-k Fuzzy String Search                     KEYS 2009   9 / 17
Top-k Single-pass Search Algorithm

                                                             ID    String    Weight
                                                              1    ab         0.80
                                                              2    ccd        0.70
Setup                                                         3    cd         0.60
   Assign IDs s.t.                                            4    abcd       0.50
   ascending order of IDs ≡                                   5    bcc        0.40
   descending order of weights                                     Figure: Dataset
   Sort the IDs on each list in ascending
   order
   Scan the lists corresponding to the                        ab     cc     cd    bc
   grams in the query.                                         1      2      2     4
   e.g., “bcd”                                                 4      5      3     5
                                                                             4
                                                          Figure: Gram inverted-list
                                                          index
  Rares Vernica (UC Irvine)   Top-k Fuzzy String Search                     KEYS 2009   9 / 17
Top-k Single-pass Search Algorithm

                                                              ID    String    Weight
                                                               1    ab         0.80
                                                               2    ccd        0.70
                                                               3    cd         0.60
Naïve approach: Round-Robin
                                                               4    abcd       0.50
    Scan all the lists in the same time                        5    bcc        0.40
    Maintain a list of “open” IDs (might
                                                                    Figure: Dataset
    still appear on some of the lists)
    Store the best k “closed” IDs in a
    top-k buffer
                                                               ab     cc     cd    bc
    Stop when the top-k buffer cannot                           1      2      2     4
    improve                                                     4      5      3     5
                                                                              4
                                                           Figure: Gram inverted-list
                                                           index
   Rares Vernica (UC Irvine)   Top-k Fuzzy String Search                     KEYS 2009   9 / 17
Top-k Single-pass Search Algorithm



                                                         ab     bc       cd
                                                          1      2        2
Heap-based
                                                         20      3        4
     Traverse the lists in a sorted                      21      4        5
     order using a heap on the top                         .
                                                           .      .
                                                                  .        .
                                                                           .
     IDs of the lists                                      .      .        .
                                                                19       19
     Advantages:
                                                                20       20
         1   No need to maintain the list                        .        .
             of “open” IDs                                       .
                                                                 .        .
                                                                          .
         2   Skip elements
                                                   Figure: Gram
                                                   inverted-lists for query
                                                   “abcd”


  Rares Vernica (UC Irvine)      Top-k Fuzzy String Search                KEYS 2009   10 / 17
Top-k Single-pass Search Algorithm



                                                       ab       bc       cd
                                                       →1        2        2
Heap-based
                                                       20        3        4
     Traverse the lists in a sorted                    21        4        5
     order using a heap on the top                       .
                                                         .        .
                                                                  .        .
                                                                           .
     IDs of the lists                                    .        .        .
                                                                19       19
     Advantages:                                                                      1
                                                                20       20
         1   No need to maintain the list                        .        .
             of “open” IDs                                       .
                                                                 .        .
                                                                          .
         2   Skip elements
                                                   Figure: Gram                 Figure:
                                                   inverted-lists for query     Min-
                                                   “abcd”                       heap


  Rares Vernica (UC Irvine)      Top-k Fuzzy String Search                KEYS 2009       10 / 17
Top-k Single-pass Search Algorithm



                                                       ab      bc        cd
                                                       →1      →2         2
Heap-based
                                                       20       3         4
     Traverse the lists in a sorted                    21       4         5
     order using a heap on the top                       .
                                                         .       .
                                                                 .         .
                                                                           .
     IDs of the lists                                    .       .         .
                                                                19       19
     Advantages:                                                                      1
                                                                20       20
         1   No need to maintain the list                        .        .           2
             of “open” IDs                                       .
                                                                 .        .
                                                                          .
         2   Skip elements
                                                   Figure: Gram                 Figure:
                                                   inverted-lists for query     Min-
                                                   “abcd”                       heap


  Rares Vernica (UC Irvine)      Top-k Fuzzy String Search                KEYS 2009       10 / 17
Top-k Single-pass Search Algorithm



                                                       ab      bc       cd
                                                       →1      →2       →2
Heap-based
                                                       20       3        4
     Traverse the lists in a sorted                    21       4        5
     order using a heap on the top                       .
                                                         .       .
                                                                 .        .
                                                                          .
     IDs of the lists                                    .       .        .
                                                                19       19
     Advantages:                                                                      1
                                                                20       20
         1   No need to maintain the list                        .        .           2
             of “open” IDs                                       .
                                                                 .        .
                                                                          .           2
         2   Skip elements
                                                   Figure: Gram                 Figure:
                                                   inverted-lists for query     Min-
                                                   “abcd”                       heap


  Rares Vernica (UC Irvine)      Top-k Fuzzy String Search                KEYS 2009       10 / 17
Top-k Single-pass Search Algorithm



                                                       ab      bc       cd
                                                       →1      →2       →2
Heap-based
                                                       20       3        4
     Traverse the lists in a sorted                    21       4        5
     order using a heap on the top                       .
                                                         .       .
                                                                 .        .
                                                                          .
     IDs of the lists                                    .       .        .
                                                                19       19
     Advantages:                                                                      1
                                                                20       20
         1   No need to maintain the list                        .        .           2
             of “open” IDs                                       .
                                                                 .        .
                                                                          .           2
         2   Skip elements
                                                   Figure: Gram                 Figure:
                                                   inverted-lists for query     Min-
                                                   “abcd”                       heap


  Rares Vernica (UC Irvine)      Top-k Fuzzy String Search                KEYS 2009       10 / 17
Top-k Single-pass Search Algorithm


                                                                                    1
                                                     ab      bc       cd
                                                                              Figure:
                                                     →1      →2       →2      Top-k
Heap-based
                                                     20       3        4      buffer,
     Traverse the lists in a sorted                  21       4        5      k =1
     order using a heap on the top                     .
                                                       .       .
                                                               .        .
                                                                        .
     IDs of the lists                                  .       .        .
                                                              19       19
     Advantages:                                                                    2
                                                              20       20
         1   No need to maintain the                           .        .           2
             list of “open” IDs                                .
                                                               .        .
                                                                        .
         2   Skip elements
                                                 Figure: Gram                 Figure:
                                                 inverted-lists for query     Min-
                                                 “abcd”                       heap


  Rares Vernica (UC Irvine)    Top-k Fuzzy String Search                KEYS 2009       10 / 17
Top-k Single-pass Search Algorithm


                                                                                      1
                                                      ab       bc       cd
                                                                                Figure:
                                                       1       →2       →2      Top-k
Heap-based
                                                     →20        3        4      buffer,
     Traverse the lists in a sorted                   21        4        5      k =1
     order using a heap on the top                      .
                                                        .        .
                                                                 .        .
                                                                          .
     IDs of the lists                                   .        .        .
                                                                19       19
     Advantages:                                                                   2
                                                                20       20
         1   No need to maintain the list                        .        .        2
             of “open” IDs                                       .
                                                                 .        .
                                                                          .       20
         2   Skip elements
                                                   Figure: Gram                 Figure:
                                                   inverted-lists for query     Min-
                                                   “abcd”                       heap


  Rares Vernica (UC Irvine)      Top-k Fuzzy String Search                KEYS 2009       10 / 17
Top-k Single-pass Search Algorithm


                                                                                      1
                                                      ab       bc       cd
                                                                                Figure:
                                                       1       →2       →2      Top-k
Heap-based
                                                     →20        3        4      buffer,
     Traverse the lists in a sorted                   21        4        5      k =1
     order using a heap on the top                      .
                                                        .        .
                                                                 .        .
                                                                          .
     IDs of the lists                                   .        .        .
                                                                19       19
     Advantages:                                                                   2
                                                                20       20
         1   No need to maintain the list                        .        .        2
             of “open” IDs                                       .
                                                                 .        .
                                                                          .       20
         2   Skip elements
                                                   Figure: Gram                 Figure:
                                                   inverted-lists for query     Min-
                                                   “abcd”                       heap


  Rares Vernica (UC Irvine)      Top-k Fuzzy String Search                KEYS 2009       10 / 17
Top-k Single-pass Search Algorithm


                                                                                      2
                                                      ab       bc       cd
                                                                                Figure:
                                                       1       →2       →2      Top-k
Heap-based
                                                     →20        3        4      buffer,
     Traverse the lists in a sorted                   21        4        5      k =1
     order using a heap on the top                      .
                                                        .        .
                                                                 .        .
                                                                          .
     IDs of the lists                                   .        .        .
                                                                19       19
     Advantages:                                                                  20
                                                                20       20
         1   No need to maintain the list                        .        .
             of “open” IDs                                       .
                                                                 .        .
                                                                          .
         2   Skip elements
                                                   Figure: Gram                 Figure:
                                                   inverted-lists for query     Min-
                                                   “abcd”                       heap


  Rares Vernica (UC Irvine)      Top-k Fuzzy String Search                KEYS 2009       10 / 17
Top-k Single-pass Search Algorithm


                                                                                      2
                                                      ab       bc       cd
                                                                                Figure:
                                                       1        2        2      Top-k
Heap-based
                                                     →20       →3       →4      buffer,
     Traverse the lists in a sorted                   21        4        5      k =1
     order using a heap on the top                      .
                                                        .        .
                                                                 .        .
                                                                          .
     IDs of the lists                                   .        .        .
                                                                19       19
     Advantages:                                                                   3
                                                                20       20
         1   No need to maintain the list                        .        .        4
             of “open” IDs                                       .
                                                                 .        .
                                                                          .       20
         2   Skip elements
                                                   Figure: Gram                 Figure:
                                                   inverted-lists for query     Min-
                                                   “abcd”                       heap


  Rares Vernica (UC Irvine)      Top-k Fuzzy String Search                KEYS 2009       10 / 17
Top-k Single-pass Search Algorithm


                                                                                      2
                                                      ab       bc       cd
                                                                                Figure:
                                                       1        2        2      Top-k
Heap-based
                                                     →20       →3       →4      buffer,
     Traverse the lists in a sorted                   21        4        5      k =1
     order using a heap on the top                      .
                                                        .        .
                                                                 .        .
                                                                          .
     IDs of the lists                                   .        .        .
                                                                19       19
     Advantages:                                                                   3
                                                                20       20
         1   No need to maintain the list                        .        .        4
             of “open” IDs                                       .
                                                                 .        .
                                                                          .       20
         2   Skip elements
                                                   Figure: Gram                 Figure:
                                                   inverted-lists for query     Min-
                                                   “abcd”                       heap


  Rares Vernica (UC Irvine)      Top-k Fuzzy String Search                KEYS 2009       10 / 17
Top-k Single-pass Search Algorithm


                                                                                      2
                                                      ab       bc       cd
                                                                                Figure:
                                                       1        2        2      Top-k
Heap-based
                                                     →20       →3       →4      buffer,
     Traverse the lists in a sorted                   21        4        5      k =1
     order using a heap on the top                      .
                                                        .        .
                                                                 .        .
                                                                          .
     IDs of the lists                                   .        .        .
                                                                19       19
     Advantages:                                                                   4
                                                                20       20
         1   No need to maintain the list                        .        .       20
             of “open” IDs                                       .
                                                                 .        .
                                                                          .
         2   Skip elements
                                                   Figure: Gram                 Figure:
                                                   inverted-lists for query     Min-
                                                   “abcd”                       heap


  Rares Vernica (UC Irvine)      Top-k Fuzzy String Search                KEYS 2009       10 / 17
Top-k Single-pass Search Algorithm


                                                                                      2
                                                      ab       bc       cd
                                                                                Figure:
                                                       1        2        2      Top-k
Heap-based
                                                     →20       →3       →4      buffer,
     Traverse the lists in a sorted                   21        4        5      k =1
     order using a heap on the top                      .
                                                        .        .
                                                                 .        .
                                                                          .
     IDs of the lists                                   .        .        .
                                                                19       19
     Advantages:                                                                   4
                                                                20       20
         1   No need to maintain the list                        .        .       20
             of “open” IDs                                       .
                                                                 .        .
                                                                          .
         2   Skip elements
                                                   Figure: Gram                 Figure:
                                                   inverted-lists for query     Min-
                                                   “abcd”                       heap


  Rares Vernica (UC Irvine)      Top-k Fuzzy String Search                KEYS 2009       10 / 17
Top-k Single-pass Search Algorithm


                                                                                      2
                                                      ab       bc       cd
                                                                                Figure:
                                                       1        2        2      Top-k
Heap-based
                                                     →20       →3       →4      buffer,
     Traverse the lists in a sorted                   21        4        5      k =1
     order using a heap on the top                      .
                                                        .        .
                                                                 .        .
                                                                          .
     IDs of the lists                                   .        .        .
                                                                19       19
     Advantages:                                                                  20
                                                                20       20
         1   No need to maintain the list                        .        .
             of “open” IDs                                       .
                                                                 .        .
                                                                          .
         2   Skip elements
                                                   Figure: Gram                 Figure:
                                                   inverted-lists for query     Min-
                                                   “abcd”                       heap


  Rares Vernica (UC Irvine)      Top-k Fuzzy String Search                KEYS 2009       10 / 17
Top-k Single-pass Search Algorithm


                                                                                      2
                                                      ab        bc       cd
                                                                                Figure:
                                                       1         2        2     Top-k
Heap-based
                                                     →20         3        4     buffer,
     Traverse the lists in a sorted                   21         4        5     k =1
     order using a heap on the top                      .
                                                        .         .
                                                                  .        .
                                                                           .
     IDs of the lists                                   .         .        .
                                                              19       19
     Advantages:                                                                  20
                                                             →20      →20
         1   No need to maintain the list                      .        .
             of “open” IDs                                     .
                                                               .        .
                                                                        .
         2   Skip elements
                                                   Figure: Gram                 Figure:
                                                   inverted-lists for query     Min-
                                                   “abcd”                       heap


  Rares Vernica (UC Irvine)      Top-k Fuzzy String Search                KEYS 2009       10 / 17
Algorithms Overview




                                                             Range
                                                             Search
          Range
          Search



      Iterative Range            Single-Pass
          Search                   Search                 Two-Phase Search




  Rares Vernica (UC Irvine)   Top-k Fuzzy String Search           KEYS 2009   11 / 17
Experimental Setting

       Datasets:
            IMDB Actor Names1
                    actor names and the number of movies they played in
                    1.2 million actors, average name length 15
                    weight is the number of movies (log normalized)
            WEB Corpus Word Grams2
                    sequences of English words and their frequency on the Web
                    2.4 million sequences, average sequence length 20
                    weight is the frequency (log normalized)
       Jaccard similarity and normalized edit similarity, q = 3
       Index and data are stored in main memory at all times
       Implemented in C++ (GNU compiler) on Ubuntu Linux OS
       Intel 2.40GHz PC, 2GB RAM

  1
      http://www.imdb.com/interfaces
  2
      http://www.ldc.upenn.edu/Catalog LDC2006T13
  Rares Vernica (UC Irvine)         Top-k Fuzzy String Search             KEYS 2009   12 / 17
Benefits of Skipping Elements


             50
                          SPS
             40           SPS*
                                                                      Average running time for
 Time (ms)




             30                                                       top-10 queries. IMDB
                                                                      dataset with Jaccard
                                                                      similarity. Single-Pass
             20
                                                                      Search (SPS) algorithm
                                                                      and SPS without
             10                                                       skipping (SPS*).

             0
              0.2         0.4       0.6   0.8      1.0         1.2
                        Dataset Size (millions)
        Rares Vernica (UC Irvine)         Top-k Fuzzy String Search              KEYS 2009   13 / 17
Potential of the Two-Phase Algorithm


             40


             30                                                       Running time for 3
 Time (ms)




                                                                      top-10 queries. WEB
                                                                      Corpus dataset with
             20                                                       normalized edit
                                                                      similarity. Two-Phase
                                                                      algorithm with different
             10
                                                                      initial thresholds.

             0          Q1           Q2                Q3
                                    Queries
        Rares Vernica (UC Irvine)         Top-k Fuzzy String Search               KEYS 2009   14 / 17
Optimum Initial Threshold for the Two-Phase Algorithm


             100                                                       Average running time for
                                                                       top-10 queries. Web
              80                                                       Corpus dataset with
                                                                       normalized edit
 Time (ms)




              60                                                       similarity. Single-Pass
                                                                       Search (SPS) algorithm,
                                                                       Two-Phase (2PH)
              40                                                       algorithm, 2PH with the
                                                 SPS                   optimum initial threshold
              20                                 2PH                   (2PH Opt).
                                                 2PH Opt
                                                                       The Iterative Range Search
              0                                                        algorithm was to expensive to be
                                                                       plotted. The average running time
               0.4          0.8     1.2   1.6         2.0        2.4   was around 5s.
                           Dataset Size (millions)
        Rares Vernica (UC Irvine)         Top-k Fuzzy String Search                   KEYS 2009    15 / 17
Summary



   Approximate ranking queries in string collections
   Useful when mismatch between query and data
   Propose three approaches to solve the problem:
      1    Use existing approximate range search algorithms as a “black box”
           Proves to be the most expensive
      2    Use particularities of the top-k problem
           Proves to be very efficient
      3    Combine (1) and (2) sequentially
           Proves to be slightly more efficient




 Rares Vernica (UC Irvine)     Top-k Fuzzy String Search        KEYS 2009   16 / 17
The Flamingo Project




                                  This work is part of
                           The Flamingo Project at UC Irvine
                          http://flamingo.ics.uci.edu



  Rares Vernica (UC Irvine)         Top-k Fuzzy String Search   KEYS 2009   17 / 17
A Quick Note on Related Work


Fagin et. al [1]                               Our Setting
     similarity on multiple                           similarity on one string
     numerical attributes                             attribute
     traverse list of IDs                             traverse list of IDs
     one list per attribute                           one list per q-gram
     lists sorted on similarity to                    lists sorted on global weight
     that attribute
     lists have different orders of                   lists have the same order of
     IDs                                              IDs
     all IDs appear on all the lists                  a subset of IDs appear on
                                                      each list

[1] R.Fagin, A. Lotem, and M. Naor. Optimal Aggregation Algorithms for Middleware.
In PODS, 2001
   Rares Vernica (UC Irvine)     Top-k Fuzzy String Search               KEYS 2009   17 / 17

Más contenido relacionado

Destacado

Business Plan Powerpoint 1
Business Plan Powerpoint 1Business Plan Powerpoint 1
Business Plan Powerpoint 1
haleydawn
 
Sample Business Proposal Presentation
Sample Business Proposal PresentationSample Business Proposal Presentation
Sample Business Proposal Presentation
Daryll Cabagay
 

Destacado (8)

Semantic Search for Sourcing and Recruiting
Semantic Search for Sourcing and RecruitingSemantic Search for Sourcing and Recruiting
Semantic Search for Sourcing and Recruiting
 
Talent Sourcing and Matching - Artificial Intelligence and Black Box Semantic...
Talent Sourcing and Matching - Artificial Intelligence and Black Box Semantic...Talent Sourcing and Matching - Artificial Intelligence and Black Box Semantic...
Talent Sourcing and Matching - Artificial Intelligence and Black Box Semantic...
 
Using the Web of Data for Information Extraction
Using the Web of Data for Information ExtractionUsing the Web of Data for Information Extraction
Using the Web of Data for Information Extraction
 
Full Text Search In PostgreSQL
Full Text Search In PostgreSQLFull Text Search In PostgreSQL
Full Text Search In PostgreSQL
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Branding ppt
Branding pptBranding ppt
Branding ppt
 
Business Plan Powerpoint 1
Business Plan Powerpoint 1Business Plan Powerpoint 1
Business Plan Powerpoint 1
 
Sample Business Proposal Presentation
Sample Business Proposal PresentationSample Business Proposal Presentation
Sample Business Proposal Presentation
 

Último

Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 

Último (20)

How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 

Efficient Top-k Algorithms for Fuzzy Search in String Collections

  • 1. Efficient Top-k Algorithms for Fuzzy Search in String Collections Rares Vernica Chen Li Department of Computer Science University of California, Irvine First International Workshop on Keyword Search on Structured Data, 2009 Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 1 / 17
  • 2. Outline 1 Motivation 2 Efficient Top-k Algorithms Problem Formulation Algorithms Overview Top-k Single-Pass Search Algorithm 3 Experimental Evaluation Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 2 / 17
  • 3. Need for Approximate String Queries ID FirstName LastName # Movies 10 Al Swartzberg 1 11 Hanna Wartenegg 1 12 Rik Swartzwelder 30 13 Joey Swartzentruber 1 14 Rene Swartenbroekx 4 15 Arnold Schwarzenegger 283 16 Luc Swartenbroeckx 1 . . . . . . . . . Figure: Actor names and popularities Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 3 / 17
  • 4. Need for Approximate String Queries ID FirstName LastName # Movies 10 Al Swartzberg 1 11 Hanna Wartenegg 1 12 Rik Swartzwelder 30 13 Joey Swartzentruber 1 14 Rene Swartenbroekx 4 15 Arnold Schwarzenegger 283 16 Luc Swartenbroeckx 1 . . . . . . . . . Figure: Actor names and popularities SELECT * FROM Actors WHERE LastName = ’Shwartzenetrugger’ ORDER BY ’# Movies’ DESC LIMIT 1; Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 3 / 17
  • 5. Need for Approximate String Queries ID FirstName LastName # Movies 10 Al Swartzberg 1 11 Hanna Wartenegg 1 12 Rik Swartzwelder 30 13 Joey Swartzentruber 1 14 Rene Swartenbroekx 4 15 Arnold Schwarzenegger 283 16 Luc Swartenbroeckx 1 . . . . . . . . . Figure: Actor names and popularities SELECT * FROM Actors WHERE LastName = ’Shwartzenetrugger’ ORDER BY ’# Movies’ DESC LIMIT 1; 0 Results Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 3 / 17
  • 6. Need for Ranking ID FirstName LastName # Movies ED 10 Al Swartzberg 1 8 11 Hanna Wartenegg 1 8 12 Rik Swartzwelder 30 8 13 Joey Swartzentruber 1 4 14 Rene Swartenbroekx 4 9 15 Arnold Schwarzenegger 283 5 16 Luc Swartenbroeckx 1 9 . . . . . . . . . . . . Figure: Actor names, popularities, and edit distances to a query string “Shwartzenetrugger”. Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17
  • 7. Need for Ranking ID FirstName LastName # Movies ED 10 Al Swartzberg 1 8 11 Hanna Wartenegg 1 8 12 Rik Swartzwelder 30 8 13 Joey Swartzentruber 1 4 14 Rene Swartenbroekx 4 9 15 Arnold Schwarzenegger 283 5 16 Luc Swartenbroeckx 1 9 . . . . . . . . . . . . Figure: Actor names, popularities, and edit distances to a query string “Shwartzenetrugger”. Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17
  • 8. Need for Ranking ID FirstName LastName # Movies ED 10 Al Swartzberg 1 8 11 Hanna Wartenegg 1 8 12 Rik Swartzwelder 30 8 13 Joey Swartzentruber 1 4 14 Rene Swartenbroekx 4 9 15 Arnold Schwarzenegger 283 5 16 Luc Swartenbroeckx 1 9 . . . . . . . . . . . . Figure: Actor names, popularities, and edit distances to a query string “Shwartzenetrugger”. Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17
  • 9. Need for Ranking ID FirstName LastName # Movies ED 10 Al Swartzberg 1 8 11 Hanna Wartenegg 1 8 12 Rik Swartzwelder 30 8 13 Joey Swartzentruber 1 4 14 Rene Swartenbroekx 4 9 15 Arnold Schwarzenegger 283 5 16 Luc Swartenbroeckx 1 9 . . . . . . . . . . . . Figure: Actor names, popularities, and edit distances to a query string “Shwartzenetrugger”. Which one result should the system return? Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17
  • 10. Need for Ranking ID FirstName LastName # Movies ED 10 Al Swartzberg 1 8 11 Hanna Wartenegg 1 8 12 Rik Swartzwelder 30 8 13 Joey Swartzentruber 1 4 14 Rene Swartenbroekx 4 9 15 Arnold Schwarzenegger 283 5 16 Luc Swartenbroeckx 1 9 . . . . . . . . . . . . Figure: Actor names, popularities, and edit distances to a query string “Shwartzenetrugger”. Which one result should the system return? Which value is more important, # Movies or similarity? Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 4 / 17
  • 11. Top-k Similar Strings Given: Weighted string collection e.g., actors’ LastName and # Movies Query string e.g., “Shwartzenetrugger” Similarity function e.g, Edit Distance Scoring function (score of a string in terms of similarity and weight) e.g., linear combination of similarity and popularity Integer k Return: k best strings in terms of overall score to the query string. Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 5 / 17
  • 12. Top-k Similar Strings Given: Weighted string collection e.g., actors’ LastName and # Movies Query string e.g., “Shwartzenetrugger” Similarity function e.g, Edit Distance Scoring function (score of a string in terms of similarity and weight) e.g., linear combination of similarity and popularity Integer k Return: k best strings in terms of overall score to the query string. Advantages over Range Search: specify k instead of a similarity threshold guaranteed k results; a range search might have 0 results Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 5 / 17
  • 13. Algorithms Overview Range Search Range Search Iterative Range Single-Pass Search Search Two-Phase Search Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 6 / 17
  • 14. q-grams: Overlapping substrings of fixed length Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2 Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
  • 15. q-grams: Overlapping substrings of fixed length Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2 Vernica Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
  • 16. q-grams: Overlapping substrings of fixed length Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2 Vernica → {Ve Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
  • 17. q-grams: Overlapping substrings of fixed length Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2 Vernica → {Ve,er Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
  • 18. q-grams: Overlapping substrings of fixed length Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2 Vernica → {Ve,er,rn Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
  • 19. q-grams: Overlapping substrings of fixed length Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2 Vernica → {Ve,er,rn,ni Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
  • 20. q-grams: Overlapping substrings of fixed length Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2 Vernica → {Ve,er,rn,ni,ic Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
  • 21. q-grams: Overlapping substrings of fixed length Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2 Vernica → {Ve,er,rn,ni,ic,ca} Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
  • 22. q-grams: Overlapping substrings of fixed length Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2 Vernica → {Ve,er,rn,ni,ic,ca} Veronica → {Ve,er,ro,on,ni,ic,ca} Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
  • 23. q-grams: Overlapping substrings of fixed length Find similar strings: e.g., “Vernica” and “Veronica” q-gram: substring of length q of a string: e.g., q = 2 Vernica → {Ve,er,rn,ni,ic,ca} Veronica → {Ve,er,ro,on,ni,ic,ca} Similar strings share a certain number of grams Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 7 / 17
  • 24. q-gram Inverted List Index q=2 ID String Weight 1 ab 0.80 2 ccd 0.70 3 cd 0.60 4 abcd 0.50 5 bcc 0.40 Figure: Dataset Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17
  • 25. q-gram Inverted List Index q=2 ID String Weight 1 ab 0.80 ab cc cd bc 2 3 ccd cd 0.70 0.60 ⇒ 1 4 2 5 2 3 4 5 4 abcd 0.50 4 5 bcc 0.40 Figure: Gram inverted-list index Figure: Dataset Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17
  • 26. q-gram Inverted List Index q=2 ID String Weight 1 ab 0.80 ab cc cd bc 2 ccd 0.70 1 2 2 4 3 cd 0.60 4 5 3 5 4 abcd 0.50 4 5 bcc 0.40 Figure: Gram inverted-list index Figure: Dataset Query string: “bcd” Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17
  • 27. q-gram Inverted List Index q=2 ID String Weight 1 ab 0.80 ab cc cd bc 2 ccd 0.70 1 2 2 4 3 cd 0.60 4 5 3 5 4 abcd 0.50 4 5 bcc 0.40 Figure: Gram inverted-list index Figure: Dataset Query string: “bcd” Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17
  • 28. q-gram Inverted List Index q=2 ID String Weight 1 ab 0.80 ab cc cd bc 2 3 ccd cd 0.70 0.60 ⇐ 1 4 2 5 2 3 4 5 4 abcd 0.50 4 5 bcc 0.40 Figure: Gram inverted-list index Figure: Dataset Query string: “bcd” Identified strings are verified by computing the real similarity. Verification is usually an expensive process. Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 8 / 17
  • 29. Top-k Single-pass Search Algorithm ID String Weight 1 ab 0.80 2 ccd 0.70 3 cd 0.60 4 abcd 0.50 5 bcc 0.40 Figure: Dataset ab cc cd bc 1 2 2 4 4 5 3 5 4 Figure: Gram inverted-list index Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17
  • 30. Top-k Single-pass Search Algorithm ID String Weight 1 ab 0.80 2 ccd 0.70 Setup 3 cd 0.60 Assign IDs s.t. 4 abcd 0.50 ascending order of IDs ≡ 5 bcc 0.40 descending order of weights Figure: Dataset ab cc cd bc 1 2 2 4 4 5 3 5 4 Figure: Gram inverted-list index Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17
  • 31. Top-k Single-pass Search Algorithm ID String Weight 1 ab 0.80 2 ccd 0.70 Setup 3 cd 0.60 Assign IDs s.t. 4 abcd 0.50 ascending order of IDs ≡ 5 bcc 0.40 descending order of weights Figure: Dataset Sort the IDs on each list in ascending order ab cc cd bc 1 2 2 4 4 5 3 5 4 Figure: Gram inverted-list index Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17
  • 32. Top-k Single-pass Search Algorithm ID String Weight 1 ab 0.80 2 ccd 0.70 Setup 3 cd 0.60 Assign IDs s.t. 4 abcd 0.50 ascending order of IDs ≡ 5 bcc 0.40 descending order of weights Figure: Dataset Sort the IDs on each list in ascending order Scan the lists corresponding to the ab cc cd bc grams in the query. 1 2 2 4 e.g., “bcd” 4 5 3 5 4 Figure: Gram inverted-list index Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17
  • 33. Top-k Single-pass Search Algorithm ID String Weight 1 ab 0.80 2 ccd 0.70 3 cd 0.60 Naïve approach: Round-Robin 4 abcd 0.50 Scan all the lists in the same time 5 bcc 0.40 Maintain a list of “open” IDs (might Figure: Dataset still appear on some of the lists) Store the best k “closed” IDs in a top-k buffer ab cc cd bc Stop when the top-k buffer cannot 1 2 2 4 improve 4 5 3 5 4 Figure: Gram inverted-list index Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 9 / 17
  • 34. Top-k Single-pass Search Algorithm ab bc cd 1 2 2 Heap-based 20 3 4 Traverse the lists in a sorted 21 4 5 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 20 20 1 No need to maintain the list . . of “open” IDs . . . . 2 Skip elements Figure: Gram inverted-lists for query “abcd” Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 35. Top-k Single-pass Search Algorithm ab bc cd →1 2 2 Heap-based 20 3 4 Traverse the lists in a sorted 21 4 5 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 1 20 20 1 No need to maintain the list . . of “open” IDs . . . . 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 36. Top-k Single-pass Search Algorithm ab bc cd →1 →2 2 Heap-based 20 3 4 Traverse the lists in a sorted 21 4 5 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 1 20 20 1 No need to maintain the list . . 2 of “open” IDs . . . . 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 37. Top-k Single-pass Search Algorithm ab bc cd →1 →2 →2 Heap-based 20 3 4 Traverse the lists in a sorted 21 4 5 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 1 20 20 1 No need to maintain the list . . 2 of “open” IDs . . . . 2 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 38. Top-k Single-pass Search Algorithm ab bc cd →1 →2 →2 Heap-based 20 3 4 Traverse the lists in a sorted 21 4 5 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 1 20 20 1 No need to maintain the list . . 2 of “open” IDs . . . . 2 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 39. Top-k Single-pass Search Algorithm 1 ab bc cd Figure: →1 →2 →2 Top-k Heap-based 20 3 4 buffer, Traverse the lists in a sorted 21 4 5 k =1 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 2 20 20 1 No need to maintain the . . 2 list of “open” IDs . . . . 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 40. Top-k Single-pass Search Algorithm 1 ab bc cd Figure: 1 →2 →2 Top-k Heap-based →20 3 4 buffer, Traverse the lists in a sorted 21 4 5 k =1 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 2 20 20 1 No need to maintain the list . . 2 of “open” IDs . . . . 20 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 41. Top-k Single-pass Search Algorithm 1 ab bc cd Figure: 1 →2 →2 Top-k Heap-based →20 3 4 buffer, Traverse the lists in a sorted 21 4 5 k =1 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 2 20 20 1 No need to maintain the list . . 2 of “open” IDs . . . . 20 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 42. Top-k Single-pass Search Algorithm 2 ab bc cd Figure: 1 →2 →2 Top-k Heap-based →20 3 4 buffer, Traverse the lists in a sorted 21 4 5 k =1 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 20 20 20 1 No need to maintain the list . . of “open” IDs . . . . 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 43. Top-k Single-pass Search Algorithm 2 ab bc cd Figure: 1 2 2 Top-k Heap-based →20 →3 →4 buffer, Traverse the lists in a sorted 21 4 5 k =1 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 3 20 20 1 No need to maintain the list . . 4 of “open” IDs . . . . 20 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 44. Top-k Single-pass Search Algorithm 2 ab bc cd Figure: 1 2 2 Top-k Heap-based →20 →3 →4 buffer, Traverse the lists in a sorted 21 4 5 k =1 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 3 20 20 1 No need to maintain the list . . 4 of “open” IDs . . . . 20 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 45. Top-k Single-pass Search Algorithm 2 ab bc cd Figure: 1 2 2 Top-k Heap-based →20 →3 →4 buffer, Traverse the lists in a sorted 21 4 5 k =1 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 4 20 20 1 No need to maintain the list . . 20 of “open” IDs . . . . 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 46. Top-k Single-pass Search Algorithm 2 ab bc cd Figure: 1 2 2 Top-k Heap-based →20 →3 →4 buffer, Traverse the lists in a sorted 21 4 5 k =1 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 4 20 20 1 No need to maintain the list . . 20 of “open” IDs . . . . 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 47. Top-k Single-pass Search Algorithm 2 ab bc cd Figure: 1 2 2 Top-k Heap-based →20 →3 →4 buffer, Traverse the lists in a sorted 21 4 5 k =1 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 20 20 20 1 No need to maintain the list . . of “open” IDs . . . . 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 48. Top-k Single-pass Search Algorithm 2 ab bc cd Figure: 1 2 2 Top-k Heap-based →20 3 4 buffer, Traverse the lists in a sorted 21 4 5 k =1 order using a heap on the top . . . . . . IDs of the lists . . . 19 19 Advantages: 20 →20 →20 1 No need to maintain the list . . of “open” IDs . . . . 2 Skip elements Figure: Gram Figure: inverted-lists for query Min- “abcd” heap Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 10 / 17
  • 49. Algorithms Overview Range Search Range Search Iterative Range Single-Pass Search Search Two-Phase Search Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 11 / 17
  • 50. Experimental Setting Datasets: IMDB Actor Names1 actor names and the number of movies they played in 1.2 million actors, average name length 15 weight is the number of movies (log normalized) WEB Corpus Word Grams2 sequences of English words and their frequency on the Web 2.4 million sequences, average sequence length 20 weight is the frequency (log normalized) Jaccard similarity and normalized edit similarity, q = 3 Index and data are stored in main memory at all times Implemented in C++ (GNU compiler) on Ubuntu Linux OS Intel 2.40GHz PC, 2GB RAM 1 http://www.imdb.com/interfaces 2 http://www.ldc.upenn.edu/Catalog LDC2006T13 Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 12 / 17
  • 51. Benefits of Skipping Elements 50 SPS 40 SPS* Average running time for Time (ms) 30 top-10 queries. IMDB dataset with Jaccard similarity. Single-Pass 20 Search (SPS) algorithm and SPS without 10 skipping (SPS*). 0 0.2 0.4 0.6 0.8 1.0 1.2 Dataset Size (millions) Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 13 / 17
  • 52. Potential of the Two-Phase Algorithm 40 30 Running time for 3 Time (ms) top-10 queries. WEB Corpus dataset with 20 normalized edit similarity. Two-Phase algorithm with different 10 initial thresholds. 0 Q1 Q2 Q3 Queries Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 14 / 17
  • 53. Optimum Initial Threshold for the Two-Phase Algorithm 100 Average running time for top-10 queries. Web 80 Corpus dataset with normalized edit Time (ms) 60 similarity. Single-Pass Search (SPS) algorithm, Two-Phase (2PH) 40 algorithm, 2PH with the SPS optimum initial threshold 20 2PH (2PH Opt). 2PH Opt The Iterative Range Search 0 algorithm was to expensive to be plotted. The average running time 0.4 0.8 1.2 1.6 2.0 2.4 was around 5s. Dataset Size (millions) Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 15 / 17
  • 54. Summary Approximate ranking queries in string collections Useful when mismatch between query and data Propose three approaches to solve the problem: 1 Use existing approximate range search algorithms as a “black box” Proves to be the most expensive 2 Use particularities of the top-k problem Proves to be very efficient 3 Combine (1) and (2) sequentially Proves to be slightly more efficient Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 16 / 17
  • 55. The Flamingo Project This work is part of The Flamingo Project at UC Irvine http://flamingo.ics.uci.edu Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 17 / 17
  • 56. A Quick Note on Related Work Fagin et. al [1] Our Setting similarity on multiple similarity on one string numerical attributes attribute traverse list of IDs traverse list of IDs one list per attribute one list per q-gram lists sorted on similarity to lists sorted on global weight that attribute lists have different orders of lists have the same order of IDs IDs all IDs appear on all the lists a subset of IDs appear on each list [1] R.Fagin, A. Lotem, and M. Naor. Optimal Aggregation Algorithms for Middleware. In PODS, 2001 Rares Vernica (UC Irvine) Top-k Fuzzy String Search KEYS 2009 17 / 17