SlideShare a Scribd company logo
1 of 49
Lorenzo Alberton
                             @lorenzoalberton




“Modern” Algorithms
 and Data Structures
                               Part 1
          Bloom Filters, Merkle Trees




    Cassandra-London, Monday 18th April 2011
                                                1
Bloom Filters
                        Burton Howard Bloom, 1970




http://portal.acm.org/citation.cfm?doid=362686.362692   2
Bloom Filter


          Space-efficient
           probabilistic
          data structure
           used to test
         set membership
         http://en.wikipedia.org/wiki/Bloom_filter   3
Bloom Filter
Space-efficient probabilistic data structure that is used to test
whether an element is a member of a set




                                                                   4
Bloom Filter
Space-efficient probabilistic data structure that is used to test
whether an element is a member of a set

               Hash Table ⇒ chance of collision

                   hash(x)             hash(y)




                                                                   4
Bloom Filter
Space-efficient probabilistic data structure that is used to test
whether an element is a member of a set

                Hash Table ⇒ chance of collision

                     hash(x)               hash(y)




         False positives are possible, false negatives are not.
It might be beneficial to build an exception list of known false positives.
                                                                         4
Bloom Filter
Space-efficient probabilistic data structure that is used to test
whether an element is a member of a set




                                                                   5
Bloom Filter
Space-efficient probabilistic data structure that is used to test
whether an element is a member of a set


                     Not a Key-Value store




                                                                   5
Bloom Filter
Space-efficient probabilistic data structure that is used to test
whether an element is a member of a set


                     Not a Key-Value store


                  Array of bits indicating the
                presence of a key in the filter




                                                                   5
Bloom Filter
Space-efficient probabilistic data structure that is used to test
whether an element is a member of a set


                     Not a Key-Value store


                  Array of bits indicating the
                presence of a key in the filter

                                                               (*)
    Removing an element from the filter is not possible

                                                                     5
Bloom Filter: Add & Query
m bits (initially set to 0)
k hash functions




S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    0   1   2                 m-1   m




                                        6
Bloom Filter: Add & Query
m bits (initially set to 0)
k hash functions

  Add


S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    0   1   2                 m-1   m




                                        6
Bloom Filter: Add & Query
m bits (initially set to 0)       if f(x) = A,
k hash functions                  set S[A] = 1
                              x
  Add


S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    0   1   2                          m-1   m




                                                 6
Bloom Filter: Add & Query
m bits (initially set to 0)              if f(x) = A,
k hash functions                         set S[A] = 1
                              x
  Add
                                  f(x)

S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
                  1
    0   1   2                                 m-1   m




                                                        6
Bloom Filter: Add & Query
m bits (initially set to 0)               if f(x) = A,
k hash functions                          set S[A] = 1
                               x
  Add
                        g(x)       f(x)

S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
          1       1
    0   1   2                                  m-1   m




                                                         6
Bloom Filter: Add & Query
m bits (initially set to 0)                  if f(x) = A,
k hash functions                             set S[A] = 1
                               x
  Add
                        g(x)       f(x)   h(x)

S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
          1       1       1
    0   1   2                                     m-1   m




                                                            6
Bloom Filter: Add & Query
m bits (initially set to 0)                         if f(x) = A,
k hash functions                                    set S[A] = 1
                               x          y
                                                  g(y)
  Add           f(y)
                        g(x)       f(x)          h(x)
                                          h(y)
S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  1       1       1 1     1   1
    0   1   2                                            m-1   m




                                                                   6
Bloom Filter: Add & Query
m bits (initially set to 0)                         if f(x) = A,
k hash functions                                    set S[A] = 1
                               x          y
                                                  g(y)
  Add           f(y)
                        g(x)       f(x)          h(x)
                                          h(y)
S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  1       1       1 1     1   1
    0   1   2                                            m-1   m




  Query


                                                                   6
Bloom Filter: Add & Query
m bits (initially set to 0)                           if f(x) = A,
k hash functions                                      set S[A] = 1
                                 x          y
                                                    g(y)
  Add           f(y)
                          g(x)       f(x)          h(x)
                                            h(y)
S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  1       1       1 1     1   1
    0   1   2                                              m-1   m




                   f(z)       h(z)     g(z)
  Query
                                 z
                                                                     6
Bloom Filter: Add & Query
m bits (initially set to 0)                           if f(x) = A,
k hash functions                                      set S[A] = 1
                                 x          y
                                                    g(y)
  Add           f(y)
                          g(x)       f(x)          h(x)
                                            h(y)
S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  1       1       1 1     1   1
    0   1   2                                              m-1   m




                   f(z)       h(z)     g(z)
  Query
                                                   one bit set to 0
                                 z                 ⇒z∉S
                                                                      6
Bloom Filter: Hash Functions
k Hash functions: uniform random distribution in [1...m)


  k different hash functions

  The same hash functions with different salts

  Double or triple hashing : g (x) = h (x) + ih (x) mod m
                                                [1]
                                                        i       1       2



  2 hash functions can mimic k hashing functions

        Dillinger, Peter C.; Manolios, Panagiotis (2004b), "Bloom Filters in Probabilistic Verification",
  [1]
        http://www.ccs.neu.edu/home/pete/pub/bloom-filters-verification.pdf



                                 http://www.strchr.com/hash_functions                                      7
Bloom Filter: Hash Functions
k Hash functions: uniform random distribution in [1...m)


  k different hash functions

     ‣ Cryptographic Hash different salts
  The same hash functions withFunctions
                (MD5, SHA-1, SHA-256, Tiger, Whirlpool ...)
  Double or triple hashing : g (x) = h (x) + ih (x) mod m
                                                [1]
                                                        i       1       2



  2 hash functions can mimic k hashing functions
           ‣ Murmur Hashes
                http://code.google.com/p/smhasher/
        Dillinger, Peter C.; Manolios, Panagiotis (2004b), "Bloom Filters in Probabilistic Verification",
  [1]
        http://www.ccs.neu.edu/home/pete/pub/bloom-filters-verification.pdf



                                 http://www.strchr.com/hash_functions                                      7
Bloom Filter: Usage



    Guard against           First line of defence
                                                       Peer to Peer        Routing -
expensive operations        in high performance
                                                      communication    Resource Location
  (like disk access)        (distributed) caches




                                                                                      ...
   Squid         Google                                Various    Google    Cisco
                              Cassandra       HBase
Proxy Cache      BigTable                              RDBMS’     Chrome   Routers



                                                                                            8
Bloom Filter: Usage in Cassandra



       Used to save I/O during key look-ups
          (check for non-existent keys)

          One bloom filter per SSTable.




                                              9
Bloom Filter: Usage in Cassandra



        Used to save I/O during key look-ups
           (check for non-existent keys)

           One bloom filter per SSTable.



  org.apache.cassandra.utils.BloomFilter

                                               9
Bloom Filter: False Positive Rate




         m = number of bits in the filter
         n = number of elements
         k = number of hashing functions




         http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html   10
Bloom Filter: False Positive Rate




         m = number of bits in the filter
         n = number of elements
         k = number of hashing functions




         http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html   10
Bloom Filter: False Positive Rate


           A bloom filter with an optimal value for k
      and 1% error rate only needs 9.6 bits per key.
   Add 4.8 bits/key and the error rate decreases by 10 times.




10.000 words, 1% error rate                       10.000 words, 0.1% error rate
     7 hash functions                                  11 hash functions

    ~12 KB of memory                                         ~18 KB of memory
             http://www.igvita.com/2008/12/27/scalable-datasets-bloom-filters-in-ruby/   11
Bloom Filter: False Positive Rate
    false positive probability




                                 bloom filter size (n)
                                  http://en.wikipedia.org/wiki/Bloom_filter   12
Counting Bloom Filter
 Can handle deletions
 Use counters instead of 0/1s
 When adding an element, increment the counters
 When deleting an element, decrement the counters
 Counters must be large enough to avoid overflow (4 bits)
                                x     y
                                                 g(y)
                f(y)
                         g(x) f(x)           h(x)
                                      h(y)
S 1 0 0 0 1 0 0 0 2 0 0 0 1 0 1
                                                           13
Stable (Time-Based) Bloom Filter
  Input
 Stream



Duplicate      1 0 0 0 1 0 0 0 1 0
 Filter


 Output
 Stream
                                   14
Stable (Time-Based) Bloom Filter
  Input              Before each insertion, P random
 Stream                cells are decremented by one.
                      The k cells for the new value xi
                        are set to Max (usually < 7)
                http://webdocs.cs.ualberta.ca/~drafiei/papers/DupDet06Sigmod.pdf


Duplicate      1 0 0 0 1 0 0 0 1 0
 Filter


 Output
 Stream
                                                                                  14
Stable (Time-Based) Bloom Filter
  Input                 Before each insertion, P random
 Stream                   cells are decremented by one.
                         The k cells for the new value xi
                           are set to Max (usually < 7)
                   http://webdocs.cs.ualberta.ca/~drafiei/papers/DupDet06Sigmod.pdf


Duplicate      1 0 0 0 1 0 0 0 1 0
 Filter

                         Alternatively, set an expiry time
 Output                     for each cell, with a TTL
                        dependent on the volume of data
 Stream
               http://www.igvita.com/2010/01/06/flow-analysis-time-based-bloom-filters/

                                                                                     14
Bloom Filters: Further reading
Compressed Bloom Filters
Improve performance when the Bloom filter is passed as a message,
and its transmission size is a limiting factor.
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.86.3346

Retouched Bloom Filters
Allow networked applications to trade off selected false positives
against false negatives
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.172.8453

Bloomier Filters
Extended to handle approximate functions (each element of the set
has an associated function value)
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.86.4154 http://arxiv.org/abs/0807.0928

Attenuated B.F., Spectral B.F., Distance-Sensitive B.F. ...
                                                                                            15
Merkle Trees
                                Ralph C. Merkle, 1979




http://www.springerlink.com/content/q865hwxq73ex1am9/   16
Merkle Trees (Hash Trees)


   Data Structure containing a
  tree of summary information
   about a larger piece of data
      to verify its contents


          http://en.wikipedia.org/wiki/Hash_Tree   17
Merkle Trees (Hash Trees)
                                                                         Leaves: hashes of
                               ROOT
                               hash(A, B)                                data blocks.
                                                                         Nodes: hashes of
                                                                         their children.
             A                                         B
          hash(C, D)                                hash(E, F)
                                                                         Used to detect
                                                                         inconsistencies
    C                   D                    E                    F      between replicas
 hash(001)        hash(002)            hash(003)             hash(004)
                                                                         (anti-entropy) and
                                                                         to minimise the
  Data                 Data                 Data                 Data
  Block                Block                Block                Block   amount of
  001                  002                  003                  004     transferred data
                                                                                              18
Merkle Trees
   Node A                 Node B
                gossip
               exchange




                                   19
Merkle Trees
   Node A                             Node B
                         gossip
                        exchange




                 Minimal data transfer
            Differences are easy to locate




                                               19
Merkle Trees
   Node A                             Node B
                         gossip
                        exchange




                 Minimal data transfer
            Differences are easy to locate


    SHA-1, Whirlpool or Tiger (TTH) hash functions
                                                     19
Merkle Trees: Usage




               Peer to Peer
              communication




                              20
Merkle Trees: Usage
                              DC++




               Peer to Peer
              communication




                                     20
Merkle Trees: Usage
                                                  DC++




                          Peer to Peer
                         communication




                                                         ...
  Amazon   Google                               Google
                      Cassandra   HBase   ZFS
  Dynamo   BigTable                             Wave

                                                               20
Merkle Trees: Usage in Cassandra


      Ensure the P2P network of nodes receives
        data blocks unaltered and unharmed.
       Anti-entropy during major compactions
           (via Scuttlebutt reconciliation).




       http://wiki.apache.org/cassandra/ArchitectureAntiEntropy   21
Merkle Trees: Usage in Cassandra


      Ensure the P2P network of nodes receives
        data blocks unaltered and unharmed.
       Anti-entropy during major compactions
           (via Scuttlebutt reconciliation).

        One Merkle Tree per Column Family
       (in Dynamo, one per node / key range)




       http://wiki.apache.org/cassandra/ArchitectureAntiEntropy   21
Merkle Trees: Usage in Cassandra


      Ensure the P2P network of nodes receives
        data blocks unaltered and unharmed.
       Anti-entropy during major compactions
           (via Scuttlebutt reconciliation).

        One Merkle Tree per Column Family
       (in Dynamo, one per node / key range)

  org.apache.cassandra.utils.MerkleTree

       http://wiki.apache.org/cassandra/ArchitectureAntiEntropy   21
References

Bloom Filters
http://bit.ly/bundles/quipo/1

Merkle Trees
http://bit.ly/bundles/quipo/2




                                22
We’re Hiring!




http://mediasift.com/careers
                               23
Lorenzo Alberton
                  @lorenzoalberton




   Thank you!

      lorenzo@alberton.info




http://www.alberton.info/talks
                                     24

More Related Content

What's hot

Graphs In Data Structure
Graphs In Data StructureGraphs In Data Structure
Graphs In Data Structure
Anuj Modi
 
how to calclute time complexity of algortihm
how to calclute time complexity of algortihmhow to calclute time complexity of algortihm
how to calclute time complexity of algortihm
Sajid Marwat
 

What's hot (20)

Graphs In Data Structure
Graphs In Data StructureGraphs In Data Structure
Graphs In Data Structure
 
AVL Tree Data Structure
AVL Tree Data StructureAVL Tree Data Structure
AVL Tree Data Structure
 
Hashing in datastructure
Hashing in datastructureHashing in datastructure
Hashing in datastructure
 
Quick sort
Quick sortQuick sort
Quick sort
 
Greedy algorithm
Greedy algorithmGreedy algorithm
Greedy algorithm
 
Binary Search - Design & Analysis of Algorithms
Binary Search - Design & Analysis of AlgorithmsBinary Search - Design & Analysis of Algorithms
Binary Search - Design & Analysis of Algorithms
 
B trees
B treesB trees
B trees
 
PostgreSql query planning and tuning
PostgreSql query planning and tuningPostgreSql query planning and tuning
PostgreSql query planning and tuning
 
Hash map
Hash mapHash map
Hash map
 
1.Role lexical Analyzer
1.Role lexical Analyzer1.Role lexical Analyzer
1.Role lexical Analyzer
 
Automata theory - CFG and normal forms
Automata theory - CFG and normal formsAutomata theory - CFG and normal forms
Automata theory - CFG and normal forms
 
Merge sort algorithm
Merge sort algorithmMerge sort algorithm
Merge sort algorithm
 
Hashing
HashingHashing
Hashing
 
DSA Presentetion Huffman tree.pdf
DSA Presentetion Huffman tree.pdfDSA Presentetion Huffman tree.pdf
DSA Presentetion Huffman tree.pdf
 
Hash table
Hash tableHash table
Hash table
 
how to calclute time complexity of algortihm
how to calclute time complexity of algortihmhow to calclute time complexity of algortihm
how to calclute time complexity of algortihm
 
RABIN KARP ALGORITHM STRING MATCHING
RABIN KARP ALGORITHM STRING MATCHINGRABIN KARP ALGORITHM STRING MATCHING
RABIN KARP ALGORITHM STRING MATCHING
 
Lecture Note-1: Algorithm and Its Properties
Lecture Note-1: Algorithm and Its PropertiesLecture Note-1: Algorithm and Its Properties
Lecture Note-1: Algorithm and Its Properties
 
Fibonacci Heap
Fibonacci HeapFibonacci Heap
Fibonacci Heap
 
Tree - Data Structure
Tree - Data StructureTree - Data Structure
Tree - Data Structure
 

Viewers also liked

Scalable Architectures - Taming the Twitter Firehose
Scalable Architectures - Taming the Twitter FirehoseScalable Architectures - Taming the Twitter Firehose
Scalable Architectures - Taming the Twitter Firehose
Lorenzo Alberton
 

Viewers also liked (7)

Scalable Architectures - Taming the Twitter Firehose
Scalable Architectures - Taming the Twitter FirehoseScalable Architectures - Taming the Twitter Firehose
Scalable Architectures - Taming the Twitter Firehose
 
Scaling Teams, Processes and Architectures
Scaling Teams, Processes and ArchitecturesScaling Teams, Processes and Architectures
Scaling Teams, Processes and Architectures
 
The Art of Scalability - Managing growth
The Art of Scalability - Managing growthThe Art of Scalability - Managing growth
The Art of Scalability - Managing growth
 
Monitoring at scale - Intuitive dashboard design
Monitoring at scale - Intuitive dashboard designMonitoring at scale - Intuitive dashboard design
Monitoring at scale - Intuitive dashboard design
 
Graphs in the Database: Rdbms In The Social Networks Age
Graphs in the Database: Rdbms In The Social Networks AgeGraphs in the Database: Rdbms In The Social Networks Age
Graphs in the Database: Rdbms In The Social Networks Age
 
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and whenNoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
 
Trees In The Database - Advanced data structures
Trees In The Database - Advanced data structuresTrees In The Database - Advanced data structures
Trees In The Database - Advanced data structures
 

Similar to Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees (7)

ESTUDO DE ALGEBRA BOOLEANA PARA ESTUDOS.
ESTUDO DE ALGEBRA BOOLEANA PARA ESTUDOS.ESTUDO DE ALGEBRA BOOLEANA PARA ESTUDOS.
ESTUDO DE ALGEBRA BOOLEANA PARA ESTUDOS.
 
Chapter 2.pptx
Chapter 2.pptxChapter 2.pptx
Chapter 2.pptx
 
Bloom filter
Bloom filterBloom filter
Bloom filter
 
Unit 5 Streams2.pptx
Unit 5 Streams2.pptxUnit 5 Streams2.pptx
Unit 5 Streams2.pptx
 
M3 PPT 22ESC143.docx
M3 PPT 22ESC143.docxM3 PPT 22ESC143.docx
M3 PPT 22ESC143.docx
 
M3 PPT 22ESC143.docx
M3 PPT 22ESC143.docxM3 PPT 22ESC143.docx
M3 PPT 22ESC143.docx
 
Open addressiing &amp;rehashing,extendiblevhashing
Open addressiing &amp;rehashing,extendiblevhashingOpen addressiing &amp;rehashing,extendiblevhashing
Open addressiing &amp;rehashing,extendiblevhashing
 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Recently uploaded (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees

  • 1. Lorenzo Alberton @lorenzoalberton “Modern” Algorithms and Data Structures Part 1 Bloom Filters, Merkle Trees Cassandra-London, Monday 18th April 2011 1
  • 2. Bloom Filters Burton Howard Bloom, 1970 http://portal.acm.org/citation.cfm?doid=362686.362692 2
  • 3. Bloom Filter Space-efficient probabilistic data structure used to test set membership http://en.wikipedia.org/wiki/Bloom_filter 3
  • 4. Bloom Filter Space-efficient probabilistic data structure that is used to test whether an element is a member of a set 4
  • 5. Bloom Filter Space-efficient probabilistic data structure that is used to test whether an element is a member of a set Hash Table ⇒ chance of collision hash(x) hash(y) 4
  • 6. Bloom Filter Space-efficient probabilistic data structure that is used to test whether an element is a member of a set Hash Table ⇒ chance of collision hash(x) hash(y) False positives are possible, false negatives are not. It might be beneficial to build an exception list of known false positives. 4
  • 7. Bloom Filter Space-efficient probabilistic data structure that is used to test whether an element is a member of a set 5
  • 8. Bloom Filter Space-efficient probabilistic data structure that is used to test whether an element is a member of a set Not a Key-Value store 5
  • 9. Bloom Filter Space-efficient probabilistic data structure that is used to test whether an element is a member of a set Not a Key-Value store Array of bits indicating the presence of a key in the filter 5
  • 10. Bloom Filter Space-efficient probabilistic data structure that is used to test whether an element is a member of a set Not a Key-Value store Array of bits indicating the presence of a key in the filter (*) Removing an element from the filter is not possible 5
  • 11. Bloom Filter: Add & Query m bits (initially set to 0) k hash functions S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 m-1 m 6
  • 12. Bloom Filter: Add & Query m bits (initially set to 0) k hash functions Add S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 m-1 m 6
  • 13. Bloom Filter: Add & Query m bits (initially set to 0) if f(x) = A, k hash functions set S[A] = 1 x Add S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 m-1 m 6
  • 14. Bloom Filter: Add & Query m bits (initially set to 0) if f(x) = A, k hash functions set S[A] = 1 x Add f(x) S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 2 m-1 m 6
  • 15. Bloom Filter: Add & Query m bits (initially set to 0) if f(x) = A, k hash functions set S[A] = 1 x Add g(x) f(x) S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 2 m-1 m 6
  • 16. Bloom Filter: Add & Query m bits (initially set to 0) if f(x) = A, k hash functions set S[A] = 1 x Add g(x) f(x) h(x) S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 2 m-1 m 6
  • 17. Bloom Filter: Add & Query m bits (initially set to 0) if f(x) = A, k hash functions set S[A] = 1 x y g(y) Add f(y) g(x) f(x) h(x) h(y) S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 2 m-1 m 6
  • 18. Bloom Filter: Add & Query m bits (initially set to 0) if f(x) = A, k hash functions set S[A] = 1 x y g(y) Add f(y) g(x) f(x) h(x) h(y) S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 2 m-1 m Query 6
  • 19. Bloom Filter: Add & Query m bits (initially set to 0) if f(x) = A, k hash functions set S[A] = 1 x y g(y) Add f(y) g(x) f(x) h(x) h(y) S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 2 m-1 m f(z) h(z) g(z) Query z 6
  • 20. Bloom Filter: Add & Query m bits (initially set to 0) if f(x) = A, k hash functions set S[A] = 1 x y g(y) Add f(y) g(x) f(x) h(x) h(y) S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 2 m-1 m f(z) h(z) g(z) Query one bit set to 0 z ⇒z∉S 6
  • 21. Bloom Filter: Hash Functions k Hash functions: uniform random distribution in [1...m) k different hash functions The same hash functions with different salts Double or triple hashing : g (x) = h (x) + ih (x) mod m [1] i 1 2 2 hash functions can mimic k hashing functions Dillinger, Peter C.; Manolios, Panagiotis (2004b), "Bloom Filters in Probabilistic Verification", [1] http://www.ccs.neu.edu/home/pete/pub/bloom-filters-verification.pdf http://www.strchr.com/hash_functions 7
  • 22. Bloom Filter: Hash Functions k Hash functions: uniform random distribution in [1...m) k different hash functions ‣ Cryptographic Hash different salts The same hash functions withFunctions (MD5, SHA-1, SHA-256, Tiger, Whirlpool ...) Double or triple hashing : g (x) = h (x) + ih (x) mod m [1] i 1 2 2 hash functions can mimic k hashing functions ‣ Murmur Hashes http://code.google.com/p/smhasher/ Dillinger, Peter C.; Manolios, Panagiotis (2004b), "Bloom Filters in Probabilistic Verification", [1] http://www.ccs.neu.edu/home/pete/pub/bloom-filters-verification.pdf http://www.strchr.com/hash_functions 7
  • 23. Bloom Filter: Usage Guard against First line of defence Peer to Peer Routing - expensive operations in high performance communication Resource Location (like disk access) (distributed) caches ... Squid Google Various Google Cisco Cassandra HBase Proxy Cache BigTable RDBMS’ Chrome Routers 8
  • 24. Bloom Filter: Usage in Cassandra Used to save I/O during key look-ups (check for non-existent keys) One bloom filter per SSTable. 9
  • 25. Bloom Filter: Usage in Cassandra Used to save I/O during key look-ups (check for non-existent keys) One bloom filter per SSTable. org.apache.cassandra.utils.BloomFilter 9
  • 26. Bloom Filter: False Positive Rate m = number of bits in the filter n = number of elements k = number of hashing functions http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html 10
  • 27. Bloom Filter: False Positive Rate m = number of bits in the filter n = number of elements k = number of hashing functions http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html 10
  • 28. Bloom Filter: False Positive Rate A bloom filter with an optimal value for k and 1% error rate only needs 9.6 bits per key. Add 4.8 bits/key and the error rate decreases by 10 times. 10.000 words, 1% error rate 10.000 words, 0.1% error rate 7 hash functions 11 hash functions ~12 KB of memory ~18 KB of memory http://www.igvita.com/2008/12/27/scalable-datasets-bloom-filters-in-ruby/ 11
  • 29. Bloom Filter: False Positive Rate false positive probability bloom filter size (n) http://en.wikipedia.org/wiki/Bloom_filter 12
  • 30. Counting Bloom Filter Can handle deletions Use counters instead of 0/1s When adding an element, increment the counters When deleting an element, decrement the counters Counters must be large enough to avoid overflow (4 bits) x y g(y) f(y) g(x) f(x) h(x) h(y) S 1 0 0 0 1 0 0 0 2 0 0 0 1 0 1 13
  • 31. Stable (Time-Based) Bloom Filter Input Stream Duplicate 1 0 0 0 1 0 0 0 1 0 Filter Output Stream 14
  • 32. Stable (Time-Based) Bloom Filter Input Before each insertion, P random Stream cells are decremented by one. The k cells for the new value xi are set to Max (usually < 7) http://webdocs.cs.ualberta.ca/~drafiei/papers/DupDet06Sigmod.pdf Duplicate 1 0 0 0 1 0 0 0 1 0 Filter Output Stream 14
  • 33. Stable (Time-Based) Bloom Filter Input Before each insertion, P random Stream cells are decremented by one. The k cells for the new value xi are set to Max (usually < 7) http://webdocs.cs.ualberta.ca/~drafiei/papers/DupDet06Sigmod.pdf Duplicate 1 0 0 0 1 0 0 0 1 0 Filter Alternatively, set an expiry time Output for each cell, with a TTL dependent on the volume of data Stream http://www.igvita.com/2010/01/06/flow-analysis-time-based-bloom-filters/ 14
  • 34. Bloom Filters: Further reading Compressed Bloom Filters Improve performance when the Bloom filter is passed as a message, and its transmission size is a limiting factor. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.86.3346 Retouched Bloom Filters Allow networked applications to trade off selected false positives against false negatives http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.172.8453 Bloomier Filters Extended to handle approximate functions (each element of the set has an associated function value) http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.86.4154 http://arxiv.org/abs/0807.0928 Attenuated B.F., Spectral B.F., Distance-Sensitive B.F. ... 15
  • 35. Merkle Trees Ralph C. Merkle, 1979 http://www.springerlink.com/content/q865hwxq73ex1am9/ 16
  • 36. Merkle Trees (Hash Trees) Data Structure containing a tree of summary information about a larger piece of data to verify its contents http://en.wikipedia.org/wiki/Hash_Tree 17
  • 37. Merkle Trees (Hash Trees) Leaves: hashes of ROOT hash(A, B) data blocks. Nodes: hashes of their children. A B hash(C, D) hash(E, F) Used to detect inconsistencies C D E F between replicas hash(001) hash(002) hash(003) hash(004) (anti-entropy) and to minimise the Data Data Data Data Block Block Block Block amount of 001 002 003 004 transferred data 18
  • 38. Merkle Trees Node A Node B gossip exchange 19
  • 39. Merkle Trees Node A Node B gossip exchange Minimal data transfer Differences are easy to locate 19
  • 40. Merkle Trees Node A Node B gossip exchange Minimal data transfer Differences are easy to locate SHA-1, Whirlpool or Tiger (TTH) hash functions 19
  • 41. Merkle Trees: Usage Peer to Peer communication 20
  • 42. Merkle Trees: Usage DC++ Peer to Peer communication 20
  • 43. Merkle Trees: Usage DC++ Peer to Peer communication ... Amazon Google Google Cassandra HBase ZFS Dynamo BigTable Wave 20
  • 44. Merkle Trees: Usage in Cassandra Ensure the P2P network of nodes receives data blocks unaltered and unharmed. Anti-entropy during major compactions (via Scuttlebutt reconciliation). http://wiki.apache.org/cassandra/ArchitectureAntiEntropy 21
  • 45. Merkle Trees: Usage in Cassandra Ensure the P2P network of nodes receives data blocks unaltered and unharmed. Anti-entropy during major compactions (via Scuttlebutt reconciliation). One Merkle Tree per Column Family (in Dynamo, one per node / key range) http://wiki.apache.org/cassandra/ArchitectureAntiEntropy 21
  • 46. Merkle Trees: Usage in Cassandra Ensure the P2P network of nodes receives data blocks unaltered and unharmed. Anti-entropy during major compactions (via Scuttlebutt reconciliation). One Merkle Tree per Column Family (in Dynamo, one per node / key range) org.apache.cassandra.utils.MerkleTree http://wiki.apache.org/cassandra/ArchitectureAntiEntropy 21
  • 49. Lorenzo Alberton @lorenzoalberton Thank you! lorenzo@alberton.info http://www.alberton.info/talks 24

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. Two keys might map into the same bucket\n
  5. Two keys might map into the same bucket\n
  6. Two keys might map into the same bucket\n
  7. Two keys might map into the same bucket\n
  8. Two keys might map into the same bucket\n
  9. Two keys might map into the same bucket\n
  10. Two keys might map into the same bucket\n
  11. Two keys might map into the same bucket\n
  12. \n
  13. \n
  14. \n
  15. \n
  16. An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
  17. An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
  18. An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
  19. An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
  20. An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
  21. An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
  22. An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
  23. An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
  24. An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
  25. An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
  26. An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
  27. An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
  28. An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
  29. An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
  30. An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
  31. An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
  32. An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
  33. An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
  34. An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
  35. An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
  36. An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
  37. An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
  38. An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
  39. An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
  40. An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
  41. An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
  42. An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
  43. An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
  44. An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
  45. An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
  46. An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
  47. Tiger is a cryptographic hash function optimised for 64-bit platform (1995)\nSize: 192 bits (truncated versions: 128 and 160 bits).\nMurmur hash is very very fast and low collision rate (2008).\nAnother good non-cryptographic hash function is the Jenkins Hash Function (Bob Jenkins, 1997)\nHashing with checksum functions is possible, and may produce a sufficiently uniform distribution of hash values, as long as the hash range size n is small compared to the range of the checksum or fingerprint function. The CRC32 checksum provides only 16 bits (the higher half of the result) that are usable for hashing.\n\n\n
  48. Popular in distributed web caches (small cost, big potential gain).\nThe Google Chrome web browser uses Bloom filters to speed up its Safe Browsing service.[6]\nIn Relational Databases, Bloom Filters are often used for JOINs\n
  49. \n
  50. All the bits for an element not yet inserted might already be set.\nThere is a clear tradeoff between m and the probability of a false positive.\nThe value of k that minimizes the probability of false positives is 0.7m/n\n
  51. \n
  52. An optimal number of hash functions k has been assumed\n
  53. Standard bloom filters can&amp;#x2019;t handle deletions: if deleting x means resetting 1s to 0s, then deleting an entry might delete several others.\n\n
  54. 2006. Precisely eliminating duplicates in an unbounded data stream (i.e. when you don&amp;#x2019;t kow the size of the data set up front) is not feasible in many streaming scenarios. A common characteristic of these algorithms is the underlying assumption that the whole data set is stored and can be accessed if needed.\nUse cases: URL crawlers, Network monitoring (number of accesses by IP in the past hour), trending topics.\nIn many data stream applications, the allocated space is rather small compared to the size of the stream. When more and more elements arrive, the fraction of zeros\nin the Bloom Filter will decrease continuously, and the false positive rate will increase accordingly, finally reaching the limit, 1, where every distinct element will be reported as a\nduplicate, indicating that the Bloom Filter is useless.\nFor the regular Bloom Filter, there is no way to distinguish the recent elements from the past ones\n\ngithub?\n
  55. 2006. Precisely eliminating duplicates in an unbounded data stream (i.e. when you don&amp;#x2019;t kow the size of the data set up front) is not feasible in many streaming scenarios. A common characteristic of these algorithms is the underlying assumption that the whole data set is stored and can be accessed if needed.\nUse cases: URL crawlers, Network monitoring (number of accesses by IP in the past hour), trending topics.\nIn many data stream applications, the allocated space is rather small compared to the size of the stream. When more and more elements arrive, the fraction of zeros\nin the Bloom Filter will decrease continuously, and the false positive rate will increase accordingly, finally reaching the limit, 1, where every distinct element will be reported as a\nduplicate, indicating that the Bloom Filter is useless.\nFor the regular Bloom Filter, there is no way to distinguish the recent elements from the past ones\n\ngithub?\n
  56. 2006. Precisely eliminating duplicates in an unbounded data stream (i.e. when you don&amp;#x2019;t kow the size of the data set up front) is not feasible in many streaming scenarios. A common characteristic of these algorithms is the underlying assumption that the whole data set is stored and can be accessed if needed.\nUse cases: URL crawlers, Network monitoring (number of accesses by IP in the past hour), trending topics.\nIn many data stream applications, the allocated space is rather small compared to the size of the stream. When more and more elements arrive, the fraction of zeros\nin the Bloom Filter will decrease continuously, and the false positive rate will increase accordingly, finally reaching the limit, 1, where every distinct element will be reported as a\nduplicate, indicating that the Bloom Filter is useless.\nFor the regular Bloom Filter, there is no way to distinguish the recent elements from the past ones\n\ngithub?\n
  57. 2006. Precisely eliminating duplicates in an unbounded data stream (i.e. when you don&amp;#x2019;t kow the size of the data set up front) is not feasible in many streaming scenarios. A common characteristic of these algorithms is the underlying assumption that the whole data set is stored and can be accessed if needed.\nUse cases: URL crawlers, Network monitoring (number of accesses by IP in the past hour), trending topics.\nIn many data stream applications, the allocated space is rather small compared to the size of the stream. When more and more elements arrive, the fraction of zeros\nin the Bloom Filter will decrease continuously, and the false positive rate will increase accordingly, finally reaching the limit, 1, where every distinct element will be reported as a\nduplicate, indicating that the Bloom Filter is useless.\nFor the regular Bloom Filter, there is no way to distinguish the recent elements from the past ones\n\ngithub?\n
  58. RBF: permit the removal of selected false positives at the expense of generating random false negatives.\n
  59. \n
  60. They are used to protect any kind of data stored, handled and transferred in and between computers\n
  61. Each inner node is the hash value of the concatenation of its two children.\nThe principal advantage of Merkle tree is that each branch of the tree can be checked independently without requiring nodes to download the entire tree or the entire data set.\n\n\n
  62. For each key range of data, each member in the replica group compute a Merkel tree (a hash encoding tree where the difference can be located quickly) and send it to other neighbors. By comparing the received Merkel tree with its own tree, each member can quickly determine which data portion is out of sync. If so, it will send the diff to the left-behind members.\n\nTiger is a cryptographic hash function optimised for 64-bit platform (1995)\nSize: 192 bits (truncated versions: 128 and 160 bits)\n
  63. For each key range of data, each member in the replica group compute a Merkel tree (a hash encoding tree where the difference can be located quickly) and send it to other neighbors. By comparing the received Merkel tree with its own tree, each member can quickly determine which data portion is out of sync. If so, it will send the diff to the left-behind members.\n\nTiger is a cryptographic hash function optimised for 64-bit platform (1995)\nSize: 192 bits (truncated versions: 128 and 160 bits)\n
  64. For each key range of data, each member in the replica group compute a Merkel tree (a hash encoding tree where the difference can be located quickly) and send it to other neighbors. By comparing the received Merkel tree with its own tree, each member can quickly determine which data portion is out of sync. If so, it will send the diff to the left-behind members.\n\nTiger is a cryptographic hash function optimised for 64-bit platform (1995)\nSize: 192 bits (truncated versions: 128 and 160 bits)\n
  65. For each key range of data, each member in the replica group compute a Merkel tree (a hash encoding tree where the difference can be located quickly) and send it to other neighbors. By comparing the received Merkel tree with its own tree, each member can quickly determine which data portion is out of sync. If so, it will send the diff to the left-behind members.\n\nTiger is a cryptographic hash function optimised for 64-bit platform (1995)\nSize: 192 bits (truncated versions: 128 and 160 bits)\n
  66. Hash trees can be used to protect any kind of data stored, handled and transferred in and between computers.\nBefore downloading a file on a p2p network, the top hash is acquired from a trusted source. When the top hash (root hash) is available, the hash tree can be received form any non-trusted source.\nCurrently the main use of hash trees is to make sure that data blocks received from other peers in a peer-to-peer network are received undamaged and unaltered, and even to check that the other peers do not lie and send fake blocks\n
  67. Hash trees can be used to protect any kind of data stored, handled and transferred in and between computers.\nBefore downloading a file on a p2p network, the top hash is acquired from a trusted source. When the top hash (root hash) is available, the hash tree can be received form any non-trusted source.\nCurrently the main use of hash trees is to make sure that data blocks received from other peers in a peer-to-peer network are received undamaged and unaltered, and even to check that the other peers do not lie and send fake blocks\n
  68. Hash trees can be used to protect any kind of data stored, handled and transferred in and between computers.\nBefore downloading a file on a p2p network, the top hash is acquired from a trusted source. When the top hash (root hash) is available, the hash tree can be received form any non-trusted source.\nCurrently the main use of hash trees is to make sure that data blocks received from other peers in a peer-to-peer network are received undamaged and unaltered, and even to check that the other peers do not lie and send fake blocks\n
  69. Hash trees can be used to protect any kind of data stored, handled and transferred in and between computers.\nBefore downloading a file on a p2p network, the top hash is acquired from a trusted source. When the top hash (root hash) is available, the hash tree can be received form any non-trusted source.\nCurrently the main use of hash trees is to make sure that data blocks received from other peers in a peer-to-peer network are received undamaged and unaltered, and even to check that the other peers do not lie and send fake blocks\n
  70. Hash trees can be used to protect any kind of data stored, handled and transferred in and between computers.\nBefore downloading a file on a p2p network, the top hash is acquired from a trusted source. When the top hash (root hash) is available, the hash tree can be received form any non-trusted source.\nCurrently the main use of hash trees is to make sure that data blocks received from other peers in a peer-to-peer network are received undamaged and unaltered, and even to check that the other peers do not lie and send fake blocks\n
  71. Hash trees can be used to protect any kind of data stored, handled and transferred in and between computers.\nBefore downloading a file on a p2p network, the top hash is acquired from a trusted source. When the top hash (root hash) is available, the hash tree can be received form any non-trusted source.\nCurrently the main use of hash trees is to make sure that data blocks received from other peers in a peer-to-peer network are received undamaged and unaltered, and even to check that the other peers do not lie and send fake blocks\n
  72. Merkle trees are exchanged, if they disagree, Cassandra does a range-repair via compaction (using the Scuttlebutt reconciliation)\nTo ensure the data is still in sync even there is no READ and WRITE occurs to the data, replica nodes periodically gossip with each other to figure out if anyone out of sync. For each key range of data, each member in the replica group compute a Merkel tree (a hash encoding tree where the difference can be located quickly) and send it to other neighbors. By comparing the received Merkel tree with its own tree, each member can quickly determine which data portion is out of sync. If so, it will send the diff to the left-behind members.\n\nAnti-entropy is the &quot;catch-all&quot; way to guarantee eventual consistency, but is also pretty expensive and therefore is not done frequently. By combining the data sync with read repair and hinted handoff, we can keep the replicas pretty up-to-date.\n\nThe key difference in Cassandra&apos;s implementation of anti-entropy is that the Merkle trees are built per column family, and they are not maintained for longer than it takes to send them to neighboring nodes. Instead, the trees are generated as snapshots of the dataset during major compactions: this means that excess data might be sent across the network, but it saves local disk IO, and is preferable for very large datasets.\n
  73. Merkle trees are exchanged, if they disagree, Cassandra does a range-repair via compaction (using the Scuttlebutt reconciliation)\nTo ensure the data is still in sync even there is no READ and WRITE occurs to the data, replica nodes periodically gossip with each other to figure out if anyone out of sync. For each key range of data, each member in the replica group compute a Merkel tree (a hash encoding tree where the difference can be located quickly) and send it to other neighbors. By comparing the received Merkel tree with its own tree, each member can quickly determine which data portion is out of sync. If so, it will send the diff to the left-behind members.\n\nAnti-entropy is the &quot;catch-all&quot; way to guarantee eventual consistency, but is also pretty expensive and therefore is not done frequently. By combining the data sync with read repair and hinted handoff, we can keep the replicas pretty up-to-date.\n\nThe key difference in Cassandra&apos;s implementation of anti-entropy is that the Merkle trees are built per column family, and they are not maintained for longer than it takes to send them to neighboring nodes. Instead, the trees are generated as snapshots of the dataset during major compactions: this means that excess data might be sent across the network, but it saves local disk IO, and is preferable for very large datasets.\n
  74. \n
  75. \n
  76. \n