Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees

•

159 likes•30,015 views

The first part of a series of talks about modern algorithms and data structures, used by nosql databases like HBase and Cassandra. An explanation of Bloom Filters and several derivates, and Merkle Trees.

Technology Education

Lorenzo Alberton
@lorenzoalberton

“Modern” Algorithms
and Data Structures
Part 1
Bloom Filters, Merkle Trees

Cassandra-London, Monday 18th April 2011
1

Bloom Filters
Burton Howard Bloom, 1970

http://portal.acm.org/citation.cfm?doid=362686.362692 2

Bloom Filter

Space-efﬁcient
probabilistic
data structure
used to test
set membership
http://en.wikipedia.org/wiki/Bloom_ﬁlter 3

Bloom Filter
Space-efﬁcient probabilistic data structure that is used to test
whether an element is a member of a set

4

Bloom Filter
Space-efﬁcient probabilistic data structure that is used to test
whether an element is a member of a set

Hash Table ⇒ chance of collision

hash(x) hash(y)

4

Bloom Filter
Space-efﬁcient probabilistic data structure that is used to test
whether an element is a member of a set

5

Bloom Filter
Space-efﬁcient probabilistic data structure that is used to test
whether an element is a member of a set

Not a Key-Value store

5

Bloom Filter
Space-efﬁcient probabilistic data structure that is used to test
whether an element is a member of a set

Not a Key-Value store

Array of bits indicating the
presence of a key in the ﬁlter

5

Bloom Filter: Add & Query
m bits (initially set to 0)
k hash functions

S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 2 m-1 m

6

Bloom Filter: Add & Query
m bits (initially set to 0)
k hash functions

Add

S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 2 m-1 m

6

Bloom Filter: Add & Query
m bits (initially set to 0) if f(x) = A,
k hash functions set S[A] = 1
x
Add

S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 2 m-1 m

6

Bloom Filter: Add & Query
m bits (initially set to 0) if f(x) = A,
k hash functions set S[A] = 1
x
Add
f(x)

S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1
0 1 2 m-1 m

6

Bloom Filter: Add & Query
m bits (initially set to 0) if f(x) = A,
k hash functions set S[A] = 1
x
Add
g(x) f(x)

S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1
0 1 2 m-1 m

6

Bloom Filter: Add & Query
m bits (initially set to 0) if f(x) = A,
k hash functions set S[A] = 1
x
Add
g(x) f(x) h(x)

S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1
0 1 2 m-1 m

6

Bloom Filter: Add & Query
m bits (initially set to 0) if f(x) = A,
k hash functions set S[A] = 1
x y
g(y)
Add f(y)
g(x) f(x) h(x)
h(y)
S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1
0 1 2 m-1 m

6

Bloom Filter: Add & Query
m bits (initially set to 0) if f(x) = A,
k hash functions set S[A] = 1
x y
g(y)
Add f(y)
g(x) f(x) h(x)
h(y)
S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1
0 1 2 m-1 m

Query

6

Bloom Filter: Add & Query
m bits (initially set to 0) if f(x) = A,
k hash functions set S[A] = 1
x y
g(y)
Add f(y)
g(x) f(x) h(x)
h(y)
S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1
0 1 2 m-1 m

f(z) h(z) g(z)
Query
z
6

Bloom Filter: Hash Functions
k Hash functions: uniform random distribution in [1...m)

k different hash functions

The same hash functions with different salts

Double or triple hashing : g (x) = h (x) + ih (x) mod m
[1]
i 1 2

2 hash functions can mimic k hashing functions

Dillinger, Peter C.; Manolios, Panagiotis (2004b), "Bloom Filters in Probabilistic Veriﬁcation",
[1]
http://www.ccs.neu.edu/home/pete/pub/bloom-ﬁlters-veriﬁcation.pdf

http://www.strchr.com/hash_functions 7

Bloom Filter: Hash Functions
k Hash functions: uniform random distribution in [1...m)

k different hash functions

‣ Cryptographic Hash different salts
The same hash functions withFunctions
(MD5, SHA-1, SHA-256, Tiger, Whirlpool ...)
Double or triple hashing : g (x) = h (x) + ih (x) mod m
[1]
i 1 2

2 hash functions can mimic k hashing functions
‣ Murmur Hashes
http://code.google.com/p/smhasher/
Dillinger, Peter C.; Manolios, Panagiotis (2004b), "Bloom Filters in Probabilistic Veriﬁcation",
[1]
http://www.ccs.neu.edu/home/pete/pub/bloom-ﬁlters-veriﬁcation.pdf

http://www.strchr.com/hash_functions 7

Bloom Filter: Usage

Guard against First line of defence
Peer to Peer Routing -
expensive operations in high performance
communication Resource Location
(like disk access) (distributed) caches

...
Squid Google Various Google Cisco
Cassandra HBase
Proxy Cache BigTable RDBMS’ Chrome Routers

8

Bloom Filter: Usage in Cassandra

Used to save I/O during key look-ups
(check for non-existent keys)

One bloom ﬁlter per SSTable.

9

Bloom Filter: Usage in Cassandra

Used to save I/O during key look-ups
(check for non-existent keys)

One bloom ﬁlter per SSTable.

org.apache.cassandra.utils.BloomFilter

9

Bloom Filter: False Positive Rate

m = number of bits in the ﬁlter
n = number of elements
k = number of hashing functions

http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html 10

Bloom Filter: False Positive Rate

A bloom ﬁlter with an optimal value for k
and 1% error rate only needs 9.6 bits per key.
Add 4.8 bits/key and the error rate decreases by 10 times.

10.000 words, 1% error rate 10.000 words, 0.1% error rate
7 hash functions 11 hash functions

~12 KB of memory ~18 KB of memory
http://www.igvita.com/2008/12/27/scalable-datasets-bloom-ﬁlters-in-ruby/ 11

Bloom Filter: False Positive Rate
false positive probability

bloom ﬁlter size (n)
http://en.wikipedia.org/wiki/Bloom_ﬁlter 12

Counting Bloom Filter
Can handle deletions
Use counters instead of 0/1s
When adding an element, increment the counters
When deleting an element, decrement the counters
Counters must be large enough to avoid overﬂow (4 bits)
x y
g(y)
f(y)
g(x) f(x) h(x)
h(y)
S 1 0 0 0 1 0 0 0 2 0 0 0 1 0 1
13

Stable (Time-Based) Bloom Filter
Input
Stream

Duplicate 1 0 0 0 1 0 0 0 1 0
Filter

Output
Stream
14

Stable (Time-Based) Bloom Filter
Input Before each insertion, P random
Stream cells are decremented by one.
The k cells for the new value xi
are set to Max (usually < 7)
http://webdocs.cs.ualberta.ca/~draﬁei/papers/DupDet06Sigmod.pdf

Duplicate 1 0 0 0 1 0 0 0 1 0
Filter

Output
Stream
14

Bloom Filters: Further reading
Compressed Bloom Filters
Improve performance when the Bloom ﬁlter is passed as a message,
and its transmission size is a limiting factor.
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.86.3346

Retouched Bloom Filters
Allow networked applications to trade off selected false positives
against false negatives
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.172.8453

Bloomier Filters
Extended to handle approximate functions (each element of the set
has an associated function value)
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.86.4154 http://arxiv.org/abs/0807.0928

Attenuated B.F., Spectral B.F., Distance-Sensitive B.F. ...
15

Merkle Trees
Ralph C. Merkle, 1979

http://www.springerlink.com/content/q865hwxq73ex1am9/ 16

Merkle Trees (Hash Trees)

Data Structure containing a
tree of summary information
about a larger piece of data
to verify its contents

http://en.wikipedia.org/wiki/Hash_Tree 17

Merkle Trees (Hash Trees)
Leaves: hashes of
ROOT
hash(A, B) data blocks.
Nodes: hashes of
their children.
A B
hash(C, D) hash(E, F)
Used to detect
inconsistencies
C D E F between replicas
hash(001) hash(002) hash(003) hash(004)
(anti-entropy) and
to minimise the
Data Data Data Data
Block Block Block Block amount of
001 002 003 004 transferred data
18

Merkle Trees
Node A Node B
gossip
exchange

19

Merkle Trees
Node A Node B
gossip
exchange

Minimal data transfer
Differences are easy to locate

19

Merkle Trees
Node A Node B
gossip
exchange

Minimal data transfer
Differences are easy to locate

SHA-1, Whirlpool or Tiger (TTH) hash functions
19

Merkle Trees: Usage

Peer to Peer
communication

20

Merkle Trees: Usage
DC++

Peer to Peer
communication

20

Merkle Trees: Usage
DC++

Peer to Peer
communication

...
Amazon Google Google
Cassandra HBase ZFS
Dynamo BigTable Wave

20

Merkle Trees: Usage in Cassandra

Ensure the P2P network of nodes receives
data blocks unaltered and unharmed.
Anti-entropy during major compactions
(via Scuttlebutt reconciliation).

http://wiki.apache.org/cassandra/ArchitectureAntiEntropy 21

References

Bloom Filters
http://bit.ly/bundles/quipo/1

Merkle Trees
http://bit.ly/bundles/quipo/2

22

We’re Hiring!

http://mediasift.com/careers
23

Lorenzo Alberton
@lorenzoalberton

Thank you!

lorenzo@alberton.info

http://www.alberton.info/talks
24

What's hot

Graphs In Data Structure

Anuj Modi

AVL Tree Data Structure

Afaq Mansoor Khan

Hashing in datastructure

rajshreemuthiah

Quick sort

Dhruv Sabalpara

Greedy algorithm

International Islamic University

Binary Search - Design & Analysis of Algorithms

Drishti Bhalla

B trees

PRAKASH RANJAN SINGH

PostgreSql query planning and tuning

Federico Campoli

Hash map

Emmanuel Fuchs

1.Role lexical Analyzer

Radhakrishnan Chinnusamy

Automata theory - CFG and normal forms

Akila Krishnamoorthy

Merge sort algorithm

Shubham Dwivedi

Hashing

Ghaffar Khan

DSA Presentetion Huffman tree.pdf

GaneshPawar819187

Hash table

Vu Tran

how to calclute time complexity of algortihm

Sajid Marwat

RABIN KARP ALGORITHM STRING MATCHING

Abhishek Singh

Lecture Note-1: Algorithm and Its Properties

Rajesh K Shukla

Fibonacci Heap

Anshuman Biswal

a. Concept and Definition b. Binary Tree c. Introduction and application d. Operation e. Types of Binary Tree • Complete • Strictly • Almost Complete f. Huffman algorithm g. Binary Search Tree • Insertion • Deletion • Searching h. Tree Traversal • Pre-order traversal • In-order traversal • Post-order traversal Slides at myblog http://www.ashimlamichhane.com.np/2016/07/tree-slide-for-data-structure-and-algorithm/ Assignments at github https://github.com/ashim888/dataStructureAndAlgorithm/tree/dev/Assignments/assignment_7

Tree - Data Structure

Ashim Lamichhane

What's hot (20)

Graphs In Data Structure

AVL Tree Data Structure

Hashing in datastructure

Quick sort

Greedy algorithm

Binary Search - Design & Analysis of Algorithms

B trees

PostgreSql query planning and tuning

Hash map

1.Role lexical Analyzer

Automata theory - CFG and normal forms

Merge sort algorithm

Hashing

DSA Presentetion Huffman tree.pdf

Hash table

how to calclute time complexity of algortihm

RABIN KARP ALGORITHM STRING MATCHING

Lecture Note-1: Algorithm and Its Properties

Fibonacci Heap

Tree - Data Structure

Viewers also liked

Scalable Architectures - Taming the Twitter Firehose

Lorenzo Alberton

Scaling Teams, Processes and Architectures

Lorenzo Alberton

The ability to grow (and shrink) according to the needs and the available resources is an essential part of designing applications. In this talk we'll cover the fundamental elements of scalability, including aspects involving people, processes and technology. With sound and proven principles and some advice on how to shape your organisation, set the right processes and design your application, this session is a must-see for developers and technical leads alike.

The Art of Scalability - Managing growth

Lorenzo Alberton

At a certain scale, millions of events happen every second, and all of them are important to evaluate the health of the system. If not handled correctly, such a volume of information can overwhelm both the infrastructure that needs to support them, and people who have to make a sense out of thousands of signals and make decisions upon them, fast. By understanding how our rational mind works, how people process information, we can present data so it's more evident and intuitive. This talk will explain how to collect useful metrics, and to create the perfect monitoring dashboard to organise and display them, letting our intuition operate automatically and quickly, and saving attention and mental effort to activities that demand it.

Monitoring at scale - Intuitive dashboard design

Lorenzo Alberton

Despite the NoSQL movement trying to flag traditional databases as a dying breed, the RDBMS keeps evolving and adding new powerful weapons to its arsenal. In this talk we'll explore Common Table Expressions (SQL-99) and how SQL handles recursion, breaking the bi-dimensional barriers and paving the way to more complex data structures like trees and graphs, and how we can replicate features from social networks and recommendation systems. We'll also have a look at window functions (SQL:2003) and the advanced reporting features they make finally possible.

Graphs in the Database: Rdbms In The Social Networks Age

Lorenzo Alberton

NoSQL databases get a lot of press coverage, but there seems to be a lot of confusion surrounding them, as in which situations they work better than a Relational Database, and how to choose one over another. This talk will give an overview of the NoSQL landscape and a classification for the different architectural categories, clarifying the base concepts and the terminology, and will provide a comparison of the features, the strengths and the drawbacks of the most popular projects (CouchDB, MongoDB, Riak, Redis, Membase, Neo4j, Cassandra, HBase, Hypertable).

NoSQL Databases: Why, what and when

Lorenzo Alberton

Storing tree structures in a bi-dimensional table has always been problematic. The simplest tree models are usually quite inefficient, while more complex ones aren't necessarily better. In this talk I briefly go through the most used models (adjacency list, materialized path, nested sets) and introduce some more advanced ones belonging to the nested intervals family (Farey algorithm, Continued Fractions, and other encodings). I describe the advantages and pitfalls of each model, some proprietary solutions (e.g. Oracle's CONNECT BY) and one of the SQL Standard's upcoming features, Common Table Expressions.

Trees In The Database - Advanced data structures

Lorenzo Alberton

Viewers also liked (7)

Scalable Architectures - Taming the Twitter Firehose

Scaling Teams, Processes and Architectures

The Art of Scalability - Managing growth

Monitoring at scale - Intuitive dashboard design

Graphs in the Database: Rdbms In The Social Networks Age

NoSQL Databases: Why, what and when

Trees In The Database - Advanced data structures

Similar to Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees

ESTUDO DE ALGEBRA BOOLEANA PARA ESTUDOS.

Chapter 2.pptx

Bloom filter

Unit 5 Streams2.pptx

M3 PPT 22ESC143.docx

M3 PPT 22ESC143.docx

Open addressiing &rehashing,extendiblevhashing

SangeethaSasi1

Similar to Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees (7)

ESTUDO DE ALGEBRA BOOLEANA PARA ESTUDOS.

Chapter 2.pptx

Bloom filter

Unit 5 Streams2.pptx

M3 PPT 22ESC143.docx

Open addressiing &rehashing,extendiblevhashing

Recently uploaded

Data Cloud, More than a CDP by Matt Robison

Anna Loughnan Colquhoun

BooK Now Call us at +918448380779 to hire a gorgeous and seductive call girl for sex. Take a Delhi Escort Service. The help of our escort agency is mostly meant for men who want sexual Indian Escorts In Delhi NCR. It should be noted that any impersonator will get 100 attention from our Young Girls Escorts in Delhi. They will assume the position of reliable allies. VIP Call Girl With Original Photos Book Tonight +918448380779 Our Cheap Price 1 Hour not available 2 Hours 5000 Full Night 8000 TAG: Call Girls in Delhi, Noida, Gurgaon, Ghaziabad, Connaught Place, Greater Kailash Delhi, Lajpat Nagar Delhi, Mayur Vihar Delhi, Chanakyapuri Delhi, New Friends Colony Delhi, Majnu Ka Tilla, Karol Bagh, Malviya Nagar, Saket, Khan Market, Noida Sector 18, Noida Sector 76, Noida Sector 51, Gurgaon Mg Road, Iffco Chowk Gurgaon, Rajiv Chowk Gurgaon All Delhi Ncr Free Home Deliver

08448380779 Call Girls In Friends Colony Women Seeking Men

Delhi Call girls

Discord is a free app offering voice, video, and text chat functionalities, primarily catering to the gaming community. It serves as a hub for users to create and join servers tailored to their interests. Discord’s ecosystem comprises servers, each functioning as a distinct online community with its own channels dedicated to specific topics or activities. Users can engage in text-based discussions, voice calls, or video chats within these channels. Understanding Discord Servers Discord servers are virtual spaces where users congregate to interact, share content, and build communities. Servers may revolve around gaming, hobbies, interests, or fandoms, providing a platform for like-minded individuals to connect. Communication Features Discord offers a range of communication tools, including text channels for messaging, voice channels for real-time audio conversations, and video channels for face-to-face interactions. These features facilitate seamless communication and collaboration. What Does NSFW Mean? The acronym NSFW stands for “Not Safe For Work,” indicating content that may be inappropriate for professional or public settings. NSFW Content NSFW content encompasses material that is sexually explicit, violent, or otherwise graphic in nature. It often includes nudity, profanity, or depictions of sensitive topics.

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

UK Journal

How to Troubleshoot Apps for the Modern Connected Worker

ThousandEyes

Boost Fertility New Invention Ups Success Rates.pdf

sudhanshuwaghmare1

With more memory available, system performance of three Dell devices increased, which can translate to a better user experience Conclusion When your system has plenty of RAM to meet your needs, you can efficiently access the applications and data you need to finish projects and to-do lists without sacrificing time and focus. Our test results show that with more memory available, three Dell PCs delivered better performance and took less time to complete the Procyon Office Productivity benchmark. These advantages translate to users being able to complete workflows more quickly and multitask more easily. Whether you need the mobility of the Latitude 5440, the creative capabilities of the Precision 3470, or the high performance of the OptiPlex Tower Plus 7010, configuring your system with more RAM can help keep processes running smoothly, enabling you to do more without compromising performance.

Boost PC performance: How more available memory can improve productivity

Principled Technologies

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

Delhi Call girls

A Domino Admins Adventures (Engage 2024)

Gabriella Davis

Building Digital Trust in a Digital Economy Veronica Tan, Director - Cyber Security Agency of Singapore Apidays Singapore 2024: Connecting Customers, Business and Technology (April 17 & 18, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

apidays

[2024]Digital Global Overview Report 2024 Meltwater.pdf

hans926745

Sara Mae O’Brien Scott and Tatiana Baquero Cakici, Senior Consultants at Enterprise Knowledge (EK), presented “AI Fast Track to Search-Focused AI Solutions” at the Information Architecture Conference (IAC24) that took place on April 11, 2024 in Seattle, WA. In their presentation, O’Brien-Scott and Cakici focused on what Enterprise AI is, why it is important, and what it takes to empower organizations to get started on a search-based AI journey and stay on track. The presentation explored the complexities of enterprise search challenges and how IA principles can be leveraged to provide AI solutions through the use of a semantic layer. O’Brien-Scott and Cakici showcased a case study where a taxonomy, an ontology, and a knowledge graph were used to structure content at a healthcare workforce solutions organization, providing personalized content recommendations and increasing content findability. In this session, participants gained insights about the following: Most common types of AI categories and use cases; Recommended steps to design and implement taxonomies and ontologies, ensuring they evolve effectively and support the organization’s search objectives; Taxonomy and ontology design considerations and best practices; Real-world AI applications that illustrated the value of taxonomies, ontologies, and knowledge graphs; and Tools, roles, and skills to design and implement AI-powered search solutions.

IAC 2024 - IA Fast Track to Search Focused AI Solutions

Enterprise Knowledge

GenAI Risks & Security Meetup 01052024.pdf

lior mazor

As privacy and data protection regulations evolve rapidly, organizations operating in multiple jurisdictions face mounting challenges to ensure compliance and safeguard customer data. With state-specific privacy laws coming up in multiple states this year, it is essential to understand what their unique data protection regulations will require clearly. How will data privacy evolve in the US in 2024? How to stay compliant? Our panellists will guide you through the intricacies of these states' specific data privacy laws, clarifying complex legal frameworks and compliance requirements. This webinar will review: - The essential aspects of each state's privacy landscape and the latest updates - Common compliance challenges faced by organizations operating in multiple states and best practices to achieve regulatory adherence - Valuable insights into potential changes to existing regulations and prepare your organization for the evolving landscape

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

TrustArc

Finology Group – Insurtech Innovation Award 2024

The Digital Insurer

Partners Life - Insurer Innovation Award 2024

The Digital Insurer

Enterprise Knowledge’s Urmi Majumder, Principal Data Architecture Consultant, and Fernando Aguilar Islas, Senior Data Science Consultant, presented "Driving Behavioral Change for Information Management through Data-Driven Green Strategy" on March 27, 2024 at Enterprise Data World (EDW) in Orlando, Florida. In this presentation, Urmi and Fernando discussed a case study describing how the information management division in a large supply chain organization drove user behavior change through awareness of the carbon footprint of their duplicated and near-duplicated content, identified via advanced data analytics. Check out their presentation to gain valuable perspectives on utilizing data-driven strategies to influence positive behavioral shifts and support sustainability initiatives within your organization. In this session, participants gained answers to the following questions: - What is a Green Information Management (IM) Strategy, and why should you have one? - How can Artificial Intelligence (AI) and Machine Learning (ML) support your Green IM Strategy through content deduplication? - How can an organization use insights into their data to influence employee behavior for IM? - How can you reap additional benefits from content reduction that go beyond Green IM?

Driving Behavioral Change for Information Management through Data-Driven Gree...

Enterprise Knowledge

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Martijn de Jong

Evaluating the top large language models.pdf

ChristopherTHyatt

The Raspberry Pi 5 was announced on October 2023. This new version of the popular embedded device comes with a new iteration of Broadcom’s VideoCore GPU platform, and was released with a fully open source driver stack, developed by Igalia. The presentation will discuss some of the major changes required to support this new Video Core iteration, the challenges we faced in the process and the solutions we provided in order to deliver conformant OpenGL ES and Vulkan drivers. The talk will also cover the next steps for the open source Raspberry Pi 5 graphics stack. (c) Embedded Open Source Summit 2024 April 16-18, 2024 Seattle, Washington (US) https://events.linuxfoundation.org/embedded-open-source-summit/ https://eoss24.sched.com/event/1aBEx

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

Igalia

Handwritten Text Recognition for manuscripts and early printed texts

Maria Levchenko

Recently uploaded (20)

Data Cloud, More than a CDP by Matt Robison

08448380779 Call Girls In Friends Colony Women Seeking Men

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

How to Troubleshoot Apps for the Modern Connected Worker

Boost Fertility New Invention Ups Success Rates.pdf

Boost PC performance: How more available memory can improve productivity

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

A Domino Admins Adventures (Engage 2024)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

[2024]Digital Global Overview Report 2024 Meltwater.pdf

IAC 2024 - IA Fast Track to Search Focused AI Solutions

GenAI Risks & Security Meetup 01052024.pdf

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

Finology Group – Insurtech Innovation Award 2024

Partners Life - Insurer Innovation Award 2024

Driving Behavioral Change for Information Management through Data-Driven Gree...

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Evaluating the top large language models.pdf

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

Handwritten Text Recognition for manuscripts and early printed texts

Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees

1. Lorenzo Alberton @lorenzoalberton “Modern” Algorithms and Data Structures Part 1 Bloom Filters, Merkle Trees Cassandra-London, Monday 18th April 2011 1

2. Bloom Filters Burton Howard Bloom, 1970 http://portal.acm.org/citation.cfm?doid=362686.362692 2

3. Bloom Filter Space-efﬁcient probabilistic data structure used to test set membership http://en.wikipedia.org/wiki/Bloom_ﬁlter 3

4. Bloom Filter Space-efﬁcient probabilistic data structure that is used to test whether an element is a member of a set 4

5. Bloom Filter Space-efﬁcient probabilistic data structure that is used to test whether an element is a member of a set Hash Table ⇒ chance of collision hash(x) hash(y) 4

6. Bloom Filter Space-efﬁcient probabilistic data structure that is used to test whether an element is a member of a set Hash Table ⇒ chance of collision hash(x) hash(y) False positives are possible, false negatives are not. It might be beneﬁcial to build an exception list of known false positives. 4

7. Bloom Filter Space-efﬁcient probabilistic data structure that is used to test whether an element is a member of a set 5

8. Bloom Filter Space-efﬁcient probabilistic data structure that is used to test whether an element is a member of a set Not a Key-Value store 5

9. Bloom Filter Space-efﬁcient probabilistic data structure that is used to test whether an element is a member of a set Not a Key-Value store Array of bits indicating the presence of a key in the ﬁlter 5

10. Bloom Filter Space-efficient probabilistic data structure that is used to test whether an element is a member of a set Not a Key-Value store Array of bits indicating the presence of a key in the filter (*) Removing an element from the filter is not possible 5

11. Bloom Filter: Add & Query m bits (initially set to 0) k hash functions S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 m-1 m 6

12. Bloom Filter: Add & Query m bits (initially set to 0) k hash functions Add S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 m-1 m 6

13. Bloom Filter: Add & Query m bits (initially set to 0) if f(x) = A, k hash functions set S[A] = 1 x Add S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 m-1 m 6

14. Bloom Filter: Add & Query m bits (initially set to 0) if f(x) = A, k hash functions set S[A] = 1 x Add f(x) S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 2 m-1 m 6

15. Bloom Filter: Add & Query m bits (initially set to 0) if f(x) = A, k hash functions set S[A] = 1 x Add g(x) f(x) S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 2 m-1 m 6

16. Bloom Filter: Add & Query m bits (initially set to 0) if f(x) = A, k hash functions set S[A] = 1 x Add g(x) f(x) h(x) S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 2 m-1 m 6

17. Bloom Filter: Add & Query m bits (initially set to 0) if f(x) = A, k hash functions set S[A] = 1 x y g(y) Add f(y) g(x) f(x) h(x) h(y) S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 2 m-1 m 6

18. Bloom Filter: Add & Query m bits (initially set to 0) if f(x) = A, k hash functions set S[A] = 1 x y g(y) Add f(y) g(x) f(x) h(x) h(y) S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 2 m-1 m Query 6

19. Bloom Filter: Add & Query m bits (initially set to 0) if f(x) = A, k hash functions set S[A] = 1 x y g(y) Add f(y) g(x) f(x) h(x) h(y) S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 2 m-1 m f(z) h(z) g(z) Query z 6

20. Bloom Filter: Add & Query m bits (initially set to 0) if f(x) = A, k hash functions set S[A] = 1 x y g(y) Add f(y) g(x) f(x) h(x) h(y) S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 2 m-1 m f(z) h(z) g(z) Query one bit set to 0 z ⇒z∉S 6

21. Bloom Filter: Hash Functions k Hash functions: uniform random distribution in [1...m) k different hash functions The same hash functions with different salts Double or triple hashing : g (x) = h (x) + ih (x) mod m [1] i 1 2 2 hash functions can mimic k hashing functions Dillinger, Peter C.; Manolios, Panagiotis (2004b), "Bloom Filters in Probabilistic Verification", [1] http://www.ccs.neu.edu/home/pete/pub/bloom-filters-verification.pdf http://www.strchr.com/hash_functions 7

22. Bloom Filter: Hash Functions k Hash functions: uniform random distribution in [1...m) k different hash functions ‣ Cryptographic Hash different salts The same hash functions withFunctions (MD5, SHA-1, SHA-256, Tiger, Whirlpool ...) Double or triple hashing : g (x) = h (x) + ih (x) mod m [1] i 1 2 2 hash functions can mimic k hashing functions ‣ Murmur Hashes http://code.google.com/p/smhasher/ Dillinger, Peter C.; Manolios, Panagiotis (2004b), "Bloom Filters in Probabilistic Verification", [1] http://www.ccs.neu.edu/home/pete/pub/bloom-filters-verification.pdf http://www.strchr.com/hash_functions 7

23. Bloom Filter: Usage Guard against First line of defence Peer to Peer Routing - expensive operations in high performance communication Resource Location (like disk access) (distributed) caches ... Squid Google Various Google Cisco Cassandra HBase Proxy Cache BigTable RDBMS’ Chrome Routers 8

24. Bloom Filter: Usage in Cassandra Used to save I/O during key look-ups (check for non-existent keys) One bloom ﬁlter per SSTable. 9

25. Bloom Filter: Usage in Cassandra Used to save I/O during key look-ups (check for non-existent keys) One bloom ﬁlter per SSTable. org.apache.cassandra.utils.BloomFilter 9

26. Bloom Filter: False Positive Rate m = number of bits in the ﬁlter n = number of elements k = number of hashing functions http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html 10

27. Bloom Filter: False Positive Rate m = number of bits in the ﬁlter n = number of elements k = number of hashing functions http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html 10

28. Bloom Filter: False Positive Rate A bloom ﬁlter with an optimal value for k and 1% error rate only needs 9.6 bits per key. Add 4.8 bits/key and the error rate decreases by 10 times. 10.000 words, 1% error rate 10.000 words, 0.1% error rate 7 hash functions 11 hash functions ~12 KB of memory ~18 KB of memory http://www.igvita.com/2008/12/27/scalable-datasets-bloom-ﬁlters-in-ruby/ 11

29. Bloom Filter: False Positive Rate false positive probability bloom ﬁlter size (n) http://en.wikipedia.org/wiki/Bloom_ﬁlter 12

30. Counting Bloom Filter Can handle deletions Use counters instead of 0/1s When adding an element, increment the counters When deleting an element, decrement the counters Counters must be large enough to avoid overﬂow (4 bits) x y g(y) f(y) g(x) f(x) h(x) h(y) S 1 0 0 0 1 0 0 0 2 0 0 0 1 0 1 13

31. Stable (Time-Based) Bloom Filter Input Stream Duplicate 1 0 0 0 1 0 0 0 1 0 Filter Output Stream 14

32. Stable (Time-Based) Bloom Filter Input Before each insertion, P random Stream cells are decremented by one. The k cells for the new value xi are set to Max (usually < 7) http://webdocs.cs.ualberta.ca/~draﬁei/papers/DupDet06Sigmod.pdf Duplicate 1 0 0 0 1 0 0 0 1 0 Filter Output Stream 14

33. Stable (Time-Based) Bloom Filter Input Before each insertion, P random Stream cells are decremented by one. The k cells for the new value xi are set to Max (usually < 7) http://webdocs.cs.ualberta.ca/~drafiei/papers/DupDet06Sigmod.pdf Duplicate 1 0 0 0 1 0 0 0 1 0 Filter Alternatively, set an expiry time Output for each cell, with a TTL dependent on the volume of data Stream http://www.igvita.com/2010/01/06/flow-analysis-time-based-bloom-filters/ 14

34. Bloom Filters: Further reading Compressed Bloom Filters Improve performance when the Bloom ﬁlter is passed as a message, and its transmission size is a limiting factor. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.86.3346 Retouched Bloom Filters Allow networked applications to trade off selected false positives against false negatives http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.172.8453 Bloomier Filters Extended to handle approximate functions (each element of the set has an associated function value) http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.86.4154 http://arxiv.org/abs/0807.0928 Attenuated B.F., Spectral B.F., Distance-Sensitive B.F. ... 15

35. Merkle Trees Ralph C. Merkle, 1979 http://www.springerlink.com/content/q865hwxq73ex1am9/ 16

36. Merkle Trees (Hash Trees) Data Structure containing a tree of summary information about a larger piece of data to verify its contents http://en.wikipedia.org/wiki/Hash_Tree 17

37. Merkle Trees (Hash Trees) Leaves: hashes of ROOT hash(A, B) data blocks. Nodes: hashes of their children. A B hash(C, D) hash(E, F) Used to detect inconsistencies C D E F between replicas hash(001) hash(002) hash(003) hash(004) (anti-entropy) and to minimise the Data Data Data Data Block Block Block Block amount of 001 002 003 004 transferred data 18

38. Merkle Trees Node A Node B gossip exchange 19

39. Merkle Trees Node A Node B gossip exchange Minimal data transfer Differences are easy to locate 19

40. Merkle Trees Node A Node B gossip exchange Minimal data transfer Differences are easy to locate SHA-1, Whirlpool or Tiger (TTH) hash functions 19

41. Merkle Trees: Usage Peer to Peer communication 20

42. Merkle Trees: Usage DC++ Peer to Peer communication 20

43. Merkle Trees: Usage DC++ Peer to Peer communication ... Amazon Google Google Cassandra HBase ZFS Dynamo BigTable Wave 20

44. Merkle Trees: Usage in Cassandra Ensure the P2P network of nodes receives data blocks unaltered and unharmed. Anti-entropy during major compactions (via Scuttlebutt reconciliation). http://wiki.apache.org/cassandra/ArchitectureAntiEntropy 21

45. Merkle Trees: Usage in Cassandra Ensure the P2P network of nodes receives data blocks unaltered and unharmed. Anti-entropy during major compactions (via Scuttlebutt reconciliation). One Merkle Tree per Column Family (in Dynamo, one per node / key range) http://wiki.apache.org/cassandra/ArchitectureAntiEntropy 21

46. Merkle Trees: Usage in Cassandra Ensure the P2P network of nodes receives data blocks unaltered and unharmed. Anti-entropy during major compactions (via Scuttlebutt reconciliation). One Merkle Tree per Column Family (in Dynamo, one per node / key range) org.apache.cassandra.utils.MerkleTree http://wiki.apache.org/cassandra/ArchitectureAntiEntropy 21

47. References Bloom Filters http://bit.ly/bundles/quipo/1 Merkle Trees http://bit.ly/bundles/quipo/2 22

48. We’re Hiring! http://mediasift.com/careers 23

49. Lorenzo Alberton @lorenzoalberton Thank you! lorenzo@alberton.info http://www.alberton.info/talks 24

Editor's Notes

\n
\n
\n
Two keys might map into the same bucket\n
Two keys might map into the same bucket\n
Two keys might map into the same bucket\n
Two keys might map into the same bucket\n
Two keys might map into the same bucket\n
Two keys might map into the same bucket\n
Two keys might map into the same bucket\n
Two keys might map into the same bucket\n
\n
\n
\n
\n
An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
An empty Bloom Filter is an array of m bits, all set to 0. There must be K hash functions defined, each of which maps some element to one of the m array positions with an uniform random distribution.\nTo add an element, feed it to each of the k hash functions to get k array positions, and set the bits to 1.\nTo test for an element, feed it to each of the k hash functions to get k array positions: if any of the bits at these positions are 0, the element is not in the set.\nUnion and intersection of Bloom filters: A simple bitwise OR and AND operations\n
Tiger is a cryptographic hash function optimised for 64-bit platform (1995)\nSize: 192 bits (truncated versions: 128 and 160 bits).\nMurmur hash is very very fast and low collision rate (2008).\nAnother good non-cryptographic hash function is the Jenkins Hash Function (Bob Jenkins, 1997)\nHashing with checksum functions is possible, and may produce a sufficiently uniform distribution of hash values, as long as the hash range size n is small compared to the range of the checksum or fingerprint function. The CRC32 checksum provides only 16 bits (the higher half of the result) that are usable for hashing.\n\n\n
Popular in distributed web caches (small cost, big potential gain).\nThe Google Chrome web browser uses Bloom filters to speed up its Safe Browsing service.[6]\nIn Relational Databases, Bloom Filters are often used for JOINs\n
\n
All the bits for an element not yet inserted might already be set.\nThere is a clear tradeoff between m and the probability of a false positive.\nThe value of k that minimizes the probability of false positives is 0.7m/n\n
\n
An optimal number of hash functions k has been assumed\n
Standard bloom filters can&#x2019;t handle deletions: if deleting x means resetting 1s to 0s, then deleting an entry might delete several others.\n\n
2006. Precisely eliminating duplicates in an unbounded data stream (i.e. when you don&#x2019;t kow the size of the data set up front) is not feasible in many streaming scenarios. A common characteristic of these algorithms is the underlying assumption that the whole data set is stored and can be accessed if needed.\nUse cases: URL crawlers, Network monitoring (number of accesses by IP in the past hour), trending topics.\nIn many data stream applications, the allocated space is rather small compared to the size of the stream. When more and more elements arrive, the fraction of zeros\nin the Bloom Filter will decrease continuously, and the false positive rate will increase accordingly, finally reaching the limit, 1, where every distinct element will be reported as a\nduplicate, indicating that the Bloom Filter is useless.\nFor the regular Bloom Filter, there is no way to distinguish the recent elements from the past ones\n\ngithub?\n
2006. Precisely eliminating duplicates in an unbounded data stream (i.e. when you don&#x2019;t kow the size of the data set up front) is not feasible in many streaming scenarios. A common characteristic of these algorithms is the underlying assumption that the whole data set is stored and can be accessed if needed.\nUse cases: URL crawlers, Network monitoring (number of accesses by IP in the past hour), trending topics.\nIn many data stream applications, the allocated space is rather small compared to the size of the stream. When more and more elements arrive, the fraction of zeros\nin the Bloom Filter will decrease continuously, and the false positive rate will increase accordingly, finally reaching the limit, 1, where every distinct element will be reported as a\nduplicate, indicating that the Bloom Filter is useless.\nFor the regular Bloom Filter, there is no way to distinguish the recent elements from the past ones\n\ngithub?\n
2006. Precisely eliminating duplicates in an unbounded data stream (i.e. when you don&#x2019;t kow the size of the data set up front) is not feasible in many streaming scenarios. A common characteristic of these algorithms is the underlying assumption that the whole data set is stored and can be accessed if needed.\nUse cases: URL crawlers, Network monitoring (number of accesses by IP in the past hour), trending topics.\nIn many data stream applications, the allocated space is rather small compared to the size of the stream. When more and more elements arrive, the fraction of zeros\nin the Bloom Filter will decrease continuously, and the false positive rate will increase accordingly, finally reaching the limit, 1, where every distinct element will be reported as a\nduplicate, indicating that the Bloom Filter is useless.\nFor the regular Bloom Filter, there is no way to distinguish the recent elements from the past ones\n\ngithub?\n
2006. Precisely eliminating duplicates in an unbounded data stream (i.e. when you don&#x2019;t kow the size of the data set up front) is not feasible in many streaming scenarios. A common characteristic of these algorithms is the underlying assumption that the whole data set is stored and can be accessed if needed.\nUse cases: URL crawlers, Network monitoring (number of accesses by IP in the past hour), trending topics.\nIn many data stream applications, the allocated space is rather small compared to the size of the stream. When more and more elements arrive, the fraction of zeros\nin the Bloom Filter will decrease continuously, and the false positive rate will increase accordingly, finally reaching the limit, 1, where every distinct element will be reported as a\nduplicate, indicating that the Bloom Filter is useless.\nFor the regular Bloom Filter, there is no way to distinguish the recent elements from the past ones\n\ngithub?\n
RBF: permit the removal of selected false positives at the expense of generating random false negatives.\n
\n
They are used to protect any kind of data stored, handled and transferred in and between computers\n
Each inner node is the hash value of the concatenation of its two children.\nThe principal advantage of Merkle tree is that each branch of the tree can be checked independently without requiring nodes to download the entire tree or the entire data set.\n\n\n
For each key range of data, each member in the replica group compute a Merkel tree (a hash encoding tree where the difference can be located quickly) and send it to other neighbors. By comparing the received Merkel tree with its own tree, each member can quickly determine which data portion is out of sync. If so, it will send the diff to the left-behind members.\n\nTiger is a cryptographic hash function optimised for 64-bit platform (1995)\nSize: 192 bits (truncated versions: 128 and 160 bits)\n
For each key range of data, each member in the replica group compute a Merkel tree (a hash encoding tree where the difference can be located quickly) and send it to other neighbors. By comparing the received Merkel tree with its own tree, each member can quickly determine which data portion is out of sync. If so, it will send the diff to the left-behind members.\n\nTiger is a cryptographic hash function optimised for 64-bit platform (1995)\nSize: 192 bits (truncated versions: 128 and 160 bits)\n
For each key range of data, each member in the replica group compute a Merkel tree (a hash encoding tree where the difference can be located quickly) and send it to other neighbors. By comparing the received Merkel tree with its own tree, each member can quickly determine which data portion is out of sync. If so, it will send the diff to the left-behind members.\n\nTiger is a cryptographic hash function optimised for 64-bit platform (1995)\nSize: 192 bits (truncated versions: 128 and 160 bits)\n
For each key range of data, each member in the replica group compute a Merkel tree (a hash encoding tree where the difference can be located quickly) and send it to other neighbors. By comparing the received Merkel tree with its own tree, each member can quickly determine which data portion is out of sync. If so, it will send the diff to the left-behind members.\n\nTiger is a cryptographic hash function optimised for 64-bit platform (1995)\nSize: 192 bits (truncated versions: 128 and 160 bits)\n
Hash trees can be used to protect any kind of data stored, handled and transferred in and between computers.\nBefore downloading a file on a p2p network, the top hash is acquired from a trusted source. When the top hash (root hash) is available, the hash tree can be received form any non-trusted source.\nCurrently the main use of hash trees is to make sure that data blocks received from other peers in a peer-to-peer network are received undamaged and unaltered, and even to check that the other peers do not lie and send fake blocks\n
Hash trees can be used to protect any kind of data stored, handled and transferred in and between computers.\nBefore downloading a file on a p2p network, the top hash is acquired from a trusted source. When the top hash (root hash) is available, the hash tree can be received form any non-trusted source.\nCurrently the main use of hash trees is to make sure that data blocks received from other peers in a peer-to-peer network are received undamaged and unaltered, and even to check that the other peers do not lie and send fake blocks\n
Hash trees can be used to protect any kind of data stored, handled and transferred in and between computers.\nBefore downloading a file on a p2p network, the top hash is acquired from a trusted source. When the top hash (root hash) is available, the hash tree can be received form any non-trusted source.\nCurrently the main use of hash trees is to make sure that data blocks received from other peers in a peer-to-peer network are received undamaged and unaltered, and even to check that the other peers do not lie and send fake blocks\n
Hash trees can be used to protect any kind of data stored, handled and transferred in and between computers.\nBefore downloading a file on a p2p network, the top hash is acquired from a trusted source. When the top hash (root hash) is available, the hash tree can be received form any non-trusted source.\nCurrently the main use of hash trees is to make sure that data blocks received from other peers in a peer-to-peer network are received undamaged and unaltered, and even to check that the other peers do not lie and send fake blocks\n
Hash trees can be used to protect any kind of data stored, handled and transferred in and between computers.\nBefore downloading a file on a p2p network, the top hash is acquired from a trusted source. When the top hash (root hash) is available, the hash tree can be received form any non-trusted source.\nCurrently the main use of hash trees is to make sure that data blocks received from other peers in a peer-to-peer network are received undamaged and unaltered, and even to check that the other peers do not lie and send fake blocks\n
Hash trees can be used to protect any kind of data stored, handled and transferred in and between computers.\nBefore downloading a file on a p2p network, the top hash is acquired from a trusted source. When the top hash (root hash) is available, the hash tree can be received form any non-trusted source.\nCurrently the main use of hash trees is to make sure that data blocks received from other peers in a peer-to-peer network are received undamaged and unaltered, and even to check that the other peers do not lie and send fake blocks\n
Merkle trees are exchanged, if they disagree, Cassandra does a range-repair via compaction (using the Scuttlebutt reconciliation)\nTo ensure the data is still in sync even there is no READ and WRITE occurs to the data, replica nodes periodically gossip with each other to figure out if anyone out of sync. For each key range of data, each member in the replica group compute a Merkel tree (a hash encoding tree where the difference can be located quickly) and send it to other neighbors. By comparing the received Merkel tree with its own tree, each member can quickly determine which data portion is out of sync. If so, it will send the diff to the left-behind members.\n\nAnti-entropy is the "catch-all" way to guarantee eventual consistency, but is also pretty expensive and therefore is not done frequently. By combining the data sync with read repair and hinted handoff, we can keep the replicas pretty up-to-date.\n\nThe key difference in Cassandra's implementation of anti-entropy is that the Merkle trees are built per column family, and they are not maintained for longer than it takes to send them to neighboring nodes. Instead, the trees are generated as snapshots of the dataset during major compactions: this means that excess data might be sent across the network, but it saves local disk IO, and is preferable for very large datasets.\n
Merkle trees are exchanged, if they disagree, Cassandra does a range-repair via compaction (using the Scuttlebutt reconciliation)\nTo ensure the data is still in sync even there is no READ and WRITE occurs to the data, replica nodes periodically gossip with each other to figure out if anyone out of sync. For each key range of data, each member in the replica group compute a Merkel tree (a hash encoding tree where the difference can be located quickly) and send it to other neighbors. By comparing the received Merkel tree with its own tree, each member can quickly determine which data portion is out of sync. If so, it will send the diff to the left-behind members.\n\nAnti-entropy is the "catch-all" way to guarantee eventual consistency, but is also pretty expensive and therefore is not done frequently. By combining the data sync with read repair and hinted handoff, we can keep the replicas pretty up-to-date.\n\nThe key difference in Cassandra's implementation of anti-entropy is that the Merkle trees are built per column family, and they are not maintained for longer than it takes to send them to neighboring nodes. Instead, the trees are generated as snapshots of the dataset during major compactions: this means that excess data might be sent across the network, but it saves local disk IO, and is preferable for very large datasets.\n
\n
\n
\n

Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees

Similar to Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees (7)

Recently uploaded

Recently uploaded (20)

Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees

Editor's Notes