How does Twitter track the top trending topics?
How does Amazon keep track of the top-selling items for the day?
How many cabs have been booked this month using your App?
Is the password that a new user is choosing a common/compromised password?
Modern web-scale systems process billions of transactions and generate terabytes of data every single day. In order to find answers to questions against this data, one would initiate a multi-minute query against a NoSQL datastore or kick off a batch job written in a distributed processing framework such as Spark or Flink. However, these jobs are throughput-heavy and not suited for realtime low-latency queries. However, you and your customers would like to have all this information "right now".
At the end of this talk, you'll realize that you can power these low-latency queries and with incredibly low memory footprint "IF" you are willing to accept answers that are, say, 96-99% accurate. This talk introduces some of the go-to probabilistic data structures that are used by organisations with large amounts of data - specifically Bloom filter, Count Min Sketch and HyperLogLog.
1. Approximate now is better than
Accurate later
Probabilistic data structures in Big Data
#ISSLearningFest
2. Why are we talking about this?
● 500 million tweets a day
● 3.5 billion Google searches per day
● 50 million Grab rides per day
● 3+ Petabytes of data in a mid-size bank
https://www.researchgate.net/
3. Volume & Velocity
● Volume
○ Massive amounts of data
○ Distributed across several machines
● Velocity
○ Real time ingestion of data
→ Simple operations such as counting is hard
→ Everyone wants knowledge from the data now !
4. Accuracy is overrated !
▸ Do you need 100% accurate information now?
▸ What is 99% accuracy?
▹ 355k = 352k to 358k
▸ If you are okay with 99% accuracy, you :
▹ Can get real time results
▹ Save a whole lot of Memory/CPU/Disk
5. Two problems
1. Membership (Volume)
Does the dataset contain a particular element?
2. Frequency(Velocity)
Who are the heavy-hitters or what are the top-k elements?
6. Two problems
1. Membership(does the dataset contain a particular element):
a. GMail: Is my chosen password in the list of compromised passwords?
b. Huge file: Is my data in this file?
2. Frequency(heavy-hitters)
a. Twitter: Number of tweets per trending topic
b. Amazon: Total number of SanDisk flash drives bought today
Other problems:Cardinality, Quantiles, Similarity
7. Structure of presentation
1. Pick a Problem
2. Do a back of the envelope calculation
3. Introduce the data-structure
4. How it internally works
9. Is your password in the list of compromised
passwords?
● 1 billion* ~10 bytepasswords
○ 10 GBon disk
○ 64 GBin memory (Java)*
Problem
*incl. char[] and object overheads
10. Possible solutions - Use a database
● Database store and“select … where password=admin123”
+ Cost
- Slower responses for disk reads
10 GB++on disk
11. Possible solutions - Use an in-memory Set
● In-memory HashSet
+ Speed
- Cost
64 GB++in-memory
12. What about replication?
What if node goes down?
Typical approach
● Replicate +Shard data across several machines
● Run distributed jobs
10 GB * Non disk
64 GB *N in memory
13. Are you okay with 99% accuracy?
What does 1% inaccuracy mean?
● GMail: You are telling the user that the password that
they chose is compromised when it wasn’t.
○ User will retry
● Huge file: You are telling your application that the
data is available in the file when it wasn’t.
○ Search through the file
It’s not that bad.
14. Introducing Bloom filter
A Bloom filter is aspace-efficient probabilistic data structure that is used to test whether an
element is a member of a set.
False positive matches are possible but false negatives are not.
“The password you chose is compromised” (when it is actually not)- Possible
“The password you chose is fine” (when it was actually compromised)- Not possible
23. Problem
● Grab- 50 million rides per day
○ What are the top trips this month?
● YouTube- 5 billion videos watched per day
○ What are the “most watched” videos today?
● Twitter - 500 million tweets per day
○ What are the topics trending this week?
24. Lies, damned lies and internet statistics !
● Do you care if21,429people tweeted about
# NationalGirlfriendDay or are you fine with 21.5k?
25. Grab- “What are the top trips this month?”
● ~50 millionrides per day
○ 1500 millionrides per month
○ Say, 10% unique rides
○ 150 million unique rides per month
● Keys (10 bytes)+Counters (4 bytes)
● 150 million keys +counters
10.2 GB in memory
Problem - Let’s pick one
Key Value
1 entry 10 bytes 4 bytes
150 million 1.5 GB 650 MB
In memory 9.6 GB 650 MB
Stats source: https://expandedramblings.com/index.php/grab-facts-statistics/
26. Introducing Count -Min Sketch
ACount-Min Sketchis a memory-efficient probabilistic data structure that
allows one to estimate frequency -related properties of data
Eg. Top-Kfrequent elements or Heavy hitters
27. What are we gaining?
● For storing 150 millionkeys and counters
○ 99%accuracy
○ 5hash functions
○ 4 millioncounters
○ Each counter is a 32 bit integer
● Totalmemory required : ~82 MB
(as compared to 10.2 GB)
28. How does it work?
Frequency array: Instead of maintaining an array of bits like Bloom filter,
Count-Min sketch maintains an array of counters
Index 0 1 2
MD5 3 5 1
Murmur3 1 4 0
1 2 3
0 1 0
Bloom filter: 0 or 1
Count-Min Sketch
In BF, output of all hashes populated
in the same array
In CMS, each Hashing function
gets its own array
29. How does it work?
1. Input (trending topics/destination) would be hashed by “k” hash functions
2. Each hash function would have its own array of counters, which gets
incremented based on hashed value
30. How does it work?
1. Input (trending topics/destination) would be hashed by “k” hash functions
2. Each hash function would have its own array of counters, which gets
incremented based on hashed value
31. How to estimate frequency?
1. Get the frequencies (counter values) of each hashing function
2. The minimum of the frequency is the estimated count
32. How to find Top -K elements? – 1/ 3
1. Maintain a Heavy hitters heap of fixed capacity (3, 10, 100 etc)
Index 0 1 2 3 4 5 6 7 8 9
MD5 3 5 1 0 0 7 0 1 0 1
Murmur3 1 4 0 3 2 3 1 0 2 0
4
2
3
shentclark 4
citybedok 3
yishair 2
33. How to find Top -K elements? – 2/ 3
2. Every time a key gets inserted into the frequency array, check if the estimated
count of the key is greater thanminimum of the heavy hitters heap.
Index 0 1 2 3 4 5 6 7 8 9
MD5 3 5 1 0 0 7 0 3 0 1
Murmur3 1 4 0 3 2 3 1 0 3 0
4
2
3
shentclark 4
citybedok 3
yishair 2
clemharbor
Estimated count = 3
MD5
Murmur3
34. How to find Top -K elements? – 1/ 3
3. If the frequency is greater than the minimum value in the top-K heap, replace the
minimum key with the new key
Index 0 1 2 3 4 5 6 7 8 9
MD5 3 5 1 0 0 7 0 3 0 1
Murmur3 1 4 0 3 2 3 1 0 3 0
4
2
3
shentclark 4
citybedok 3
yishair 2
clemharbor 3
clemharbor
Estimated count = 3
35. Summary - Accuracy is overrated !
Membership
● Bloom filter
Frequency
● Count Min Sketch
More :
● HyperLogLog, Counting Bloom filter, Cuckoo filter, t-digest
Probabilistic Data Structures give up accuracy but give us performance and space benefits
Sometimes, PDSAs are used to
calculate approximate real-time
numbers and a batch job at EOD runs
to update the accurate number.
39. Downsides – Bloom Filter
• “Vanilla” bloom filters are not growable
• Need to know the number of items before-hand.
• Scalable bloom filter
• Deletes aren’t possible in “vanilla” bloom filter
• We wouldn’t do which item set the bit in an index
• Counting bloom filter
• Persistence
• Can’t persist modified bits alone into disk
40. Downsides - CMS
● Upper bound for frequencies in each counter
● Biased counting - returns the minimum of the over-estimated frequencies
● Cannot return unique keys