Approximate "Now" is Better Than Accurate "Later"

Approximate now is better than
Accurate later
Probabilistic data structures in Big Data
#ISSLearningFest

Why are we talking about this?
● 500 million tweets a day
● 3.5 billion Google searches per day
● 50 million Grab rides per day
● 3+ Petabytes of data in a mid-size bank
https://www.researchgate.net/

Volume & Velocity
● Volume
○ Massive amounts of data
○ Distributed across several machines
● Velocity
○ Real time ingestion of data
→ Simple operations such as counting is hard
→ Everyone wants knowledge from the data now !

Accuracy is overrated !
▸ Do you need 100% accurate information now?
▸ What is 99% accuracy?
▹ 355k = 352k to 358k
▸ If you are okay with 99% accuracy, you :
▹ Can get real time results
▹ Save a whole lot of Memory/CPU/Disk

Two problems
1. Membership (Volume)
Does the dataset contain a particular element?
2. Frequency(Velocity)
Who are the heavy-hitters or what are the top-k elements?

Two problems
1. Membership(does the dataset contain a particular element):
a. GMail: Is my chosen password in the list of compromised passwords?
b. Huge file: Is my data in this file?
2. Frequency(heavy-hitters)
a. Twitter: Number of tweets per trending topic
b. Amazon: Total number of SanDisk flash drives bought today
Other problems:Cardinality, Quantiles, Similarity

Structure of presentation
1. Pick a Problem
2. Do a back of the envelope calculation
3. Introduce the data-structure
4. How it internally works

1. Membership
“Does the dataset contain a particular element?”

Is your password in the list of compromised
passwords?
● 1 billion* ~10 bytepasswords
○ 10 GBon disk
○ 64 GBin memory (Java)*
Problem
*incl. char[] and object overheads

Possible solutions - Use a database
● Database store and“select … where password=admin123”
+ Cost
- Slower responses for disk reads
10 GB++on disk

Possible solutions - Use an in-memory Set
● In-memory HashSet
+ Speed
- Cost
64 GB++in-memory

What about replication?
What if node goes down?
Typical approach
● Replicate +Shard data across several machines
● Run distributed jobs
10 GB * Non disk
64 GB *N in memory

Are you okay with 99% accuracy?
What does 1% inaccuracy mean?
● GMail: You are telling the user that the password that
they chose is compromised when it wasn’t.
○ User will retry
● Huge file: You are telling your application that the
data is available in the file when it wasn’t.
○ Search through the file
It’s not that bad.

Introducing Bloom filter
A Bloom filter is aspace-efficient probabilistic data structure that is used to test whether an
element is a member of a set.
False positive matches are possible but false negatives are not.
“The password you chose is compromised” (when it is actually not)- Possible
“The password you chose is fine” (when it was actually compromised)- Not possible

Let’s add one more password?

But.. but...you said, it wasn’t accurate !

How to reduce false positives?
● The array size (m) is large.
● Number of hash functions is higher
https://hur.st/bloomfilter/

Where are Bloom filters used?
● Parquet/ORC data formats
● Several NoSQL databases

2. Frequency
“Who are the heavy-hitters/top-k elements?”

Problem
● Grab- 50 million rides per day
○ What are the top trips this month?
● YouTube- 5 billion videos watched per day
○ What are the “most watched” videos today?
● Twitter - 500 million tweets per day
○ What are the topics trending this week?

Lies, damned lies and internet statistics !
● Do you care if21,429people tweeted about
# NationalGirlfriendDay or are you fine with 21.5k?

Grab- “What are the top trips this month?”
● ~50 millionrides per day
○ 1500 millionrides per month
○ Say, 10% unique rides
○ 150 million unique rides per month
● Keys (10 bytes)+Counters (4 bytes)
● 150 million keys +counters
10.2 GB in memory
Problem - Let’s pick one
Key Value
1 entry 10 bytes 4 bytes
150 million 1.5 GB 650 MB
In memory 9.6 GB 650 MB
Stats source: https://expandedramblings.com/index.php/grab-facts-statistics/

Introducing Count -Min Sketch
ACount-Min Sketchis a memory-efficient probabilistic data structure that
allows one to estimate frequency -related properties of data
Eg. Top-Kfrequent elements or Heavy hitters

What are we gaining?
● For storing 150 millionkeys and counters
○ 99%accuracy
○ 5hash functions
○ 4 millioncounters
○ Each counter is a 32 bit integer
● Totalmemory required : ~82 MB
(as compared to 10.2 GB)

How does it work?
Frequency array: Instead of maintaining an array of bits like Bloom filter,
Count-Min sketch maintains an array of counters
Index 0 1 2
MD5 3 5 1
Murmur3 1 4 0
1 2 3
0 1 0
Bloom filter: 0 or 1
Count-Min Sketch
In BF, output of all hashes populated
in the same array
In CMS, each Hashing function
gets its own array

How does it work?
1. Input (trending topics/destination) would be hashed by “k” hash functions
2. Each hash function would have its own array of counters, which gets
incremented based on hashed value

How to estimate frequency?
1. Get the frequencies (counter values) of each hashing function
2. The minimum of the frequency is the estimated count

How to find Top -K elements? – 1/ 3
1. Maintain a Heavy hitters heap of fixed capacity (3, 10, 100 etc)
Index 0 1 2 3 4 5 6 7 8 9
MD5 3 5 1 0 0 7 0 1 0 1
Murmur3 1 4 0 3 2 3 1 0 2 0
4
2
3
shentclark 4
citybedok 3
yishair 2

2. Every time a key gets inserted into the frequency array, check if the estimated
count of the key is greater thanminimum of the heavy hitters heap.
Index 0 1 2 3 4 5 6 7 8 9
MD5 3 5 1 0 0 7 0 3 0 1
Murmur3 1 4 0 3 2 3 1 0 3 0
4
2
3
shentclark 4
citybedok 3
yishair 2
clemharbor
Estimated count = 3
MD5
Murmur3

3. If the frequency is greater than the minimum value in the top-K heap, replace the
minimum key with the new key
Index 0 1 2 3 4 5 6 7 8 9
MD5 3 5 1 0 0 7 0 3 0 1
Murmur3 1 4 0 3 2 3 1 0 3 0
4
2
3
shentclark 4
citybedok 3
yishair 2
clemharbor 3
clemharbor
Estimated count = 3

Summary - Accuracy is overrated !
Membership
● Bloom filter
Frequency
● Count Min Sketch
More :
● HyperLogLog, Counting Bloom filter, Cuckoo filter, t-digest
Probabilistic Data Structures give up accuracy but give us performance and space benefits
Sometimes, PDSAs are used to
calculate approximate real-time
numbers and a batch job at EOD runs
to update the accurate number.

That’s it !
#ISSLearningFest

Give Us Your Feedback
#ISSLearningFest
Day 2 Programme

Downsides – Bloom Filter
• “Vanilla” bloom filters are not growable
• Need to know the number of items before-hand.
• Scalable bloom filter
• Deletes aren’t possible in “vanilla” bloom filter
• We wouldn’t do which item set the bit in an index
• Counting bloom filter
• Persistence
• Can’t persist modified bits alone into disk

Downsides - CMS
● Upper bound for frequencies in each counter
● Biased counting - returns the minimum of the over-estimated frequencies
● Cannot return unique keys

Approximate "Now" is Better Than Accurate "Later"

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Approximate "Now" is Better Than Accurate "Later"

Similar a Approximate "Now" is Better Than Accurate "Later" (20)

Más de NUS-ISS

Más de NUS-ISS (20)

Último

Último (20)

Approximate "Now" is Better Than Accurate "Later"