SlideShare una empresa de Scribd logo
1 de 40
Descargar para leer sin conexión
Approximate now is better than
Accurate later
Probabilistic data structures in Big Data
#ISSLearningFest
Why are we talking about this?
● 500 million tweets a day
● 3.5 billion Google searches per day
● 50 million Grab rides per day
● 3+ Petabytes of data in a mid-size bank
https://www.researchgate.net/
Volume & Velocity
● Volume
○ Massive amounts of data
○ Distributed across several machines
● Velocity
○ Real time ingestion of data
→ Simple operations such as counting is hard
→ Everyone wants knowledge from the data now !
Accuracy is overrated !
▸ Do you need 100% accurate information now?
▸ What is 99% accuracy?
▹ 355k = 352k to 358k
▸ If you are okay with 99% accuracy, you :
▹ Can get real time results
▹ Save a whole lot of Memory/CPU/Disk
Two problems
1. Membership (Volume)
Does the dataset contain a particular element?
2. Frequency(Velocity)
Who are the heavy-hitters or what are the top-k elements?
Two problems
1. Membership(does the dataset contain a particular element):
a. GMail: Is my chosen password in the list of compromised passwords?
b. Huge file: Is my data in this file?
2. Frequency(heavy-hitters)
a. Twitter: Number of tweets per trending topic
b. Amazon: Total number of SanDisk flash drives bought today
Other problems:Cardinality, Quantiles, Similarity
Structure of presentation
1. Pick a Problem
2. Do a back of the envelope calculation
3. Introduce the data-structure
4. How it internally works
1. Membership
“Does the dataset contain a particular element?”
Is your password in the list of compromised
passwords?
● 1 billion* ~10 bytepasswords
○ 10 GBon disk
○ 64 GBin memory (Java)*
Problem
*incl. char[] and object overheads
Possible solutions - Use a database
● Database store and“select … where password=admin123”
+ Cost
- Slower responses for disk reads
10 GB++on disk
Possible solutions - Use an in-memory Set
● In-memory HashSet
+ Speed
- Cost
64 GB++in-memory
What about replication?
What if node goes down?
Typical approach
● Replicate +Shard data across several machines
● Run distributed jobs
10 GB * Non disk
64 GB *N in memory
Are you okay with 99% accuracy?
What does 1% inaccuracy mean?
● GMail: You are telling the user that the password that
they chose is compromised when it wasn’t.
○ User will retry
● Huge file: You are telling your application that the
data is available in the file when it wasn’t.
○ Search through the file
It’s not that bad.
Introducing Bloom filter
A Bloom filter is aspace-efficient probabilistic data structure that is used to test whether an
element is a member of a set.
False positive matches are possible but false negatives are not.
“The password you chose is compromised” (when it is actually not)- Possible
“The password you chose is fine” (when it was actually compromised)- Not possible
How does it work?
How does it work?
Let’s add one more password?
Check membership
But.. but...you said, it wasn’t accurate !
How to reduce false positives?
● The array size (m) is large.
● Number of hash functions is higher
https://hur.st/bloomfilter/
Where are Bloom filters used?
● Parquet/ORC data formats
● Several NoSQL databases
2. Frequency
“Who are the heavy-hitters/top-k elements?”
Problem
● Grab- 50 million rides per day
○ What are the top trips this month?
● YouTube- 5 billion videos watched per day
○ What are the “most watched” videos today?
● Twitter - 500 million tweets per day
○ What are the topics trending this week?
Lies, damned lies and internet statistics !
● Do you care if21,429people tweeted about
# NationalGirlfriendDay or are you fine with 21.5k?
Grab- “What are the top trips this month?”
● ~50 millionrides per day
○ 1500 millionrides per month
○ Say, 10% unique rides
○ 150 million unique rides per month
● Keys (10 bytes)+Counters (4 bytes)
● 150 million keys +counters
10.2 GB in memory
Problem - Let’s pick one
Key Value
1 entry 10 bytes 4 bytes
150 million 1.5 GB 650 MB
In memory 9.6 GB 650 MB
Stats source: https://expandedramblings.com/index.php/grab-facts-statistics/
Introducing Count -Min Sketch
ACount-Min Sketchis a memory-efficient probabilistic data structure that
allows one to estimate frequency -related properties of data
Eg. Top-Kfrequent elements or Heavy hitters
What are we gaining?
● For storing 150 millionkeys and counters
○ 99%accuracy
○ 5hash functions
○ 4 millioncounters
○ Each counter is a 32 bit integer
● Totalmemory required : ~82 MB
(as compared to 10.2 GB)
How does it work?
Frequency array: Instead of maintaining an array of bits like Bloom filter,
Count-Min sketch maintains an array of counters
Index 0 1 2
MD5 3 5 1
Murmur3 1 4 0
1 2 3
0 1 0
Bloom filter: 0 or 1
Count-Min Sketch
In BF, output of all hashes populated
in the same array
In CMS, each Hashing function
gets its own array
How does it work?
1. Input (trending topics/destination) would be hashed by “k” hash functions
2. Each hash function would have its own array of counters, which gets
incremented based on hashed value
How does it work?
1. Input (trending topics/destination) would be hashed by “k” hash functions
2. Each hash function would have its own array of counters, which gets
incremented based on hashed value
How to estimate frequency?
1. Get the frequencies (counter values) of each hashing function
2. The minimum of the frequency is the estimated count
How to find Top -K elements? – 1/ 3
1. Maintain a Heavy hitters heap of fixed capacity (3, 10, 100 etc)
Index 0 1 2 3 4 5 6 7 8 9
MD5 3 5 1 0 0 7 0 1 0 1
Murmur3 1 4 0 3 2 3 1 0 2 0
4
2
3
shentclark 4
citybedok 3
yishair 2
How to find Top -K elements? – 2/ 3
2. Every time a key gets inserted into the frequency array, check if the estimated
count of the key is greater thanminimum of the heavy hitters heap.
Index 0 1 2 3 4 5 6 7 8 9
MD5 3 5 1 0 0 7 0 3 0 1
Murmur3 1 4 0 3 2 3 1 0 3 0
4
2
3
shentclark 4
citybedok 3
yishair 2
clemharbor
Estimated count = 3
MD5
Murmur3
How to find Top -K elements? – 1/ 3
3. If the frequency is greater than the minimum value in the top-K heap, replace the
minimum key with the new key
Index 0 1 2 3 4 5 6 7 8 9
MD5 3 5 1 0 0 7 0 3 0 1
Murmur3 1 4 0 3 2 3 1 0 3 0
4
2
3
shentclark 4
citybedok 3
yishair 2
clemharbor 3
clemharbor
Estimated count = 3
Summary - Accuracy is overrated !
Membership
● Bloom filter
Frequency
● Count Min Sketch
More :
● HyperLogLog, Counting Bloom filter, Cuckoo filter, t-digest
Probabilistic Data Structures give up accuracy but give us performance and space benefits
Sometimes, PDSAs are used to
calculate approximate real-time
numbers and a batch job at EOD runs
to update the accurate number.
That’s it !
#ISSLearningFest
Give Us Your Feedback
#ISSLearningFest
Day 2 Programme
Appendix
Downsides – Bloom Filter
• “Vanilla” bloom filters are not growable
• Need to know the number of items before-hand.
• Scalable bloom filter
• Deletes aren’t possible in “vanilla” bloom filter
• We wouldn’t do which item set the bit in an index
• Counting bloom filter
• Persistence
• Can’t persist modified bits alone into disk
Downsides - CMS
● Upper bound for frequencies in each counter
● Biased counting - returns the minimum of the over-estimated frequencies
● Cannot return unique keys

Más contenido relacionado

La actualidad más candente

Distributed Deep Learning on AWS with Apache MXNet
Distributed Deep Learning on AWS with Apache MXNetDistributed Deep Learning on AWS with Apache MXNet
Distributed Deep Learning on AWS with Apache MXNetAmazon Web Services
 
Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016MLconf
 
Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Jen Aman
 
Using Deep Learning to do Real-Time Scoring in Practical Applications
Using Deep Learning to do Real-Time Scoring in Practical ApplicationsUsing Deep Learning to do Real-Time Scoring in Practical Applications
Using Deep Learning to do Real-Time Scoring in Practical ApplicationsGreg Makowski
 
(BDT311) Deep Learning: Going Beyond Machine Learning
(BDT311) Deep Learning: Going Beyond Machine Learning(BDT311) Deep Learning: Going Beyond Machine Learning
(BDT311) Deep Learning: Going Beyond Machine LearningAmazon Web Services
 
Networks are like onions: Practical Deep Learning with TensorFlow
Networks are like onions: Practical Deep Learning with TensorFlowNetworks are like onions: Practical Deep Learning with TensorFlow
Networks are like onions: Practical Deep Learning with TensorFlowBarbara Fusinska
 
Big data app meetup 2016-06-15
Big data app meetup 2016-06-15Big data app meetup 2016-06-15
Big data app meetup 2016-06-15Illia Polosukhin
 
(CMP305) Deep Learning on AWS Made EasyCmp305
(CMP305) Deep Learning on AWS Made EasyCmp305(CMP305) Deep Learning on AWS Made EasyCmp305
(CMP305) Deep Learning on AWS Made EasyCmp305Amazon Web Services
 
Machine Learning Use Cases with Azure
Machine Learning Use Cases with AzureMachine Learning Use Cases with Azure
Machine Learning Use Cases with AzureChris McHenry
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" Joshua Bloom
 
Deep learning with TensorFlow
Deep learning with TensorFlowDeep learning with TensorFlow
Deep learning with TensorFlowNdjido Ardo BAR
 
Anomaly Detection at Scale
Anomaly Detection at ScaleAnomaly Detection at Scale
Anomaly Detection at ScaleJeff Henrikson
 
Applying your Convolutional Neural Networks
Applying your Convolutional Neural NetworksApplying your Convolutional Neural Networks
Applying your Convolutional Neural NetworksDatabricks
 
Deeplearning on Hadoop @OSCON 2014
Deeplearning on Hadoop @OSCON 2014Deeplearning on Hadoop @OSCON 2014
Deeplearning on Hadoop @OSCON 2014Adam Gibson
 
Jay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIJay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIAI Frontiers
 
[246]reasoning, attention and memory toward differentiable reasoning machines
[246]reasoning, attention and memory   toward differentiable reasoning machines[246]reasoning, attention and memory   toward differentiable reasoning machines
[246]reasoning, attention and memory toward differentiable reasoning machinesNAVER D2
 
MLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkMLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkSigOpt
 
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15MLconf
 
Machine learning on Hadoop data lakes
Machine learning on Hadoop data lakesMachine learning on Hadoop data lakes
Machine learning on Hadoop data lakesDataWorks Summit
 

La actualidad más candente (20)

Distributed Deep Learning on AWS with Apache MXNet
Distributed Deep Learning on AWS with Apache MXNetDistributed Deep Learning on AWS with Apache MXNet
Distributed Deep Learning on AWS with Apache MXNet
 
Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016
 
Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow
 
Using Deep Learning to do Real-Time Scoring in Practical Applications
Using Deep Learning to do Real-Time Scoring in Practical ApplicationsUsing Deep Learning to do Real-Time Scoring in Practical Applications
Using Deep Learning to do Real-Time Scoring in Practical Applications
 
(BDT311) Deep Learning: Going Beyond Machine Learning
(BDT311) Deep Learning: Going Beyond Machine Learning(BDT311) Deep Learning: Going Beyond Machine Learning
(BDT311) Deep Learning: Going Beyond Machine Learning
 
Networks are like onions: Practical Deep Learning with TensorFlow
Networks are like onions: Practical Deep Learning with TensorFlowNetworks are like onions: Practical Deep Learning with TensorFlow
Networks are like onions: Practical Deep Learning with TensorFlow
 
Big data app meetup 2016-06-15
Big data app meetup 2016-06-15Big data app meetup 2016-06-15
Big data app meetup 2016-06-15
 
(CMP305) Deep Learning on AWS Made EasyCmp305
(CMP305) Deep Learning on AWS Made EasyCmp305(CMP305) Deep Learning on AWS Made EasyCmp305
(CMP305) Deep Learning on AWS Made EasyCmp305
 
Machine Learning Use Cases with Azure
Machine Learning Use Cases with AzureMachine Learning Use Cases with Azure
Machine Learning Use Cases with Azure
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning"
 
Deep learning with TensorFlow
Deep learning with TensorFlowDeep learning with TensorFlow
Deep learning with TensorFlow
 
Anomaly Detection at Scale
Anomaly Detection at ScaleAnomaly Detection at Scale
Anomaly Detection at Scale
 
Applying your Convolutional Neural Networks
Applying your Convolutional Neural NetworksApplying your Convolutional Neural Networks
Applying your Convolutional Neural Networks
 
李育杰/The Growth of a Data Scientist
李育杰/The Growth of a Data Scientist李育杰/The Growth of a Data Scientist
李育杰/The Growth of a Data Scientist
 
Deeplearning on Hadoop @OSCON 2014
Deeplearning on Hadoop @OSCON 2014Deeplearning on Hadoop @OSCON 2014
Deeplearning on Hadoop @OSCON 2014
 
Jay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIJay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AI
 
[246]reasoning, attention and memory toward differentiable reasoning machines
[246]reasoning, attention and memory   toward differentiable reasoning machines[246]reasoning, attention and memory   toward differentiable reasoning machines
[246]reasoning, attention and memory toward differentiable reasoning machines
 
MLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkMLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott Clark
 
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
 
Machine learning on Hadoop data lakes
Machine learning on Hadoop data lakesMachine learning on Hadoop data lakes
Machine learning on Hadoop data lakes
 

Similar a Approximate "Now" is Better Than Accurate "Later"

Managing your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed LuxembourgManaging your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed LuxembourgDavid Pilato
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++Mike Acton
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Codemotion
 
MongoDB for Time Series Data: Sharding
MongoDB for Time Series Data: ShardingMongoDB for Time Series Data: Sharding
MongoDB for Time Series Data: ShardingMongoDB
 
Building a CRM on top of ElasticSearch
Building a CRM on top of ElasticSearchBuilding a CRM on top of ElasticSearch
Building a CRM on top of ElasticSearchMark Greene
 
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...MongoDB
 
Basics in algorithms and data structure
Basics in algorithms and data structure Basics in algorithms and data structure
Basics in algorithms and data structure Eman magdy
 
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedMaking Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedTuri, Inc.
 
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...Lucidworks
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...huguk
 
Stripe CTF3 wrap-up
Stripe CTF3 wrap-upStripe CTF3 wrap-up
Stripe CTF3 wrap-upStripe
 
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Kai Chan
 
Writing Applications for Scylla
Writing Applications for ScyllaWriting Applications for Scylla
Writing Applications for ScyllaScyllaDB
 
MSR 2009
MSR 2009MSR 2009
MSR 2009swy351
 
Zero ETL analytics with LLAP in Azure HDInsight
Zero ETL analytics with LLAP in Azure HDInsightZero ETL analytics with LLAP in Azure HDInsight
Zero ETL analytics with LLAP in Azure HDInsightDataWorks Summit
 
Basic terminologies & asymptotic notations
Basic terminologies & asymptotic notationsBasic terminologies & asymptotic notations
Basic terminologies & asymptotic notationsRajendran
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 CareerBuilder.com
 
Mining data streams
Mining data streamsMining data streams
Mining data streamsAkash Gupta
 

Similar a Approximate "Now" is Better Than Accurate "Later" (20)

Managing your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed LuxembourgManaging your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed Luxembourg
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
 
MongoDB for Time Series Data: Sharding
MongoDB for Time Series Data: ShardingMongoDB for Time Series Data: Sharding
MongoDB for Time Series Data: Sharding
 
Spark Meetup
Spark MeetupSpark Meetup
Spark Meetup
 
Building a CRM on top of ElasticSearch
Building a CRM on top of ElasticSearchBuilding a CRM on top of ElasticSearch
Building a CRM on top of ElasticSearch
 
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
 
Basics in algorithms and data structure
Basics in algorithms and data structure Basics in algorithms and data structure
Basics in algorithms and data structure
 
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedMaking Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and Distributed
 
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
 
Stripe CTF3 wrap-up
Stripe CTF3 wrap-upStripe CTF3 wrap-up
Stripe CTF3 wrap-up
 
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
 
Writing Applications for Scylla
Writing Applications for ScyllaWriting Applications for Scylla
Writing Applications for Scylla
 
MSR 2009
MSR 2009MSR 2009
MSR 2009
 
Zero ETL analytics with LLAP in Azure HDInsight
Zero ETL analytics with LLAP in Azure HDInsightZero ETL analytics with LLAP in Azure HDInsight
Zero ETL analytics with LLAP in Azure HDInsight
 
Basic terminologies & asymptotic notations
Basic terminologies & asymptotic notationsBasic terminologies & asymptotic notations
Basic terminologies & asymptotic notations
 
2014 pycon-talk
2014 pycon-talk2014 pycon-talk
2014 pycon-talk
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
 
Mining data streams
Mining data streamsMining data streams
Mining data streams
 

Más de NUS-ISS

Designing Impactful Services and User Experience - Lim Wee Khee
Designing Impactful Services and User Experience - Lim Wee KheeDesigning Impactful Services and User Experience - Lim Wee Khee
Designing Impactful Services and User Experience - Lim Wee KheeNUS-ISS
 
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...NUS-ISS
 
How the World's Leading Independent Automotive Distributor is Reinventing Its...
How the World's Leading Independent Automotive Distributor is Reinventing Its...How the World's Leading Independent Automotive Distributor is Reinventing Its...
How the World's Leading Independent Automotive Distributor is Reinventing Its...NUS-ISS
 
The Importance of Cybersecurity for Digital Transformation
The Importance of Cybersecurity for Digital TransformationThe Importance of Cybersecurity for Digital Transformation
The Importance of Cybersecurity for Digital TransformationNUS-ISS
 
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...NUS-ISS
 
Understanding GenAI/LLM and What is Google Offering - Felix Goh
Understanding GenAI/LLM and What is Google Offering - Felix GohUnderstanding GenAI/LLM and What is Google Offering - Felix Goh
Understanding GenAI/LLM and What is Google Offering - Felix GohNUS-ISS
 
Digital Product-Centric Enterprise and Enterprise Architecture - Tan Eng Tsze
Digital Product-Centric Enterprise and Enterprise Architecture - Tan Eng TszeDigital Product-Centric Enterprise and Enterprise Architecture - Tan Eng Tsze
Digital Product-Centric Enterprise and Enterprise Architecture - Tan Eng TszeNUS-ISS
 
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...NUS-ISS
 
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...NUS-ISS
 
Supply Chain Security for Containerised Workloads - Lee Chuk Munn
Supply Chain Security for Containerised Workloads - Lee Chuk MunnSupply Chain Security for Containerised Workloads - Lee Chuk Munn
Supply Chain Security for Containerised Workloads - Lee Chuk MunnNUS-ISS
 
Future of Learning - Yap Aye Wee.pdf
Future of Learning - Yap Aye Wee.pdfFuture of Learning - Yap Aye Wee.pdf
Future of Learning - Yap Aye Wee.pdfNUS-ISS
 
Future of Learning - Khoong Chan Meng
Future of Learning - Khoong Chan MengFuture of Learning - Khoong Chan Meng
Future of Learning - Khoong Chan MengNUS-ISS
 
Site Reliability Engineer (SRE), We Keep The Lights On 24/7
Site Reliability Engineer (SRE), We Keep The Lights On 24/7Site Reliability Engineer (SRE), We Keep The Lights On 24/7
Site Reliability Engineer (SRE), We Keep The Lights On 24/7NUS-ISS
 
Product Management in The Trenches for a Cloud Service
Product Management in The Trenches for a Cloud ServiceProduct Management in The Trenches for a Cloud Service
Product Management in The Trenches for a Cloud ServiceNUS-ISS
 
Overview of Data and Analytics Essentials and Foundations
Overview of Data and Analytics Essentials and FoundationsOverview of Data and Analytics Essentials and Foundations
Overview of Data and Analytics Essentials and FoundationsNUS-ISS
 
Predictive Analytics
Predictive AnalyticsPredictive Analytics
Predictive AnalyticsNUS-ISS
 
Feature Engineering for IoT
Feature Engineering for IoTFeature Engineering for IoT
Feature Engineering for IoTNUS-ISS
 
Master of Technology in Software Engineering
Master of Technology in Software EngineeringMaster of Technology in Software Engineering
Master of Technology in Software EngineeringNUS-ISS
 
Master of Technology in Enterprise Business Analytics
Master of Technology in Enterprise Business AnalyticsMaster of Technology in Enterprise Business Analytics
Master of Technology in Enterprise Business AnalyticsNUS-ISS
 
Diagnosing Complex Problems Using System Archetypes
Diagnosing Complex Problems Using System ArchetypesDiagnosing Complex Problems Using System Archetypes
Diagnosing Complex Problems Using System ArchetypesNUS-ISS
 

Más de NUS-ISS (20)

Designing Impactful Services and User Experience - Lim Wee Khee
Designing Impactful Services and User Experience - Lim Wee KheeDesigning Impactful Services and User Experience - Lim Wee Khee
Designing Impactful Services and User Experience - Lim Wee Khee
 
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...
 
How the World's Leading Independent Automotive Distributor is Reinventing Its...
How the World's Leading Independent Automotive Distributor is Reinventing Its...How the World's Leading Independent Automotive Distributor is Reinventing Its...
How the World's Leading Independent Automotive Distributor is Reinventing Its...
 
The Importance of Cybersecurity for Digital Transformation
The Importance of Cybersecurity for Digital TransformationThe Importance of Cybersecurity for Digital Transformation
The Importance of Cybersecurity for Digital Transformation
 
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...
 
Understanding GenAI/LLM and What is Google Offering - Felix Goh
Understanding GenAI/LLM and What is Google Offering - Felix GohUnderstanding GenAI/LLM and What is Google Offering - Felix Goh
Understanding GenAI/LLM and What is Google Offering - Felix Goh
 
Digital Product-Centric Enterprise and Enterprise Architecture - Tan Eng Tsze
Digital Product-Centric Enterprise and Enterprise Architecture - Tan Eng TszeDigital Product-Centric Enterprise and Enterprise Architecture - Tan Eng Tsze
Digital Product-Centric Enterprise and Enterprise Architecture - Tan Eng Tsze
 
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...
 
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...
 
Supply Chain Security for Containerised Workloads - Lee Chuk Munn
Supply Chain Security for Containerised Workloads - Lee Chuk MunnSupply Chain Security for Containerised Workloads - Lee Chuk Munn
Supply Chain Security for Containerised Workloads - Lee Chuk Munn
 
Future of Learning - Yap Aye Wee.pdf
Future of Learning - Yap Aye Wee.pdfFuture of Learning - Yap Aye Wee.pdf
Future of Learning - Yap Aye Wee.pdf
 
Future of Learning - Khoong Chan Meng
Future of Learning - Khoong Chan MengFuture of Learning - Khoong Chan Meng
Future of Learning - Khoong Chan Meng
 
Site Reliability Engineer (SRE), We Keep The Lights On 24/7
Site Reliability Engineer (SRE), We Keep The Lights On 24/7Site Reliability Engineer (SRE), We Keep The Lights On 24/7
Site Reliability Engineer (SRE), We Keep The Lights On 24/7
 
Product Management in The Trenches for a Cloud Service
Product Management in The Trenches for a Cloud ServiceProduct Management in The Trenches for a Cloud Service
Product Management in The Trenches for a Cloud Service
 
Overview of Data and Analytics Essentials and Foundations
Overview of Data and Analytics Essentials and FoundationsOverview of Data and Analytics Essentials and Foundations
Overview of Data and Analytics Essentials and Foundations
 
Predictive Analytics
Predictive AnalyticsPredictive Analytics
Predictive Analytics
 
Feature Engineering for IoT
Feature Engineering for IoTFeature Engineering for IoT
Feature Engineering for IoT
 
Master of Technology in Software Engineering
Master of Technology in Software EngineeringMaster of Technology in Software Engineering
Master of Technology in Software Engineering
 
Master of Technology in Enterprise Business Analytics
Master of Technology in Enterprise Business AnalyticsMaster of Technology in Enterprise Business Analytics
Master of Technology in Enterprise Business Analytics
 
Diagnosing Complex Problems Using System Archetypes
Diagnosing Complex Problems Using System ArchetypesDiagnosing Complex Problems Using System Archetypes
Diagnosing Complex Problems Using System Archetypes
 

Último

Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 

Último (20)

Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 

Approximate "Now" is Better Than Accurate "Later"

  • 1. Approximate now is better than Accurate later Probabilistic data structures in Big Data #ISSLearningFest
  • 2. Why are we talking about this? ● 500 million tweets a day ● 3.5 billion Google searches per day ● 50 million Grab rides per day ● 3+ Petabytes of data in a mid-size bank https://www.researchgate.net/
  • 3. Volume & Velocity ● Volume ○ Massive amounts of data ○ Distributed across several machines ● Velocity ○ Real time ingestion of data → Simple operations such as counting is hard → Everyone wants knowledge from the data now !
  • 4. Accuracy is overrated ! ▸ Do you need 100% accurate information now? ▸ What is 99% accuracy? ▹ 355k = 352k to 358k ▸ If you are okay with 99% accuracy, you : ▹ Can get real time results ▹ Save a whole lot of Memory/CPU/Disk
  • 5. Two problems 1. Membership (Volume) Does the dataset contain a particular element? 2. Frequency(Velocity) Who are the heavy-hitters or what are the top-k elements?
  • 6. Two problems 1. Membership(does the dataset contain a particular element): a. GMail: Is my chosen password in the list of compromised passwords? b. Huge file: Is my data in this file? 2. Frequency(heavy-hitters) a. Twitter: Number of tweets per trending topic b. Amazon: Total number of SanDisk flash drives bought today Other problems:Cardinality, Quantiles, Similarity
  • 7. Structure of presentation 1. Pick a Problem 2. Do a back of the envelope calculation 3. Introduce the data-structure 4. How it internally works
  • 8. 1. Membership “Does the dataset contain a particular element?”
  • 9. Is your password in the list of compromised passwords? ● 1 billion* ~10 bytepasswords ○ 10 GBon disk ○ 64 GBin memory (Java)* Problem *incl. char[] and object overheads
  • 10. Possible solutions - Use a database ● Database store and“select … where password=admin123” + Cost - Slower responses for disk reads 10 GB++on disk
  • 11. Possible solutions - Use an in-memory Set ● In-memory HashSet + Speed - Cost 64 GB++in-memory
  • 12. What about replication? What if node goes down? Typical approach ● Replicate +Shard data across several machines ● Run distributed jobs 10 GB * Non disk 64 GB *N in memory
  • 13. Are you okay with 99% accuracy? What does 1% inaccuracy mean? ● GMail: You are telling the user that the password that they chose is compromised when it wasn’t. ○ User will retry ● Huge file: You are telling your application that the data is available in the file when it wasn’t. ○ Search through the file It’s not that bad.
  • 14. Introducing Bloom filter A Bloom filter is aspace-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positive matches are possible but false negatives are not. “The password you chose is compromised” (when it is actually not)- Possible “The password you chose is fine” (when it was actually compromised)- Not possible
  • 15. How does it work?
  • 16. How does it work?
  • 17. Let’s add one more password?
  • 19. But.. but...you said, it wasn’t accurate !
  • 20. How to reduce false positives? ● The array size (m) is large. ● Number of hash functions is higher https://hur.st/bloomfilter/
  • 21. Where are Bloom filters used? ● Parquet/ORC data formats ● Several NoSQL databases
  • 22. 2. Frequency “Who are the heavy-hitters/top-k elements?”
  • 23. Problem ● Grab- 50 million rides per day ○ What are the top trips this month? ● YouTube- 5 billion videos watched per day ○ What are the “most watched” videos today? ● Twitter - 500 million tweets per day ○ What are the topics trending this week?
  • 24. Lies, damned lies and internet statistics ! ● Do you care if21,429people tweeted about # NationalGirlfriendDay or are you fine with 21.5k?
  • 25. Grab- “What are the top trips this month?” ● ~50 millionrides per day ○ 1500 millionrides per month ○ Say, 10% unique rides ○ 150 million unique rides per month ● Keys (10 bytes)+Counters (4 bytes) ● 150 million keys +counters 10.2 GB in memory Problem - Let’s pick one Key Value 1 entry 10 bytes 4 bytes 150 million 1.5 GB 650 MB In memory 9.6 GB 650 MB Stats source: https://expandedramblings.com/index.php/grab-facts-statistics/
  • 26. Introducing Count -Min Sketch ACount-Min Sketchis a memory-efficient probabilistic data structure that allows one to estimate frequency -related properties of data Eg. Top-Kfrequent elements or Heavy hitters
  • 27. What are we gaining? ● For storing 150 millionkeys and counters ○ 99%accuracy ○ 5hash functions ○ 4 millioncounters ○ Each counter is a 32 bit integer ● Totalmemory required : ~82 MB (as compared to 10.2 GB)
  • 28. How does it work? Frequency array: Instead of maintaining an array of bits like Bloom filter, Count-Min sketch maintains an array of counters Index 0 1 2 MD5 3 5 1 Murmur3 1 4 0 1 2 3 0 1 0 Bloom filter: 0 or 1 Count-Min Sketch In BF, output of all hashes populated in the same array In CMS, each Hashing function gets its own array
  • 29. How does it work? 1. Input (trending topics/destination) would be hashed by “k” hash functions 2. Each hash function would have its own array of counters, which gets incremented based on hashed value
  • 30. How does it work? 1. Input (trending topics/destination) would be hashed by “k” hash functions 2. Each hash function would have its own array of counters, which gets incremented based on hashed value
  • 31. How to estimate frequency? 1. Get the frequencies (counter values) of each hashing function 2. The minimum of the frequency is the estimated count
  • 32. How to find Top -K elements? – 1/ 3 1. Maintain a Heavy hitters heap of fixed capacity (3, 10, 100 etc) Index 0 1 2 3 4 5 6 7 8 9 MD5 3 5 1 0 0 7 0 1 0 1 Murmur3 1 4 0 3 2 3 1 0 2 0 4 2 3 shentclark 4 citybedok 3 yishair 2
  • 33. How to find Top -K elements? – 2/ 3 2. Every time a key gets inserted into the frequency array, check if the estimated count of the key is greater thanminimum of the heavy hitters heap. Index 0 1 2 3 4 5 6 7 8 9 MD5 3 5 1 0 0 7 0 3 0 1 Murmur3 1 4 0 3 2 3 1 0 3 0 4 2 3 shentclark 4 citybedok 3 yishair 2 clemharbor Estimated count = 3 MD5 Murmur3
  • 34. How to find Top -K elements? – 1/ 3 3. If the frequency is greater than the minimum value in the top-K heap, replace the minimum key with the new key Index 0 1 2 3 4 5 6 7 8 9 MD5 3 5 1 0 0 7 0 3 0 1 Murmur3 1 4 0 3 2 3 1 0 3 0 4 2 3 shentclark 4 citybedok 3 yishair 2 clemharbor 3 clemharbor Estimated count = 3
  • 35. Summary - Accuracy is overrated ! Membership ● Bloom filter Frequency ● Count Min Sketch More : ● HyperLogLog, Counting Bloom filter, Cuckoo filter, t-digest Probabilistic Data Structures give up accuracy but give us performance and space benefits Sometimes, PDSAs are used to calculate approximate real-time numbers and a batch job at EOD runs to update the accurate number.
  • 37. Give Us Your Feedback #ISSLearningFest Day 2 Programme
  • 39. Downsides – Bloom Filter • “Vanilla” bloom filters are not growable • Need to know the number of items before-hand. • Scalable bloom filter • Deletes aren’t possible in “vanilla” bloom filter • We wouldn’t do which item set the bit in an index • Counting bloom filter • Persistence • Can’t persist modified bits alone into disk
  • 40. Downsides - CMS ● Upper bound for frequencies in each counter ● Biased counting - returns the minimum of the over-estimated frequencies ● Cannot return unique keys