SlideShare a Scribd company logo
1 of 30
Download to read offline
Probabilistic Data Structures
and Approximate Solutions
IPython notebook with code >>

by Oleksandr Pryymak
PyData London 2014
Probabilistic||Approximate: Why?
Often:
● an approximate answer is sufficient
● need to trade accuracy for scalability or speed
● need to analyse stream of data
Catch:
● despite typically achieving good result, exists a
chance of the bad worst case behaviour.
● use on large datasets (law of large numbers)
Code: Approximation
import random
x = [random.randint(0,80000) for _ in xrange(10000)]
y = [i>>8 for i in x] # trim 8 bits off of integers
z = x[:500]

# 5% sample (x is uniform)

avx = average(x)
avy = average(y) * 2**8 # add 8 bits
avz = average(z)
print avx
print avy, 'error %.06f%%' % (100*abs(avx-avy)/float(avx))
print avz, 'error %.06f%%' % (100*abs(avx-avz)/float(avx))
39547.8816
39420.7744 error 0.321401%
39591.424 error 0.110100%
Code: Sampling Data

Interview question:
Get K samples from an infinite stream
Probabilistic Data Structures
Generally they are:
● Use less space than a full dataset
● Require higher CPU load
● Stream-friendly
● Can be parallelized
● Have controlled error rate
Hash functions
One-way function:
arbitrary length of the key ->
to a fixed length of the message

message = hash(key)
However, collisions are possible:

hash(key1) = hash(key2)
Code: Hashing
Hash collisions and performance
●
●

Cryptographic hashes not ideal for our use (like bcrypt)
Need a fast algorithm with the lowest number of collisions:

Hash
=============
Murmur
FNV-1
DJB2
SDBM
SuperFastHash
CRC32
LoseLose

Lowercase
=============
145 ns
6 collis
184 ns
1 collis
156 ns
7 collis
148 ns
4 collis
164 ns
85 collis
250 ns
2 collis
338 ns
215178 collis

Random UUID
===========
259 ns
5 collis
730 ns
5 collis
437 ns
6 collis
484 ns
6 collis
344 ns
4 collis
946 ns
0 collis
-

Numbers
==============
92 ns
0 collis
92 ns
0 collis
93 ns
0 collis
90 ns
0 collis
118 ns
18742 collis
130 ns
0 collis
-

Murmur2 collisions
●

cataract collides with periti

●

roquette collides with skivie

●

shawl collides with stormbound

●

dowlases collides with tramontane

●

cricketings collides with twanger

●

longans collides with whigs

by Ian Boyd: http://programmers.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed
Hash randomness visualised hashmap

Great
murmur2

Not so great

on a sequence of numbers

DJB2
on a sequence of numbers
Comparison: Locality Sensitive Hashing (LSH)
Comparison: Locality Sensitive Hashing (LSH)
Image hashes

Kernelized locality-sensitive hashing for scalable image search
B Kulis, K Grauman - Computer Vision, 2009 IEEE 12th …, 2009 - ieeexplore.ieee.org
Abstract Fast retrieval methods are critical for large-scale and data-driven vision applications. Recent work has explored ways to embed highdimensional features or complex distance functions into a low-dimensional Hamming space where items can be ... Cited by 22
Membership test: Bloom filter
Bloom filter is probabilistic but only yields false positives.
Hash each item k times indices into bit field.
`

At least one 0 means
w definitely isn’t in set.
All 1s would mean w
probably is in set.

1..m
Use Bloom filter to serve requests
Code: bloom filter
Use Bloom filter to store graphs
Graphs only gain nodes because of Bloom
filter false positives.

Pell et al., PNAS 2012
Counting Distinct Elements
In:
infinite stream of data
Question: how many distinct elements are there?
is similar to:
In:
coin flips
Question: how many times it has been flipped?
Coin flips: intuition
● Long runs of HEADs in random series are rare.
● The longer you look, the more likely you see a long one.
● Long runs are very rare and are correlated with how
many coins you’ve flipped.
Code: Cardinality estimation
Cardinality estimation
Basic algorithm:
●
●

n=0
For each input item:
○ Hash item into bit string
○ Count trailing zeroes in bit string
○ If this count > n:
■ Let n = count

●

Estimated cardinality (“count distinct”) = 2^n
Cardinality estimation: HyperLogLog

Demo by: http://www.
aggregateknowledge.
com/science/blog/hll.html
Billions of distinct values in 1.5KB of
RAM with 2% relative error
HyperLogLog: the analysis of a near-optimal
cardinality estimation algorithm
P.Flajolet, É.Fusy, O.Gandouet, F.Meunier;
2007
Code: HyperLogLog
Count-min sketch
Frequency histogram
estimation with chance
of over-counting

count(value) = min{w1[h1(value)], ... wd[hd(value)]}
Code: Frequent Itemsets
Machine Learning: Feature hashing
High-dimensional
machine learning without
feature dictionary

by Andrew Clegg “Approximate methods for
scalable data mining”
Locality-sensitive hashing
To approximate nearest
neighbours

by Andrew Clegg “Approximate methods for
scalable data mining”
Probabilistic Databases
● PrDB (University of Maryland)
● Orion (Purdue University)
● MayBMS (Cornell University)

● BlinkDB v0.1alpha
(UC Berkeley and MIT)
BlinkDB: queries
Queries with Bounded Errors
and Bounded Response Times
on Very Large Data
BlinkDB: architecture
References
Mining of Massive Datasets
by Jure Leskovec, Anand Rajaraman, and Jeff Ullman
http://infolab.stanford.edu/~ullman/mmds.html
Summary

● know the data structures
● know what you sacrifice
● control errors

http://nbviewer.ipython.org/gist/235/d3ee622926b5f77f03df
http://highlyscalable.wordpress.com/2012/05/01/probabilisticstructures-web-analytics-data-mining/ by Ilya Katsov

More Related Content

What's hot

R statistics with mongo db
R statistics with mongo dbR statistics with mongo db
R statistics with mongo db
MongoDB
 
Barcelona MUG MongoDB + Hadoop Presentation
Barcelona MUG MongoDB + Hadoop PresentationBarcelona MUG MongoDB + Hadoop Presentation
Barcelona MUG MongoDB + Hadoop Presentation
Norberto Leite
 
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Paul Brebner
 
Sv big datascience_cliffclick_5_2_2013
Sv big datascience_cliffclick_5_2_2013Sv big datascience_cliffclick_5_2_2013
Sv big datascience_cliffclick_5_2_2013
Sri Ambati
 
A survey paper on sequence pattern mining with incremental
A survey paper on sequence pattern mining with incrementalA survey paper on sequence pattern mining with incremental
A survey paper on sequence pattern mining with incremental
Alexander Decker
 

What's hot (19)

Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in R
 
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
Андрей Козлов (Altoros): Оптимизация производительности CassandraАндрей Козлов (Altoros): Оптимизация производительности Cassandra
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
 
Cloud-based Data Stream Processing
Cloud-based Data Stream ProcessingCloud-based Data Stream Processing
Cloud-based Data Stream Processing
 
From Trill to Quill: Pushing the Envelope of Functionality and Scale
From Trill to Quill: Pushing the Envelope of Functionality and ScaleFrom Trill to Quill: Pushing the Envelope of Functionality and Scale
From Trill to Quill: Pushing the Envelope of Functionality and Scale
 
Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017
 
The Very ^ 2 Basics of R
The Very ^ 2 Basics of RThe Very ^ 2 Basics of R
The Very ^ 2 Basics of R
 
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
 
R statistics with mongo db
R statistics with mongo dbR statistics with mongo db
R statistics with mongo db
 
Barcelona MUG MongoDB + Hadoop Presentation
Barcelona MUG MongoDB + Hadoop PresentationBarcelona MUG MongoDB + Hadoop Presentation
Barcelona MUG MongoDB + Hadoop Presentation
 
Real-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsReal-Time Big Data Stream Analytics
Real-Time Big Data Stream Analytics
 
Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL DatabasesReal-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL Databases
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
 
Sidi chang demo
Sidi chang demoSidi chang demo
Sidi chang demo
 
Time Series Processing with Solr and Spark
Time Series Processing with Solr and SparkTime Series Processing with Solr and Spark
Time Series Processing with Solr and Spark
 
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
 
Sv big datascience_cliffclick_5_2_2013
Sv big datascience_cliffclick_5_2_2013Sv big datascience_cliffclick_5_2_2013
Sv big datascience_cliffclick_5_2_2013
 
AfterGlow
AfterGlowAfterGlow
AfterGlow
 
A survey paper on sequence pattern mining with incremental
A survey paper on sequence pattern mining with incrementalA survey paper on sequence pattern mining with incremental
A survey paper on sequence pattern mining with incremental
 
20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL
 

Viewers also liked (7)

Hashing Technique In Data Structures
Hashing Technique In Data StructuresHashing Technique In Data Structures
Hashing Technique In Data Structures
 
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle TreesModern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
 
File organisation
File organisationFile organisation
File organisation
 
Ch17 Hashing
Ch17 HashingCh17 Hashing
Ch17 Hashing
 
File structures
File structuresFile structures
File structures
 
File Organization
File OrganizationFile Organization
File Organization
 
File organization
File organizationFile organization
File organization
 

Similar to Probabilistic Data Structures and Approximate Solutions

NIPS2007: structured prediction
NIPS2007: structured predictionNIPS2007: structured prediction
NIPS2007: structured prediction
zukun
 
Python Programming - IX. On Randomness
Python Programming - IX. On RandomnessPython Programming - IX. On Randomness
Python Programming - IX. On Randomness
Ranel Padon
 

Similar to Probabilistic Data Structures and Approximate Solutions (20)

Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak
Probabilistic Data Structures and Approximate Solutions Oleksandr PryymakProbabilistic Data Structures and Approximate Solutions Oleksandr Pryymak
Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak
 
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming Applications
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
 
Tech talk Probabilistic Data Structure
Tech talk  Probabilistic Data StructureTech talk  Probabilistic Data Structure
Tech talk Probabilistic Data Structure
 
anti-ddos GNTC based on P4 /BIH
anti-ddos GNTC based on P4 /BIHanti-ddos GNTC based on P4 /BIH
anti-ddos GNTC based on P4 /BIH
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
Algorithms 101 for Data Scientists
Algorithms 101 for Data ScientistsAlgorithms 101 for Data Scientists
Algorithms 101 for Data Scientists
 
NIPS2007: structured prediction
NIPS2007: structured predictionNIPS2007: structured prediction
NIPS2007: structured prediction
 
Probabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityProbabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. Cardinality
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Scalable real-time processing techniques
Scalable real-time processing techniquesScalable real-time processing techniques
Scalable real-time processing techniques
 
Beyond php it's not (just) about the code
Beyond php   it's not (just) about the codeBeyond php   it's not (just) about the code
Beyond php it's not (just) about the code
 
Python Programming - IX. On Randomness
Python Programming - IX. On RandomnessPython Programming - IX. On Randomness
Python Programming - IX. On Randomness
 
Outlier and fraud detection using Hadoop
Outlier and fraud detection using HadoopOutlier and fraud detection using Hadoop
Outlier and fraud detection using Hadoop
 
Introduction to Julia
Introduction to JuliaIntroduction to Julia
Introduction to Julia
 
How we use functional programming to find the bad guys @ Build Stuff LT and U...
How we use functional programming to find the bad guys @ Build Stuff LT and U...How we use functional programming to find the bad guys @ Build Stuff LT and U...
How we use functional programming to find the bad guys @ Build Stuff LT and U...
 
Neural Networks in the Wild: Handwriting Recognition
Neural Networks in the Wild: Handwriting RecognitionNeural Networks in the Wild: Handwriting Recognition
Neural Networks in the Wild: Handwriting Recognition
 
Spatially resolved pair correlation functions for point cloud data
Spatially resolved pair correlation functions for point cloud dataSpatially resolved pair correlation functions for point cloud data
Spatially resolved pair correlation functions for point cloud data
 
Anil Thomas - Object recognition
Anil Thomas - Object recognitionAnil Thomas - Object recognition
Anil Thomas - Object recognition
 
Self healing data
Self healing dataSelf healing data
Self healing data
 

More from Oleksandr Pryymak

More from Oleksandr Pryymak (8)

Information surprise or how to find interesting data
Information surprise or how to find interesting dataInformation surprise or how to find interesting data
Information surprise or how to find interesting data
 
Efficient opinion sharing in large decentralised teams
Efficient opinion sharing in large decentralised teamsEfficient opinion sharing in large decentralised teams
Efficient opinion sharing in large decentralised teams
 
Efficient Sharing of Conflicting Opinions with Minimal Communication in Large...
Efficient Sharing of Conflicting Opinions with Minimal Communication in Large...Efficient Sharing of Conflicting Opinions with Minimal Communication in Large...
Efficient Sharing of Conflicting Opinions with Minimal Communication in Large...
 
Semantic Web - Introduction
Semantic Web - IntroductionSemantic Web - Introduction
Semantic Web - Introduction
 
sumno.com - march 2009
sumno.com - march 2009sumno.com - march 2009
sumno.com - march 2009
 
Sumno.com (eng)
Sumno.com (eng)Sumno.com (eng)
Sumno.com (eng)
 
Sumno.com (ukr)
Sumno.com (ukr)Sumno.com (ukr)
Sumno.com (ukr)
 
Gwt.org.ua (ukr)
Gwt.org.ua (ukr)Gwt.org.ua (ukr)
Gwt.org.ua (ukr)
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Recently uploaded (20)

DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

Probabilistic Data Structures and Approximate Solutions

  • 1. Probabilistic Data Structures and Approximate Solutions IPython notebook with code >> by Oleksandr Pryymak PyData London 2014
  • 2. Probabilistic||Approximate: Why? Often: ● an approximate answer is sufficient ● need to trade accuracy for scalability or speed ● need to analyse stream of data Catch: ● despite typically achieving good result, exists a chance of the bad worst case behaviour. ● use on large datasets (law of large numbers)
  • 3. Code: Approximation import random x = [random.randint(0,80000) for _ in xrange(10000)] y = [i>>8 for i in x] # trim 8 bits off of integers z = x[:500] # 5% sample (x is uniform) avx = average(x) avy = average(y) * 2**8 # add 8 bits avz = average(z) print avx print avy, 'error %.06f%%' % (100*abs(avx-avy)/float(avx)) print avz, 'error %.06f%%' % (100*abs(avx-avz)/float(avx)) 39547.8816 39420.7744 error 0.321401% 39591.424 error 0.110100%
  • 4. Code: Sampling Data Interview question: Get K samples from an infinite stream
  • 5. Probabilistic Data Structures Generally they are: ● Use less space than a full dataset ● Require higher CPU load ● Stream-friendly ● Can be parallelized ● Have controlled error rate
  • 6. Hash functions One-way function: arbitrary length of the key -> to a fixed length of the message message = hash(key) However, collisions are possible: hash(key1) = hash(key2)
  • 8. Hash collisions and performance ● ● Cryptographic hashes not ideal for our use (like bcrypt) Need a fast algorithm with the lowest number of collisions: Hash ============= Murmur FNV-1 DJB2 SDBM SuperFastHash CRC32 LoseLose Lowercase ============= 145 ns 6 collis 184 ns 1 collis 156 ns 7 collis 148 ns 4 collis 164 ns 85 collis 250 ns 2 collis 338 ns 215178 collis Random UUID =========== 259 ns 5 collis 730 ns 5 collis 437 ns 6 collis 484 ns 6 collis 344 ns 4 collis 946 ns 0 collis - Numbers ============== 92 ns 0 collis 92 ns 0 collis 93 ns 0 collis 90 ns 0 collis 118 ns 18742 collis 130 ns 0 collis - Murmur2 collisions ● cataract collides with periti ● roquette collides with skivie ● shawl collides with stormbound ● dowlases collides with tramontane ● cricketings collides with twanger ● longans collides with whigs by Ian Boyd: http://programmers.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed
  • 9. Hash randomness visualised hashmap Great murmur2 Not so great on a sequence of numbers DJB2 on a sequence of numbers
  • 11. Comparison: Locality Sensitive Hashing (LSH) Image hashes Kernelized locality-sensitive hashing for scalable image search B Kulis, K Grauman - Computer Vision, 2009 IEEE 12th …, 2009 - ieeexplore.ieee.org Abstract Fast retrieval methods are critical for large-scale and data-driven vision applications. Recent work has explored ways to embed highdimensional features or complex distance functions into a low-dimensional Hamming space where items can be ... Cited by 22
  • 12. Membership test: Bloom filter Bloom filter is probabilistic but only yields false positives. Hash each item k times indices into bit field. ` At least one 0 means w definitely isn’t in set. All 1s would mean w probably is in set. 1..m
  • 13. Use Bloom filter to serve requests
  • 15. Use Bloom filter to store graphs Graphs only gain nodes because of Bloom filter false positives. Pell et al., PNAS 2012
  • 16. Counting Distinct Elements In: infinite stream of data Question: how many distinct elements are there? is similar to: In: coin flips Question: how many times it has been flipped?
  • 17. Coin flips: intuition ● Long runs of HEADs in random series are rare. ● The longer you look, the more likely you see a long one. ● Long runs are very rare and are correlated with how many coins you’ve flipped.
  • 19. Cardinality estimation Basic algorithm: ● ● n=0 For each input item: ○ Hash item into bit string ○ Count trailing zeroes in bit string ○ If this count > n: ■ Let n = count ● Estimated cardinality (“count distinct”) = 2^n
  • 20. Cardinality estimation: HyperLogLog Demo by: http://www. aggregateknowledge. com/science/blog/hll.html Billions of distinct values in 1.5KB of RAM with 2% relative error HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm P.Flajolet, É.Fusy, O.Gandouet, F.Meunier; 2007
  • 22. Count-min sketch Frequency histogram estimation with chance of over-counting count(value) = min{w1[h1(value)], ... wd[hd(value)]}
  • 24. Machine Learning: Feature hashing High-dimensional machine learning without feature dictionary by Andrew Clegg “Approximate methods for scalable data mining”
  • 25. Locality-sensitive hashing To approximate nearest neighbours by Andrew Clegg “Approximate methods for scalable data mining”
  • 26. Probabilistic Databases ● PrDB (University of Maryland) ● Orion (Purdue University) ● MayBMS (Cornell University) ● BlinkDB v0.1alpha (UC Berkeley and MIT)
  • 27. BlinkDB: queries Queries with Bounded Errors and Bounded Response Times on Very Large Data
  • 29. References Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, and Jeff Ullman http://infolab.stanford.edu/~ullman/mmds.html
  • 30. Summary ● know the data structures ● know what you sacrifice ● control errors http://nbviewer.ipython.org/gist/235/d3ee622926b5f77f03df http://highlyscalable.wordpress.com/2012/05/01/probabilisticstructures-web-analytics-data-mining/ by Ilya Katsov