SlideShare a Scribd company logo
1 of 22
Streaming Algorithms
Joe Kelley
Data Engineer
July 2013
CONFIDENTIAL | 2
Accelerating Your Time to Value
Strategy
and Roadmap
IMAGINE
Training
and Education
ILLUMINATE
Hands-On
Data Science and
Data Engineering
IMPLEMENT
Leading Provider of
Data Science & Engineering for Big
Analytics
CONFIDENTIAL | 3
• Operates on a continuous stream of data
• Unknown or infinite size
• Only one pass; options:
• Store it
• Lose it
• Store an approximation
• Limited processing time per item
•
• Limited total memory
•
What is a Streaming Algorithm?
Algorithm
Standing Query
Ad-hoc Query
Input
Output
Memory
Disk
CONFIDENTIAL | 4
Why use a Streaming Algorithm?
• Compare to typical “Big Data” approach: store
everything, analyze later, scale linearly
• Streaming Pros:
• Lower latency
• Lower storage cost
• Streaming Cons:
• Less flexibility
• Lower precision (sometimes)
• Answer?
• Why not both?
Streaming
Algorithm
Result
Initial Answer
Long-term Storage Batch Algorithm
Result
Authoritative Answer
CONFIDENTIAL | 5
General Techniques
1. Tunable Approximation
2. Sampling
• Sliding window
• Fixed number
• Fixed percentage
3. Hashing: useful randomness
CONFIDENTIAL | 6
Example 1: Sampling device error rates
• Stream of (device_id, event, timestamp)
• Scenario:
• Not enough space to store everything
• Simple queries  storing 1% is good enough
Device-1
(Device-1, event-1, 10001123)
(Device-1, event-3, 10001126)
(Device-1, event-1, 10001129)
...
Device-2
(Device-2, event-2, 10001124)
(Device-2, ERROR, 10001130)
(Device-2, event-4, 10001132)
...
Device-3
(Device-3, event-3, 10001122)
(Device-3, event-1, 10001127)
(Device-3, ERROR, 10001135)
...
(Device-3, event-3, 10001122)
(Device-1, event-1, 10001123)
(Device-2, event-2, 10001124)
(Device-1, event-3, 10001126)
(Device-3, event-1, 10001127)
(Device-1, event-1, 10001129)
(Device-2, ERROR, 10001130)
(Device-2, event-4, 10001132)
(Device-3, ERROR, 10001135)
...
Input
CONFIDENTIAL | 7
Example 1: Sampling device error rates
• Stream of (device_id, event, timestamp)
• Scenario:
• Not enough space to store everything
• Simple queries  storing 1% is good enough
Algorithm:
for each element e:
with probability 0.01:
store e
else:
throw out e
Can lead to some insidious statistical “bugs”…
CONFIDENTIAL | 8
Example 1: Sampling device error rates
• Stream of (device_id, event, timestamp)
• Scenario:
• Not enough space to store everything
• Simple queries  storing 1% is good enough
Query:
How many errors has the average device encountered?
Answer:
SELECT AVG(n) FROM (
SELECT COUNT(*) AS n FROM events
WHERE event = 'ERROR'
GROUP BY device_id
)
Simple… but off by up to 100x. Each device had only 1% of its events
sampled.
Can we just multiply by 100?
CONFIDENTIAL | 9
Example 1: Sampling device error rates
• Stream of (device_id, event, timestamp)
• Scenario:
• Not enough space to store everything
• Simple queries  storing 1% is good enough
Better Algorithm:
for each element e:
if (hash(e.device_id) mod 100) == 0
store e
else:
throw out e
Choose how to hash carefully... or hash every different way
CONFIDENTIAL | 10
Example 2: Sampling fixed number
Choice of p is crucial:
• p = constant  prefer more recent elements. Higher p = more recent
• p = k/n  sample uniformly from entire stream
Let arr = array of size k
for each element e:
if arr is not yet full:
add e to arr
else:
with probability p:
replace a random element of arr with e
else:
throw out e
Want to sample a fixed count (k), not a fixed percentage.
Algorithm:
CONFIDENTIAL | 11
Example 2: Sampling fixed number
CONFIDENTIAL | 12
Example 3: Counting unique users
• Input: stream of (user_id, action, timestamp)
• Want to know how many distinct users are seen over
a time period
• Naïve approach:
• Store all user_id’s in a list/tree/hashtable
• Millions of users = lot of memory
• Better approach:
• Store all user_id’s in a database
• Good, but maybe it’s not fast enough…
• What if an approximate count is ok?
CONFIDENTIAL | 13
Example 3: Counting unique users
• Input: stream of (user_id, action, timestamp)
• Want to know how many distinct users are seen over a time period
• Approximate count is ok
• Flajolet-Martin Idea:
• Hash each user_id into a bit string
• Count the trailing zeros
• Remember maximum number of trailing zeros seen
user_id H(user_id) trailing zeros max(trailing zeros)
john_doe 0111001001 0 0
jane_doe 1011011100 2 2
alan_t 0010111000 3 3
EWDijkstra 1101011110 1 3
jane_doe 1011011100 2 3
CONFIDENTIAL | 14
Example 3: Counting unique users
• Input: stream of (user_id, action, timestamp)
• Want to know how many distinct users are seen over a time period
• Intuition:
• If we had seen 2 distinct users, we would expect 1
trailing zero
• If we had seen 4, we would expect 2 trailing zeros
• If we had seen , we would expect
• In general, if there has been a maximum of trailing
zeros, is a reasonable estimation of distinct users
• Want more precision? User more independent hash
functions, and combine the results
• Median = only get powers of two
• Mean = subject to skew
• Median of means of groups works well in practice
CONFIDENTIAL | 15
Example 3: Counting unique users
• Input: stream of (user_id, action, timestamp)
• Want to know how many distinct users are seen over a time period
Flajolet-Martin, all together:
arr = int[k]
for each item e:
for i in 0...k-1:
z = trailing_zeros(hashi(e))
if z > arr[i]:
arr[i] = z
means = group_means(arr)
median = median(means)
return pow(2, median)
CONFIDENTIAL | 16
Example 3: Counting unique users
Flajolet-Martin in practice
• Devil is in the details
• Tunable precision
• more hash functions = more precise
• See the paper for bounds on precision
• Tunable latency
• more hash functions = higher latency
• faster hash functions = lower latency
• faster hash functions = more possibility of
correlation = less precision
Remember: streaming algorithm for quick, imprecise
answer. Back-end batch algorithm for slower, exact
answer
CONFIDENTIAL | 17
Example 4: Counting Individual Item Frequencies
Want to keep track of how many times each item has
appeared in the stream
Many applications:
• How popular is each search term?
• How many times has this hashtag been tweeted?
• Which IP addresses are DDoS’ing me?
Again, two obvious approaches:
• In-memory hashmap of itemcount
• Database
But can we be more clever?
CONFIDENTIAL | 18
Example 4: Counting Individual Item Frequencies
Want to keep track of how many times each item has appeared in the stream
Idea:
• Maintain array of counts
• Hash each item, increment array at that index
To check the count of an item, hash again and check
array at that index
• Over-estimates because of hash “collisions”
CONFIDENTIAL | 19
Example 4: Counting Individual Item Frequencies
Count-Min Sketch algorithm:
• Maintain 2-d array of size w x d
• Choose d different hash functions; each row in array corresponds to one
hash function
• Hash each item with every hash function, increment the appropriate
position in each row
• To query an item, hash it d times again, take the minimum value from all
rows
CONFIDENTIAL | 20
Example 4: Counting Individual Item Frequencies
Want to keep track of how many times each item has appeared in the stream
Count-Min Sketch, all together:
arr = int[d][w]
for each item e:
for i in 0...d-1:
j = hashi(e) mod w
arr[i][j]++
def frequency(q):
min = +infinity
for i in 0...d-1:
j = hashi(e) mod w
if arr[i][j] < min:
min = arr[i][j]
return min
CONFIDENTIAL | 21
Example 4: Counting Individual Item Frequencies
Count-Min Sketch in practice
• Devil is in the details
• Tunable precision
• Bigger array = more precise
• See the paper for bounds on precision
• Tunable latency
• more hash functions = higher latency
• Better at estimating more frequent items
• Can subtract out estimation of collisions
Remember: streaming algorithm for quick, imprecise
answer. Back-end batch algorithm for slower, exact
answer
CONFIDENTIAL | 22
Questions?
• Feel free to reach out
• www.thinkbiganalytics.com
• joe.kelley@thinkbiganalytics.com
• www.slideshare.net/jfkelley1
• References:
• http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf
• http://infolab.stanford.edu/~ullman/mmds.html
We’re hiring! Engineers and Data Scientists

More Related Content

What's hot

Simple Introduction to AutoEncoder
Simple Introduction to AutoEncoderSimple Introduction to AutoEncoder
Simple Introduction to AutoEncoderJun Lang
 
Virtualization
VirtualizationVirtualization
VirtualizationMadnanS
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Raja Chiky
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learningHaris Jamil
 
Feature selection
Feature selectionFeature selection
Feature selectiondkpawar
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streamshktripathy
 
Autoencoder
AutoencoderAutoencoder
AutoencoderHARISH R
 
Feature Selection in Machine Learning
Feature Selection in Machine LearningFeature Selection in Machine Learning
Feature Selection in Machine LearningUpekha Vandebona
 
Algorithms Lecture 6: Searching Algorithms
Algorithms Lecture 6: Searching AlgorithmsAlgorithms Lecture 6: Searching Algorithms
Algorithms Lecture 6: Searching AlgorithmsMohamed Loey
 
NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelHemantha Kulathilake
 
Learning to Rank with Neural Networks
Learning to Rank with Neural NetworksLearning to Rank with Neural Networks
Learning to Rank with Neural NetworksBhaskar Mitra
 
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALADATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALASaikiran Panjala
 
Information retrieval dynamic indexing
Information retrieval dynamic indexingInformation retrieval dynamic indexing
Information retrieval dynamic indexingNadia Nahar
 

What's hot (20)

Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Simple Introduction to AutoEncoder
Simple Introduction to AutoEncoderSimple Introduction to AutoEncoder
Simple Introduction to AutoEncoder
 
Virtualization
VirtualizationVirtualization
Virtualization
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014
 
Greedy algorithm
Greedy algorithmGreedy algorithm
Greedy algorithm
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learning
 
Feature selection
Feature selectionFeature selection
Feature selection
 
Types of Parser
Types of ParserTypes of Parser
Types of Parser
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streams
 
Autoencoder
AutoencoderAutoencoder
Autoencoder
 
Run time storage
Run time storageRun time storage
Run time storage
 
Feature Selection in Machine Learning
Feature Selection in Machine LearningFeature Selection in Machine Learning
Feature Selection in Machine Learning
 
Anomaly detection
Anomaly detectionAnomaly detection
Anomaly detection
 
Algorithms Lecture 6: Searching Algorithms
Algorithms Lecture 6: Searching AlgorithmsAlgorithms Lecture 6: Searching Algorithms
Algorithms Lecture 6: Searching Algorithms
 
NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language Model
 
Type Checking(Compiler Design) #ShareThisIfYouLike
Type Checking(Compiler Design) #ShareThisIfYouLikeType Checking(Compiler Design) #ShareThisIfYouLike
Type Checking(Compiler Design) #ShareThisIfYouLike
 
Learning to Rank with Neural Networks
Learning to Rank with Neural NetworksLearning to Rank with Neural Networks
Learning to Rank with Neural Networks
 
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALADATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
 
ID3 ALGORITHM
ID3 ALGORITHMID3 ALGORITHM
ID3 ALGORITHM
 
Information retrieval dynamic indexing
Information retrieval dynamic indexingInformation retrieval dynamic indexing
Information retrieval dynamic indexing
 

Viewers also liked

Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)Yueshen Xu
 
Hash - A probabilistic approach for big data
Hash - A probabilistic approach for big dataHash - A probabilistic approach for big data
Hash - A probabilistic approach for big dataLuca Mastrostefano
 
Detecting Anomalies in Streaming Data
Detecting Anomalies in Streaming DataDetecting Anomalies in Streaming Data
Detecting Anomalies in Streaming DataSubutai Ahmad
 
Big Data and Stream Data Analysis at Politecnico di Milano
Big Data and Stream Data Analysis at Politecnico di MilanoBig Data and Stream Data Analysis at Politecnico di Milano
Big Data and Stream Data Analysis at Politecnico di MilanoMarco Brambilla
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithmsSandeep Joshi
 
[RakutenTechConf2013] [D-3_2] Counting Big Data by Streaming Algorithms
[RakutenTechConf2013] [D-3_2] Counting Big Databy Streaming Algorithms[RakutenTechConf2013] [D-3_2] Counting Big Databy Streaming Algorithms
[RakutenTechConf2013] [D-3_2] Counting Big Data by Streaming AlgorithmsRakuten Group, Inc.
 
NAF2SEM and cross-document Event Coreference
NAF2SEM and cross-document Event CoreferenceNAF2SEM and cross-document Event Coreference
NAF2SEM and cross-document Event CoreferenceFilip Ilievski
 
Copyright And Streaming Media Presentation
Copyright And Streaming Media PresentationCopyright And Streaming Media Presentation
Copyright And Streaming Media PresentationWill Ritter
 
Twarql Architecture - Streaming Annotated Tweets
Twarql Architecture - Streaming Annotated TweetsTwarql Architecture - Streaming Annotated Tweets
Twarql Architecture - Streaming Annotated TweetsPablo Mendes
 
Data Stream Outlier Detection Algorithm
Data Stream Outlier Detection Algorithm Data Stream Outlier Detection Algorithm
Data Stream Outlier Detection Algorithm Hamza Aslam
 
Stream processing using Apache Storm - Big Data Meetup Athens 2016
Stream processing using Apache Storm - Big Data Meetup Athens 2016Stream processing using Apache Storm - Big Data Meetup Athens 2016
Stream processing using Apache Storm - Big Data Meetup Athens 2016Adrianos Dadis
 
Protocol For Streaming Media
Protocol For Streaming MediaProtocol For Streaming Media
Protocol For Streaming MediaKaniska Mandal
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RRadek Maciaszek
 
Paradigmas de Procesamiento en Big Data: Arquitecturas y Tecnologías aplicadas
Paradigmas de Procesamiento en Big Data: Arquitecturas y Tecnologías aplicadasParadigmas de Procesamiento en Big Data: Arquitecturas y Tecnologías aplicadas
Paradigmas de Procesamiento en Big Data: Arquitecturas y Tecnologías aplicadasBig-Data-Summit
 
Discover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalDiscover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalHortonworks
 
Márton Balassi Streaming ML with Flink-
Márton Balassi Streaming ML with Flink- Márton Balassi Streaming ML with Flink-
Márton Balassi Streaming ML with Flink- Flink Forward
 
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big AnalyticsReal time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big AnalyticsData Con LA
 
Data Stream Analytics - Why they are important
Data Stream Analytics - Why they are importantData Stream Analytics - Why they are important
Data Stream Analytics - Why they are importantParis Carbone
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patternshadooparchbook
 

Viewers also liked (20)

Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)
 
Hash - A probabilistic approach for big data
Hash - A probabilistic approach for big dataHash - A probabilistic approach for big data
Hash - A probabilistic approach for big data
 
Detecting Anomalies in Streaming Data
Detecting Anomalies in Streaming DataDetecting Anomalies in Streaming Data
Detecting Anomalies in Streaming Data
 
Chapter 2.1 : Data Stream
Chapter 2.1 : Data StreamChapter 2.1 : Data Stream
Chapter 2.1 : Data Stream
 
Big Data and Stream Data Analysis at Politecnico di Milano
Big Data and Stream Data Analysis at Politecnico di MilanoBig Data and Stream Data Analysis at Politecnico di Milano
Big Data and Stream Data Analysis at Politecnico di Milano
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
 
[RakutenTechConf2013] [D-3_2] Counting Big Data by Streaming Algorithms
[RakutenTechConf2013] [D-3_2] Counting Big Databy Streaming Algorithms[RakutenTechConf2013] [D-3_2] Counting Big Databy Streaming Algorithms
[RakutenTechConf2013] [D-3_2] Counting Big Data by Streaming Algorithms
 
NAF2SEM and cross-document Event Coreference
NAF2SEM and cross-document Event CoreferenceNAF2SEM and cross-document Event Coreference
NAF2SEM and cross-document Event Coreference
 
Copyright And Streaming Media Presentation
Copyright And Streaming Media PresentationCopyright And Streaming Media Presentation
Copyright And Streaming Media Presentation
 
Twarql Architecture - Streaming Annotated Tweets
Twarql Architecture - Streaming Annotated TweetsTwarql Architecture - Streaming Annotated Tweets
Twarql Architecture - Streaming Annotated Tweets
 
Data Stream Outlier Detection Algorithm
Data Stream Outlier Detection Algorithm Data Stream Outlier Detection Algorithm
Data Stream Outlier Detection Algorithm
 
Stream processing using Apache Storm - Big Data Meetup Athens 2016
Stream processing using Apache Storm - Big Data Meetup Athens 2016Stream processing using Apache Storm - Big Data Meetup Athens 2016
Stream processing using Apache Storm - Big Data Meetup Athens 2016
 
Protocol For Streaming Media
Protocol For Streaming MediaProtocol For Streaming Media
Protocol For Streaming Media
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and R
 
Paradigmas de Procesamiento en Big Data: Arquitecturas y Tecnologías aplicadas
Paradigmas de Procesamiento en Big Data: Arquitecturas y Tecnologías aplicadasParadigmas de Procesamiento en Big Data: Arquitecturas y Tecnologías aplicadas
Paradigmas de Procesamiento en Big Data: Arquitecturas y Tecnologías aplicadas
 
Discover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalDiscover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.final
 
Márton Balassi Streaming ML with Flink-
Márton Balassi Streaming ML with Flink- Márton Balassi Streaming ML with Flink-
Márton Balassi Streaming ML with Flink-
 
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big AnalyticsReal time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
 
Data Stream Analytics - Why they are important
Data Stream Analytics - Why they are importantData Stream Analytics - Why they are important
Data Stream Analytics - Why they are important
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
 

Similar to Streaming Algorithms

Practical deep learning for computer vision
Practical deep learning for computer visionPractical deep learning for computer vision
Practical deep learning for computer visionEran Shlomo
 
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsDebasish Ghosh
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_David
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_DavidUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_David
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_DavidStreamNative
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeDataWorks Summit
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...huguk
 
Digital Document Preservation Simulation - Boston Python User's Group
Digital Document  Preservation Simulation - Boston Python User's GroupDigital Document  Preservation Simulation - Boston Python User's Group
Digital Document Preservation Simulation - Boston Python User's GroupMicah Altman
 
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...Subrata Kumer Paul
 
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering Unsupervised Learning: Clustering
Unsupervised Learning: Clustering Experfy
 
CSE545 sp23 (2) Streaming Algorithms 2-4.pdf
CSE545 sp23 (2) Streaming Algorithms 2-4.pdfCSE545 sp23 (2) Streaming Algorithms 2-4.pdf
CSE545 sp23 (2) Streaming Algorithms 2-4.pdfAlexanderKyalo3
 
The Heatmap
 - Why is Security Visualization so Hard?
The Heatmap
 - Why is Security Visualization so Hard?The Heatmap
 - Why is Security Visualization so Hard?
The Heatmap
 - Why is Security Visualization so Hard?Raffael Marty
 
Algorithms Analysis.pdf
Algorithms Analysis.pdfAlgorithms Analysis.pdf
Algorithms Analysis.pdfShaistaRiaz4
 
ADS Introduction
ADS IntroductionADS Introduction
ADS IntroductionNagendraK18
 
Chapter 01 Introduction DM.pptx
Chapter 01 Introduction DM.pptxChapter 01 Introduction DM.pptx
Chapter 01 Introduction DM.pptxssuser957b41
 
Lecture7-QuantitativeAnalysis2.pptx
Lecture7-QuantitativeAnalysis2.pptxLecture7-QuantitativeAnalysis2.pptx
Lecture7-QuantitativeAnalysis2.pptxssuser0d0f881
 
Implementing and analyzing online experiments
Implementing and analyzing online experimentsImplementing and analyzing online experiments
Implementing and analyzing online experimentsSean Taylor
 
The Heatmap
 - Why is Security Visualization so Hard?
The Heatmap
 - Why is Security Visualization so Hard?The Heatmap
 - Why is Security Visualization so Hard?
The Heatmap
 - Why is Security Visualization so Hard?Raffael Marty
 
Unit 1, ADA.pptx
Unit 1, ADA.pptxUnit 1, ADA.pptx
Unit 1, ADA.pptxjinkhatima
 
04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptxShree Shree
 
BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6Rod Soto
 
anti-ddos GNTC based on P4 /BIH
anti-ddos GNTC based on P4 /BIHanti-ddos GNTC based on P4 /BIH
anti-ddos GNTC based on P4 /BIHLeo Chu
 

Similar to Streaming Algorithms (20)

Practical deep learning for computer vision
Practical deep learning for computer visionPractical deep learning for computer vision
Practical deep learning for computer vision
 
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming Applications
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_David
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_DavidUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_David
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_David
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
 
Digital Document Preservation Simulation - Boston Python User's Group
Digital Document  Preservation Simulation - Boston Python User's GroupDigital Document  Preservation Simulation - Boston Python User's Group
Digital Document Preservation Simulation - Boston Python User's Group
 
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...
 
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
 
CSE545 sp23 (2) Streaming Algorithms 2-4.pdf
CSE545 sp23 (2) Streaming Algorithms 2-4.pdfCSE545 sp23 (2) Streaming Algorithms 2-4.pdf
CSE545 sp23 (2) Streaming Algorithms 2-4.pdf
 
The Heatmap
 - Why is Security Visualization so Hard?
The Heatmap
 - Why is Security Visualization so Hard?The Heatmap
 - Why is Security Visualization so Hard?
The Heatmap
 - Why is Security Visualization so Hard?
 
Algorithms Analysis.pdf
Algorithms Analysis.pdfAlgorithms Analysis.pdf
Algorithms Analysis.pdf
 
ADS Introduction
ADS IntroductionADS Introduction
ADS Introduction
 
Chapter 01 Introduction DM.pptx
Chapter 01 Introduction DM.pptxChapter 01 Introduction DM.pptx
Chapter 01 Introduction DM.pptx
 
Lecture7-QuantitativeAnalysis2.pptx
Lecture7-QuantitativeAnalysis2.pptxLecture7-QuantitativeAnalysis2.pptx
Lecture7-QuantitativeAnalysis2.pptx
 
Implementing and analyzing online experiments
Implementing and analyzing online experimentsImplementing and analyzing online experiments
Implementing and analyzing online experiments
 
The Heatmap
 - Why is Security Visualization so Hard?
The Heatmap
 - Why is Security Visualization so Hard?The Heatmap
 - Why is Security Visualization so Hard?
The Heatmap
 - Why is Security Visualization so Hard?
 
Unit 1, ADA.pptx
Unit 1, ADA.pptxUnit 1, ADA.pptx
Unit 1, ADA.pptx
 
04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx
 
BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6BsidesLVPresso2016_JZeditsv6
BsidesLVPresso2016_JZeditsv6
 
anti-ddos GNTC based on P4 /BIH
anti-ddos GNTC based on P4 /BIHanti-ddos GNTC based on P4 /BIH
anti-ddos GNTC based on P4 /BIH
 

Recently uploaded

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 

Recently uploaded (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Streaming Algorithms

  • 2. CONFIDENTIAL | 2 Accelerating Your Time to Value Strategy and Roadmap IMAGINE Training and Education ILLUMINATE Hands-On Data Science and Data Engineering IMPLEMENT Leading Provider of Data Science & Engineering for Big Analytics
  • 3. CONFIDENTIAL | 3 • Operates on a continuous stream of data • Unknown or infinite size • Only one pass; options: • Store it • Lose it • Store an approximation • Limited processing time per item • • Limited total memory • What is a Streaming Algorithm? Algorithm Standing Query Ad-hoc Query Input Output Memory Disk
  • 4. CONFIDENTIAL | 4 Why use a Streaming Algorithm? • Compare to typical “Big Data” approach: store everything, analyze later, scale linearly • Streaming Pros: • Lower latency • Lower storage cost • Streaming Cons: • Less flexibility • Lower precision (sometimes) • Answer? • Why not both? Streaming Algorithm Result Initial Answer Long-term Storage Batch Algorithm Result Authoritative Answer
  • 5. CONFIDENTIAL | 5 General Techniques 1. Tunable Approximation 2. Sampling • Sliding window • Fixed number • Fixed percentage 3. Hashing: useful randomness
  • 6. CONFIDENTIAL | 6 Example 1: Sampling device error rates • Stream of (device_id, event, timestamp) • Scenario: • Not enough space to store everything • Simple queries  storing 1% is good enough Device-1 (Device-1, event-1, 10001123) (Device-1, event-3, 10001126) (Device-1, event-1, 10001129) ... Device-2 (Device-2, event-2, 10001124) (Device-2, ERROR, 10001130) (Device-2, event-4, 10001132) ... Device-3 (Device-3, event-3, 10001122) (Device-3, event-1, 10001127) (Device-3, ERROR, 10001135) ... (Device-3, event-3, 10001122) (Device-1, event-1, 10001123) (Device-2, event-2, 10001124) (Device-1, event-3, 10001126) (Device-3, event-1, 10001127) (Device-1, event-1, 10001129) (Device-2, ERROR, 10001130) (Device-2, event-4, 10001132) (Device-3, ERROR, 10001135) ... Input
  • 7. CONFIDENTIAL | 7 Example 1: Sampling device error rates • Stream of (device_id, event, timestamp) • Scenario: • Not enough space to store everything • Simple queries  storing 1% is good enough Algorithm: for each element e: with probability 0.01: store e else: throw out e Can lead to some insidious statistical “bugs”…
  • 8. CONFIDENTIAL | 8 Example 1: Sampling device error rates • Stream of (device_id, event, timestamp) • Scenario: • Not enough space to store everything • Simple queries  storing 1% is good enough Query: How many errors has the average device encountered? Answer: SELECT AVG(n) FROM ( SELECT COUNT(*) AS n FROM events WHERE event = 'ERROR' GROUP BY device_id ) Simple… but off by up to 100x. Each device had only 1% of its events sampled. Can we just multiply by 100?
  • 9. CONFIDENTIAL | 9 Example 1: Sampling device error rates • Stream of (device_id, event, timestamp) • Scenario: • Not enough space to store everything • Simple queries  storing 1% is good enough Better Algorithm: for each element e: if (hash(e.device_id) mod 100) == 0 store e else: throw out e Choose how to hash carefully... or hash every different way
  • 10. CONFIDENTIAL | 10 Example 2: Sampling fixed number Choice of p is crucial: • p = constant  prefer more recent elements. Higher p = more recent • p = k/n  sample uniformly from entire stream Let arr = array of size k for each element e: if arr is not yet full: add e to arr else: with probability p: replace a random element of arr with e else: throw out e Want to sample a fixed count (k), not a fixed percentage. Algorithm:
  • 11. CONFIDENTIAL | 11 Example 2: Sampling fixed number
  • 12. CONFIDENTIAL | 12 Example 3: Counting unique users • Input: stream of (user_id, action, timestamp) • Want to know how many distinct users are seen over a time period • Naïve approach: • Store all user_id’s in a list/tree/hashtable • Millions of users = lot of memory • Better approach: • Store all user_id’s in a database • Good, but maybe it’s not fast enough… • What if an approximate count is ok?
  • 13. CONFIDENTIAL | 13 Example 3: Counting unique users • Input: stream of (user_id, action, timestamp) • Want to know how many distinct users are seen over a time period • Approximate count is ok • Flajolet-Martin Idea: • Hash each user_id into a bit string • Count the trailing zeros • Remember maximum number of trailing zeros seen user_id H(user_id) trailing zeros max(trailing zeros) john_doe 0111001001 0 0 jane_doe 1011011100 2 2 alan_t 0010111000 3 3 EWDijkstra 1101011110 1 3 jane_doe 1011011100 2 3
  • 14. CONFIDENTIAL | 14 Example 3: Counting unique users • Input: stream of (user_id, action, timestamp) • Want to know how many distinct users are seen over a time period • Intuition: • If we had seen 2 distinct users, we would expect 1 trailing zero • If we had seen 4, we would expect 2 trailing zeros • If we had seen , we would expect • In general, if there has been a maximum of trailing zeros, is a reasonable estimation of distinct users • Want more precision? User more independent hash functions, and combine the results • Median = only get powers of two • Mean = subject to skew • Median of means of groups works well in practice
  • 15. CONFIDENTIAL | 15 Example 3: Counting unique users • Input: stream of (user_id, action, timestamp) • Want to know how many distinct users are seen over a time period Flajolet-Martin, all together: arr = int[k] for each item e: for i in 0...k-1: z = trailing_zeros(hashi(e)) if z > arr[i]: arr[i] = z means = group_means(arr) median = median(means) return pow(2, median)
  • 16. CONFIDENTIAL | 16 Example 3: Counting unique users Flajolet-Martin in practice • Devil is in the details • Tunable precision • more hash functions = more precise • See the paper for bounds on precision • Tunable latency • more hash functions = higher latency • faster hash functions = lower latency • faster hash functions = more possibility of correlation = less precision Remember: streaming algorithm for quick, imprecise answer. Back-end batch algorithm for slower, exact answer
  • 17. CONFIDENTIAL | 17 Example 4: Counting Individual Item Frequencies Want to keep track of how many times each item has appeared in the stream Many applications: • How popular is each search term? • How many times has this hashtag been tweeted? • Which IP addresses are DDoS’ing me? Again, two obvious approaches: • In-memory hashmap of itemcount • Database But can we be more clever?
  • 18. CONFIDENTIAL | 18 Example 4: Counting Individual Item Frequencies Want to keep track of how many times each item has appeared in the stream Idea: • Maintain array of counts • Hash each item, increment array at that index To check the count of an item, hash again and check array at that index • Over-estimates because of hash “collisions”
  • 19. CONFIDENTIAL | 19 Example 4: Counting Individual Item Frequencies Count-Min Sketch algorithm: • Maintain 2-d array of size w x d • Choose d different hash functions; each row in array corresponds to one hash function • Hash each item with every hash function, increment the appropriate position in each row • To query an item, hash it d times again, take the minimum value from all rows
  • 20. CONFIDENTIAL | 20 Example 4: Counting Individual Item Frequencies Want to keep track of how many times each item has appeared in the stream Count-Min Sketch, all together: arr = int[d][w] for each item e: for i in 0...d-1: j = hashi(e) mod w arr[i][j]++ def frequency(q): min = +infinity for i in 0...d-1: j = hashi(e) mod w if arr[i][j] < min: min = arr[i][j] return min
  • 21. CONFIDENTIAL | 21 Example 4: Counting Individual Item Frequencies Count-Min Sketch in practice • Devil is in the details • Tunable precision • Bigger array = more precise • See the paper for bounds on precision • Tunable latency • more hash functions = higher latency • Better at estimating more frequent items • Can subtract out estimation of collisions Remember: streaming algorithm for quick, imprecise answer. Back-end batch algorithm for slower, exact answer
  • 22. CONFIDENTIAL | 22 Questions? • Feel free to reach out • www.thinkbiganalytics.com • joe.kelley@thinkbiganalytics.com • www.slideshare.net/jfkelley1 • References: • http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf • http://infolab.stanford.edu/~ullman/mmds.html We’re hiring! Engineers and Data Scientists