SlideShare una empresa de Scribd logo
1 de 31
Descargar para leer sin conexión
Scalable real-time 
processing techniques 
How to almost count 
Lars Albertsson, Schibsted
“We promised to count live... 
...but since you can’t do that, we used historical 
numbers and this cool math to extrapolate.” 
?!?
Stream counting is simple 
You already have the building blocks 
Yet many wait for batch execution 
Or go through estimation hoops
Accurate counting 
Server Bus 
Bucketiser 
Bucketiser 
Bucketiser 
Aggregator 
Server 
Server 
Server 
● Straightforward, with some plumbing. 
● Heavier than you need.
Now or later? Exact or rough? 
Approximation now >> accurate later
Basic scenarios 
● How many distinct items in last x minutes? 
● What are the top k items in last x minutes? 
● How many Ys in last x minutes? 
These base techniques are sufficient for 
implementing e.g. personalisation and 
recommendation algorithms.
Cardinality - distinct stream count 
● Naive: Set of hashes. X bits per item.
Cardinality - distinct stream count 
● Naive: Set of hashes. X bits per item. 
● Naive 2: Set approximation with Bloom filter 
+ counter.
Counting in context 
● Look backward, different time windows, 
compare. 
● Count for a small time quantum, keep 
history. 
● Aggregate old windows. 
● Monoid representations are desirable.
Cardinality - distinct stream count 
● Naive: Set of hashes. X bits per item. 
● Naive 2: Set approximation with Bloom filter 
+ counter. 
● Naive 3: Hash to bitmap. Count bits.
Cardinality - distinct stream count 
● Naive: Set of hashes. X bits per item. 
● Naive 2: Set approximation with Bloom filter 
+ counter. 
● Naive 3: Hash to bitmap. Count bits. 
● Attempt 4: Hash, bitmap, count + collision 
compensation. Linear Probabilistic Counter.
Cardinality - distinct stream count 
● Naive: Set of hashes. X bits per item. 
● Naive 2: Set approximation with Bloom filter 
+ counter. 
● Naive 3: Hash to bitmap. Count bits. 
● Attempt 4: Hash, bitmap, count + collision 
compensation. Linear Probabilistic Counter. 
● Read papers… -> HyperLogLog counter
Cardinality - distinct stream count 
Source: Shakespeare, highscalability.com
Top K counting 
U2 65 
Gaga 46 
Avicii 23 
Eminem 21 
Dolly 18 
U2 65 
Gaga 46 
Avicii 23 
Eminem 21 
Peps 19 
U2 65 
Gaga 46 
Avicii 23 
Eminem 21 
Dolly 20 
● Keep k items, assume absentees have 
lowest value 
● Accurate at top, overcounting in bottom
Approx counting - Count-Min Sketch 
● Compute n hashes for key. 
● Increment once on each row, col by mod 
(hash) 
● Retrieve by min() over rows 
3 7 20 3 11 6 3+1 4 1 1 
3 8 6 2+1 17 13 1 0 4 5 
12 7 6 14 2 0 2 3 6+1 7 
3 2 12 8+1 10 2 7 2 11 2
Top K with Count-Min Sketch 
U2 65 
Gaga 46 
Avicii 23 
Eminem 21 
Dolly 18 
U2 65 
Gaga 46 
Avicii 23 
Eminem 21 
Peps 2 
U2 65 
Gaga 46 
Avicii 23 
Eminem 21 
Dolly 19 
● Keep Heavy Hitters list. 
● Lookup absentees in CMS. 
● Risk of overcount is smaller and spread out.
Cubic CMS 
● Decorate song with geo, age, etc. Pour into 
CMS. 
● Keep heavy hitters per geo, age group. 
*:*:<U2> 
SE:*:<U2> 
*:31-40:<U2> 
SE:31-40:<U2> 
+1 
+1 
+1 
+1
Machinery 
O(104) messages / s per machine. 
You probably only need one. If not, use Storm. 
Read and write to pub/sub channel, e.g. Kafka 
or ZeroMQ.
Brute force alternative 
Dump every single message into 
ElasticSearch. 
Suitable for high dimensionality cubes.
Recommendations, you said? 
● Collaborative filtering - similarity matrix 
Users 
2 4 1 1 5 2 
0 1 7 1 0 6 
5 2 9 0 3 0 
3 8 0 6 0 7 
Items
Shave the matrix 
Users 
Items 
0,0 3 
0,1 5 
0,2 0 
0,3 2 
1,0 8 
... ... 
2,1 9 
1,0 8 
2,2 7 
5,0 7 
5,2 6 
... ... 
Flip Sort 
2,1 9 
1,0 8 
2,2 7 
5,0 7 
5,2 6 
Cut 
0 0 0 0 0 0 
0 0 7 0 0 6 
0 0 9 0 0 0 
0 8 0 0 0 7 
Noise removed - fine for 
recommendations 
2 4 1 1 5 2 
0 1 7 1 0 6 
5 2 9 0 3 0 
3 8 0 6 0 7
Hungry for more? 
Mikio Braun: http://www.berlinbuzzwords.de/session/real-time-personalization-and- 
recommendation-stream-mining 
Ted Dunning on deep learning for real-time anomaly detection: http://www. 
berlinbuzzwords.de/session/deep-learning-high-performance-time-series-databases 
Ted Dunning on Storm: http://www.youtube.com/watch?v=7PcmbI5aC20 
Open source: stream-lib, Algebird
Want to work in this area? 
lalle@schibsted.com
Scalable real-time processing techniques
Scalable real-time processing techniques
Scalable real-time processing techniques
Scalable real-time processing techniques
Scalable real-time processing techniques
Scalable real-time processing techniques
Scalable real-time processing techniques
Scalable real-time processing techniques

Más contenido relacionado

La actualidad más candente

(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...
ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...
ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...Altinity Ltd
 
Introduction to InfluxDB
Introduction to InfluxDBIntroduction to InfluxDB
Introduction to InfluxDBJorn Jambers
 
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...InfluxData
 
Time series database, InfluxDB & PHP
Time series database, InfluxDB & PHPTime series database, InfluxDB & PHP
Time series database, InfluxDB & PHPCorley S.r.l.
 
Apache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series databaseApache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series databaseFlorian Lautenschlager
 
Time Series Data in a Time Series World
Time Series Data in a Time Series WorldTime Series Data in a Time Series World
Time Series Data in a Time Series WorldMapR Technologies
 
Devoxx france 2015 influxdb
Devoxx france 2015 influxdbDevoxx france 2015 influxdb
Devoxx france 2015 influxdbNicolas Muller
 
Statsd introduction
Statsd introductionStatsd introduction
Statsd introductionRick Chang
 
Developing Ansible Dynamic Inventory Script - Nov 2017
Developing Ansible Dynamic Inventory Script - Nov 2017Developing Ansible Dynamic Inventory Script - Nov 2017
Developing Ansible Dynamic Inventory Script - Nov 2017Ahmed AbouZaid
 
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...Altinity Ltd
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkDatabricks
 
Paul Dix (Founder InfluxDB) - Organising Metrics at #DOXLON
Paul Dix (Founder InfluxDB) - Organising Metrics at #DOXLONPaul Dix (Founder InfluxDB) - Organising Metrics at #DOXLON
Paul Dix (Founder InfluxDB) - Organising Metrics at #DOXLONOutlyer
 
Monitoring in a scalable world
Monitoring in a scalable worldMonitoring in a scalable world
Monitoring in a scalable worldTechExeter
 
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...InfluxData
 
RBea: Scalable Real-Time Analytics at King
RBea: Scalable Real-Time Analytics at KingRBea: Scalable Real-Time Analytics at King
RBea: Scalable Real-Time Analytics at KingGyula Fóra
 
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016DataStax
 

La actualidad más candente (20)

(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Tick
TickTick
Tick
 
ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...
ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...
ClickHouse Paris Meetup. Pragma Analytics Software Suite w/ClickHouse, by Mat...
 
Introduction to InfluxDB
Introduction to InfluxDBIntroduction to InfluxDB
Introduction to InfluxDB
 
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...
 
Time series database, InfluxDB & PHP
Time series database, InfluxDB & PHPTime series database, InfluxDB & PHP
Time series database, InfluxDB & PHP
 
Apache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series databaseApache Solr as a compressed, scalable, and high performance time series database
Apache Solr as a compressed, scalable, and high performance time series database
 
Time Series Data in a Time Series World
Time Series Data in a Time Series WorldTime Series Data in a Time Series World
Time Series Data in a Time Series World
 
Devoxx france 2015 influxdb
Devoxx france 2015 influxdbDevoxx france 2015 influxdb
Devoxx france 2015 influxdb
 
Statsd introduction
Statsd introductionStatsd introduction
Statsd introduction
 
Developing Ansible Dynamic Inventory Script - Nov 2017
Developing Ansible Dynamic Inventory Script - Nov 2017Developing Ansible Dynamic Inventory Script - Nov 2017
Developing Ansible Dynamic Inventory Script - Nov 2017
 
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
 
InfluxDb
InfluxDbInfluxDb
InfluxDb
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
 
Paul Dix (Founder InfluxDB) - Organising Metrics at #DOXLON
Paul Dix (Founder InfluxDB) - Organising Metrics at #DOXLONPaul Dix (Founder InfluxDB) - Organising Metrics at #DOXLON
Paul Dix (Founder InfluxDB) - Organising Metrics at #DOXLON
 
Monitoring in a scalable world
Monitoring in a scalable worldMonitoring in a scalable world
Monitoring in a scalable world
 
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
 
RBea: Scalable Real-Time Analytics at King
RBea: Scalable Real-Time Analytics at KingRBea: Scalable Real-Time Analytics at King
RBea: Scalable Real-Time Analytics at King
 
Graphite
GraphiteGraphite
Graphite
 
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
 

Destacado

Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven productsLars Albertsson
 
Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelinesLars Albertsson
 
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...confluent
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereGwen (Chen) Shapira
 
Effective testing for spark programs Strata NY 2015
Effective testing for spark programs   Strata NY 2015Effective testing for spark programs   Strata NY 2015
Effective testing for spark programs Strata NY 2015Holden Karau
 
(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot Instances(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot InstancesAmazon Web Services
 
Interactive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and SpotifyInteractive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and SpotifyChris Johnson
 
10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systemsXavier Amatriain
 

Destacado (9)

Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
 
Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelines
 
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be there
 
Effective testing for spark programs Strata NY 2015
Effective testing for spark programs   Strata NY 2015Effective testing for spark programs   Strata NY 2015
Effective testing for spark programs Strata NY 2015
 
(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot Instances(CMP310) Data Processing Pipelines Using Containers & Spot Instances
(CMP310) Data Processing Pipelines Using Containers & Spot Instances
 
Interactive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and SpotifyInteractive Recommender Systems with Netflix and Spotify
Interactive Recommender Systems with Netflix and Spotify
 
10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems
 

Similar a Scalable real-time processing techniques

FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...Rob Skillington
 
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBaseHBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBaseHBaseCon
 
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsDebasish Ghosh
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Codemotion
 
Reactor, Reactive streams and MicroServices
Reactor, Reactive streams and MicroServicesReactor, Reactive streams and MicroServices
Reactor, Reactive streams and MicroServicesStéphane Maldini
 
Webinar: Using Control Theory to Keep Compactions Under Control
Webinar: Using Control Theory to Keep Compactions Under ControlWebinar: Using Control Theory to Keep Compactions Under Control
Webinar: Using Control Theory to Keep Compactions Under ControlScyllaDB
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasVirtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasFlink Forward
 
streamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptxstreamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptxGopiNathVelivela
 
[db analytics showcase Sapporo 2018] B33 H2O4GPU and GoAI: harnessing the pow...
[db analytics showcase Sapporo 2018] B33 H2O4GPU and GoAI: harnessing the pow...[db analytics showcase Sapporo 2018] B33 H2O4GPU and GoAI: harnessing the pow...
[db analytics showcase Sapporo 2018] B33 H2O4GPU and GoAI: harnessing the pow...Insight Technology, Inc.
 
Sketch algorithms
Sketch algorithmsSketch algorithms
Sketch algorithmsSimon Belak
 
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"NUS-ISS
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithmsSandeep Joshi
 
Counting (Using Computer)
Counting (Using Computer)Counting (Using Computer)
Counting (Using Computer)roshmat
 
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion RecordsScylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion RecordsScyllaDB
 
Interview questions slide deck
Interview questions slide deckInterview questions slide deck
Interview questions slide deckMikeBegley
 
#GDC15 Code Clinic
#GDC15 Code Clinic#GDC15 Code Clinic
#GDC15 Code ClinicMike Acton
 
Amazon Redshift
Amazon RedshiftAmazon Redshift
Amazon RedshiftJeff Patti
 
Probabilistic data structure
Probabilistic data structureProbabilistic data structure
Probabilistic data structureThinh Dang
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringHJ van Veen
 

Similar a Scalable real-time processing techniques (20)

FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
 
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBaseHBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
 
14 query processing-sorting
14 query processing-sorting14 query processing-sorting
14 query processing-sorting
 
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming Applications
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
 
Reactor, Reactive streams and MicroServices
Reactor, Reactive streams and MicroServicesReactor, Reactive streams and MicroServices
Reactor, Reactive streams and MicroServices
 
Webinar: Using Control Theory to Keep Compactions Under Control
Webinar: Using Control Theory to Keep Compactions Under ControlWebinar: Using Control Theory to Keep Compactions Under Control
Webinar: Using Control Theory to Keep Compactions Under Control
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasVirtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
 
streamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptxstreamingalgo88585858585858585pppppp.pptx
streamingalgo88585858585858585pppppp.pptx
 
[db analytics showcase Sapporo 2018] B33 H2O4GPU and GoAI: harnessing the pow...
[db analytics showcase Sapporo 2018] B33 H2O4GPU and GoAI: harnessing the pow...[db analytics showcase Sapporo 2018] B33 H2O4GPU and GoAI: harnessing the pow...
[db analytics showcase Sapporo 2018] B33 H2O4GPU and GoAI: harnessing the pow...
 
Sketch algorithms
Sketch algorithmsSketch algorithms
Sketch algorithms
 
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
 
Counting (Using Computer)
Counting (Using Computer)Counting (Using Computer)
Counting (Using Computer)
 
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion RecordsScylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
 
Interview questions slide deck
Interview questions slide deckInterview questions slide deck
Interview questions slide deck
 
#GDC15 Code Clinic
#GDC15 Code Clinic#GDC15 Code Clinic
#GDC15 Code Clinic
 
Amazon Redshift
Amazon RedshiftAmazon Redshift
Amazon Redshift
 
Probabilistic data structure
Probabilistic data structureProbabilistic data structure
Probabilistic data structure
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 

Más de Lars Albertsson

Crossing the data divide
Crossing the data divideCrossing the data divide
Crossing the data divideLars Albertsson
 
Schema management with Scalameta
Schema management with ScalametaSchema management with Scalameta
Schema management with ScalametaLars Albertsson
 
How to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfLars Albertsson
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdfLars Albertsson
 
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfLars Albertsson
 
Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application qualityLars Albertsson
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetLars Albertsson
 
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesLars Albertsson
 
The right side of speed - learning to shift left
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift leftLars Albertsson
 
Mortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityMortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityLars Albertsson
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish styleLars Albertsson
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data opsLars Albertsson
 
Engineering data quality
Engineering data qualityEngineering data quality
Engineering data qualityLars Albertsson
 
Eventually, time will kill your data processing
Eventually, time will kill your data processingEventually, time will kill your data processing
Eventually, time will kill your data processingLars Albertsson
 
Taming the reproducibility crisis
Taming the reproducibility crisisTaming the reproducibility crisis
Taming the reproducibility crisisLars Albertsson
 
Eventually, time will kill your data pipeline
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipelineLars Albertsson
 
Kubernetes as data platform
Kubernetes as data platformKubernetes as data platform
Kubernetes as data platformLars Albertsson
 

Más de Lars Albertsson (20)

Crossing the data divide
Crossing the data divideCrossing the data divide
Crossing the data divide
 
Schema management with Scalameta
Schema management with ScalametaSchema management with Scalameta
Schema management with Scalameta
 
How to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdf
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
 
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdf
 
Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application quality
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budget
 
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practices
 
Ai legal and ethics
Ai   legal and ethicsAi   legal and ethics
Ai legal and ethics
 
The right side of speed - learning to shift left
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift left
 
Mortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data qualityMortal analytics - Covid-19 and the problem of data quality
Mortal analytics - Covid-19 and the problem of data quality
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish style
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data ops
 
Data democratised
Data democratisedData democratised
Data democratised
 
Engineering data quality
Engineering data qualityEngineering data quality
Engineering data quality
 
Eventually, time will kill your data processing
Eventually, time will kill your data processingEventually, time will kill your data processing
Eventually, time will kill your data processing
 
Taming the reproducibility crisis
Taming the reproducibility crisisTaming the reproducibility crisis
Taming the reproducibility crisis
 
Eventually, time will kill your data pipeline
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipeline
 
Data ops in practice
Data ops in practiceData ops in practice
Data ops in practice
 
Kubernetes as data platform
Kubernetes as data platformKubernetes as data platform
Kubernetes as data platform
 

Último

About .NET 8 and a first glimpse into .NET9
About .NET 8 and a first glimpse into .NET9About .NET 8 and a first glimpse into .NET9
About .NET 8 and a first glimpse into .NET9Jürgen Gutsch
 
20240330_고급진 코드를 위한 exception 다루기
20240330_고급진 코드를 위한 exception 다루기20240330_고급진 코드를 위한 exception 다루기
20240330_고급진 코드를 위한 exception 다루기Chiwon Song
 
ARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdf
ARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdfARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdf
ARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdfTobias Schneck
 
AI Embracing Every Shade of Human Beauty
AI Embracing Every Shade of Human BeautyAI Embracing Every Shade of Human Beauty
AI Embracing Every Shade of Human BeautyRaymond Okyere-Forson
 
Kubernetes go-live checklist for your microservices.pptx
Kubernetes go-live checklist for your microservices.pptxKubernetes go-live checklist for your microservices.pptx
Kubernetes go-live checklist for your microservices.pptxPrakarsh -
 
ERP For Electrical and Electronics manufecturing.pptx
ERP For Electrical and Electronics manufecturing.pptxERP For Electrical and Electronics manufecturing.pptx
ERP For Electrical and Electronics manufecturing.pptxAutus Cyber Tech
 
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...Jaydeep Chhasatia
 
Introduction-to-Software-Development-Outsourcing.pptx
Introduction-to-Software-Development-Outsourcing.pptxIntroduction-to-Software-Development-Outsourcing.pptx
Introduction-to-Software-Development-Outsourcing.pptxIntelliSource Technologies
 
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine HarmonyLeveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmonyelliciumsolutionspun
 
online pdf editor software solutions.pdf
online pdf editor software solutions.pdfonline pdf editor software solutions.pdf
online pdf editor software solutions.pdfMeon Technology
 
IA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG timeIA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG timeNeo4j
 
Deep Learning for Images with PyTorch - Datacamp
Deep Learning for Images with PyTorch - DatacampDeep Learning for Images with PyTorch - Datacamp
Deep Learning for Images with PyTorch - DatacampVICTOR MAESTRE RAMIREZ
 
Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...
Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...
Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...OnePlan Solutions
 
Cybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and BadCybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and BadIvo Andreev
 
Growing Oxen: channel operators and retries
Growing Oxen: channel operators and retriesGrowing Oxen: channel operators and retries
Growing Oxen: channel operators and retriesSoftwareMill
 
Your Vision, Our Expertise: TECUNIQUE's Tailored Software Teams
Your Vision, Our Expertise: TECUNIQUE's Tailored Software TeamsYour Vision, Our Expertise: TECUNIQUE's Tailored Software Teams
Your Vision, Our Expertise: TECUNIQUE's Tailored Software TeamsJaydeep Chhasatia
 
Why Choose Brain Inventory For Ecommerce Development.pdf
Why Choose Brain Inventory For Ecommerce Development.pdfWhy Choose Brain Inventory For Ecommerce Development.pdf
Why Choose Brain Inventory For Ecommerce Development.pdfBrain Inventory
 
Generative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-CouncilGenerative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-CouncilVICTOR MAESTRE RAMIREZ
 
JS-Experts - Cybersecurity for Generative AI
JS-Experts - Cybersecurity for Generative AIJS-Experts - Cybersecurity for Generative AI
JS-Experts - Cybersecurity for Generative AIIvo Andreev
 

Último (20)

About .NET 8 and a first glimpse into .NET9
About .NET 8 and a first glimpse into .NET9About .NET 8 and a first glimpse into .NET9
About .NET 8 and a first glimpse into .NET9
 
20240330_고급진 코드를 위한 exception 다루기
20240330_고급진 코드를 위한 exception 다루기20240330_고급진 코드를 위한 exception 다루기
20240330_고급진 코드를 위한 exception 다루기
 
ARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdf
ARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdfARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdf
ARM Talk @ Rejekts - Will ARM be the new Mainstream in our Data Centers_.pdf
 
AI Embracing Every Shade of Human Beauty
AI Embracing Every Shade of Human BeautyAI Embracing Every Shade of Human Beauty
AI Embracing Every Shade of Human Beauty
 
Kubernetes go-live checklist for your microservices.pptx
Kubernetes go-live checklist for your microservices.pptxKubernetes go-live checklist for your microservices.pptx
Kubernetes go-live checklist for your microservices.pptx
 
ERP For Electrical and Electronics manufecturing.pptx
ERP For Electrical and Electronics manufecturing.pptxERP For Electrical and Electronics manufecturing.pptx
ERP For Electrical and Electronics manufecturing.pptx
 
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
 
Sustainable Web Design - Claire Thornewill
Sustainable Web Design - Claire ThornewillSustainable Web Design - Claire Thornewill
Sustainable Web Design - Claire Thornewill
 
Introduction-to-Software-Development-Outsourcing.pptx
Introduction-to-Software-Development-Outsourcing.pptxIntroduction-to-Software-Development-Outsourcing.pptx
Introduction-to-Software-Development-Outsourcing.pptx
 
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine HarmonyLeveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
 
online pdf editor software solutions.pdf
online pdf editor software solutions.pdfonline pdf editor software solutions.pdf
online pdf editor software solutions.pdf
 
IA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG timeIA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG time
 
Deep Learning for Images with PyTorch - Datacamp
Deep Learning for Images with PyTorch - DatacampDeep Learning for Images with PyTorch - Datacamp
Deep Learning for Images with PyTorch - Datacamp
 
Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...
Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...
Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...
 
Cybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and BadCybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and Bad
 
Growing Oxen: channel operators and retries
Growing Oxen: channel operators and retriesGrowing Oxen: channel operators and retries
Growing Oxen: channel operators and retries
 
Your Vision, Our Expertise: TECUNIQUE's Tailored Software Teams
Your Vision, Our Expertise: TECUNIQUE's Tailored Software TeamsYour Vision, Our Expertise: TECUNIQUE's Tailored Software Teams
Your Vision, Our Expertise: TECUNIQUE's Tailored Software Teams
 
Why Choose Brain Inventory For Ecommerce Development.pdf
Why Choose Brain Inventory For Ecommerce Development.pdfWhy Choose Brain Inventory For Ecommerce Development.pdf
Why Choose Brain Inventory For Ecommerce Development.pdf
 
Generative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-CouncilGenerative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-Council
 
JS-Experts - Cybersecurity for Generative AI
JS-Experts - Cybersecurity for Generative AIJS-Experts - Cybersecurity for Generative AI
JS-Experts - Cybersecurity for Generative AI
 

Scalable real-time processing techniques

  • 1. Scalable real-time processing techniques How to almost count Lars Albertsson, Schibsted
  • 2. “We promised to count live... ...but since you can’t do that, we used historical numbers and this cool math to extrapolate.” ?!?
  • 3. Stream counting is simple You already have the building blocks Yet many wait for batch execution Or go through estimation hoops
  • 4. Accurate counting Server Bus Bucketiser Bucketiser Bucketiser Aggregator Server Server Server ● Straightforward, with some plumbing. ● Heavier than you need.
  • 5. Now or later? Exact or rough? Approximation now >> accurate later
  • 6. Basic scenarios ● How many distinct items in last x minutes? ● What are the top k items in last x minutes? ● How many Ys in last x minutes? These base techniques are sufficient for implementing e.g. personalisation and recommendation algorithms.
  • 7. Cardinality - distinct stream count ● Naive: Set of hashes. X bits per item.
  • 8. Cardinality - distinct stream count ● Naive: Set of hashes. X bits per item. ● Naive 2: Set approximation with Bloom filter + counter.
  • 9. Counting in context ● Look backward, different time windows, compare. ● Count for a small time quantum, keep history. ● Aggregate old windows. ● Monoid representations are desirable.
  • 10. Cardinality - distinct stream count ● Naive: Set of hashes. X bits per item. ● Naive 2: Set approximation with Bloom filter + counter. ● Naive 3: Hash to bitmap. Count bits.
  • 11. Cardinality - distinct stream count ● Naive: Set of hashes. X bits per item. ● Naive 2: Set approximation with Bloom filter + counter. ● Naive 3: Hash to bitmap. Count bits. ● Attempt 4: Hash, bitmap, count + collision compensation. Linear Probabilistic Counter.
  • 12. Cardinality - distinct stream count ● Naive: Set of hashes. X bits per item. ● Naive 2: Set approximation with Bloom filter + counter. ● Naive 3: Hash to bitmap. Count bits. ● Attempt 4: Hash, bitmap, count + collision compensation. Linear Probabilistic Counter. ● Read papers… -> HyperLogLog counter
  • 13. Cardinality - distinct stream count Source: Shakespeare, highscalability.com
  • 14. Top K counting U2 65 Gaga 46 Avicii 23 Eminem 21 Dolly 18 U2 65 Gaga 46 Avicii 23 Eminem 21 Peps 19 U2 65 Gaga 46 Avicii 23 Eminem 21 Dolly 20 ● Keep k items, assume absentees have lowest value ● Accurate at top, overcounting in bottom
  • 15. Approx counting - Count-Min Sketch ● Compute n hashes for key. ● Increment once on each row, col by mod (hash) ● Retrieve by min() over rows 3 7 20 3 11 6 3+1 4 1 1 3 8 6 2+1 17 13 1 0 4 5 12 7 6 14 2 0 2 3 6+1 7 3 2 12 8+1 10 2 7 2 11 2
  • 16. Top K with Count-Min Sketch U2 65 Gaga 46 Avicii 23 Eminem 21 Dolly 18 U2 65 Gaga 46 Avicii 23 Eminem 21 Peps 2 U2 65 Gaga 46 Avicii 23 Eminem 21 Dolly 19 ● Keep Heavy Hitters list. ● Lookup absentees in CMS. ● Risk of overcount is smaller and spread out.
  • 17. Cubic CMS ● Decorate song with geo, age, etc. Pour into CMS. ● Keep heavy hitters per geo, age group. *:*:<U2> SE:*:<U2> *:31-40:<U2> SE:31-40:<U2> +1 +1 +1 +1
  • 18. Machinery O(104) messages / s per machine. You probably only need one. If not, use Storm. Read and write to pub/sub channel, e.g. Kafka or ZeroMQ.
  • 19. Brute force alternative Dump every single message into ElasticSearch. Suitable for high dimensionality cubes.
  • 20. Recommendations, you said? ● Collaborative filtering - similarity matrix Users 2 4 1 1 5 2 0 1 7 1 0 6 5 2 9 0 3 0 3 8 0 6 0 7 Items
  • 21. Shave the matrix Users Items 0,0 3 0,1 5 0,2 0 0,3 2 1,0 8 ... ... 2,1 9 1,0 8 2,2 7 5,0 7 5,2 6 ... ... Flip Sort 2,1 9 1,0 8 2,2 7 5,0 7 5,2 6 Cut 0 0 0 0 0 0 0 0 7 0 0 6 0 0 9 0 0 0 0 8 0 0 0 7 Noise removed - fine for recommendations 2 4 1 1 5 2 0 1 7 1 0 6 5 2 9 0 3 0 3 8 0 6 0 7
  • 22. Hungry for more? Mikio Braun: http://www.berlinbuzzwords.de/session/real-time-personalization-and- recommendation-stream-mining Ted Dunning on deep learning for real-time anomaly detection: http://www. berlinbuzzwords.de/session/deep-learning-high-performance-time-series-databases Ted Dunning on Storm: http://www.youtube.com/watch?v=7PcmbI5aC20 Open source: stream-lib, Algebird
  • 23. Want to work in this area? lalle@schibsted.com