SlideShare una empresa de Scribd logo
1 de 45
Stateful Stream
Processing at In-Memory
Speed
Jamie Grier
@jamiegrier
jamie@data-artisans.com
Who am I?
• Director of Applications Engineering at data
Artisans
• Previously working on streaming computation at
Twitter, Gnip and Boulder Imaging
• Involved in various kinds of stream processing for
about a decade
• High-speed video, social media streaming, general
frameworks for stream processing
Overview
• In stateful stream processing the bottleneck has often
been the key-value store
• Accuracy has been sacrificed for speed
• Lambda Architecture was developed to address
shortcomings of stream processors
• Can we remove the key-value store bottleneck and
enable processing at in-memory speeds?
• Can we do this accurately without Lamba Architecture?
Problem statement
• Incoming message rate: 1.5 million/sec
• Group by several dimensions and aggregate
over 1 hour event-time windows
• Write hourly time series data to database
• Respond to queries both over historical data and
the live in-flight aggregates
Input and Queries
Stream
tweet-id: 1, event: url-
click, time: 01:01:01
tweet-id: 2, event: url-
click, time: 01:01:02
tweet-id: 1, event:
impression, time:
01:01:03
tweet-id: 2, event: url-
click, time: 02:01:01
tweet-id: 1, event:
impression, time:
02:02:02
Query Result
tweet-id: 1, event: url-
click, time: 01:00:00 1
tweet-id: 1,
event: *,
time: 01:00:00
2
tweet-id: *,
event: *,
time: 01:00:00
3
tweet-id: *,
event: impression,
time: 02:00:00
1
tweet-id: 2,
event: *,
time: 02:00:00
1
Input and Queries
Stream
tweet-id: 1, event: url-
click, time: 01:01:03
tweet-id: 2, event: url-
click, time: 01:01:02
tweet-id: 1, event:
impression, time:
01:01:01
tweet-id: 2, event: url-
click, time: 02:02:01
tweet-id: 1, event:
impression, time:
02:01:02
Query Result
tweet-id: 1, event: url-
click, time: 01:00:00 1
tweet-id: 1,
event: *,
time: 01:00:00
2
tweet-id: *,
event: *,
time: 01:00:00
3
tweet-id: *,
event: impression,
time: 02:00:00
1
tweet-id: 2,
event: *,
time: 02:00:00
1
Input and Queries
Query Result
tweet-id: 1, event: url-
click, time: 01:00:00 1
tweet-id: 1,
event: *,
time: 01:00:00
2
tweet-id: *,
event: *,
time: 01:00:00
3
tweet-id: *,
event: impression,
time: 02:00:00
1
tweet-id: 2,
event: *,
time: 02:00:00
1
Stream
tweet-id: 1, event: url-
click, time: 01:01:03
tweet-id: 2, event: url-
click, time: 01:01:02
tweet-id: 1, event:
impression, time:
01:01:01
tweet-id: 2, event: url-
click, time: 02:02:01
tweet-id: 1, event:
impression, time:
02:01:02
Input and Queries
Stream
tweet-id: 1, event: url-
click, time: 01:01:03
tweet-id: 2, event: url-
click, time: 01:01:02
tweet-id: 1, event:
impression, time:
01:01:01
tweet-id: 2, event: url-
click, time: 02:02:01
tweet-id: 1, event:
impression, time:
02:01:02
Query Result
tweet-id: 1, event: url-
click, time: 01:00:00 1
tweet-id: 1,
event: *,
time: 01:00:00
2
tweet-id: *,
event: *,
time: 01:00:00
3
tweet-id: *,
event: impression,
time: 02:00:00
1
tweet-id: 2,
event: *,
time: 02:00:00
1
Query Result
tweet-id: 1, event: url-
click, time: 01:00:00 1
tweet-id: 1,
event: *,
time: 01:00:00
2
tweet-id: *,
event: *,
time: 01:00:00
3
tweet-id: *,
event: impression,
time: 02:00:00
1
tweet-id: 2,
event: *,
time: 02:00:00
1
Input and Queries
Stream
tweet-id: 1, event: url-
click, time: 01:01:03
tweet-id: 2, event: url-
click, time: 01:01:02
tweet-id: 1, event:
impression, time:
01:01:01
tweet-id: 2, event: url-
click, time: 02:02:01
tweet-id: 1, event:
impression, time:
02:01:02
Stream
tweet-id: 1, event: url-
click, time: 01:01:03
tweet-id: 2, event: url-
click, time: 01:01:02
tweet-id: 1, event:
impression, time:
01:01:01
tweet-id: 2, event: url-
click, time: 02:02:01
tweet-id: 1, event:
impression, time:
02:01:02
Query Result
tweet-id: 1, event: url-
click, time: 01:00:00 1
tweet-id: 1,
event: *,
time: 01:00:00
2
tweet-id: *,
event: *,
time: 01:00:00
3
tweet-id: *,
event: impression,
time: 02:00:00
1
tweet-id: 2,
event: *,
time: 02:00:00
1
Input and Queries
Time Series Data
0
25
50
75
100
125
01:00:00 02:00:00 03:00:00 04:00:00
Tweet Impressions
Tweet 1 Tweet 2
Any questions so far?
Legacy System
Stream Processor
Hadoop
Lambda Architecture
Streaming
Batch
Legacy System
Lambda Architecture
Hadoop
Streaming
Batch
Stream Processor
Legacy System
Lambda Architecture
Hadoop
Streaming
Batch
Stream Processor
Legacy System
Lambda Architecture
Hadoop
Streaming
Batch
Stream Processor
Legacy System
Lambda Architecture
Hadoop
Streaming
Batch
Stream Processor
Legacy System
Lambda Architecture
Hadoop
Streaming
Batch
Stream Processor
Legacy System
Lambda Architecture
Hadoop
Streaming
Batch
Stream Processor
Legacy System
Lambda Architecture
Hadoop
Streaming
Batch
Stream Processor
Legacy System
Lambda Architecture
Hadoop
Streaming
Batch
• Aggregates built directly in
key/value store
• Read/modify/write for every
message
• Inaccurate: double-counting,
lost pre-aggregated data
• Hadoop job improves results
after 24 hours
Legacy System
(Lambda Architecture)
Any questions so far?
Goals for Prototype
System
• Feature parity with existing system
• Attempt to reduce hardware footprint by 100x
• Exactly once semantics: compute correct results in real-
time with or without failures. Failures should not lead to
missing data or double counting
• Satisfy realtime queries with low latency
• One system: No Lambda Architecture!
• Eliminate the key/value store bottleneck (big win)
My road to
Apache Flink
• Interested in Google Cloud Dataflow
• Google nailed the semantics for stream processing
• Unified batch and stream processing with one model
• Dataflow didn’t exist in open source at the time (or so I
thought) and I wanted to build it.
• My wife wouldn’t let me quit my job!
• Dataflow SDK is now open source as Apache Beam and
Flink is the most complete runner.
Why Apache Flink?
• Basically identical semantics to Google Cloud Dataflow
• Flink is a true fault-tolerant stateful stream processor
• Exactly once guarantees for state updates
• The state management features might allow us to eliminate the key-value
store
• Windowing is built-in which makes time series easy
• Native event time support / correct time based aggregations
• Very fast data shuffling in benchmarks: 83 million msgs/sec on 30 machines
• Flink “just works” with no tuning - even at scale!
Prototype System
Apache Flink
Streaming
Prototype System
Apache Flink
Streaming
Prototype System
Apache Flink
Streaming
Prototype System
Apache Flink
Streaming
Prototype System
Apache Flink
Streaming
Prototype System
Apache Flink
Streaming
Prototype System
Apache Flink
Streaming
Prototype System
Apache Flink
Streaming
Prototype System
Apache Flink
Streaming
Prototype System
Apache Flink
We now have a sharded key/value store
inside the stream processor
Streaming
Prototype System
Apache Flink
Why not just query that!
We now have a sharded key/value store
inside the stream processor
Streaming
Prototype System
Apache Flink
Query
Servic
e
Why not just query that!
We now have a sharded key/value store
inside the stream processor
Prototype System
• Eliminates the key-value store
bottleneck
• Eliminates the batch layer
• No more Lambda Architecture!
• Realtime queries over in-flight
aggregates
• Hourly aggregates written to
database
The Results
• Uses 0.5% of the resources of the legacy system:
An improvement of 200x with zero tuning!
• Exactly once analytics in realtime
• Complete elimination of batch layer and Lambda
Architecture
• Successfully eliminated the key-value store
bottleneck
How is 200x improvement
possible?
• The key is making use of fault-tolerant state inside the
stream processor
• Computation proceeds at in-memory speeds
• No need to make requests over the network to update
values in external store
• Dramatically less load on the database because only the
completed window aggregates are written there.
• Flink is extremely efficient at network I/O and data shuffling,
and has highly optimized serialization architecture
Does this matter
at smaller scale?
• YES it does!
• Much larger problems on the same hardware
investment
• Exactly-once semantics and state management
is important at any scale!
• Engineering time invested can be expensive at
any scale if things don’t “just work”.
Summary
• Used stateful operator features in Flink to remove
the key/value store bottleneck
• Dramatic reduction in hardware costs (200x)
• Maintained feature parity by providing low-latency
queries for in flight aggregates as well as long-
term storage of hourly time series data
• Actually improved accuracy of aggregations:
Exactly-once vs. at least once semantics
Questions?
Thanks!

Más contenido relacionado

La actualidad más candente

The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin Databricks
 
Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin
Interactive Data Analysis with Apache Flink @ Flink Meetup in BerlinInteractive Data Analysis with Apache Flink @ Flink Meetup in Berlin
Interactive Data Analysis with Apache Flink @ Flink Meetup in BerlinTill Rohrmann
 
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR BenchmarksExtending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR BenchmarksJamie Grier
 
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & KafkaMohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & KafkaFlink Forward
 
MongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDBMongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDBMongoDB
 
Fabian Hueske – Cascading on Flink
Fabian Hueske – Cascading on FlinkFabian Hueske – Cascading on Flink
Fabian Hueske – Cascading on FlinkFlink Forward
 
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache FlinkSuneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache FlinkFlink Forward
 
QCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkQCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkRobert Metzger
 
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)Robert Metzger
 
Apache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream ProcessorApache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream ProcessorAljoscha Krettek
 
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache ZeppelinMoon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache ZeppelinFlink Forward
 
Apache Flink community Update for March 2016 - Slim Baltagi
Apache Flink community Update for March 2016 - Slim BaltagiApache Flink community Update for March 2016 - Slim Baltagi
Apache Flink community Update for March 2016 - Slim BaltagiSlim Baltagi
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamDataWorks Summit/Hadoop Summit
 
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...Kai Wähner
 
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
Hadoop made fast - Why Virtual Reality Needed Stream Processing to SurviveHadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Surviveconfluent
 
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream ProcessingChristian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream ProcessingFlink Forward
 
Bay Area Apache Flink Meetup Community Update August 2015
Bay Area Apache Flink Meetup Community Update August 2015Bay Area Apache Flink Meetup Community Update August 2015
Bay Area Apache Flink Meetup Community Update August 2015Henry Saputra
 
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraStream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraDatabricks
 

La actualidad más candente (20)

The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
 
Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin
Interactive Data Analysis with Apache Flink @ Flink Meetup in BerlinInteractive Data Analysis with Apache Flink @ Flink Meetup in Berlin
Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin
 
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR BenchmarksExtending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
 
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & KafkaMohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
 
Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
 
MongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDBMongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDB
 
Flink Streaming
Flink StreamingFlink Streaming
Flink Streaming
 
Fabian Hueske – Cascading on Flink
Fabian Hueske – Cascading on FlinkFabian Hueske – Cascading on Flink
Fabian Hueske – Cascading on Flink
 
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache FlinkSuneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
 
QCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkQCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache Flink
 
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
 
Apache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream ProcessorApache Flink(tm) - A Next-Generation Stream Processor
Apache Flink(tm) - A Next-Generation Stream Processor
 
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache ZeppelinMoon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
 
Apache Flink community Update for March 2016 - Slim Baltagi
Apache Flink community Update for March 2016 - Slim BaltagiApache Flink community Update for March 2016 - Slim Baltagi
Apache Flink community Update for March 2016 - Slim Baltagi
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
 
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
 
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
Hadoop made fast - Why Virtual Reality Needed Stream Processing to SurviveHadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
 
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream ProcessingChristian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream Processing
 
Bay Area Apache Flink Meetup Community Update August 2015
Bay Area Apache Flink Meetup Community Update August 2015Bay Area Apache Flink Meetup Community Update August 2015
Bay Area Apache Flink Meetup Community Update August 2015
 
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraStream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
 

Destacado

Extending the Yahoo Streaming Benchmark
Extending the Yahoo Streaming BenchmarkExtending the Yahoo Streaming Benchmark
Extending the Yahoo Streaming BenchmarkJamie Grier
 
Marton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream ProcessingMarton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream ProcessingFlink Forward
 
Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink
Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with FlinkSanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink
Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with FlinkFlink Forward
 
Workshop 9venos recuperaciones 2015 grado 9 veno
Workshop 9venos recuperaciones 2015 grado 9 venoWorkshop 9venos recuperaciones 2015 grado 9 veno
Workshop 9venos recuperaciones 2015 grado 9 venojolehidy6
 
Como crear una pagina web en línea
Como crear una pagina web en líneaComo crear una pagina web en línea
Como crear una pagina web en líneaShirley Trejo
 
Revista Acción Marcial - Número 17
Revista Acción Marcial - Número 17Revista Acción Marcial - Número 17
Revista Acción Marcial - Número 17Eskrima Kombat
 
Catálogo IM DC/POS
Catálogo IM DC/POSCatálogo IM DC/POS
Catálogo IM DC/POSnvalente2
 
Acne Treatments
Acne TreatmentsAcne Treatments
Acne Treatmentsfacedoctor
 
Identidad lésbica en la literatura chilena reciente por amancai argomedo carm...
Identidad lésbica en la literatura chilena reciente por amancai argomedo carm...Identidad lésbica en la literatura chilena reciente por amancai argomedo carm...
Identidad lésbica en la literatura chilena reciente por amancai argomedo carm...Amancai Argomedo Carmona
 
Anuvrat_REPORT AT SPRING FAILURE
Anuvrat_REPORT AT SPRING FAILUREAnuvrat_REPORT AT SPRING FAILURE
Anuvrat_REPORT AT SPRING FAILUREAnuvrat Shukla
 
Chapter 01 power_point
Chapter 01 power_pointChapter 01 power_point
Chapter 01 power_pointncash513
 
Presentacion mi ciudad
Presentacion mi ciudadPresentacion mi ciudad
Presentacion mi ciudadyouni22
 
Trabajo extra de matematicas de David Paredes
Trabajo extra de matematicas de David ParedesTrabajo extra de matematicas de David Paredes
Trabajo extra de matematicas de David ParedesRodrigo Paredes
 
Tu web con fundamento también en el móvil (más chicha, menos perejil) - Salón...
Tu web con fundamento también en el móvil (más chicha, menos perejil) - Salón...Tu web con fundamento también en el móvil (más chicha, menos perejil) - Salón...
Tu web con fundamento también en el móvil (más chicha, menos perejil) - Salón...Arsys
 
Blanco dejenaro
Blanco dejenaroBlanco dejenaro
Blanco dejenaroisfaschool
 
UHY-Capability-Statement-2015
UHY-Capability-Statement-2015UHY-Capability-Statement-2015
UHY-Capability-Statement-2015Mihael Rot
 
Cogents Performance Marketing Group
Cogents Performance Marketing GroupCogents Performance Marketing Group
Cogents Performance Marketing Groupcogentads
 
Jill Obenauer Resume 2016 edited
Jill Obenauer Resume 2016 editedJill Obenauer Resume 2016 edited
Jill Obenauer Resume 2016 editedJill Obenauer
 

Destacado (20)

Extending the Yahoo Streaming Benchmark
Extending the Yahoo Streaming BenchmarkExtending the Yahoo Streaming Benchmark
Extending the Yahoo Streaming Benchmark
 
Marton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream ProcessingMarton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream Processing
 
Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink
Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with FlinkSanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink
Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink
 
Workshop 9venos recuperaciones 2015 grado 9 veno
Workshop 9venos recuperaciones 2015 grado 9 venoWorkshop 9venos recuperaciones 2015 grado 9 veno
Workshop 9venos recuperaciones 2015 grado 9 veno
 
Como crear una pagina web en línea
Como crear una pagina web en líneaComo crear una pagina web en línea
Como crear una pagina web en línea
 
Revista Acción Marcial - Número 17
Revista Acción Marcial - Número 17Revista Acción Marcial - Número 17
Revista Acción Marcial - Número 17
 
Catálogo IM DC/POS
Catálogo IM DC/POSCatálogo IM DC/POS
Catálogo IM DC/POS
 
Portoviejo rock city
Portoviejo rock cityPortoviejo rock city
Portoviejo rock city
 
NewMahwah - Linkedin para Ejecutivos de Ventas
NewMahwah - Linkedin para Ejecutivos de VentasNewMahwah - Linkedin para Ejecutivos de Ventas
NewMahwah - Linkedin para Ejecutivos de Ventas
 
Acne Treatments
Acne TreatmentsAcne Treatments
Acne Treatments
 
Identidad lésbica en la literatura chilena reciente por amancai argomedo carm...
Identidad lésbica en la literatura chilena reciente por amancai argomedo carm...Identidad lésbica en la literatura chilena reciente por amancai argomedo carm...
Identidad lésbica en la literatura chilena reciente por amancai argomedo carm...
 
Anuvrat_REPORT AT SPRING FAILURE
Anuvrat_REPORT AT SPRING FAILUREAnuvrat_REPORT AT SPRING FAILURE
Anuvrat_REPORT AT SPRING FAILURE
 
Chapter 01 power_point
Chapter 01 power_pointChapter 01 power_point
Chapter 01 power_point
 
Presentacion mi ciudad
Presentacion mi ciudadPresentacion mi ciudad
Presentacion mi ciudad
 
Trabajo extra de matematicas de David Paredes
Trabajo extra de matematicas de David ParedesTrabajo extra de matematicas de David Paredes
Trabajo extra de matematicas de David Paredes
 
Tu web con fundamento también en el móvil (más chicha, menos perejil) - Salón...
Tu web con fundamento también en el móvil (más chicha, menos perejil) - Salón...Tu web con fundamento también en el móvil (más chicha, menos perejil) - Salón...
Tu web con fundamento también en el móvil (más chicha, menos perejil) - Salón...
 
Blanco dejenaro
Blanco dejenaroBlanco dejenaro
Blanco dejenaro
 
UHY-Capability-Statement-2015
UHY-Capability-Statement-2015UHY-Capability-Statement-2015
UHY-Capability-Statement-2015
 
Cogents Performance Marketing Group
Cogents Performance Marketing GroupCogents Performance Marketing Group
Cogents Performance Marketing Group
 
Jill Obenauer Resume 2016 edited
Jill Obenauer Resume 2016 editedJill Obenauer Resume 2016 edited
Jill Obenauer Resume 2016 edited
 

Similar a Stateful Stream Processing at In-Memory Speed

Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...Soroosh Khodami
 
JavaOne 2015: Top Performance Patterns Deep Dive
JavaOne 2015: Top Performance Patterns Deep DiveJavaOne 2015: Top Performance Patterns Deep Dive
JavaOne 2015: Top Performance Patterns Deep DiveAndreas Grabner
 
Event Driven Architectures - Net Conf UY 2018
Event Driven Architectures - Net Conf UY 2018Event Driven Architectures - Net Conf UY 2018
Event Driven Architectures - Net Conf UY 2018Bradley Irby
 
Scaling habits of ASP.NET
Scaling habits of ASP.NETScaling habits of ASP.NET
Scaling habits of ASP.NETDavid Giard
 
Building real time applications with Symfony2
Building real time applications with Symfony2Building real time applications with Symfony2
Building real time applications with Symfony2Antonio Peric-Mazar
 
Five Ways to Fix Your SQL Server Dev-Test Problems
Five Ways to Fix Your SQL Server Dev-Test Problems Five Ways to Fix Your SQL Server Dev-Test Problems
Five Ways to Fix Your SQL Server Dev-Test Problems Catalogic Software
 
BTD2015 - Your Place In DevTOps is Finding Solutions - Not Just Bugs!
BTD2015 - Your Place In DevTOps is Finding Solutions - Not Just Bugs!BTD2015 - Your Place In DevTOps is Finding Solutions - Not Just Bugs!
BTD2015 - Your Place In DevTOps is Finding Solutions - Not Just Bugs!Andreas Grabner
 
Reactive Development: Commands, Actors and Events. Oh My!!
Reactive Development: Commands, Actors and Events.  Oh My!!Reactive Development: Commands, Actors and Events.  Oh My!!
Reactive Development: Commands, Actors and Events. Oh My!!David Hoerster
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterJohn Adams
 
Sps toronto introduction to azure functions microsoft flow
Sps toronto introduction to azure functions microsoft flowSps toronto introduction to azure functions microsoft flow
Sps toronto introduction to azure functions microsoft flowVincent Biret
 
Measure and increase developer productivity with help of Severless by Kazulki...
Measure and increase developer productivity with help of Severless by Kazulki...Measure and increase developer productivity with help of Severless by Kazulki...
Measure and increase developer productivity with help of Severless by Kazulki...Vadym Kazulkin
 
Azure stream analytics by Nico Jacobs
Azure stream analytics by Nico JacobsAzure stream analytics by Nico Jacobs
Azure stream analytics by Nico JacobsITProceed
 
[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQLWSO2
 
Four Practices to Fix Your Top .NET Performance Problems
Four Practices to Fix Your Top .NET Performance ProblemsFour Practices to Fix Your Top .NET Performance Problems
Four Practices to Fix Your Top .NET Performance ProblemsAndreas Grabner
 
Building data intensive applications
Building data intensive applicationsBuilding data intensive applications
Building data intensive applicationsAmit Kejriwal
 
SenchaCon Roadshow Irvine 2017
SenchaCon Roadshow Irvine 2017SenchaCon Roadshow Irvine 2017
SenchaCon Roadshow Irvine 2017Speedment, Inc.
 
Eric Proegler Oredev Performance Testing in New Contexts
Eric Proegler Oredev Performance Testing in New ContextsEric Proegler Oredev Performance Testing in New Contexts
Eric Proegler Oredev Performance Testing in New ContextsEric Proegler
 
Capacity Planning for fun & profit
Capacity Planning for fun & profitCapacity Planning for fun & profit
Capacity Planning for fun & profitRodrigo Campos
 

Similar a Stateful Stream Processing at In-Memory Speed (20)

Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
Why And When Should We Consider Stream Processing In Our Solutions Teqnation ...
 
JavaOne 2015: Top Performance Patterns Deep Dive
JavaOne 2015: Top Performance Patterns Deep DiveJavaOne 2015: Top Performance Patterns Deep Dive
JavaOne 2015: Top Performance Patterns Deep Dive
 
Event Driven Architectures - Net Conf UY 2018
Event Driven Architectures - Net Conf UY 2018Event Driven Architectures - Net Conf UY 2018
Event Driven Architectures - Net Conf UY 2018
 
Scaling habits of ASP.NET
Scaling habits of ASP.NETScaling habits of ASP.NET
Scaling habits of ASP.NET
 
Building real time applications with Symfony2
Building real time applications with Symfony2Building real time applications with Symfony2
Building real time applications with Symfony2
 
Five Ways to Fix Your SQL Server Dev-Test Problems
Five Ways to Fix Your SQL Server Dev-Test Problems Five Ways to Fix Your SQL Server Dev-Test Problems
Five Ways to Fix Your SQL Server Dev-Test Problems
 
BTD2015 - Your Place In DevTOps is Finding Solutions - Not Just Bugs!
BTD2015 - Your Place In DevTOps is Finding Solutions - Not Just Bugs!BTD2015 - Your Place In DevTOps is Finding Solutions - Not Just Bugs!
BTD2015 - Your Place In DevTOps is Finding Solutions - Not Just Bugs!
 
Redundant devops
Redundant devopsRedundant devops
Redundant devops
 
Reactive Development: Commands, Actors and Events. Oh My!!
Reactive Development: Commands, Actors and Events.  Oh My!!Reactive Development: Commands, Actors and Events.  Oh My!!
Reactive Development: Commands, Actors and Events. Oh My!!
 
Gcp dataflow
Gcp dataflowGcp dataflow
Gcp dataflow
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling Twitter
 
Sps toronto introduction to azure functions microsoft flow
Sps toronto introduction to azure functions microsoft flowSps toronto introduction to azure functions microsoft flow
Sps toronto introduction to azure functions microsoft flow
 
Measure and increase developer productivity with help of Severless by Kazulki...
Measure and increase developer productivity with help of Severless by Kazulki...Measure and increase developer productivity with help of Severless by Kazulki...
Measure and increase developer productivity with help of Severless by Kazulki...
 
Azure stream analytics by Nico Jacobs
Azure stream analytics by Nico JacobsAzure stream analytics by Nico Jacobs
Azure stream analytics by Nico Jacobs
 
[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL
 
Four Practices to Fix Your Top .NET Performance Problems
Four Practices to Fix Your Top .NET Performance ProblemsFour Practices to Fix Your Top .NET Performance Problems
Four Practices to Fix Your Top .NET Performance Problems
 
Building data intensive applications
Building data intensive applicationsBuilding data intensive applications
Building data intensive applications
 
SenchaCon Roadshow Irvine 2017
SenchaCon Roadshow Irvine 2017SenchaCon Roadshow Irvine 2017
SenchaCon Roadshow Irvine 2017
 
Eric Proegler Oredev Performance Testing in New Contexts
Eric Proegler Oredev Performance Testing in New ContextsEric Proegler Oredev Performance Testing in New Contexts
Eric Proegler Oredev Performance Testing in New Contexts
 
Capacity Planning for fun & profit
Capacity Planning for fun & profitCapacity Planning for fun & profit
Capacity Planning for fun & profit
 

Último

Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 

Último (20)

Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 

Stateful Stream Processing at In-Memory Speed

  • 1. Stateful Stream Processing at In-Memory Speed Jamie Grier @jamiegrier jamie@data-artisans.com
  • 2. Who am I? • Director of Applications Engineering at data Artisans • Previously working on streaming computation at Twitter, Gnip and Boulder Imaging • Involved in various kinds of stream processing for about a decade • High-speed video, social media streaming, general frameworks for stream processing
  • 3. Overview • In stateful stream processing the bottleneck has often been the key-value store • Accuracy has been sacrificed for speed • Lambda Architecture was developed to address shortcomings of stream processors • Can we remove the key-value store bottleneck and enable processing at in-memory speeds? • Can we do this accurately without Lamba Architecture?
  • 4. Problem statement • Incoming message rate: 1.5 million/sec • Group by several dimensions and aggregate over 1 hour event-time windows • Write hourly time series data to database • Respond to queries both over historical data and the live in-flight aggregates
  • 5. Input and Queries Stream tweet-id: 1, event: url- click, time: 01:01:01 tweet-id: 2, event: url- click, time: 01:01:02 tweet-id: 1, event: impression, time: 01:01:03 tweet-id: 2, event: url- click, time: 02:01:01 tweet-id: 1, event: impression, time: 02:02:02 Query Result tweet-id: 1, event: url- click, time: 01:00:00 1 tweet-id: 1, event: *, time: 01:00:00 2 tweet-id: *, event: *, time: 01:00:00 3 tweet-id: *, event: impression, time: 02:00:00 1 tweet-id: 2, event: *, time: 02:00:00 1
  • 6. Input and Queries Stream tweet-id: 1, event: url- click, time: 01:01:03 tweet-id: 2, event: url- click, time: 01:01:02 tweet-id: 1, event: impression, time: 01:01:01 tweet-id: 2, event: url- click, time: 02:02:01 tweet-id: 1, event: impression, time: 02:01:02 Query Result tweet-id: 1, event: url- click, time: 01:00:00 1 tweet-id: 1, event: *, time: 01:00:00 2 tweet-id: *, event: *, time: 01:00:00 3 tweet-id: *, event: impression, time: 02:00:00 1 tweet-id: 2, event: *, time: 02:00:00 1
  • 7. Input and Queries Query Result tweet-id: 1, event: url- click, time: 01:00:00 1 tweet-id: 1, event: *, time: 01:00:00 2 tweet-id: *, event: *, time: 01:00:00 3 tweet-id: *, event: impression, time: 02:00:00 1 tweet-id: 2, event: *, time: 02:00:00 1 Stream tweet-id: 1, event: url- click, time: 01:01:03 tweet-id: 2, event: url- click, time: 01:01:02 tweet-id: 1, event: impression, time: 01:01:01 tweet-id: 2, event: url- click, time: 02:02:01 tweet-id: 1, event: impression, time: 02:01:02
  • 8. Input and Queries Stream tweet-id: 1, event: url- click, time: 01:01:03 tweet-id: 2, event: url- click, time: 01:01:02 tweet-id: 1, event: impression, time: 01:01:01 tweet-id: 2, event: url- click, time: 02:02:01 tweet-id: 1, event: impression, time: 02:01:02 Query Result tweet-id: 1, event: url- click, time: 01:00:00 1 tweet-id: 1, event: *, time: 01:00:00 2 tweet-id: *, event: *, time: 01:00:00 3 tweet-id: *, event: impression, time: 02:00:00 1 tweet-id: 2, event: *, time: 02:00:00 1
  • 9. Query Result tweet-id: 1, event: url- click, time: 01:00:00 1 tweet-id: 1, event: *, time: 01:00:00 2 tweet-id: *, event: *, time: 01:00:00 3 tweet-id: *, event: impression, time: 02:00:00 1 tweet-id: 2, event: *, time: 02:00:00 1 Input and Queries Stream tweet-id: 1, event: url- click, time: 01:01:03 tweet-id: 2, event: url- click, time: 01:01:02 tweet-id: 1, event: impression, time: 01:01:01 tweet-id: 2, event: url- click, time: 02:02:01 tweet-id: 1, event: impression, time: 02:01:02
  • 10. Stream tweet-id: 1, event: url- click, time: 01:01:03 tweet-id: 2, event: url- click, time: 01:01:02 tweet-id: 1, event: impression, time: 01:01:01 tweet-id: 2, event: url- click, time: 02:02:01 tweet-id: 1, event: impression, time: 02:01:02 Query Result tweet-id: 1, event: url- click, time: 01:00:00 1 tweet-id: 1, event: *, time: 01:00:00 2 tweet-id: *, event: *, time: 01:00:00 3 tweet-id: *, event: impression, time: 02:00:00 1 tweet-id: 2, event: *, time: 02:00:00 1 Input and Queries
  • 11. Time Series Data 0 25 50 75 100 125 01:00:00 02:00:00 03:00:00 04:00:00 Tweet Impressions Tweet 1 Tweet 2
  • 13. Legacy System Stream Processor Hadoop Lambda Architecture Streaming Batch
  • 22. • Aggregates built directly in key/value store • Read/modify/write for every message • Inaccurate: double-counting, lost pre-aggregated data • Hadoop job improves results after 24 hours Legacy System (Lambda Architecture)
  • 24. Goals for Prototype System • Feature parity with existing system • Attempt to reduce hardware footprint by 100x • Exactly once semantics: compute correct results in real- time with or without failures. Failures should not lead to missing data or double counting • Satisfy realtime queries with low latency • One system: No Lambda Architecture! • Eliminate the key/value store bottleneck (big win)
  • 25. My road to Apache Flink • Interested in Google Cloud Dataflow • Google nailed the semantics for stream processing • Unified batch and stream processing with one model • Dataflow didn’t exist in open source at the time (or so I thought) and I wanted to build it. • My wife wouldn’t let me quit my job! • Dataflow SDK is now open source as Apache Beam and Flink is the most complete runner.
  • 26. Why Apache Flink? • Basically identical semantics to Google Cloud Dataflow • Flink is a true fault-tolerant stateful stream processor • Exactly once guarantees for state updates • The state management features might allow us to eliminate the key-value store • Windowing is built-in which makes time series easy • Native event time support / correct time based aggregations • Very fast data shuffling in benchmarks: 83 million msgs/sec on 30 machines • Flink “just works” with no tuning - even at scale!
  • 36. Prototype System Apache Flink We now have a sharded key/value store inside the stream processor Streaming
  • 37. Prototype System Apache Flink Why not just query that! We now have a sharded key/value store inside the stream processor Streaming
  • 38. Prototype System Apache Flink Query Servic e Why not just query that! We now have a sharded key/value store inside the stream processor
  • 39. Prototype System • Eliminates the key-value store bottleneck • Eliminates the batch layer • No more Lambda Architecture! • Realtime queries over in-flight aggregates • Hourly aggregates written to database
  • 40. The Results • Uses 0.5% of the resources of the legacy system: An improvement of 200x with zero tuning! • Exactly once analytics in realtime • Complete elimination of batch layer and Lambda Architecture • Successfully eliminated the key-value store bottleneck
  • 41. How is 200x improvement possible? • The key is making use of fault-tolerant state inside the stream processor • Computation proceeds at in-memory speeds • No need to make requests over the network to update values in external store • Dramatically less load on the database because only the completed window aggregates are written there. • Flink is extremely efficient at network I/O and data shuffling, and has highly optimized serialization architecture
  • 42. Does this matter at smaller scale? • YES it does! • Much larger problems on the same hardware investment • Exactly-once semantics and state management is important at any scale! • Engineering time invested can be expensive at any scale if things don’t “just work”.
  • 43. Summary • Used stateful operator features in Flink to remove the key/value store bottleneck • Dramatic reduction in hardware costs (200x) • Maintained feature parity by providing low-latency queries for in flight aggregates as well as long- term storage of hourly time series data • Actually improved accuracy of aggregations: Exactly-once vs. at least once semantics

Notas del editor

  1. Aggregates built directly in key/value store Inaccurate: double-counting, lost aggregates Hadoop batch job “fixes” later (Lambda Architecture) Hadoop job runs every 24 hours
  2. Aggregates built directly in key/value store Inaccurate: double-counting, lost aggregates Hadoop batch job “fixes” later (Lambda Architecture) Hadoop job runs every 24 hours
  3. Aggregates built directly in key/value store Inaccurate: double-counting, lost aggregates Hadoop batch job “fixes” later (Lambda Architecture) Hadoop job runs every 24 hours
  4. Aggregates built directly in key/value store Inaccurate: double-counting, lost aggregates Hadoop batch job “fixes” later (Lambda Architecture) Hadoop job runs every 24 hours
  5. Aggregates built directly in key/value store Inaccurate: double-counting, lost aggregates Hadoop batch job “fixes” later (Lambda Architecture) Hadoop job runs every 24 hours
  6. Aggregates built directly in key/value store Inaccurate: double-counting, lost aggregates Hadoop batch job “fixes” later (Lambda Architecture) Hadoop job runs every 24 hours
  7. Aggregates built directly in key/value store Inaccurate: double-counting, lost aggregates Hadoop batch job “fixes” later (Lambda Architecture) Hadoop job runs every 24 hours
  8. Aggregates built directly in key/value store Inaccurate: double-counting, lost aggregates Hadoop batch job “fixes” later (Lambda Architecture) Hadoop job runs every 24 hours
  9. Aggregates built directly in key/value store Inaccurate: double-counting, lost aggregates Hadoop batch job “fixes” later (Lambda Architecture) Hadoop job runs every 24 hours