SlideShare una empresa de Scribd logo
1 de 30
Descargar para leer sin conexión
Stateful processing of massive
out-of-order streams in Apache Beam
Kenneth Knowles
Apache Beam PMC
Software Engineer @ Google
kenn@apache.org / @kennknowles
https://s.apache.org/stateful-beam-dataworks-sjc-2017
Dataworks Summit SJC 2017
1
Agenda
1. Massive out-of-order streams
2. Apache Beam for streams
3. Portable stateful processing with Beam
2
Massive out-of-order
streams
3
Massive
Out-of-order
Streams
Computation
4
Computation
Massive
Out-of-order
Streams
5
Massive
Out-of-order
Streams
6
Massive
Out-of-order
Streams
7
Massive
Out-of-order
Streams
8
Use cases for massive out-of-order streams
● Operations and manufacturing
● Mobile gaming
● Web analytics
● Wearables
● Automotive
● Power grid
● Network monitoring
● (Mobile) banking
… anything processing "events that happen"
(you can also process things that aren't events; just use fewer features) 9
Apache Beam for
Streams
10
Are you building
one of these?
11
Are you building
one of these?
12
Filter
Join
20142004 2006 2008 2010 2012 20162005 2007 2009 2013 20152011
MapReduce
(paper)
Apache
Hadoop
Dataflow Model
(paper)
MillWheel
(paper)
Heron
Apache
Spark
Apache
Storm
Apache
Gearpump
(incubating)
Apache
Apex
Apache
Flink
Cloud
Dataflow
FlumeJava
(paper)
Apache Beam
Which one?
Apache
Samza
13
The Beam Vision
Sum Per Key
input.apply(
Sum.integersPerKey())
Java
input | Sum.PerKey()
Python
⋮
Cloud Dataflow:
fully managed
Apache Spark
local, on-prem,
cloud
Apache Flink
local, on-prem,
cloud
⋮
Apache Apex
local, on-prem,
cloud
Apache
Gearpump
(incubating)
14
The Beam Vision
KafkaIO
KakaIO.read()
Python
⋮
class KafkaIO extends
UnboundedSource { … }
Java
Cloud Dataflow:
fully managed
Apache Spark
local, on-prem,
cloud
Apache Flink
local, on-prem,
cloud
⋮
Apache Apex
local, on-prem,
cloud
Apache
Gearpump
(incubating)
15
The Beam Model
What are you computing? (read, map, reduce)
16
Where in event time? (event time windowing)
When in processing time are results produced? (triggers)
How do refinements relate? (accumulation mode)
The focus of today
Per element ParDo (Map, etc)
17
Every item
processed
independently
Stateless
implementation
Per key Combine (Reduce, etc)
18
Items grouped by
some key and
combined
Stateful streaming
implementation
(buffering until trigger)
But your code doesn't
work with state, just
associative &
commutative function
It "just works" with massive out-of-order streams
19
ParDo, Map, etc. Combine, Reduce, etc.
"Parse incoming events
and filter out bad data"
"Sum per hour and output when
you have the whole hour"
"Put events in 10 minute windows
sliding every 2 minutes"
"Group into sessions and
emit as fast as possible"
But what if you need more control?
20
ParDo, Map, etc. Combine, Reduce, etc.
"I need some state on
the side to tweak my
FlatMap's behavior"
"My aggregation is not an
associative & commutative
operator"
"Triggers aren't specific
enough for my use case"
"I need to output even when
data isn't coming in"
Portable Stateful
Processing With Beam
21
What if you need more control?
22
ParDo, Map, etc.
Combine, Reduce, etc.
ProcessFunction
MapWithState
Operator
… that "just works" with out-of-order events
… is portable across engines
Timers
State
State & timers
for ParDo!
Example: time-batched requests
output return value
of batched RPC
buffer request
batched
requests
On Timer
23
"call me back in
500ms"
On Element
User's view of your transform
On Timer
On Element
24
Some requests
(try to contain costs)
Events come in
(out of order, windowing specified)
Correct windowed output
(don't care how you got them)
input
.apply(Window.into( hours )
.apply(new EnrichEvents())
Event time windowing still "just works"
25
Window into
Fixed windows of 1 hour
Window into
30 min sliding by 10 min
Key Window MEDIAN_IDLE MAIN_ACTIVITY ...
"kenn" 9am - 10am 10m "hack"
12pm - 1pm 25m "eat"
11pm - 12am 60m "sleep"
"tgroh" 8am - 9am 20m "bike"
11am - 12pm 3m "hack"
... ...
State is per key and window
Bonus: automatically garbage collected when a window expires
(vs manual clearing of per-key state) 26
Unified present & historical processing
27
Same
input
data
Equivalent
results
● Domain-specific triggering ("output when five people who live in Seattle
have checked in")
● Slowly changing dimensions ("update FX rates for currency ABC")
● Stream joins ("join-matrix" / "join-biclique")
● Fine-grained aggregation ("add odd elements to accumulator A and
event elements to accumulator B")
● Per-key workflows (like user sign up flow w/ reminders & expiration)
What else can you do with state & timers
28
Summary
Stateful processing in Beam...
● … unlocks new uses cases
● … is portable across data processing engines
● … works with event time windowing
● … works for present and historical data
29
Thank you for listening!
This talk:
● Me - @KennKnowles / kenn@apache.org
● These Slides - https://s.apache.org/stateful-beam-dataworks-sjc-2017
Go Deeper
● Design - https://s.apache.org/beam-state
● Blog - https://beam.apache.org/blog/2017/02/13/stateful-processing.html
Join the Beam community:
● User discussions - user@beam.apache.org
● Development discussions - dev@beam.apache.org
● Follow @ApacheBeam on Twitter
https://beam.apache.org
30

Más contenido relacionado

Más de DataWorks Summit

Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Open Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart CitiesOpen Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart Cities
DataWorks Summit
 
Data Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake EnvironmentData Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake Environment
DataWorks Summit
 
Hadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native EraHadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native Era
DataWorks Summit
 
Free Servers to Build Big Data System on: Bing’s Approach
Free Servers to Build Big Data System on: Bing’s ApproachFree Servers to Build Big Data System on: Bing’s Approach
Free Servers to Build Big Data System on: Bing’s Approach
DataWorks Summit
 

Más de DataWorks Summit (20)

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
 
Applying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsApplying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real Problems
 
Open Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart CitiesOpen Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart Cities
 
Data Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake EnvironmentData Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake Environment
 
Big Data Technologies in Support of a Medical School Data Science Institute
Big Data Technologies in Support of a Medical School Data Science InstituteBig Data Technologies in Support of a Medical School Data Science Institute
Big Data Technologies in Support of a Medical School Data Science Institute
 
Hadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native EraHadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native Era
 
Free Servers to Build Big Data System on: Bing’s Approach
Free Servers to Build Big Data System on: Bing’s ApproachFree Servers to Build Big Data System on: Bing’s Approach
Free Servers to Build Big Data System on: Bing’s Approach
 
IoFMT – Internet of Fleet Management Things
IoFMT – Internet of Fleet Management ThingsIoFMT – Internet of Fleet Management Things
IoFMT – Internet of Fleet Management Things
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Stateful processing of massive out-of-order streams with Apache Beam

  • 1. Stateful processing of massive out-of-order streams in Apache Beam Kenneth Knowles Apache Beam PMC Software Engineer @ Google kenn@apache.org / @kennknowles https://s.apache.org/stateful-beam-dataworks-sjc-2017 Dataworks Summit SJC 2017 1
  • 2. Agenda 1. Massive out-of-order streams 2. Apache Beam for streams 3. Portable stateful processing with Beam 2
  • 9. Use cases for massive out-of-order streams ● Operations and manufacturing ● Mobile gaming ● Web analytics ● Wearables ● Automotive ● Power grid ● Network monitoring ● (Mobile) banking … anything processing "events that happen" (you can also process things that aren't events; just use fewer features) 9
  • 11. Are you building one of these? 11
  • 12. Are you building one of these? 12 Filter Join
  • 13. 20142004 2006 2008 2010 2012 20162005 2007 2009 2013 20152011 MapReduce (paper) Apache Hadoop Dataflow Model (paper) MillWheel (paper) Heron Apache Spark Apache Storm Apache Gearpump (incubating) Apache Apex Apache Flink Cloud Dataflow FlumeJava (paper) Apache Beam Which one? Apache Samza 13
  • 14. The Beam Vision Sum Per Key input.apply( Sum.integersPerKey()) Java input | Sum.PerKey() Python ⋮ Cloud Dataflow: fully managed Apache Spark local, on-prem, cloud Apache Flink local, on-prem, cloud ⋮ Apache Apex local, on-prem, cloud Apache Gearpump (incubating) 14
  • 15. The Beam Vision KafkaIO KakaIO.read() Python ⋮ class KafkaIO extends UnboundedSource { … } Java Cloud Dataflow: fully managed Apache Spark local, on-prem, cloud Apache Flink local, on-prem, cloud ⋮ Apache Apex local, on-prem, cloud Apache Gearpump (incubating) 15
  • 16. The Beam Model What are you computing? (read, map, reduce) 16 Where in event time? (event time windowing) When in processing time are results produced? (triggers) How do refinements relate? (accumulation mode) The focus of today
  • 17. Per element ParDo (Map, etc) 17 Every item processed independently Stateless implementation
  • 18. Per key Combine (Reduce, etc) 18 Items grouped by some key and combined Stateful streaming implementation (buffering until trigger) But your code doesn't work with state, just associative & commutative function
  • 19. It "just works" with massive out-of-order streams 19 ParDo, Map, etc. Combine, Reduce, etc. "Parse incoming events and filter out bad data" "Sum per hour and output when you have the whole hour" "Put events in 10 minute windows sliding every 2 minutes" "Group into sessions and emit as fast as possible"
  • 20. But what if you need more control? 20 ParDo, Map, etc. Combine, Reduce, etc. "I need some state on the side to tweak my FlatMap's behavior" "My aggregation is not an associative & commutative operator" "Triggers aren't specific enough for my use case" "I need to output even when data isn't coming in"
  • 22. What if you need more control? 22 ParDo, Map, etc. Combine, Reduce, etc. ProcessFunction MapWithState Operator … that "just works" with out-of-order events … is portable across engines Timers State State & timers for ParDo!
  • 23. Example: time-batched requests output return value of batched RPC buffer request batched requests On Timer 23 "call me back in 500ms" On Element
  • 24. User's view of your transform On Timer On Element 24 Some requests (try to contain costs) Events come in (out of order, windowing specified) Correct windowed output (don't care how you got them) input .apply(Window.into( hours ) .apply(new EnrichEvents())
  • 25. Event time windowing still "just works" 25 Window into Fixed windows of 1 hour Window into 30 min sliding by 10 min
  • 26. Key Window MEDIAN_IDLE MAIN_ACTIVITY ... "kenn" 9am - 10am 10m "hack" 12pm - 1pm 25m "eat" 11pm - 12am 60m "sleep" "tgroh" 8am - 9am 20m "bike" 11am - 12pm 3m "hack" ... ... State is per key and window Bonus: automatically garbage collected when a window expires (vs manual clearing of per-key state) 26
  • 27. Unified present & historical processing 27 Same input data Equivalent results
  • 28. ● Domain-specific triggering ("output when five people who live in Seattle have checked in") ● Slowly changing dimensions ("update FX rates for currency ABC") ● Stream joins ("join-matrix" / "join-biclique") ● Fine-grained aggregation ("add odd elements to accumulator A and event elements to accumulator B") ● Per-key workflows (like user sign up flow w/ reminders & expiration) What else can you do with state & timers 28
  • 29. Summary Stateful processing in Beam... ● … unlocks new uses cases ● … is portable across data processing engines ● … works with event time windowing ● … works for present and historical data 29
  • 30. Thank you for listening! This talk: ● Me - @KennKnowles / kenn@apache.org ● These Slides - https://s.apache.org/stateful-beam-dataworks-sjc-2017 Go Deeper ● Design - https://s.apache.org/beam-state ● Blog - https://beam.apache.org/blog/2017/02/13/stateful-processing.html Join the Beam community: ● User discussions - user@beam.apache.org ● Development discussions - dev@beam.apache.org ● Follow @ApacheBeam on Twitter https://beam.apache.org 30