With Apache Beam, you can process massive out-of-order streams (or standard batch use cases too) by defining high-level transformation pipelines that you can then run on a variety of backends, including Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow.
This talk introduces a new feature of the Beam programming model: stateful processing with processing-time and event-time timers. This enhancement unlocks new use cases and efficiencies, such as:
- Micro-service like workflows ("register this user, remind them after a day, and expire their sign up after a week")
- Customized output control ("only output when the signal has changed by more than 0.3")
- Carefully batched RPCs ("write as many items as possible at the same time, but no more than 500")
- Stream joins with custom output triggering ("join these two streams on an arbitrary join predicate with correct exactly once results")
In this talk, you will learn how to use Beam to develop complex, stateful pipelines to easily implement scenarios like the above, which you can finely tailor to your precise use case.
9. Use cases for massive out-of-order streams
● Operations and manufacturing
● Mobile gaming
● Web analytics
● Wearables
● Automotive
● Power grid
● Network monitoring
● (Mobile) banking
… anything processing "events that happen"
(you can also process things that aren't events; just use fewer features) 9
16. The Beam Model
What are you computing? (read, map, reduce)
16
Where in event time? (event time windowing)
When in processing time are results produced? (triggers)
How do refinements relate? (accumulation mode)
The focus of today
17. Per element ParDo (Map, etc)
17
Every item
processed
independently
Stateless
implementation
18. Per key Combine (Reduce, etc)
18
Items grouped by
some key and
combined
Stateful streaming
implementation
(buffering until trigger)
But your code doesn't
work with state, just
associative &
commutative function
19. It "just works" with massive out-of-order streams
19
ParDo, Map, etc. Combine, Reduce, etc.
"Parse incoming events
and filter out bad data"
"Sum per hour and output when
you have the whole hour"
"Put events in 10 minute windows
sliding every 2 minutes"
"Group into sessions and
emit as fast as possible"
20. But what if you need more control?
20
ParDo, Map, etc. Combine, Reduce, etc.
"I need some state on
the side to tweak my
FlatMap's behavior"
"My aggregation is not an
associative & commutative
operator"
"Triggers aren't specific
enough for my use case"
"I need to output even when
data isn't coming in"
22. What if you need more control?
22
ParDo, Map, etc.
Combine, Reduce, etc.
ProcessFunction
MapWithState
Operator
… that "just works" with out-of-order events
… is portable across engines
Timers
State
State & timers
for ParDo!
24. User's view of your transform
On Timer
On Element
24
Some requests
(try to contain costs)
Events come in
(out of order, windowing specified)
Correct windowed output
(don't care how you got them)
input
.apply(Window.into( hours )
.apply(new EnrichEvents())
25. Event time windowing still "just works"
25
Window into
Fixed windows of 1 hour
Window into
30 min sliding by 10 min
26. Key Window MEDIAN_IDLE MAIN_ACTIVITY ...
"kenn" 9am - 10am 10m "hack"
12pm - 1pm 25m "eat"
11pm - 12am 60m "sleep"
"tgroh" 8am - 9am 20m "bike"
11am - 12pm 3m "hack"
... ...
State is per key and window
Bonus: automatically garbage collected when a window expires
(vs manual clearing of per-key state) 26
27. Unified present & historical processing
27
Same
input
data
Equivalent
results
28. ● Domain-specific triggering ("output when five people who live in Seattle
have checked in")
● Slowly changing dimensions ("update FX rates for currency ABC")
● Stream joins ("join-matrix" / "join-biclique")
● Fine-grained aggregation ("add odd elements to accumulator A and
event elements to accumulator B")
● Per-key workflows (like user sign up flow w/ reminders & expiration)
What else can you do with state & timers
28
29. Summary
Stateful processing in Beam...
● … unlocks new uses cases
● … is portable across data processing engines
● … works with event time windowing
● … works for present and historical data
29
30. Thank you for listening!
This talk:
● Me - @KennKnowles / kenn@apache.org
● These Slides - https://s.apache.org/stateful-beam-dataworks-sjc-2017
Go Deeper
● Design - https://s.apache.org/beam-state
● Blog - https://beam.apache.org/blog/2017/02/13/stateful-processing.html
Join the Beam community:
● User discussions - user@beam.apache.org
● Development discussions - dev@beam.apache.org
● Follow @ApacheBeam on Twitter
https://beam.apache.org
30