In this talk I will present the architecture that allows runners to execute a Beam pipeline. I will explain what needs to happen in order for a compatible runner to know which transforms to run, how to pass data from one step to the next, and how beam allows runners to be SDK agnostic when running pipelines.
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
1. How a BEAM runner executes a
pipeline
Javier Ramirez (@supercoco9)
Head of Engineering @teamdatatonic
2018-10-02
2. ▪ Why do I care?
▪ Pipeline basics
▪ Runner Overview
▪ Exploring the graph
▪ Implementing PCollections
▪ Watermark Propagation
▪ Implementing Ptransforms: Read, Pardo, GroupByKey,
Window, and Flatten
▪ Optimising the execution Plan
▪ Persisting state (coders and snapshots)
▪ FnApi and RunnerApi for SDK independence
3. Why
▪ I started using beam dataflow private alpha when it was “just” a serverless runner
▪ Then Beam was born as a common layer on top of multiple runners
▪ Wanted to understand what is part of Beam and what’s part of the runner
▪ Might help choosing the right runner for the job
4. Pipeline overview
■ Write pipeline code in Java, Python, or Go
■ The abstraction is a Directed Acyclic Graph (DAG) where nodes are transforms and edges are data flowing as
PCollections.
■ Both PTransforms and PCollections can be distributed and parallelised, and the model is fault-tolerant, so
they need to be serializable to be sent across workers
■ Read data from one or more inputs, bounded or unbounded
■ Apply transforms, stateless or stateful
■ Write data to one or more outputs
■ Optionally, keep track of metrics
5. Runner overview
BEAM-compatible is a very flexible claim
▪ Can choose to support only some languages (The portability API will change this)
▪ Can choose to support only batch or streaming processing
▪ Can choose to what extent to support early triggers and late data, refinements, state…
▪ Needs to translate from BEAM code to runner-native code
▪ Is responsible for submitting and monitoring the pipeline
▪ Must serialize/deserialize data and functions across workers and stages
▪ Is responsible for performance, scalability, optimisations, and enforcing the BEAM
model guarantees (some methods will be called exactly once, a transform will not be
executed by more than a thread at once within a worker, if a bundle of data is
processed by a transform more than once, it will not generate duplicates…)
6. Runner entrypoint: exploring the DAG
■ Beam provides a method to traverse (visit) the graph. Runners need to walk the graph to:
■ Validate the pipeline
■ Get insights to choose the best execution strategy
❏ Example: Spark Runner
❏ Chooses if using the batch or streaming engine by visiting the graph and checking if any PCollection
is unbounded
❏ Detects PCollections that are used in more than one transform, and creates internal caches to store
those collections
■ Translate the BEAM transforms into native transforms
■ Optimise the graph execution (to minimize serialization and shuffling)
7. Implementing PCollections
■ Unordered bags of elements
■ Might be bounded or unbounded
■ All the elements are of the same type and the PCollection has a coder to serialize/deserialize
elements
■ Every element will always have
■ A Timestamp (might be negative infinity if not important)
■ A Window, which is initially the global window, but can be changed via transforms
■ Every PCollection has a watermark estimating how complete it is
9. Implementing PTransforms
■ Beam can do pretty complex things with just a few primitives
■ Read
■ Flatten
■ Window
■ GroupByKey
■ ParDo
10. Implementing Read
■ Read can be bounded or unbounded.
■ The runner calls split to the source into
bundles
■ The runner gets a reader object and
then it can execute =====>
■ If supported, the runner can call
splitAtFraction to enable dynamic
rebalancing
11. Implementing unbounded Read
■ The general pattern is the same for both bounded and unbounded, but unbounded sources
■ Report a watermark that the runner needs to associate to the elements and propagate downstream
■ Provide Record IDs in case we need to use deduplication to enforce exactly-once processing
■ Support Checkpointing. The runner can get information about the current checkpoint on the stream,
and can call “FinalizeCheckpoint” to tell the unbounded source the elements are safe on the pipeline
and can be acknowledged from the stream if needed
12. Implementing Flatten
■ The runner only needs to verify the window strategies of all the PCollections to flatten are
compatible
■ The result is a single PCollection containing all the elements and windows of the input
PCollections without any changes
13. Implementing Window
■ Window is just a grouping key with a maximum timestamp
■ One element can be conceptually in one window only. If you need to assign an element to
multiple windows, it counts as multiple elements from Beam’s point of view.
■ The runner may choose to use a physical representation where one element appears to be
assigned to multiple windows for storage efficiency, but it maps conceptually to multiple
elements
14. Implementing GroupByKey
■ GroupByKey groups a PCollection of key-value pairs by Key and Window
■ GroupByKey will emit results only when window triggers allow it, and should automatically drop
expired elements
■ Since GroupByKey is closely related to Windows, it needs to be able to merge element by
window when requested, for example to keep session windows
■ GroupByKey needs to choose the timestamp to emit with the results
15. Implementing Pardo
■ Conceptually simple:
■ Setup is called once per instance of the ParDo
■ The runner decides on the bundle size (some runners allow user control)
■ It calls startBundle once per bundle
■ It calls processElement once per element
■ If we are using timely processing, it calls onTimer for each timer activation
■ It finishes by calling finishBundle
■ If an element fails, the whole bundle is retried
■ Teardown is called to release ParDo resources
■ Under the hood, the runner needs to take into account ParDos can be stateful and can have side
inputs. In those cases the runner is responsible for keeping and propagating state, and for
materialising the side inputs
16. Optimising the DAG execution
■ Two levels of optimisation
■ Execution plan (Supported by BEAM)
■ Intermediate data materialisation (Depends on the Runner)
17. Optimising the Execution Plan
■ Fusion and combine
aggregation are core concepts
behind the dataflow/BEAM
model
■ The JAVA SDK provides core
helpers to deal with this
18. Intermediate data materialisation
■ What to do if one transform fails downstream and we need to reprocess the data?
■ With some sources (like a static file) we could potentially replay the whole data and retry. Not
deterministic or fast, but it would work
■ With unbounded sources (or with bounded sources with changing data), we might not be
able to replay the whole data, so we need to have some way of
“materialising/checkpointing/snapshotting” the data
❏ Flink and IBM Streams distributed snapshots
❏ Samza incremental checkpoints
20. SDK independent runners
■ Recap: Executing a user's pipeline can be broken up into the following categories:
■ Perform any grouping and triggering logic including keeping track of a watermark, window merging,
propagating status...
■ Pipeline execution management such as polling for counter information, job status, …
■ Execute user definable functions (UDFs) including custom sources/sinks, regular DoFns...
■ Executing UDFs is the only category which requires a specific language-specific SDK context to execute in.
Moving the execution of UDFs to language-specific SDK harnesses and using an RPC service between the
two allows for a cross language/cross Runner portability story.
21. SDK independent runners: RunnerAPI & FnAPI
■ The harness is a docker container able to run the language-specific parts of the pipeline. The
Runner is responsible for launching and managing the container. Communication between
Runner and Harness is via the FnApi, implemented via gRPC
23. ▪ Why do I care?
▪ Pipeline basics
▪ Runner Overview
▪ Exploring the graph
▪ Implementing PCollections
▪ Watermark Propagation
▪ Implementing Ptransforms: Read, Pardo, GroupByKey,
Window, and Flatten
▪ Optimising the execution Plan
▪ Persisting state (coders and snapshots)
▪ FnApi and RunnerApi for SDK independence