Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Ververica

© 2019 Ververica
Aljoscha Krettek – Software Engineer, Flink PMC, Beam PMC
Towards Flink 2.0: Unified Batch & Stream
Processing

© 2019 Ververica2
This is joint work with many members of
the Apache Flink community

© 2019 Ververica3
Some of this presents work that is in
progress in the Flink community. Other
things are planned and/or have design
documents. Some were discussed at one
point or another on the mailing lists or in
person.
This represents our understaning of the
current state, this is not a fixed roadmap,
Flink is an open-source Apache project.

© 2019 Ververica4
Agenda
How does Flink
execute Pipelines?
How do users write
and operate
Pipelines?
API
s
Configuration
Execution/OpsExecution
Operators
Connectors

© 2019 Ververica
The Relationship between
Batch and Streaming

© 2019 Ververica6
Batch Processing is a special case of Stream Processing
A batch is just a bounded stream.
That is about 60% of the truth…

© 2019 Ververica7
The remaining 40% of the truth
*from the excellent Streaming 101 by Tyler Akidau: https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-
101)
The (Event-time) Watermark
… never seen this in
Batch Processing,
though.

© 2019 Ververica8
The remaining 40% of the truth
Continuous
Streaming
Batch
Processing
Data is incomplete
Latency SLAs
Completeness and
Latency is a tradeoff
Data is as complete
as it gets within the job
No Low Latency SLAs

© 2019 Ververica9
Real-time Stream Processing
older more recent
watermark

© 2019 Ververica10
Stream Re-Processing
older more recent
watermark
unprocesse
d

© 2019 Ververica11
Batch-style Processing
older more recent
watermark
unprocesse
d

© 2019 Ververica12
Batch and Streaming Processing Styles
S S S
M M M
R R R
S S S
S S S
M M M
R R R
S S S
more batch-y more stream-y
running
not running
can do things one-by-one everything is always-on
running

© 2019 Ververica13
Batch vs. Stream Processing
Continuous
Streaming
Batch
Processing
Watermarks to model
Completeness/Latency tradeoff No Watermarks
Incremental results &
Proc.-Time Timers
Results at end-of-
program only
In-receive-order
ingestion with low parallelism
Massively parallel
out-of-order ingestion

© 2019 Ververica14
In Flink
DataSet API
DataStream API

© 2019 Ververica
Unify what? And why should I care?

© 2019 Ververica17
Unify what? And why would I care?
API:
- DataSet
- DataStream
Connectors:
- Source/FinkFunction
- Input/OutputFormat
Table API/SQL:
- Unified APIs
- Connectors
Runtime:
- Operators
- Optimization

© 2019 Ververica18
A Typical “Unified” Use Case: Bootstrapping State
“stream
”
source
“batch”
source
Stateful
operatio
n
batch-y partstream-y part
• We have a streaming use
case
• We want to bootstrap the
state of some operations
from a historical source
• First execute bounded parts
of the graph, then start the
rest

© 2019 Ververica19
Future of the DataStream API
• DataStream is already supporting Bounded and Unbounded Streams
• Not exploiting batch optimizations so far
– Bounded batch-style execution still faster on DataSet API
• After Flink 1.10:
– Introduce BoundedDataStream and non-incremental mode to exploit
optimizations for bounded data
– Watermarks "jump" from -∞ to +∞ at end of program
– Processing time timers deactivated or deferred (end of key)
– DataStream translation and runtime (operators) need to be enhanced to use the
added optimization potential

© 2019 Ververica20
Exploiting the Batch Special Case
Planner/Optimizer
Continuous Operators
Streaming
Scheduler Rules
Additional Bounded
Operators
Additional
Scheduling Strategies
if (bounded && non-incremental)
activates additional
optimizer choices
Core operators,
cover all cases
Optimized operators
for subset of cases

© 2019 Ververica21
API:
- DataSet
- DataStream
Connectors:
Table API/SQL:
- Unified APIs
- Connectors
Runtime:
- Operators
- Optimization

© 2019 Ververica22
Current Source Interfaces
Batch: InputFormat
Enumerate
splits
pull()
read
splits
pull()
Streaming: SourceFunction
JobManager
Source
Thread
push()

© 2019 Ververica23
A New (unified) Source Interface FLIP-27
Enumerate
splits
read
splits/
thread
mgmt
?()
JobManager

© 2019 Ververica24
API:
- DataSet
- DataStream
Connectors:
Table API/SQL:
- Unified APIs
- Connectors
Runtime:
- Operators
- Optimization

© 2019 Ververica25
Table API / SQL
• Unified API: yes
• Unified runtime: yes, with the new Blink-based Table Runner: FLINK-
11439
– but: still runs in either “batch” or “streaming” mode
– More work required to make the system automatically decide and
optimize
• Unified sources: no, eventually yes with FLIP-27

© 2019 Ververica26
Table API / SQL in Flink 1.9
Table API / SQL
Classic Query Processor
Flink Task Runtime
DataSet StreamTransformation
Driver
(Pull)
StreamOperator
(selectable push)
New Query Processor*
batch env. stream env. batch & stream
* Based on the Blink query processor FLINK-11439

© 2019 Ververica27
API:
- DataSet
- DataStream
Connectors:
Table API/SQL:
- Unified APIs
- Connectors
Runtime:
- Operators
- Optimization

© 2019 Ververica28
Push-based and Pull-based Operators
accept data from any input immediately
(like actor messages)
minimize latency
supports checkpoint alignment
pull data from one input at a time
(like reading streams)
control over data flow,
high-latency, breaks checkpoints
pull() pull()
Push Operators Pull Operators

© 2019 Ververica29
Flink 1.9 - Selectable Push-based Operators
subscribe to inputs (select)
and receive pushed events
 Operators control data flow by selecting active data paths
 Among active data paths, fully asynchronous data flow
driven by network, data sources (and timers)
similar to non-blocking-I/O model
Java NIO, Linux Epoll, or Select
select() select()
FLINK-11875

© 2019 Ververica30
Scheduling Strategies
• Build pipelined regions
– Incremental results: everything pipelines
– Non-incremental results: break pipelines once in a while
• Recovery: Restart the pipelined region from latest checkpoint (or beginning)
– replay input since checkpoint or beginning
✘✘

© 2019 Ververica
Configuration/Ops Vision

© 2019 Ververica32
How can I enable checkpointing?
Only via StreamExecutionEnv.enableCheckpointing()
How can I restore from a Savepoint/Checkpoint?
Only via bin/flink run -s …
How can I run a job on a “remote” cluster?
Either bin/flink run or RemoteEnvironment
How can I run a job on YARN?
Only via bin/flink run*
* you could hand-roll other stuff but that’s complicated
Let’s do a Quiz

© 2019 Ververica33
Configuration All the Way Down
bin/flink run -c com.mystuff.MyProgram myjar.jar
--config my-config.yaml
--checkpointing.enabled true
--state.backend rocksdb
FLIP-59 is the beginning of this effort

© 2019 Ververica34
A new Executor Abstraction for Executing Jobs
bin/flink run -c com.mystuff.MyProgram myjar.jar
--executor yarn-per-job
--executor.yarn.some-option true
FLIP-73 tracks this effort, and then FLIP-74 for the JobClient API
java -jar myjar.jar
--executor kubernetes-per-job
--executor.kubernetes.discombobulator true
 Using Flink as a library

© 2019 Ververica36
Preview of new Blink SQL Engine
Python Table API
Hive support
…and lot's more
Analytics over Checkpoints/Savepoints
Atomic stop-with-savepoint
What else is new in Flink 1.9?
🥳

© 2019 Ververica37
Cross-Batch-Streaming
Machine Learning
Unaligned Checkpoints
Python Table UDFs
…and lot's more
Interactive multi-job programs a big documentation overhaul
What else is the community working on?
DDL and Clients for
Streaming SQL
Full support of Blink SQL Engine
and TPC-DS coverage

© 2019 Ververica38
Thank you!
• Try Flink and help us improve it
• Contribute docs, code, tutorials
• Share your use cases and ideas
• Join a Flink Meetup
aljoscha@apache.org @ApacheFlink @VervericaData
https://flink.apache.org/

© 2019 Ververica
Thank you!
Questions?

© 2019 Ververica41
Latency vs. Completeness (for geeks)
1977 1980 1983 1999 2002 2005 2015
Processing Time
Episode
IV
Episode
V
Episode
VI
Episode
I
Episode
II
Episode
III
Episode
VII
Event Time
2016
Rogue
One
III.5
2017
Episode
VIII

© 2019 Ververica43
Current Source Interfaces
InputFormat
createInputSplits(): splits
openSplit(split)
assignInputSplit()
nextRecord(): T
closeCurrentSplit()
SourceFunction
run(OutputContext)
close()
batch streaming

© 2019 Ververica44
Batch InputFormat Processing
TaskManager TaskManager TaskManager
JobManager
(1) request split
(2) send split
(3) process split
• Splits are assigned to TaskManagers by the JobManager, which runs a copy of
the InputFormat  Flink knows about splits and can be clever about scheduling,
be reactive
• Splits can be processed in arbitrary order
• Split processing pulls records from the InputFormat
• InputFormat knows nothing about watermarks, timestamps, checkpointing  bad
for streaming

© 2019 Ververica45
Stream SourceFunction Processing
• Source have a run-loop that they manage completely on their own
• Sources have flexibility and can efficiently work with the source system: batch
accesses, dealing with multiple topics from one consumer, threading model,
etc…
• Flink does not know what’s going on inside and can’t be clever about it
• Sources have to implement their own per-partition watermarking, idleness
tracking, what have you
TaskManagerTaskManager TaskManager
(1) do your thing

© 2019 Ververica46
A New (unified) Source Interface
• This must support both batch and streaming use cases, allow Flink to be clever, be
able to deal with event-time, watermarks, source idiosyncrasies, and enable
snapshotting
• This should enable new features: generic idleness detection, event-time
alignment*
FLIP-27
Source
createSplitEnumerator()
createSplitReader
SplitEnumerator
discoverNewSplits()
nextSplit()
snapshotState()
isDone()
SplitReader
addSplit()
hasAvailable(): Future
snapshotState()
emitNext(Context): Status
* FLINK-10886: Event-time alignment for sources; Jamie Grier (Lyft) contributed the first parts of this

© 2019 Ververica47
A New (unified) SourceInterface: Execution Style I
TaskManager TaskManager TaskManager
JobManager
(1) request split
(2) send split
(3) process split
• Splits are assigned to TaskManagers by the JobManager, which runs a copy of
the SplitEnumerator  Flink knows about splits and can be clever about
scheduling, be reactive
• Splits can be processed in arbitrary order
• Split processing is driven by the TaskManager working with SplitReader
• SplitReader emits watermarks but Flink deals with idleness, per-split
watermarking

© 2019 Ververica50
Runtime
Tasks / JobGraph / Network Interconnect / Fault Tolerance
DataSet
“Operators“ and Drivers / Operator Graph / Visitors
DataStream
StreamOperators / StreamTransformation Graph* /
Monolithic Translators
Table API / SQL
Logical Nodes* / Different translation paths

© 2019 Ververica51
What could be improved?
• Each API has its own internal graph representation  code duplication
• Multiple translation components between the different graphs  code duplication
– DataStream API has an intermediate graph structure: StreamTransformation  StreamGraph 
JobGraph
• Separate (incompatible) operator implementations
– DataStream API has StreamOperator, DataSet API has Drivers  two map operators, two flatMap
operators
– These are run by different lower-level Tasks
– DataSet operators are optimized for different requirements than DataSet operators
• Table API is translated to two distinct lower-level APIs  two different translation
stacks
– ”project operator” for DataStream and for DataSet
• Connectors for each API are separate  a whole bunch of connectors all over the
From a system design/code quality/architecture/development perspective

© 2019 Ververica52
What does this mean for users?
• You have to decide between DataSet and DataStream when writing a job
– Two (slightly) different APIs, with different capabilities
– Different set of supported connectors: no Kafka DataSet connector, no HBase DataStream connector
– Different performance characteristics
– Different fault-tolerance behavior
– Different scheduling logic
• With Table API, you only have to learn one API
– Still, the set of supported connectors depends on the underlying execution API
– Feature set depends on whether there is an implementation for your underlying API
• You cannot combine more batch-y with more stream-y sources/sinks
• A “soft problem”: with two stacks of everything, less developer power will go into
each one individual stack  less features, worse performance, more bugs that are
fixed slower

© 2019 Ververica54
DataSet
“Operators“ and Drivers / Operator
Graph / Visitors
DataStream
StreamOperator /
StreamTransformation Graph* /
Monolithic Translators
Batch is a subset of streaming!
Can’t we just?
✅❌
Done!
🥳

© 2019 Ververica55
Unifying the Batch and Streaming APIs
• DataStream API functionality is already a superset of DataSet API functionality
• We need to introduce BoundedStream to harness optimization potential, semantics
are clear from earlier:
–No processing-time timers
–Watermark “jumps” from –Infinity to +Infinity at end of processing
• DataStream translation and runtime (operators) need to be enhanced to use the
added optimization potential
• Streaming execution is the generic case that always works, “batch” enables
additional “optimization rules”: bounded operators, different scheduling  we get
feature parity automatically ✅
• Sources need to be unified as well  see later

© 2019 Ververica56
Under-the-hood Changes
• StreamTransformation/StreamGraph
need to be beefed up to carry the
additional information about
boundedness
• Translation, scheduling, deployment,
memory management and network stack
needs to take this into account
Graph Representation / DAG Operator / Task
• StreamOperator needs to support
batch-style execution  see next slide
• Network stack must eventually support
blocking inputs

© 2019 Ververica57
Selective Push Model Operator FLINK-11875
batch: pull-based operator (or Driver) streaming: push-based StreamOperator
• StreamOperator needs additional API
to tell the runtime which input to
consume
• Network stack/graph needs to be
enhanced to deal with blocking inputs,
😱

© 2019 Ververica58
Table API / SQL
• API: easy, it’s already unified ✅
• Translation and runtime (operators) need to be enhanced to use the added
optimization potential but use the StreamOperator for both batch and streaming
style execution
• Streaming execution is the generic case that always works, “batch” enables
additional “optimization rules”: bounded operators, different scheduling  we get
feature parity automatically ✅
• Sources will be unified from the unified source interface ✅
• This is already available in the Blink fork (by Alibaba), FLINK-11439 is the effort of
getting that into Flink

© 2019 Ververica59
StreamTransformation DAG / StreamOperator
DataStream
“Physical“ Application API
Table API / SQL
Declarative API
Runtime
The Future
Stack

Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Ververica

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Ververica

Similar a Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Ververica (20)

Más de Flink Forward

Más de Flink Forward (20)

Último

Último (20)

Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Ververica

Notas del editor