With its superior state management and savepoint mechanism, Apache Flink is unique among modern stream processors in supporting minimal-effort job reconfiguration. Savepoints are being extensively used to enable dynamic scaling, bug fixing, upgrades, and numerous other reconfiguration use-cases, all while preserving exactly-once semantics. However, when it comes to dynamic scaling, the burden of reconfiguration decisions -when and how much to scale- is currently placed on the user.
In this talk, I share our recent work at ETH Zurich on providing support for self-managed and automatically reconfigurable stream processing. I present SnailTrail (NSDI’18), an online critical path analysis module that detects bottlenecks and provides insights on streaming application performance, and DS2 (OSDI’18), an automatic scaling controller which identifies optimal backpressure-free configurations and operates reactively online. Both SnailTrail and DS2 are integrated with Apache Flink and publicly available. I conclude with evaluation results, ongoing work, and and future challenges in this area.
41. CRITICAL PARTICIPATION (CP METRIC)
An estimation of the activity’s participation in the critical path
23
total number of paths
in the snapshot
activity duration: edge weight
centrality: the number of
paths this activity appears on
Definition 8. Transient Path Centrality: Let P = {~p1, ~p2, ...~pN}
be the set of N transient paths of snapshot G[ts,te]. The tran-
sient path centrality of an edge e 2 G[ts,te] is defined as
c(e) =
NX
i=1
ci(e), where ci(e) =
8
>><
>>:
0 : e < ~pi
1 : e 2 ~pi
The following holds:
CPa =
TPC(a) · aw
N(te ts)
(3)
Spark, Flink
di↵erent, but act
ysis: all execute
graphs whose v
whose edges den
ers (threads, pr
graph can be tran
all workers appl
tions of the data
1 We provide proofs
4
42. CRITICAL PARTICIPATION (CP METRIC)
An estimation of the activity’s participation in the critical path
23
total number of paths
in the snapshot
activity duration: edge weight
centrality: the number of
paths this activity appears on
Can be computed
without path
enumeration!
Definition 8. Transient Path Centrality: Let P = {~p1, ~p2, ...~pN}
be the set of N transient paths of snapshot G[ts,te]. The tran-
sient path centrality of an edge e 2 G[ts,te] is defined as
c(e) =
NX
i=1
ci(e), where ci(e) =
8
>><
>>:
0 : e < ~pi
1 : e 2 ~pi
The following holds:
CPa =
TPC(a) · aw
N(te ts)
(3)
Spark, Flink
di↵erent, but act
ysis: all execute
graphs whose v
whose edges den
ers (threads, pr
graph can be tran
all workers appl
tions of the data
1 We provide proofs
4
50. 30
Streaming systems must be capable of adapting the level
of parallelism when conditions change at runtime
events/s
time
: input rate : throughput
Data loss SLO violationsIdle resources
events/s
time
events/s
time
52. HEURISTIC SCALING APPROACHES
32
CPU utilization
backlog, tuples/s
backpressure signal
threshold and
rule-based
if CPU > 80% => scale
small changes,
one operator
at a time
Borealis
StreamCloud
Seep
IBM Streams
Spark Streaming
Google Dataflow
Dhalion
scaling actionmetrics policy
53. HEURISTIC SCALING APPROACHES
32
CPU utilization
backlog, tuples/s
backpressure signal
threshold and
rule-based
if CPU > 80% => scale
small changes,
one operator
at a time
Problematic under
interference,
multi-tenancy
Sensitive to
noise, manual,
hard to tune
Non-predictive,
speculative steps
Borealis
StreamCloud
Seep
IBM Streams
Spark Streaming
Google Dataflow
Dhalion
scaling actionmetrics policy
54. Effect of Dhalion’s scaling actions
in an initially under-provisioned
wordcount dataflow
33
55. Effect of Dhalion’s scaling actions
in an initially under-provisioned
wordcount dataflow
33
o1src o2
back-pressure!
target: 40 rec/s
56. Effect of Dhalion’s scaling actions
in an initially under-provisioned
wordcount dataflow
33
o1src o2
back-pressure!
target: 40 rec/s
10 rec/s 100 rec/s
57. Effect of Dhalion’s scaling actions
in an initially under-provisioned
wordcount dataflow
33
o1src o2
back-pressure!
target: 40 rec/s
10 rec/s 100 rec/s
Which operator is the bottleneck?
What if we scale ο1 x 4?
How much to scale ο2?
59. 34
o1src o2
back-pressure!
target: 40 rec/s
10 rec/s 100 rec/s
Which operator is the bottleneck?
What if we scale ο1 x 4?
How much to scale ο2?
o1 cannot keep up
waiting for
output
waiting for
input
src
o1
o2
60. 34
o1src o2
back-pressure!
target: 40 rec/s
10 rec/s 100 rec/s
Which operator is the bottleneck?
What if we scale ο1 x 4?
How much to scale ο2?
o1 cannot keep up
waiting for
output
waiting for
input
src
o1
o2
o2 cannot keep up
src
o1
o2
62. 36
src
o1
o2
10 recs 10 recs
1 2 3 4
100 rec 100 recs
Intuition: use the dataflow graph to extract operator dependencies
and system instrumentation to collect accurate, representative metrics.
target: 40 rec/s
0.5s
63. 36
src
o1
o2
10 recs 10 recs
1 2 3 4
100 rec 100 recs
Intuition: use the dataflow graph to extract operator dependencies
and system instrumentation to collect accurate, representative metrics.
x4 instances
to keep up
with src rate
target: 40 rec/s
0.5s
64. 36
src
o1
o2
10 recs 10 recs
1 2 3 4
100 rec 100 recs
Intuition: use the dataflow graph to extract operator dependencies
and system instrumentation to collect accurate, representative metrics.
True rate = 200 recs/s
x4 instances
to keep up
with src rate
target: 40 rec/s
0.5s
65. 36
src
o1
o2
10 recs 10 recs
1 2 3 4
100 rec 100 recs
Intuition: use the dataflow graph to extract operator dependencies
and system instrumentation to collect accurate, representative metrics.
True rate = 200 recs/s
x4 instances
to keep up
with src rate
x2 instances
to keep up
with x4 o1
instances
target: 40 rec/s
0.5s
66. If operator scaling is linear, then:
▸ no overshoot when scaling up
▸ no undershoot when scaling down
37
parallelism
initial rate
target
prediction
p0 p1
parallelism
initial rate
target
p0p1
prediction
DS2 MAKES LINEAR PREDICTIONS
67. If operator scaling is linear, then:
▸ no overshoot when scaling up
▸ no undershoot when scaling down
37
parallelism
initial rate
target
prediction
p0 p1
parallelism
initial rate
target
p0p1
prediction
DS2 MAKES LINEAR PREDICTIONS
x
x
p’
p’
68. If operator scaling is linear, then:
▸ no overshoot when scaling up
▸ no undershoot when scaling down
37
parallelism
initial rate
target
prediction
p0 p1
parallelism
initial rate
target
p0p1
Ideal rates act as un upper bound when
scaling up and as a lower bound when
scaling down:
▸ DS2 will converge monotonically to
the target rate
prediction
DS2 MAKES LINEAR PREDICTIONS
p’
p’
69. If operator scaling is linear, then:
▸ no overshoot when scaling up
▸ no undershoot when scaling down
37
parallelism
initial rate
target
prediction
p0 p1
parallelism
initial rate
target
p0p1
Ideal rates act as un upper bound when
scaling up and as a lower bound when
scaling down:
▸ DS2 will converge monotonically to
the target rate
prediction
DS2 MAKES LINEAR PREDICTIONS
actual
actual
70. DS2 MINIMIZES THE ERROR UNTIL CONVERGENCE
38
parallelism
initial rate
target
actual
error
p0 p1
prediction
x
x
x
71. DS2 MINIMIZES THE ERROR UNTIL CONVERGENCE
38
parallelism
initial rate
target
actual
p0 p1
x
new
prediction
72. DS2 MINIMIZES THE ERROR UNTIL CONVERGENCE
38
parallelism
initial rate
target
actual
p0 p1
x
error
p1’
new
prediction
Gradually minimizes error
75. DS2 VS. STATE-OF-THE-ART ON HERON
41
Initially under-provisioned wordcount dataflow
Target rate: 16.700 rec/s
76. DS2 VS. STATE-OF-THE-ART ON HERON
41
Initially under-provisioned wordcount dataflow
Target rate: 16.700 rec/s
77. DS2 VS. STATE-OF-THE-ART ON HERON
42
Initially under-provisioned wordcount dataflow
Target rate: 16.700 rec/s
78. DS2 VS. STATE-OF-THE-ART ON HERON
42
Initially under-provisioned wordcount dataflow
Target rate: 16.700 rec/s
DS2 converges in a
single step for
both operators
79. DS2 VS. STATE-OF-THE-ART ON HERON
42
Initially under-provisioned wordcount dataflow
Target rate: 16.700 rec/s
DS2 converges in a
single step for
both operators
and converges in
60s, as soon as it
receives the
Heron metrics
80. DS2 VS. STATE-OF-THE-ART ON HERON
42
Initially under-provisioned wordcount dataflow
Target rate: 16.700 rec/s
DS2 converges in a
single step for
both operators
Dhalion scales
one operator at a
time, and needs
six steps in total
1
6
5
43
2and converges in
60s, as soon as it
receives the
Heron metrics
81. DS2 VS. STATE-OF-THE-ART ON HERON
42
Initially under-provisioned wordcount dataflow
Target rate: 16.700 rec/s
DS2 converges in a
single step for
both operators
and converges in 2000s
Dhalion scales
one operator at a
time, and needs
six steps in total
1
6
5
43
2and converges in
60s, as soon as it
receives the
Heron metrics
82. DS2 VS. STATE-OF-THE-ART ON HERON
42
Initially under-provisioned wordcount dataflow
+10 counts
+12 mappers
Target rate: 16.700 rec/s
DS2 converges in a
single step for
both operators
and converges in 2000s
Dhalion scales
one operator at a
time, and needs
six steps in total
1
6
5
43
2and converges in
60s, as soon as it
receives the
Heron metrics
83. DS2 ON APACHE FLINK
43
Initially under-provisioned wordcount
Target rate: 2.000.000 rec/s, drops to half at 800s
84. DS2 ON APACHE FLINK
43
Initially under-provisioned wordcount
Target rate: 2.000.000 rec/s, drops to half at 800s
DS2 converges in
2 steps for both
operators
1
2
85. DS2 ON APACHE FLINK
43
Initially under-provisioned wordcount
Target rate: 2.000.000 rec/s, drops to half at 800s
DS2 reacts within
3s when the target
rate drops
DS2 converges in
2 steps for both
operators
1
2
86. DS2 ON APACHE FLINK
43
Initially under-provisioned wordcount
Target rate: 2.000.000 rec/s, drops to half at 800s
DS2 reacts within
3s when the target
rate drops
DS2 converges in
2 steps for both
operators
1
2
Transient
underpovisioning
by 1 instance
87. 44
github.com/strymon-system
Kalavri V, Liagouris J, Hoffmann M, Dimitrova D, Forshaw M, Roscoe T.
Three steps is all you need: fast, accurate, automatic scaling decisions for distributed streaming dataflows.
OSDI ’18.
Hoffmann M, Lattuada A, Liagouris J, Kalavri V, Dimitrova D, Wicki S, Chothia Z, Roscoe T.
Snailtrail: Generalizing critical paths for online analysis of distributed dataflows.
NSDI’18.
github.com/li1/snailtrail
88. 45
Zaheer Chothia
Andrea Lattuada
Timothy Roscoe
Moritz Hoffmann Desislava Dimitrova
John Liagouris
Malte Sandstede
Matthew ForshawSebastian Wicki
strymon.systems.ethz.ch