Stateful stream processing with exactly-once guarantees is one of Apache Flink's distinctive features and we can observe that the scale of state that is managed by Flink in production constantly grows. This leads to a couple of interesting challenges for state handling in Flink. In this talk, we presents current and future developments to improve the handling of large state and recovery in Apache Flink. We show how to keep snapshots for large state swift and how to minimize negative effects on job performance through incremental and asynchronous checkpointing. Furthermore, we discuss how to greatly accelerate recovery under failures and for rescaling. In this context, we go into details about improved execution graph recovery, caching state on task managers, and considering new features of modern storage architectures for our state backends.
2. State in Streaming Programs
2
case class Event(producer: String, evtType: Int, msg: String)
case class Alert(msg: String, count: Long)
env.addSource(…)
.map(bytes => Event.parse(bytes) )
.keyBy("producer")
.mapWithState { (event: Event, state: Option[Int]) => {
// pattern rules
}
.filter(alert => alert.msg.contains("CRITICAL"))
.keyBy("msg")
.timeWindow(Time.seconds(10))
.sum("count")
Source map()
mapWith
State()
filter()
window()
sum()keyBy keyBy
3. State in Streaming Programs
3
case class Event(producer: String, evtType: Int, msg: String)
case class Alert(msg: String, count: Long)
env.addSource(…)
.map(bytes => Event.parse(bytes) )
.keyBy("producer")
.mapWithState { (event: Event, state: Option[Int]) => {
// pattern rules
}
.filter(alert => alert.msg.contains("CRITICAL"))
.keyBy("msg")
.timeWindow(Time.seconds(10))
.sum("count")
Source map()
mapWith
State()
filter()
window()
sum()keyBy keyBy
Stateless
Stateful
4. Internal vs External State
4
External State
Internal State
• State in a separate data store
• Can store "state capacity" independent
• Usually much slower than internal state
• Hard to get "exactly-once" guarantees
• State in the stream processor
• Faster than external state
• Working area local to computation
• Checkpoints to stable store (DFS)
• Always exactly-once consistent
• Stream processor has to handle scalability
5. Keyed State Backends
5
HeapKeyedStateBackend RocksDBKeyedStateBackend
-State lives in memory, on Java heap
-Operates on objects
-Think of a hash map {key obj -> state obj}
-Async snapshots supported
-State lives in off-heap memory and on disk
-Operates on bytes, uses serialization
-Think of K/V store {key bytes -> state bytes}
-Log-structured-merge (LSM) tree
-Async snapshots
-Incremental snapshots
7. Synchronous Checkpointing
7
Checkpoint
Coordinator thread
Event processing
thread
Checkpointing
thread
(loop: processElement)
(trigger checkpoint) (acknowledge checkpoint)
(write state to DFS)
Task Manager
Job Manager
Why is async checkpointing so essential for large state?
8. Synchronous Checkpointing
8
Checkpoint
Coordinator thread
Event processing
thread
Checkpointing
thread
(loop: processElement)
(trigger checkpoint) (acknowledge checkpoint)
(write state to DFS)
Task Manager
Job Manager
Problem: All event processing is on hold here to avoid
concurrent modifications to the state that is written
12. What we will discuss
▪ What are incremental checkpoints?
▪ Why is RocksDB so well suited for this?
▪ How do we integrate this with Flink’s
checkpointing?
12
Driven by and
13. Full Checkpointing
13
K S
2 B
4 W
6 N
K S
2 B
3 K
4 L
6 N
K S
2 Q
3 K
6 N
9 S
K S
2 B
4 W
6 N
K S
2 B
3 K
4 L
6 N
K S
2 Q
3 K
6 N
9 S
Checkpoint 1 Checkpoint 2 Checkpoint 3
time
14. Incremental Checkpointing
14
K S
2 B
4 W
6 N
K S
2 B
3 K
4 L
6 N
K S
2 Q
3 K
6 N
9 S
K S
2 B
4 W
6 N
K S
3 K
4 L
K S
2 Q
4 -
9 S
iCheckpoint 1 iCheckpoint 2 iCheckpoint 3
Δ(-,c1)
Δ(c1,c2)
Δ(c2,c3)
time
15. Incremental Recovery
15
K S
2 B
4 W
6 N
K S
2 B
3 K
4 L
6 N
K S
2 Q
3 K
6 N
9 S
K S
2 B
4 W
6 N
K S
3 K
4 L
K S
2 Q
4 -
9 S
iCheckpoint 1 iCheckpoint 2 iCheckpoint 3
Δ(-,c1)
Δ(c1,c2)
Δ(c2,c3)
time
Recovery?
+ +
17. RocksDB Compaction
▪ Background Thread
merges SSTable files
▪ Removes copies of
the same key (latest
version survives)
▪ Actually deletion of
keys
17
2 C 7 N 9 Q 1 V 7 - 9 S
1 V 2 C 9 S
SSTable-1 SSTable-2
SSTable-3
merge
Compaction consolidates our Δs!
37. Incremental checkpointing benefits
▪ Incremental checkpoints can dramatically
reduce CP overhead for large state.
▪ Incremental checkpoints are async.
▪ RocksDB’s compaction consolidates the
increments. Keeps overhead low for
recovery.
37
38. Incremental checkpointing limitations
▪ Breaks the unification of checkpoints and
savepoints (CP: low overhead, SP: features)
▪ RocksDB specific format.
▪ Currently no support for rescaling from
incremental checkpoint.
38