SlideShare una empresa de Scribd logo
1 de 40
1
Stefan Richter

@stefanrrichter



April 11, 2017
Improvements for
large state in Apache Flink
State in Streaming Programs
2
case class Event(producer: String, evtType: Int, msg: String)
case class Alert(msg: String, count: Long)
env.addSource(…)

.map(bytes => Event.parse(bytes) )
.keyBy("producer")
.mapWithState { (event: Event, state: Option[Int]) => {
// pattern rules
}
.filter(alert => alert.msg.contains("CRITICAL"))
.keyBy("msg")
.timeWindow(Time.seconds(10))
.sum("count")
Source map()
mapWith

State()
filter()
window()

sum()keyBy keyBy
State in Streaming Programs
3
case class Event(producer: String, evtType: Int, msg: String)
case class Alert(msg: String, count: Long)
env.addSource(…)

.map(bytes => Event.parse(bytes) )
.keyBy("producer")
.mapWithState { (event: Event, state: Option[Int]) => {
// pattern rules
}
.filter(alert => alert.msg.contains("CRITICAL"))
.keyBy("msg")
.timeWindow(Time.seconds(10))
.sum("count")
Source map()
mapWith

State()
filter()
window()

sum()keyBy keyBy
Stateless
Stateful
Internal vs External State
4
External State
Internal State
• State in a separate data store
• Can store "state capacity" independent
• Usually much slower than internal state
• Hard to get "exactly-once" guarantees
• State in the stream processor
• Faster than external state
• Working area local to computation
• Checkpoints to stable store (DFS)
• Always exactly-once consistent
• Stream processor has to handle scalability
Keyed State Backends
5
HeapKeyedStateBackend RocksDBKeyedStateBackend
-State lives in memory, on Java heap
-Operates on objects
-Think of a hash map {key obj -> state obj}
-Async snapshots supported
-State lives in off-heap memory and on disk
-Operates on bytes, uses serialization
-Think of K/V store {key bytes -> state bytes}
-Log-structured-merge (LSM) tree
-Async snapshots
-Incremental snapshots
Asynchronous Checkpoints
6
Synchronous Checkpointing
7
Checkpoint
Coordinator thread
Event processing
thread
Checkpointing
thread
(loop: processElement)
(trigger checkpoint) (acknowledge checkpoint)
(write state to DFS)
Task Manager
Job Manager
Why is async checkpointing so essential for large state?
Synchronous Checkpointing
8
Checkpoint
Coordinator thread
Event processing
thread
Checkpointing
thread
(loop: processElement)
(trigger checkpoint) (acknowledge checkpoint)
(write state to DFS)
Task Manager
Job Manager
Problem: All event processing is on hold here to avoid
concurrent modifications to the state that is written
Asynchronous Checkpointing
9
Checkpoint
Coordinator thread
Event processing
thread
Checkpointing
thread
(loop: processElement) (snapshot state)
(trigger checkpoint) (acknowledge checkpoint)(write state to DFS)
Task Manager
Job Manager
Asynchronous Checkpointing
10
Checkpoint
Coordinator thread
Event processing
thread
Checkpointing
thread
(loop: processElement) (snapshot state)
(trigger checkpoint) (acknowledge checkpoint)(write state to DFS)
Task Manager
Job Manager
Problem: How to deal with concurrent modifications?
Incremental Checkpoints
11
What we will discuss
▪ What are incremental checkpoints?
▪ Why is RocksDB so well suited for this?
▪ How do we integrate this with Flink’s
checkpointing?
12
Driven by and
Full Checkpointing
13
K S
2 B
4 W
6 N
K S
2 B
3 K
4 L
6 N
K S
2 Q
3 K
6 N
9 S
K S
2 B
4 W
6 N
K S
2 B
3 K
4 L
6 N
K S
2 Q
3 K
6 N
9 S
Checkpoint 1 Checkpoint 2 Checkpoint 3
time
Incremental Checkpointing
14
K S
2 B
4 W
6 N
K S
2 B
3 K
4 L
6 N
K S
2 Q
3 K
6 N
9 S
K S
2 B
4 W
6 N
K S
3 K
4 L
K S
2 Q
4 -
9 S
iCheckpoint 1 iCheckpoint 2 iCheckpoint 3
Δ(-,c1)
Δ(c1,c2)
Δ(c2,c3)
time
Incremental Recovery
15
K S
2 B
4 W
6 N
K S
2 B
3 K
4 L
6 N
K S
2 Q
3 K
6 N
9 S
K S
2 B
4 W
6 N
K S
3 K
4 L
K S
2 Q
4 -
9 S
iCheckpoint 1 iCheckpoint 2 iCheckpoint 3
Δ(-,c1)
Δ(c1,c2)
Δ(c2,c3)
time
Recovery?
+ +
RocksDB Architecture (simplified)
16
Memtable
SSTable-7
SSTable-6
Memory
…
Storage
key_1 val_1 key_2 val_2 …Index +
sorted by key
- All writes go against Memtable
- Mutable Buffer (couple MB)
- Unique keys
- Reads consider Memtable first, then SSTables
- Immutable
- We can consider newly created SSTables as Δs!
periodic
flush
periodic
merge
RocksDB Compaction
▪ Background Thread
merges SSTable files
▪ Removes copies of
the same key (latest
version survives)
▪ Actually deletion of
keys
17
2 C 7 N 9 Q 1 V 7 - 9 S
1 V 2 C 9 S
SSTable-1 SSTable-2
SSTable-3
merge
Compaction consolidates our Δs!
Flink’s Incremental Checkpointing
18
Checkpoint
Coordinator
StatefulMap
(1/3)
StatefulMap
(2/3)
StatefulMap
(3/3)
DFS
Network
SharedStateRegistry
Flink’s Incremental Checkpointing
19
Checkpoint
Coordinator
StatefulMap
(1/3)
StatefulMap
(2/3)
StatefulMap
(3/3)
DFS
Network
Step 1:
Checkpoint Coordinator sends
checkpoint barrier that triggers
a snapshot on each instance
SharedStateRegistry
Flink’s Incremental Checkpointing
20
StatefulMap
(1/3)
Checkpoint
Coordinator
StatefulMap
(2/3)
StatefulMap
(3/3)
DFS
Δ3
Δ2
Δ1
Network
Step 2:
Each instance writes its
incremental snapshot to
distributed storage
SharedStateRegistry
Flink’s Incremental Checkpointing
21
StatefulMap
(1/3)
Checkpoint
Coordinator
StatefulMap
(2/3)
StatefulMap
(3/3)
DFS
Δ3
Δ2
Δ1
Network
Step 2:
Each instance writes its
incremental snapshot to
distributed storage
SharedStateRegistry
Incremental Snapshot of Operator
22
data
manifest
01010101
00110011
10101010
11001100
01010101
00225.sst
share
Local FS
01010101
00110011
10101010
11001100
01010101
00226.sst
SharedState
Registry
Distributed FS://sfmap/1/
Incremental Snapshot of Operator
23
data
chk-1
manifest
01010101
00110011
10101010
11001100
01010101
00225.sst
manifest
01010101
00110011
10101010
11001100
01010101
00225.sst 00226.sst
01010101
00110011
10101010
11001100
01010101
++
share
Local FS
01010101
00110011
10101010
11001100
01010101
00226.sst
SharedState
Registry
copy / hardlink
Distributed FS://sfmap/1/
Incremental Snapshot of Operator
24
data
chk-1
manifest
01010101
00110011
10101010
11001100
01010101
00225.sst
manifest
01010101
00110011
10101010
11001100
01010101
00225.sst 00226.sst
01010101
00110011
10101010
11001100
01010101
++
share
01010101
00110011
10101010
11001100
01010101
00226.sst
01010101
00110011
10101010
11001100
01010101
00225.sst
manifest sst.list
Local FS
SharedState
Registry
01010101
00110011
10101010
11001100
01010101
00226.sst
chk-1
Distributed FS://sfmap/1/
List of SSTables
referenced by snapshot
async upload to DFS
Flink’s Incremental Checkpointing
25
StatefulMap
(1/3)
Checkpoint
Coordinator
StatefulMap
(2/3)
StatefulMap
(3/3)
DFS
H1
H2
H3Network
Step 3:
Each instance acknowledges and sends
a handle (e.g. file path in DFS) to the
Checkpoint Coordinator.
SharedStateRegistry
Δ3
Δ2
Δ1
Incremental Snapshot of Operator
26
data
chk-1
manifest
01010101
00110011
10101010
11001100
01010101
00225.sst
manifest
01010101
00110011
10101010
11001100
01010101
00225.sst 00226.sst
01010101
00110011
10101010
11001100
01010101
++
share
01010101
00110011
10101010
11001100
01010101
00226.sst
01010101
00110011
10101010
11001100
01010101
00225.sst
manifest sst.list
Local FS
SharedState
Registry
01010101
00110011
10101010
11001100
01010101
00226.sst
chk-1
{00225.sst = 1}
{00226.sst = 1}
Distributed FS://sfmap/1/
Flink’s Incremental Checkpointing
27
StatefulMap
(1/3)
Checkpoint
Coordinator
StatefulMap
(2/3)
StatefulMap
(3/3)
DFS
H1
H3
CP 1
H2
Network
SharedStateRegistry
Δ3
Δ2
Δ1Step 4:
Checkpoint Coordinator signals CP1 success
to all instances.
Flink’s Incremental Checkpointing
28
Checkpoint
Coordinator
StatefulMap
(1/3)
StatefulMap
(2/3)
StatefulMap
(3/3)
DFS
Network
SharedStateRegistry
H1
H3
CP 1
H2
What happens when another CP is triggered?
Δ3
Δ2
Δ1
Incremental Snapshot of Operator
29
data
chk-1
manifest
01010101
00110011
10101010
11001100
01010101
00226.sst 00228.sst 00229.sst
01010101
00110011
10101010
11001100
01010101
01010101
00110011
10101010
11001100
01010101
share
01010101
00110011
10101010
11001100
01010101
00226.sst
01010101
00110011
10101010
11001100
01010101
00225.sst
manifest sst.list
Local FS
{00225.sst = 1}
{00226.sst = 1}
SharedState
Registry
Distributed FS://sfmap/1/
Incremental Snapshot of Operator
30
data
chk-2 chk-1
manifest
01010101
00110011
10101010
11001100
01010101
00226.sst 00228.sst 00229.sst
01010101
00110011
10101010
11001100
01010101
01010101
00110011
10101010
11001100
01010101
manifest
01010101
00110011
10101010
11001100
01010101
00226.sst 00228.sst
01010101
00110011
10101010
11001100
01010101
01010101
00110011
10101010
11001100
01010101
+ +
share
01010101
00110011
10101010
11001100
01010101
00226.sst
01010101
00110011
10101010
11001100
01010101
00225.sst
manifest sst.list
Local FS
{00225.sst = 1}
{00226.sst = 1}
SharedState
Registry
Distributed FS://sfmap/1/
00229.sst
Incremental Snapshot of Operator
31
data
chk-1
manifest
01010101
00110011
10101010
11001100
01010101
00226.sst 00228.sst 00229.sst
01010101
00110011
10101010
11001100
01010101
01010101
00110011
10101010
11001100
01010101
share
01010101
00110011
10101010
11001100
01010101
00226.sst 00228.sst 00229.sst
01010101
00110011
10101010
11001100
01010101
01010101
00110011
10101010
11001100
01010101
01010101
00110011
10101010
11001100
01010101
00225.sst
manifest sst.list
chk-2
manifest sst.list
Local FS
{00225.sst = 1}
{00226.sst = 2}
{00228.sst = 1}
{00229.sst = 1}
SharedState
Registry
Distributed FS://sfmap/1/
chk-2
manifest
01010101
00110011
10101010
11001100
01010101
00226.sst 00228.sst 00229.sst
01010101
00110011
10101010
11001100
01010101
01010101
00110011
10101010
11001100
01010101
+ +
upload missing SSTable files
Deleting Incremental Checkpoints
32
StatefulMap
(1/3)
Checkpoint
Coordinator
StatefulMap
(2/3)
StatefulMap
(3/3)
DFS
H1
H3
CP 1
H1
H2H3
CP 2
H2
Network
Deleting an outdated checkpoint
SharedStateRegistry
Δ3
Δ2
Δ1
Δ3
Δ2
Δ1
Deleting Incremental Snapshot
33
data
chk-1
manifest
01010101
00110011
10101010
11001100
01010101
00226.sst 00228.sst 00229.sst
01010101
00110011
10101010
11001100
01010101
01010101
00110011
10101010
11001100
01010101
share
01010101
00110011
10101010
11001100
01010101
00226.sst 00228.sst 00229.sst
01010101
00110011
10101010
11001100
01010101
01010101
00110011
10101010
11001100
01010101
01010101
00110011
10101010
11001100
01010101
00225.sst
manifest sst.list
chk-2
manifest sst.list
Local FS
{00225.sst = 1}
{00226.sst = 2}
{00228.sst = 1}
{00229.sst = 1}
SharedState
Registry
Distributed FS://sfmap/1/
Deleting Incremental Snapshot
34
data
manifest
01010101
00110011
10101010
11001100
01010101
00226.sst 00228.sst 00229.sst
01010101
00110011
10101010
11001100
01010101
01010101
00110011
10101010
11001100
01010101
share
01010101
00110011
10101010
11001100
01010101
00226.sst 00228.sst 00229.sst
01010101
00110011
10101010
11001100
01010101
01010101
00110011
10101010
11001100
01010101
01010101
00110011
10101010
11001100
01010101
00225.sst
chk-2
manifest sst.list
Local FS
{00225.sst = 0}
{00226.sst = 1}
{00228.sst = 1}
{00229.sst = 1}
SharedState
Registry
Distributed FS://sfmap/1/
Deleting Incremental Snapshot
35
data
manifest
01010101
00110011
10101010
11001100
01010101
00226.sst 00228.sst 00229.sst
01010101
00110011
10101010
11001100
01010101
01010101
00110011
10101010
11001100
01010101
share
01010101
00110011
10101010
11001100
01010101
00226.sst 00228.sst 00229.sst
01010101
00110011
10101010
11001100
01010101
01010101
00110011
10101010
11001100
01010101
chk-2
manifest sst.list
Local FS
{00226.sst = 1}
{00228.sst = 1}
{00229.sst = 1}
SharedState
Registry
Distributed FS://sfmap/1/
Wrapping up
36
Incremental checkpointing benefits
▪ Incremental checkpoints can dramatically
reduce CP overhead for large state.
▪ Incremental checkpoints are async.
▪ RocksDB’s compaction consolidates the
increments. Keeps overhead low for
recovery.
37
Incremental checkpointing limitations
▪ Breaks the unification of checkpoints and
savepoints (CP: low overhead, SP: features)
▪ RocksDB specific format.
▪ Currently no support for rescaling from
incremental checkpoint.
38
Further improvements in Flink 1.3/4
▪ AsyncHeapKeyedStateBackend (merged)
▪ AsyncHeapOperatorStateBackend (PR)
▪ MapState (merged)
▪ RocksDBInternalTimerService (PR)
▪ AsyncHeapInternalTimerService
39
Questions?
40

Más contenido relacionado

La actualidad más candente

Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward
 
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache FlinkFabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Ververica
 

La actualidad más candente (20)

Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasVirtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
 
Flink forward SF 2017: Ufuk Celebi - The Stream Processor as a Database: Buil...
Flink forward SF 2017: Ufuk Celebi - The Stream Processor as a Database: Buil...Flink forward SF 2017: Ufuk Celebi - The Stream Processor as a Database: Buil...
Flink forward SF 2017: Ufuk Celebi - The Stream Processor as a Database: Buil...
 
Flink Forward Berlin 2017: Kostas Kloudas - Complex Event Processing with Fli...
Flink Forward Berlin 2017: Kostas Kloudas - Complex Event Processing with Fli...Flink Forward Berlin 2017: Kostas Kloudas - Complex Event Processing with Fli...
Flink Forward Berlin 2017: Kostas Kloudas - Complex Event Processing with Fli...
 
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...
 
Flink Forward SF 2017: Konstantinos Kloudas - Extending Flink’s Streaming APIs
Flink Forward SF 2017: Konstantinos Kloudas -  Extending Flink’s Streaming APIsFlink Forward SF 2017: Konstantinos Kloudas -  Extending Flink’s Streaming APIs
Flink Forward SF 2017: Konstantinos Kloudas - Extending Flink’s Streaming APIs
 
Kostas Kloudas - Complex Event Processing with Flink: the state of FlinkCEP
Kostas Kloudas - Complex Event Processing with Flink: the state of FlinkCEP Kostas Kloudas - Complex Event Processing with Flink: the state of FlinkCEP
Kostas Kloudas - Complex Event Processing with Flink: the state of FlinkCEP
 
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...
 
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
 
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache FlinkFabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache Flink
 
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache FlinkTzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
 
Kostas Kloudas - Extending Flink's Streaming APIs
Kostas Kloudas - Extending Flink's Streaming APIsKostas Kloudas - Extending Flink's Streaming APIs
Kostas Kloudas - Extending Flink's Streaming APIs
 
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache BeamAljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
 
Apache Flink's Table & SQL API - unified APIs for batch and stream processing
Apache Flink's Table & SQL API - unified APIs for batch and stream processingApache Flink's Table & SQL API - unified APIs for batch and stream processing
Apache Flink's Table & SQL API - unified APIs for batch and stream processing
 
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
What's new in 1.9.0 blink planner - Kurt Young, AlibabaWhat's new in 1.9.0 blink planner - Kurt Young, Alibaba
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
 
Flink Forward Berlin 2017: Patrick Gunia - Migration of a realtime stats prod...
Flink Forward Berlin 2017: Patrick Gunia - Migration of a realtime stats prod...Flink Forward Berlin 2017: Patrick Gunia - Migration of a realtime stats prod...
Flink Forward Berlin 2017: Patrick Gunia - Migration of a realtime stats prod...
 
Continuous Processing with Apache Flink - Strata London 2016
Continuous Processing with Apache Flink - Strata London 2016Continuous Processing with Apache Flink - Strata London 2016
Continuous Processing with Apache Flink - Strata London 2016
 
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
 
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...
Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...
 
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
 
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
 

Similar a Flink Forward SF 2017: Stefan Richter - Improvements for large state and recovery in Flink

Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
confluent
 

Similar a Flink Forward SF 2017: Stefan Richter - Improvements for large state and recovery in Flink (20)

Stephan Ewen - Scaling to large State
Stephan Ewen - Scaling to large StateStephan Ewen - Scaling to large State
Stephan Ewen - Scaling to large State
 
The Stream Processor as a Database Apache Flink
The Stream Processor as a Database Apache FlinkThe Stream Processor as a Database Apache Flink
The Stream Processor as a Database Apache Flink
 
The Stream Processor as the Database - Apache Flink @ Berlin buzzwords
The Stream Processor as the Database - Apache Flink @ Berlin buzzwords   The Stream Processor as the Database - Apache Flink @ Berlin buzzwords
The Stream Processor as the Database - Apache Flink @ Berlin buzzwords
 
Building Applications with Streams and Snapshots
Building Applications with Streams and SnapshotsBuilding Applications with Streams and Snapshots
Building Applications with Streams and Snapshots
 
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data ArtisansApache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink
 
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...
 
Real-time Stream Processing with Apache Flink @ Hadoop Summit
Real-time Stream Processing with Apache Flink @ Hadoop SummitReal-time Stream Processing with Apache Flink @ Hadoop Summit
Real-time Stream Processing with Apache Flink @ Hadoop Summit
 
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
 
The Nuts and Bolts of Kafka Streams---An Architectural Deep Dive
The Nuts and Bolts of Kafka Streams---An Architectural Deep DiveThe Nuts and Bolts of Kafka Streams---An Architectural Deep Dive
The Nuts and Bolts of Kafka Streams---An Architectural Deep Dive
 
Going Reactive with Relational Databases
Going Reactive with Relational DatabasesGoing Reactive with Relational Databases
Going Reactive with Relational Databases
 
Apache Flink Stream Processing
Apache Flink Stream ProcessingApache Flink Stream Processing
Apache Flink Stream Processing
 
Do flink on web with flow - Dongwon Kim & Haemee park, SK Telecom)
Do flink on web with flow - Dongwon Kim & Haemee park, SK Telecom)Do flink on web with flow - Dongwon Kim & Haemee park, SK Telecom)
Do flink on web with flow - Dongwon Kim & Haemee park, SK Telecom)
 
Do Flink on Web with FLOW
Do Flink on Web with FLOWDo Flink on Web with FLOW
Do Flink on Web with FLOW
 
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 

Más de Flink Forward

Más de Flink Forward (20)

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 

Último

Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Último (20)

Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 

Flink Forward SF 2017: Stefan Richter - Improvements for large state and recovery in Flink

  • 1. 1 Stefan Richter
 @stefanrrichter
 
 April 11, 2017 Improvements for large state in Apache Flink
  • 2. State in Streaming Programs 2 case class Event(producer: String, evtType: Int, msg: String) case class Alert(msg: String, count: Long) env.addSource(…)
 .map(bytes => Event.parse(bytes) ) .keyBy("producer") .mapWithState { (event: Event, state: Option[Int]) => { // pattern rules } .filter(alert => alert.msg.contains("CRITICAL")) .keyBy("msg") .timeWindow(Time.seconds(10)) .sum("count") Source map() mapWith
 State() filter() window()
 sum()keyBy keyBy
  • 3. State in Streaming Programs 3 case class Event(producer: String, evtType: Int, msg: String) case class Alert(msg: String, count: Long) env.addSource(…)
 .map(bytes => Event.parse(bytes) ) .keyBy("producer") .mapWithState { (event: Event, state: Option[Int]) => { // pattern rules } .filter(alert => alert.msg.contains("CRITICAL")) .keyBy("msg") .timeWindow(Time.seconds(10)) .sum("count") Source map() mapWith
 State() filter() window()
 sum()keyBy keyBy Stateless Stateful
  • 4. Internal vs External State 4 External State Internal State • State in a separate data store • Can store "state capacity" independent • Usually much slower than internal state • Hard to get "exactly-once" guarantees • State in the stream processor • Faster than external state • Working area local to computation • Checkpoints to stable store (DFS) • Always exactly-once consistent • Stream processor has to handle scalability
  • 5. Keyed State Backends 5 HeapKeyedStateBackend RocksDBKeyedStateBackend -State lives in memory, on Java heap -Operates on objects -Think of a hash map {key obj -> state obj} -Async snapshots supported -State lives in off-heap memory and on disk -Operates on bytes, uses serialization -Think of K/V store {key bytes -> state bytes} -Log-structured-merge (LSM) tree -Async snapshots -Incremental snapshots
  • 7. Synchronous Checkpointing 7 Checkpoint Coordinator thread Event processing thread Checkpointing thread (loop: processElement) (trigger checkpoint) (acknowledge checkpoint) (write state to DFS) Task Manager Job Manager Why is async checkpointing so essential for large state?
  • 8. Synchronous Checkpointing 8 Checkpoint Coordinator thread Event processing thread Checkpointing thread (loop: processElement) (trigger checkpoint) (acknowledge checkpoint) (write state to DFS) Task Manager Job Manager Problem: All event processing is on hold here to avoid concurrent modifications to the state that is written
  • 9. Asynchronous Checkpointing 9 Checkpoint Coordinator thread Event processing thread Checkpointing thread (loop: processElement) (snapshot state) (trigger checkpoint) (acknowledge checkpoint)(write state to DFS) Task Manager Job Manager
  • 10. Asynchronous Checkpointing 10 Checkpoint Coordinator thread Event processing thread Checkpointing thread (loop: processElement) (snapshot state) (trigger checkpoint) (acknowledge checkpoint)(write state to DFS) Task Manager Job Manager Problem: How to deal with concurrent modifications?
  • 12. What we will discuss ▪ What are incremental checkpoints? ▪ Why is RocksDB so well suited for this? ▪ How do we integrate this with Flink’s checkpointing? 12 Driven by and
  • 13. Full Checkpointing 13 K S 2 B 4 W 6 N K S 2 B 3 K 4 L 6 N K S 2 Q 3 K 6 N 9 S K S 2 B 4 W 6 N K S 2 B 3 K 4 L 6 N K S 2 Q 3 K 6 N 9 S Checkpoint 1 Checkpoint 2 Checkpoint 3 time
  • 14. Incremental Checkpointing 14 K S 2 B 4 W 6 N K S 2 B 3 K 4 L 6 N K S 2 Q 3 K 6 N 9 S K S 2 B 4 W 6 N K S 3 K 4 L K S 2 Q 4 - 9 S iCheckpoint 1 iCheckpoint 2 iCheckpoint 3 Δ(-,c1) Δ(c1,c2) Δ(c2,c3) time
  • 15. Incremental Recovery 15 K S 2 B 4 W 6 N K S 2 B 3 K 4 L 6 N K S 2 Q 3 K 6 N 9 S K S 2 B 4 W 6 N K S 3 K 4 L K S 2 Q 4 - 9 S iCheckpoint 1 iCheckpoint 2 iCheckpoint 3 Δ(-,c1) Δ(c1,c2) Δ(c2,c3) time Recovery? + +
  • 16. RocksDB Architecture (simplified) 16 Memtable SSTable-7 SSTable-6 Memory … Storage key_1 val_1 key_2 val_2 …Index + sorted by key - All writes go against Memtable - Mutable Buffer (couple MB) - Unique keys - Reads consider Memtable first, then SSTables - Immutable - We can consider newly created SSTables as Δs! periodic flush periodic merge
  • 17. RocksDB Compaction ▪ Background Thread merges SSTable files ▪ Removes copies of the same key (latest version survives) ▪ Actually deletion of keys 17 2 C 7 N 9 Q 1 V 7 - 9 S 1 V 2 C 9 S SSTable-1 SSTable-2 SSTable-3 merge Compaction consolidates our Δs!
  • 19. Flink’s Incremental Checkpointing 19 Checkpoint Coordinator StatefulMap (1/3) StatefulMap (2/3) StatefulMap (3/3) DFS Network Step 1: Checkpoint Coordinator sends checkpoint barrier that triggers a snapshot on each instance SharedStateRegistry
  • 22. Incremental Snapshot of Operator 22 data manifest 01010101 00110011 10101010 11001100 01010101 00225.sst share Local FS 01010101 00110011 10101010 11001100 01010101 00226.sst SharedState Registry Distributed FS://sfmap/1/
  • 23. Incremental Snapshot of Operator 23 data chk-1 manifest 01010101 00110011 10101010 11001100 01010101 00225.sst manifest 01010101 00110011 10101010 11001100 01010101 00225.sst 00226.sst 01010101 00110011 10101010 11001100 01010101 ++ share Local FS 01010101 00110011 10101010 11001100 01010101 00226.sst SharedState Registry copy / hardlink Distributed FS://sfmap/1/
  • 24. Incremental Snapshot of Operator 24 data chk-1 manifest 01010101 00110011 10101010 11001100 01010101 00225.sst manifest 01010101 00110011 10101010 11001100 01010101 00225.sst 00226.sst 01010101 00110011 10101010 11001100 01010101 ++ share 01010101 00110011 10101010 11001100 01010101 00226.sst 01010101 00110011 10101010 11001100 01010101 00225.sst manifest sst.list Local FS SharedState Registry 01010101 00110011 10101010 11001100 01010101 00226.sst chk-1 Distributed FS://sfmap/1/ List of SSTables referenced by snapshot async upload to DFS
  • 25. Flink’s Incremental Checkpointing 25 StatefulMap (1/3) Checkpoint Coordinator StatefulMap (2/3) StatefulMap (3/3) DFS H1 H2 H3Network Step 3: Each instance acknowledges and sends a handle (e.g. file path in DFS) to the Checkpoint Coordinator. SharedStateRegistry Δ3 Δ2 Δ1
  • 26. Incremental Snapshot of Operator 26 data chk-1 manifest 01010101 00110011 10101010 11001100 01010101 00225.sst manifest 01010101 00110011 10101010 11001100 01010101 00225.sst 00226.sst 01010101 00110011 10101010 11001100 01010101 ++ share 01010101 00110011 10101010 11001100 01010101 00226.sst 01010101 00110011 10101010 11001100 01010101 00225.sst manifest sst.list Local FS SharedState Registry 01010101 00110011 10101010 11001100 01010101 00226.sst chk-1 {00225.sst = 1} {00226.sst = 1} Distributed FS://sfmap/1/
  • 27. Flink’s Incremental Checkpointing 27 StatefulMap (1/3) Checkpoint Coordinator StatefulMap (2/3) StatefulMap (3/3) DFS H1 H3 CP 1 H2 Network SharedStateRegistry Δ3 Δ2 Δ1Step 4: Checkpoint Coordinator signals CP1 success to all instances.
  • 29. Incremental Snapshot of Operator 29 data chk-1 manifest 01010101 00110011 10101010 11001100 01010101 00226.sst 00228.sst 00229.sst 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 share 01010101 00110011 10101010 11001100 01010101 00226.sst 01010101 00110011 10101010 11001100 01010101 00225.sst manifest sst.list Local FS {00225.sst = 1} {00226.sst = 1} SharedState Registry Distributed FS://sfmap/1/
  • 30. Incremental Snapshot of Operator 30 data chk-2 chk-1 manifest 01010101 00110011 10101010 11001100 01010101 00226.sst 00228.sst 00229.sst 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 manifest 01010101 00110011 10101010 11001100 01010101 00226.sst 00228.sst 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 + + share 01010101 00110011 10101010 11001100 01010101 00226.sst 01010101 00110011 10101010 11001100 01010101 00225.sst manifest sst.list Local FS {00225.sst = 1} {00226.sst = 1} SharedState Registry Distributed FS://sfmap/1/ 00229.sst
  • 31. Incremental Snapshot of Operator 31 data chk-1 manifest 01010101 00110011 10101010 11001100 01010101 00226.sst 00228.sst 00229.sst 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 share 01010101 00110011 10101010 11001100 01010101 00226.sst 00228.sst 00229.sst 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 00225.sst manifest sst.list chk-2 manifest sst.list Local FS {00225.sst = 1} {00226.sst = 2} {00228.sst = 1} {00229.sst = 1} SharedState Registry Distributed FS://sfmap/1/ chk-2 manifest 01010101 00110011 10101010 11001100 01010101 00226.sst 00228.sst 00229.sst 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 + + upload missing SSTable files
  • 32. Deleting Incremental Checkpoints 32 StatefulMap (1/3) Checkpoint Coordinator StatefulMap (2/3) StatefulMap (3/3) DFS H1 H3 CP 1 H1 H2H3 CP 2 H2 Network Deleting an outdated checkpoint SharedStateRegistry Δ3 Δ2 Δ1 Δ3 Δ2 Δ1
  • 33. Deleting Incremental Snapshot 33 data chk-1 manifest 01010101 00110011 10101010 11001100 01010101 00226.sst 00228.sst 00229.sst 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 share 01010101 00110011 10101010 11001100 01010101 00226.sst 00228.sst 00229.sst 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 00225.sst manifest sst.list chk-2 manifest sst.list Local FS {00225.sst = 1} {00226.sst = 2} {00228.sst = 1} {00229.sst = 1} SharedState Registry Distributed FS://sfmap/1/
  • 34. Deleting Incremental Snapshot 34 data manifest 01010101 00110011 10101010 11001100 01010101 00226.sst 00228.sst 00229.sst 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 share 01010101 00110011 10101010 11001100 01010101 00226.sst 00228.sst 00229.sst 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 00225.sst chk-2 manifest sst.list Local FS {00225.sst = 0} {00226.sst = 1} {00228.sst = 1} {00229.sst = 1} SharedState Registry Distributed FS://sfmap/1/
  • 35. Deleting Incremental Snapshot 35 data manifest 01010101 00110011 10101010 11001100 01010101 00226.sst 00228.sst 00229.sst 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 share 01010101 00110011 10101010 11001100 01010101 00226.sst 00228.sst 00229.sst 01010101 00110011 10101010 11001100 01010101 01010101 00110011 10101010 11001100 01010101 chk-2 manifest sst.list Local FS {00226.sst = 1} {00228.sst = 1} {00229.sst = 1} SharedState Registry Distributed FS://sfmap/1/
  • 37. Incremental checkpointing benefits ▪ Incremental checkpoints can dramatically reduce CP overhead for large state. ▪ Incremental checkpoints are async. ▪ RocksDB’s compaction consolidates the increments. Keeps overhead low for recovery. 37
  • 38. Incremental checkpointing limitations ▪ Breaks the unification of checkpoints and savepoints (CP: low overhead, SP: features) ▪ RocksDB specific format. ▪ Currently no support for rescaling from incremental checkpoint. 38
  • 39. Further improvements in Flink 1.3/4 ▪ AsyncHeapKeyedStateBackend (merged) ▪ AsyncHeapOperatorStateBackend (PR) ▪ MapState (merged) ▪ RocksDBInternalTimerService (PR) ▪ AsyncHeapInternalTimerService 39