Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
1
Stefan Richter

@StefanRRichter



September 13, 2017
A Look at Flink’s Internal Data
Structures for Efficient
Checkpoin...
Flink State and Distributed Snapshots
2
State
Backend
Stateful

Operation
Source
Event
3
Trigger checkpoint
Inject checkpoint barrier
Stateful

Operation
Source
Flink State and Distributed Snapshots
„Asynchron...
4
Take state snapshot
Synchronously trigger
state snapshot (e.g.
copy-on-write)
Flink State and Distributed Snapshots
Stat...
5
DFS
Processing pipeline continues
Durably persist

full snapshots

asynchronously
For each state backend:
one writer, n ...
Challenges for Data Structures
▪ Asynchronous Checkpoints:
▪ Minimize pipeline stall time while taking
the snapshot.
▪ Kee...
7
DFS
Durably persist

incremental
snapshots

asynchronously
Processing pipeline continues
ΔΔ
(differences from
previous s...
Full Snapshots
8
K S
2 B
4 W
6 N
K S
2 B
3 K
4 L
6 N
K S
2 Q
3 K
6 N
9 S
K S
2 B
4 W
6 N
K S
2 B
3 K
4 L
6 N
K S
2 Q
3 K
6...
Checkpoint
Directory
(DFS)
Incremental Snapshots
9
K S
2 B
4 W
6 N
K S
2 B
3 K
4 L
6 N
K S
2 Q
3 K
6 N
9 S
K S
2 B
4 W
6 N...
5 TB 10 TB 15 TB
Incremental Full
Incremental vs Full Snapshots
10
Checkpoint
Duration
State Size
State
Backend
(Task-local)
Incremental Recovery
11
K S
2 B
4 W
6 N
K S
2 B
3 K
4 L
6 N
K S
2 Q
3 K
6 N
9 S
K S
2 B
4 W
6 N...
Challenges for Data Structures
▪ Incremental checkpoints:
▪ Efficiently detect the (minimal) set of
state changes between ...
Two Keyed State Backends
13
HeapKeyedStateBackend RocksDBKeyedStateBackend
• State lives in memory, on Java heap.
• Operat...
Asynchronous and Incremental Checkpoints
with the RocksDB Keyed-State Backend
14
What is RocksDB
▪ Ordered Key/Value store (bytes).
▪ Based on LSM trees.
▪ Embedded, written in C++, used via JNI.
▪ Write...
RocksDB Architecture (simplified)
16
Memory
Local Disk
K: 2 V: T K: 8 V: W
Memtable
- Mutable Memory Buffer for K/V Pairs ...
RocksDB Architecture (simplified)
17
Memory
Local Disk
K: 4 V: Q
Memtable
K: 2 V: T K: 8 V: WIdx
SSTable-(1)
BF-1
- Flushe...
RocksDB Architecture (simplified)
18
Memory
Local Disk
K: 2 V: A K: 5 V: N
Memtable
K: 4 V: Q K: 7 V: SIdx
SSTable-(2)
K: ...
RocksDB Compaction (simplified)
19
K: 2 V: C K: 7 - K: 4 V: Q K: 7 V: S K: 2 V: T K: 8 V: W
K: 2 V: C K: 4 V: Q K: 8 V: W
...
RocksDB Asynchronous Checkpoint
▪ Flush Memtable. (Simplified)
▪ Then create iterator over all current
SSTables (they are ...
RocksDB Incremental Checkpoint
▪ Flush Memtable.
▪ For all SSTable files in local working
directory: upload new files sinc...
Incremental Checkpoints - Example
▪ Illustrate interactions (simplified) between:
▪ Local RocksDB instance
▪ Remote Checkp...
sstable-(1,2,3)
sstable-(1) sstable-(2)
sstable-(1) sstable-(2) sstable-(3) sstable-(4)
sstable-(4) sstable-(5)
CP 1
CP 2
...
sstable-(1,2,3)
sstable-(1) sstable-(2)
sstable-(1) sstable-(2) sstable-(3) sstable-(4)
sstable-(4) sstable-(5)
CP 1
CP 2
...
sstable-(1,2,3)
sstable-(1) sstable-(2)
sstable-(1) sstable-(2) sstable-(3) sstable-(4)
sstable-(4) sstable-(5)
CP 1
CP 2
...
Incremental Checkpoints - Example
sstable-(1,2,3)
sstable-(1) sstable-(2)
sstable-(1) sstable-(2) sstable-(3) sstable-(4)
...
Incremental Checkpoints - Example
sstable-(1,2,3)
sstable-(1) sstable-(2)
sstable-(1) sstable-(2) sstable-(3) sstable-(4)
...
Incremental Checkpoints - Example
sstable-(1,2,3)
sstable-(1) sstable-(2)
sstable-(1) sstable-(2) sstable-(3) sstable-(4)
...
Incremental Checkpoints - Example
29
sstable-(1,2,3)
sstable-(1) sstable-(2)
sstable-(1) sstable-(2) sstable-(3) sstable-(...
Incremental Checkpoints - Example
30
sstable-(1,2,3)
sstable-(1) sstable-(2)
sstable-(1) sstable-(2) sstable-(3) sstable-(...
Summary RocksDB
▪ Asynchronous snapshots „for free“, immutable
copy is already there.
▪ Trivially supports multiple concur...
Asynchronous Checkpoints
with the Heap Keyed-State Backend
32
Heap Backend: Chained Hash Map
33
S1
K1
Entry
S2
K2
Entry
next
Map<K, S> {
Entry<K, S>[] table;
}
Entry<K, S> {
final K ke...
34
S1
K1
Entry
S2
K2
Entry
next
Map<K, S> {
Entry<K, S>[] table;
}
Entry<K, S> {
final K key;
S state;
Entry next;
}
Map<K...
35
S1
K1
Entry
S2
K2
Entry
next
Map<K, S> {
Entry<K, S>[] table;
}
Entry<K, S> {
final K key;
S state;
Entry next;
}
Map<K...
36
S1
K1
Entry
S2
K2
Entry
next
Map<K, S> {
Entry<K, S>[] table;
}
Entry<K, S> {
final K key;
S state;
Entry next;
}
Map<K...
37
S1
K1
Entry
S2
K2
Entry
next
Map<K, S> {
Entry<K, S>[] table;
}
Entry<K, S> {
final K key;
S state;
Entry next;
}
Map<K...
Copy-on-Write Hash Map
38
S1
K1
Entry
S2
K2
Entry
next
Map<K, S>
„Structural
changes to map“
(easy to detect)
„User modifi...
Copy-on-Write Hash Map - Snapshots
39
Map<K, S> {
Entry<K, S>[] table;
int mapVersion;
int requiredVersion;
OrderedSet<Int...
Copy-on-Write Hash Map - 2 Golden Rules
1.Whenever a map entry e is modified and e.entryVersion <
map.requiredVersion, fir...
Copy-on-Write Hash Map - Example
41
sv:0
ev:0
sv:0
ev:0
K:42 S:7
K:23 S:3
mapVersion: 0
requiredVersion: 0
Copy-on-Write Hash Map - Example
42
sv:0
ev:0
sv:0
ev:0
K:42 S:7
K:23 S:3
snapshotVersion: 0
mapVersion: 1
requiredVersion...
sv:0
ev:0
sv:0
ev:0
S:7
S:3
Copy-on-Write Hash Map - Example
43
K:42
K:23
sv:1
ev:1
K:13 S:2
snapshotVersion: 0
mapVersion...
sv:0
ev:0
sv:0
ev:0
S:7
S:3
Copy-on-Write Hash Map - Example
44
K:42
K:23
K:13
sv:1
ev:1
S:2
snapshotVersion: 0
mapVersion...
sv:0
ev:0
sv:0
ev:0
S:7
S:3
Copy-on-Write Hash Map - Example
45
K:42
K:23
K:13
sv:1
ev:1
S:2
snapshotVersion: 0
mapVersion...
sv:0
ev:0
sv:0
ev:0
S:7
S:3
Copy-on-Write Hash Map - Example
46
K:42
K:23 S:3
K:13
sv:1
ev:1
S:2
snapshotVersion: 0
mapVer...
sv:0
ev:0
sv:0
ev:0
S:7
S:3
Copy-on-Write Hash Map - Example
47
K:42
K:23 S:3
sv:1
ev:1
K:13
sv:1
ev:1
S:2
snapshotVersion...
sv:0
ev:0
sv:0
ev:0
S:7
S:3K:23 S:3
sv:1
ev:1
Copy-on-Write Hash Map - Example
48
K:42K:13
sv:1
ev:1
S:2
sv:0
ev:1
snapsho...
sv:0
ev:0
sv:0
ev:0
S:7
S:3 S:3
sv:1
ev:1
K:23
Copy-on-Write Hash Map - Example
49
K:42K:13
sv:1
ev:1
S:2
sv:0
ev:1
snapsh...
S:3
sv:1
ev:1
sv:0
ev:0
sv:0
ev:0
S:7
S:3K:23
Copy-on-Write Hash Map - Example
50
K:42K:13
sv:1
ev:1
S:2
sv:0
ev:1
snapsho...
S:3
sv:1
ev:1
sv:0
ev:0
sv:0
ev:0
S:7
S:3K:23
Copy-on-Write Hash Map - Example
51
K:42K:13
sv:1
ev:1
S:2
sv:0
ev:1
snapsho...
S:3
sv:1
ev:1
sv:0
ev:0
sv:0
ev:0
S:7
S:3K:23
Copy-on-Write Hash Map - Example
52
K:42
S:7
K:13
sv:1
ev:1
S:2
sv:1
ev:1
sn...
K:23 S:3
sv:1
ev:1
Copy-on-Write Hash Map - Example
53
K:42
sv:1
ev:1
S:7
K:13
sv:1
ev:1
S:2
mapVersion: 1
requiredVersion...
K:23 S:3
sv:1
ev:1
Copy-on-Write Hash Map - Example
54
sv:1
ev:1
S:2 K:42
sv:1
ev:1
S:7
K:13
snapshotVersion: 1
mapVersion...
K:23 S:3
sv:1
ev:1
Copy-on-Write Hash Map - Example
55
sv:1
ev:1
S:2 K:42
sv:1
ev:1
S:7
K:13
snapshotVersion: 1
mapVersion...
K:23 S:3
sv:1
ev:1
Copy-on-Write Hash Map - Example
56
sv:1
ev:1
S:2 K:42
sv:1
ev:1
S:7
K:13
snapshotVersion: 1
mapVersion...
sv:1
ev:2
K:23 S:3
sv:1
ev:1
Copy-on-Write Hash Map - Example
57
sv:1
ev:1
S:2 K:42
sv:1
ev:1
S:7
K:13
snapshotVersion: 1
...
sv:1
ev:2
Copy-on-Write Hash Map - Example
58
K:42
S:7
K:13
sv:1
ev:1
S:2
mapVersion: 2
requiredVersion: 0
release(1)
Copy-on-Write Hash Map Benefits
▪ Fine granular, lazy copy-on-write.
▪ At most one copy per object, per snapshot.
▪ No thr...
Questions?
60
Próxima SlideShare
Cargando en…5
×

Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data structures and algorithms for efficient checkpointing

1.210 visualizaciones

Publicado el

Stateful stream processing with exactly-once guarantees is one of Apache Flink's distinctive features and we have observed that the scale of state that is managed by Flink in production is constantly growing. This development created new challenges for state management in Flink, in particular for state checkpointing, which is the core of Flink's fault tolerance mechanism. Two of the most important problems that we had to solve were the following: (i) how can we limit the duration and size of checkpoints to something that does not grow linearly in the size of the state and (ii) how can we take checkpoints without blocking the processing pipeline in the meantime? We have implemented incremental checkpoints to solve the first problem by checkpointing only the changes between checkpoints, instead of always recording the whole state. Asynchronous checkpoints address the second problem and enable Flink to continue processing concurrently to running checkpoints. In this talk, we will take a deep dive into the details of Flink's new checkpointing features. In particular, we will talk about the underlying datastructures, log-structured merge trees and copy-on-write hash tables, and how those building blocks are assembled and orchestrated to advance Flink's checkpointing.

Publicado en: Datos y análisis
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Responder 
    ¿Estás seguro?    No
    Tu mensaje aparecerá aquí

Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data structures and algorithms for efficient checkpointing

  1. 1. 1 Stefan Richter
 @StefanRRichter
 
 September 13, 2017 A Look at Flink’s Internal Data Structures for Efficient Checkpointing
  2. 2. Flink State and Distributed Snapshots 2 State Backend Stateful
 Operation Source Event
  3. 3. 3 Trigger checkpoint Inject checkpoint barrier Stateful
 Operation Source Flink State and Distributed Snapshots „Asynchronous Barrier Snapshotting“
  4. 4. 4 Take state snapshot Synchronously trigger state snapshot (e.g. copy-on-write) Flink State and Distributed Snapshots Stateful
 Operation Source „Asynchronous Barrier Snapshotting“
  5. 5. 5 DFS Processing pipeline continues Durably persist
 full snapshots
 asynchronously For each state backend: one writer, n reader Flink State and Distributed Snapshots Stateful
 Operation Source
  6. 6. Challenges for Data Structures ▪ Asynchronous Checkpoints: ▪ Minimize pipeline stall time while taking the snapshot. ▪ Keep overhead (memory, CPU,…) as low as possible while writing the snapshot. ▪ Support multiple parallel checkpoints. 6
  7. 7. 7 DFS Durably persist
 incremental snapshots
 asynchronously Processing pipeline continues ΔΔ (differences from previous snapshots) Flink State and Distributed Snapshots Stateful
 Operation Source
  8. 8. Full Snapshots 8 K S 2 B 4 W 6 N K S 2 B 3 K 4 L 6 N K S 2 Q 3 K 6 N 9 S K S 2 B 4 W 6 N K S 2 B 3 K 4 L 6 N K S 2 Q 3 K 6 N 9 S Checkpoint 1 Checkpoint 2 Checkpoint 3 time State Backend (Task-local) Checkpoint Directory (DFS) Updates Updates
  9. 9. Checkpoint Directory (DFS) Incremental Snapshots 9 K S 2 B 4 W 6 N K S 2 B 3 K 4 L 6 N K S 2 Q 3 K 6 N 9 S K S 2 B 4 W 6 N K S 3 K 4 L K S 2 Q 4 - 9 S iCheckpoint 1 iCheckpoint 2 iCheckpoint 3 Δ(-, icp1) Δ(icp1, icp2) Δ(icp2, icp3) time State Backend (Task-local) Updates Updates
  10. 10. 5 TB 10 TB 15 TB Incremental Full Incremental vs Full Snapshots 10 Checkpoint Duration State Size
  11. 11. State Backend (Task-local) Incremental Recovery 11 K S 2 B 4 W 6 N K S 2 B 3 K 4 L 6 N K S 2 Q 3 K 6 N 9 S K S 2 B 4 W 6 N K S 3 K 4 L K S 2 Q 4 - 9 S time Recovery? + + iCheckpoint 1 iCheckpoint 2 iCheckpoint 3 Δ(-, icp1) Δ(icp1, icp2) Δ(icp2, icp3) Checkpoint Directory (DFS)
  12. 12. Challenges for Data Structures ▪ Incremental checkpoints: ▪ Efficiently detect the (minimal) set of state changes between two checkpoints. ▪ Avoid unbounded checkpoint history. 12
  13. 13. Two Keyed State Backends 13 HeapKeyedStateBackend RocksDBKeyedStateBackend • State lives in memory, on Java heap. • Operates on objects. • Think of a hash map {key obj -> state obj}. • Goes through ser/de during snapshot/restore. • Async snapshots supported. • State lives in off-heap memory and on disk. • Operates on serialized bytes. • Think of K/V store {key bytes -> state bytes}. • Based on log-structured-merge (LSM) tree. • Goes through ser/de for each state access. • Async and incremental snapshots.
  14. 14. Asynchronous and Incremental Checkpoints with the RocksDB Keyed-State Backend 14
  15. 15. What is RocksDB ▪ Ordered Key/Value store (bytes). ▪ Based on LSM trees. ▪ Embedded, written in C++, used via JNI. ▪ Write optimized (all bulk sequential), with acceptable read performance. 15
  16. 16. RocksDB Architecture (simplified) 16 Memory Local Disk K: 2 V: T K: 8 V: W Memtable - Mutable Memory Buffer for K/V Pairs (bytes) - All reads / writes go here first - Aggregating (overrides same unique keys) - Asynchronously flushed to disk when full
  17. 17. RocksDB Architecture (simplified) 17 Memory Local Disk K: 4 V: Q Memtable K: 2 V: T K: 8 V: WIdx SSTable-(1) BF-1 - Flushed Memtable becomes immutable - Sorted By Key (unique) - Read: first check Memtable, then SSTable - Bloomfilter & Index as optimisations
  18. 18. RocksDB Architecture (simplified) 18 Memory Local Disk K: 2 V: A K: 5 V: N Memtable K: 4 V: Q K: 7 V: SIdx SSTable-(2) K: 2 V: T K: 8 V: WIdx SSTable-(1) K: 2 V: C K: 7 -Idx SSTable-(3) BF-1 BF-2 BF-3 - Deletes are explicit entries - Natural support for snapshots - Iteration: online merge - SSTables can accumulate
  19. 19. RocksDB Compaction (simplified) 19 K: 2 V: C K: 7 - K: 4 V: Q K: 7 V: S K: 2 V: T K: 8 V: W K: 2 V: C K: 4 V: Q K: 8 V: W multiway sorted merge „Log Compaction“ SSTable-(3) SSTable-(1)SSTable-(2) SSTable-(1, 2, 3) (rm)
  20. 20. RocksDB Asynchronous Checkpoint ▪ Flush Memtable. (Simplified) ▪ Then create iterator over all current SSTables (they are immutable). ▪ New changes go to Memtable and future SSTables and are not considered by iterator. 20
  21. 21. RocksDB Incremental Checkpoint ▪ Flush Memtable. ▪ For all SSTable files in local working directory: upload new files since last checkpoint to DFS, re-reference other files. ▪ Do reference counting on JobManager for uploaded files. 21
  22. 22. Incremental Checkpoints - Example ▪ Illustrate interactions (simplified) between: ▪ Local RocksDB instance ▪ Remote Checkpoint Directory (DFS) ▪ Job Manager (Checkpoint Coordinator) ▪ In this example: 2 latest checkpoints retained 22
  23. 23. sstable-(1,2,3) sstable-(1) sstable-(2) sstable-(1) sstable-(2) sstable-(3) sstable-(4) sstable-(4) sstable-(5) CP 1 CP 2 CP 3 merge sstable-(1,2,3) sstable-(4,5,6)CP 4 merge +sstable-(6) sstable-(3) sstable-(4) sstable-(1) sstable-(2) sstable-(1,2,3) sstable-(5) sstable-(4,5,6) CP 1 CP 2CP 2 CP 3 CP 4 Incremental Checkpoints - Example TaskManager Local RocksDB Working Directory DFS Upload Job Manager
  24. 24. sstable-(1,2,3) sstable-(1) sstable-(2) sstable-(1) sstable-(2) sstable-(3) sstable-(4) sstable-(4) sstable-(5) CP 1 CP 2 CP 3 merge sstable-(1,2,3) sstable-(4,5,6)CP 4 merge +sstable-(6) sstable-(3) sstable-(4) sstable-(1) sstable-(2) sstable-(1,2,3) sstable-(5) sstable-(4,5,6) CP 1 CP 2CP 2 CP 3 CP 4 Incremental Checkpoints - Example TaskManager Local RocksDB Working Directory DFS Upload Job Manager
  25. 25. sstable-(1,2,3) sstable-(1) sstable-(2) sstable-(1) sstable-(2) sstable-(3) sstable-(4) sstable-(4) sstable-(5) CP 1 CP 2 CP 3 merge sstable-(1,2,3) sstable-(4,5,6)CP 4 merge +sstable-(6) sstable-(3) sstable-(4) sstable-(1) sstable-(2) sstable-(1,2,3) sstable-(5) sstable-(4,5,6) CP 1 CP 2CP 2 CP 3 CP 4 Incremental Checkpoints - Example TaskManager Local RocksDB Working Directory DFS Upload Job Manager
  26. 26. Incremental Checkpoints - Example sstable-(1,2,3) sstable-(1) sstable-(2) sstable-(1) sstable-(2) sstable-(3) sstable-(4) sstable-(4) sstable-(5) CP 1 CP 2 CP 3 sstable-(1,2,3) sstable-(4,5,6)CP 4 merge +sstable-(6) sstable-(3) sstable-(4) sstable-(1) sstable-(2) sstable-(1,2,3) sstable-(5) sstable-(4,5,6) CP 1 CP 2CP 2 CP 3 CP 4 TaskManager Local RocksDB Working Directory DFS Upload Job Manager
  27. 27. Incremental Checkpoints - Example sstable-(1,2,3) sstable-(1) sstable-(2) sstable-(1) sstable-(2) sstable-(3) sstable-(4) sstable-(4) sstable-(5) CP 1 CP 2 CP 3 merge sstable-(1,2,3) sstable-(4,5,6)CP 4 +sstable-(6) sstable-(3) sstable-(4) sstable-(1) sstable-(2) sstable-(1,2,3) sstable-(5) sstable-(4,5,6) CP 1 CP 2CP 2 CP 3 CP 4 TaskManager Local RocksDB Working Directory DFS Upload Job Manager
  28. 28. Incremental Checkpoints - Example sstable-(1,2,3) sstable-(1) sstable-(2) sstable-(1) sstable-(2) sstable-(3) sstable-(4) sstable-(4) sstable-(5) CP 1 CP 2 CP 3 merge sstable-(1,2,3) sstable-(4,5,6)CP 4 +sstable-(6) sstable-(3) sstable-(4) sstable-(1) sstable-(2) sstable-(1,2,3) sstable-(5) sstable-(4,5,6) CP 1 CP 2CP 2 CP 3 CP 4 TaskManager Local RocksDB Working Directory DFS Upload Job Manager
  29. 29. Incremental Checkpoints - Example 29 sstable-(1,2,3) sstable-(1) sstable-(2) sstable-(1) sstable-(2) sstable-(3) sstable-(4) sstable-(4) sstable-(5) CP 1 CP 2 CP 3 merge sstable-(1,2,3) sstable-(4,5,6)CP 4 merge +sstable-(6) sstable-(3) sstable-(4) sstable-(1) sstable-(2) sstable-(1,2,3) sstable-(5) sstable-(4,5,6) CP 1 CP 2CP 2 CP 3 CP 4 TaskManager Local RocksDB Working Directory DFS Upload Job Manager
  30. 30. Incremental Checkpoints - Example 30 sstable-(1,2,3) sstable-(1) sstable-(2) sstable-(1) sstable-(2) sstable-(3) sstable-(4) sstable-(4) sstable-(5) CP 1 CP 2 CP 3 merge sstable-(1,2,3) sstable-(4,5,6)CP 4 merge +sstable-(6) sstable-(3) sstable-(4) sstable-(1) sstable-(2) sstable-(1,2,3) sstable-(5) sstable-(4,5,6) CP 1 CP 2CP 2 CP 3 CP 4 TaskManager Local RocksDB Working Directory DFS Upload Job Manager
  31. 31. Summary RocksDB ▪ Asynchronous snapshots „for free“, immutable copy is already there. ▪ Trivially supports multiple concurrent snapshots. ▪ Detection of incremental changes easy: observe creation and deletion of SSTables. ▪ Bounded checkpoint history through compaction. 31
  32. 32. Asynchronous Checkpoints with the Heap Keyed-State Backend 32
  33. 33. Heap Backend: Chained Hash Map 33 S1 K1 Entry S2 K2 Entry next Map<K, S> { Entry<K, S>[] table; } Entry<K, S> { final K key; S state; Entry next; } Map<K, S>
  34. 34. 34 S1 K1 Entry S2 K2 Entry next Map<K, S> { Entry<K, S>[] table; } Entry<K, S> { final K key; S state; Entry next; } Map<K, S> Heap Backend: Chained Hash Map
  35. 35. 35 S1 K1 Entry S2 K2 Entry next Map<K, S> { Entry<K, S>[] table; } Entry<K, S> { final K key; S state; Entry next; } Map<K, S> Heap Backend: Chained Hash Map
  36. 36. 36 S1 K1 Entry S2 K2 Entry next Map<K, S> { Entry<K, S>[] table; } Entry<K, S> { final K key; S state; Entry next; } Map<K, S> Heap Backend: Chained Hash Map
  37. 37. 37 S1 K1 Entry S2 K2 Entry next Map<K, S> { Entry<K, S>[] table; } Entry<K, S> { final K key; S state; Entry next; } Map<K, S> „Structural changes to map“ (easy to detect) „User modifies state objects“ (hard to detect) Heap Backend: Chained Hash Map
  38. 38. Copy-on-Write Hash Map 38 S1 K1 Entry S2 K2 Entry next Map<K, S> „Structural changes to map“ (easy to detect) „User modifies state objects“ (hard to detect) Map<K, S> { Entry<K, S>[] table; int mapVersion; int requiredVersion; OrderedSet<Integer> snapshots; } Entry<K, S> { final K key; S state; Entry next; int stateVersion; int entryVersion; }
  39. 39. Copy-on-Write Hash Map - Snapshots 39 Map<K, S> { Entry<K, S>[] table; int mapVersion; int requiredVersion; OrderedSet<Integer> snapshots; } Entry<K, S> { final K key; S state; Entry next; int stateVersion; int entryVersion; } Create Snapshot: 1. Flat array-copy of table array 2. snapshots.add(mapVersion); 3. ++mapVersion; 4. requiredVersion = mapVersion; Release Snapshot: 1. snapshots.remove(snapVersion); 2. requiredVersion = snapshots.getMax();
  40. 40. Copy-on-Write Hash Map - 2 Golden Rules 1.Whenever a map entry e is modified and e.entryVersion < map.requiredVersion, first copy the entry and redirect pointers that are reachable through snapshot to the copy. Set e.entryVersion = map.mapVersion. Pointer redirection can trigger recursive application of rule 1 to other entries. 2.Before returning the state object s of entry e to a caller, if e.stateVersion < map.requiredVersion, create a deep copy of s and redirect the pointer in e to the copy. Set e.stateVersion = map.mapVersion. Then return the copy to the caller. Applying rule 2 can trigger rule 1. 40
  41. 41. Copy-on-Write Hash Map - Example 41 sv:0 ev:0 sv:0 ev:0 K:42 S:7 K:23 S:3 mapVersion: 0 requiredVersion: 0
  42. 42. Copy-on-Write Hash Map - Example 42 sv:0 ev:0 sv:0 ev:0 K:42 S:7 K:23 S:3 snapshotVersion: 0 mapVersion: 1 requiredVersion: 1 snapshot() -> 0
  43. 43. sv:0 ev:0 sv:0 ev:0 S:7 S:3 Copy-on-Write Hash Map - Example 43 K:42 K:23 sv:1 ev:1 K:13 S:2 snapshotVersion: 0 mapVersion: 1 requiredVersion: 1 put(13, 2)
  44. 44. sv:0 ev:0 sv:0 ev:0 S:7 S:3 Copy-on-Write Hash Map - Example 44 K:42 K:23 K:13 sv:1 ev:1 S:2 snapshotVersion: 0 mapVersion: 1 requiredVersion: 1 get(23)
  45. 45. sv:0 ev:0 sv:0 ev:0 S:7 S:3 Copy-on-Write Hash Map - Example 45 K:42 K:23 K:13 sv:1 ev:1 S:2 snapshotVersion: 0 mapVersion: 1 requiredVersion: 1 get(23) Caller could modify state object!
  46. 46. sv:0 ev:0 sv:0 ev:0 S:7 S:3 Copy-on-Write Hash Map - Example 46 K:42 K:23 S:3 K:13 sv:1 ev:1 S:2 snapshotVersion: 0 mapVersion: 1 requiredVersion: 1 get(23) Trigger copy. How to connect? ?
  47. 47. sv:0 ev:0 sv:0 ev:0 S:7 S:3 Copy-on-Write Hash Map - Example 47 K:42 K:23 S:3 sv:1 ev:1 K:13 sv:1 ev:1 S:2 snapshotVersion: 0 mapVersion: 1 requiredVersion: 1 get(23) Copy entry. How to connect? ?
  48. 48. sv:0 ev:0 sv:0 ev:0 S:7 S:3K:23 S:3 sv:1 ev:1 Copy-on-Write Hash Map - Example 48 K:42K:13 sv:1 ev:1 S:2 sv:0 ev:1 snapshotVersion: 0 mapVersion: 1 requiredVersion: 1 get(23) Copy entry (recursive). Connect?
  49. 49. sv:0 ev:0 sv:0 ev:0 S:7 S:3 S:3 sv:1 ev:1 K:23 Copy-on-Write Hash Map - Example 49 K:42K:13 sv:1 ev:1 S:2 sv:0 ev:1 snapshotVersion: 0 mapVersion: 1 requiredVersion: 1 get(23)
  50. 50. S:3 sv:1 ev:1 sv:0 ev:0 sv:0 ev:0 S:7 S:3K:23 Copy-on-Write Hash Map - Example 50 K:42K:13 sv:1 ev:1 S:2 sv:0 ev:1 snapshotVersion: 0 mapVersion: 1 requiredVersion: 1 get(42)
  51. 51. S:3 sv:1 ev:1 sv:0 ev:0 sv:0 ev:0 S:7 S:3K:23 Copy-on-Write Hash Map - Example 51 K:42K:13 sv:1 ev:1 S:2 sv:0 ev:1 snapshotVersion: 0 mapVersion: 1 requiredVersion: 1 get(42) Copy required!
  52. 52. S:3 sv:1 ev:1 sv:0 ev:0 sv:0 ev:0 S:7 S:3K:23 Copy-on-Write Hash Map - Example 52 K:42 S:7 K:13 sv:1 ev:1 S:2 sv:1 ev:1 snapshotVersion: 0 mapVersion: 1 requiredVersion: 1 get(42)
  53. 53. K:23 S:3 sv:1 ev:1 Copy-on-Write Hash Map - Example 53 K:42 sv:1 ev:1 S:7 K:13 sv:1 ev:1 S:2 mapVersion: 1 requiredVersion: 0 release(0)
  54. 54. K:23 S:3 sv:1 ev:1 Copy-on-Write Hash Map - Example 54 sv:1 ev:1 S:2 K:42 sv:1 ev:1 S:7 K:13 snapshotVersion: 1 mapVersion: 2 requiredVersion: 2 snapshot() -> 1
  55. 55. K:23 S:3 sv:1 ev:1 Copy-on-Write Hash Map - Example 55 sv:1 ev:1 S:2 K:42 sv:1 ev:1 S:7 K:13 snapshotVersion: 1 mapVersion: 2 requiredVersion: 2 remove(23) Change next-pointer?
  56. 56. K:23 S:3 sv:1 ev:1 Copy-on-Write Hash Map - Example 56 sv:1 ev:1 S:2 K:42 sv:1 ev:1 S:7 K:13 snapshotVersion: 1 mapVersion: 2 requiredVersion: 2 remove(23) Change next-pointer?
  57. 57. sv:1 ev:2 K:23 S:3 sv:1 ev:1 Copy-on-Write Hash Map - Example 57 sv:1 ev:1 S:2 K:42 sv:1 ev:1 S:7 K:13 snapshotVersion: 1 mapVersion: 2 requiredVersion: 2 remove(23)
  58. 58. sv:1 ev:2 Copy-on-Write Hash Map - Example 58 K:42 S:7 K:13 sv:1 ev:1 S:2 mapVersion: 2 requiredVersion: 0 release(1)
  59. 59. Copy-on-Write Hash Map Benefits ▪ Fine granular, lazy copy-on-write. ▪ At most one copy per object, per snapshot. ▪ No thread synchronisation between readers and writer after array-copy step. ▪ Supports multiple concurrent snapshots. ▪ Also implements efficient incremental rehashing. ▪ Future: Could serve as basis for incremental snapshots. 59
  60. 60. Questions? 60

×