(Jason Gustafson, Confluent) Kafka Summit SF 2018
Kafka has a well-designed replication protocol, but over the years, we have found some extremely subtle edge cases which can, in the worst case, lead to data loss. We fixed the cases we were aware of in version 0.11.0.0, but shortly after that, another edge case popped up and then another. Clearly we needed a better approach to verify the correctness of the protocol. What we found is Leslie Lamport’s specification language TLA+.
In this talk I will discuss how we have stepped up our testing methodology in Apache Kafka to include formal specification and model checking using TLA+. I will cover the following:
1. How Kafka replication works
2. What weaknesses we have found over the years
3. How these problems have been fixed
4. How we have used TLA+ to verify the fixed protocol.
This talk will give you a deeper understanding of Kafka replication internals and its semantics. The replication protocol is a great case study in the complex behavior of distributed systems. By studying the faults and how they were fixed, you will have more insight into the kinds of problems that may lurk in your own designs. You will also learn a little bit of TLA+ and how it can be used to verify distributed algorithms.
2. ● At the heart of Kafka is the log
● Log replication provides high availability
● Kafka has a solid replication protocol
● 99.999% of the time it does the right thing
● This talk is about the remaining 0.001%
Overview
25. r0 r1
r0 r1 r2
A
B
C
Leader
Follower
Follower
Followers fetch from the
leader.
26. r0 r1
r0 r1 r2
r0
A
B
C
Leader
Follower
Follower
Followers fetch from the
leader.
27. r0 r1
r0 r1 r2
r0
A
B
C
Leader
Follower
Follower
Leader election is handled by
a separate component known
as the controller
28. r0 r1
r0 r1 r2
r0
A
B
C
Leader
Follower
Follower
Leader Epoch ISR
B 0 A, B, C
In order to enable election by
the controller, we maintain
state in Zookeeper about the
in-sync replicas (ISR).
29. r0 r1
r0 r1 r2
r0
A
B
C
Leader
Follower
Follower
When there is a state change
(e.g. a new leader), the
controller sends the updated
state to all the replicas.
Leader Epoch ISR
B 0 A, B, C
30. r0 r1
r0 r1 r2
r0
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
When there is a state change
(e.g. a new leader), the
controller sends the updated
state to all the replicas.
31. r0 r1
r0 r1 r2
r0
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
32. r0 r1
r0 r1 r2
r0
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
High Watermark
The high watermark is
the largest offset known
to be replicated to all
members of the ISR.
33. r0 r1
r0 r1 r2
r0
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
The high watermark is
the largest offset known
to be replicated to all
members of the ISR.
34. r0 r1
r0 r1 r2
r0
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
Records below the high
watermark are considered
“committed” and are visible
to consumers.
Committed
35. r0 r1
r0 r1 r2
r0
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
Records above the high
watermark are considered
uncommitted.
Committed Uncommitted
36. r0 r1
r0 r1 r2
r0
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
As records are replicated,
the high watermark moves
forward.
37. r0 r1
r0 r1 r2
r0 r1
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
As records are replicated,
the high watermark moves
forward.
38. r0 r1
r0 r1 r2
r0 r1
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
As records are replicated,
the high watermark moves
forward.
39. r0 r1
r0 r1 r2
r0 r1
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
If a replica falls behind, it
can be removed from the
ISR by the leader.
40. r0 r1
r0 r1 r2 r3 r4 r5 r6
r0 r1
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
If a replica falls behind, it
can be removed from the
ISR by the leader.
41. r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
If a replica falls behind, it
can be removed from the
ISR by the leader.
42. r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B
Follower
(epoch=0)
If a replica falls behind, it
can be removed from the
ISR by the leader.
43. r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B
Follower
(epoch=0)
If a replica falls behind, it
can be removed from the
ISR by the leader.
44. r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B
Follower
(epoch=0)
If a replica falls behind, it
can be removed from the
ISR by the leader.
45. r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B
Follower
(epoch=0)
An out-of-sync replica that
catches up to the high
watermark is added back
to the ISR.
46. r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B
Follower
(epoch=0)
An out-of-sync replica that
catches up to the high
watermark is added back
to the ISR.
47. r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
An out-of-sync replica that
catches up to the high
watermark is added back
to the ISR.
48. r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
49. r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
Only replicas in the ISR are
eligible to become leader
50. r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
When a leader fails, the
controller will take it out of
the ISR and elect a new
leader from the remaining
ISR.
51. r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
A 1 A, C
Follower
(epoch=0)
When a leader fails, the
controller will take it out of
the ISR and elect a new
leader from the remaining
ISR.
52. r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Leader
(epoch=1)
Leader Epoch ISR
A 1 A, C
Follower
(epoch=0)
The new leader/ISR state is
propagated to the
remaining replicas
53. r0 r1 r2 r3 r4 r7 r8
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Leader
(epoch=1)
Leader Epoch ISR
A 1 A, C
Follower
(epoch=0)
The leader can begin
accepting writes
immediately.
54. r0 r1 r2 r3 r4 r7 r8
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Leader
(epoch=1)
Leader Epoch ISR
A 1 A, C
Follower
(epoch=1)
Upon becoming a follower,
the replica may have
uncommitted data which
needs to be truncated.
55. r0 r1 r2 r3 r4 r7 r8
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4
A
B
C
Leader
(epoch=0)
Leader
(epoch=1)
Leader Epoch ISR
A 1 A, C
Follower
(epoch=1)
Upon becoming a follower,
the replica may have
uncommitted data which
needs to be truncated.
56. r0 r1 r2 r3 r4 r7 r8
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r7
A
B
C
Leader
(epoch=0)
Leader
(epoch=1)
Leader Epoch ISR
A 1 A, C
Follower
(epoch=1)
Upon becoming a follower,
the replica may have
uncommitted data which
needs to be truncated.
63. r0 r1
r0 r1 r2
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
Every replica tracks the
high watermark separately
64. r0 r1
r0 r1 r2
r0
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
Every replica tracks the
high watermark separately
65. r0 r1
r0 r1 r2
r0
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
The leader advances its
high watermark based on
the fetch offsets of replicas
66. r0 r1
r0 r1 r2
r0
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
The leader advances its
high watermark based on
the fetch offsets of replicas
67. r0 r1
r0 r1 r2
r0
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
The leader piggybacks its
high watermark onto fetch
responses
68. r0 r1
r0 r1 r2
r0 r1
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
The leader piggybacks its
high watermark onto fetch
responses
69. r0 r1
r0 r1 r2
r0 r1
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
At any point in time, the
follower high watermarks
may be a little behind the
leader’s.
71. r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
72. r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
Replica B fails.
73. r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
C 1 A, C
Follower
(epoch=0)
Replica B is removed from
the ISR and C is elected as
the new leader.
74. r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
C 1 A, C
Leader
(epoch=1)
Replica B is removed from
the ISR and C is elected as
the new leader.
75. r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 A, C
Leader
(epoch=1)
Replica A finds the new
leader and truncates its log
to the local high watermark
76. r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 A, C
Leader
(epoch=1)
Replica A finds the new
leader and truncates its log
to the local high watermark
77. r0 r1
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 A, C
Leader
(epoch=1)
Replica A finds the new
leader and truncates its log
to the local high watermark
78. r0 r1
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 A, C
Leader
(epoch=1)
Before replica A begins
fetching, the new leader
fails.
79. r0 r1
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 A, C
Leader
(epoch=1)
Before replica A begins
fetching, the new leader
fails.
80. r0 r1
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
A 2 A
Leader
(epoch=1)
Before replica A begins
fetching, the new leader
fails.
81. r0 r1
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Leader
(epoch=2)
Leader Epoch ISR
A 2 A
Leader
(epoch=1)
Before replica A begins
fetching, the new leader
fails.
82. r0 r1 r7 r8 r9
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Leader
(epoch=2)
Leader Epoch ISR
A 2 A
Leader
(epoch=1)
Leader A then begins
accepting writes.
83. r0 r1 r7 r8 r9
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Leader
(epoch=2)
Leader Epoch ISR
A 2 A
Leader
(epoch=1)
But r2 and r3 had already
been committed to the ISR!
84. r0 r1 r7 r8 r9
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Leader
(epoch=2)
Leader Epoch ISR
A 2 A
Leader
(epoch=1)
Suppose that B eventually
gets restarted.
85. r0 r1 r7 r8 r9
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Leader
(epoch=2)
Leader Epoch ISR
A 2 A
Leader
(epoch=1)
Suppose that B eventually
gets restarted.
86. r0 r1 r7 r8 r9
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Follower
(epoch=2)
Leader
(epoch=2)
Leader Epoch ISR
A 2 A
Leader
(epoch=1)
Suppose that B eventually
gets restarted.
87. r0 r1 r7 r8 r9
r0 r1 r2 r3
r0 r1 r2 r3 r4 r5
A
B
C
Follower
(epoch=2)
Leader
(epoch=2)
Leader Epoch ISR
A 2 A
Leader
(epoch=1)
Suppose that B eventually
gets restarted.
88. r0 r1 r7 r8 r9
r0 r1 r2 r3 r9
r0 r1 r2 r3 r4 r5
A
B
C
Follower
(epoch=2)
Leader
(epoch=2)
Leader Epoch ISR
A 2 A
Leader
(epoch=1)
The logs have now
diverged.
90. r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 A, C
Leader
(epoch=1)
Replica B has failed and
replica A needs to truncate
its log.
91. r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 A, C
Leader
(epoch=1)
A -> C: What is the end offset
for epoch=0?
92. r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 A, C
Leader
(epoch=1)
A -> C: What is the end offset
for epoch=0?
C -> A: The end offset is 6
Offset 6
93. r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 A, C
Leader
(epoch=1)
A -> C: What is the end offset
for epoch=0?
C -> A: The end offset is 6
C: Cool, no truncation needed!
94. r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 A, C
Leader
(epoch=1)
95. r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 A, C
Leader
(epoch=1)
96. r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
A 2 A
Leader
(epoch=1)
97. r0 r1 r2 r3 r7 r8
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Leader
(epoch=2)
Leader Epoch ISR
A 2 A
Leader
(epoch=1)
99. r0 r1 r2
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4
A
B
C
Leader
(epoch=0)
Leader
(epoch=1)
Leader Epoch ISR
A 1 A, C
Follower
(epoch=0)
Replica B has failed and
replica A has been elected
as the new leader
100. r0 r1 r2 r7 r8
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4
A
B
C
Leader
(epoch=0)
Leader
(epoch=1)
Leader Epoch ISR
A 1 A, C
Follower
(epoch=0)
Replica B has failed and
replica A has been elected
as the new leader
101. r0 r1 r2 r7 r8
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4
A
B
C
Leader
(epoch=0)
Leader
(epoch=1)
Leader Epoch ISR
A 1 A, C
Follower
(epoch=0)
Replica B has failed and
replica A has been elected
as the new leader
epoch=0
offset=0
epoch=1
offset=3
102. r0 r1 r2 r7 r8
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4
A
B
C
Leader
(epoch=0)
Leader
(epoch=1)
Leader Epoch ISR
A 1 A, C
Follower
(epoch=0)
epoch=0
offset=0
epoch=1
offset=3
103. r0 r1 r2 r7 r8
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4
A
B
C
Leader
(epoch=0)
Leader
(epoch=1)
Leader Epoch ISR
A 1 A, C
Follower
(epoch=0)
Before replica C can
truncate its log, it becomes
the new leader.
epoch=0
offset=0
epoch=1
offset=3
104. r0 r1 r2 r7 r8
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4
A
B
C
Leader
(epoch=0)
Leader
(epoch=1)
Leader Epoch ISR
C 2 A, C
Follower
(epoch=0)
Before replica C can
truncate its log, it becomes
the new leader.
epoch=0
offset=0
epoch=1
offset=3
105. r0 r1 r2 r7 r8
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4
A
B
C
Leader
(epoch=0)
Leader
(epoch=1)
Leader Epoch ISR
C 2 A, C
Leader
(epoch=2)
Before replica C can
truncate its log, it becomes
the new leader.
epoch=0
offset=0
epoch=1
offset=3
106. r0 r1 r2 r7 r8
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r9
A
B
C
Leader
(epoch=0)
Leader
(epoch=1)
Leader Epoch ISR
C 2 A, C
Leader
(epoch=2)
Before replica C can
truncate its log, it becomes
the new leader.
epoch=0
offset=0
epoch=1
offset=3
epoch=2
offset=5
107. r0 r1 r2 r7 r8
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r9
A
B
C
Leader
(epoch=0)
Follower
(epoch=2)
Leader Epoch ISR
C 2 A, C
Leader
(epoch=2)
epoch=0
offset=0
epoch=1
offset=3
epoch=2
offset=5
A -> C: What is the end offset
for epoch=1?
108. r0 r1 r2 r7 r8
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r9
A
B
C
Leader
(epoch=0)
Follower
(epoch=2)
Leader Epoch ISR
C 2 A, C
Leader
(epoch=2)
epoch=0
offset=0
epoch=1
offset=3
epoch=2
offset=5
A -> C: What is the end offset
for epoch=1?
C -> A: The end offset is 5
109. r0 r1 r2 r7 r8
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r9
A
B
C
Leader
(epoch=0)
Follower
(epoch=2)
Leader Epoch ISR
C 2 A, C
Leader
(epoch=2)
epoch=0
offset=0
epoch=1
offset=3
epoch=2
offset=5
A -> C: What is the end offset
for epoch=1?
C -> A: The end offset is 5
C: Cool, no truncation needed!
110. r0 r1 r2 r7 r8
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r9
A
B
C
Leader
(epoch=0)
Follower
(epoch=2)
Leader Epoch ISR
C 2 A, C
Leader
(epoch=2)
111. r0 r1 r2 r7 r8 r9
r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r9
A
B
C
Leader
(epoch=0)
Follower
(epoch=2)
Leader Epoch ISR
C 2 A, C
Leader
(epoch=2)
113. r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
A 0 A, B, C
Follower
(epoch=0)
114. r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
A 0 A, B, C
Follower
(epoch=0)
Follower A fails and is
removed from the ISR.
115. r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
A 0 B, C
Follower
(epoch=0)
Follower A fails and is
removed from the ISR.
116. r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
A 0 B, C
Follower
(epoch=0)
117. r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
A 0 B, C
Follower
(epoch=0)
Replica A could not re-register
in order to get the latest
leader/ISR state and continued
fetching from the current
leader.
118. r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
A 0 B, C
Follower
(epoch=0)
Replica A could not re-register
in order to get the latest
leader/ISR state and continued
fetching from the current
leader.
119. r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2
A
B
C
Follower
(epoch=0)
Leader Epoch ISR
A 0 B, C
Follower
(epoch=0)
Leader
(epoch=0)
120. r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2
A
B
C
Follower
(epoch=0)
Leader Epoch ISR
C 1 C
Follower
(epoch=0)
Leader
(epoch=0)
121. r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2
A
B
C
Follower
(epoch=0)
Leader Epoch ISR
C 1 C
Leader
(epoch=1)
Leader
(epoch=0)
122. r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2
A
B
C
Follower
(epoch=0)
Leader Epoch ISR
C 1 C
Leader
(epoch=1)
Leader
(epoch=0)
123. r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r7 r8 r9
A
B
C
Follower
(epoch=0)
Leader Epoch ISR
C 1 C
Leader
(epoch=1)
Leader
(epoch=0)
124. r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r7 r8 r9
A
B
C
Follower
(epoch=0)
Leader Epoch ISR
C 1 C
Leader
(epoch=1)
Leader
(epoch=0)
Meanwhile, replica A still
thought B was the leader and
was still trying to make
progress
125. r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r7 r8 r9
A
B
C
Follower
(epoch=0)
Leader Epoch ISR
C 1 C
Leader
(epoch=1)
Leader
(epoch=0)
126. r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r7 r8 r9
A
B
C
Follower
(epoch=0)
Leader Epoch ISR
C 1 C
Leader
(epoch=1)
Follower
(epoch=1)
127. r0 r1 r2 r3 r4
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r7 r8 r9
A
B
C
Follower
(epoch=0)
Leader Epoch ISR
C 1 C
Leader
(epoch=1)
Follower
(epoch=1)
128. r0 r1 r2 r3 r4
r0 r1 r2
r0 r1 r2 r7 r8 r9
A
B
C
Follower
(epoch=0)
Leader Epoch ISR
C 1 C
Leader
(epoch=1)
Follower
(epoch=1)
129. r0 r1 r2 r3 r4
r0 r1 r2
r0 r1 r2 r7 r8 r9
A
B
C
Follower
(epoch=0)
Leader Epoch ISR
C 1 C
Leader
(epoch=1)
Follower
(epoch=1)
130. r0 r1 r2 r3 r4
r0 r1 r2 r7 r8 r9
r0 r1 r2 r7 r8 r9
A
B
C
Follower
(epoch=0)
Leader Epoch ISR
C 1 C
Leader
(epoch=1)
Follower
(epoch=1)
131. r0 r1 r2 r3 r4
r0 r1 r2 r7 r8 r9
r0 r1 r2 r7 r8 r9
A
B
C
Follower
(epoch=0)
Leader Epoch ISR
C 1 B, C
Leader
(epoch=1)
Follower
(epoch=1)
132. r0 r1 r2 r3 r4
r0 r1 r2 r7 r8 r9
r0 r1 r2 r7 r8 r9
A
B
C
Follower
(epoch=0)
Leader Epoch ISR
B 2 B, C
Leader
(epoch=1)
Follower
(epoch=1)
Once back in the ISR, the
controller elected it as leader
133. r0 r1 r2 r3 r4
r0 r1 r2 r7 r8 r9
r0 r1 r2 r7 r8 r9
A
B
C
Follower
(epoch=0)
Leader Epoch ISR
B 2 B, C
Leader
(epoch=1)
Leader
(epoch=2)
Once back in the ISR, the
controller elected it as leader
134. r0 r1 r2 r3 r4
r0 r1 r2 r7 r8 r9
r0 r1 r2 r7 r8 r9
A
B
C
Follower
(epoch=0)
Leader Epoch ISR
B 2 B, C
Leader
(epoch=1)
Leader
(epoch=2)
Suddenly, replica A was able to
make progress again!
135. r0 r1 r2 r3 r4 r9
r0 r1 r2 r7 r8 r9
r0 r1 r2 r7 r8 r9
A
B
C
Follower
(epoch=0)
Leader Epoch ISR
B 2 B, C
Leader
(epoch=1)
Leader
(epoch=2)
Suddenly, replica A was able to
make progress again!
136. Reflection
● Our mushy brains are not equipped to thinking
about edge cases in distributed systems
● How do we know that our fixes are not just
trading one edge case for another?
● How do we know there are not more edge
cases?
138. TLA+/TLC
● TLA+ is a specification language
created by Leslie Lamport
● TLC is a model checker
● Think “brute force proof by
mathematical induction”
139. TLA+/TLCUsing LaTeX syntax makes
model checking just as much
fun as writing research papers!● TLA+ is a specification language
created by Leslie Lamport
● TLC is a model checker
● Think “brute force proof by
mathematical induction”
141. ● Define the state and how to initialize it
● Define the valid state transitions
● Define expected state invariants
● Run model to check invariants
Model
Checklist
142. ● Define the state and how to initialize it
● Define the valid state transitions
● Define expected state invariants
● Run model to check invariants
Model
Checklist
143. r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
144. r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
1. Records and the log
147. r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
1. Records and the log
148. r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
1. Records and the log
2. Replica State
153. r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
1. Records and the log
2. Replica State
154. r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
1. Records and the log
2. Replica State
3. Quorum State
156. r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
1. Records and the log
2. Replica State
3. Quorum State
157. r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 A, B, C
Follower
(epoch=0)
1. Records and the log
2. Replica State
3. Quorum State
4. LeaderAndIsr Propagation
162. ● Define the state and how to initialize it
● Define the valid state transitions
● Define expected state invariants
● Run model to check invariants
Model
Checklist
179. ● Define the state and how to initialize it
● Define the valid state transitions
● Define expected state invariants
● Run model to check invariants
Model
Checklist
180. Replication
Invariant StrongIsr == A r1 in Replicas:
/ ~ ReplicaPresumesLeadership(r1)
/ LET hw == replicaState[r1].hw
IN A r2 in quorumState.isr:
HasMatchingLogsUpTo(r1, r2, hw)
181. Replication
Invariant StrongIsr == A r1 in Replicas:
/ ~ ReplicaPresumesLeadership(r1)
/ LET hw == replicaState[r1].hw
IN A r2 in quorumState.isr:
HasMatchingLogsUpTo(r1, r2, hw)
“If any replica is eligible to return data, then that data
must be replicated to all members of the current ISR”
182. r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 A, C
Leader
(epoch=1)
Leader A had failed and
replica C was being elected
as the new leader.
183. r0 r1 r2 r3
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 A, C
Leader
(epoch=1)
Upon becoming a follower
of C, replica A would
truncate its log to the local
high watermark.
184. r0 r1
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 A, C
Leader
(epoch=1)
185. r0 r1
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 A, C
Leader
(epoch=1)
This state violates the
StrongIsr property because
leader C is eligible to return
records r2 and r3, though
they are not present on A.
186. ● Define the state and how to initialize it
● Define the valid state transitions
● Define expected state invariants
● Run model to check invariants
Model
Checklist
188. r0 r1
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
B 0 B, C
Follower
(epoch=0)
The leader is B and replica
A is trying to catch up to
rejoin the ISR.
189. r0 r1
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
C 1 B, C
Follower
(epoch=0)
The leader changes to C.
190. r0 r1
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=0)
Leader Epoch ISR
C 1 B, C
Leader
(epoch=1)
The leader changes to C.
191. r0 r1
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 B, C
Leader
(epoch=1)
Follower A catches up and
rejoins the ISR.
192. r0 r1 r2
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 B, C
Leader
(epoch=1)
Follower A catches up and
rejoins the ISR.
193. r0 r1 r2
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 A, B, C
Leader
(epoch=1)
Follower A catches up and
rejoins the ISR.
194. r0 r1 r2
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 A, B, C
Leader
(epoch=1)
This violates StrongIsr
because replica B may
have returned records r3,
r4, and r5 which A does not
yet have.
196. r0 r1 r2
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 B, C
Leader
(epoch=1)
After becoming leader, C
only knows that the true
high watermark is between
its own high watermark and
the end of the log.
True high
watermark
197. r0 r1 r2
r0 r1 r2 r3 r4 r5 r6
r0 r1 r2 r3 r4 r5
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 B, C
Leader
(epoch=1)
So we wait until the
follower has reached the
starting offset of this
leader’s own epoch before
allowing it into the ISR.
True high
watermark
198. r0 r1 r2
r0 r1 r2 r3 r4 r5
r0 r1 r2 r3 r4 r5 r7 r8
A
B
C
Follower
(epoch=1)
Follower
(epoch=1)
Leader Epoch ISR
C 1 B, C
Leader
(epoch=1)
So we wait until the
follower has reached the
starting offset of this
leader’s own epoch before
allowing it into the ISR.
True high
watermark
199. r0 r1 r2 r3 r4 r5 r7
r0 r1 r2 r3 r4 r5
r0 r1 r2 r3 r4 r5 r7 r8
A
B
C
Follower
(epoch=1)
Follower
(epoch=1)
Leader Epoch ISR
C 1 B, C
Leader
(epoch=1)
So we wait until the
follower has reached the
starting offset of this
leader’s own epoch before
allowing it into the ISR.
True high
watermark
200. r0 r1 r2 r3 r4 r5 r7
r0 r1 r2 r3 r4 r5
r0 r1 r2 r3 r4 r5 r7 r8
A
B
C
Follower
(epoch=1)
Follower
(epoch=1)
Leader Epoch ISR
C 1 A, B, C
Leader
(epoch=1)
So we wait until the
follower has reached the
starting offset of this
leader’s own epoch before
allowing it into the ISR.
True high
watermark
202. r0 r1 r2 r3
r0 r1 r2 r5 r6
r0 r1 r2 r5 r6
A
B
C
Follower
(epoch=0)
Leader Epoch ISR
B 2 B, C
Leader
(epoch=1)
Leader
(epoch=2)
Replica A was a zombie
which was still fetching
from B. After a couple
leader elections, replica B
became the leader again.
203. r0 r1 r2 r3
r0 r1 r2 r5 r6
r0 r1 r2 r5 r6
A
B
C
Follower
(epoch=0)
Leader Epoch ISR
B 2 B, C
Leader
(epoch=1)
Leader
(epoch=2)
A -> B:
Fetch(offset=4, epoch=0)
204. r0 r1 r2 r3
r0 r1 r2 r5 r6
r0 r1 r2 r5 r6
A
B
C
Follower
(epoch=0)
Leader Epoch ISR
B 2 B, C
Leader
(epoch=1)
Leader
(epoch=2)
A -> B:
Fetch(offset=4, epoch=0)
B -> A:
You are fenced!
207. Summary
● Distributed systems are subtle and we are
poorly equipped to reason about edge cases.
● Model checking is a systematic approach to
finding these edge cases and verifying our
fixes address them.
● All of the replication fixes we know of will be
available in Apache Kafka 2.1.0.
208. Note of
Caution ● The model is not the implementation.
● The implementation will have complexity that
the model cannot capture.
212. r0 r1 r2
r0 r1 r2 r3
r0 r1 r2 r3
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 B, C
Leader
(epoch=1)
B became a zombie while it
was the leader for epoch 0.
213. r0 r1 r2
r0 r1 r2 r3
r0 r1 r2 r3 r7 r8
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 B, C
Leader
(epoch=1)
The new leader will be
accepting writes.
214. r0 r1 r2
r0 r1 r2 r3 r9 r10
r0 r1 r2 r3 r7 r8
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 B, C
Leader
(epoch=1)
The old leader may accept
writes as well!
215. r0 r1 r2
r0 r1 r2 r3 r9 r10
r0 r1 r2 r3 r7 r8
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR
C 1 B, C
Leader
(epoch=1)
As long as the leader
cannot advance its high
watermark, there is no
semantic violation.
216. r0 r1 r2
r0 r1 r2 r3 r9 r10
r0 r1 r2 r3 r7 r8
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR Ver
C 1 B, C 1
Leader
(epoch=1)
As long as the leader
cannot advance its high
watermark, there is no
semantic violation.
217. r0 r1 r2
r0 r1 r2 r3 r9 r10
r0 r1 r2 r3 r7 r8
A
B
C
Leader
(epoch=0)
Follower
(epoch=1)
Leader Epoch ISR Ver
C 1 B, C 1
Leader
(epoch=1)
The controller sends the
latest version of the leader
and ISR state to replicas in
the LeaderAndIsr request
218. r0 r1 r2
r0 r1 r2 r3 r9 r10
r0 r1 r2 r3 r7 r8
A
B
C
Leader
(epoch=0,
version=0)
Follower
(epoch=1)
Leader Epoch ISR Ver
C 1 B, C 1
Leader
(epoch=1,
version=1)
The controller sends the
latest version of the leader
and ISR state to replicas in
the LeaderAndIsr request
219. r0 r1 r2
r0 r1 r2 r3 r9 r10
r0 r1 r2 r3 r7 r8
A
B
C
Leader
(epoch=0,
version=0)
Follower
(epoch=1)
Leader Epoch ISR Ver
C 1 B, C 1
Leader
(epoch=1,
version=1)
This allows for CAS
updates, which effectively
fences replicas which have
old state.
228. VARIABLES var1, var2, …
Init ==
/ var1 = 1
/ …
Action1 ==
/ var1 leq 10
/ var1’ = var + 1
…
Next ==
/ Action1
/ Action2
/ …
Spec == Init / []Next
Invariant ==
/ var1 geq 1
/ …
TLA+
Overview
Specify the set of valid
state transitions
229. VARIABLES var1, var2, …
Init ==
/ var1 = 1
/ …
Action1 ==
/ var1 leq 10
/ var1’ = var + 1
…
Next ==
/ Action1
/ Action2
/ …
Spec == Init / []Next
Invariant ==
/ var1 geq 1
/ …
TLA+
Overview
Specify the set of valid
state transitions
230. VARIABLES var1, var2, …
Init ==
/ var1 = 1
/ …
Action1 ==
/ var1 leq 10
/ var1’ = var + 1
…
Next ==
/ Action1
/ Action2
/ …
Spec == Init / []Next
Invariant ==
/ var1 geq 1
/ …
TLA+
Overview
The specification is the
conjunction of the initial state
and all the states reachable
by repeatedly applying the
`Next` state transition
231. VARIABLES var1, var2, …
Init ==
/ var1 = 1
/ …
Action1 ==
/ var1 leq 10
/ var1’ = var + 1
…
Next ==
/ Action1
/ Action2
/ …
Spec == Init / []Next
Invariant ==
/ var1 geq 1
/ …
TLA+
Overview
Define the model invariants
that should hold after every
state transition