Nikhil Bhatia, Confluent, Engineering Leader in the Global Kafka + Sanjana Kaundinya, Confluent, Software Engineer + Luke Knepper, Confluent, Product Manager
Kafka is the backbone for many of the Global 2000’s real-time streaming data architectures. It can create a “global event mesh” to power real-time applications and analytics at scale. Kafka is a fast, scalable, durable, and fault-tolerant publish-subscribe messaging system. In this talk, we’ll cover the state-of-the-art patterns for efficient, resilient, highly available global Kafka. We’ll show how Confluent Platform and Confluent Cloud enable multi-region stretch clusters, global data sharing, hybrid cloud, and edge computing.
https://www.meetup.com/KafkaBayArea/events/275694619/
2. 2
Nikhil Bhatia
Engineering Leader, Kafka
Global, Confluent
About Myself
• Previous project at Confluent -
Infinite Storage for Kafka
• Principal Engineer at Microsoft
@nikhilbhatia
linkedin.com/in/nikhil-bhatia-a2a8115
4. Kafka Overview
4
● Broker - Stores messages in partitions
● Topic - Virtual Group of one or more partitions
● Partitions - Log files on disk with sequential write only. Kafka
guarantees message ordering in a partition.
Broker
T1
P
P2
P1
C
C
C
P
5. CG1
Kafka Log Offsets
5
P1
C1
P
C2
4 5 6 7 8 9
0 1 2 3
CG2
C1
C2
P2
P3
P4
Partition 1
__consumer
_offset
startOffset CG1 CG2 HW endOffset
Produce
● The log end offset is the offset of
the last message written to a log.
● The high watermark offset is the
offset of the last message that
was successfully copied to all of
the log’s replicas.
● The consumer offset is a way of
tracking the sequential order in
which messages are received by
Kafka topics.
6. Why do we Need Replication ?
6
Broker can do down
● Controlled - software/config update
● Uncontrolled - compute/disk fault, bugs
When broker goes down - durability and availability impact
● Data loss
● Some Partition on the cluster not available
From a major cloud provided
“For any Single Instance Virtual Machine using Standard HDD Managed Disks for Operating System Disks and Data Disks,
we guarantee you will have Virtual Machine Connectivity of at least 95%.”
8. How are messages committed ?
8
● Leader maintains In Sync Replicas (ISR)
● Leader waits for followers to sync replicas
● If follower fails it will shrink the ISR, continue being available
● Leader and Follower failure cases are handled by having leader epoch be part
of producer message and log truncation in the follower (refer KIP-101)
●
R1
(L)
R2
(L)
R4
(L)
R2
R1
R3
R4
R3
R3
R3
(L)
R4
R2
P1
(L)
9. Salient points for Replication
9
● Intra cluster Replication helps improve durability and availability for
node level failures.
● Offsets are core piece of Kafka producer and consumer ecosystem.
● Kafka Replication protocol ensures strong consistency through byte
by byte replication and providing message ordering guarantees.
10. Multi Zone(MZ) HA Kafka Cluster
10
B B
zk zk
P C
B
zk
AZ1 AZ2 AZ3
Inter Zone Latency <10 ms
Typical ~3 ms
11. Why Globally Replicate ?
11
● Disaster Recovery
○ DC can go down, regional failures do happen even in major
clouds
■ AWS (US-EAST-1) Region failure in last Nov
■ Microsoft's last March 3 Azure East US outage
○ Planned failovers (expected hurricane, prepare and prevent
outage)
○ Passing DR audits are a requirement
● Fan-out - Topic sharing - Data needs to get near to the consumer
● Fan-in - Aggregate Clusters e.g. IOT use cases
● Cluster/Cloud Migration
12. Disaster Recovery - Metrics
12
Recovery Point Objective(RPO):
Maximum amount of data – as
measured by time – that can be lost
after a recovery
Recovery Time Objective (RTO):
Targeted duration of time and a
service level within which a business
process must be restored after a
disaster
*Source: Wikipedia
14. 14
Luke Knepper
Product Manager for
Global Kafka
About Myself
● Stanford CS ⇒
● Lead Software Engineer ⇒
● Stanford MBA ⇒
● Product Manager
@knep
linkedin.com/in/knepper
16. Stretched Clusters
● Fast Disaster Recovery
● Offset Preserving
● Automated Client Failover
with No Custom Code
● Sync or Async Replication per
Topic with Confluent’s
Multi-Region Clusters
16
17. Replica Placement
● Ensure your partition’s
replicas are spread
throughout your data centers
17
placement.json
"replicas": [ {
"count": 2,
"constraints": { "rack": "east" }
}, {
"count": 2,
"constraints": { "rack": "west" }
} ],
kafka-topics --create
--bootstrap-server localhost:9091
--topic annas_topic
--partitions 1
--config min.insync.replicas=3
--replica-placement placement.json
19. Observers: Asynchronous Replicas
19
● Replicate partitions from the leader like followers.
● Not considered when we increment the high watermark
● Improved durability without sacrificing write throughput
● Replicates across slower/higher-latency links without falling in and
out of sync (also known as ISR thrashing)
● Available in Confluent Server
20. Example: Datacenter Failover for 2.5 DCs
2.5 datacenters 4 Replicas (“R”) + 2 Observers (“O”) Min ISR: 3 acks=all
20
R R
O
ISR
DC A DC B
Zk Zk
R R
O
Zk Zk
DC 0.5
Zk
Steady State
Observers stay out of the ISR.
Min ISR of 3 forces writes go to
both datacenters.
Note: This “half-DC” is
needed to prevent a “split
brain” between the two
datacenters
21. Example: Datacenter Failover for 2.5 DCs
2.5 datacenters 4 Replicas (“R”) + 2 Observers (“O”) Min ISR: 3 acks=all
21
R R
O
ISR
DC A DC B - offline
Zk Zk
R R
O
Zk Zk
DC 0.5
Zk
DC Failure
In-sync replica count falls below min ISR.
Observer automatically joins ISR.
Min ISR is satisfied, writes can go to remaining datacenter.
Availability is automatically maintained.
22. 22
Sanjana Kaundinya
Software Engineer, Global
Kafka
About Myself
- Working at Confluent for the past 1.5
years supporting and developing
global replication technologies
@skaundinya15
linkedin.com/in/sanjanakaundinya
27. Offset Translation in MirrorMaker 2.0
27
offset_sync
topic,
partition,
src offset,
matching dest offset
checkpoints
topic,
partition,
group name,
consumer group src offset,
matching dest offset
Consumer
translateOffsets
Destination Cluster
28. Active-Active Replication in MirrorMaker 2.0
28
● Two clusters can be configured to replicate to each other
○ Known as an “Active-Active” replication scenario
● Records are produced to both clusters and can be seen by clients in
both clusters
○ Problem that comes from this is cyclic replication
● MirrorMaker 2.0 uses alias detection
○ Example: Topics containing “us-west” in prefix won’t be
replicated to “us-west” cluster
● Holds true regardless of cluster topology
● Cluster can be replicated to several downstream clusters
30. Active-Active Replication in Replicator
30
● Replicator adds a provenance header to each record with the
following information
ID of the origin cluster where this message was first produced
Name of the topic to which this message was first produced
Timestamp when Replicator first copied the record
● Replicator (by default) will skip over any records where the
destination cluster id is the same as the origin cluster id in the
provenance header
● By adding a bit of overhead to the record, cyclic replication is
prevented
● Works with more than two clusters
31. Connecting Clusters
Sans Kafka Connect
● Multi continent replication
without the need for an
external system
● Offset preserving, thereby
eliminating the need for
offset translation
● Use cases include data
sharing, cluster migration,
and hybrid cloud
architectures
Cluster Linking
31
32. Cluster Linking Architecture Overview
32
● Extends the existing replica fetching protocol
○ Uses protocol to fetch across clusters
● Link has information necessary for destination to talk to source
● Destination cluster can create mirror topics
○ Topics fetch from source and have same configs as source topic
● Mirror topics are immutable
○ Recognizes partition is a mirror and fetches over cluster link
● Allows for mirror topic to be byte for byte replica of source topic
○ Maintains offset consistency across clusters, eliminates need for
offset translation