"Nobody likes dealing with duplicated orders or missing payments, so it’s crucial for the data pipelines we build to process these things to ensure each message is processed exactly once.
While Kafka has supported exactly-once semantics for years in its clients and stream processing libraries, Kafka Connect lacked this support for source connectors until version 3.3.
In this talk, learn how KIP-618 made exactly-once source connectors possible. Topics covered will include an overview of exactly-once support in Kafka’s client libraries, a brief refresher on the source connector API, a deep dive into some of the internal workings of Kafka Connect, and the design challenges of EOS support.
This talk assumes basic familiarity with Kafka, its client libraries, and Kafka Connect. Audience members should expect to come away with better knowledge of how to implement exactly-once source connectors and how to run Kafka Connect clusters with exactly-once support."
4. ● “Exactly-once semantics”
● “Semantics” instead of “delivery”, “guarantees”, “delivery
guarantees”, etc. (see Two Generals’ Problem)
● Levels:
○ Probably-once
○ At-least-once
○ At-most-once
○ Exactly-once
● With all else equal, exactly-once is best
● But of course, it’s the hardest to implement
EOS
6. Source Connectors
● Kafka stores and transmits events. Where do these events
come from, and where do they go?
● DYI producer/consumer application? Nah 👎
● Connectors: no-code (or low-code) applications to integrate
Kafka with other systems
● Sink connectors write data from Kafka to the external system
● Source connectors read data from the external system into
Kafka
8. Kafka Connect
● Distributed, horizontally-scalable, fault-
tolerant ingest/export tool for Kafka
● Developers implement connectors
against the Kafka Connect API
● Cluster administrators install connectors
onto one or more Kafka Connect workers,
which combine to form a cluster
● Users can then create and manage
connectors on that cluster by submitting
JSON configurations via a REST API
● (For users) No code required!
{
"name": "local-file-source",
"connector.class": "FileStreamSink",
"tasks.max": "1",
"file": "test.txt",
"topic": "connect-test"
}
9. We’re going to talk about designing support for exactly-once
semantics (EOS) with source connectors developed for Kafka
Connect.
In summary…
18. Zombie fencing: actually pretty easy?
● Give each task a transactional ID derived from the name of
the connector and the task ID
○ E.g., “reddit-source-0” or “chris-ksl-3”
● Let tasks fence out older instances on startup
○ Fencing: disabling a producer from writing to Kafka
36. That was not a good idea
● Poor UX
○ Causes tasks to fail in between zombie fencing and end
of rebalance
○ Forcibly kills them, no chance to commit pending offsets
○ Looks like a bug to users
● Correctness issue
○ Users can manually restart failed tasks
○ Even in between zombie fencing and publishing new
task configs
○ Uh oh, a zombie task made it to the other end of the
rebalance!
37. Zombie fencing: durable task counts
● Forget the “fence then write” logic
● Instead, we explicitly track the number of to-be-fenced tasks
in the config topic with a task count record
● These serve two purposes:
○ Explicitly: if fencing is necessary, how many tasks have
to be fenced out
○ Implicitly: determine whether zombie fencing is
necessary
41. Laggy task startup
● Zombie fencing disables all initialized task producers from
writing to Kafka
● What if a zombie task lags and hasn’t initialized its producer
by the time zombie fencing for a new generation of tasks
takes place?
● Or, what if a task is restarted on a zombie worker after
zombie fencing takes place?
45. Caveats
● Fencing during rebalancing is not a good idea
○ Makes rebalances more brittle
○ Requires a new rebalance any time we want to restart a
task that failed due to failed zombie fencing
● Instead, we fence outside of rebalances
○ During task startup, workers issue a REST request to the
leader to perform zombie fencing for the connector
○ The leader will perform that round (if necessary), then
send back a 2XX response
○ If a non-2XX response is received, the task is marked
failed
○ Tasks can easily be restarted
49. In practice (downstream readers)
● Have to filter out records from aborted transactions
● If using the Java consumer, configure with isolation.level
= read_committed
● For sink connectors, do at least one of the following:
○ Configure worker with consumer.isolation.level =
read_committed
○ Configure connector with
consumer.override.isolation.level =
read_committed with (3.0.0 or later, with default
worker configuration)
50. In practice (writing connectors)
Have to define source offsets correctly
public abstract class SourceTask {
public abstract List<SourceRecord> poll();
}
public class SourceRecord {
public SourceRecord(Map<String, ?>
sourcePartition, Map<String, ?> sourceOffset, ...)
}
51. In practice (writing connectors)
O /¤#ƿ
–)ƿ
} „"ƿ
„)} +Ρ#ƿ
)μμ„#–„ƿ
Ρ)++#Ρ–Ю
№
public abstract class SourceTask {
protected SourceTaskContext context;
public abstract void start(Map<String, String> props);
}
public interface SourceTaskContext {
OffsetStorageReader offsetStorageReader();
}
public interface OffsetStorageReader {
<T> Map<Map<String, T>, Map<String, Object>>
offsets(Collection<Map<String, T>> partitions);
}