Apache Kafka is a distributed messaging system that allows for publishing and subscribing to streams of records, known as topics, in a fault-tolerant and scalable way. It is used for building real-time data pipelines and streaming apps. Producers write data to topics which are committed to disks across partitions and replicated for fault tolerance. Consumers read data from topics in a decoupled manner based on offsets. Kafka can process streaming data in real-time and at large volumes with low latency and high throughput.
2. What is Apache Kafka?
Messaging System
Distributed
Persistent and Replicable
Very fast - low latency - and scalable
Simple but highly configurable
By Linkedin, open sourced under apache.org
3. Data Streaming
New kind of data ...
● User or application data (events) streams
● Monitoring - App, System
● App Logging
● High volume
4. Data Streaming Cont’d
… you want to process
● Using various components
● Into a target form
● Map, reduce, shuffle
● Real time or batch
5. HP Service Virtualization Use Cases
Process of clients
message streams
Real-time performance
modeling
Logs aggregation
6. How To Solve It?
Producers and
Consumers
● Distributed
● Decoupled
● Configurable
● Dynamic
10. What Can I Do?
producer.
write(topic_id, message);
consumer.
read(topic_id, offset);
11. I Want To Produce
● java/scala client
● address of one or more brokers
● choose a topic where to produce
● highly configurable and tunable:
○ partitioner
○ number of acks (async=0, master=1, replicas=1+?)
○ batching, buffer size, timeouts, retries, ...
12. I Want To Consume
High Level API
● Groups abstraction
○ To All, To One
○ To Some
● Stream API
● Stores positions to support fault tolerance
13. I Want To Consume Cont’d
Low Level
● Java/scala client
● Find a leader for a topic
● Calculate an offset
● Fetches messages
○ Re-consume if needed
14. I Want To Consume Cont’d
Delivery Semantic:
● At most once
● At least once
● Exactly once
16. Kafka Internals - Disks Cont’d
Disks are fast ...
… when properly used
● sequential access - read ahead, write behind
● rely on operating system
○ avoid heap, materialization and GC
● it’s more like file copy over network
It’s easy … with immutable topics
17. Kafka Internals - Replication
“In Sync” Replicas
● Replication factor on partition basis
● One leader + 0..n replicas
● Replicas are consumers
○ “In Sync” if they are not “too far” behind a leader
○ Batch sync
18. Kafka Internals - Replication Cont’d
Tunable Trade-Offs
● Producer’s write method:
○ Not blocked, async
○ Waits for master ACK
○ Waits for all in-sync replicas
● Consumer pulls only committed messages
● Server’s minimum in-sync replicas