Nowdays, IT system become complex. Do I need yet another messaging system for my long tech stack? We will see how Kafka compares to traditional messaging systems, what's inside and how to use in java stack.
2. Who is this person?
Saulius Tvarijonas - saulius.tvarijonas@gmail.com
● VP of Software Development @ CUJO
● 20 years of java experience
● Area of Interest
○ Distributed Systems
○ Big Data
○ Performance optimizations
○ DevOps
3. What is CUJO?
● CUJO is a smart firewall that keeps your connected home safe
● Security
● Parental Control
● Big Data & Machine Learning based protection
www.getcujo.com
5. History
● Originally developed by LinkedIn
● Open sourced in early 2011
● Authors founded Confluent company
● Latest version 0.10.1.1
● Scala + Java
7. Concepts
● Kafka is run as a cluster on one or more servers
● The Kafka cluster stores streams of records in
categories called topics
● Each topic divided into one or more partitions
● Each record consists of a key, a value, and a timestamp
8. Topic
Each partition is an ordered, immutable sequence of records that is continually appended to—a structured commit log.
Logs are rotated and deleted based on policy configuration.
10. Partition
● Allow the log to scale beyond a size that will fit on a single server
● Act as the unit of parallelism
11. Replication
● Topics can (and should) be replicated
● Unit of replication is partition
● Each partition has 1 leader and 0 or more
replicas
● ISR = In-Sync Replica
13. Durability Guarantees
● Options
○ Do not wait for ACK
○ Wait for ACK from leader
○ Wait for ACK from all ISRs
● Disable unclean leader election
● Specify a minimum ISR size
14. Replica Management
● One broker elected as controller
● ZooKeeper used for metadata and coordination
● Rebalancing, rebalancing, rebalancing...
15. Message Delivery Semantics
● At most once — Messages may be lost but are never redelivered.
● At least once — Messages are never lost but may be redelivered.
● Exactly once — this is what people actually want, each message is
delivered once and only once.
17. Streams
● Simple and lightweight client library
● Transparently handles the load balancing of multiple instances
● Fault-tolerant local state
● Time based window operations
● Map, Filter, Join, ...
20. Message Format
1. 4 byte CRC32 of the message
2. 1 byte "magic" identifier to allow format changes, value is 0 or 1
3. 1 byte "attributes" identifier to allow annotations on the message independent of the version
bit 0 ~ 2 : Compression codec.
0 : no compression
1 : gzip
2 : snappy
3 : lz4
bit 3 : Timestamp type
0 : create time
1 : log append time
bit 4 ~ 7 : reserved
4. (Optional) 8 byte timestamp only if "magic" identifier is greater than 0
5. 4 byte key length, containing length K
6. K byte key
7. 4 byte payload length, containing length V
8. V byte payload
21. Use Cases
● Messaging System (ActiveMQ, RabbitMQ)
● Storage System
● Stream Processing (Storm, Samza, Spark)
● Log Aggregation (Scribe, Flume)
● Metrics
22. Efficiency
● Main reasons for high throughput and low latency
○ Batch of individual messages
○ Zero copy I/O using sendfile()
○ Heavily relies on Linux PageCache
27. Mirroring data between clusters
● Provide a replica in another datacenter
● Different number of partitions
● Order by key is preserved, but offset different
● Do not use as fault-tolerance mechanism