This talk was presented at the Hotstar Scale Meetup in Bangalore by Jayesh Sidhwani
In this talk, the presenter introduces Apache Kafka and the Apache Kafka Streams library. Starting from the need for building streaming applications to thinking the use-cases as a streaming job - this talk covers all the technicalities.
It ends with a short description of how Kafka is deployed and used at Hotstar
2. Agenda
● What is Apache Kafka?
● Why do we need stream processing?
● Stream processing using Apache Kafka
● Kafka @ Hotstar
Feel free to stop me for questions
2
3. $ whoami
● Personalisation lead at Hotstar
● Led Data Infrastructure team at Grofers and TinyOwl
● Kafka fanboy
● Usually rant on twitter @jayeshsidhwani
3
4. What is Kafka?
4
● Kafka is a scalable,
fault-tolerant, distributed queue
● Producers and Consumers
● Uses
○ Asynchronous communication in
event-driven architectures
○ Message broadcast for database
replication
Diagram credits: http://kafka.apache.org
5. ● Brokers
○ Heart of Kafka
○ Stores data
○ Data stored into topics
● Zookeeper
○ Manages cluster state information
○ Leader election
Inside Kafka
5
BROKER
ZOOKEEPER
BROKER BROKER
ZOOKEEPER
TOPIC
TOPIC
TOPIC
P P P
C C C
6. ● Topics are partitioned
○ A partition is a append-only
commit-log file
○ Achieves horizontal scalability
● Messages written in a
partitions are ordered
● Each message gets an
auto-incrementing offset #
○ {“user_id”: 1, “term”: “GoT”} is
a message in the topic searched
Inside a topic
6Diagram credits: http://kafka.apache.org
7. How do consumers read?
● Consumer subscribes to a topic
● Consumers read from the head
of the queue
● Multiple consumers can read
from a single topic
7Diagram credits: http://kafka.apache.org
8. Kafka consumer scales horizontally
● Consumers can be grouped
● Consumer Groups
○ Horizontally scalable
○ Fault tolerant
○ Delivery guaranteed
8Diagram credits: http://kafka.apache.org
10. Discrete data processing models
10
APP APP APP
● Request / Response
processing mode
○ Processing time <1
second
○ Clients can use this
data
11. Discrete data processing models
11
APP APP APP
● Request / Response
processing mode
○ Processing time <1
second
○ Clients can use this
data
DWH HADOOP
● Batch processing
mode
○ Processing time few
hours to a day
○ Analysts can use this
data
12. Discrete data processing models
12
● As the system grows, such
synchronous processing
model leads to a spaghetti
and unmaintainable design
APP APP APPAPP
SEARCH
MONIT
CACHE
13. Promise of stream processing
13
● Untangle movement of data
○ Single source of truth
○ No duplicate writes
○ Anyone can consume anything
○ Decouples data generation
from data computation
APPAPP APP APP
SEARCH
MONIT
CACHE
STREAM PROCESSING FRAMEWORK
14. Promise of stream processing
14
● Untangle movement of data
○ Single source of truth
○ No duplicate writes
○ Anyone can consume anything
● Process, transform and react
on the data as it happens
○ Sub-second latencies
○ Anomaly detection on bad stream
quality
○ Timely notification to users who
dropped off in a live match Intelligence
APPAPP APP APP
STREAM PROCESSING FRAMEWORK
Filter
Window
Join
Anomaly Action
16. Stream processing frameworks
● Write your own?
○ Windowing
○ State management
○ Fault tolerance
○ Scalability
● Use frameworks such as Apache Spark, Samza, Storm
○ Batteries attached
○ Cluster manager to coordinate resources
○ High memory / cpu footprint
16
17. Kafka Streams
● Kafka Streams is a simple, low-latency, framework
independent stream processing framework
● Simple DSL
● Same principles as Kafka consumer (minus operations
overhead)
● No cluster manager! yay!
17
18. Writing Kafka Streams
● Define a processing topology
○ Source nodes
○ Processor nodes
■ One or more
■ Filtering, windowing, joins etc
○ Sink nodes
● Compile it and run like any other java
application
18
27. Stream <> Table duality
● Heart of Kafka Stream
● A stream is a changelog of
events
27Diagram credits: http://confluent.io
28. Stream <> Table duality
● Heart of Kafka Stream
● A stream is a changelog of
events
● A table is a compacted stream
28Diagram credits: http://confluent.io