Build real time stream processing applications using Apache Kafka

Real time stream processing using
Apache Kafka

Agenda
● What is Apache Kafka?
● Why do we need stream processing?
● Stream processing using Apache Kafka
● Kafka @ Hotstar
Feel free to stop me for questions
2

$ whoami
● Personalisation lead at Hotstar
● Led Data Infrastructure team at Grofers and TinyOwl
● Kafka fanboy
● Usually rant on twitter @jayeshsidhwani
3

What is Kafka?
4
● Kafka is a scalable,
fault-tolerant, distributed queue
● Producers and Consumers
● Uses
○ Asynchronous communication in
event-driven architectures
○ Message broadcast for database
replication
Diagram credits: http://kafka.apache.org

● Brokers
○ Heart of Kafka
○ Stores data
○ Data stored into topics
● Zookeeper
○ Manages cluster state information
○ Leader election
Inside Kafka
5
BROKER
ZOOKEEPER
BROKER BROKER
ZOOKEEPER
TOPIC
TOPIC
TOPIC
P P P
C C C

● Topics are partitioned
○ A partition is a append-only
commit-log file
○ Achieves horizontal scalability
● Messages written in a
partitions are ordered
● Each message gets an
auto-incrementing offset #
○ {“user_id”: 1, “term”: “GoT”} is
a message in the topic searched
Inside a topic
6Diagram credits: http://kafka.apache.org

How do consumers read?
● Consumer subscribes to a topic
● Consumers read from the head
of the queue
● Multiple consumers can read
from a single topic

Kafka consumer scales horizontally
● Consumers can be grouped
● Consumer Groups
○ Horizontally scalable
○ Fault tolerant
○ Delivery guaranteed

Stream processing and its use-cases
9

Discrete data processing models
10
APP APP APP
● Request / Response
processing mode
○ Processing time <1
second
○ Clients can use this
data

11
APP APP APP
● Request / Response
processing mode
○ Processing time <1
second
○ Clients can use this
data
DWH HADOOP
● Batch processing
mode
○ Processing time few
hours to a day
○ Analysts can use this
data

12
● As the system grows, such
synchronous processing
model leads to a spaghetti
and unmaintainable design
APP APP APPAPP
SEARCH
MONIT
CACHE

Promise of stream processing
13
● Untangle movement of data
○ Single source of truth
○ No duplicate writes
○ Anyone can consume anything
○ Decouples data generation
from data computation
APPAPP APP APP
SEARCH
MONIT
CACHE
STREAM PROCESSING FRAMEWORK

Promise of stream processing
14
● Untangle movement of data
○ Single source of truth
○ No duplicate writes
○ Anyone can consume anything
● Process, transform and react
on the data as it happens
○ Sub-second latencies
○ Anomaly detection on bad stream
quality
○ Timely notification to users who
dropped off in a live match Intelligence
APPAPP APP APP
STREAM PROCESSING FRAMEWORK
Filter
Window
Join
Anomaly Action

Stream processing using Kafka
15

Stream processing frameworks
● Write your own?
○ Windowing
○ State management
○ Fault tolerance
○ Scalability
● Use frameworks such as Apache Spark, Samza, Storm
○ Batteries attached
○ Cluster manager to coordinate resources
○ High memory / cpu footprint
16

Kafka Streams
● Kafka Streams is a simple, low-latency, framework
independent stream processing framework
● Simple DSL
● Same principles as Kafka consumer (minus operations
overhead)
● No cluster manager! yay!
17

Writing Kafka Streams
● Define a processing topology
○ Source nodes
○ Processor nodes
■ One or more
■ Filtering, windowing, joins etc
○ Sink nodes
● Compile it and run like any other java
application
18

Kafka Streams architecture and operations
● Kafka manages
○ Parallelism
○ Fault tolerance
○ Ordering
○ State Management
20Diagram credits: http://confluent.io

Streaming joins and state-stores
● Beyond filtering and windowing
● Streaming joins are hard to scale
○ Kafka scales at 800k writes/sec*
○ How about your database?
● Solution: Cache a static stream
in-memory
○ Join with running stream
○ Stream<>table duality
● Kafka supports in-memory cache OOB
○ RocksDB
○ In-memory hash
○ Persistent / Transient
*
achieved using librdkafka c++ library

Demo
● Inputs:
○ Incoming stream of benchmark stream
quality from CDN provider
○ Incoming stream quality reported by
Hotstar clients
● Output:
○ Calculate the locations reporting bad
QoS in real-time
*

Demo
● Inputs:
○ Incoming stream of benchmark stream
quality from CDN provider
○ Incoming stream quality reported by
Hotstar clients
● Output:
○ Calculate the locations reporting bad
QoS in real-time
*
CDN
benchmarks
Client
reports
Alerts

Stream <> Table duality
● Heart of Kafka Stream
● A stream is a changelog of
events

Stream <> Table duality
● Heart of Kafka Stream
● A stream is a changelog of
events
● A table is a compacted stream

Build real time stream processing applications using Apache Kafka

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Build real time stream processing applications using Apache Kafka

Similar a Build real time stream processing applications using Apache Kafka (20)

Más de Hotstar

Más de Hotstar (7)

Último

Último (20)

Build real time stream processing applications using Apache Kafka