SlideShare una empresa de Scribd logo
1 de 28
Descargar para leer sin conexión
Real time stream processing using
Apache Kafka
Agenda
● What is Apache Kafka?
● Why do we need stream processing?
● Stream processing using Apache Kafka
● Kafka @ Hotstar
Feel free to stop me for questions
2
$ whoami
● Personalisation lead at Hotstar
● Led Data Infrastructure team at Grofers and TinyOwl
● Kafka fanboy
● Usually rant on twitter @jayeshsidhwani
3
What is Kafka?
4
● Kafka is a scalable,
fault-tolerant, distributed queue
● Producers and Consumers
● Uses
○ Asynchronous communication in
event-driven architectures
○ Message broadcast for database
replication
Diagram credits: http://kafka.apache.org
● Brokers
○ Heart of Kafka
○ Stores data
○ Data stored into topics
● Zookeeper
○ Manages cluster state information
○ Leader election
Inside Kafka
5
BROKER
ZOOKEEPER
BROKER BROKER
ZOOKEEPER
TOPIC
TOPIC
TOPIC
P P P
C C C
● Topics are partitioned
○ A partition is a append-only
commit-log file
○ Achieves horizontal scalability
● Messages written in a
partitions are ordered
● Each message gets an
auto-incrementing offset #
○ {“user_id”: 1, “term”: “GoT”} is
a message in the topic searched
Inside a topic
6Diagram credits: http://kafka.apache.org
How do consumers read?
● Consumer subscribes to a topic
● Consumers read from the head
of the queue
● Multiple consumers can read
from a single topic
7Diagram credits: http://kafka.apache.org
Kafka consumer scales horizontally
● Consumers can be grouped
● Consumer Groups
○ Horizontally scalable
○ Fault tolerant
○ Delivery guaranteed
8Diagram credits: http://kafka.apache.org
Stream processing and its use-cases
9
Discrete data processing models
10
APP APP APP
● Request / Response
processing mode
○ Processing time <1
second
○ Clients can use this
data
Discrete data processing models
11
APP APP APP
● Request / Response
processing mode
○ Processing time <1
second
○ Clients can use this
data
DWH HADOOP
● Batch processing
mode
○ Processing time few
hours to a day
○ Analysts can use this
data
Discrete data processing models
12
● As the system grows, such
synchronous processing
model leads to a spaghetti
and unmaintainable design
APP APP APPAPP
SEARCH
MONIT
CACHE
Promise of stream processing
13
● Untangle movement of data
○ Single source of truth
○ No duplicate writes
○ Anyone can consume anything
○ Decouples data generation
from data computation
APPAPP APP APP
SEARCH
MONIT
CACHE
STREAM PROCESSING FRAMEWORK
Promise of stream processing
14
● Untangle movement of data
○ Single source of truth
○ No duplicate writes
○ Anyone can consume anything
● Process, transform and react
on the data as it happens
○ Sub-second latencies
○ Anomaly detection on bad stream
quality
○ Timely notification to users who
dropped off in a live match Intelligence
APPAPP APP APP
STREAM PROCESSING FRAMEWORK
Filter
Window
Join
Anomaly Action
Stream processing using Kafka
15
Stream processing frameworks
● Write your own?
○ Windowing
○ State management
○ Fault tolerance
○ Scalability
● Use frameworks such as Apache Spark, Samza, Storm
○ Batteries attached
○ Cluster manager to coordinate resources
○ High memory / cpu footprint
16
Kafka Streams
● Kafka Streams is a simple, low-latency, framework
independent stream processing framework
● Simple DSL
● Same principles as Kafka consumer (minus operations
overhead)
● No cluster manager! yay!
17
Writing Kafka Streams
● Define a processing topology
○ Source nodes
○ Processor nodes
■ One or more
■ Filtering, windowing, joins etc
○ Sink nodes
● Compile it and run like any other java
application
18
Demo
Simple Kafka Stream
19
Kafka Streams architecture and operations
● Kafka manages
○ Parallelism
○ Fault tolerance
○ Ordering
○ State Management
20Diagram credits: http://confluent.io
Streaming joins and state-stores
● Beyond filtering and windowing
● Streaming joins are hard to scale
○ Kafka scales at 800k writes/sec*
○ How about your database?
● Solution: Cache a static stream
in-memory
○ Join with running stream
○ Stream<>table duality
● Kafka supports in-memory cache OOB
○ RocksDB
○ In-memory hash
○ Persistent / Transient
21Diagram credits: http://confluent.io
*
achieved using librdkafka c++ library
Demo
● Inputs:
○ Incoming stream of benchmark stream
quality from CDN provider
○ Incoming stream quality reported by
Hotstar clients
● Output:
○ Calculate the locations reporting bad
QoS in real-time
22Diagram credits: http://confluent.io
*
achieved using librdkafka c++ library
Demo
● Inputs:
○ Incoming stream of benchmark stream
quality from CDN provider
○ Incoming stream quality reported by
Hotstar clients
● Output:
○ Calculate the locations reporting bad
QoS in real-time
23Diagram credits: http://confluent.io
*
achieved using librdkafka c++ library
CDN
benchmarks
Client
reports
Alerts
KSQL - Kafka Streams ++
24
Kafka @ Hotstar
25
26
Stream <> Table duality
● Heart of Kafka Stream
● A stream is a changelog of
events
27Diagram credits: http://confluent.io
Stream <> Table duality
● Heart of Kafka Stream
● A stream is a changelog of
events
● A table is a compacted stream
28Diagram credits: http://confluent.io

Más contenido relacionado

La actualidad más candente

Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per Day
Hadoop summit - Scaling Uber’s Real-Time Infra for  Trillion Events per DayHadoop summit - Scaling Uber’s Real-Time Infra for  Trillion Events per Day
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per DayAnkur Bansal
 
How to build an event driven architecture with kafka and kafka connect
How to build an event driven architecture with kafka and kafka connectHow to build an event driven architecture with kafka and kafka connect
How to build an event driven architecture with kafka and kafka connectLoi Nguyen
 
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at UberDisaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uberconfluent
 
Data pipeline with kafka
Data pipeline with kafkaData pipeline with kafka
Data pipeline with kafkaMole Wong
 
SAIS2018 - Fact Store At Netflix Scale
SAIS2018 - Fact Store At Netflix ScaleSAIS2018 - Fact Store At Netflix Scale
SAIS2018 - Fact Store At Netflix ScaleNitin S
 
Building end to end streaming application on Spark
Building end to end streaming application on SparkBuilding end to end streaming application on Spark
Building end to end streaming application on Sparkdatamantra
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Monal Daxini
 
PHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
PHP At 5000 Requests Per Second: Hootsuite’s Scaling StoryPHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
PHP At 5000 Requests Per Second: Hootsuite’s Scaling Storyvanphp
 
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin KumarSiphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumarconfluent
 
Low latency stream processing with jet
Low latency stream processing with jetLow latency stream processing with jet
Low latency stream processing with jetStreamNative
 
How Much Can You Connect? | Bhavesh Raheja, Disney + Hotstar
How Much Can You Connect? | Bhavesh Raheja, Disney + HotstarHow Much Can You Connect? | Bhavesh Raheja, Disney + Hotstar
How Much Can You Connect? | Bhavesh Raheja, Disney + HotstarHostedbyConfluent
 
Streaming datasets for personalization
Streaming datasets for personalizationStreaming datasets for personalization
Streaming datasets for personalizationShriya Arora
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streamingdatamantra
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecPeter Bakas
 
Storing State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your AnalyticsStoring State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your AnalyticsYaroslav Tkachenko
 
Scaling a Core Banking Engine Using Apache Kafka | Peter Dudbridge, Thought M...
Scaling a Core Banking Engine Using Apache Kafka | Peter Dudbridge, Thought M...Scaling a Core Banking Engine Using Apache Kafka | Peter Dudbridge, Thought M...
Scaling a Core Banking Engine Using Apache Kafka | Peter Dudbridge, Thought M...HostedbyConfluent
 
Introduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLIntroduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLdatamantra
 
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...confluent
 
Stream Processing in Uber
Stream Processing in UberStream Processing in Uber
Stream Processing in UberC4Media
 
Hw09 Next Steps For Hadoop
Hw09   Next Steps For HadoopHw09   Next Steps For Hadoop
Hw09 Next Steps For HadoopCloudera, Inc.
 

La actualidad más candente (20)

Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per Day
Hadoop summit - Scaling Uber’s Real-Time Infra for  Trillion Events per DayHadoop summit - Scaling Uber’s Real-Time Infra for  Trillion Events per Day
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per Day
 
How to build an event driven architecture with kafka and kafka connect
How to build an event driven architecture with kafka and kafka connectHow to build an event driven architecture with kafka and kafka connect
How to build an event driven architecture with kafka and kafka connect
 
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at UberDisaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
 
Data pipeline with kafka
Data pipeline with kafkaData pipeline with kafka
Data pipeline with kafka
 
SAIS2018 - Fact Store At Netflix Scale
SAIS2018 - Fact Store At Netflix ScaleSAIS2018 - Fact Store At Netflix Scale
SAIS2018 - Fact Store At Netflix Scale
 
Building end to end streaming application on Spark
Building end to end streaming application on SparkBuilding end to end streaming application on Spark
Building end to end streaming application on Spark
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
 
PHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
PHP At 5000 Requests Per Second: Hootsuite’s Scaling StoryPHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
PHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
 
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin KumarSiphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
 
Low latency stream processing with jet
Low latency stream processing with jetLow latency stream processing with jet
Low latency stream processing with jet
 
How Much Can You Connect? | Bhavesh Raheja, Disney + Hotstar
How Much Can You Connect? | Bhavesh Raheja, Disney + HotstarHow Much Can You Connect? | Bhavesh Raheja, Disney + Hotstar
How Much Can You Connect? | Bhavesh Raheja, Disney + Hotstar
 
Streaming datasets for personalization
Streaming datasets for personalizationStreaming datasets for personalization
Streaming datasets for personalization
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
 
Storing State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your AnalyticsStoring State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your Analytics
 
Scaling a Core Banking Engine Using Apache Kafka | Peter Dudbridge, Thought M...
Scaling a Core Banking Engine Using Apache Kafka | Peter Dudbridge, Thought M...Scaling a Core Banking Engine Using Apache Kafka | Peter Dudbridge, Thought M...
Scaling a Core Banking Engine Using Apache Kafka | Peter Dudbridge, Thought M...
 
Introduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLIntroduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQL
 
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...
 
Stream Processing in Uber
Stream Processing in UberStream Processing in Uber
Stream Processing in Uber
 
Hw09 Next Steps For Hadoop
Hw09   Next Steps For HadoopHw09   Next Steps For Hadoop
Hw09 Next Steps For Hadoop
 

Similar a Build real time stream processing applications using Apache Kafka

Apache Kafka - Martin Podval
Apache Kafka - Martin PodvalApache Kafka - Martin Podval
Apache Kafka - Martin PodvalMartin Podval
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayDataWorks Summit
 
Streamsets and spark in Retail
Streamsets and spark in RetailStreamsets and spark in Retail
Streamsets and spark in RetailHari Shreedharan
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2aspyker
 
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUpStrimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUpJosé Román Martín Gil
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafkadatamantra
 
Kafka Practices @ Uber - Seattle Apache Kafka meetup
Kafka Practices @ Uber - Seattle Apache Kafka meetupKafka Practices @ Uber - Seattle Apache Kafka meetup
Kafka Practices @ Uber - Seattle Apache Kafka meetupMingmin Chen
 
Tips & Tricks for Apache Kafka®
Tips & Tricks for Apache Kafka®Tips & Tricks for Apache Kafka®
Tips & Tricks for Apache Kafka®confluent
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uberconfluent
 
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...StampedeCon
 
Clusternaut: Orchestrating  Percona XtraDB Cluster with Kubernetes
Clusternaut:  Orchestrating  Percona XtraDB Cluster with KubernetesClusternaut:  Orchestrating  Percona XtraDB Cluster with Kubernetes
Clusternaut: Orchestrating  Percona XtraDB Cluster with KubernetesRaghavendra Prabhu
 
Friends don't let friends do dual writes: Outbox pattern with OpenShift Strea...
Friends don't let friends do dual writes: Outbox pattern with OpenShift Strea...Friends don't let friends do dual writes: Outbox pattern with OpenShift Strea...
Friends don't let friends do dual writes: Outbox pattern with OpenShift Strea...Red Hat Developers
 
Blackray @ SAPO CodeBits 2009
Blackray @ SAPO CodeBits 2009Blackray @ SAPO CodeBits 2009
Blackray @ SAPO CodeBits 2009fschupp
 
Insta clustr seattle kafka meetup presentation bb
Insta clustr seattle kafka meetup presentation   bbInsta clustr seattle kafka meetup presentation   bb
Insta clustr seattle kafka meetup presentation bbNitin Kumar
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at UberDatabricks
 
Activity feeds (and more) at mate1
Activity feeds (and more) at mate1Activity feeds (and more) at mate1
Activity feeds (and more) at mate1Hisham Mardam-Bey
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
Upleveling Analytics with Kafka with Amy Chen
Upleveling Analytics with Kafka with Amy ChenUpleveling Analytics with Kafka with Amy Chen
Upleveling Analytics with Kafka with Amy ChenHostedbyConfluent
 
Architectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingArchitectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingApache Apex
 

Similar a Build real time stream processing applications using Apache Kafka (20)

Apache Kafka - Martin Podval
Apache Kafka - Martin PodvalApache Kafka - Martin Podval
Apache Kafka - Martin Podval
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per day
 
Streamsets and spark in Retail
Streamsets and spark in RetailStreamsets and spark in Retail
Streamsets and spark in Retail
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUpStrimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
 
Kafka Practices @ Uber - Seattle Apache Kafka meetup
Kafka Practices @ Uber - Seattle Apache Kafka meetupKafka Practices @ Uber - Seattle Apache Kafka meetup
Kafka Practices @ Uber - Seattle Apache Kafka meetup
 
Tips & Tricks for Apache Kafka®
Tips & Tricks for Apache Kafka®Tips & Tricks for Apache Kafka®
Tips & Tricks for Apache Kafka®
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
 
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
 
Clusternaut: Orchestrating  Percona XtraDB Cluster with Kubernetes
Clusternaut:  Orchestrating  Percona XtraDB Cluster with KubernetesClusternaut:  Orchestrating  Percona XtraDB Cluster with Kubernetes
Clusternaut: Orchestrating  Percona XtraDB Cluster with Kubernetes
 
Friends don't let friends do dual writes: Outbox pattern with OpenShift Strea...
Friends don't let friends do dual writes: Outbox pattern with OpenShift Strea...Friends don't let friends do dual writes: Outbox pattern with OpenShift Strea...
Friends don't let friends do dual writes: Outbox pattern with OpenShift Strea...
 
Blackray @ SAPO CodeBits 2009
Blackray @ SAPO CodeBits 2009Blackray @ SAPO CodeBits 2009
Blackray @ SAPO CodeBits 2009
 
Insta clustr seattle kafka meetup presentation bb
Insta clustr seattle kafka meetup presentation   bbInsta clustr seattle kafka meetup presentation   bb
Insta clustr seattle kafka meetup presentation bb
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
Netty training
Netty trainingNetty training
Netty training
 
Activity feeds (and more) at mate1
Activity feeds (and more) at mate1Activity feeds (and more) at mate1
Activity feeds (and more) at mate1
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Upleveling Analytics with Kafka with Amy Chen
Upleveling Analytics with Kafka with Amy ChenUpleveling Analytics with Kafka with Amy Chen
Upleveling Analytics with Kafka with Amy Chen
 
Architectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingArchitectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark Streaming
 

Más de Hotstar

Airflow based Video Encoding Platform
Airflow based Video Encoding PlatformAirflow based Video Encoding Platform
Airflow based Video Encoding PlatformHotstar
 
How Chaos Engineering is practiced at Hotstar
How Chaos Engineering is practiced at HotstarHow Chaos Engineering is practiced at Hotstar
How Chaos Engineering is practiced at HotstarHotstar
 
WebSDK - Switching between service providers
WebSDK - Switching between service providersWebSDK - Switching between service providers
WebSDK - Switching between service providersHotstar
 
Remote Config and Beyond
Remote Config and BeyondRemote Config and Beyond
Remote Config and BeyondHotstar
 
Analysing high throughput data in real time
Analysing high throughput data in real timeAnalysing high throughput data in real time
Analysing high throughput data in real timeHotstar
 
Scaling Hotstar.com for 10Mn concurrency
Scaling Hotstar.com for 10Mn concurrencyScaling Hotstar.com for 10Mn concurrency
Scaling Hotstar.com for 10Mn concurrencyHotstar
 
Amazon AI Conclave, Bangalore 2017
Amazon AI Conclave, Bangalore 2017 Amazon AI Conclave, Bangalore 2017
Amazon AI Conclave, Bangalore 2017 Hotstar
 

Más de Hotstar (7)

Airflow based Video Encoding Platform
Airflow based Video Encoding PlatformAirflow based Video Encoding Platform
Airflow based Video Encoding Platform
 
How Chaos Engineering is practiced at Hotstar
How Chaos Engineering is practiced at HotstarHow Chaos Engineering is practiced at Hotstar
How Chaos Engineering is practiced at Hotstar
 
WebSDK - Switching between service providers
WebSDK - Switching between service providersWebSDK - Switching between service providers
WebSDK - Switching between service providers
 
Remote Config and Beyond
Remote Config and BeyondRemote Config and Beyond
Remote Config and Beyond
 
Analysing high throughput data in real time
Analysing high throughput data in real timeAnalysing high throughput data in real time
Analysing high throughput data in real time
 
Scaling Hotstar.com for 10Mn concurrency
Scaling Hotstar.com for 10Mn concurrencyScaling Hotstar.com for 10Mn concurrency
Scaling Hotstar.com for 10Mn concurrency
 
Amazon AI Conclave, Bangalore 2017
Amazon AI Conclave, Bangalore 2017 Amazon AI Conclave, Bangalore 2017
Amazon AI Conclave, Bangalore 2017
 

Último

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 

Último (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

Build real time stream processing applications using Apache Kafka

  • 1. Real time stream processing using Apache Kafka
  • 2. Agenda ● What is Apache Kafka? ● Why do we need stream processing? ● Stream processing using Apache Kafka ● Kafka @ Hotstar Feel free to stop me for questions 2
  • 3. $ whoami ● Personalisation lead at Hotstar ● Led Data Infrastructure team at Grofers and TinyOwl ● Kafka fanboy ● Usually rant on twitter @jayeshsidhwani 3
  • 4. What is Kafka? 4 ● Kafka is a scalable, fault-tolerant, distributed queue ● Producers and Consumers ● Uses ○ Asynchronous communication in event-driven architectures ○ Message broadcast for database replication Diagram credits: http://kafka.apache.org
  • 5. ● Brokers ○ Heart of Kafka ○ Stores data ○ Data stored into topics ● Zookeeper ○ Manages cluster state information ○ Leader election Inside Kafka 5 BROKER ZOOKEEPER BROKER BROKER ZOOKEEPER TOPIC TOPIC TOPIC P P P C C C
  • 6. ● Topics are partitioned ○ A partition is a append-only commit-log file ○ Achieves horizontal scalability ● Messages written in a partitions are ordered ● Each message gets an auto-incrementing offset # ○ {“user_id”: 1, “term”: “GoT”} is a message in the topic searched Inside a topic 6Diagram credits: http://kafka.apache.org
  • 7. How do consumers read? ● Consumer subscribes to a topic ● Consumers read from the head of the queue ● Multiple consumers can read from a single topic 7Diagram credits: http://kafka.apache.org
  • 8. Kafka consumer scales horizontally ● Consumers can be grouped ● Consumer Groups ○ Horizontally scalable ○ Fault tolerant ○ Delivery guaranteed 8Diagram credits: http://kafka.apache.org
  • 9. Stream processing and its use-cases 9
  • 10. Discrete data processing models 10 APP APP APP ● Request / Response processing mode ○ Processing time <1 second ○ Clients can use this data
  • 11. Discrete data processing models 11 APP APP APP ● Request / Response processing mode ○ Processing time <1 second ○ Clients can use this data DWH HADOOP ● Batch processing mode ○ Processing time few hours to a day ○ Analysts can use this data
  • 12. Discrete data processing models 12 ● As the system grows, such synchronous processing model leads to a spaghetti and unmaintainable design APP APP APPAPP SEARCH MONIT CACHE
  • 13. Promise of stream processing 13 ● Untangle movement of data ○ Single source of truth ○ No duplicate writes ○ Anyone can consume anything ○ Decouples data generation from data computation APPAPP APP APP SEARCH MONIT CACHE STREAM PROCESSING FRAMEWORK
  • 14. Promise of stream processing 14 ● Untangle movement of data ○ Single source of truth ○ No duplicate writes ○ Anyone can consume anything ● Process, transform and react on the data as it happens ○ Sub-second latencies ○ Anomaly detection on bad stream quality ○ Timely notification to users who dropped off in a live match Intelligence APPAPP APP APP STREAM PROCESSING FRAMEWORK Filter Window Join Anomaly Action
  • 16. Stream processing frameworks ● Write your own? ○ Windowing ○ State management ○ Fault tolerance ○ Scalability ● Use frameworks such as Apache Spark, Samza, Storm ○ Batteries attached ○ Cluster manager to coordinate resources ○ High memory / cpu footprint 16
  • 17. Kafka Streams ● Kafka Streams is a simple, low-latency, framework independent stream processing framework ● Simple DSL ● Same principles as Kafka consumer (minus operations overhead) ● No cluster manager! yay! 17
  • 18. Writing Kafka Streams ● Define a processing topology ○ Source nodes ○ Processor nodes ■ One or more ■ Filtering, windowing, joins etc ○ Sink nodes ● Compile it and run like any other java application 18
  • 20. Kafka Streams architecture and operations ● Kafka manages ○ Parallelism ○ Fault tolerance ○ Ordering ○ State Management 20Diagram credits: http://confluent.io
  • 21. Streaming joins and state-stores ● Beyond filtering and windowing ● Streaming joins are hard to scale ○ Kafka scales at 800k writes/sec* ○ How about your database? ● Solution: Cache a static stream in-memory ○ Join with running stream ○ Stream<>table duality ● Kafka supports in-memory cache OOB ○ RocksDB ○ In-memory hash ○ Persistent / Transient 21Diagram credits: http://confluent.io * achieved using librdkafka c++ library
  • 22. Demo ● Inputs: ○ Incoming stream of benchmark stream quality from CDN provider ○ Incoming stream quality reported by Hotstar clients ● Output: ○ Calculate the locations reporting bad QoS in real-time 22Diagram credits: http://confluent.io * achieved using librdkafka c++ library
  • 23. Demo ● Inputs: ○ Incoming stream of benchmark stream quality from CDN provider ○ Incoming stream quality reported by Hotstar clients ● Output: ○ Calculate the locations reporting bad QoS in real-time 23Diagram credits: http://confluent.io * achieved using librdkafka c++ library CDN benchmarks Client reports Alerts
  • 24. KSQL - Kafka Streams ++ 24
  • 26. 26
  • 27. Stream <> Table duality ● Heart of Kafka Stream ● A stream is a changelog of events 27Diagram credits: http://confluent.io
  • 28. Stream <> Table duality ● Heart of Kafka Stream ● A stream is a changelog of events ● A table is a compacted stream 28Diagram credits: http://confluent.io