SlideShare una empresa de Scribd logo
1 de 21
Descargar para leer sin conexión
Bad/Bed time Stories
Today's Menu
●
Quick Kafka Overview
●
Kafka Usage At AppsFlyer
●
AppsFlyer First Cluster
●
Designing The Next Cluster: Requirements And Changes
●
Problems With The New Cluster
●
Changes To The New Cluster
●
Traffic Boost, Then More Issues
●
More Solutions
●
And More Failures
●
Splitting The Cluster, More Changes And The Current Configuration
●
Lessons Learned
●
Testing The Cluster
●
Collecting Metrics And Alerting
"A first sign of the beginning of
understanding is the wish to die."
Franz Kafka
Quick Kafka
“An open source, distributed,
partitioned and replicated commit-log
based publish- subscribe messaging
system”
Kafka Overview
Kafka Overview
●
Topic: Category which messages are published by the message
producers
●
Broker: Kafka server process (usually one per node)
●
Partitions: Topics are partitioned, each partition is represented by the
ordered immutable sequence of messages. Each message in the partition
is assigned a unique ID called offset
Kafka Usage in AppsFlyer
AppsFlyer First cluster
●
Traffic: Up to few hundreds millions
●
Size: 4 M1.xlarge brokers
●
~8 Topics
●
Replication factor 1
●
Retention 8-12H
●
Default number of partitions 8
●
Vanilla configuration
Main reason for migration: Lack of storage capacity,
limited parallelism due to low partition count and forecast
for future needs.
Requirements for the Next Cluster
●
More capacity to support
Billions of messages
●
Messages replication to
prevent data loss
●
Support loss of brokers up
to entire AZ
●
Much higher parallelism to
support more consumers
●
Longer retention period –
48 hours on most topics
The new Cluster changes
●
18 m1.xlarge brokers, 6 per AZ
●
Replication factor of 3
●
All partitions are distributed between AZ
●
Topics # of partitions increased (between 12 to 120 depends on
parallelism needs)
●
4 Network and IO threads
●
Default log retention 48 hours
●
Auto Leader rebalance enabled
●
Imbalanced ratio set to default 15%
* Leader: For each partition there is a leader which serve for writes and reads and the other brokers are replicated from
* Imbalance ratio: The highest percentage of leadership a broker can hold, above that auto rebalance is initiate
Glossary
And After a few Months
Problems
●
Uneven distributions of
leaders which cause
high load on specific
brokers and eventually
lag in consumers and
brokers failures
●
Constantly rebalanced
of brokers leaders which
caused failures in
python producers
Solutions
●
Increase number of
brokers to 24 improve
broker leadership
distribution
●
Rewrite Python
producers in Clojure
●
Decrease number of
partitions where high
parallelism is not
needed
Traffic increasing and...
Problems
●
High Iowait in the brokers
●
Missing ISR due to leaders overloaded
●
Network bandwidth close to thresholds
●
Lag in consumers
* ISR: In Active Replicas
Glossary
More Solutions
●
Split into 2 clusters: launches which contain
80% of messages and all the rest
●
Move launches cluster to i2.2xlarge with local
SSD
●
Finer tuning of leaders
●
Increase number of IO and Network Threads
●
Enable AWS enhanced networking
And some few more...
●
Decrease Replication
factor to 2 in Launches
cluster to reduce load
on leaders, reduce disk
capacity and AZ traffic
costs
●
Move 2nd cluster to
i2.2xlarge as well
●
Upgrade ZK due to
performance issues
Lessons learned
●
Minimize replication factor as possible to avoid
extra load on the Leaders
●
Make sure that leaders count is well balanced
between brokers
●
Balance partition number to support parallelism
●
Split cluster logically considering traffic and
business importance
●
Retention (time based) should be long enough
to recover from failures
●
In AWS, spread cluster between AZ
●
Support cluster dynamic changes by clients
●
Create automation for reassign
●
Save cluster-reassignment.json of each topic for
future needs!
●
Don't be to cheap on the Zookeepers
Testing the cluster
●
Load test using kafka-producer-perf-test.sh &
kafka-consumer-perf-test.sh
●
Broker failure while running
●
Entire AZ failure while running
●
Reassign partitions on the fly
●
Kafka dashboard contains: Leader election rate,
ISR status, offline partitions count, Log Flush time,
All Topics Bytes in per broker, IOWait, LoadAvg,
Disk Capacity and more
●
Set appropriate alerts
Collecting metrics & Alerting
●
Using Airbnb plugin for
Kafka, sending metrics to
graphite
●
Internal application that
collects Lag for each Topic
and send values to graphite
●
Alerts are set on Lag (For
each topic, Under replicated
partitions, Broker topic
metrics below threshold,
Leader reelection
kafka

Más contenido relacionado

La actualidad más candente

Exactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka StreamsExactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka StreamsGuozhang Wang
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache KafkaJoe Stein
 
Deep dive into Apache Kafka consumption
Deep dive into Apache Kafka consumptionDeep dive into Apache Kafka consumption
Deep dive into Apache Kafka consumptionAlexandre Tamborrino
 
APACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsAPACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsKetan Gote
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planningconfluent
 
Messaging queue - Kafka
Messaging queue - KafkaMessaging queue - Kafka
Messaging queue - KafkaMayank Bansal
 
Kafka Overview
Kafka OverviewKafka Overview
Kafka Overviewiamtodor
 
Apache Kafka
Apache KafkaApache Kafka
Apache KafkaJoe Stein
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka IntroductionAmita Mirajkar
 
Understanding kafka
Understanding kafkaUnderstanding kafka
Understanding kafkaAmitDhodi
 

La actualidad más candente (20)

Exactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka StreamsExactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka Streams
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache Kafka
 
Deep dive into Apache Kafka consumption
Deep dive into Apache Kafka consumptionDeep dive into Apache Kafka consumption
Deep dive into Apache Kafka consumption
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Kafka ops-new
Kafka ops-newKafka ops-new
Kafka ops-new
 
APACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsAPACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka Streams
 
Kafka tutorial
Kafka tutorialKafka tutorial
Kafka tutorial
 
Kafka internals
Kafka internalsKafka internals
Kafka internals
 
Kafka blr-meetup-presentation - Kafka internals
Kafka blr-meetup-presentation - Kafka internalsKafka blr-meetup-presentation - Kafka internals
Kafka blr-meetup-presentation - Kafka internals
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
 
Messaging queue - Kafka
Messaging queue - KafkaMessaging queue - Kafka
Messaging queue - Kafka
 
Kafka
KafkaKafka
Kafka
 
Kafka Overview
Kafka OverviewKafka Overview
Kafka Overview
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedIn
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
 
Apache Kafka Demo
Apache Kafka DemoApache Kafka Demo
Apache Kafka Demo
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache Kafka
Apache Kafka Apache Kafka
Apache Kafka
 
Understanding kafka
Understanding kafkaUnderstanding kafka
Understanding kafka
 

Destacado

Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Data Con LA
 
C* Summit 2013: Big Data Analytics – Realize the Investment from Your Big Dat...
C* Summit 2013: Big Data Analytics – Realize the Investment from Your Big Dat...C* Summit 2013: Big Data Analytics – Realize the Investment from Your Big Dat...
C* Summit 2013: Big Data Analytics – Realize the Investment from Your Big Dat...DataStax Academy
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataVictor Coustenoble
 
Putting Kafka Into Overdrive
Putting Kafka Into OverdrivePutting Kafka Into Overdrive
Putting Kafka Into OverdriveTodd Palino
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaJiangjie Qin
 
Apache Kafka - Messaging System Overview
Apache Kafka - Messaging System OverviewApache Kafka - Messaging System Overview
Apache Kafka - Messaging System OverviewDmitry Tolpeko
 

Destacado (7)

Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
 
C* Summit 2013: Big Data Analytics – Realize the Investment from Your Big Dat...
C* Summit 2013: Big Data Analytics – Realize the Investment from Your Big Dat...C* Summit 2013: Big Data Analytics – Realize the Investment from Your Big Dat...
C* Summit 2013: Big Data Analytics – Realize the Investment from Your Big Dat...
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational Data
 
Putting Kafka Into Overdrive
Putting Kafka Into OverdrivePutting Kafka Into Overdrive
Putting Kafka Into Overdrive
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
Apache Kafka - Messaging System Overview
Apache Kafka - Messaging System OverviewApache Kafka - Messaging System Overview
Apache Kafka - Messaging System Overview
 

Similar a kafka

Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Monal Daxini
 
Insta clustr seattle kafka meetup presentation bb
Insta clustr seattle kafka meetup presentation   bbInsta clustr seattle kafka meetup presentation   bb
Insta clustr seattle kafka meetup presentation bbNitin Kumar
 
Tips & Tricks for Apache Kafka®
Tips & Tricks for Apache Kafka®Tips & Tricks for Apache Kafka®
Tips & Tricks for Apache Kafka®confluent
 
Non-Kafkaesque Apache Kafka - Yottabyte 2018
Non-Kafkaesque Apache Kafka - Yottabyte 2018Non-Kafkaesque Apache Kafka - Yottabyte 2018
Non-Kafkaesque Apache Kafka - Yottabyte 2018Otávio Carvalho
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaSteven Wu
 
Apache Kafka - Free Friday
Apache Kafka - Free FridayApache Kafka - Free Friday
Apache Kafka - Free FridayOtávio Carvalho
 
Copy of Kafka-Camus
Copy of Kafka-CamusCopy of Kafka-Camus
Copy of Kafka-CamusDeep Shah
 
Kafka in action - Tech Talk - Paytm
Kafka in action - Tech Talk - PaytmKafka in action - Tech Talk - Paytm
Kafka in action - Tech Talk - PaytmSumit Jain
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explainedconfluent
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache KafkaChhavi Parasher
 
Building zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafkaBuilding zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafkaAvinash Ramineni
 
Twitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, TwitterTwitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, TwitterHostedbyConfluent
 
Fundamentals and Architecture of Apache Kafka
Fundamentals and Architecture of Apache KafkaFundamentals and Architecture of Apache Kafka
Fundamentals and Architecture of Apache KafkaAngelo Cesaro
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafkaSamuel Kerrien
 
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUpStrimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUpJosé Román Martín Gil
 
Stateful stream processing with kafka and samza
Stateful stream processing with kafka and samzaStateful stream processing with kafka and samza
Stateful stream processing with kafka and samzaGeorge Li
 

Similar a kafka (20)

Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
 
Insta clustr seattle kafka meetup presentation bb
Insta clustr seattle kafka meetup presentation   bbInsta clustr seattle kafka meetup presentation   bb
Insta clustr seattle kafka meetup presentation bb
 
Tips & Tricks for Apache Kafka®
Tips & Tricks for Apache Kafka®Tips & Tricks for Apache Kafka®
Tips & Tricks for Apache Kafka®
 
Non-Kafkaesque Apache Kafka - Yottabyte 2018
Non-Kafkaesque Apache Kafka - Yottabyte 2018Non-Kafkaesque Apache Kafka - Yottabyte 2018
Non-Kafkaesque Apache Kafka - Yottabyte 2018
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Kafka aws
Kafka awsKafka aws
Kafka aws
 
Apache Kafka - Free Friday
Apache Kafka - Free FridayApache Kafka - Free Friday
Apache Kafka - Free Friday
 
Copy of Kafka-Camus
Copy of Kafka-CamusCopy of Kafka-Camus
Copy of Kafka-Camus
 
Kafka in action - Tech Talk - Paytm
Kafka in action - Tech Talk - PaytmKafka in action - Tech Talk - Paytm
Kafka in action - Tech Talk - Paytm
 
Apache KAfka
Apache KAfkaApache KAfka
Apache KAfka
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
 
Building zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafkaBuilding zero data loss pipelines with apache kafka
Building zero data loss pipelines with apache kafka
 
Twitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, TwitterTwitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
Twitter’s Apache Kafka Adoption Journey | Ming Liu, Twitter
 
Fundamentals and Architecture of Apache Kafka
Fundamentals and Architecture of Apache KafkaFundamentals and Architecture of Apache Kafka
Fundamentals and Architecture of Apache Kafka
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafka
 
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUpStrimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
 
Apache kafka part 1
Apache kafka part  1Apache kafka part  1
Apache kafka part 1
 
Stateful stream processing with kafka and samza
Stateful stream processing with kafka and samzaStateful stream processing with kafka and samza
Stateful stream processing with kafka and samza
 

Más de Ariel Moskovich (11)

Consul scale
Consul scaleConsul scale
Consul scale
 
Docker appsflyer
Docker appsflyerDocker appsflyer
Docker appsflyer
 
Advanced Code Flow, Notes From the Field
Advanced Code Flow, Notes From the FieldAdvanced Code Flow, Notes From the Field
Advanced Code Flow, Notes From the Field
 
Practical Monitoring Techniques
Practical Monitoring TechniquesPractical Monitoring Techniques
Practical Monitoring Techniques
 
Consul
ConsulConsul
Consul
 
sensu
sensusensu
sensu
 
devopstools
devopstoolsdevopstools
devopstools
 
Bouncer
BouncerBouncer
Bouncer
 
Devopstools
DevopstoolsDevopstools
Devopstools
 
Docker in prod
Docker in prodDocker in prod
Docker in prod
 
Docker tlv
Docker tlvDocker tlv
Docker tlv
 

kafka

  • 2. Today's Menu ● Quick Kafka Overview ● Kafka Usage At AppsFlyer ● AppsFlyer First Cluster ● Designing The Next Cluster: Requirements And Changes ● Problems With The New Cluster ● Changes To The New Cluster ● Traffic Boost, Then More Issues ● More Solutions ● And More Failures ● Splitting The Cluster, More Changes And The Current Configuration ● Lessons Learned ● Testing The Cluster ● Collecting Metrics And Alerting
  • 3. "A first sign of the beginning of understanding is the wish to die." Franz Kafka
  • 5. “An open source, distributed, partitioned and replicated commit-log based publish- subscribe messaging system” Kafka Overview
  • 6. Kafka Overview ● Topic: Category which messages are published by the message producers ● Broker: Kafka server process (usually one per node) ● Partitions: Topics are partitioned, each partition is represented by the ordered immutable sequence of messages. Each message in the partition is assigned a unique ID called offset
  • 7. Kafka Usage in AppsFlyer
  • 8. AppsFlyer First cluster ● Traffic: Up to few hundreds millions ● Size: 4 M1.xlarge brokers ● ~8 Topics ● Replication factor 1 ● Retention 8-12H ● Default number of partitions 8 ● Vanilla configuration Main reason for migration: Lack of storage capacity, limited parallelism due to low partition count and forecast for future needs.
  • 9. Requirements for the Next Cluster ● More capacity to support Billions of messages ● Messages replication to prevent data loss ● Support loss of brokers up to entire AZ ● Much higher parallelism to support more consumers ● Longer retention period – 48 hours on most topics
  • 10. The new Cluster changes ● 18 m1.xlarge brokers, 6 per AZ ● Replication factor of 3 ● All partitions are distributed between AZ ● Topics # of partitions increased (between 12 to 120 depends on parallelism needs) ● 4 Network and IO threads ● Default log retention 48 hours ● Auto Leader rebalance enabled ● Imbalanced ratio set to default 15% * Leader: For each partition there is a leader which serve for writes and reads and the other brokers are replicated from * Imbalance ratio: The highest percentage of leadership a broker can hold, above that auto rebalance is initiate Glossary
  • 11. And After a few Months
  • 12. Problems ● Uneven distributions of leaders which cause high load on specific brokers and eventually lag in consumers and brokers failures ● Constantly rebalanced of brokers leaders which caused failures in python producers
  • 13. Solutions ● Increase number of brokers to 24 improve broker leadership distribution ● Rewrite Python producers in Clojure ● Decrease number of partitions where high parallelism is not needed
  • 15. Problems ● High Iowait in the brokers ● Missing ISR due to leaders overloaded ● Network bandwidth close to thresholds ● Lag in consumers * ISR: In Active Replicas Glossary
  • 16. More Solutions ● Split into 2 clusters: launches which contain 80% of messages and all the rest ● Move launches cluster to i2.2xlarge with local SSD ● Finer tuning of leaders ● Increase number of IO and Network Threads ● Enable AWS enhanced networking
  • 17. And some few more... ● Decrease Replication factor to 2 in Launches cluster to reduce load on leaders, reduce disk capacity and AZ traffic costs ● Move 2nd cluster to i2.2xlarge as well ● Upgrade ZK due to performance issues
  • 18. Lessons learned ● Minimize replication factor as possible to avoid extra load on the Leaders ● Make sure that leaders count is well balanced between brokers ● Balance partition number to support parallelism ● Split cluster logically considering traffic and business importance ● Retention (time based) should be long enough to recover from failures ● In AWS, spread cluster between AZ ● Support cluster dynamic changes by clients ● Create automation for reassign ● Save cluster-reassignment.json of each topic for future needs! ● Don't be to cheap on the Zookeepers
  • 19. Testing the cluster ● Load test using kafka-producer-perf-test.sh & kafka-consumer-perf-test.sh ● Broker failure while running ● Entire AZ failure while running ● Reassign partitions on the fly ● Kafka dashboard contains: Leader election rate, ISR status, offline partitions count, Log Flush time, All Topics Bytes in per broker, IOWait, LoadAvg, Disk Capacity and more ● Set appropriate alerts
  • 20. Collecting metrics & Alerting ● Using Airbnb plugin for Kafka, sending metrics to graphite ● Internal application that collects Lag for each Topic and send values to graphite ● Alerts are set on Lag (For each topic, Under replicated partitions, Broker topic metrics below threshold, Leader reelection