SlideShare una empresa de Scribd logo
1 de 42
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Kafka at Peak Performance
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Todd Palino
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Who Am I?
3
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Kafka At LinkedIn
 1100+ Kafka brokers
 Over 32,000 topics
 350,000+ Partitions
 875 Billion messages per day
 185 Terabytes In
 675 Terabytes Out
 Peak Load (whole site)
– 10.5 Million messages/sec
– 18.5 Gigabits/sec Inbound
– 70.5 Gigabits/sec Outbound
4
 1800+ Kafka brokers
 Over 79,000 topics
 1,130,000+ Partitions
 1.3 Trillion messages per day
 330 Terabytes In
 1.2 Petabytes Out
 Peak Load (single cluster)
– 2 Million messages/sec
– 4.7 Gigabits/sec Inbound
– 15 Gigabits/sec Outbound
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
What Will We Talk About?
 Picking Your Hardware
 Monitoring the Cluster
 Triaging Broker Performance Problems
 Conclusion
5
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Hardware Selection
6
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
What’s Important To You?
 Message Retention - Disk size
 Message Throughput - Network capacity
 Producer Performance - Disk I/O
 Consumer Performance - Memory
7
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Go Wide
 Kafka is well-suited to horizontal scaling
 RAIS - Redundant Array of Inexpensive Servers
 Also helps with CPU utilization
– Kafka needs to decompress and recompress every message batch
– KIP-31 will help with this by eliminating recompression
 Don’t co-locate Kafka
8
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Disk Layout
 RAID
– Can survive a single disk failure (not RAID 0)
– Provides the broker with a single log directory
– Eats up disk I/O
 JBOD
– Gives Kafka all the disk I/O available
– Broker is not smart about balancing partitions
– If one disk fails, the entire broker stops
 Amazon EBS performance works!
9
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Operating System Tuning
 Filesystem Options
– EXT or XFS
– Using unsafe mount options
 Virtual Memory
– Swappiness
– Dirty Pages
 Networking
10
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Java
 Only use JDK 8 now
 Keep heap size small
– Even our largest brokers use a 6 GB heap
– Save the rest for page cache
 Garbage Collection - G1 all the way
– Basic tuning only
– Watch for humongous allocations
11
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
How Much Do You Need?
12
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Buy The Book!
13
Early Access available now.
Covers all aspects of Kafka,
from setup to client
development to ongoing
administration and
troubleshooting.
Also discusses stream
processing and other use
cases.
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Kafka Cluster Sizing
 How big for your local cluster?
– How much disk space do you have?
– How much network bandwidth do you have?
– CPU, memory, disk I/O
 How big for your aggregate cluster?
– In general, multiple the number of brokers by the number of local clusters
– May have additional concerns with lots of consumers
14
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Topic Configuration
 Partition Counts for Local
– Many theories on how to do this correctly, but the answer is “it depends”
– How many consumers do you have?
– Do you have specific partition requirements?
– Keeping partition sizes manageable
 Partition Counts for Aggregate
– Multiply the number of partitions in a local cluster by the number of local clusters
– Periodically review partition counts in all clusters
 Message Retention
– If aggregate is where you really need the messages, only retain it in local for long
enough to cover mirror maker problems
15
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Possible Broker Improvements
 Namespaces
– Namespace topics by datacenter
– Eliminate local clusters and just have aggregate
– Significant hardware savings
 JBOD Fixes
– Intelligent partition assignment
– Admin tools to move partitions between mount points
– Broker should not fail completely with a single disk failure
16
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Administrative Improvements
 Multiple cluster management
– Topic management across clusters
– Visualization of mirror maker paths
 Better client monitoring
– Burrow for consumer monitoring
– No open source solution for producer monitoring (audit)
 End-to-end availability monitoring
17
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Keeping An Eye On Things
18
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Monitoring The Foundation
 CPU Load
 Network inbound and outbound
 Filehandle usage for Kafka
 Disk
– Free space - where you write logs, and where Kafka stores messages
– Free inodes
– I/O performance - at least average wait and percent utilization
 Garbage Collection
19
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Broker Ground Rules
 Tuning
– Stick (mostly) with the defaults
– Set default cluster retention as appropriate
– Default partition count should be at least the number of brokers
 Monitoring
– Watch the right things
– Don’t try to alert on everything
 Triage and Resolution
– Solve problems, don’t mask them
20
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Too Much Information!
 Monitoring teams hate Kafka
– Per-Topic metrics
– Per-Partition metrics
– Per-Client metrics
 Capture as much as you can
– Many metrics are useful while triaging an issue
 Clients want metrics on their own topics
 Only alert on what is needed to signal a problem
21
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Broker Monitoring
 Bytes In and Out, Messages In
– Why not messages out?
 Partitions
– Count and Leader Count
– Under Replicated and Offline
 Threads
– Network pool, Request pool
– Max Dirty Percent
 Requests
– Rates and times - total, queue, local, and send
22
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Topic Monitoring
 Bytes In, Bytes Out
 Messages In, Produce Rate, Produce Failure Rate
 Fetch Rate, Fetch Failure Rate
 Partition Bytes
 Log End Offset
– Why bother?
– KIP-32 will make this unnecessary
 Quota Throttling
 Provide this to your customers for them to alert on
23
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Client Monitoring
 For consumers, use Burrow
– Monitor all partitions for all consumers
– Provides an easy to digest “good, warning, bad” state, with detail available
– Fast and free
 Producers are a little harder
– Several internal implementations of message auditing
– The community needs a good open source standard
 Cluster availability monitoring
– kafka-monitoring is coming soon from LinkedIn!
24
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
It’s Broken! Now What?
25
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
All The Best Ops People…
 Know more of what is happening than their customers
 Are proactive
 Fix bugs, not work around them
 This applies to our developers too!
26
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Anticipating Trouble
 Trend cluster utilization and growth over time
 Use default configurations for quotas and retention to require customers to
talk to you
 Monitor request times
– If you are able to develop a consistent baseline, this is early warning
27
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Under Replicated Partitions
 Count of number of partitions which are not fully replicated within the
cluster
 Also referred to as “replica lag”
 Primary indicator of problems within the cluster
28
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Broker Performance Checks
 Are you still running 0.8?
 Are all the brokers in the cluster working?
 Are the network interfaces saturated?
– Reelect partition leaders
– Rebalance partitions in the cluster
– Spread out traffic more (increase partitions or brokers)
 Is the CPU utilization high? (especially iowait)
– Is another process competing for resources?
– Look for a bad disk
 Do you have really big messages?
29
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Kafka’s OK, Now What?
 If Kafka is working properly, it’s probably a client issue
– Don’t throw it over the fence. Help your customers understand
 Common producer issues
– Batch size and linger time
– Receive and send buffers
– Sync vs. async, and acknowledgements
 Common consumer issues
– Garbage collection problems
– Min fetch bytes and max wait time
– Not enough partitions
30
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Conclusion
31
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
One Ecosystem
 Kafka can scale to millions of messages per second, and more
– Operations must scale the cluster appropriately
– Developers must use the right tuning and go parallel
 Few problems are owned by only one side
– Expanding partitions often requires coordination
– Applications that need higher reliability drive cluster configurations
 Either we work together, or we fail separately
32
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Would You Like To Know More?
 Presentations: http://www.slideshare.net/toddpalino
– More Datacenters, More Problems
– Kafka As A Service
– Always download the originals for slide notes!
 Blog Posts: https://engineering.linkedin.com/blog
– Development and SRE blogs on Kafka and other topics
 LinkedIn Open Source: https://github.com/linkedin/streaming
– Burrow Consumer Monitoring - https://github.com/linkedin/Burrow
– Kafka Admin Tools - https://github.com/linkedin/kafka-tools
33
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Getting Involved With Kafka
 http://kafka.apache.org
 Join the mailing lists
– users@kafka.apache.org
– dev@kafka.apache.org
 irc.freenode.net - #apache-kafka
 Meetups
– Apache Kafka - http://www.meetup.com/http-kafka-apache-org
– Bay Area Samza - http://www.meetup.com/Bay-Area-Samza-Meetup/
 Contribute code
34
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Data @ LinkedIn is Hiring!
 Streams Infrastructure
– Kafka pub/sub ecosystem
– Stream Processing Platform built on Apache Samza
– Next Generation change capture technology (incubating)
 LinkedIn
– Strong commitment to open source
– Do cool things and work with awesome people
 Join us in working on cutting edge stream processing infrastructures
– Please contact kparamasivam@linkedin.com
– Software developers and Site Reliability Engineers at all levels
35
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Appendix
37
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
JDK Options
Heap Size -Xmx6g -Xms6g
Metaspace -XX:MetaspaceSize=96m -XX:MinMetaspaceFreeRatio=50
-XX:MaxMetaspaceFreeRatio=80
G1 Tuning -XX:+UseG1GC -XX:MaxGCPauseMillis=20
-XX:InitiatingHeapOccupancyPercent=35
-XX:G1HeapRegionSize=16M
GC Logging -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
-XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution
-XX:+PrintGCDetails -XX:+PrintGCDateStamps
-XX:+PrintTenuringDistribution
-Xloggc:/path/to/logs/gc.log -verbose:gc
Error Handling -XX:-HeapDumpOnOutOfMemoryError
-XX:ErrorFile=/path/to/logs/hs_err.log
38
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
OS Tuning Parameters
 Networking:
net.core.rmem_default = 124928
net.core.rmem_max = 2048000
net.core.wmem_default = 124928
net.core.wmem_max = 2048000
net.ipv4.tcp_rmem = 4096 87380 4194304
net.ipv4.tcp_wmem = 4096 16384 4194304
net.ipv4.tcp_max_tw_buckets = 262144
net.ipv4.tcp_max_syn_backlog = 1024
39
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
OS Tuning Parameters (cont.)
 Virtual Memory
vm.oom_kill_allocating_task = 1
vm.max_map_count = 200000
vm.swappiness = 1
vm.dirty_writeback_centisecs = 500
vm.dirty_expire_centisecs = 500
vm.dirty_ratio = 60
vm.dirty_background_ratio = 5
40
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Kafka Broker Sensors
kafka.server:name=BytesInPerSec,type=BrokerTopicMetrics
kafka.server:name=BytesOutPerSec,type=BrokerTopicMetrics
kafka.server:name=MessagesInPerSec,type=BrokerTopicMetrics
kafka.server:name=PartitionCount,type=ReplicaManager
kafka.server:name=LeaderCount,type=ReplicaManager
kafka.server:name=UnderReplicatedPartitions,type=ReplicaManager
kafka.server:name=RequestHandlerAvgIdlePercent,type=KafkaRequestHandlerPool
kafka.controller:name=ActiveControllerCount,type=KafkaController
kafka.controller:name=OfflinePartitionsCount,type=KafkaController
kafka.log:name=max-dirty-percent,type=LogCleanerManager
kafka.network:name=NetworkProcessorAvgIdlePercent,type=SocketServer
kafka.network:name=RequestsPerSec=*,type=RequestMetrics
kafka.network:name=RequestQueueTimeMs,request=*,type=RequestMetrics
kafka.network:name=LocalTimeMs,request=*,type=RequestMetrics
kafka.network:name=RemoteTimeMs,request=*,type=RequestMetrics
kafka.network:name=ResponseQueueTimeMs,request=*,type=RequestMetrics
kafka.network:name=ResponseSendTimeMs,request=*,type=RequestMetrics
kafka.network:name=TotalTimeMs,request=*,type=RequestMetrics
41
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Kafka Broker Sensors - Topics
kafka.server:name=BytesInPerSec,type=BrokerTopicMetrics,topics=*
kafka.server:name=BytesOutPerSec,type=BrokerTopicMetrics,topics=*
kafka.server:name=MessagesInPerSec,type=BrokerTopicMetrics,topics=*
kafka.server:name=TotalProduceRequestsPerSec,type=BrokerTopicMetrics,topic=*
kafka.server:name=FailedProduceRequestsPerSec,type=BrokerTopicMetrics,topic=*
kafka.server:name=TotalFetchRequestsPerSec,type=BrokerTopicMetrics,topic=*
kafka.server:name=FailedFetchRequestsPerSec,type=BrokerTopicMetrics,topic=*
kafka.log:type=Log,name=LogEndOffset,topic=*,partition=*
42

Más contenido relacionado

La actualidad más candente

Kafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around KafkaKafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around Kafka
Guido Schmutz
 

La actualidad más candente (20)

Apache kafka performance(throughput) - without data loss and guaranteeing dat...
Apache kafka performance(throughput) - without data loss and guaranteeing dat...Apache kafka performance(throughput) - without data loss and guaranteeing dat...
Apache kafka performance(throughput) - without data loss and guaranteeing dat...
 
No data loss pipeline with apache kafka
No data loss pipeline with apache kafkaNo data loss pipeline with apache kafka
No data loss pipeline with apache kafka
 
Kafka 101 and Developer Best Practices
Kafka 101 and Developer Best PracticesKafka 101 and Developer Best Practices
Kafka 101 and Developer Best Practices
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
Putting Kafka Into Overdrive
Putting Kafka Into OverdrivePutting Kafka Into Overdrive
Putting Kafka Into Overdrive
 
Kafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around KafkaKafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around Kafka
 
Handle Large Messages In Apache Kafka
Handle Large Messages In Apache KafkaHandle Large Messages In Apache Kafka
Handle Large Messages In Apache Kafka
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Kafka Tutorial: Advanced Producers
Kafka Tutorial: Advanced ProducersKafka Tutorial: Advanced Producers
Kafka Tutorial: Advanced Producers
 
Citi Tech Talk Disaster Recovery Solutions Deep Dive
Citi Tech Talk  Disaster Recovery Solutions Deep DiveCiti Tech Talk  Disaster Recovery Solutions Deep Dive
Citi Tech Talk Disaster Recovery Solutions Deep Dive
 
Hardening Kafka Replication
Hardening Kafka Replication Hardening Kafka Replication
Hardening Kafka Replication
 
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
 A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ... A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
A Hitchhiker's Guide to Apache Kafka Geo-Replication with Sanjana Kaundinya ...
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Getting Started with Confluent Schema Registry
Getting Started with Confluent Schema RegistryGetting Started with Confluent Schema Registry
Getting Started with Confluent Schema Registry
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
 
kafka
kafkakafka
kafka
 
Improving Kafka at-least-once performance at Uber
Improving Kafka at-least-once performance at UberImproving Kafka at-least-once performance at Uber
Improving Kafka at-least-once performance at Uber
 
Envoy and Kafka
Envoy and KafkaEnvoy and Kafka
Envoy and Kafka
 
Evolution of MongoDB Replicaset and Its Best Practices
Evolution of MongoDB Replicaset and Its Best PracticesEvolution of MongoDB Replicaset and Its Best Practices
Evolution of MongoDB Replicaset and Its Best Practices
 

Similar a Kafka at Peak Performance

Linked in multi tier, multi-tenant, multi-problem kafka
Linked in multi tier, multi-tenant, multi-problem kafkaLinked in multi tier, multi-tenant, multi-problem kafka
Linked in multi tier, multi-tenant, multi-problem kafka
Nitin Kumar
 
InfiniBand for the enterprise
InfiniBand for the enterpriseInfiniBand for the enterprise
InfiniBand for the enterprise
Anas Kanzoua
 

Similar a Kafka at Peak Performance (20)

Linked in multi tier, multi-tenant, multi-problem kafka
Linked in multi tier, multi-tenant, multi-problem kafkaLinked in multi tier, multi-tenant, multi-problem kafka
Linked in multi tier, multi-tenant, multi-problem kafka
 
Multi tier, multi-tenant, multi-problem kafka
Multi tier, multi-tenant, multi-problem kafkaMulti tier, multi-tenant, multi-problem kafka
Multi tier, multi-tenant, multi-problem kafka
 
Kafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesKafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier Architectures
 
More Datacenters, More Problems
More Datacenters, More ProblemsMore Datacenters, More Problems
More Datacenters, More Problems
 
Data stream with cruise control
Data stream with cruise controlData stream with cruise control
Data stream with cruise control
 
ARIN 34 IPv6 IAB/IETF Activities Report
ARIN 34 IPv6 IAB/IETF Activities ReportARIN 34 IPv6 IAB/IETF Activities Report
ARIN 34 IPv6 IAB/IETF Activities Report
 
InfiniBand for the enterprise
InfiniBand for the enterpriseInfiniBand for the enterprise
InfiniBand for the enterprise
 
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...
 
Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8
 
Adobe Ask the AEM Community Expert Session Oct 2016
Adobe Ask the AEM Community Expert Session Oct 2016Adobe Ask the AEM Community Expert Session Oct 2016
Adobe Ask the AEM Community Expert Session Oct 2016
 
MySQL for Software-as-a-Service (SaaS)
MySQL for Software-as-a-Service (SaaS)MySQL for Software-as-a-Service (SaaS)
MySQL for Software-as-a-Service (SaaS)
 
Spark Streaming the Industrial IoT
Spark Streaming the Industrial IoTSpark Streaming the Industrial IoT
Spark Streaming the Industrial IoT
 
The role of NoSQL in the Next Generation of Financial Informatics
The role of NoSQL in the Next Generation of Financial InformaticsThe role of NoSQL in the Next Generation of Financial Informatics
The role of NoSQL in the Next Generation of Financial Informatics
 
Kafka 0.9, Things you should know
Kafka 0.9, Things you should knowKafka 0.9, Things you should know
Kafka 0.9, Things you should know
 
Management and Automation of MongoDB Clusters - Slides
Management and Automation of MongoDB Clusters - SlidesManagement and Automation of MongoDB Clusters - Slides
Management and Automation of MongoDB Clusters - Slides
 
Fast Online Access to Massive Offline Data - SECR 2016
Fast Online Access to Massive Offline Data - SECR 2016Fast Online Access to Massive Offline Data - SECR 2016
Fast Online Access to Massive Offline Data - SECR 2016
 
#IBMEdge: Flash Storage Session
#IBMEdge: Flash Storage Session#IBMEdge: Flash Storage Session
#IBMEdge: Flash Storage Session
 
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
 
Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splun...
Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splun...Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splun...
Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splun...
 
CisCon 2018 - Analytics per Storage Area Networks
CisCon 2018 - Analytics per Storage Area NetworksCisCon 2018 - Analytics per Storage Area Networks
CisCon 2018 - Analytics per Storage Area Networks
 

Más de Todd Palino

Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Todd Palino
 
I'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedInI'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedIn
Todd Palino
 

Más de Todd Palino (10)

Leading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical LeaderLeading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical Leader
 
From Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy StepsFrom Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy Steps
 
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart WayCode Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
 
Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?
 
URP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to KnowURP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to Know
 
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
 
Running Kafka for Maximum Pain
Running Kafka for Maximum PainRunning Kafka for Maximum Pain
Running Kafka for Maximum Pain
 
I'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedInI'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedIn
 
Tuning Kafka for Fun and Profit
Tuning Kafka for Fun and ProfitTuning Kafka for Fun and Profit
Tuning Kafka for Fun and Profit
 
Enterprise Kafka: Kafka as a Service
Enterprise Kafka: Kafka as a ServiceEnterprise Kafka: Kafka as a Service
Enterprise Kafka: Kafka as a Service
 

Último

Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Christo Ananth
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Último (20)

Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 

Kafka at Peak Performance

  • 1. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Kafka at Peak Performance
  • 2. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Todd Palino
  • 3. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Who Am I? 3
  • 4. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Kafka At LinkedIn  1100+ Kafka brokers  Over 32,000 topics  350,000+ Partitions  875 Billion messages per day  185 Terabytes In  675 Terabytes Out  Peak Load (whole site) – 10.5 Million messages/sec – 18.5 Gigabits/sec Inbound – 70.5 Gigabits/sec Outbound 4  1800+ Kafka brokers  Over 79,000 topics  1,130,000+ Partitions  1.3 Trillion messages per day  330 Terabytes In  1.2 Petabytes Out  Peak Load (single cluster) – 2 Million messages/sec – 4.7 Gigabits/sec Inbound – 15 Gigabits/sec Outbound
  • 5. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. What Will We Talk About?  Picking Your Hardware  Monitoring the Cluster  Triaging Broker Performance Problems  Conclusion 5
  • 6. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Hardware Selection 6
  • 7. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. What’s Important To You?  Message Retention - Disk size  Message Throughput - Network capacity  Producer Performance - Disk I/O  Consumer Performance - Memory 7
  • 8. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Go Wide  Kafka is well-suited to horizontal scaling  RAIS - Redundant Array of Inexpensive Servers  Also helps with CPU utilization – Kafka needs to decompress and recompress every message batch – KIP-31 will help with this by eliminating recompression  Don’t co-locate Kafka 8
  • 9. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Disk Layout  RAID – Can survive a single disk failure (not RAID 0) – Provides the broker with a single log directory – Eats up disk I/O  JBOD – Gives Kafka all the disk I/O available – Broker is not smart about balancing partitions – If one disk fails, the entire broker stops  Amazon EBS performance works! 9
  • 10. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Operating System Tuning  Filesystem Options – EXT or XFS – Using unsafe mount options  Virtual Memory – Swappiness – Dirty Pages  Networking 10
  • 11. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Java  Only use JDK 8 now  Keep heap size small – Even our largest brokers use a 6 GB heap – Save the rest for page cache  Garbage Collection - G1 all the way – Basic tuning only – Watch for humongous allocations 11
  • 12. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. How Much Do You Need? 12
  • 13. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Buy The Book! 13 Early Access available now. Covers all aspects of Kafka, from setup to client development to ongoing administration and troubleshooting. Also discusses stream processing and other use cases.
  • 14. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Kafka Cluster Sizing  How big for your local cluster? – How much disk space do you have? – How much network bandwidth do you have? – CPU, memory, disk I/O  How big for your aggregate cluster? – In general, multiple the number of brokers by the number of local clusters – May have additional concerns with lots of consumers 14
  • 15. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Topic Configuration  Partition Counts for Local – Many theories on how to do this correctly, but the answer is “it depends” – How many consumers do you have? – Do you have specific partition requirements? – Keeping partition sizes manageable  Partition Counts for Aggregate – Multiply the number of partitions in a local cluster by the number of local clusters – Periodically review partition counts in all clusters  Message Retention – If aggregate is where you really need the messages, only retain it in local for long enough to cover mirror maker problems 15
  • 16. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Possible Broker Improvements  Namespaces – Namespace topics by datacenter – Eliminate local clusters and just have aggregate – Significant hardware savings  JBOD Fixes – Intelligent partition assignment – Admin tools to move partitions between mount points – Broker should not fail completely with a single disk failure 16
  • 17. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Administrative Improvements  Multiple cluster management – Topic management across clusters – Visualization of mirror maker paths  Better client monitoring – Burrow for consumer monitoring – No open source solution for producer monitoring (audit)  End-to-end availability monitoring 17
  • 18. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Keeping An Eye On Things 18
  • 19. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Monitoring The Foundation  CPU Load  Network inbound and outbound  Filehandle usage for Kafka  Disk – Free space - where you write logs, and where Kafka stores messages – Free inodes – I/O performance - at least average wait and percent utilization  Garbage Collection 19
  • 20. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Broker Ground Rules  Tuning – Stick (mostly) with the defaults – Set default cluster retention as appropriate – Default partition count should be at least the number of brokers  Monitoring – Watch the right things – Don’t try to alert on everything  Triage and Resolution – Solve problems, don’t mask them 20
  • 21. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Too Much Information!  Monitoring teams hate Kafka – Per-Topic metrics – Per-Partition metrics – Per-Client metrics  Capture as much as you can – Many metrics are useful while triaging an issue  Clients want metrics on their own topics  Only alert on what is needed to signal a problem 21
  • 22. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Broker Monitoring  Bytes In and Out, Messages In – Why not messages out?  Partitions – Count and Leader Count – Under Replicated and Offline  Threads – Network pool, Request pool – Max Dirty Percent  Requests – Rates and times - total, queue, local, and send 22
  • 23. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Topic Monitoring  Bytes In, Bytes Out  Messages In, Produce Rate, Produce Failure Rate  Fetch Rate, Fetch Failure Rate  Partition Bytes  Log End Offset – Why bother? – KIP-32 will make this unnecessary  Quota Throttling  Provide this to your customers for them to alert on 23
  • 24. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Client Monitoring  For consumers, use Burrow – Monitor all partitions for all consumers – Provides an easy to digest “good, warning, bad” state, with detail available – Fast and free  Producers are a little harder – Several internal implementations of message auditing – The community needs a good open source standard  Cluster availability monitoring – kafka-monitoring is coming soon from LinkedIn! 24
  • 25. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. It’s Broken! Now What? 25
  • 26. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. All The Best Ops People…  Know more of what is happening than their customers  Are proactive  Fix bugs, not work around them  This applies to our developers too! 26
  • 27. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Anticipating Trouble  Trend cluster utilization and growth over time  Use default configurations for quotas and retention to require customers to talk to you  Monitor request times – If you are able to develop a consistent baseline, this is early warning 27
  • 28. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Under Replicated Partitions  Count of number of partitions which are not fully replicated within the cluster  Also referred to as “replica lag”  Primary indicator of problems within the cluster 28
  • 29. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Broker Performance Checks  Are you still running 0.8?  Are all the brokers in the cluster working?  Are the network interfaces saturated? – Reelect partition leaders – Rebalance partitions in the cluster – Spread out traffic more (increase partitions or brokers)  Is the CPU utilization high? (especially iowait) – Is another process competing for resources? – Look for a bad disk  Do you have really big messages? 29
  • 30. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Kafka’s OK, Now What?  If Kafka is working properly, it’s probably a client issue – Don’t throw it over the fence. Help your customers understand  Common producer issues – Batch size and linger time – Receive and send buffers – Sync vs. async, and acknowledgements  Common consumer issues – Garbage collection problems – Min fetch bytes and max wait time – Not enough partitions 30
  • 31. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Conclusion 31
  • 32. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. One Ecosystem  Kafka can scale to millions of messages per second, and more – Operations must scale the cluster appropriately – Developers must use the right tuning and go parallel  Few problems are owned by only one side – Expanding partitions often requires coordination – Applications that need higher reliability drive cluster configurations  Either we work together, or we fail separately 32
  • 33. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Would You Like To Know More?  Presentations: http://www.slideshare.net/toddpalino – More Datacenters, More Problems – Kafka As A Service – Always download the originals for slide notes!  Blog Posts: https://engineering.linkedin.com/blog – Development and SRE blogs on Kafka and other topics  LinkedIn Open Source: https://github.com/linkedin/streaming – Burrow Consumer Monitoring - https://github.com/linkedin/Burrow – Kafka Admin Tools - https://github.com/linkedin/kafka-tools 33
  • 34. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Getting Involved With Kafka  http://kafka.apache.org  Join the mailing lists – users@kafka.apache.org – dev@kafka.apache.org  irc.freenode.net - #apache-kafka  Meetups – Apache Kafka - http://www.meetup.com/http-kafka-apache-org – Bay Area Samza - http://www.meetup.com/Bay-Area-Samza-Meetup/  Contribute code 34
  • 35. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Data @ LinkedIn is Hiring!  Streams Infrastructure – Kafka pub/sub ecosystem – Stream Processing Platform built on Apache Samza – Next Generation change capture technology (incubating)  LinkedIn – Strong commitment to open source – Do cool things and work with awesome people  Join us in working on cutting edge stream processing infrastructures – Please contact kparamasivam@linkedin.com – Software developers and Site Reliability Engineers at all levels 35
  • 36.
  • 37. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Appendix 37
  • 38. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. JDK Options Heap Size -Xmx6g -Xms6g Metaspace -XX:MetaspaceSize=96m -XX:MinMetaspaceFreeRatio=50 -XX:MaxMetaspaceFreeRatio=80 G1 Tuning -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M GC Logging -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -Xloggc:/path/to/logs/gc.log -verbose:gc Error Handling -XX:-HeapDumpOnOutOfMemoryError -XX:ErrorFile=/path/to/logs/hs_err.log 38
  • 39. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. OS Tuning Parameters  Networking: net.core.rmem_default = 124928 net.core.rmem_max = 2048000 net.core.wmem_default = 124928 net.core.wmem_max = 2048000 net.ipv4.tcp_rmem = 4096 87380 4194304 net.ipv4.tcp_wmem = 4096 16384 4194304 net.ipv4.tcp_max_tw_buckets = 262144 net.ipv4.tcp_max_syn_backlog = 1024 39
  • 40. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. OS Tuning Parameters (cont.)  Virtual Memory vm.oom_kill_allocating_task = 1 vm.max_map_count = 200000 vm.swappiness = 1 vm.dirty_writeback_centisecs = 500 vm.dirty_expire_centisecs = 500 vm.dirty_ratio = 60 vm.dirty_background_ratio = 5 40
  • 41. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Kafka Broker Sensors kafka.server:name=BytesInPerSec,type=BrokerTopicMetrics kafka.server:name=BytesOutPerSec,type=BrokerTopicMetrics kafka.server:name=MessagesInPerSec,type=BrokerTopicMetrics kafka.server:name=PartitionCount,type=ReplicaManager kafka.server:name=LeaderCount,type=ReplicaManager kafka.server:name=UnderReplicatedPartitions,type=ReplicaManager kafka.server:name=RequestHandlerAvgIdlePercent,type=KafkaRequestHandlerPool kafka.controller:name=ActiveControllerCount,type=KafkaController kafka.controller:name=OfflinePartitionsCount,type=KafkaController kafka.log:name=max-dirty-percent,type=LogCleanerManager kafka.network:name=NetworkProcessorAvgIdlePercent,type=SocketServer kafka.network:name=RequestsPerSec=*,type=RequestMetrics kafka.network:name=RequestQueueTimeMs,request=*,type=RequestMetrics kafka.network:name=LocalTimeMs,request=*,type=RequestMetrics kafka.network:name=RemoteTimeMs,request=*,type=RequestMetrics kafka.network:name=ResponseQueueTimeMs,request=*,type=RequestMetrics kafka.network:name=ResponseSendTimeMs,request=*,type=RequestMetrics kafka.network:name=TotalTimeMs,request=*,type=RequestMetrics 41
  • 42. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Kafka Broker Sensors - Topics kafka.server:name=BytesInPerSec,type=BrokerTopicMetrics,topics=* kafka.server:name=BytesOutPerSec,type=BrokerTopicMetrics,topics=* kafka.server:name=MessagesInPerSec,type=BrokerTopicMetrics,topics=* kafka.server:name=TotalProduceRequestsPerSec,type=BrokerTopicMetrics,topic=* kafka.server:name=FailedProduceRequestsPerSec,type=BrokerTopicMetrics,topic=* kafka.server:name=TotalFetchRequestsPerSec,type=BrokerTopicMetrics,topic=* kafka.server:name=FailedFetchRequestsPerSec,type=BrokerTopicMetrics,topic=* kafka.log:type=Log,name=LogEndOffset,topic=*,partition=* 42