Kafka operators need to provide guarantees to the business that Kafka is working properly and delivering data in real time, and they need to identify and triage problems so they can solve them before end users notice them. This elevates the importance of Kafka monitoring from a nice-to-have to an operational necessity. In this talk, Kafka operations experts Xavier Léauté and Gwen Shapira share their best practices for monitoring Kafka and the streams of events flowing through it. How to detect duplicates, catch buggy clients, and triage performance issues – in short, how to keep the business’s central nervous system healthy and humming along, all like a Kafka pro.
Speakers: Gwen Shapira, Xavier Leaute (Confluence)
Gwen is a software engineer at Confluent working on core Apache Kafka. She has 15 years of experience working with code and customers to build scalable data architectures. She currently specializes in building real-time reliable data processing pipelines using Apache Kafka. Gwen is an author of “Kafka - the Definitive Guide”, "Hadoop Application Architectures", and a frequent presenter at industry conferences. Gwen is also a committer on the Apache Kafka and Apache Sqoop projects.
Xavier Leaute is One of the first engineers to Confluent team, Xavier is responsible for analytics infrastructure, including real-time analytics in KafkaStreams. He was previously a quantitative researcher at BlackRock. Prior to that, he held various research and analytics roles at Barclays Global Investors and MSCI.
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
SFBigAnalytics_20190724: Monitor kafka like a Pro
1. 1
Monitoring Kafka like a Pro
Xavier Léauté, Software Engineer
Gwen Shapira, Software Engineer
2. 2
In which we’ll review:
- The basics of monitoring Kafka Brokers
- Basics of monitoring Kafka Clients
- Advanced technique of monitoring Kafka
clients
5. Partitions
• Kafka organizes messages into
topics
• Each topics have a set of
partitions
• Each partition is a replicated log
of messages, referenced by
sequential offset
Partition 0
Partition 1
Partition 2
0 1 2 3 4 5
0 1 2 3 4 5 6 7
0 1 2 3 4
Offset
6. Replication
• Each Partition is replicated 3
times
• Each replica lives on separate
broker
• Leader handles all reads and
writes.
• Followers replicate events
from leader.
01234567
Replica 1 Replica 2 Replica 3
01234567
01234567
Producer
10. 10
Canary
● Lead partition on every broker
● Produce and Consume
● Every 15 seconds
● Yell if 4 consecutive misses
● Do this as close as possible to the
users
● Advanced: Measure latency
Partition 1
Replica 100
Leader
Partition 1
Replica 101
Broker 100 Broker 101
Partition 2
Replica 100
Partition 2
Replica 101
Leader
30. 23
Why Profile Streaming Applications?
Understand your bottlenecks
Metrics are sometimes hard to come by
Metrics don’t tell the full picture – assume you know what to look for
Production is the only environment that matters
31. 24
Profiling Streaming Applications
Most of the time is probably not spent in your application code
Deserializing / Serializing payloads becomes significant
I/O matters a lot more (state management + network)
32. 25
My tool of choice
Async Profiler
Can profile application online, attaches to any running JVM
Support older JDK (7 and above)
No need for special JVM flags
Merges Linux perf events with JVM profiling
Low overhead
33. 26
There are other good tools as well
Java Flight Recorder
JFR is now open-source with OpenJDK 11
Includes Mission Control powerful tools to analyze JFR dumps
Less useful for native code profiling (no perf event data)
Requires upgrading to JDK11 for most users
37. 28
Flamegraph 101 – here’s where your CPU cycles went
RocksDB
% CPU Time
Stack
38. 28
Flamegraph 101 – here’s where your CPU cycles went
Kafka poll() loop
% CPU Time
Stack
39. 28
Flamegraph 101 – here’s where your CPU cycles went
Actual Processing Time
% CPU Time
Stack
40. 29
Understanding where you spend your time
Streaming applications are complex
Analyze the proportion of time spent:
• Fetching the data (including de-serializing)
• Processing the data
• Sending the data (including serializing)
CPU usage / load are poor capacity utilization metrics
Use the ratio of time spent to wall clock for capacity planning
43. 32
Detecting Unwanted Side-effects
Cache flushes were calling inefficient methods
Cache flushes were never a problem in testing
Production load caused streams cache to fill up instantly (always flushing)
Increased cache size gave us an order of magnitude improvement in performance
46. 33
Helps understand I/O bottlenecks
Time spent generating
SSL random data
Time spent writing
to socket
47. 34
Per-Thread Profiling
Stream threads process heterogeneous workloads
Important to understand which tasks may be the bottleneck
./profiler.sh -d <time> -t f out.svg <pid>
49. 36
Detecting lock contention
Time spent waiting on locks / timeouts does not show up in CPU usage
./profiler.sh -d <time> -e lock -o svg=total —f out.svg <pid>
53. 38
Detecting lock contention
Detecting locking issues has helped us fix timing bugs
code was calling consumer.poll() with timeout in critical state restore loop
becomes a problem when data comes in at low rate (e.g. 1 per second)
-> state restore went from minutes to seconds
58. 41
Allocation profiling
Addressing unnecessary memory allocation helps:
• reduce GC pressure – especially long lived objects
• improve performance – large byte arrays can be expensive to initialize
Often seen at serialization / deserialization + compression stages
-> Helped us improve performance for small messages by an order of magnitude
60. 43
Profiling Container Workloads
Run from inside the container
• may still require host level access
• requires shipping profiling tools with your image
Run outside container
• requires copying agent library into the container
61. 44
Profiling on Google Kubernetes Engine
kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
my-streams-app 1/1 Running 0 72m 12.42.1.2 gke-bc5762429427-default-pool-c91f8be8-cbkq
gcloud compute ssh <node-id>
Problems
• Home is mounted non-executable
• Cannot install packages directly on host
62. 45
GKE Toolbox
Everything needs to run inside a container
Enter the toolbox container
toolbox --bind $HOME:/host
apt-get install anything you need…
compile async-profiler or download pre-compiled into toolbox home
Note: libstd++ in target container must be compatible with the version compiled against
copy agent library to host
cp ~/async-profiler/build/libasyncProfiler.so /host/
libasyncProfiler.so
63. 46
Setting up aync-profiler in GKE
Outside the toolbox
Copy agent library into target docker container
path in container must match exact path in toolbox (e.g /root/async-profiler/build)
docker exec -u0 <container-id> mkdir -p /root/async-profiler/build
docker cp libasyncProfiler.so <container-id>:/root/async-profiler/
build/libasyncProfiler.so
Turn on necessary kernel options (4.6 and above)
echo 1 | sudo tee /proc/sys/kernel/perf_event_paranoid
echo 0 | sudo tee /proc/sys/kernel/kptr_restrict
64. 47
Running Aync-Profiler in GKE
Inside toolbox
~/async-profiler/profiler.sh -d 30 -f /tmp/flamegraph.svg <pid>
can’t find /tmp/flamegraph.svg ?
Output path is resolved inside the application container
Outside the toolbox
docker cp <container-id>:/tmp/flamegraph.svg ~/
65. Do Try This At Home
https://github.com/jvm-profiling-tools/async-profiler
https://www.confluent.io/download/
https://slackpass.io/confluentcommunity
https://www.confluent.io/blog
66. 25% off
Kafka Summit SF 2019
code ks19meetup
Thank you!
@gwenshap
gwen@confluent.io
@xvrl
xavier@confluent.io