2. Agenda
2
01
Confluent Control Center
Everything about Confluent monitoring
solution
02
JMX Metrics and Monitoring Stacks
Overview of JMX metrics and 3rd party
monitoring stacks
03
Monitor Consumer Lag
All the different ways to monitor consumer
lag
04
Key Alerts
Important metrics and alerts to set up
05
Audit Logs
Confluent Platform Audit Logs overview
06
Enabling and using Proactive
Support
Step by Step walkthrough on how to
enable Proactive Support
3. • Confluent Platform is the central nervous system for a business, and potentially a Kafka-based single
source of truth.
• Kafka operators need to provide guarantees to the business that Kafka is working properly and
delivering data in real time. They need to identify and triage problems in order to solve them before it
affects end users. As a result, monitoring your Kafka deployments is an operational must-have.
• Monitoring help provides assurance that all your services are working properly, meeting SLAs and
addressing business needs.
• Here are some common business-level questions:
1. Are applications receiving all data?
2. Are my business applications showing the latest data?
3. Why are the applications running slowly?
4. Do we need to scale up?
5. Can any data get lost?
6. Will there be service interruptions?
7. Are there assurances in case of a disaster event?
• We will see how Control Center can help to answer all those questions
Why monitoring ?
3
5. 5
You can deploy Confluent Control Center for
out-of-the-box Kafka cluster monitoring so you
don’t have to build your own monitoring system.
Control Center makes it easy to manage the
entire Confluent Platform.
Control Center is a web-based application that
allows you to manage your cluster, to monitor
Kafka clusters in predefined dashboards and to
alert on triggers.
6. • Kafka exposes hundreds of JMX metrics. Some of them are per broker,
per client, per topic and per partition, and so the number of metrics
scales up as the cluster grows. For an average-size Kafka cluster, the
number of metrics can very quickly grow into thousands !
• A common pitfall of generic monitoring tools is to import pretty much all
available metrics. But even with a comprehensive list of metrics, there is
a limit to what can be achieved with no Kafka context or Kafka expertise
to determine which metrics are important and which ones are not.
• People end up referring to just the two or three charts that they
actually understand.
• Meanwhile, they ignore all the other charts because they don’t
understand them
• It can generate a lot of noise as people spend time chasing
“issues” that aren’t impactful to the services, or worse, obscures
real problems.
• Control Center was designed to help operators identify the most
important things to monitor in Kafka, including the cluster and the
client applications producing messages to and consuming messages
from the cluster
The metrics swamp
6
8. 8
• Cluster Overview provides insight into the
well-being of the Kafka cluster from the
cluster perspective, and allows you to drill
down to the broker level, topic level,
connect cluster level and KSQL level
perspectives
• Multiple clusters can be monitored with a
single Control Center and starting from
CP 5.4.1, it also supports Multi-Cluster
Schema Registry
• Requires Confluent Metrics Reporter to be
installed and enabled
Cluster Overview
9. 9
• Brokers Overview provides a succinct view
of essential Kafka metrics for brokers in a
cluster:
• Throughput for production and
consumption
• Broker uptime
• Partitions replicas status (including
URP)
• Apache ZooKeeper status
• Active Controller
• Disk usage and distribution
• System metrics for network and
request pool usage
• Clicking on panels, you get an historical
view of the metrics 👇👇👇
Brokers Overview
10. 10
• Brokers Metrics page provides historical
data for following panels:
• Production metrics
• Consumption metrics
• Broker uptime metrics
• Partition replicas metrics
• System usage
• Disk usage
Brokers Metrics page
11. 11
• You can add, view, edit, and delete topics
using the Control Center topic management
interface
• Message Browser
• Manage Schemas for Topics
• Avro, JSON-Schema and Protobuf
• ⚠ Options to view and edit schemas
through the user interface are available
only for schemas that use the default
TopicNameStrategy
• Multi-Cluster Schema Registry
• Metrics:
• Production Throughput and Failed
production requests
• Consumption Throughput and Failed
consumptions requests, % messages
consumed (require Monitoring
Interceptors) and End-to-end latency
(require Monitoring Interceptors)
• Availability (URP and Out of Sync
followers and observers)
• Consumer Lag
Topics
12. 12
• Provides the convenience of managing
connectors for multiple Kafka Connect
clusters.
• Use Control Center to:
• Add a connector by completing UI
fields. Note: specific procedure when
RBAC is used.
• Add a connector by uploading a
connector configuration file
• Download connector configuration files
to reuse in another connector or cluster,
or to use as a template.
• Edit a connector configuration and
relaunch it.
• Pause a running connector; resume a
paused connector.
• Delete a connector.
• View the status of connectors in
Connect clusters.
Connect
13. 13
• Control Center provides the convenience of
running streaming queries on one or more
ksqlDB clusters within its graphical user
interface
• Use ksqlDB to:
• View a summary of all ksqlDB
applications connected to Control
Center.
• Search for a ksqlDB application being
managed by the Control Center
instance.
• Browse topic messages.
• View the number of running queries,
registered streams, and registered
tables for each ksqlDB application.
• Navigate to the ksqlDB Editor, Streams,
Tables, Flow View and Running Queries
for each ksqlDB application.
ksqlDB
14. 14
• View all consumer groups for all topics in a
cluster
• Use Consumers menu to:
• View all consumer groups for a cluster
in the All consumer groups page
• View consumer lag across all topics in a
cluster
• View consumption metric for a
consumer group (only available if
monitoring interceptors are set)
• Set up consumer group alerts
Consumers
15. 15
• Available since 5.4.0
• From the Replicators pages on Control Center,
you can:
• Monitor tasks, message throughput,
and Connect workers running as
replicators.
• Monitor metrics on source topics on the
origin cluster.
• Monitor metrics on replicated topics on
the destination cluster.
• Drill down on source and replicated
topics to monitor and configure them
through the Control Center Topics
pages
• To enable it, follow those steps:
• Add
replicator-rest-extension-<versi
on>.jar to your CLASSPATH
• rest.extension.classes=io.conflu
ent.connect.replicator.monitorin
g.ReplicatorMonitoringExtension
Replicators
16. 16
• You can set up alerts in Control Center based on 4
component triggers:
• Broker
• Bytes in
• Bytes out
• Fetch request latency
• Production request count
• Production request latency
• Cluster
• Cluster down
• Leader election rate
• Offline topic partitions
• Unclean election count
• Under replicated topic partitions
• ZooKeeper status
• ZooKeeper expiration rate
• Consumer Group
• Average latency (ms)
• Consumer lag
• Consumer lead
• Consumption difference
• Maximum latency (ms)
• Topic
• Bytes in
• Bytes out
• Out of sync replica count
• Production request count
• Under-replicated topic partitions
• Notifications are possible via email, PagerDuty or
Slack
Alerts
17. 17
• Cluster settings
• Change cluster name (also possible
using configuration file)
• Update dynamic settings without any
restart required
• Download broker configuration
• Status and License menu
• Processing status: status of Control
Center (Running or Not Running).
Consumption data and Broker data
(message throughput are shown
real-time for the last 30 minutes)
• Set or update license
And more...
19. Let’s walk through the important questions we asked at the start of the session, that you need to be
able to answer.
1. Are applications receiving all data?
2. Are my business applications showing the latest data?
3. Why are the applications running slowly?
4. Do we need to scale up?
5. Can any data get lost?
6. Will there be service interruptions?
7. Are there assurances in case of a disaster event?
Control Center - Important questions
19
20. • When every single message running through your Kafka cluster is critical to your business, you may
need to demonstrate that every single message that a Kafka client application produces to Kafka
is consumed on the other side
• Examples where Kafka is working as designed, but messages may unintentionally not be consumed.
1. A consumer is offline for a period of time that is longer than the consumer offsets retention
period (offsets.retention.minutes) which is 7 days by default. When the application restarts,
by default it sets consumer offsets to latest (auto.offset.reset). Therefore, we will miss
consuming some messages.
2. A consumer is offline for a period of time that is longer than the log retention period which is 7
days by default(log.retention.hours). By the time the application restarts, messages will have
been deleted from the data log files and never consumed.
3. A producer is configured with weak durability guarantees, for example it does not wait for
multiple broker acknowledgements (e.g., acks=0 or acks=1). If a broker suddenly fails before
having a chance to replicate a message to other brokers, that message will be lost.
• This is why it is important to know if your applications were able to receive all data
1- Are applications receiving all data?
20
21. • Confluent Monitoring Interceptors allow Control Center to report message delivery statistics as
messages are produced and consumed.
• If you start a new consumer group for a topic (assuming you configured Confluent Monitoring
Interceptors for both producer and consumer), Control Center processes the message delivery
statistics and reports on actual message consumption versus expected message consumption.
• Within a time bucket, Control Center compares the number of messages produced and number of
messages consumed. If all messages produced were consumed, then it shows actual consumption
equals the expected consumption; otherwise, Control Center can alert (Consumption difference
trigger) that there is a difference.
• See % of messages consumed in Consumptions tab
1- Are applications receiving all data?
21
22. • You also need to know if they are being processed with low latency, and likely you even have SLAs for
real-time applications that give you specific latency targets you need to hit
• Detecting high latency and comparing actual consumption against expected consumption is difficult
without end-to-end stream monitoring capabilities. You could instrument your application code, but
that carries the risk that the instrumentation code itself could degrade performance.
• Control Center monitors stream latency from producer to consumer, and can catch a spike in latency as
well as a trend in gradually increasing latency.
• As a Kafka operator, you want to identify spikes in latency and slow consumers trending higher latency,
before customers start complaining.
• See End-to-end latency in Consumptions tab
2- Are my business applications showing the
latest data?
22
23. • You absolutely need to baseline and monitor performance of
your Kafka cluster to be able to answer question like this
• If you want to optimize your Kafka deployment for high
throughput or low latency, follow the recommendations in
white paper Optimizing Your Apache Kafka Deployment:
Levers for Throughput, Latency, Durability, and Availability
• To identify performance bottlenecks, it is important to break
down the Kafka request lifecycle into phases and determine
how much time a request spends in each phase
• You can get this info in Production metrics panel and
Consumption metrics panel
3- Why are the applications running slowly?
23
24. 1. Client sends request to broker
2. Network thread gets request and puts it on queue
3. IO thread/handler picks up request and processes
4. read/write from/to local “disk”
5. Wait for other brokers to ack messages
6. Put response on queue
7. Network thread sends response to client
• Refer to Confluent Support Knowledge base article Kafka Broker
Performance Diagnostics for more details
• The phase in which the broker is spending the most time in the
request lifecycle may indicate one of several problems: slow CPU,
slow disk, network congestion, not enough I/O threads, or not
enough network threads.
• Or perhaps there just is not enough incoming data to return in a
fetch response, so the broker is simply waiting.
• Combine this information with other important indicators like the
network or request pool usage(in Brokers Overview) Then focus
your investigations, figure out the bottlenecks, consider tuning the
settings, e.g., increasing the number of I/O threads
(num.io.threads ) or network threads (num.network.threads ), or
take some other action
3- Why are the applications running slowly?
24
25. • Topic partition leadership also impacts broker performance: the
more topic partitions that a given broker is a leader for, the more
requests the broker may have to process. Therefore, if some brokers
are have more leaders than others, you may have an unbalanced
cluster when it comes to performance and disk utilization
• Control Center provides certain indicators like:
• Number of requests each broker is processing
• Disk skewed: if relative mean absolute difference of all broker
sizes exceeds 10%(fixed value, not configurable) and
confluent.controlcenter.disk.skew.warning.min.byt
es (default 1G) exceeds the configured value)
• Cluster-wide balance status
• You can take action by rebalancing the Kafka cluster using the
Auto Data Balancer (or using self balancing feature starting from
CP 6.0). This rebalances the number of leaders and disk usage
evenly across brokers and racks on a per topic and cluster level, and
can be throttled so that it doesn’t kill your performance while
moving topic partitions between brokers.
3- Why are the applications running slowly?
25
26. • Capacity planning for your Kafka cluster is important to ensure that you are always able to meet business demands.
• Look beyond generic CPU, network, and disk utilization; use Control Center to monitor Kafka-specific indicators that may
indicate the Kafka cluster is at or near capacity:
• Broker processing (details in previous section “Why are the applications running slowly?”)
• Network pool usage
• Thread pool usage
• Produce request latencies
• Fetch request latencies
• Network utilization
• Disk Space
4- Do we need to scale up?
26
• Take corrective actions on capacity issues before it is too late!
• Example: if network pool utilization is high and CPU on the brokers is high, you may decide to add a
new broker to the cluster and then move partitions to the new broker. But it takes additional CPU and
network capacity, and so if you try to add brokers when you are already close to 100% utilization, the
process will be slow and painful !
• Monitor capacity in your Kafka applications to decide if you need to linearly scale their throughput. If
Control Center shows consumer group latency trending higher and higher over time, it may be an
indication that the existing consumers cannot keep up with the current data volume. To address this,
consider adding consumers to the consumer groups, (assuming there are enough partitions to
distribute across all the consumers)
27. • Operationally, you want assurance that data produced to Kafka will not get lost
• The most important feature that enables durability is replication. This ensures that messages are copied to multiple brokers,
and so if a broker has a failure, the data is still available from at least one other broker. Topics with high durability
requirements should have the configuration parameter replication.factorset to at least 3, which will ensure that the
cluster can handle a loss of two brokers without losing the data.
• The number of under-replicated partitions is the best indicator for cluster health: there should be no under-replicated
partitions in steady state. This provides assurance that data replication is working between Kafka brokers and the replicas
are in sync with topic partition leaders, such that if a leader fails, the replicas will not lose any data.
• Check the built-in dashboard in the system health view landing page to instantly see topic partition status across your entire
cluster in a single view.
• Under-replicated topic partitions is one of the most important issues you should always investigate and you can set an alert
for this in Control Center!
5- Can any data get lost ?
27
28. • Software upgrades, broker configuration updates, and cluster maintenance are all expected parts of Kafka operations. If you
need to schedule any of these activities, you may need to do a rolling restart through the cluster, restarting one broker at a
time. A rolling restart executed the right way can provide high availability by avoiding downtime for end users. Doing it
wrong may result in downtime for users, or worse: lost data.
• Some important things to note before doing a rolling restart:
• Because at least one replica is unavailable while a broker is restarting, clients will not experience downtime if
the number of remaining in sync replicas for that topic partition is greater than the configured
min.insync.replicas
.
• Run brokers with controlled.shutdown.enable=true
to migrate topic partition leadership before the broker
is stopped.
• The active controller should be the last broker you restart. This is to ensure that the active controller is not
moved on each broker restart, which would slow down the restart.
• Use Control Center for your rolling restarts to monitor broker status. Make sure each broker is fully online before restarting
the next one
• During a single broker restart, the number of under replicated topic partitions or offline partitions may increase, and then
once your broker is back online, the numbers should recover to zero. This indicates that data replication and ISRs are caught
up. Now restart the next broker
6- Will there be service interruptions?
28
29. • Control Center can help you with your multi-data-center deployment and prepare for disaster recovery in case disaster
strikes.
• It can manage multiple Kafka deployments in different data centers, even between Confluent Cloud and on-prem
deployments.
• Below is a diagram for an active-active multi-data-center deployment, and Confluent Replicator is the key to synchronize
data and metadata between the sites. Control Center enables you to configure and run Confluent Replicator to copy data
between your data centers, and then you can use stream monitoring to ensure end-to-end message delivery.
7- Are there assurances in case of a disaster
event?
29
30. • Once Replicator is deployed, monitor Replicator performance and tune it to minimize replication lag, which is the delay
between messages written to the origin cluster but not yet copied to the destination cluster. This is important because
processing at the destination cluster will be delayed by this lag, and secondly, in case of a disaster event, you may lose
messages that were produced at the origin cluster but weren’t replicated yet to the destination cluster.
• Easily monitor Replicator in Control Center because it is a connector and can be monitored like any other Kafka connector.
And since 5.4, you can use dedicated Replicators section. Run Replicator with Confluent Monitoring Interceptors to get
detailed message delivery statistics.
7- Are there assurances in case of a disaster
event?
30
31. 31
Control Center has great values:
• Not just a monitoring tool, but also a
management tool
• Provides a highly opinionated view of
metrics for helping make administrative
decisions
• 100% made for Kafka and CP ecosystem
(Schema Management, data inspection,
Kafka Connect, Replicator & KSQL
integrations..)
Can not be used alone:
• No system monitoring (e.g. CPU, Network,
etc…)
• Does not expose all JMX (e.g. Request Q
size)
• Does not monitor all clients (only those
with Monitoring Interceptors)
• Does not monitor all components (e.g. ZK)
Summary
33. • Enable auto-updates (available since 5.4)
• Dedicated metric data cluster is recommended: you can send monitoring data to a dedicated Kafka
Cluster:
• Independent of availability of the cluster(s) being monitored
• Ease of upgrade
• Metric data cluster can have reduced security requirements
• Guarantee that Control Center workload will never interfere with production traffic
• Troubleshooting tips and common issues
• Lot of customers have data that is not accurate because the Control Center server is not powerful
enough. Make sure to follow system requirements !
• You can disable Control Center Usage Data Collection
Control Center - Tips and Tricks
33
34. 02 JMX Metrics and Monitoring
Stacks
Overview of JMX metrics and 3rd party monitoring stacks
35. • Kafka brokers and Java client applications (Kafka Connect, Kafka Streams, Producer/Consumer, etc..)
expose hundreds of internal JMX (Java Management Extensions) metrics
• Important JMX metrics to monitor:
• Broker metrics
• ZooKeeper metrics
• Producer metrics
• Consumer metrics
• ksqlDB & Kafka Streams metrics
• Kafka Connect metrics
• It’s key to have a dashboard that let you know “everything is OK?” in one glance
• Multiple monitoring stacks are available. Choose the one that is already used in your company
JMX metrics
35
37. 37
Prometheus/Grafana
• Prometheus is a popular open-source
monitoring solution which uses
JMX-Exporter to extract the metrics. The
exporter can be configured to extract and
forward only the metrics desired.
• An example of
JMX-Exporter/Prometheus/Grafana
monitoring stack deployed on top of
Confluent cp-demo is available here
Prometheus exporter
(JMX-Exporter)
41. • JMX metrics are only for java based clients.
• Librdkafka applications can be configured (disabled by default) to emit internal metrics at a fixed
interval by setting the statistics.interval.ms configuration property to a value > 0 and registering a
stats_cb (or similar, depending on language)
• All statistics described here
• Emits JSON object string:
Librdkafka: Client statistics
41
44. 44
Jolokia/Elastic/Kibana
• ELK stack is a popular open-source
monitoring solution which uses Jolokia
(JSON over HTTP) to extract the metrics.
Metrics are then exported to Elasticsearch
and displayed in a Kibana dashboard
• An example of
Jolokia/Elasticsearch/Kibana monitoring
stack deployed on top of Confluent
cp-demo is available here
46. 46
Datadog
• Datadog has had an Apache Kafka integration
for monitoring self-managed broker
installations with their Datadog Agent for
several years.
• The new Confluent Platform integration (see
May 2020 blog post) adds several capabilities:
- Monitoring for Kafka Connect, ksqlDB,
Confluent Schema Registry, and
Confluent REST Proxy
- Monitoring for Java-based Kafka
clients
- Default Confluent Platform dashboard
with the most critical metrics
- Optionally configured log collection
• See Datadog documentation
48. 48
TICK
The TICK stack is comprised of
• Telegraf a component that gathers metrics
• Influxdb a time series database
• Chronograf a visualization tool
• Kapacitor real time alerting platform
50. • It is important to monitor your application’s consumer
lag, which is the number of records for any partition that
the consumer is behind in the log
• For "real-time" consumer applications, where the
consumer is meant to be processing the newest
messages with as little latency as possible, consumer lag
should be monitored closely.
• Most "real-time" applications will want little-to-no
consumer lag, because lag introduces end-to-end
latency.
Monitor Consumer Lag
50
51. Consumer lag is available in Consumers section from navigation bar:
#1: Using Control Center
51
52. • If you use Java consumers, you can capture JMX metrics and monitor records-lag-max
• Note: the consumer’s records-lag-max JMX metric calculates lag by comparing the offset most
recently seen by the consumer to the most recent offset in the log, which is a more real-time
measurement.
#2: Using JMX (Java client only)
52
Metric Description
kafka.consumer:type=consumer-fe
tch-manager-metrics,client-id=(
[-.w]+),records-lag-max
The maximum lag in terms of number of
records for any partition in this window. An
increasing value over time is your best
indication that the consumer group is not
keeping up with the producers.
53. • Refer to this Knowledge Base article for full details
• Create a properties file containing your security details
• Example:
#3: Using kafka-consumer-groups CLI
53
54. 54
#4: Using
kafka-lag-exporter and
Prometheus/Grafana
• lightbend/kafka-lag-exporter is a 3rd party
tool (not supported by Confluent) that is
using Kafka's Admin API
describeConsumerGroups() method to get
consumer lags and export them to
Prometheus.
• Example of how to set it up is available
here
• Out of the box Grafana dashboard is
available
57. 57
Alerts
• As seen earlier, setting up alerts can be done
through Control Center, but also using your
monitoring stack based on JMX metrics (for
example Prometheus AlertManager plugin)
• Alert on what’s important: Under-replicated
partitions is a good start
• Alerting on SLAs is even better: especially
when measured from a client point of view
58. Key Alerts
58
Cluster/Broker:
• UnderReplicatedPartitions > 0
*
• OfflinePartitionsCount > 0 *
• UnderMinIsrPartitionCount > 0
• ActiveControllerCount != 1
• AtMinIsrPartitionCount > 0
• RequestHandlerAvgIdlePercent
< 40%
• NetworkProcessorAvgIdlePercen
t < 40%
• RequestQueueSize
(establish
the baseline during
normal/peak production load
and alert if a deviation
occurs)
• TotalTimeMs,request=*
(Produce|FetchConsumer|FetchF
ollower)
OS:
• Disk usage > 60% (minor) >
80-90% (major)
• CPU usage > 60% over 5
minutes (generally caused
by SSL connections or old
clients causing down
conversions)
• Network IO usage > 60%
• File handle usage > 60%
JVM Monitoring:
• G1 YoungGeneration
CollectionTime
• G1 OldGeneration
CollectionTime
• GC time > 30%
Connect:
• connector=(*)
status
• connector=(*),task=(.*)
status
Zookeeper:
• AvgRequestLatency > 10ms
over 30 seconds(disk
latency is high. `iostat
-x ` look at await time in
`top`)
• NumAliveConnections - make
sure you are not close to
maximum as set with
maxClientCnxns
• OutstandingRequests -
should be below 10 in
general
The Four Letter Words: mntr and
ruok (need to be enabled starting
from 5.4 with
-Dzookeeper.4lw.commands.whiteli
st=*
$ echo ruok | nc localhost 2181
$ imok
* alert can also be set with Control Center
64. • Audit logs provide a way to capture, protect, and preserve authorization activity into
topics in Kafka clusters on Confluent Platform using Confluent Server Authorizer.
• Record the runtime decisions of the permission checks that occur as users attempt to
take actions that are protected by ACLs and RBAC.
• Each auditable event includes information about who tried to do what, when they tried,
and whether or not the system gave permission to proceed.
• By default, audit logs are enabled and are managed by the inter-broker principal
(typically, the user kafka), who has expansive permissions (this can be changed).
• The primary value of audit logs is that they provide data you can use to assess security
risks in your local and remote Kafka clusters. They contain all of the information
necessary to follow a user’s interaction with your local or remote Kafka clusters, and
provide a way to:
• Track user and application access across the platform
• Identify abnormal behavior and anomalies
• Proactively monitor and resolve security risks
• You can use Splunk, S3, or other sink connectors to move your audit log data to a target
platform for analysis.
Audit Logs
64
65. • List of auditable events is available here
• Audit logs are enabled by default (it can be disabled)
• Default topic is confluent-audit-log-events.
• Default audit logs capture MANAGEMENT and AUTHORIZE categories of authorization
events only.
• More advanced configuration allows you to:
• Which event categories you want to capture (including categories like produce,
consume, and inter-broker, which are disabled by default)
• Multiple topics to capture logs of differing importance
• Topic destination routes optimized for security and performance
• Retention periods that serve the needs of your organization
• Excluded principals, which ensures performance is not compromised by
excessively high message volumes
• The Kafka port over which to communicate with your audit log cluster
• ℹ When enabling audit logging for produce and consume, be very selective about
which events you want logged, and configure logging for only the most sensitive
topics.
• Refer to Audit log configuration examples
Audit Logs - Configuration
65
66. 66
Deployment
• The destinations option identifies the
audit log cluster, which is provided by the
bootstrap server.
• Security settings for connecting to audit
log cluster are prefixed with
confluent.security.event.logger.exp
orter.kafka:
• Secure your audit logs by following these
recommendations
67. 06 Enabling and using
Proactive Support
Everything about Confluent monitoring solution
69. What is Proactive
Support ? ● Proactive Support provides ongoing,
real time analysis of performance and
configuration data from the experts
● Cloud UI managed Proactive Support
alerts
● Two rules at launch:
○ RequestHandlerAvgIdlePercent
< 0.3 -> WARN alert
○ NetworkProcessorAvgIdlePercen
t < 0.3 -> WARN alert
70. How does it work ?
● Confluent Telemetry Reporter is a plugin
that runs inside each Confluent
Platform service to push metadata
about the service to Confluent
● Data is sent over HTTP using an
encrypted connection, once per minute
by default
● Installed as part of the full Confluent
Platform installation
● What data is sent:
○ Runtime performance metrics
○ Kafka version
○ Confluent Platform version
○ Unique identifiers for the CP
component, Kafka cluster and
Customer organization.
● Note: you can stop sending data at any
time by removing the configuration
parameters.
72. 72
Prerequisites
• Access to Confluent Cloud
• Internet connectivity either directly
or through a proxy
• Confluent Platform 6.0 or higher
• Create a Cloud API key to
authenticate with Confluent Cloud
using UI or CLI
ccloud api-key create --resource
cloud
74. 74
Step by Step
Select Join the waitlist or login if
you’re already a Proactive Support
customer
75. 75
Step by Step
• Configure Confluent Telemetry
Reporter as described in
instructions.
76. 76
Step by Step
• Configure with Ansible by adding
the configuration overrides to all
Confluent Platform roles
77. 77
Step by Step
• Configure with Operator
• Note: k8S-SECRET-NAME is a
Kubernetes secret named secretRef.
It must contain the key telemetry
with base64-encoded values.
78. 78
Step by Step
• Custom Deployments Configuration
• Note: For Confluent Server, the
metric.reporters configuration is
not needed
• If restarting Confluent Server is
undesirable, you can add these
configurations by using dynamic
configuration and the
kafka-config CLI.
79. 79
Step by Step
• Once all the configuration is set up
and enabled, the data should be
successfully received by Telemetry
Reporter(s)
80. 80
Step by Step
• Configure your first notification:
Slack
Webhook
Email
83. • White Papers:
• Monitoring Your Apache Kafka® Deployment End-to-End
• Github:
• confluentinc/jmx-monitoring-stacks: run Confluent cp-demo with open source monitoring
stacks (see blog post here)
• jeanlouisboudart/kafka-platform-prometheus: Simple demo of how to monitor Kafka
Platform using Prometheus and Grafana
• framiere/monitoring-demo: a Docker based walkthrough of the open-source ecosystem to
do metrics/logs/alerting
• Support Knowledge Base articles:
• Monitoring Kafka
• Monitoring Zookeeper
• Monitoring Connect
• Kafka Broker Performance Diagnostics
• Top 5 Broker JMX metrics you should be watching
Interesting links
83