SlideShare una empresa de Scribd logo
1 de 85
Descargar para leer sin conexión
OnPrem: Monitoring Overview
Customer Success Engineering
Agenda
2
01
Confluent Control Center
Everything about Confluent monitoring
solution
02
JMX Metrics and Monitoring Stacks
Overview of JMX metrics and 3rd party
monitoring stacks
03
Monitor Consumer Lag
All the different ways to monitor consumer
lag
04
Key Alerts
Important metrics and alerts to set up
05
Audit Logs
Confluent Platform Audit Logs overview
06
Enabling and using Proactive
Support
Step by Step walkthrough on how to
enable Proactive Support
• Confluent Platform is the central nervous system for a business, and potentially a Kafka-based single
source of truth.
• Kafka operators need to provide guarantees to the business that Kafka is working properly and
delivering data in real time. They need to identify and triage problems in order to solve them before it
affects end users. As a result, monitoring your Kafka deployments is an operational must-have.
• Monitoring help provides assurance that all your services are working properly, meeting SLAs and
addressing business needs.
• Here are some common business-level questions:
1. Are applications receiving all data?
2. Are my business applications showing the latest data?
3. Why are the applications running slowly?
4. Do we need to scale up?
5. Can any data get lost?
6. Will there be service interruptions?
7. Are there assurances in case of a disaster event?
• We will see how Control Center can help to answer all those questions
Why monitoring ?
3
01 Control Center
Everything about Confluent monitoring solution
5
You can deploy Confluent Control Center for
out-of-the-box Kafka cluster monitoring so you
don’t have to build your own monitoring system.
Control Center makes it easy to manage the
entire Confluent Platform.
Control Center is a web-based application that
allows you to manage your cluster, to monitor
Kafka clusters in predefined dashboards and to
alert on triggers.
• Kafka exposes hundreds of JMX metrics. Some of them are per broker,
per client, per topic and per partition, and so the number of metrics
scales up as the cluster grows. For an average-size Kafka cluster, the
number of metrics can very quickly grow into thousands !
• A common pitfall of generic monitoring tools is to import pretty much all
available metrics. But even with a comprehensive list of metrics, there is
a limit to what can be achieved with no Kafka context or Kafka expertise
to determine which metrics are important and which ones are not.
• People end up referring to just the two or three charts that they
actually understand.
• Meanwhile, they ignore all the other charts because they don’t
understand them
• It can generate a lot of noise as people spend time chasing
“issues” that aren’t impactful to the services, or worse, obscures
real problems.
• Control Center was designed to help operators identify the most
important things to monitor in Kafka, including the cluster and the
client applications producing messages to and consuming messages
from the cluster
The metrics swamp
6
Control Center
A walkthrough of the features
8
• Cluster Overview provides insight into the
well-being of the Kafka cluster from the
cluster perspective, and allows you to drill
down to the broker level, topic level,
connect cluster level and KSQL level
perspectives
• Multiple clusters can be monitored with a
single Control Center and starting from
CP 5.4.1, it also supports Multi-Cluster
Schema Registry
• Requires Confluent Metrics Reporter to be
installed and enabled
Cluster Overview
9
• Brokers Overview provides a succinct view
of essential Kafka metrics for brokers in a
cluster:
• Throughput for production and
consumption
• Broker uptime
• Partitions replicas status (including
URP)
• Apache ZooKeeper status
• Active Controller
• Disk usage and distribution
• System metrics for network and
request pool usage
• Clicking on panels, you get an historical
view of the metrics 👇👇👇
Brokers Overview
10
• Brokers Metrics page provides historical
data for following panels:
• Production metrics
• Consumption metrics
• Broker uptime metrics
• Partition replicas metrics
• System usage
• Disk usage
Brokers Metrics page
11
• You can add, view, edit, and delete topics
using the Control Center topic management
interface
• Message Browser
• Manage Schemas for Topics
• Avro, JSON-Schema and Protobuf
• ⚠ Options to view and edit schemas
through the user interface are available
only for schemas that use the default
TopicNameStrategy
• Multi-Cluster Schema Registry
• Metrics:
• Production Throughput and Failed
production requests
• Consumption Throughput and Failed
consumptions requests, % messages
consumed (require Monitoring
Interceptors) and End-to-end latency
(require Monitoring Interceptors)
• Availability (URP and Out of Sync
followers and observers)
• Consumer Lag
Topics
12
• Provides the convenience of managing
connectors for multiple Kafka Connect
clusters.
• Use Control Center to:
• Add a connector by completing UI
fields. Note: specific procedure when
RBAC is used.
• Add a connector by uploading a
connector configuration file
• Download connector configuration files
to reuse in another connector or cluster,
or to use as a template.
• Edit a connector configuration and
relaunch it.
• Pause a running connector; resume a
paused connector.
• Delete a connector.
• View the status of connectors in
Connect clusters.
Connect
13
• Control Center provides the convenience of
running streaming queries on one or more
ksqlDB clusters within its graphical user
interface
• Use ksqlDB to:
• View a summary of all ksqlDB
applications connected to Control
Center.
• Search for a ksqlDB application being
managed by the Control Center
instance.
• Browse topic messages.
• View the number of running queries,
registered streams, and registered
tables for each ksqlDB application.
• Navigate to the ksqlDB Editor, Streams,
Tables, Flow View and Running Queries
for each ksqlDB application.
ksqlDB
14
• View all consumer groups for all topics in a
cluster
• Use Consumers menu to:
• View all consumer groups for a cluster
in the All consumer groups page
• View consumer lag across all topics in a
cluster
• View consumption metric for a
consumer group (only available if
monitoring interceptors are set)
• Set up consumer group alerts
Consumers
15
• Available since 5.4.0
• From the Replicators pages on Control Center,
you can:
• Monitor tasks, message throughput,
and Connect workers running as
replicators.
• Monitor metrics on source topics on the
origin cluster.
• Monitor metrics on replicated topics on
the destination cluster.
• Drill down on source and replicated
topics to monitor and configure them
through the Control Center Topics
pages
• To enable it, follow those steps:
• Add
replicator-rest-extension-<versi
on>.jar to your CLASSPATH
• rest.extension.classes=io.conflu
ent.connect.replicator.monitorin
g.ReplicatorMonitoringExtension
Replicators
16
• You can set up alerts in Control Center based on 4
component triggers:
• Broker
• Bytes in
• Bytes out
• Fetch request latency
• Production request count
• Production request latency
• Cluster
• Cluster down
• Leader election rate
• Offline topic partitions
• Unclean election count
• Under replicated topic partitions
• ZooKeeper status
• ZooKeeper expiration rate
• Consumer Group
• Average latency (ms)
• Consumer lag
• Consumer lead
• Consumption difference
• Maximum latency (ms)
• Topic
• Bytes in
• Bytes out
• Out of sync replica count
• Production request count
• Under-replicated topic partitions
• Notifications are possible via email, PagerDuty or
Slack
Alerts
17
• Cluster settings
• Change cluster name (also possible
using configuration file)
• Update dynamic settings without any
restart required
• Download broker configuration
• Status and License menu
• Processing status: status of Control
Center (Running or Not Running).
Consumption data and Broker data
(message throughput are shown
real-time for the last 30 minutes)
• Set or update license
And more...
Control Center
Helps answering important questions
Let’s walk through the important questions we asked at the start of the session, that you need to be
able to answer.
1. Are applications receiving all data?
2. Are my business applications showing the latest data?
3. Why are the applications running slowly?
4. Do we need to scale up?
5. Can any data get lost?
6. Will there be service interruptions?
7. Are there assurances in case of a disaster event?
Control Center - Important questions
19
• When every single message running through your Kafka cluster is critical to your business, you may
need to demonstrate that every single message that a Kafka client application produces to Kafka
is consumed on the other side
• Examples where Kafka is working as designed, but messages may unintentionally not be consumed.
1. A consumer is offline for a period of time that is longer than the consumer offsets retention
period (offsets.retention.minutes) which is 7 days by default. When the application restarts,
by default it sets consumer offsets to latest (auto.offset.reset). Therefore, we will miss
consuming some messages.
2. A consumer is offline for a period of time that is longer than the log retention period which is 7
days by default(log.retention.hours). By the time the application restarts, messages will have
been deleted from the data log files and never consumed.
3. A producer is configured with weak durability guarantees, for example it does not wait for
multiple broker acknowledgements (e.g., acks=0 or acks=1). If a broker suddenly fails before
having a chance to replicate a message to other brokers, that message will be lost.
• This is why it is important to know if your applications were able to receive all data
1- Are applications receiving all data?
20
• Confluent Monitoring Interceptors allow Control Center to report message delivery statistics as
messages are produced and consumed.
• If you start a new consumer group for a topic (assuming you configured Confluent Monitoring
Interceptors for both producer and consumer), Control Center processes the message delivery
statistics and reports on actual message consumption versus expected message consumption.
• Within a time bucket, Control Center compares the number of messages produced and number of
messages consumed. If all messages produced were consumed, then it shows actual consumption
equals the expected consumption; otherwise, Control Center can alert (Consumption difference
trigger) that there is a difference.
• See % of messages consumed in Consumptions tab
1- Are applications receiving all data?
21
• You also need to know if they are being processed with low latency, and likely you even have SLAs for
real-time applications that give you specific latency targets you need to hit
• Detecting high latency and comparing actual consumption against expected consumption is difficult
without end-to-end stream monitoring capabilities. You could instrument your application code, but
that carries the risk that the instrumentation code itself could degrade performance.
• Control Center monitors stream latency from producer to consumer, and can catch a spike in latency as
well as a trend in gradually increasing latency.
• As a Kafka operator, you want to identify spikes in latency and slow consumers trending higher latency,
before customers start complaining.
• See End-to-end latency in Consumptions tab
2- Are my business applications showing the
latest data?
22
• You absolutely need to baseline and monitor performance of
your Kafka cluster to be able to answer question like this
• If you want to optimize your Kafka deployment for high
throughput or low latency, follow the recommendations in
white paper Optimizing Your Apache Kafka Deployment:
Levers for Throughput, Latency, Durability, and Availability
• To identify performance bottlenecks, it is important to break
down the Kafka request lifecycle into phases and determine
how much time a request spends in each phase
• You can get this info in Production metrics panel and
Consumption metrics panel
3- Why are the applications running slowly?
23
1. Client sends request to broker
2. Network thread gets request and puts it on queue
3. IO thread/handler picks up request and processes
4. read/write from/to local “disk”
5. Wait for other brokers to ack messages
6. Put response on queue
7. Network thread sends response to client
• Refer to Confluent Support Knowledge base article Kafka Broker
Performance Diagnostics for more details
• The phase in which the broker is spending the most time in the
request lifecycle may indicate one of several problems: slow CPU,
slow disk, network congestion, not enough I/O threads, or not
enough network threads.
• Or perhaps there just is not enough incoming data to return in a
fetch response, so the broker is simply waiting.
• Combine this information with other important indicators like the
network or request pool usage(in Brokers Overview) Then focus
your investigations, figure out the bottlenecks, consider tuning the
settings, e.g., increasing the number of I/O threads
(num.io.threads ) or network threads (num.network.threads ), or
take some other action
3- Why are the applications running slowly?
24
• Topic partition leadership also impacts broker performance: the
more topic partitions that a given broker is a leader for, the more
requests the broker may have to process. Therefore, if some brokers
are have more leaders than others, you may have an unbalanced
cluster when it comes to performance and disk utilization
• Control Center provides certain indicators like:
• Number of requests each broker is processing
• Disk skewed: if relative mean absolute difference of all broker
sizes exceeds 10%(fixed value, not configurable) and
confluent.controlcenter.disk.skew.warning.min.byt
es (default 1G) exceeds the configured value)
• Cluster-wide balance status
• You can take action by rebalancing the Kafka cluster using the
Auto Data Balancer (or using self balancing feature starting from
CP 6.0). This rebalances the number of leaders and disk usage
evenly across brokers and racks on a per topic and cluster level, and
can be throttled so that it doesn’t kill your performance while
moving topic partitions between brokers.
3- Why are the applications running slowly?
25
• Capacity planning for your Kafka cluster is important to ensure that you are always able to meet business demands.
• Look beyond generic CPU, network, and disk utilization; use Control Center to monitor Kafka-specific indicators that may
indicate the Kafka cluster is at or near capacity:
• Broker processing (details in previous section “Why are the applications running slowly?”)
• Network pool usage
• Thread pool usage
• Produce request latencies
• Fetch request latencies
• Network utilization
• Disk Space
4- Do we need to scale up?
26
• Take corrective actions on capacity issues before it is too late!
• Example: if network pool utilization is high and CPU on the brokers is high, you may decide to add a
new broker to the cluster and then move partitions to the new broker. But it takes additional CPU and
network capacity, and so if you try to add brokers when you are already close to 100% utilization, the
process will be slow and painful !
• Monitor capacity in your Kafka applications to decide if you need to linearly scale their throughput. If
Control Center shows consumer group latency trending higher and higher over time, it may be an
indication that the existing consumers cannot keep up with the current data volume. To address this,
consider adding consumers to the consumer groups, (assuming there are enough partitions to
distribute across all the consumers)
• Operationally, you want assurance that data produced to Kafka will not get lost
• The most important feature that enables durability is replication. This ensures that messages are copied to multiple brokers,
and so if a broker has a failure, the data is still available from at least one other broker. Topics with high durability
requirements should have the configuration parameter replication.factorset to at least 3, which will ensure that the
cluster can handle a loss of two brokers without losing the data.
• The number of under-replicated partitions is the best indicator for cluster health: there should be no under-replicated
partitions in steady state. This provides assurance that data replication is working between Kafka brokers and the replicas
are in sync with topic partition leaders, such that if a leader fails, the replicas will not lose any data.
• Check the built-in dashboard in the system health view landing page to instantly see topic partition status across your entire
cluster in a single view.
• Under-replicated topic partitions is one of the most important issues you should always investigate and you can set an alert
for this in Control Center!
5- Can any data get lost ?
27
• Software upgrades, broker configuration updates, and cluster maintenance are all expected parts of Kafka operations. If you
need to schedule any of these activities, you may need to do a rolling restart through the cluster, restarting one broker at a
time. A rolling restart executed the right way can provide high availability by avoiding downtime for end users. Doing it
wrong may result in downtime for users, or worse: lost data.
• Some important things to note before doing a rolling restart:
• Because at least one replica is unavailable while a broker is restarting, clients will not experience downtime if
the number of remaining in sync replicas for that topic partition is greater than the configured
min.insync.replicas
.
• Run brokers with controlled.shutdown.enable=true
to migrate topic partition leadership before the broker
is stopped.
• The active controller should be the last broker you restart. This is to ensure that the active controller is not
moved on each broker restart, which would slow down the restart.
• Use Control Center for your rolling restarts to monitor broker status. Make sure each broker is fully online before restarting
the next one
• During a single broker restart, the number of under replicated topic partitions or offline partitions may increase, and then
once your broker is back online, the numbers should recover to zero. This indicates that data replication and ISRs are caught
up. Now restart the next broker
6- Will there be service interruptions?
28
• Control Center can help you with your multi-data-center deployment and prepare for disaster recovery in case disaster
strikes.
• It can manage multiple Kafka deployments in different data centers, even between Confluent Cloud and on-prem
deployments.
• Below is a diagram for an active-active multi-data-center deployment, and Confluent Replicator is the key to synchronize
data and metadata between the sites. Control Center enables you to configure and run Confluent Replicator to copy data
between your data centers, and then you can use stream monitoring to ensure end-to-end message delivery.
7- Are there assurances in case of a disaster
event?
29
• Once Replicator is deployed, monitor Replicator performance and tune it to minimize replication lag, which is the delay
between messages written to the origin cluster but not yet copied to the destination cluster. This is important because
processing at the destination cluster will be delayed by this lag, and secondly, in case of a disaster event, you may lose
messages that were produced at the origin cluster but weren’t replicated yet to the destination cluster.
• Easily monitor Replicator in Control Center because it is a connector and can be monitored like any other Kafka connector.
And since 5.4, you can use dedicated Replicators section. Run Replicator with Confluent Monitoring Interceptors to get
detailed message delivery statistics.
7- Are there assurances in case of a disaster
event?
30
31
Control Center has great values:
• Not just a monitoring tool, but also a
management tool
• Provides a highly opinionated view of
metrics for helping make administrative
decisions
• 100% made for Kafka and CP ecosystem
(Schema Management, data inspection,
Kafka Connect, Replicator & KSQL
integrations..)
Can not be used alone:
• No system monitoring (e.g. CPU, Network,
etc…)
• Does not expose all JMX (e.g. Request Q
size)
• Does not monitor all clients (only those
with Monitoring Interceptors)
• Does not monitor all components (e.g. ZK)
Summary
Control Center Demo
Going through all the UI sections
• Enable auto-updates (available since 5.4)
• Dedicated metric data cluster is recommended: you can send monitoring data to a dedicated Kafka
Cluster:
• Independent of availability of the cluster(s) being monitored
• Ease of upgrade
• Metric data cluster can have reduced security requirements
• Guarantee that Control Center workload will never interfere with production traffic
• Troubleshooting tips and common issues
• Lot of customers have data that is not accurate because the Control Center server is not powerful
enough. Make sure to follow system requirements !
• You can disable Control Center Usage Data Collection
Control Center - Tips and Tricks
33
02 JMX Metrics and Monitoring
Stacks
Overview of JMX metrics and 3rd party monitoring stacks
• Kafka brokers and Java client applications (Kafka Connect, Kafka Streams, Producer/Consumer, etc..)
expose hundreds of internal JMX (Java Management Extensions) metrics
• Important JMX metrics to monitor:
• Broker metrics
• ZooKeeper metrics
• Producer metrics
• Consumer metrics
• ksqlDB & Kafka Streams metrics
• Kafka Connect metrics
• It’s key to have a dashboard that let you know “everything is OK?” in one glance
• Multiple monitoring stacks are available. Choose the one that is already used in your company
JMX metrics
35
Popular Monitoring stacks
36
Prometheus/Grafana TICK
Telegraf - InfluxDB - Chronograf -
Kapacitor
ELK
ElasticSearch - LogStash - Kibana
Datadog
37
Prometheus/Grafana
• Prometheus is a popular open-source
monitoring solution which uses
JMX-Exporter to extract the metrics. The
exporter can be configured to extract and
forward only the metrics desired.
• An example of
JMX-Exporter/Prometheus/Grafana
monitoring stack deployed on top of
Confluent cp-demo is available here
Prometheus exporter
(JMX-Exporter)
Prometheus/Grafana: Broker (with cp-demo)
38
Prometheus/Grafana: JAVA producer demo
39
Prometheus/Grafana: JAVA consumer demo
40
• JMX metrics are only for java based clients.
• Librdkafka applications can be configured (disabled by default) to emit internal metrics at a fixed
interval by setting the statistics.interval.ms configuration property to a value > 0 and registering a
stats_cb (or similar, depending on language)
• All statistics described here
• Emits JSON object string:
Librdkafka: Client statistics
41
• Using prometheus-net/prometheus-net, starting up a MetricsServer to export metrics to Prometheus
Prometheus/Grafana: Librdkafka: .NET example
42
Prometheus/Grafana: .NET Client demo
43
44
Jolokia/Elastic/Kibana
• ELK stack is a popular open-source
monitoring solution which uses Jolokia
(JSON over HTTP) to extract the metrics.
Metrics are then exported to Elasticsearch
and displayed in a Kibana dashboard
• An example of
Jolokia/Elasticsearch/Kibana monitoring
stack deployed on top of Confluent
cp-demo is available here
Elasticsearch/Kibana: cp-demo
45
46
Datadog
• Datadog has had an Apache Kafka integration
for monitoring self-managed broker
installations with their Datadog Agent for
several years.
• The new Confluent Platform integration (see
May 2020 blog post) adds several capabilities:
- Monitoring for Kafka Connect, ksqlDB,
Confluent Schema Registry, and
Confluent REST Proxy
- Monitoring for Java-based Kafka
clients
- Default Confluent Platform dashboard
with the most critical metrics
- Optionally configured log collection
• See Datadog documentation
Datadog: Confluent Platform overview
47
48
TICK
The TICK stack is comprised of
• Telegraf a component that gathers metrics
• Influxdb a time series database
• Chronograf a visualization tool
• Kapacitor real time alerting platform
03 Monitor Consumer Lag
All different ways to monitor consumer lag
• It is important to monitor your application’s consumer
lag, which is the number of records for any partition that
the consumer is behind in the log
• For "real-time" consumer applications, where the
consumer is meant to be processing the newest
messages with as little latency as possible, consumer lag
should be monitored closely.
• Most "real-time" applications will want little-to-no
consumer lag, because lag introduces end-to-end
latency.
Monitor Consumer Lag
50
Consumer lag is available in Consumers section from navigation bar:
#1: Using Control Center
51
• If you use Java consumers, you can capture JMX metrics and monitor records-lag-max
• Note: the consumer’s records-lag-max JMX metric calculates lag by comparing the offset most
recently seen by the consumer to the most recent offset in the log, which is a more real-time
measurement.
#2: Using JMX (Java client only)
52
Metric Description
kafka.consumer:type=consumer-fe
tch-manager-metrics,client-id=(
[-.w]+),records-lag-max
The maximum lag in terms of number of
records for any partition in this window. An
increasing value over time is your best
indication that the consumer group is not
keeping up with the producers.
• Refer to this Knowledge Base article for full details
• Create a properties file containing your security details
• Example:
#3: Using kafka-consumer-groups CLI
53
54
#4: Using
kafka-lag-exporter and
Prometheus/Grafana
• lightbend/kafka-lag-exporter is a 3rd party
tool (not supported by Confluent) that is
using Kafka's Admin API
describeConsumerGroups() method to get
consumer lags and export them to
Prometheus.
• Example of how to set it up is available
here
• Out of the box Grafana dashboard is
available
#4: Using kafka-lag-exporter and Prometheus -
Demo
55
04 Key Alerts
Important metrics and alerts to set up
57
Alerts
• As seen earlier, setting up alerts can be done
through Control Center, but also using your
monitoring stack based on JMX metrics (for
example Prometheus AlertManager plugin)
• Alert on what’s important: Under-replicated
partitions is a good start
• Alerting on SLAs is even better: especially
when measured from a client point of view
Key Alerts
58
Cluster/Broker:
• UnderReplicatedPartitions > 0
*
• OfflinePartitionsCount > 0 *
• UnderMinIsrPartitionCount > 0
• ActiveControllerCount != 1
• AtMinIsrPartitionCount > 0
• RequestHandlerAvgIdlePercent
< 40%
• NetworkProcessorAvgIdlePercen
t < 40%
• RequestQueueSize
(establish
the baseline during
normal/peak production load
and alert if a deviation
occurs)
• TotalTimeMs,request=*
(Produce|FetchConsumer|FetchF
ollower)
OS:
• Disk usage > 60% (minor) >
80-90% (major)
• CPU usage > 60% over 5
minutes (generally caused
by SSL connections or old
clients causing down
conversions)
• Network IO usage > 60%
• File handle usage > 60%
JVM Monitoring:
• G1 YoungGeneration
CollectionTime
• G1 OldGeneration
CollectionTime
• GC time > 30%
Connect:
• connector=(*)
status
• connector=(*),task=(.*)
status
Zookeeper:
• AvgRequestLatency > 10ms
over 30 seconds(disk
latency is high. `iostat
-x ` look at await time in
`top`)
• NumAliveConnections - make
sure you are not close to
maximum as set with
maxClientCnxns
• OutstandingRequests -
should be below 10 in
general
The Four Letter Words: mntr and
ruok (need to be enabled starting
from 5.4 with
-Dzookeeper.4lw.commands.whiteli
st=*
$ echo ruok | nc localhost 2181
$ imok
* alert can also be set with Control Center
Under Replicated Partitions
Offline Partitions
Controller Count
Controller Count
05 Audit Logs
Confluent Platform Audit Logs overview
• Audit logs provide a way to capture, protect, and preserve authorization activity into
topics in Kafka clusters on Confluent Platform using Confluent Server Authorizer.
• Record the runtime decisions of the permission checks that occur as users attempt to
take actions that are protected by ACLs and RBAC.
• Each auditable event includes information about who tried to do what, when they tried,
and whether or not the system gave permission to proceed.
• By default, audit logs are enabled and are managed by the inter-broker principal
(typically, the user kafka), who has expansive permissions (this can be changed).
• The primary value of audit logs is that they provide data you can use to assess security
risks in your local and remote Kafka clusters. They contain all of the information
necessary to follow a user’s interaction with your local or remote Kafka clusters, and
provide a way to:
• Track user and application access across the platform
• Identify abnormal behavior and anomalies
• Proactively monitor and resolve security risks
• You can use Splunk, S3, or other sink connectors to move your audit log data to a target
platform for analysis.
Audit Logs
64
• List of auditable events is available here
• Audit logs are enabled by default (it can be disabled)
• Default topic is confluent-audit-log-events.
• Default audit logs capture MANAGEMENT and AUTHORIZE categories of authorization
events only.
• More advanced configuration allows you to:
• Which event categories you want to capture (including categories like produce,
consume, and inter-broker, which are disabled by default)
• Multiple topics to capture logs of differing importance
• Topic destination routes optimized for security and performance
• Retention periods that serve the needs of your organization
• Excluded principals, which ensures performance is not compromised by
excessively high message volumes
• The Kafka port over which to communicate with your audit log cluster
• ℹ When enabling audit logging for produce and consume, be very selective about
which events you want logged, and configure logging for only the most sensitive
topics.
• Refer to Audit log configuration examples
Audit Logs - Configuration
65
66
Deployment
• The destinations option identifies the
audit log cluster, which is provided by the
bootstrap server.
• Security settings for connecting to audit
log cluster are prefixed with
confluent.security.event.logger.exp
orter.kafka:
• Secure your audit logs by following these
recommendations
06 Enabling and using
Proactive Support
Everything about Confluent monitoring solution
Feature Overview
What is Proactive
Support ? ● Proactive Support provides ongoing,
real time analysis of performance and
configuration data from the experts
● Cloud UI managed Proactive Support
alerts
● Two rules at launch:
○ RequestHandlerAvgIdlePercent
< 0.3 -> WARN alert
○ NetworkProcessorAvgIdlePercen
t < 0.3 -> WARN alert
How does it work ?
● Confluent Telemetry Reporter is a plugin
that runs inside each Confluent
Platform service to push metadata
about the service to Confluent
● Data is sent over HTTP using an
encrypted connection, once per minute
by default
● Installed as part of the full Confluent
Platform installation
● What data is sent:
○ Runtime performance metrics
○ Kafka version
○ Confluent Platform version
○ Unique identifiers for the CP
component, Kafka cluster and
Customer organization.
● Note: you can stop sending data at any
time by removing the configuration
parameters.
How to setup
72
Prerequisites
• Access to Confluent Cloud
• Internet connectivity either directly
or through a proxy
• Confluent Platform 6.0 or higher
• Create a Cloud API key to
authenticate with Confluent Cloud
using UI or CLI
ccloud api-key create --resource
cloud
73
Step by Step
Click on Proactive support in
Control Center
74
Step by Step
Select Join the waitlist or login if
you’re already a Proactive Support
customer
75
Step by Step
• Configure Confluent Telemetry
Reporter as described in
instructions.
76
Step by Step
• Configure with Ansible by adding
the configuration overrides to all
Confluent Platform roles
77
Step by Step
• Configure with Operator
• Note: k8S-SECRET-NAME is a
Kubernetes secret named secretRef.
It must contain the key telemetry
with base64-encoded values.
78
Step by Step
• Custom Deployments Configuration
• Note: For Confluent Server, the
metric.reporters configuration is
not needed
• If restarting Confluent Server is
undesirable, you can add these
configurations by using dynamic
configuration and the
kafka-config CLI.
79
Step by Step
• Once all the configuration is set up
and enabled, the data should be
successfully received by Telemetry
Reporter(s)
80
Step by Step
• Configure your first notification:
Slack
Webhook
Email
81
Step by Step
• Check Status and Notifications by
going on
Appendix
Interesting links
• White Papers:
• Monitoring Your Apache Kafka® Deployment End-to-End
• Github:
• confluentinc/jmx-monitoring-stacks: run Confluent cp-demo with open source monitoring
stacks (see blog post here)
• jeanlouisboudart/kafka-platform-prometheus: Simple demo of how to monitor Kafka
Platform using Prometheus and Grafana
• framiere/monitoring-demo: a Docker based walkthrough of the open-source ecosystem to
do metrics/logs/alerting
• Support Knowledge Base articles:
• Monitoring Kafka
• Monitoring Zookeeper
• Monitoring Connect
• Kafka Broker Performance Diagnostics
• Top 5 Broker JMX metrics you should be watching
Interesting links
83
Thank you!
cnfl.io/meetups cnfl.io/slack
cnfl.io/blog
OnPrem Monitoring.pdf

Más contenido relacionado

Similar a OnPrem Monitoring.pdf

PayPal Resilient System Design
PayPal Resilient System DesignPayPal Resilient System Design
PayPal Resilient System Design
Pradeep Ballal
 

Similar a OnPrem Monitoring.pdf (20)

Resilience Planning & How the Empire Strikes Back
Resilience Planning & How the Empire Strikes BackResilience Planning & How the Empire Strikes Back
Resilience Planning & How the Empire Strikes Back
 
OMEGAMON XE for Messaging V730 Long client presentation
OMEGAMON XE for Messaging V730 Long client presentationOMEGAMON XE for Messaging V730 Long client presentation
OMEGAMON XE for Messaging V730 Long client presentation
 
IBM MQ - better application performance
IBM MQ - better application performanceIBM MQ - better application performance
IBM MQ - better application performance
 
Sql server 2019 New Features by Yevhen Nedaskivskyi
Sql server 2019 New Features by Yevhen NedaskivskyiSql server 2019 New Features by Yevhen Nedaskivskyi
Sql server 2019 New Features by Yevhen Nedaskivskyi
 
Microservices deck
Microservices deckMicroservices deck
Microservices deck
 
Service quality monitoring system architecture
Service quality monitoring system architectureService quality monitoring system architecture
Service quality monitoring system architecture
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Server and application monitoring webinars [Applications Manager] - Part 4
Server and application monitoring webinars [Applications Manager] - Part 4Server and application monitoring webinars [Applications Manager] - Part 4
Server and application monitoring webinars [Applications Manager] - Part 4
 
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
 
Large scale, distributed access management deployment with aruba clear pass
Large scale, distributed access management deployment with aruba clear passLarge scale, distributed access management deployment with aruba clear pass
Large scale, distributed access management deployment with aruba clear pass
 
IBM IMPACT 2014 - AMC-1882 Building a Scalable & Continuously Available IBM M...
IBM IMPACT 2014 - AMC-1882 Building a Scalable & Continuously Available IBM M...IBM IMPACT 2014 - AMC-1882 Building a Scalable & Continuously Available IBM M...
IBM IMPACT 2014 - AMC-1882 Building a Scalable & Continuously Available IBM M...
 
Whats new in Enterprise 5.0 Product Suite
Whats new in Enterprise 5.0 Product SuiteWhats new in Enterprise 5.0 Product Suite
Whats new in Enterprise 5.0 Product Suite
 
The Overview of Microservices Architecture
The Overview of Microservices ArchitectureThe Overview of Microservices Architecture
The Overview of Microservices Architecture
 
PayPal Resilient System Design
PayPal Resilient System DesignPayPal Resilient System Design
PayPal Resilient System Design
 
Software for Oil, Gas & Marine Sector by Labsols
Software for Oil, Gas & Marine Sector by LabsolsSoftware for Oil, Gas & Marine Sector by Labsols
Software for Oil, Gas & Marine Sector by Labsols
 
(ATS4-PLAT03) Balancing Security with access for Development
(ATS4-PLAT03) Balancing Security with access for Development(ATS4-PLAT03) Balancing Security with access for Development
(ATS4-PLAT03) Balancing Security with access for Development
 
Microservices.pdf
Microservices.pdfMicroservices.pdf
Microservices.pdf
 
Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014Performance tuning Grails applications SpringOne 2GX 2014
Performance tuning Grails applications SpringOne 2GX 2014
 
ADF Performance Monitor
ADF Performance MonitorADF Performance Monitor
ADF Performance Monitor
 
Troubleshooting and Best Practices with WSO2 Enterprise Integrator
Troubleshooting and Best Practices with WSO2 Enterprise IntegratorTroubleshooting and Best Practices with WSO2 Enterprise Integrator
Troubleshooting and Best Practices with WSO2 Enterprise Integrator
 

Último

Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
chumtiyababu
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
Kamal Acharya
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
MayuraD1
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
AldoGarca30
 

Último (20)

A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEGEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech Civil
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and properties
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to Computers
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 

OnPrem Monitoring.pdf

  • 2. Agenda 2 01 Confluent Control Center Everything about Confluent monitoring solution 02 JMX Metrics and Monitoring Stacks Overview of JMX metrics and 3rd party monitoring stacks 03 Monitor Consumer Lag All the different ways to monitor consumer lag 04 Key Alerts Important metrics and alerts to set up 05 Audit Logs Confluent Platform Audit Logs overview 06 Enabling and using Proactive Support Step by Step walkthrough on how to enable Proactive Support
  • 3. • Confluent Platform is the central nervous system for a business, and potentially a Kafka-based single source of truth. • Kafka operators need to provide guarantees to the business that Kafka is working properly and delivering data in real time. They need to identify and triage problems in order to solve them before it affects end users. As a result, monitoring your Kafka deployments is an operational must-have. • Monitoring help provides assurance that all your services are working properly, meeting SLAs and addressing business needs. • Here are some common business-level questions: 1. Are applications receiving all data? 2. Are my business applications showing the latest data? 3. Why are the applications running slowly? 4. Do we need to scale up? 5. Can any data get lost? 6. Will there be service interruptions? 7. Are there assurances in case of a disaster event? • We will see how Control Center can help to answer all those questions Why monitoring ? 3
  • 4. 01 Control Center Everything about Confluent monitoring solution
  • 5. 5 You can deploy Confluent Control Center for out-of-the-box Kafka cluster monitoring so you don’t have to build your own monitoring system. Control Center makes it easy to manage the entire Confluent Platform. Control Center is a web-based application that allows you to manage your cluster, to monitor Kafka clusters in predefined dashboards and to alert on triggers.
  • 6. • Kafka exposes hundreds of JMX metrics. Some of them are per broker, per client, per topic and per partition, and so the number of metrics scales up as the cluster grows. For an average-size Kafka cluster, the number of metrics can very quickly grow into thousands ! • A common pitfall of generic monitoring tools is to import pretty much all available metrics. But even with a comprehensive list of metrics, there is a limit to what can be achieved with no Kafka context or Kafka expertise to determine which metrics are important and which ones are not. • People end up referring to just the two or three charts that they actually understand. • Meanwhile, they ignore all the other charts because they don’t understand them • It can generate a lot of noise as people spend time chasing “issues” that aren’t impactful to the services, or worse, obscures real problems. • Control Center was designed to help operators identify the most important things to monitor in Kafka, including the cluster and the client applications producing messages to and consuming messages from the cluster The metrics swamp 6
  • 8. 8 • Cluster Overview provides insight into the well-being of the Kafka cluster from the cluster perspective, and allows you to drill down to the broker level, topic level, connect cluster level and KSQL level perspectives • Multiple clusters can be monitored with a single Control Center and starting from CP 5.4.1, it also supports Multi-Cluster Schema Registry • Requires Confluent Metrics Reporter to be installed and enabled Cluster Overview
  • 9. 9 • Brokers Overview provides a succinct view of essential Kafka metrics for brokers in a cluster: • Throughput for production and consumption • Broker uptime • Partitions replicas status (including URP) • Apache ZooKeeper status • Active Controller • Disk usage and distribution • System metrics for network and request pool usage • Clicking on panels, you get an historical view of the metrics 👇👇👇 Brokers Overview
  • 10. 10 • Brokers Metrics page provides historical data for following panels: • Production metrics • Consumption metrics • Broker uptime metrics • Partition replicas metrics • System usage • Disk usage Brokers Metrics page
  • 11. 11 • You can add, view, edit, and delete topics using the Control Center topic management interface • Message Browser • Manage Schemas for Topics • Avro, JSON-Schema and Protobuf • ⚠ Options to view and edit schemas through the user interface are available only for schemas that use the default TopicNameStrategy • Multi-Cluster Schema Registry • Metrics: • Production Throughput and Failed production requests • Consumption Throughput and Failed consumptions requests, % messages consumed (require Monitoring Interceptors) and End-to-end latency (require Monitoring Interceptors) • Availability (URP and Out of Sync followers and observers) • Consumer Lag Topics
  • 12. 12 • Provides the convenience of managing connectors for multiple Kafka Connect clusters. • Use Control Center to: • Add a connector by completing UI fields. Note: specific procedure when RBAC is used. • Add a connector by uploading a connector configuration file • Download connector configuration files to reuse in another connector or cluster, or to use as a template. • Edit a connector configuration and relaunch it. • Pause a running connector; resume a paused connector. • Delete a connector. • View the status of connectors in Connect clusters. Connect
  • 13. 13 • Control Center provides the convenience of running streaming queries on one or more ksqlDB clusters within its graphical user interface • Use ksqlDB to: • View a summary of all ksqlDB applications connected to Control Center. • Search for a ksqlDB application being managed by the Control Center instance. • Browse topic messages. • View the number of running queries, registered streams, and registered tables for each ksqlDB application. • Navigate to the ksqlDB Editor, Streams, Tables, Flow View and Running Queries for each ksqlDB application. ksqlDB
  • 14. 14 • View all consumer groups for all topics in a cluster • Use Consumers menu to: • View all consumer groups for a cluster in the All consumer groups page • View consumer lag across all topics in a cluster • View consumption metric for a consumer group (only available if monitoring interceptors are set) • Set up consumer group alerts Consumers
  • 15. 15 • Available since 5.4.0 • From the Replicators pages on Control Center, you can: • Monitor tasks, message throughput, and Connect workers running as replicators. • Monitor metrics on source topics on the origin cluster. • Monitor metrics on replicated topics on the destination cluster. • Drill down on source and replicated topics to monitor and configure them through the Control Center Topics pages • To enable it, follow those steps: • Add replicator-rest-extension-<versi on>.jar to your CLASSPATH • rest.extension.classes=io.conflu ent.connect.replicator.monitorin g.ReplicatorMonitoringExtension Replicators
  • 16. 16 • You can set up alerts in Control Center based on 4 component triggers: • Broker • Bytes in • Bytes out • Fetch request latency • Production request count • Production request latency • Cluster • Cluster down • Leader election rate • Offline topic partitions • Unclean election count • Under replicated topic partitions • ZooKeeper status • ZooKeeper expiration rate • Consumer Group • Average latency (ms) • Consumer lag • Consumer lead • Consumption difference • Maximum latency (ms) • Topic • Bytes in • Bytes out • Out of sync replica count • Production request count • Under-replicated topic partitions • Notifications are possible via email, PagerDuty or Slack Alerts
  • 17. 17 • Cluster settings • Change cluster name (also possible using configuration file) • Update dynamic settings without any restart required • Download broker configuration • Status and License menu • Processing status: status of Control Center (Running or Not Running). Consumption data and Broker data (message throughput are shown real-time for the last 30 minutes) • Set or update license And more...
  • 18. Control Center Helps answering important questions
  • 19. Let’s walk through the important questions we asked at the start of the session, that you need to be able to answer. 1. Are applications receiving all data? 2. Are my business applications showing the latest data? 3. Why are the applications running slowly? 4. Do we need to scale up? 5. Can any data get lost? 6. Will there be service interruptions? 7. Are there assurances in case of a disaster event? Control Center - Important questions 19
  • 20. • When every single message running through your Kafka cluster is critical to your business, you may need to demonstrate that every single message that a Kafka client application produces to Kafka is consumed on the other side • Examples where Kafka is working as designed, but messages may unintentionally not be consumed. 1. A consumer is offline for a period of time that is longer than the consumer offsets retention period (offsets.retention.minutes) which is 7 days by default. When the application restarts, by default it sets consumer offsets to latest (auto.offset.reset). Therefore, we will miss consuming some messages. 2. A consumer is offline for a period of time that is longer than the log retention period which is 7 days by default(log.retention.hours). By the time the application restarts, messages will have been deleted from the data log files and never consumed. 3. A producer is configured with weak durability guarantees, for example it does not wait for multiple broker acknowledgements (e.g., acks=0 or acks=1). If a broker suddenly fails before having a chance to replicate a message to other brokers, that message will be lost. • This is why it is important to know if your applications were able to receive all data 1- Are applications receiving all data? 20
  • 21. • Confluent Monitoring Interceptors allow Control Center to report message delivery statistics as messages are produced and consumed. • If you start a new consumer group for a topic (assuming you configured Confluent Monitoring Interceptors for both producer and consumer), Control Center processes the message delivery statistics and reports on actual message consumption versus expected message consumption. • Within a time bucket, Control Center compares the number of messages produced and number of messages consumed. If all messages produced were consumed, then it shows actual consumption equals the expected consumption; otherwise, Control Center can alert (Consumption difference trigger) that there is a difference. • See % of messages consumed in Consumptions tab 1- Are applications receiving all data? 21
  • 22. • You also need to know if they are being processed with low latency, and likely you even have SLAs for real-time applications that give you specific latency targets you need to hit • Detecting high latency and comparing actual consumption against expected consumption is difficult without end-to-end stream monitoring capabilities. You could instrument your application code, but that carries the risk that the instrumentation code itself could degrade performance. • Control Center monitors stream latency from producer to consumer, and can catch a spike in latency as well as a trend in gradually increasing latency. • As a Kafka operator, you want to identify spikes in latency and slow consumers trending higher latency, before customers start complaining. • See End-to-end latency in Consumptions tab 2- Are my business applications showing the latest data? 22
  • 23. • You absolutely need to baseline and monitor performance of your Kafka cluster to be able to answer question like this • If you want to optimize your Kafka deployment for high throughput or low latency, follow the recommendations in white paper Optimizing Your Apache Kafka Deployment: Levers for Throughput, Latency, Durability, and Availability • To identify performance bottlenecks, it is important to break down the Kafka request lifecycle into phases and determine how much time a request spends in each phase • You can get this info in Production metrics panel and Consumption metrics panel 3- Why are the applications running slowly? 23
  • 24. 1. Client sends request to broker 2. Network thread gets request and puts it on queue 3. IO thread/handler picks up request and processes 4. read/write from/to local “disk” 5. Wait for other brokers to ack messages 6. Put response on queue 7. Network thread sends response to client • Refer to Confluent Support Knowledge base article Kafka Broker Performance Diagnostics for more details • The phase in which the broker is spending the most time in the request lifecycle may indicate one of several problems: slow CPU, slow disk, network congestion, not enough I/O threads, or not enough network threads. • Or perhaps there just is not enough incoming data to return in a fetch response, so the broker is simply waiting. • Combine this information with other important indicators like the network or request pool usage(in Brokers Overview) Then focus your investigations, figure out the bottlenecks, consider tuning the settings, e.g., increasing the number of I/O threads (num.io.threads ) or network threads (num.network.threads ), or take some other action 3- Why are the applications running slowly? 24
  • 25. • Topic partition leadership also impacts broker performance: the more topic partitions that a given broker is a leader for, the more requests the broker may have to process. Therefore, if some brokers are have more leaders than others, you may have an unbalanced cluster when it comes to performance and disk utilization • Control Center provides certain indicators like: • Number of requests each broker is processing • Disk skewed: if relative mean absolute difference of all broker sizes exceeds 10%(fixed value, not configurable) and confluent.controlcenter.disk.skew.warning.min.byt es (default 1G) exceeds the configured value) • Cluster-wide balance status • You can take action by rebalancing the Kafka cluster using the Auto Data Balancer (or using self balancing feature starting from CP 6.0). This rebalances the number of leaders and disk usage evenly across brokers and racks on a per topic and cluster level, and can be throttled so that it doesn’t kill your performance while moving topic partitions between brokers. 3- Why are the applications running slowly? 25
  • 26. • Capacity planning for your Kafka cluster is important to ensure that you are always able to meet business demands. • Look beyond generic CPU, network, and disk utilization; use Control Center to monitor Kafka-specific indicators that may indicate the Kafka cluster is at or near capacity: • Broker processing (details in previous section “Why are the applications running slowly?”) • Network pool usage • Thread pool usage • Produce request latencies • Fetch request latencies • Network utilization • Disk Space 4- Do we need to scale up? 26 • Take corrective actions on capacity issues before it is too late! • Example: if network pool utilization is high and CPU on the brokers is high, you may decide to add a new broker to the cluster and then move partitions to the new broker. But it takes additional CPU and network capacity, and so if you try to add brokers when you are already close to 100% utilization, the process will be slow and painful ! • Monitor capacity in your Kafka applications to decide if you need to linearly scale their throughput. If Control Center shows consumer group latency trending higher and higher over time, it may be an indication that the existing consumers cannot keep up with the current data volume. To address this, consider adding consumers to the consumer groups, (assuming there are enough partitions to distribute across all the consumers)
  • 27. • Operationally, you want assurance that data produced to Kafka will not get lost • The most important feature that enables durability is replication. This ensures that messages are copied to multiple brokers, and so if a broker has a failure, the data is still available from at least one other broker. Topics with high durability requirements should have the configuration parameter replication.factorset to at least 3, which will ensure that the cluster can handle a loss of two brokers without losing the data. • The number of under-replicated partitions is the best indicator for cluster health: there should be no under-replicated partitions in steady state. This provides assurance that data replication is working between Kafka brokers and the replicas are in sync with topic partition leaders, such that if a leader fails, the replicas will not lose any data. • Check the built-in dashboard in the system health view landing page to instantly see topic partition status across your entire cluster in a single view. • Under-replicated topic partitions is one of the most important issues you should always investigate and you can set an alert for this in Control Center! 5- Can any data get lost ? 27
  • 28. • Software upgrades, broker configuration updates, and cluster maintenance are all expected parts of Kafka operations. If you need to schedule any of these activities, you may need to do a rolling restart through the cluster, restarting one broker at a time. A rolling restart executed the right way can provide high availability by avoiding downtime for end users. Doing it wrong may result in downtime for users, or worse: lost data. • Some important things to note before doing a rolling restart: • Because at least one replica is unavailable while a broker is restarting, clients will not experience downtime if the number of remaining in sync replicas for that topic partition is greater than the configured min.insync.replicas . • Run brokers with controlled.shutdown.enable=true to migrate topic partition leadership before the broker is stopped. • The active controller should be the last broker you restart. This is to ensure that the active controller is not moved on each broker restart, which would slow down the restart. • Use Control Center for your rolling restarts to monitor broker status. Make sure each broker is fully online before restarting the next one • During a single broker restart, the number of under replicated topic partitions or offline partitions may increase, and then once your broker is back online, the numbers should recover to zero. This indicates that data replication and ISRs are caught up. Now restart the next broker 6- Will there be service interruptions? 28
  • 29. • Control Center can help you with your multi-data-center deployment and prepare for disaster recovery in case disaster strikes. • It can manage multiple Kafka deployments in different data centers, even between Confluent Cloud and on-prem deployments. • Below is a diagram for an active-active multi-data-center deployment, and Confluent Replicator is the key to synchronize data and metadata between the sites. Control Center enables you to configure and run Confluent Replicator to copy data between your data centers, and then you can use stream monitoring to ensure end-to-end message delivery. 7- Are there assurances in case of a disaster event? 29
  • 30. • Once Replicator is deployed, monitor Replicator performance and tune it to minimize replication lag, which is the delay between messages written to the origin cluster but not yet copied to the destination cluster. This is important because processing at the destination cluster will be delayed by this lag, and secondly, in case of a disaster event, you may lose messages that were produced at the origin cluster but weren’t replicated yet to the destination cluster. • Easily monitor Replicator in Control Center because it is a connector and can be monitored like any other Kafka connector. And since 5.4, you can use dedicated Replicators section. Run Replicator with Confluent Monitoring Interceptors to get detailed message delivery statistics. 7- Are there assurances in case of a disaster event? 30
  • 31. 31 Control Center has great values: • Not just a monitoring tool, but also a management tool • Provides a highly opinionated view of metrics for helping make administrative decisions • 100% made for Kafka and CP ecosystem (Schema Management, data inspection, Kafka Connect, Replicator & KSQL integrations..) Can not be used alone: • No system monitoring (e.g. CPU, Network, etc…) • Does not expose all JMX (e.g. Request Q size) • Does not monitor all clients (only those with Monitoring Interceptors) • Does not monitor all components (e.g. ZK) Summary
  • 32. Control Center Demo Going through all the UI sections
  • 33. • Enable auto-updates (available since 5.4) • Dedicated metric data cluster is recommended: you can send monitoring data to a dedicated Kafka Cluster: • Independent of availability of the cluster(s) being monitored • Ease of upgrade • Metric data cluster can have reduced security requirements • Guarantee that Control Center workload will never interfere with production traffic • Troubleshooting tips and common issues • Lot of customers have data that is not accurate because the Control Center server is not powerful enough. Make sure to follow system requirements ! • You can disable Control Center Usage Data Collection Control Center - Tips and Tricks 33
  • 34. 02 JMX Metrics and Monitoring Stacks Overview of JMX metrics and 3rd party monitoring stacks
  • 35. • Kafka brokers and Java client applications (Kafka Connect, Kafka Streams, Producer/Consumer, etc..) expose hundreds of internal JMX (Java Management Extensions) metrics • Important JMX metrics to monitor: • Broker metrics • ZooKeeper metrics • Producer metrics • Consumer metrics • ksqlDB & Kafka Streams metrics • Kafka Connect metrics • It’s key to have a dashboard that let you know “everything is OK?” in one glance • Multiple monitoring stacks are available. Choose the one that is already used in your company JMX metrics 35
  • 36. Popular Monitoring stacks 36 Prometheus/Grafana TICK Telegraf - InfluxDB - Chronograf - Kapacitor ELK ElasticSearch - LogStash - Kibana Datadog
  • 37. 37 Prometheus/Grafana • Prometheus is a popular open-source monitoring solution which uses JMX-Exporter to extract the metrics. The exporter can be configured to extract and forward only the metrics desired. • An example of JMX-Exporter/Prometheus/Grafana monitoring stack deployed on top of Confluent cp-demo is available here Prometheus exporter (JMX-Exporter)
  • 41. • JMX metrics are only for java based clients. • Librdkafka applications can be configured (disabled by default) to emit internal metrics at a fixed interval by setting the statistics.interval.ms configuration property to a value > 0 and registering a stats_cb (or similar, depending on language) • All statistics described here • Emits JSON object string: Librdkafka: Client statistics 41
  • 42. • Using prometheus-net/prometheus-net, starting up a MetricsServer to export metrics to Prometheus Prometheus/Grafana: Librdkafka: .NET example 42
  • 44. 44 Jolokia/Elastic/Kibana • ELK stack is a popular open-source monitoring solution which uses Jolokia (JSON over HTTP) to extract the metrics. Metrics are then exported to Elasticsearch and displayed in a Kibana dashboard • An example of Jolokia/Elasticsearch/Kibana monitoring stack deployed on top of Confluent cp-demo is available here
  • 46. 46 Datadog • Datadog has had an Apache Kafka integration for monitoring self-managed broker installations with their Datadog Agent for several years. • The new Confluent Platform integration (see May 2020 blog post) adds several capabilities: - Monitoring for Kafka Connect, ksqlDB, Confluent Schema Registry, and Confluent REST Proxy - Monitoring for Java-based Kafka clients - Default Confluent Platform dashboard with the most critical metrics - Optionally configured log collection • See Datadog documentation
  • 48. 48 TICK The TICK stack is comprised of • Telegraf a component that gathers metrics • Influxdb a time series database • Chronograf a visualization tool • Kapacitor real time alerting platform
  • 49. 03 Monitor Consumer Lag All different ways to monitor consumer lag
  • 50. • It is important to monitor your application’s consumer lag, which is the number of records for any partition that the consumer is behind in the log • For "real-time" consumer applications, where the consumer is meant to be processing the newest messages with as little latency as possible, consumer lag should be monitored closely. • Most "real-time" applications will want little-to-no consumer lag, because lag introduces end-to-end latency. Monitor Consumer Lag 50
  • 51. Consumer lag is available in Consumers section from navigation bar: #1: Using Control Center 51
  • 52. • If you use Java consumers, you can capture JMX metrics and monitor records-lag-max • Note: the consumer’s records-lag-max JMX metric calculates lag by comparing the offset most recently seen by the consumer to the most recent offset in the log, which is a more real-time measurement. #2: Using JMX (Java client only) 52 Metric Description kafka.consumer:type=consumer-fe tch-manager-metrics,client-id=( [-.w]+),records-lag-max The maximum lag in terms of number of records for any partition in this window. An increasing value over time is your best indication that the consumer group is not keeping up with the producers.
  • 53. • Refer to this Knowledge Base article for full details • Create a properties file containing your security details • Example: #3: Using kafka-consumer-groups CLI 53
  • 54. 54 #4: Using kafka-lag-exporter and Prometheus/Grafana • lightbend/kafka-lag-exporter is a 3rd party tool (not supported by Confluent) that is using Kafka's Admin API describeConsumerGroups() method to get consumer lags and export them to Prometheus. • Example of how to set it up is available here • Out of the box Grafana dashboard is available
  • 55. #4: Using kafka-lag-exporter and Prometheus - Demo 55
  • 56. 04 Key Alerts Important metrics and alerts to set up
  • 57. 57 Alerts • As seen earlier, setting up alerts can be done through Control Center, but also using your monitoring stack based on JMX metrics (for example Prometheus AlertManager plugin) • Alert on what’s important: Under-replicated partitions is a good start • Alerting on SLAs is even better: especially when measured from a client point of view
  • 58. Key Alerts 58 Cluster/Broker: • UnderReplicatedPartitions > 0 * • OfflinePartitionsCount > 0 * • UnderMinIsrPartitionCount > 0 • ActiveControllerCount != 1 • AtMinIsrPartitionCount > 0 • RequestHandlerAvgIdlePercent < 40% • NetworkProcessorAvgIdlePercen t < 40% • RequestQueueSize (establish the baseline during normal/peak production load and alert if a deviation occurs) • TotalTimeMs,request=* (Produce|FetchConsumer|FetchF ollower) OS: • Disk usage > 60% (minor) > 80-90% (major) • CPU usage > 60% over 5 minutes (generally caused by SSL connections or old clients causing down conversions) • Network IO usage > 60% • File handle usage > 60% JVM Monitoring: • G1 YoungGeneration CollectionTime • G1 OldGeneration CollectionTime • GC time > 30% Connect: • connector=(*) status • connector=(*),task=(.*) status Zookeeper: • AvgRequestLatency > 10ms over 30 seconds(disk latency is high. `iostat -x ` look at await time in `top`) • NumAliveConnections - make sure you are not close to maximum as set with maxClientCnxns • OutstandingRequests - should be below 10 in general The Four Letter Words: mntr and ruok (need to be enabled starting from 5.4 with -Dzookeeper.4lw.commands.whiteli st=* $ echo ruok | nc localhost 2181 $ imok * alert can also be set with Control Center
  • 63. 05 Audit Logs Confluent Platform Audit Logs overview
  • 64. • Audit logs provide a way to capture, protect, and preserve authorization activity into topics in Kafka clusters on Confluent Platform using Confluent Server Authorizer. • Record the runtime decisions of the permission checks that occur as users attempt to take actions that are protected by ACLs and RBAC. • Each auditable event includes information about who tried to do what, when they tried, and whether or not the system gave permission to proceed. • By default, audit logs are enabled and are managed by the inter-broker principal (typically, the user kafka), who has expansive permissions (this can be changed). • The primary value of audit logs is that they provide data you can use to assess security risks in your local and remote Kafka clusters. They contain all of the information necessary to follow a user’s interaction with your local or remote Kafka clusters, and provide a way to: • Track user and application access across the platform • Identify abnormal behavior and anomalies • Proactively monitor and resolve security risks • You can use Splunk, S3, or other sink connectors to move your audit log data to a target platform for analysis. Audit Logs 64
  • 65. • List of auditable events is available here • Audit logs are enabled by default (it can be disabled) • Default topic is confluent-audit-log-events. • Default audit logs capture MANAGEMENT and AUTHORIZE categories of authorization events only. • More advanced configuration allows you to: • Which event categories you want to capture (including categories like produce, consume, and inter-broker, which are disabled by default) • Multiple topics to capture logs of differing importance • Topic destination routes optimized for security and performance • Retention periods that serve the needs of your organization • Excluded principals, which ensures performance is not compromised by excessively high message volumes • The Kafka port over which to communicate with your audit log cluster • ℹ When enabling audit logging for produce and consume, be very selective about which events you want logged, and configure logging for only the most sensitive topics. • Refer to Audit log configuration examples Audit Logs - Configuration 65
  • 66. 66 Deployment • The destinations option identifies the audit log cluster, which is provided by the bootstrap server. • Security settings for connecting to audit log cluster are prefixed with confluent.security.event.logger.exp orter.kafka: • Secure your audit logs by following these recommendations
  • 67. 06 Enabling and using Proactive Support Everything about Confluent monitoring solution
  • 69. What is Proactive Support ? ● Proactive Support provides ongoing, real time analysis of performance and configuration data from the experts ● Cloud UI managed Proactive Support alerts ● Two rules at launch: ○ RequestHandlerAvgIdlePercent < 0.3 -> WARN alert ○ NetworkProcessorAvgIdlePercen t < 0.3 -> WARN alert
  • 70. How does it work ? ● Confluent Telemetry Reporter is a plugin that runs inside each Confluent Platform service to push metadata about the service to Confluent ● Data is sent over HTTP using an encrypted connection, once per minute by default ● Installed as part of the full Confluent Platform installation ● What data is sent: ○ Runtime performance metrics ○ Kafka version ○ Confluent Platform version ○ Unique identifiers for the CP component, Kafka cluster and Customer organization. ● Note: you can stop sending data at any time by removing the configuration parameters.
  • 72. 72 Prerequisites • Access to Confluent Cloud • Internet connectivity either directly or through a proxy • Confluent Platform 6.0 or higher • Create a Cloud API key to authenticate with Confluent Cloud using UI or CLI ccloud api-key create --resource cloud
  • 73. 73 Step by Step Click on Proactive support in Control Center
  • 74. 74 Step by Step Select Join the waitlist or login if you’re already a Proactive Support customer
  • 75. 75 Step by Step • Configure Confluent Telemetry Reporter as described in instructions.
  • 76. 76 Step by Step • Configure with Ansible by adding the configuration overrides to all Confluent Platform roles
  • 77. 77 Step by Step • Configure with Operator • Note: k8S-SECRET-NAME is a Kubernetes secret named secretRef. It must contain the key telemetry with base64-encoded values.
  • 78. 78 Step by Step • Custom Deployments Configuration • Note: For Confluent Server, the metric.reporters configuration is not needed • If restarting Confluent Server is undesirable, you can add these configurations by using dynamic configuration and the kafka-config CLI.
  • 79. 79 Step by Step • Once all the configuration is set up and enabled, the data should be successfully received by Telemetry Reporter(s)
  • 80. 80 Step by Step • Configure your first notification: Slack Webhook Email
  • 81. 81 Step by Step • Check Status and Notifications by going on
  • 83. • White Papers: • Monitoring Your Apache Kafka® Deployment End-to-End • Github: • confluentinc/jmx-monitoring-stacks: run Confluent cp-demo with open source monitoring stacks (see blog post here) • jeanlouisboudart/kafka-platform-prometheus: Simple demo of how to monitor Kafka Platform using Prometheus and Grafana • framiere/monitoring-demo: a Docker based walkthrough of the open-source ecosystem to do metrics/logs/alerting • Support Knowledge Base articles: • Monitoring Kafka • Monitoring Zookeeper • Monitoring Connect • Kafka Broker Performance Diagnostics • Top 5 Broker JMX metrics you should be watching Interesting links 83