Monitoring InfluxEnterprise

Tim E. Hall @thallinflux
VP, Products InfluxData
Monitoring InfluxEnterprise

Discussion Topics
• Background
• Gathering Metrics...and Logs
• Visualization, Monitoring, and Alerting
• Troubleshooting Scenarios

From
development to
production
• Change is required
• Establish monitoring baselines
• Ensure visibility into health of the system
• Notifications for most common issues,
before they become outages

From OSS to Enterprise
InfluxDB
OSS
Meta 1 Meta 3Meta 2
Data Node
2
Data Node
1
InfluxDB Enterprise

Deploy Telegraf on all nodes (meta and data)
By enabling these plugins, KPI’s routinely associated with infrastructure and database
performance can be measured and serve as a good starting point for monitoring.
Minimum Recommendation:
1. CPU: collects standard CPU metrics
2. System: gathers general stats on system load
3. Processes: uptime, and number of users logged in
4. DiskIO: gathers metrics about disk traffic and timing
5. Disk: gathers metrics about disk usage
6. Mem: collects system memory metrics
7. NetStat: Network related metrics
8. http_response: Setup local ping
9. filestat: Files to gather stats about (meta node only)
10. InfluxDB: gather stats from the InfluxDB Instance. (data node only)
Optional:
1. Logs: requires syslog
2. Swap: collects system swap metrics
3. Internal: gather Telegraf related stats
4. Docker: if deployed in containers

But where should these metrics land?
• You’ve got lots of options
– Typical recommendation: use an Open Source instance as the “watcher
of the watchers”
• If there are a small number of clusters that need to be monitored this is the easiest,
simplest way to go
– Other options that can be considered:
• 2 instances -- monitor each other
• Separate by environment -- and eliminate the environment global tag in the Telegraf
config
• Unleash your creativity…

Key Point
– Production InfluxDB instances
should not monitor themselves
– WHY?
• Because…visibility is lost if the
database is unreachable, for any
reason.
[monitor]
store-enabled = false

Telegraf Configuration: Global
[global_tags]
cluster_id = $CLUSTER_ID
environment = $ENVIRONMENT
[agent]
interval = "10s"
round_interval = true
metric_buffer_limit = 10000
metric_batch_size = 1000
collection_jitter = "0s"
flush_interval = "30s"
flush_jitter = "30s"
debug = false
hostname = ""
All plugins are controlled by the telegraf.conf file. Administrators can easily enable/disable plugins and options by
activating them.
Global tags can be specified in the [global_tags]
section of the config file in key="value" format. Use
a GUID which uniquely identifies each “cluster” and
ensure that environment variable exists consistently
on all hosts (meta and data). Optionally, add other
tags if desired. Example: dev, prod for environment.
Agent Configuration recommended config settings
for InfluxDB data collection. Adjust the interval and
flush_interval based on:
● desire around “speed of observability”
● retention policy for the data

Telegraf Configuration: Inputs (common)
# INPUTS
[[inputs.cpu]]
percpu = false
totalcpu = true
fieldpass = ["usage_idle",
"usage_user", "usage_system",
"usage_steal"]
[[inputs.mem]]
[[inputs.netstat]]
[[inputs.system]]
[[inputs.diskio]]
Input Configuration items include grabbing metrics
from the various infrastructure, database, and
system components in play.
For the other plug-ins, default config is sufficient.

Telegraf Configuration: Inputs Data Nodes
# INPUTS
[[inputs.influxdb]]
interval = "15s"
urls = ["http://<localhost>:8086/debug/vars"]
timeout = "15s”
[[inputs.http_response]] #DATA
address = "http://<localhost>:8086/ping”
[[inputs.disk]]
mount_points =
["/var/lib/influxdb/data","/var/lib/influxdb/wal",
"/var/lib/influxdb/hh”,"/"]
InfluxDB grabs all metrics from the
exposed endpoint.
http_response allows you to ping
individual data nodes and track
response output.
You can also setup a separate Telegraf
agent elsewhere within your
infrastructure to ping the available
cluster(s) through the load balancer.
disk allows you to configure the
various volumes/mount points on
disk -- locations of data, wal, hinted
handoff -- and root. (default config
options shown)

Telegraf Configuration: Inputs Meta Nodes
# INPUTS
[[inputs.http_response]] #META
address = "http://<localhost>:8091/ping"
[[inputs.filestat]]
files =
["/ivar/lib/influxdb/meta/snapshots/*/state.bin"]
md5 = false
[[inputs.disk]]
mount_points = ["/var/lib/influxdb/meta", "/"]
http_response allows you to ping
individual meta nodes and track response
output.
filestat allows you to monitor metadata
snapshots.
disk allows you to configure the
various volumes/mount points on
disk -- locations of meta store -- and
root. (default config options shown)

Telegraf Configuration: Outputs
# OUTPUTS
[[outputs.influxdb]]
urls = [ "<target URL of DB>" ]
database = "telegraf"
retention_policy = "autogen"
timeout = "10s"
username = <uname>
password = <pword>
content_encoding = "gzip"
Output Configuration tells telegraf which
output sink to send the data . Multiple
output sinks can be specified in the
configuration file.
** NOTE: This should point to the load
balancer, if you are storing the metrics into a
cluster.

Telegraf Configuration: Gathering Logs
# INPUT
[[inputs.syslog]]
# OUTPUTS
urls = [ "http://localhost:8086" ]
# Drop all measurements that start
with "syslog"
namedrop = [ "syslog*" ]
urls = [ "http://localhost:8086" ]
retention_policy = "14days"
# Only accept syslog data:
namepass = [ "syslog*" ]
Output Configuration use
namepass/namedrop to
direct metrics/logs to
different db.rp targets
** NOTE: This should point to
the load balancer, if you are
storing the metrics into a
cluster.
Input Configuration add the
syslog input plug-in.
Review the settings for
your environment.
InfluxDB can be used to capture both metrics and events. The syslog protocol is used to gather the logs.

Visualization, Monitoring, Alerting

We’ve gathered a wide variety of metrics...so now what?
• Dashboards!

Alerting: Common Metrics to Watch
• Disk Usage
• Hinted Handoff Queue
• No metrics…. aka Deadman

Disk Usage Batch Task: TICKscript
// Monitor disk usage for all hosts
var data = batch
|query('''
SELECT last(used_percent)
FROM "telegraf"."autogen"."disk"
WHERE ("host" =~ /prod-.*/)
AND ("path" = '/var/lib/influxdb/data'
OR "path" = '/var/lib/influxdb/wal'
OR "path" = '/var/lib/influxdb/hh'
OR "path" = '/')
''')
.period(5m)
.every(10m)
.groupBy('host', 'role', 'environment', 'device')

Disk Usage Alert: TICKscript
var warn_threshold = 85
var critical_threshold = 95
data
|alert()
.id('Host: {{ index .Tags "host" }}, Environment: {{ index .Tags
"environment" }}')
.message('Alert: Disk Usage, Level: {{ .Level }}, Device: {{ index
.Tags "device" }}, {{ .ID }}, Usage: %{{ index .Fields "used_percent" }}')
.warn(lambda: "used_percent" > warn_threshold)
.crit(lambda: "used_percent" > critical_threshold)
.slack()
.channel('#monitoring')

Hinted Handoff Queue Batch Task: TICKscript
// This generates alerts for high hinted-handoff queues for InfluxEnterprise
var queue_size = batch
|query('''
SELECT max(queueBytes) as "max"
FROM "telegraf"."autogen"."influxdb_hh_processor"
''')
.groupBy('host', 'cluster_id')
.period(5m)
.every(10m)
|eval(lambda: "max" / 1048576.0)
.as('queue_size_mb')

Hinted Handoff Queue Alert: TICKscript
var warn_threshold = 3500
var crit_threshold = 5000
queue_size
|alert()
.id(’InfluxEnterprise/{{ .TaskName }}/{{ index .Tags "cluster_id"
}}/{{ index .Tags "host" }}')
.message('Host {{ index .Tags "host" }} (cluster {{ index .Tags
"cluster_id" }}) has a hinted-handoff queue size of {{ index .Fields
"queue_size_mb" }}MB')
.details('')
.warn(lambda: "queue_size_mb" > warn_threshold)
.crit(lambda: "queue_size_mb" > crit_threshold)
.stateChangesOnly()
.slack()
.pagerDuty()

Deadman Batch Task: TICKscript
// Ensure hosts are running. If no CPU usage statistics can be retrieved
// We assume the host has locked up, disappeared or is otherwise unreachable
var cpu_stats = batch
|barrier().idle(5m)
|query('''
SELECT count(usage_system)
FROM "telegraf"."autogen"."cpu"
''')
.period(5m)
.every(10m)
.groupBy('cluster_id', 'host')

Deadman Alert: TICKscript
var trigger = cpu_stats
|deadman(0.0, 10m)
.id('Host: {{ index .Tags "host" }}, Cluster ID: {{ index .Tags
"cluster_id" }}')
.message('Alert: Kapacitor Deadman, Level: {{ .Level }}, {{ .ID }}')
.idTag('alertID')
.messageField('message')
.durationField('duration')
.levelTag('level')
.stateChangesOnly()
.slack()
.channel('#monitoring')

Deadman Evaluate & Visualize Alert in Chronograf: TICKscript
trigger
|eval(lambda: "emitted")
.as('value')
.keep('value', 'message', 'duration')
|eval(lambda: float("value"))
.as('value')
.keep()
|influxDBOut()
.create()
.database('chronograf')
.retentionPolicy('autogen')
.measurement('alerts')
.tag('alertName', 'Deadman')
.tag('triggerType', 'deadman')
For Chronograf

Common Troubleshooting Scenarios
• OOM Loop
• Runaway Series Cardinality

Common Troubleshooting Scenarios
Workload Type
• Which type are you?
– Read heavy
– Write heavy
– Mixed?
– Establish baselines and
understand “normal”
using metrics and
visualization
– Baselines allow you to
understand change over
time and help determine
when is time to scale up
Log Analysis
• Metrics First!
– Highlights where you
should look within the
log files
• Logs allow for pin
pointing root-cause of
issue observed by
metrics
– Cache max memory size
– Hinted Handoff Queue
“Blocked”
IOPS & Disk Throughput
• Understand the
capabilities of your
hardware
– We recommend SSD-
based deployments
• Deploying in an IaaS
environment?
– Understand max read
and write limits based
on machine class and
drive types – these can
change as you scale!

Recap
• Gather Metrics...and Logs
• Visualize, Monitor, and Alert… tune based on your environment
• Review Common Troubleshooting Scenarios
https://community.influxdata.com https://docs.influxdata.com

Monitoring InfluxEnterprise

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Monitoring InfluxEnterprise

Similar a Monitoring InfluxEnterprise (20)

Más de InfluxData

Más de InfluxData (20)

Último

Último (20)

Monitoring InfluxEnterprise