Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | InfluxData

Lessons from Cloud
Scaling Prometheus
metrics in
Kubernetes with
Telegraf

The curious case of the missing metrics
One Label too far...

© 2019 InfluxData. All rights reserved. 3
The Suspects
● Prometheus
● Kubernetes
● Gateway
● Queryd

Prometheus
http://gateway.twodotoh.svc.cluster.local:9999/metrics

Prometheus
global:
scrape_interval: 15s
scrape_configs:
- job_name: prod_twodotoh
kubernetes_sd_configs:
- role: service

Kubernetes

InfluxCloud
Gateway Gateway
Queryd
Gateway
Queryd Queryd
Ingress

Problem: Prometheus Debugging is Hard
prometheus_target_sync_length_seconds{scrape_job="prod_twodotoh",quantile="0.01"} 0.012562015
prometheus_target_sync_length_seconds_sum{scrape_job="prod_twodotoh"} 0.012562015
prometheus_target_sync_length_seconds_count{scrape_job="prod_twodotoh"} 1

Problem: Prometheus Scaling is Hard
global:
scrape_configs:
- job_name: prod_twodotoh_ns_a
- role: service
namespaces:
names:
- a
global:
scrape_configs:
- job_name: prod_twodotoh_ns_a
- role: service
namespaces:
names:
- b

Solution: Isolatation with Telegraf Sidecar

Solution: Isolation with Telegraf Sidecar
apiVersion: apps/v1
kind: Deployment
metadata:
name: "gateway"
labels:
spec:
serviceName: "gateway"
replicas: 100
template:
metadata:
name: "gateway"
labels:
app: "gateway"
spec:
containers:
- name: "telegraf"
image: "docker.io/library/telegraf:1.12"
- name: "gateway"
image: "quay.io/influxdb/gateway:latest"
[[inputs.internal]]
[[inputs.prometheus]]
urls = ["http://127.0.0.1:9999/metrics"]
[[outputs.influxdb]]
urls = ["$MONITOR_HOST"]
database = "$MONITOR_DATABASE"
timeout = "5s"
[[outputs.influxdb_v2]]
urls=["http://us-west-2-1.aws.cloud2.influxdata.c
token = "$TOKEN"
organization = "$ORG"
bucket = "$BUCKET"
timeout = "5s"
namepass = ["internal"]

Solution: Isolatation with Telegraf Sidecar

Problem: Prom has 1 and only 1 value
global:
scrape_configs:
- role: service
metric_relabel_configs:
- regex: user_agent
action: labeldrop

Solution: Influx for more context
[[inputs.internal]]
[[processors.converter]]
[processors.converter.tags]
string = ["user_agent"]
timeout = "5s"
urls=["http://us-west-2-1.aws.cloud2.influxdata.com"]
token = "$TOKEN"
bucket = "$BUCKET"
timeout = "5s"

Problem: Is there a way to prevent?
global:
scrape_configs:
- role: service
metric_relabel_configs:
- regex: user_agent
action: labeldrop

Solution: Telegraf Guard Rails
[[inputs.internal]]
[[processors.tag_limit]]
limit = 4
## List of tags to preferentially preserve
keep = ["handler", "method", "status"]
timeout = "5s"
token = "$TOKEN"
bucket = "$BUCKET"
timeout = "5s"

Problem: Hard to Rotate Prom Passwords
global:
scrape_configs:
- role: service
bearer_token_file: /etc/hunter2

Solution: Per Pod Credentials
[[inputs.internal]]
bearer_token = "/etc/telegraf/hunter2"

Lessons
Scaling is NOT More Manual Processes
Scaling is NOT saying “You’re Doing it Wrong”
Scaling IS Empowering Developers
Scaling IS Predictability of Failure Modes

The time when we were
Watching the watchers...

Problem: Am I scraping all the pods?
global:
scrape_configs:
- role: service

Solution: Telegraf K8s Inventory
[[inputs.internal]]
[[inputs.kube_inventory]]
url = "http://1.1.1.1:10255"
timeout = "5s"
token = "$TOKEN"
bucket = "$BUCKET"
timeout = "5s"

Scaling even more

Scaling even more with Influx Enterprise
Load
Balancer

Scaling even more with Kafka and Influx
Enterprise
Kafka

Core Idea
● Measure and test metrics scaling
○ Are you missing metrics?
● Decentralize metrics gathering
○ Consider metrics as part of the program
● Empower Developers
○ They know their metrics the best. Allow them local tooling control

First Order Conclusion
● Too easy to shoot yourself in the foot with prometheus metrics.
● Too much in prometheus needs operation heroes.
● Too difficult to express vital information in prometheus about your
program without a ton of centralized control.
● One mistake can impact everyone.

Second Order Conclusion
● Prometheus is not descriptive enough.
● Extremely difficult to change over time.
● The metrics game is not a solved problem.
○ Opentelemetry?
○ SNMP?
● Probably not one answer to everything.

Future
● Flux into Telegraf
○ Processor for transformation
○ Moving the program near the data
○ Flux Output
○ Monitoring and alerting at edge
● Telegraf Flux scripts hosted in InfluxDB API
○ Runtime plugins without re-compiling
○ Sampling rules from server-side
■ Aggregation on server with input to client
● What else?

Thank You!

The time when collecting metrics impacted storage...
Measure, measure, measure

Problem: Prometheus metrics are heavy
weight

Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | InfluxData

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | InfluxData

Similar a Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | InfluxData (20)

Más de InfluxData

Más de InfluxData (20)

Último

Último (20)

Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | InfluxData