Observing the HashiCorp Ecosystem From Prometheus

Observing the HashiCorp Ecosystem From Prometheus
Kris Buytaert & Julien Pivotto
June 21, 2022
O11y

Kris Buytaert
• I used to be a developer
• Then I became an Ops person
• Chief Trolling/Travel/Technical Oﬃcer @ Inuits.eu
• Chief Yak Shaver @ o11y.eu
• Organiser of #devopsdays, #cfgmgmtcamp, #loadays, ...
• Cofounder of all of the above
• Everything is a Freaking DNS Problem
• DNS : devops needs sushi
• @krisbuytaert on twitter/github
O11y 1

Julien Pivotto
• Prometheus maintainer
• Open Source Observability Expert
• Principal Software Architect & CoFounder @ o11y.eu
• DevOps believer
• @roidelapluie on twitter/github
O11y 2

O11y
• Inuits.eu Spinoﬀ
• Open Source Observability
• Currently supporting the Prometheus Ecosystem
• Professional Services & Support (now)
• Long Term Enterprise Support (next month)
• Prometheus Distribution (soon)
O11y 3

Introduction, a brief history of Open Source Monitoring
O11y 3

July 2008 Ottawa Linux Symposium Paper
• Bloated Java Tools
• Dysfunctional Open Core Software
• DBA Required
• Nagios was king in the Open Source world
O11y 4

June 2011 #monitoringsucks
• John Vincent (@lusis) , june 2011
• A #devops sub-movement
• (manual conﬁguration, not in sync with reality, hosts only, services sometimes,
applications never)
O11y 5

October 2011 #monitoringlove
• Ulf Mansson, #devopsdays Rome 2011
• A new found love for monitoring
• Triggered by { New Open Source Tools * Automation }
O11y 6

November 2012 Prometheus
O11y 7

What is monitoring?
• High level overview of the state of a service/component
• Availability
• Technical components
• Performance ?
What is going on?
O11y 8

Pitfalls of traditional monitoring
• Drift from reality
• Total lack of automation
• Partial automation
• Lots of work to maintain
• Binary states: it works - it does not work
• Alert fatigue
• Alert fatigue
• Alert fatigue
• Alert fatigue
O11y 9

What is observability?
• Understand how your services behave
• Like you are at their place
• Without incident speciﬁc code
Why is this going on?
O11y 10

How do monitoring and observability connect?
• Monitoring is required
• If lucky, monitoring is enough
• Observability is removing luck <- @roidelapluie
O11y 11

What is observability - in Practice?
Three pillars:
• Metrics
• Logs
• Traces
O11y 12

Metrics
https:/
/play.grafana.org/
O11y 13

Logs
https:/
/play.grafana.org/
O11y 14

Traces
https:/
/www.jaegertracing.io/
O11y 15

Prometheus
• Prometheus is an Open Source CNCF Project
• Collects and stores metrics
• Pull-based
• Service discovery (including Consul)
• Alerting
O11y 16

The Prometheus ecosystem
• Exporters for every piece of the infra
• Maintained by multiple companies
• Long-Term Support release coming Q3 2022
O11y 17

Prometheus data model
• Metric have labels
• Labels diﬀerentiate metrics, e.g.:
• HTTP response code
• Datacenter name
O11y 18

PromQL
• Prometheus Query Language
• Powerful yet simple query language
rate(http_requests_total[5m])
O11y 19

Observing your services
• consul_sd_configs
• Stream consul services list to Prometheus
• Up-to-date service list
• Use the ﬂexibility of labels
• Add relevant labels
• Filter targets
O11y 20

consul_sd_configs labels
• __meta_consul_service
• __meta_consul_tags
• __meta_consul_node
• __meta_consul_service_metadata_
• __meta_consul_dc
O11y 21

Alerting philosophy
• Page on actionable critical failure
• Avoid paging on Consul Health Check failure
• Keep “ambiance” alerts to get the atmosphere and quickly ﬁnd the cause
O11y 22

consul_exporter
• Exporter maintained by Prometheus team
• Expose consul cluster health
• Optionally expose key/values
• e.g. store desired state in KV for graphing
• Connect to a single instance
O11y 23

Consul telemetry
• Built-in
• Runtime metrics (memory, CPU, ...)
• Autopilot, raft metrics
• Calls (rate, errors, latency)
O11y 24

Conﬁgure Consul telemetry
Consul conﬁguration:
telemetry {
disable_hostname = true
prometheus_retention_time = "1h"
}
O11y 25

Prometheus conﬁguration:
scrape_jobs:
- name: consul
static_configs:
- <consulserver1>:8500
- <consulserver2>:8500
metrics_path: '/v1/agent/metrics'
param:
format: ["prometheus"]
O11y 26

Consul alerts (consul_exporter)
Is consul running?
up{job="consul_exporter"} == 0
consul_up{job="consul_exporter"} == 0
Is there a leader?
consul_raft_leader != 1
Are peers in raft?
sum(consul_raft_peers) != count(up{job="consul"})
O11y 27

Consul alerts (Consul telemetry)
Is consul running?
up{job="consul"} == 0
Is my cluster healthy?
consul_autopilot_healthy == 0
O11y 28

Conﬁgure Vault telemetry
Vault conﬁguration:
telemetry {
disable_hostname = true
prometheus_retention_time = "1h"
}
O11y 29

Prometheus conﬁguration:
scrape_jobs:
- name: vault
static_configs:
- <vaultserver1>:8200
- <vaultserver2>:8200
metrics_path: '/v1/sys/metrics'
param:
format: ["prometheus"]
O11y 30

Vault alerting
Is Vault up?
up{job="vault"} == 0
Is Vault sealed?
vault_core_unsealed == 0
Is audit log working?
rate(vault_audit_log_request_failure[5m]) > 0
rate(vault_audit_log_response_failure[5m]) > 0
O11y 31

Alert inhibition
• Suppressing notiﬁcation from alerts of other alerts are ﬁring.
• Reduces alerts, e.g. if vault is sealed.
O11y 32

Conﬁguring inhibition
Alertmanager conﬁguration:
inhibit_rules:
- source_match:
alertname: VaultIsSealed
target_match:
alertname: ErrorRateTooHigh
equal: [ datacenter ]
O11y 33

Conclusion
• Alerting should come from your end services
• Consul & Vault focused alerts will pinpoint causes
• Speciﬁc Vault & Consul alerts can page you (e.g. sealed)
• Draft dashboards based on your needs (response times, errors, etc)
O11y 34

Contact
O11y
https:/
/o11y.eu
info@o11y.eu
O11y 34

Observing the HashiCorp Ecosystem From Prometheus

Recomendados

Recomendados

Más contenido relacionado

Similar a Observing the HashiCorp Ecosystem From Prometheus

Similar a Observing the HashiCorp Ecosystem From Prometheus (20)

Más de Julien Pivotto

Más de Julien Pivotto (20)

Último

Último (20)

Observing the HashiCorp Ecosystem From Prometheus