3. Kris Buytaert
• I used to be a developer
• Then I became an Ops person
• Chief Trolling/Travel/Technical Officer @ Inuits.eu
• Chief Yak Shaver @ o11y.eu
• Organiser of #devopsdays, #cfgmgmtcamp, #loadays, ...
• Cofounder of all of the above
• Everything is a Freaking DNS Problem
• DNS : devops needs sushi
• @krisbuytaert on twitter/github
O11y 1
4. Julien Pivotto
• Prometheus maintainer
• Open Source Observability Expert
• Principal Software Architect & CoFounder @ o11y.eu
• DevOps believer
• @roidelapluie on twitter/github
O11y 2
5. O11y
• Inuits.eu Spinoff
• Open Source Observability
• Currently supporting the Prometheus Ecosystem
• Professional Services & Support (now)
• Long Term Enterprise Support (next month)
• Prometheus Distribution (soon)
O11y 3
7. July 2008 Ottawa Linux Symposium Paper
• Bloated Java Tools
• Dysfunctional Open Core Software
• DBA Required
• Nagios was king in the Open Source world
O11y 4
8. June 2011 #monitoringsucks
• John Vincent (@lusis) , june 2011
• A #devops sub-movement
• (manual configuration, not in sync with reality, hosts only, services sometimes,
applications never)
O11y 5
9. October 2011 #monitoringlove
• Ulf Mansson, #devopsdays Rome 2011
• A new found love for monitoring
• Triggered by { New Open Source Tools * Automation }
O11y 6
11. What is monitoring?
• High level overview of the state of a service/component
• Availability
• Technical components
• Performance ?
What is going on?
O11y 8
12. Pitfalls of traditional monitoring
• Drift from reality
• Total lack of automation
• Total lack of automation
• Total lack of automation
• Total lack of automation
• Partial automation
• Lots of work to maintain
• Binary states: it works - it does not work
• Alert fatigue
• Alert fatigue
• Alert fatigue
• Alert fatigue
O11y 9
13. What is observability?
• Understand how your services behave
• Like you are at their place
• Without incident specific code
Why is this going on?
O11y 10
14. How do monitoring and observability connect?
• Monitoring is required
• If lucky, monitoring is enough
• Observability is removing luck <- @roidelapluie
O11y 11
15. What is observability - in Practice?
Three pillars:
• Metrics
• Logs
• Traces
O11y 12
20. Prometheus
• Prometheus is an Open Source CNCF Project
• Collects and stores metrics
• Pull-based
• Service discovery (including Consul)
• Alerting
O11y 16
21. The Prometheus ecosystem
• Exporters for every piece of the infra
• Maintained by multiple companies
• Long-Term Support release coming Q3 2022
O11y 17
22. Prometheus data model
• Metric have labels
• Labels differentiate metrics, e.g.:
• HTTP response code
• Datacenter name
O11y 18
23. PromQL
• Prometheus Query Language
• Powerful yet simple query language
rate(http_requests_total[5m])
O11y 19
25. Observing your services
• consul_sd_configs
• Stream consul services list to Prometheus
• Up-to-date service list
• Use the flexibility of labels
• Add relevant labels
• Filter targets
O11y 20
27. Alerting philosophy
• Page on actionable critical failure
• Avoid paging on Consul Health Check failure
• Keep “ambiance” alerts to get the atmosphere and quickly find the cause
O11y 22
29. consul_exporter
• Exporter maintained by Prometheus team
• Expose consul cluster health
• Optionally expose key/values
• e.g. store desired state in KV for graphing
• Connect to a single instance
O11y 23
33. Consul alerts (consul_exporter)
Is consul running?
up{job="consul_exporter"} == 0
consul_up{job="consul_exporter"} == 0
Is there a leader?
consul_raft_leader != 1
Are peers in raft?
sum(consul_raft_peers) != count(up{job="consul"})
O11y 27
34. Consul alerts (Consul telemetry)
Is consul running?
up{job="consul"} == 0
Is my cluster healthy?
consul_autopilot_healthy == 0
O11y 28
43. Conclusion
• Alerting should come from your end services
• Consul & Vault focused alerts will pinpoint causes
• Specific Vault & Consul alerts can page you (e.g. sealed)
• Draft dashboards based on your needs (response times, errors, etc)
O11y 34