Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
@snehainguva
prometheus everything,
observing kubernetes in the
cloud
digitalocean.com
digitalocean.com
about me
software engineer @DigitalOcean
former delivery, currently observability
kubernetes, prometheus
digitalocean.com
Some stats
digitalocean.com
15 kubernetes clusters
12 data centers
300+ production applications
digitalocean.com
2 promethei +
1 alertmanager per cluster
1.5 million+ timeseries
99218 samples/sec
(note: data-center wid...
digitalocean.com
the plan:
● the pre-kubernetes days
● kubernetes at DigitalOcean (aka docc)
● prometheus + alertmanager a...
digitalocean.com
pre-kubernetes:
service owners write an application
provision a server with chef or ansible
use a CI/CD p...
digitalocean.com
pre-kubernetes:
use nagios + various plugins to monitor
use collectd + application metrics + statsd +
gra...
digitalocean.com
pre-kubernetes:
longer to provision host than write actual service
blackbox monitoring NOT insightful
whi...
digitalocean.com
docc: Digital Ocean Command Center
A tool for deploying containerized,
stateless applications
digitalocean.com
What is kubernetes?
Container orchestration tool from Google
digitalocean.com
What is docc?
An abstraction layer on top of kubernetes
CLI DOCCSERVER
deployment → pods
service
digitalocean.com
post-docc:
service owners write an application
service owner dockerizes application
describe application ...
digitalocean.com
post-docc:
deployments and updates take minutes, not hours
view running applications
get application logs...
digitalocean.com
But what about monitoring?
digitalocean.com
Let’s use
prometheus + alertmanager
digitalocean.com
service
promconfig
alertconfigalertmanager
docc
deployment → pods
digitalocean.com
instrument your application
use prometheus golang client
expose metrics endpoint
1
digitalocean.com
specify metrics, ports, alerts in your manifest file
Which metrics endpoint should be scraped?
Which cont...
digitalocean.com
use docc CLI to deploy your application
service
CLI doccserver
$ docc deploy manifest.json
3
annotations ...
digitalocean.com
prometheus talks to the kubernetes api and grabs
the metrics endpoint and port information
promconfigserv...
digitalocean.com
promconfig grabs alert information and
rewrites prometheus rules file
promconfigservice
5
digitalocean.com
alertconfig grabs alert routes and
rewrites alertmanager configuration file
service alertmanager alertcon...
digitalocean.com
What should we monitor?
digitalocean.com
latency
traffic
error
saturation
Request
Errors
Duration
4 Golden Signals
request-based system metrics
digitalocean.com
Brendan Gregg’s USE-ful metrics
Utilization
Saturation
Error
“Solves 80% of server issues with 5% of the
...
digitalocean.com
counters: cumulative, increasing metric
gauges: single metric that goes up or down
histograms: samples an...
digitalocean.com
Putting it all together...
digitalocean.com
service metric: traffic
how much demand is placed on the system
loadbalancer backend traffic
sum(rate(hap...
digitalocean.com
cluster metric: utilization
average time resource is busy servicing work
cluster CPU utilization
(sum(rat...
digitalocean.com
How should we alert?
digitalocean.com
Threshold alerts
Do any of the aforementioned metrics exceed a
lower or upper bound?
digitalocean.com
Threshold alerts
Are more than 80% of cluster CPU cores being utilized?
(sum(rate(container_cpu_usage_sec...
digitalocean.com
State-based alerts
Is there a divergence between expected
state and actual state of a service?
digitalocean.com
State-based alerts
Is my service up and/or scrape-able?
absent(up{kubernetes_name="doccserver"}) or
sum(u...
digitalocean.com
Common pitfalls
digitalocean.com
Pitfall #1: Alerting fatigue
digitalocean.com
Solution: Slack and/or Pagerduty
send only the most urgent, production alerts to pagerduty
try out differ...
digitalocean.com
Pitfall #2: Who owns what?
digitalocean.com
Solution: opinionated manifest file
services owner must include maintainer information
alerts themselves ...
digitalocean.com
Pitfall #3: Meta-monitoring
digitalocean.com
Solution: Duplicate promethei and HA
alertmanager
alertmanager
alertmanager
alertmanager
digitalocean.com
Solution: Deadman’s switch
ALERT JustKeepSwimming
IF vector(1)
elastalert
digitalocean.com
digitalocean.com
#1: Automated alerts
utilize user-defined memory and cpu limits for
threshold alerts
automatic state-base...
digitalocean.com
#2: Leverage metrics for autopilot
user trusts in our custom controllers and
schedulers
collect metrics a...
digitalocean.com
#3: Leverage metrics for autoscaling
services based on resource usage, #
connections, etc.
loadbalancers ...
digitalocean.com
a brave new world of container
orchestration
prometheus + alertmanager are
awesome!
extensibility
thanks!
@snehainguva
● The best prometheus tutorials you will ever
read, Julius Volz
● Actual Prometheus Website
● Kuberne...
Próxima SlideShare
Cargando en…5
×

Prometheus Everything, Observing Kubernetes in the Cloud

1.532 visualizaciones

Publicado el

Promcon 2017 Slides!

Publicado en: Ingeniería

Prometheus Everything, Observing Kubernetes in the Cloud

  1. 1. @snehainguva
  2. 2. prometheus everything, observing kubernetes in the cloud digitalocean.com
  3. 3. digitalocean.com about me software engineer @DigitalOcean former delivery, currently observability kubernetes, prometheus
  4. 4. digitalocean.com Some stats
  5. 5. digitalocean.com 15 kubernetes clusters 12 data centers 300+ production applications
  6. 6. digitalocean.com 2 promethei + 1 alertmanager per cluster 1.5 million+ timeseries 99218 samples/sec (note: data-center wide scraping is at 550k samples/sec) +
  7. 7. digitalocean.com the plan: ● the pre-kubernetes days ● kubernetes at DigitalOcean (aka docc) ● prometheus + alertmanager and kubernetes ● alerting in action: examples ● potential pitfalls ● next steps
  8. 8. digitalocean.com pre-kubernetes: service owners write an application provision a server with chef or ansible use a CI/CD pipeline, bash scripts, or other tools to deploy and update application on a VM
  9. 9. digitalocean.com pre-kubernetes: use nagios + various plugins to monitor use collectd + application metrics + statsd + graphite push data to openTSDB
  10. 10. digitalocean.com pre-kubernetes: longer to provision host than write actual service blackbox monitoring NOT insightful whitebox monitoring services NOT easily queryable
  11. 11. digitalocean.com docc: Digital Ocean Command Center A tool for deploying containerized, stateless applications
  12. 12. digitalocean.com What is kubernetes? Container orchestration tool from Google
  13. 13. digitalocean.com What is docc? An abstraction layer on top of kubernetes CLI DOCCSERVER deployment → pods service
  14. 14. digitalocean.com post-docc: service owners write an application service owner dockerizes application describe application in json manifest file deploy!
  15. 15. digitalocean.com post-docc: deployments and updates take minutes, not hours view running applications get application logs easily scale, update, or restart applications
  16. 16. digitalocean.com But what about monitoring?
  17. 17. digitalocean.com Let’s use prometheus + alertmanager
  18. 18. digitalocean.com service promconfig alertconfigalertmanager docc deployment → pods
  19. 19. digitalocean.com instrument your application use prometheus golang client expose metrics endpoint 1
  20. 20. digitalocean.com specify metrics, ports, alerts in your manifest file Which metrics endpoint should be scraped? Which container port needs to be exposed? Specify alerting rule, duration interval, and channel. 2
  21. 21. digitalocean.com use docc CLI to deploy your application service CLI doccserver $ docc deploy manifest.json 3 annotations contain rules and receiver info deployment → pods
  22. 22. digitalocean.com prometheus talks to the kubernetes api and grabs the metrics endpoint and port information promconfigservice 4
  23. 23. digitalocean.com promconfig grabs alert information and rewrites prometheus rules file promconfigservice 5
  24. 24. digitalocean.com alertconfig grabs alert routes and rewrites alertmanager configuration file service alertmanager alertconfig 6
  25. 25. digitalocean.com What should we monitor?
  26. 26. digitalocean.com latency traffic error saturation Request Errors Duration 4 Golden Signals request-based system metrics
  27. 27. digitalocean.com Brendan Gregg’s USE-ful metrics Utilization Saturation Error “Solves 80% of server issues with 5% of the effort.”
  28. 28. digitalocean.com counters: cumulative, increasing metric gauges: single metric that goes up or down histograms: samples and buckets observations summaries: samples observations, specify quantile prom metrics types
  29. 29. digitalocean.com Putting it all together...
  30. 30. digitalocean.com service metric: traffic how much demand is placed on the system loadbalancer backend traffic sum(rate(haproxy_backend_bytes_out_total{ kubernetes_name="loadbalancer", backend="tls_default_neptune_nyc3_internal_digitalocean_com"} [1m])) BY (backend) fxn: rate() and sum() metric type: counter labels
  31. 31. digitalocean.com cluster metric: utilization average time resource is busy servicing work cluster CPU utilization (sum(rate(container_cpu_usage_seconds_total{id="/"}[5m])) / sum(machine_cpu_cores)) fxn: sum() and rate() metric type: counter
  32. 32. digitalocean.com How should we alert?
  33. 33. digitalocean.com Threshold alerts Do any of the aforementioned metrics exceed a lower or upper bound?
  34. 34. digitalocean.com Threshold alerts Are more than 80% of cluster CPU cores being utilized? (sum(rate(container_cpu_usage_seconds_total{id="/"}[5m])) / sum(machine_cpu_cores))* 100 > 80
  35. 35. digitalocean.com State-based alerts Is there a divergence between expected state and actual state of a service?
  36. 36. digitalocean.com State-based alerts Is my service up and/or scrape-able? absent(up{kubernetes_name="doccserver"}) or sum(up{kubernetes_name="doccserver"}) == 0
  37. 37. digitalocean.com Common pitfalls
  38. 38. digitalocean.com Pitfall #1: Alerting fatigue
  39. 39. digitalocean.com Solution: Slack and/or Pagerduty send only the most urgent, production alerts to pagerduty try out different promQL queries to have less spikey metrics
  40. 40. digitalocean.com Pitfall #2: Who owns what?
  41. 41. digitalocean.com Solution: opinionated manifest file services owner must include maintainer information alerts themselves include descriptions and summaries with several labels alerts must include team-specific receivers
  42. 42. digitalocean.com Pitfall #3: Meta-monitoring
  43. 43. digitalocean.com Solution: Duplicate promethei and HA alertmanager alertmanager alertmanager alertmanager
  44. 44. digitalocean.com Solution: Deadman’s switch ALERT JustKeepSwimming IF vector(1) elastalert
  45. 45. digitalocean.com
  46. 46. digitalocean.com #1: Automated alerts utilize user-defined memory and cpu limits for threshold alerts automatic state-based alerts
  47. 47. digitalocean.com #2: Leverage metrics for autopilot user trusts in our custom controllers and schedulers collect metrics and build model about resource usage over time accordingly adjust limits and alerts
  48. 48. digitalocean.com #3: Leverage metrics for autoscaling services based on resource usage, # connections, etc. loadbalancers based on # of frontend and backend connections # of worker nodes based on memory and cpu capacity metrics
  49. 49. digitalocean.com a brave new world of container orchestration prometheus + alertmanager are awesome! extensibility
  50. 50. thanks! @snehainguva ● The best prometheus tutorials you will ever read, Julius Volz ● Actual Prometheus Website ● Kubernetes Project

×