Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Monitoring, the Prometheus Way - Julius Voltz, Prometheus

3.508 visualizaciones

Publicado el

Prometheus is an opinionated metrics collection and monitoring system that is particularly well suited to accommodate modern workloads like containers and micro-services. To achieve these goals, it radically breaks away from existing systems and follows very different design principles. In this talk, Prometheus founder Julius Volz will explain these design principles and how they apply to dockerized applications. This will provide insight useful to newcomers wanting to start on the right foot in the land of container monitoring, but also to veterans wanting to quickly map their existing knowledge to Prometheus concepts. In particular, a demo will show Prometheus in action together with a Docker Swarm cluster.

Publicado en: Tecnología
  • Sé el primero en comentar

Monitoring, the Prometheus Way - Julius Voltz, Prometheus

  1. 1. Monitoring, the Prometheus Way Julius Volz Co-founder, Prometheus
  2. 2. Monitoring system and TSDB: • Instrumentation • Metrics collection and storage • Querying, alerting, dashboarding • For all levels of the stack! Made for dynamic cloud environments. What is Prometheus
  3. 3. We don’t do: • Logging or tracing • Automatic anomaly detection • Scalable or durable storage What is it not?
  4. 4. • Started 2012 at SoundCloud • Fully publicised in 2015 • Now part of CNCF Origin
  5. 5. Motivation SoundCloud in 2012: • Early dynamic cluster scheduler • Hundreds of microservices • Thousands of service instances → Hard to monitor with StatsD/Graphite and other existing tools → Finally decided to build new solution
  6. 6. Architecture Prometheus web app API server node exporter mysqld exporter cAdvisor Targets Service Discovery (DNS, Kubernetes, AWS, Consul, custom...) Grafana HTTP API UI Alertmanager
  7. 7. • Dimensional data model • Query language • Simplicity + efficiency • Service discovery integration Selling Points
  8. 8. Data Model What is a time series? <identifier> → [ (t0, v0), (t1, v1), … ]
  9. 9. <identifier> → [ (t0, v0), (t1, v1), … ] Data Model What is a time series? What is this? int64 float64
  10. 10. Data Model nginx.ip-1-2-3-4-80.home.200.http_requests_total nginx.ip-1-2-3-5-80.settings.500.http_requests_total nginx.ip-1-2-3-5-80.settings.400.http_requests_total nginx.ip-1-2-3-5-80.home.200.http_requests_total • Implies hierarchy that doesn’t exist • User-level encoding of semantics • Hard to extend Graphite / StatsD:
  11. 11. Data Model nginx.ip-1-2-3-4-80.home.200.http_requests_total nginx.ip-1-2-3-5-80.settings.500.http_requests_total nginx.ip-1-2-3-5-80.settings.400.http_requests_total nginx.ip-1-2-3-5-80.home.200.http_requests_total http_requests_total{job="nginx",instance="1.2.3.4:80",path="/home",status="200"} http_requests_total{job="nginx",instance="1.2.3.5:80",path="/settings",status="500"} http_requests_total{job="nginx",instance="1.2.3.5:80",path="/settings",status="400"} http_requests_total{job="nginx",instance="1.2.3.4:80",path="/home",status="200"} Graphite / StatsD: Prometheus:
  12. 12. Selecting Series *.nginx.*.*.500.*.http_requests_total http_requests_total{job="nginx",status="500"} → Want label dimensions as first-class citizens.
  13. 13. PromQL ● New query language ● Great for time series computations ● Not SQL-style, but functional Querying
  14. 14. All partitions in my entire infrastructure with more than 100GB capacity that are not mounted on root? Querying node_filesystem_bytes_total{mountpoint!=”/”} / 1e9 > 100 {device="sda1", mountpoint="/home”, instance=”10.0.0.1”} 118.8 {device="sda1", mountpoint="/home”, instance=”10.0.0.2”} 118.8 {device="sdb1", mountpoint="/data”, instance=”10.0.0.2”} 451.2 {device="xdvc", mountpoint="/mnt”, instance=”10.0.0.3”} 320.0
  15. 15. What’s the ratio of request errors across all service instances? Querying sum(rate(http_requests_total{status="500"}[5m])) / sum(rate(http_requests_total[5m])) {} 0.029
  16. 16. What’s the ratio of request errors across all service instances? Querying sum by(path) (rate(http_requests_total{status="500"}[5m])) / sum by(path) (rate(http_requests_total[5m])) {path="/status"} 0.0039 {path="/"} 0.0011 {path="/api/v1/topics/:topic"} 0.087 {path="/api/v1/topics} 0.0342
  17. 17. 99th percentile request latency across all instances? Querying histogram_quantile(0.99, sum without(instance) (rate(request_latency_seconds_bucket[5m])) ) {path="/status", method="GET"} 0.012 {path="/", method="GET"} 0.43 {path="/api/v1/topics/:topic", method="POST"} 1.31 {path="/api/v1/topics, method="GET"} 0.192
  18. 18. Expression browser
  19. 19. Built-in graphing
  20. 20. Dashboarding
  21. 21. • Local storage, no clustering • HA by running two • Go: static binary Operational Simplicity Prometheus data
  22. 22. Local storage is scalable enough for many orgs: • 1 million+ samples/s • Millions of series • 1.3 bytes per sample Good for keeping a few weeks or months of data. Efficiency
  23. 23. Decoupled Remote Storage Prometheus data Adapter Remote Storage custom protocol generic remote read/write protocol → Prometheus 1.6! For scalable, durable, long-term storage.
  24. 24. • On-demand VMs (EC2, Azure, GCP, ...) • Dynamically scheduled service instances (Docker Swarm, Kubernetes, ...) • Microservices → many services, dynamic hosts, and ports How to make sense of this all? Dynamic Environments ...pose new challenges:
  25. 25. • ...know what should be there • ...decide where to pull from • ...add dimensional metadata to series Service Discovery Use service discovery to:
  26. 26. • VM providers (AWS, Azure, Google, ...) • Cluster managers (Kubernetes, Marathon, …) • Generic mechanisms (DNS, Consul, Zookeeper, custom, ...) What about Docker? Service Discovery Prometheus has built-in support for:
  27. 27. Docker Service Discovery Not settled yet, some approaches:
  28. 28. • Works today • Port info not included • No task / service metadata • Doesn’t solve cross-Docker-network pulling → Already good for host and container metrics. Docker Service Discovery DNS A records (tasks.<service>)
  29. 29. • Requires Docker socket access / run on manager • Handles Docker network attaching • Provides more metadata Docker Service Discovery Swarm API (https://github.com/ContainerSolutions/prometheus-swarm-discovery )
  30. 30. • Access SD / Swarm API from any host • Allow Prometheus to scrape any networks (somehow) Discuss: docker/docker - Issue #27307 Docker Service Discovery Ideal would be:
  31. 31. Add engine flags: --experimental=true --metrics-addr=0.0.0.0:4999 $ curl localhost:4999/metrics | more # HELP engine_daemon_container_actions_seconds The number of seconds [...] # TYPE engine_daemon_container_actions_seconds histogram engine_daemon_container_actions_seconds_bucket{action="changes",le="0.005"} 1 engine_daemon_container_actions_seconds_bucket{action="changes",le="0.01"} 1 engine_daemon_container_actions_seconds_bucket{action="changes",le="0.025"} 1 engine_daemon_container_actions_seconds_bucket{action="changes",le="0.05"} 1 engine_daemon_container_actions_seconds_bucket{action="changes",le="0.1"} 1 engine_daemon_container_actions_seconds_bucket{action="changes",le="0.25"} 1 ... Docker Metrics Native Prometheus metrics in Docker engine since 1.13:
  32. 32. cAdvisor node exporter Docker engine cAdvisor node exporter Docker engine DEMO Prometheus Docker engine node exporter cAdvisor Targets Service Discovery DNS A records: “tasks.<service>” Grafana

×