SlideShare una empresa de Scribd logo
1 de 38
Descargar para leer sin conexión
Monitoring Microservices
with Prometheus
Tobias Schmidt - MicroCPH May 17, 2017
github.com/grobie @dagrobie tobidt@gmail.com
Monitoring
● Ability to observe and understand systems and their behavior.
○ Know when things go wrong
○ Understand and debug service misbehavior
○ Detect trends and act in advance
● Blackbox vs. Whitebox monitoring
○ Blackbox: Observes systems externally with periodic checks
○ Whitebox: Provides internally observed metrics
● Whitebox: Different levels of granularity
○ Logging
○ Tracing
○ Metrics
Monitoring
● Metrics monitoring system and time series database
○ Instrumentation (client libraries and exporters)
○ Metrics collection, processing and storage
○ Querying, alerting and dashboards
○ Analysis, trending, capacity planning
○ Focused on infrastructure, not business metrics
● Key features
○ Powerful query language for metrics with label dimensions
○ Stable and simple operation
○ Built for modern dynamic deploy environments
○ Easy setup
● What it’s not
○ Logging system
○ Designed for perfect answers
Prometheus
Instrumentation case study
Gusta: a simple like service
● Service to handle everything around liking a resource
○ List all liked likes on a resource
○ Create a like on a resource
○ Delete a like on a resource
● Implementation
○ Written in golang
○ Uses the gokit.io toolkit
Gusta overview
// Like represents all information of a single like.
type Like struct {
ResourceID string `json:"resourceID"`
UserID string `json:"userID"`
CreatedAt time.Time `json:"createdAt"`
}
// Service describes all methods provided by the gusta service.
type Service interface {
ListResourceLikes(resourceID string) ([]Like, error)
LikeResource(resourceID, userID string) error
UnlikeResource(resourceID, userID string) error
}
Gusta core
// main.go
var store gusta.Store
store = gusta.NewMemoryStore()
var s gusta.Service
s = gusta.NewService(store)
s = gusta.LoggingMiddleware(logger)(s)
var h http.Handler
h = gusta.MakeHTTPHandler(s, log.With(logger, "component", "HTTP"))
http.Handle("/", h)
if err := http.ListenAndServe(*httpAddr, nil); err != nil {
logger.Log("exit error", err)
}
Gusta server
./gusta
ts=2017-05-16T19:39:34.938108068Z transport=HTTP addr=:8080
ts=2017-05-16T19:38:24.203071341Z method=LikeResource ResourceID=r1ee85512 UserID=ue86d7a01 took=10.466µs err=null
ts=2017-05-16T19:38:24.323002316Z method=ListResourceLikes ResourceID=r8669fd29 took=17.812µs err=null
ts=2017-05-16T19:38:24.343061775Z method=ListResourceLikes ResourceID=rd4ac47c6 took=30.986µs err=null
ts=2017-05-16T19:38:24.363022818Z method=LikeResource ResourceID=r1ee85512 UserID=u19597d1e took=10.757µs err=null
ts=2017-05-16T19:38:24.38303722Z method=ListResourceLikes ResourceID=rfc9a393a took=41.554µs err=null
ts=2017-05-16T19:38:24.40303802Z method=ListResourceLikes ResourceID=r8669fd29 took=28.115µs err=null
ts=2017-05-16T19:38:24.423045585Z method=ListResourceLikes ResourceID=r8669fd29 took=23.842µs err=null
ts=2017-05-16T19:38:20.843121594Z method=UnlikeResource ResourceID=r1ee85512 UserID=ub5e42f43 took=8.57µs err="not
found"
ts=2017-05-16T19:38:20.863037026Z method=ListResourceLikes ResourceID=rfc9a393a took=27.839µs err=null
ts=2017-05-16T19:38:20.883081162Z method=ListResourceLikes ResourceID=r8669fd29 took=16.999µs err=null
Gusta server
Basic Instrumentation
Providing operational insight
● “Four golden signals” cover the essentials
○ Latency
○ Traffic
○ Errors
○ Saturation
● Similar concepts: RED and USE methods
○ Request: Rate, Errors, Duration
○ Utilization, Saturation, Errors
● Information about the service itself
● Interaction with dependencies (other services, databases, etc.)
What information should be provided?
● Direct instrumentation
○ Traffic, Latency, Errors, Saturation
○ Service specific metrics (and interaction with dependencies)
○ Prometheus client libraries provide packages to instrument HTTP
requests out of the box
● Exporters
○ Utilization, Saturation
○ node_exporter CPU, memory, IO utilization per host
○ wmi_exporter does the same for Windows
○ cAdvisor (Container advisor) provides similar metrics for each container
Where to get the information from?
// main.go
import "github.com/prometheus/client_golang/prometheus"
var registry = prometheus.NewRegistry()
registry.MustRegister(
prometheus.NewGoCollector(),
prometheus.NewProcessCollector(os.Getpid(), ""),
)
// Pass down registry when creating HTTP handlers.
h = gusta.MakeHTTPHandler(s, log.With(logger, "component", "HTTP"), registry)
Initializing Prometheus client library
var h http.Handler = listResourceLikesHandler
var method, path string = "GET", "/api/v1/likes/{id}"
requests := prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "gusta_http_server_requests_total",
Help: "Total number of requests handled by the HTTP server.",
ConstLabels: prometheus.Labels{"method": method, "path": path},
},
[]string{"code"},
)
registry.MustRegister(requests)
h = promhttp.InstrumentHandlerCounter(requests, h)
Counting HTTP requests
var h http.Handler = listResourceLikesHandler
var method, path string = "GET", "/api/v1/likes/{id}"
requestDuration := prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "gusta_http_server_request_duration_seconds",
Help: "A histogram of latencies for requests.",
Buckets: []float64{.0025, .005, 0.01, 0.025, 0.05, 0.1},
ConstLabels: prometheus.Labels{"method": method, "path": path},
},
[]string{},
)
registry.MustRegister(requestDuration)
h = promhttp.InstrumentHandlerDuration(requestDuration, h)
Observing HTTP request latency
Exposing metrics
Observing the current state
● Prometheus is a pull based monitoring system
○ Instances expose an HTTP endpoint to expose their metrics
○ Prometheus uses service discovery or static target lists to collect the
state periodically
● Centralized management
○ Prometheus decides how often to scrape instances
● Prometheus stores the data on local disc
○ In a big outage, you could run Prometheus on your laptop!
How to collect the metrics?
// main.go
// ...
http.Handle("/metrics", promhttp.HandlerFor(
registry,
promhttp.HandlerOpts{},
))
Exposing the metrics via HTTP
curl -s http://localhost:8080/metrics | grep requests
# HELP gusta_http_server_requests_total Total number of requests handled by the gusta HTTP server.
# TYPE gusta_http_server_requests_total counter
gusta_http_server_requests_total{code="200",method="DELETE",path="/api/v1/likes"} 3
gusta_http_server_requests_total{code="200",method="GET",path="/api/v1/likes/{id}"} 429
gusta_http_server_requests_total{code="200",method="POST",path="/api/v1/likes"} 51
gusta_http_server_requests_total{code="404",method="DELETE",path="/api/v1/likes"} 14
gusta_http_server_requests_total{code="409",method="POST",path="/api/v1/likes"} 3
Request metrics
curl -s http://localhost:8080/metrics | grep request_duration
# HELP gusta_http_server_request_duration_seconds A histogram of latencies for requests.
# TYPE gusta_http_server_request_duration_seconds histogram
...
gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.00025"} 414
gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.0005"} 423
gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.001"} 429
gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.0025"} 429
gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.005"} 429
gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.01"} 429
gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="+Inf"} 429
gusta_http_server_request_duration_seconds_sum{method="GET",path="/api/v1/likes/{id}"} 0.047897984
gusta_http_server_request_duration_seconds_count{method="GET",path="/api/v1/likes/{id}"} 429
...
Latency metrics
curl -s http://localhost:8080/metrics | grep process
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 892.78
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1024
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 23
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 9.3446144e+07
...
Out-of-the-box process metrics
Collecting metrics
Scraping all service instances
# Scrape all targets every 5 seconds by default.
global:
scrape_interval: 5s
evaluation_interval: 5s
scrape_configs:
# Scrape the Prometheus server itself.
- job_name: prometheus
static_configs:
- targets: [localhost:9090]
# Scrape the Gusta service.
- job_name: gusta
static_configs:
- targets: [localhost:8080]
Static configuration
scrape_configs:
# Scrape the Gusta service using Consul.
- job_name: consul
consul_sd_configs:
- server: localhost:8500
relabel_configs:
- source_labels: [__meta_consul_tags]
regex: .*,prod,.*
action: keep
- source_labels: [__meta_consul_service]
target_label: job
Consul service discovery
Target overview
Simple Graph UI
Simple Graph UI
Dashboards
Human-readable metrics
Grafana example
Alerts
Actionable metrics
ALERT InstanceDown
IF up == 0
FOR 2m
LABELS { severity = "warning" }
ANNOTATIONS {
summary = "Instance down for more than 5 minutes.",
description = "{{ $labels.instance }} of job {{ $labels.job }} has been down for >= 5 minutes.",
}
ALERT RunningOutOfFileDescriptors
IF process_open_fds / process_fds * 100 > 95
FOR 2m
LABELS { severity = "warning" }
ANNOTATIONS {
summary = "Instance has many open file descriptors.",
description = "{{ $labels.instance }} of job {{ $labels.job }} has {{ $value }}% open descriptors.",
}
Alert examples
ALERT GustaHighErrorRate
IF sum without(code, instance) (rate(gusta_http_server_requests_total{code=~"5.."}[1m]))
/ sum without(code, instance) (rate(gusta_http_server_requests_total[1m]))
* 100 > 0.1
FOR 2m
LABELS { severity = "critical" }
ANNOTATIONS {
summary = "Gusta service endpoints have a high error rate.",
description = "Gusta endpoint {{ $labels.method }} {{ $labels.path }} returns {{ $value }}% errors.",
}
ALERT GustaHighLatency
IF histogram_quantile(0.95, rate(gusta_http_server_request_duration_seconds_bucket[1m])) > 0.1
LABELS { severity = "critical" }
ANNOTATIONS {
summary = "Gusta service endpoints have a high latency.",
description = "Gusta endpoint {{ $labels.method }} {{ $labels.path }}
has a 95% percentile latency of {{ $value }} seconds.",
}
Alert examples
ALERT FilesystemRunningFull
IF predict_linear(node_filesystem_avail{mountpoint!="/var/lib/docker/aufs"}[6h], 24 * 60 * 60) < 0
FOR 1h
LABELS { severity = "warning" }
ANNOTATIONS {
summary = "Filesystem space is filling up.",
description = "Filesystem on {{ $labels.device }} at {{ $labels.instance }}
is predicted to run out of space within the next 24 hours.",
}
Alert examples
Summary
● Monitoring is essential to run, understand and operate services.
● Prometheus
○ Client instrumentation
○ Scrape configuration
○ Querying
○ Dashboards
○ Alert rules
● Important Metrics
○ Four golden signals: Latency, Traffic, Error, Saturation
● Best practices
Recap
● https://prometheus.io
● Talks, Articles, Videos https://www.reddit.com/r/PrometheusMonitoring/
● Our “StackOverflow” https://www.robustperception.io/blog/
● Ask the community https://prometheus.io/community/
● Google’s SRE book https://landing.google.com/sre/book/index.html
● USE method http://www.brendangregg.com/usemethod.html
● My philosophy on alerting https://goo.gl/UnvYhQ
Sources
Thank you
Tobias Schmidt - MicroCPH May 17, 2017
github.com/grobie - @dagrobie
● High availability
○ Run two identical servers
● Scaling
○ Shard by datacenter / team / service ( / instance )
● Aggregation across Prometheus servers
○ Federation
● Retention time
○ Generic remote storage support available.
● Pull vs. Push
○ Doesn’t matter in practice. Advantages depend on use case.
● Security
○ Focused on writing a monitoring system, left to the user.
FAQ

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)
 
Introduction to Prometheus
Introduction to PrometheusIntroduction to Prometheus
Introduction to Prometheus
 
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
 
Monitoring With Prometheus
Monitoring With PrometheusMonitoring With Prometheus
Monitoring With Prometheus
 
Monitoring with prometheus
Monitoring with prometheusMonitoring with prometheus
Monitoring with prometheus
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)
 
PromQL Deep Dive - The Prometheus Query Language
PromQL Deep Dive - The Prometheus Query Language PromQL Deep Dive - The Prometheus Query Language
PromQL Deep Dive - The Prometheus Query Language
 
Quarkus k8s
Quarkus   k8sQuarkus   k8s
Quarkus k8s
 
Cloud Monitoring with Prometheus
Cloud Monitoring with PrometheusCloud Monitoring with Prometheus
Cloud Monitoring with Prometheus
 
Getting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and GrafanaGetting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and Grafana
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)
 
Continuous Lifecycle London 2018 Event Keynote
Continuous Lifecycle London 2018 Event KeynoteContinuous Lifecycle London 2018 Event Keynote
Continuous Lifecycle London 2018 Event Keynote
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with Prometheus
 
Introduction to GitHub Actions
Introduction to GitHub ActionsIntroduction to GitHub Actions
Introduction to GitHub Actions
 
Demystifying observability
Demystifying observability Demystifying observability
Demystifying observability
 
An Introduction to Prometheus
An Introduction to PrometheusAn Introduction to Prometheus
An Introduction to Prometheus
 
Intro to open source observability with grafana, prometheus, loki, and tempo(...
Intro to open source observability with grafana, prometheus, loki, and tempo(...Intro to open source observability with grafana, prometheus, loki, and tempo(...
Intro to open source observability with grafana, prometheus, loki, and tempo(...
 
How to monitor your micro-service with Prometheus?
How to monitor your micro-service with Prometheus?How to monitor your micro-service with Prometheus?
How to monitor your micro-service with Prometheus?
 
Kafka High Availability in multi data center setup with floating Observers wi...
Kafka High Availability in multi data center setup with floating Observers wi...Kafka High Availability in multi data center setup with floating Observers wi...
Kafka High Availability in multi data center setup with floating Observers wi...
 
[Outdated] Secrets of Performance Tuning Java on Kubernetes
[Outdated] Secrets of Performance Tuning Java on Kubernetes[Outdated] Secrets of Performance Tuning Java on Kubernetes
[Outdated] Secrets of Performance Tuning Java on Kubernetes
 

Similar a Monitoring microservices with Prometheus

Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemTimely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Accumulo Summit
 

Similar a Monitoring microservices with Prometheus (20)

Using NGINX as an Effective and Highly Available Content Cache
Using NGINX as an Effective and Highly Available Content CacheUsing NGINX as an Effective and Highly Available Content Cache
Using NGINX as an Effective and Highly Available Content Cache
 
ITB2017 - Nginx Effective High Availability Content Caching
ITB2017 - Nginx Effective High Availability Content CachingITB2017 - Nginx Effective High Availability Content Caching
ITB2017 - Nginx Effective High Availability Content Caching
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0
 
The RED Method: How To Instrument Your Services
The RED Method: How To Instrument Your ServicesThe RED Method: How To Instrument Your Services
The RED Method: How To Instrument Your Services
 
Tracking the Performance of the Web Over Time with the HTTP Archive
Tracking the Performance of the Web Over Time with the HTTP ArchiveTracking the Performance of the Web Over Time with the HTTP Archive
Tracking the Performance of the Web Over Time with the HTTP Archive
 
Akamai Edge: Tracking the Performance of the Web with HTTP Archive
Akamai Edge: Tracking the Performance of the Web with HTTP ArchiveAkamai Edge: Tracking the Performance of the Web with HTTP Archive
Akamai Edge: Tracking the Performance of the Web with HTTP Archive
 
PostgreSQL Performance Problems: Monitoring and Alerting
PostgreSQL Performance Problems: Monitoring and AlertingPostgreSQL Performance Problems: Monitoring and Alerting
PostgreSQL Performance Problems: Monitoring and Alerting
 
Dynamic Infrastructure and Container Monitoring with Prometheus
Dynamic Infrastructure and Container Monitoring with PrometheusDynamic Infrastructure and Container Monitoring with Prometheus
Dynamic Infrastructure and Container Monitoring with Prometheus
 
Log aggregation and analysis
Log aggregation and analysisLog aggregation and analysis
Log aggregation and analysis
 
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek
 
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackDocker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic Stack
 
Metrics with Ganglia
Metrics with GangliaMetrics with Ganglia
Metrics with Ganglia
 
Improving go-git performance
Improving go-git performanceImproving go-git performance
Improving go-git performance
 
observability pre-release: using prometheus to test and fix new software
observability pre-release: using prometheus to test and fix new softwareobservability pre-release: using prometheus to test and fix new software
observability pre-release: using prometheus to test and fix new software
 
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic SystemTimely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
 
Native container monitoring
Native container monitoringNative container monitoring
Native container monitoring
 
Native Container Monitoring
Native Container MonitoringNative Container Monitoring
Native Container Monitoring
 
Improving the performance of Odoo deployments
Improving the performance of Odoo deploymentsImproving the performance of Odoo deployments
Improving the performance of Odoo deployments
 
Monitoring with Prometheus
Monitoring with PrometheusMonitoring with Prometheus
Monitoring with Prometheus
 
Redis
RedisRedis
Redis
 

Más de Tobias Schmidt

Más de Tobias Schmidt (7)

Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with Prometheus
 
The history of Prometheus at SoundCloud
The history of Prometheus at SoundCloudThe history of Prometheus at SoundCloud
The history of Prometheus at SoundCloud
 
Efficient monitoring and alerting
Efficient monitoring and alertingEfficient monitoring and alerting
Efficient monitoring and alerting
 
Moving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloudMoving to Kubernetes - Tales from SoundCloud
Moving to Kubernetes - Tales from SoundCloud
 
Prometheus loves Grafana
Prometheus loves GrafanaPrometheus loves Grafana
Prometheus loves Grafana
 
16 months @ SoundCloud
16 months @ SoundCloud16 months @ SoundCloud
16 months @ SoundCloud
 
Two database findings
Two database findingsTwo database findings
Two database findings
 

Último

Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Christo Ananth
 

Último (20)

BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spain
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 

Monitoring microservices with Prometheus

  • 1. Monitoring Microservices with Prometheus Tobias Schmidt - MicroCPH May 17, 2017 github.com/grobie @dagrobie tobidt@gmail.com
  • 3. ● Ability to observe and understand systems and their behavior. ○ Know when things go wrong ○ Understand and debug service misbehavior ○ Detect trends and act in advance ● Blackbox vs. Whitebox monitoring ○ Blackbox: Observes systems externally with periodic checks ○ Whitebox: Provides internally observed metrics ● Whitebox: Different levels of granularity ○ Logging ○ Tracing ○ Metrics Monitoring
  • 4. ● Metrics monitoring system and time series database ○ Instrumentation (client libraries and exporters) ○ Metrics collection, processing and storage ○ Querying, alerting and dashboards ○ Analysis, trending, capacity planning ○ Focused on infrastructure, not business metrics ● Key features ○ Powerful query language for metrics with label dimensions ○ Stable and simple operation ○ Built for modern dynamic deploy environments ○ Easy setup ● What it’s not ○ Logging system ○ Designed for perfect answers Prometheus
  • 5. Instrumentation case study Gusta: a simple like service
  • 6. ● Service to handle everything around liking a resource ○ List all liked likes on a resource ○ Create a like on a resource ○ Delete a like on a resource ● Implementation ○ Written in golang ○ Uses the gokit.io toolkit Gusta overview
  • 7. // Like represents all information of a single like. type Like struct { ResourceID string `json:"resourceID"` UserID string `json:"userID"` CreatedAt time.Time `json:"createdAt"` } // Service describes all methods provided by the gusta service. type Service interface { ListResourceLikes(resourceID string) ([]Like, error) LikeResource(resourceID, userID string) error UnlikeResource(resourceID, userID string) error } Gusta core
  • 8. // main.go var store gusta.Store store = gusta.NewMemoryStore() var s gusta.Service s = gusta.NewService(store) s = gusta.LoggingMiddleware(logger)(s) var h http.Handler h = gusta.MakeHTTPHandler(s, log.With(logger, "component", "HTTP")) http.Handle("/", h) if err := http.ListenAndServe(*httpAddr, nil); err != nil { logger.Log("exit error", err) } Gusta server
  • 9. ./gusta ts=2017-05-16T19:39:34.938108068Z transport=HTTP addr=:8080 ts=2017-05-16T19:38:24.203071341Z method=LikeResource ResourceID=r1ee85512 UserID=ue86d7a01 took=10.466µs err=null ts=2017-05-16T19:38:24.323002316Z method=ListResourceLikes ResourceID=r8669fd29 took=17.812µs err=null ts=2017-05-16T19:38:24.343061775Z method=ListResourceLikes ResourceID=rd4ac47c6 took=30.986µs err=null ts=2017-05-16T19:38:24.363022818Z method=LikeResource ResourceID=r1ee85512 UserID=u19597d1e took=10.757µs err=null ts=2017-05-16T19:38:24.38303722Z method=ListResourceLikes ResourceID=rfc9a393a took=41.554µs err=null ts=2017-05-16T19:38:24.40303802Z method=ListResourceLikes ResourceID=r8669fd29 took=28.115µs err=null ts=2017-05-16T19:38:24.423045585Z method=ListResourceLikes ResourceID=r8669fd29 took=23.842µs err=null ts=2017-05-16T19:38:20.843121594Z method=UnlikeResource ResourceID=r1ee85512 UserID=ub5e42f43 took=8.57µs err="not found" ts=2017-05-16T19:38:20.863037026Z method=ListResourceLikes ResourceID=rfc9a393a took=27.839µs err=null ts=2017-05-16T19:38:20.883081162Z method=ListResourceLikes ResourceID=r8669fd29 took=16.999µs err=null Gusta server
  • 11. ● “Four golden signals” cover the essentials ○ Latency ○ Traffic ○ Errors ○ Saturation ● Similar concepts: RED and USE methods ○ Request: Rate, Errors, Duration ○ Utilization, Saturation, Errors ● Information about the service itself ● Interaction with dependencies (other services, databases, etc.) What information should be provided?
  • 12. ● Direct instrumentation ○ Traffic, Latency, Errors, Saturation ○ Service specific metrics (and interaction with dependencies) ○ Prometheus client libraries provide packages to instrument HTTP requests out of the box ● Exporters ○ Utilization, Saturation ○ node_exporter CPU, memory, IO utilization per host ○ wmi_exporter does the same for Windows ○ cAdvisor (Container advisor) provides similar metrics for each container Where to get the information from?
  • 13. // main.go import "github.com/prometheus/client_golang/prometheus" var registry = prometheus.NewRegistry() registry.MustRegister( prometheus.NewGoCollector(), prometheus.NewProcessCollector(os.Getpid(), ""), ) // Pass down registry when creating HTTP handlers. h = gusta.MakeHTTPHandler(s, log.With(logger, "component", "HTTP"), registry) Initializing Prometheus client library
  • 14. var h http.Handler = listResourceLikesHandler var method, path string = "GET", "/api/v1/likes/{id}" requests := prometheus.NewCounterVec( prometheus.CounterOpts{ Name: "gusta_http_server_requests_total", Help: "Total number of requests handled by the HTTP server.", ConstLabels: prometheus.Labels{"method": method, "path": path}, }, []string{"code"}, ) registry.MustRegister(requests) h = promhttp.InstrumentHandlerCounter(requests, h) Counting HTTP requests
  • 15. var h http.Handler = listResourceLikesHandler var method, path string = "GET", "/api/v1/likes/{id}" requestDuration := prometheus.NewHistogramVec( prometheus.HistogramOpts{ Name: "gusta_http_server_request_duration_seconds", Help: "A histogram of latencies for requests.", Buckets: []float64{.0025, .005, 0.01, 0.025, 0.05, 0.1}, ConstLabels: prometheus.Labels{"method": method, "path": path}, }, []string{}, ) registry.MustRegister(requestDuration) h = promhttp.InstrumentHandlerDuration(requestDuration, h) Observing HTTP request latency
  • 17. ● Prometheus is a pull based monitoring system ○ Instances expose an HTTP endpoint to expose their metrics ○ Prometheus uses service discovery or static target lists to collect the state periodically ● Centralized management ○ Prometheus decides how often to scrape instances ● Prometheus stores the data on local disc ○ In a big outage, you could run Prometheus on your laptop! How to collect the metrics?
  • 18. // main.go // ... http.Handle("/metrics", promhttp.HandlerFor( registry, promhttp.HandlerOpts{}, )) Exposing the metrics via HTTP
  • 19. curl -s http://localhost:8080/metrics | grep requests # HELP gusta_http_server_requests_total Total number of requests handled by the gusta HTTP server. # TYPE gusta_http_server_requests_total counter gusta_http_server_requests_total{code="200",method="DELETE",path="/api/v1/likes"} 3 gusta_http_server_requests_total{code="200",method="GET",path="/api/v1/likes/{id}"} 429 gusta_http_server_requests_total{code="200",method="POST",path="/api/v1/likes"} 51 gusta_http_server_requests_total{code="404",method="DELETE",path="/api/v1/likes"} 14 gusta_http_server_requests_total{code="409",method="POST",path="/api/v1/likes"} 3 Request metrics
  • 20. curl -s http://localhost:8080/metrics | grep request_duration # HELP gusta_http_server_request_duration_seconds A histogram of latencies for requests. # TYPE gusta_http_server_request_duration_seconds histogram ... gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.00025"} 414 gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.0005"} 423 gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.001"} 429 gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.0025"} 429 gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.005"} 429 gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="0.01"} 429 gusta_http_server_request_duration_seconds_bucket{method="GET",path="/api/v1/likes/{id}",le="+Inf"} 429 gusta_http_server_request_duration_seconds_sum{method="GET",path="/api/v1/likes/{id}"} 0.047897984 gusta_http_server_request_duration_seconds_count{method="GET",path="/api/v1/likes/{id}"} 429 ... Latency metrics
  • 21. curl -s http://localhost:8080/metrics | grep process # HELP process_cpu_seconds_total Total user and system CPU time spent in seconds. # TYPE process_cpu_seconds_total counter process_cpu_seconds_total 892.78 # HELP process_max_fds Maximum number of open file descriptors. # TYPE process_max_fds gauge process_max_fds 1024 # HELP process_open_fds Number of open file descriptors. # TYPE process_open_fds gauge process_open_fds 23 # HELP process_resident_memory_bytes Resident memory size in bytes. # TYPE process_resident_memory_bytes gauge process_resident_memory_bytes 9.3446144e+07 ... Out-of-the-box process metrics
  • 22. Collecting metrics Scraping all service instances
  • 23. # Scrape all targets every 5 seconds by default. global: scrape_interval: 5s evaluation_interval: 5s scrape_configs: # Scrape the Prometheus server itself. - job_name: prometheus static_configs: - targets: [localhost:9090] # Scrape the Gusta service. - job_name: gusta static_configs: - targets: [localhost:8080] Static configuration
  • 24. scrape_configs: # Scrape the Gusta service using Consul. - job_name: consul consul_sd_configs: - server: localhost:8500 relabel_configs: - source_labels: [__meta_consul_tags] regex: .*,prod,.* action: keep - source_labels: [__meta_consul_service] target_label: job Consul service discovery
  • 31. ALERT InstanceDown IF up == 0 FOR 2m LABELS { severity = "warning" } ANNOTATIONS { summary = "Instance down for more than 5 minutes.", description = "{{ $labels.instance }} of job {{ $labels.job }} has been down for >= 5 minutes.", } ALERT RunningOutOfFileDescriptors IF process_open_fds / process_fds * 100 > 95 FOR 2m LABELS { severity = "warning" } ANNOTATIONS { summary = "Instance has many open file descriptors.", description = "{{ $labels.instance }} of job {{ $labels.job }} has {{ $value }}% open descriptors.", } Alert examples
  • 32. ALERT GustaHighErrorRate IF sum without(code, instance) (rate(gusta_http_server_requests_total{code=~"5.."}[1m])) / sum without(code, instance) (rate(gusta_http_server_requests_total[1m])) * 100 > 0.1 FOR 2m LABELS { severity = "critical" } ANNOTATIONS { summary = "Gusta service endpoints have a high error rate.", description = "Gusta endpoint {{ $labels.method }} {{ $labels.path }} returns {{ $value }}% errors.", } ALERT GustaHighLatency IF histogram_quantile(0.95, rate(gusta_http_server_request_duration_seconds_bucket[1m])) > 0.1 LABELS { severity = "critical" } ANNOTATIONS { summary = "Gusta service endpoints have a high latency.", description = "Gusta endpoint {{ $labels.method }} {{ $labels.path }} has a 95% percentile latency of {{ $value }} seconds.", } Alert examples
  • 33. ALERT FilesystemRunningFull IF predict_linear(node_filesystem_avail{mountpoint!="/var/lib/docker/aufs"}[6h], 24 * 60 * 60) < 0 FOR 1h LABELS { severity = "warning" } ANNOTATIONS { summary = "Filesystem space is filling up.", description = "Filesystem on {{ $labels.device }} at {{ $labels.instance }} is predicted to run out of space within the next 24 hours.", } Alert examples
  • 35. ● Monitoring is essential to run, understand and operate services. ● Prometheus ○ Client instrumentation ○ Scrape configuration ○ Querying ○ Dashboards ○ Alert rules ● Important Metrics ○ Four golden signals: Latency, Traffic, Error, Saturation ● Best practices Recap
  • 36. ● https://prometheus.io ● Talks, Articles, Videos https://www.reddit.com/r/PrometheusMonitoring/ ● Our “StackOverflow” https://www.robustperception.io/blog/ ● Ask the community https://prometheus.io/community/ ● Google’s SRE book https://landing.google.com/sre/book/index.html ● USE method http://www.brendangregg.com/usemethod.html ● My philosophy on alerting https://goo.gl/UnvYhQ Sources
  • 37. Thank you Tobias Schmidt - MicroCPH May 17, 2017 github.com/grobie - @dagrobie
  • 38. ● High availability ○ Run two identical servers ● Scaling ○ Shard by datacenter / team / service ( / instance ) ● Aggregation across Prometheus servers ○ Federation ● Retention time ○ Generic remote storage support available. ● Pull vs. Push ○ Doesn’t matter in practice. Advantages depend on use case. ● Security ○ Focused on writing a monitoring system, left to the user. FAQ