Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Efficient monitoring and alerting

781 visualizaciones

Publicado el

Container environments make it easy to deploy hundreds of microservices in today’s infrastructures. Monitoring thousands of metrics efficiently introduces new challenges to not lose insight, avoid alert fatigue and maintain a high development velocity. In this talk I’ll present an overview of important metrics including the 4 golden signals, discuss strategies to organize alerting efficiently, give insight into SoundCloud’s monitoring history and highlight a few success and failure stories.

Publicado en: Ingeniería
  • Sé el primero en comentar

Efficient monitoring and alerting

  1. 1. Efficient monitoring in modern environments Tobias Schmidt - ContainerDays Hamburg 2016 @dagrobie - github.com/grobie
  2. 2. Introduction About myself Production Engineer for 5+ years Container orchestration (in-house, Kubernetes) Service discovery Monitoring (Prometheus) Production readiness
  3. 3. Monitoring
  4. 4. Collecting, processing, aggregating, and displaying real- time quantitative data about a system, such as query counts and types, processing times, and server lifetimes. Site Reliability Engineering - O’Reilly 2016 Monitoring
  5. 5. Monitoring
  6. 6. Monitoring Why monitor? Enable automatic alerting Analysis of long-term trends Validate new features/experiments/implementations Debugging
  7. 7. Monitoring Blackbox vs. Whitebox Blackbox: Externally observed What the user sees Whitebox: Data exposed by the system Allows to act on imminent issues
  8. 8. Metrics
  9. 9. Metrics Instrument everything Host (CPU, memory, I/O, network, filesystem, …) Container (CPU, memory, restarts, OOM, throttling, …) Applications (throughput, latency, queues, …)
  10. 10. Metrics Export detailed metrics Attach all relevant information Use aggregations later in alerts and dashboards
  11. 11. Metrics Four golden signals Minimum set of metrics every service should have Coined by Google SRE
  12. 12. Four golden signals Latency Time to serve user requests Median doesn’t reflect user experience
  13. 13. Four golden signals Traffic Demand placed on a system (HTTP requests, network throughput, transactions, …)
  14. 14. Four golden signals Errors Failure responses to user requests
  15. 15. Four golden signals Saturation & Utilization Consumption of constrained resources (Memory, I/O, CPU slices, …)
  16. 16. Alerting
  17. 17. Alerting Use symptom based alerting Monitor for your users Four golden signals (traffic is tricky) Only page if something needs immediate human intervention
  18. 18. Alerting Prevent alert fatigue Alert grouping Provide easy silencing Dependencies Avoid static thresholds
  19. 19. Alerting Use ticketing system Avoid email spam Warnings are tasks like new features
  20. 20. Alerting Provide runbooks (playbooks) Keep them concise Explanation, hints, links Dynamic - include recent observations Discuss with non-experts
  21. 21. Alerting Practice outages “Game days” Repeat regularly
  22. 22. Matt T. Proud, Julius Volz, Björn Rabenstein, Matthias Rampke Philosophy on Alerting - Rob Ewaschuk Acknowledgements
  23. 23. Thank you May the queries flow, and your pagers be quiet. Tobias Schmidt - ContainerDays Hamburg 2016 @dagrobie - github.com/grobie

×