Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Monitoring With Prometheus

679 visualizaciones

Publicado el

Prometheus: Monitoring by "Pravin Magdum" from "Crevise". The presentation was done at #doppa17 DevOps++ Global Summit 2017. All the copyrights are reserved with the author

Publicado en: Tecnología
  • Sé el primero en comentar

Monitoring With Prometheus

  1. 1. #DOPPA17 Prometheus: Monitoring Pravin Magdum 9th September 2017
  2. 2. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) • Devops evangelist @ Crevise Technology • Developer turned into Devops evangelist • 9 + years of development and project management exp in various technology. • Love to resolve tech problems,debug issues. Who Am I?
  3. 3. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) What is and why Monitoring ? • Continuously keep track of the status of the system • Continuously keep track of deployed applications • Earliest warning of failures, defects or problems and to improve them • Trending to see over time - help with upgrade /downgrade infra resources • To know when things go wrong • If issue persists, analysed data to debug issue and prevent it in future • Black box monitoring • Whitebox monitoring
  4. 4. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) Black box monitoring • Just like smoke testing • Examples - Ping ,http requests • To check if server is up and working etc • When - when system broken and to test from outside n/w • Won’t get info -whats going inside machine
  5. 5. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) White box monitoring • Complementary to black box testing • Get info -whats inside going in system • Example - check CPU usage, n/w usage
  6. 6. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) What to Monitor ? • It is best to, first, understand what holds business value to you and your customers. • CPU, Memory, IO, storage - typical metrics • Application monitoring - to make application run in cluster depending on these metrics • Predicate resource utilization to avoid downtime.
  7. 7. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) Prometheus • Inspired from Google’ Borgmon monitoring system • Mainly written in GO , publicly launched in 2015 • Open source Monitoring and alerting system with active Eco system • Used by Docker, Digital ocean, Core Os to name few
  8. 8. Prometheus Offers Prometheus Offers - • Multi-dimensional data model(time series data) – No strings like “doppa.pune” – Key value pairs {event=Doppa, city=pune} • Powerful Queries - To leverage this dimensionality • Precise alerting • Pull model over HTTP • Scalable • Dashboards • Efficient - – Single server can handle - Millions of metrics
  9. 9. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) Lets Understand with simple diagram
  10. 10. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) Components • Prometheus Server - scrapes and stores time series data • Exporters - to get metrics from resources • Alet Rules - define alert rules • Alert Manager - to notify on different communication channel about alerts
  11. 11. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) Powerful Queries • Can multiply ,join,add,aggregate ,predict in same query • Can evaluate current as well as backdated data • E.g. • Which are top 3 services who are consuming CPU most or more than 80% ? • Will my storage get full in next 4 hours?
  12. 12. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) Some Query examples • CPU: 100 - (avg by (instance) (irate(node_cpu{instance="node1:9100",job="node",mode="idle"}[1m])) * 100) • Memory: node_memory_MemTotal{job=‘node’,instance=‘node1:9100’} - node_memory_MemFree{job=‘node’,instance=‘node1:9100’} - node_memory_Buffers{job=‘node’,instance=‘node1:9100’} - node_memory_Cached{job=‘node’,instance=‘node1:9100’} • Disk Write : irate(node_disk_bytes_written[60s]) / 1024
  13. 13. Out of box feature • The textfile collector is similar to the Pushgateway, in that it allows exporting of statistics from batch jobs,shell scripts. • Metrics not exported by node-exporter • You can still have such metrics in prometheus with the help of Textfile Collector • Produce output that is compatible with Prometheus text output format • Write your own exporters to feed prometheus • ./node_exporter --collector.textfile.directory=Metrics
  14. 14. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) Running prometheus • Download prometheus https://prometheus.io/download/#prometheus • Extract and Run - done. • Let’s hit http://localhost:9090 • Let’s see in action
  15. 15. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) Node Exporter installation • Again two steps • Go to https://prometheus.io/download/#node_exporter • And download node exporter ,Extract and Run - done • Exports metrics at port : 9100 • let’s hit http://localhost:9100
  16. 16. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) Configuration • It’s time to tell prometheus to pull metrics from node exporter • Edit Prometheus.yml file - configuration file for prometheus • Scrape interval -15 sec • Scrape_configs: what are we scraping • targets: nodes Ip/hostname to monitor • Labels: logical group of hosts
  17. 17. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) Sample configuration file Global: scrape_interval: 15s Alert.Rules: -’CriticalAlert.Rules’ scrape_configs: job_name: node static_configs: labels: Group: ‘QA-Env' targets: - "IP:9100"
  18. 18. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) Reload prometheus curl -X POST http://localhost:9090/-/reload # above curl command will reload prometheus server with new configuration without restart
  19. 19. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) Alert Rules and manager • Define alert rules with powerful prom queries • Predicate about linear changes at nodes • Send Alerts on your choice of communication channel • e.g. slack , pagerduty , email ,sms etc
  20. 20. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) Setting up Alert Manager • Alertmanager can be configured to send prometheus alerts to your mailbox,slack,get automated calls in critical situation etc. • Download Alertmanager from https://prometheus.io/download/#alertmanager • Extract and configure • Run
  21. 21. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) Alert rules - instance is up and running ?ALERT InstanceDown IF up == 0 FOR 10m LABELS { severity = "CRITICAL" } ANNOTATIONS { summary = "Instance down", description = "{{ $labels.group }}-{{$labels.instance}} - instance has been down for more than 10 minute." }
  22. 22. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) Alert Rule - does my cpu usage going beyond 75% ? ALERT NodeCPUUsage IF (100 - (avg by (instance) (irate(node_cpu{job="node",mode="idle"}[1m])) * 100)) > 75 FOR 2m LABELS { severity="CRITICAL"} ANNOTATIONS { SUMMARY = "{{ $labels.group }}-{{$labels.instance}}: High CPU usage detected", DESCRIPTION = " CPU usage is above 75% (current value is: {{ $value }})" }
  23. 23. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) Alert Rule- storage packed 80 % ? ALERT filesystem_threshold_exceeded IF 100 *(1 - (node_filesystem_free{mountpoint="/"} / node_filesystem_size{ mountpoint="/"}) ) > 80 LABELS {severity="CRITICAL" } ANNOTATIONS { summary = "{{ $labels.group }}-{{ $labels.instance }} High filesystem usage is detected", description = "This device's filesystem usage has exceeded the threshold with a value of {{ $value }}.", }
  24. 24. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) AlertManager.yml Alert manager Configuration file global: route: repeat_interval: 4h routes: - receiver: email-QA match: group: 'QA-trad' - receiver: email-prod
  25. 25. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) Receivers receivers: - name: "email-Prod" email_configs: - to: 'Prodsupport@crevise.com' from: 'no-reply@crevise.com' smarthost: 'smtp.office365.com:587' auth_username: 'no-reply@crevise.com' auth_identity: 'no-reply@crevise.com' auth_password: 'fXXXX'
  26. 26. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) Slack Alerts
  27. 27. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) Email Alerts
  28. 28. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) Set up Grafana • Download tar from official site,Extract it and run binary • Check URL http://<Server-IP>:3000 • Add Prometheus as datasource • Go to http://<Server-IP>:3000 • Enter the username admin and password admin, and then click “Log In”. • Click “Data Sources” on the left menu • Click “Add new” on the top menu
  29. 29. Grafana Dashboard
  30. 30. #DOPPA17 As a author of this presentation I/we own the copyright and confirm the originality of the content. I/we allow Agile testing alliance to use the content for social media marketing, publishing it on ATA Blog or ATA social medial channels(Provided due credit is given to me/us) Thank you !! Questions ? Reachable at pravin.magdum@crevise.com Twitter - @pravin_magdum

×