SlideShare una empresa de Scribd logo
1 de 74
Descargar para leer sin conexión
PromCon Munich
Julien Pivotto
@roidelapluie
Improved alerting with
Prometheus and
Alertmanager
November 8th, 2019
• This talk contains PromQL.
• This talk contains YAML.
• What you will see was built over time.
Important notes!
@roidelapluie
Context
@roidelapluie
Message Broker in the Belgian healthcare sector
• High visibility
• Sync & Async
• Legacy & New
• Lots of partners
• Multiple customers
Message Broker
@roidelapluie
Technical
Business
Monitoring
@roidelapluie Font Awesome CC-BY-4.0
Alerts are not only for incidents.
Some alerts carry business information about ongoing
events (good or bad).
Some alerts go outside of our org.
Some alerts are not for humans.
Alerting
@roidelapluie
Channels
@roidelapluie Font Awesome CC-BY-4.0
Repeat every 15m, 1h, 4h, 24h, 2d
24x7, 10x5, 12x6, 10x7, never
Legal holidays
Time frames
@roidelapluie
Updated annotations & value
Updated graphs
15m/1h repeat interval?
@roidelapluie
• Alertmanager owns the noti cations
• Webhook receivers have no logic
• Take decisions at time of alert writing
Constraints
@roidelapluie
• Avoid Alertmanager recon gurations
• Safe and easy way to write alerts
• Only send relevant alerts
• Alert on staging environments
Challenges
@roidelapluie
PromQL
@roidelapluie
- alert: a target is down
expr: up == 0
for: 5m
Gauges
@roidelapluie
Gauges
@roidelapluie
Instead of:
- alert: a target is down
expr: up == 0
for: 5m
Do:
- alert: a target is down
expr: avg_over_time(up[5m]) < .9
for: 5m
Gauges
@roidelapluie
Alert me if temperature is above 27°C
Hysteresis
@roidelapluie
- alert: temperature is above threshold
expr: temperature_celcius > 27
for: 5m
labels:
priority: high
Hysteresis
@roidelapluie
Hysteresis
@roidelapluie
Hysteresis is the dependence of the state of a system
on its history.
Hysteresis
@roidelapluie Wikipedia CC-BY-SA-3.0
- alert: temperature is above threshold
expr: |
avg_over_time(temperature_celcius[5m])
> 27
for: 5m
labels:
priority: high
alternative: max_over_time
5m might be too short
if > 5m: when is it resolved?
Hysteresis
@roidelapluie
Alert me
• if temperature is above 27°C
• only stop when it gets below 25°C
Hysteresis
@roidelapluie
(avg_over_time(temperature_celcius[5m]) > 27)
or (temperature_celcius > 25 and
count without (alertstate, alertname, priority)
ALERTS{
alertstate="firing",
alertname="temperature is above threshold"
})
Hysteresis
@roidelapluie
temperature_celcius > 27
but...
Computed threshold
@roidelapluie
- record: temperature_threshold_celcius
expr: |
27+0*temperature_celcius{
location=~".*ambiant"
}
or 25+0*temperature_celcius
Bonus: temperature_threshold_celcius can be
used in grafana!
Computed threshold
@roidelapluie
- alert: temperature is above threshold
expr: |
temperature_celcius >
temperature_threshold_celcius
Note: put threshold & alert
in the same alert group
Computed threshold
@roidelapluie
- alert: no more sms
expr: sms_available < 39000
Absence
@roidelapluie
Absence
@roidelapluie
No metric = No alert!
Metric is back = New alert!
Absence
@roidelapluie
- record: sms_available_last
expr: |
sms_available or
sms_available_last
- alert: no more sms
record: sms_available_last < 39000
- alert: no more sms data
record: absent(sms_available)
for: 1h
Absence
@roidelapluie
Con guration
@roidelapluie
recipients:
name/channel
jpivotto/mail
opsteam/ticket
appteam/message
customer/sms
dc1/jenkins
Recipients
@roidelapluie
Alertmanager receivers
- name: "opsteam/mail"
email_configs:
- to: 'ops@inuits.eu'
send_resolved: yes
html: "{{ template "inuits.html.tmpl" . }}"
text: "{{ template "inuits.txt.tmpl" . }}"
headers:
Subject: "{{ template "title.tmpl" . }}"
Hint: Subject can be a template.
Receivers
@roidelapluie
Alertmanager receivers
- name: "opsteam/mail/noresolved"
email_configs:
- to: 'ops@inuits.eu'
send_resolved: no
html: "{{ template "inuits.html.tmpl" . }}"
text: "{{ template "inuits.txt.tmpl" . }}"
headers:
Subject: "{{ template "title.tmpl" . }}"
Same, but with send_resolved: no
Receivers
@roidelapluie
Alertmanager receivers
- name: "lotsOfPeople/mail"
email_configs:
- to: 'a@inuits.eu,b@inuits.eu,c@inuits.eu'
headers:
To: a@inuits.eu
CC: b@inuits.eu
Reply-To: support@inuits.eu
c@inuits.eu is now BCC.
Email: CC, BCC
@roidelapluie
Prometheus alert
- alert: Not enough traffic
expr: ...
for: 5m
labels:
recipients: customer1/sms,opsteam/ticket
annotations:
summary: ...
resolved_summary: ...
Who gets the alert?
@roidelapluie
Alertmanager routing
- receiver: "customer1/sms"
match_re:
recipient: "(.*,)?customer1/sms(,.*)?"
continue: true
routes: [...]
- receiver: "opsteam/ticket"
match_re:
recipient: "(.*,)?opsteam/ticket(,.*)?"
continue: true
routes: [...]
Who gets the alert?
@roidelapluie
Prometheus alert
- alert: Not enough traffic
expr: ...
for: 5m
labels:
recipients: customer1/sms,opsteam/ticket
send_resolved: "no"
Resolved
@roidelapluie
Alertmanager routing
- receiver: "customer1/sms"
match_re:
recipient: "(.*,)?customer1/sms(,.*)?"
continue: true
routes:
- receiver: customer1/sms/noresolved
match:
send_resolved: "no"
Resolved
@roidelapluie
Prometheus alert
- alert: Not enough traffic
expr: ...
for: 5m
labels:
recipients: customer1/sms,opsteam/ticket
repeat_interval: 1h
Repeat interval
@roidelapluie
Alertmanager routing
- receiver: "customer1/sms"
match_re:
recipient: "(.*,)?customer1/sms(,.*)?"
continue: true
routes:
- receiver: customer1/sms
repeat_interval: 1h
match:
repeat_interval: 1h
Repeat interval
@roidelapluie
Some channels have speci c group_interval: 0s.
Some channels always send_resolved: no.
Some recipients have aliases (ticket+chat).
Extra con gurations
@roidelapluie
Extract of amtool con g routes show
─ {recipient=~"^(?:(.*,)?jpivotto/mail(,.*)?)$"} continue: true receiver: jpivotto/mail
   ├── {repeat_interval="15m",send_resolved="no"} receiver: jpivotto/mail/noresolved
   ├── {repeat_interval="15m"} receiver: jpivotto/mail
   ├── {repeat_interval="30m",send_resolved="no"} receiver: jpivotto/mail/noresolved
   ├── {repeat_interval="30m"} receiver: jpivotto/mail
   ├── {repeat_interval="1h",send_resolved="no"} receiver: jpivotto/mail/noresolved
   ├── {repeat_interval="1h"} receiver: jpivotto/mail
   ├── {repeat_interval="2h",send_resolved="no"} receiver: jpivotto/mail/noresolved
   ├── {repeat_interval="2h"} receiver: jpivotto/mail
   ├── {repeat_interval="4h",send_resolved="no"} receiver: jpivotto/mail/noresolved
   ├── {repeat_interval="4h"} receiver: jpivotto/mail
   ├── {repeat_interval="6h",send_resolved="no"} receiver: jpivotto/mail/noresolved
   ├── {repeat_interval="6h"} receiver: jpivotto/mail
   ├── {repeat_interval="12h",send_resolved="no"} receiver: jpivotto/mail/noresolved
   ├── {repeat_interval="12h"} receiver: jpivotto/mail
   ├── {repeat_interval="24h",send_resolved="no"} receiver:jpivotto/mail/noresolved
   ├── {repeat_interval="24h"} receiver: jpivotto/mail
   ├── {send_resolved="no"} receiver: jpivotto/mail/noresolved
   └── {repeat_interval=""} receiver: jpivotto/mail
Routes tree
@roidelapluie
Con g Management!
Our input
receivers:
customer:
email:
to: [customer@example.com]
cc: [service-management@inuits.eu]
bcc: [ops@inuits.eu]
sms: [+1234567890, +2345678901]
chat:
room: "#customer"
How do we achieve it?
@roidelapluie
• Script that is deployed with AM
• Knows all the recipients
• Will validate alerts yaml
• promtool
• mandatory labels
• validate receivers label
• validate repeat_interval label
Not possible to write alerts that go nowhere by
accident.
Con guration management
@roidelapluie
Time frame
@roidelapluie
Prometheus alert
- alert: a target is down
expr: up == 0
for: 5m
labels:
recipients: customer1/sms,opsteam/ticket
time_window: 13x5
Time frame
@roidelapluie
- record: daily_saving_time_belgium
expr: |
(vector(0) and (month() < 3 or month() > 10))
or
(vector(1) and (month() > 3 and month() < 10))
or
(
(
(month() %2 and (day_of_month() - day_of_week()
> (30 + +month() % 2 - 7)) and day_of_week() > 0)
or
-1*month()%2+1 and (day_of_month() -
day_of_week() <= (30 + month() % 2 - 7))
)
)
or
(vector(1) and ((month()==10 and hour() < 1) or (month()==3 and hour() > 0
or
vector(0)
- record: belgium_localtime
expr: |
time() + 3600 + 3600 * daily_saving_time_belgium
Timezone
@roidelapluie
hour(belgium_localtime)
hour() and other time-functions can take a
timestamp as argument.
Belgian hour
@roidelapluie
- record: public_holiday
expr: |
vector(1) and
day_of_month(belgium_localtime) == 25
and month(belgium_localtime) == 12
labels:
name: Xmas
Holidays
@roidelapluie
groups:
- name: Easter Meeus/Jones/Butcher Algorithm
interval: 60s
rules:
- record: easter_y
expr: year(belgium_localtime)
- record: easter_a
expr: easter_y % 19
- record: easter_b
expr: floor(easter_y / 100)
- record: easter_c
expr: easter_y % 100
- record: easter_d
expr: floor(easter_b / 4)
- record: easter_e
expr: easter_b % 4
- record: easter_f
expr: floor((easter_b +8 ) / 25)
- record: easter_g
expr: floor((easter_b - easter_f + 1 ) / 3)
- record: easter_h
expr: (19*easter_a + easter_b - easter_d - easter_g + 15 ) % 30
- record: easter_i
expr: floor(easter_c/4)
- record: easter_k
expr: easter_c%4
- record: easter_l
expr: (32 + 2*easter_e + 2*easter_i - easter_h - easter_k) % 7
- record: easter_m
expr: floor((easter_a + 11*easter_h + 22*easter_l) / 451)
- record: easter_month
expr: floor((easter_h + easter_l - 7*easter_m + 114) / 31)
- record: easter_day
expr: ((easter_h + easter_l - 7*easter_m + 114) %31) + 1
Easter
@roidelapluie Wikipedia CC-BY-SA-3.0
- record: public_holiday
expr: |
vector(1) and
day_of_month(belgium_localtime-86400) == easter_day
and month(belgium_localtime-86400) == easter_month
labels:
name: Easter Monday
- record: public_holiday
expr: |
vector(1) and
day_of_month(belgium_localtime-40*86400) == easter_day
and month(belgium_localtime-40*86400) == easter_month
labels:
name: Feast of the Ascension
Easter
@roidelapluie
- record: business_day
expr: |
vector(1) and day_of_week(belgium_localtime) > 0
and day_of_week(belgium_localtime) < 6
unless count(public_holiday)
- record: belgium_hour
expr: |
hour(belgium_localtime)
- record: business_hour
expr: |
vector(1) and belgium_hour >= 8 < 18
and business_day
Business hour
@roidelapluie
- record: extended_business_hour
expr: |
(vector(1) and belgium_hour >= 7 < 20
and business_day)
- record: extended_business_hour_sat
expr: |
extended_business_hour
or (vector(1) and belgium_hour >= 7 < 14
and day_of_week(belgium_localtime) == 6
unless count(public_holiday))
Extended business hours
@roidelapluie
(sum(rate(http_requests_total{code=~"5.."}[5m])) by (vhost)
> 10 and on () business_hour)
or
(sum(rate(http_requests_total{code=~"5.."}[5m])) by (vhost) > 1
and sum(rate(http_requests_total{code=~"2.."}[5m])) by (vhost) < 1)
and on () business_hour
Thresholds depending on time
@roidelapluie
- record: daylight
expr: |
vector(1) and belgium_hour >= 8 < 18
- record: extended_daylight
expr: |
vector(1) and belgium_hour >= 7 < 20
Day and night
@roidelapluie
- alert: Time Window - Night
expr: absent(daylight)
labels:
recipient: none
- alert: Time Window - OBH
expr: absent(business_hour)
labels:
recipient: none
- alert: Time Window - Extended Night
expr: absent(extended_daylight)
labels:
recipient: none
Alerts
@roidelapluie
- alert: Time Window - Extended OBH with Saturday
expr: absent(extended_business_hour_sat)
labels:
recipient: none
- alert: Time Window - Extended OBH
expr: absent(extended_business_hour)
labels:
recipient: none
Alerts
@roidelapluie
At this point, we will have "meaningless" alerts at night
and during business holidays.
Alerts
@roidelapluie
Inhibition is a concept of suppressing noti cations for
certain alerts if certain other alerts are already ring.
Inhibition
@roidelapluie Alertmanager documentation CC-BY-4.0
Alertmanager inhibition
inhibit_rules:
- source_match:
alertname: "Time Window - Night"
target_match:
time_window: 10x7
- source_match:
alertname: "Time Window - OBH"
target_match:
time_window: 10x5
Inhibition
@roidelapluie
Alertmanager inhibition
- source_match:
alertname: "Time Window - Extended Night"
target_match:
time_window: 13x7
- source_match:
alertname: "Time Window - Extended OBH"
target_match:
time_window: 13x5
- source_match:
alertname: "Time Window - Extended OBH with Saturday"
target_match:
time_window: 13x6
Inhibition
@roidelapluie
Alerts Relabeling
@roidelapluie
Prometheus alert
- alert: a target is down
expr: up == 0
for: 5m
labels:
recipients_prod: customer1/sms,opsteam/ticket
time_window_prod: 24x7
recipients: opsteam/chat
time_window: 8x5
Per env recipients
@roidelapluie
Prometheus con g
alerting:
alert_relabel_configs:
- source_labels: [time_window_prod,env]
regex: "(.+);prod"
target_label: time_window
replacement: '$1'
- source_labels: [time_window_dev,env]
regex: "(.+);dev"
target_label: time_window
replacement: '$1'
Repeat for other env, other labels.
Per env recipients
@roidelapluie
Prometheus con g
alerting:
alert_relabel_configs:
- source_labels: [time_window]
regex: "never"
action: drop
Be careful about the order (time_window can be
mutated by relabeling).
Drop alert
@roidelapluie
Conclusion
@roidelapluie
• Alert-Writing experience is great with this
• Prometheus and PromQL can do a lot
• Con g management lls the "gaps"
• With some e ort, we have everything we wanted
Conclusion
@roidelapluie
- name: normal rate
interval: 120s
rules:
- record: request_rate_history
expr: |
sum(rate(http_requests_total[5m])) by (env)
labels:
when: 0w
Bonus
@roidelapluie
- record: request_rate_history
expr: |
(
sum(rate(http_requests_total[5m] offset 168h)) by (env)
and on () (
daily_saving_time_belgium offset 1w
== daily_saving_time_belgium)
) or
(
sum(rate(http_requests_total[5m] offset 167h)) by (env)
and on () (
daily_saving_time_belgium offset 1w
< daily_saving_time_belgium)
) or
(
sum(rate(http_requests_total[5m] offset 169h)) by (env)
and on () (
daily_saving_time_belgium offset 1w
> daily_saving_time_belgium)
)
labels:
when: 1w
Bonus
@roidelapluie
- record: request_rate_normal
expr: |
max(bottomk(1,
topk(4, request_rate_history) by(env)
) by(env)) by(env)
Bonus
@roidelapluie
Bonus
@roidelapluie
Bonus
@roidelapluie
Really hope we will get rid of DST in 2021!
Bonus
@roidelapluie
Julien Pivotto
@roidelapluie
roidelapluie@inuits.eu
Essensteenweg 31
2930 Brasschaat
Belgium
Contact:
info@inuits.eu
+32-3-8082105

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)
 
Prometheus - basics
Prometheus - basicsPrometheus - basics
Prometheus - basics
 
Monitoring with prometheus
Monitoring with prometheusMonitoring with prometheus
Monitoring with prometheus
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with Prometheus
 
Grafana
GrafanaGrafana
Grafana
 
Cloud Monitoring tool Grafana
Cloud Monitoring  tool Grafana Cloud Monitoring  tool Grafana
Cloud Monitoring tool Grafana
 
How to monitor your micro-service with Prometheus?
How to monitor your micro-service with Prometheus?How to monitor your micro-service with Prometheus?
How to monitor your micro-service with Prometheus?
 
Terraform
TerraformTerraform
Terraform
 
Grafana introduction
Grafana introductionGrafana introduction
Grafana introduction
 
An introduction to terraform
An introduction to terraformAn introduction to terraform
An introduction to terraform
 
Monitoring With Prometheus
Monitoring With PrometheusMonitoring With Prometheus
Monitoring With Prometheus
 
Prometheus design and philosophy
Prometheus design and philosophy   Prometheus design and philosophy
Prometheus design and philosophy
 
End to-end monitoring with the prometheus operator - Max Inden
End to-end monitoring with the prometheus operator - Max IndenEnd to-end monitoring with the prometheus operator - Max Inden
End to-end monitoring with the prometheus operator - Max Inden
 
Intro to open source observability with grafana, prometheus, loki, and tempo(...
Intro to open source observability with grafana, prometheus, loki, and tempo(...Intro to open source observability with grafana, prometheus, loki, and tempo(...
Intro to open source observability with grafana, prometheus, loki, and tempo(...
 
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
 
Introduction to Ansible
Introduction to AnsibleIntroduction to Ansible
Introduction to Ansible
 
Ansible presentation
Ansible presentationAnsible presentation
Ansible presentation
 
Prometheus and Grafana
Prometheus and GrafanaPrometheus and Grafana
Prometheus and Grafana
 
MySQL Monitoring using Prometheus & Grafana
MySQL Monitoring using Prometheus & GrafanaMySQL Monitoring using Prometheus & Grafana
MySQL Monitoring using Prometheus & Grafana
 
Terraform introduction
Terraform introductionTerraform introduction
Terraform introduction
 

Similar a Improved alerting with Prometheus and Alertmanager

#include customer.h#include heap.h#include iostream.docx
#include customer.h#include heap.h#include iostream.docx#include customer.h#include heap.h#include iostream.docx
#include customer.h#include heap.h#include iostream.docx
AASTHA76
 
ch02-primitive-data-definite-loops.ppt
ch02-primitive-data-definite-loops.pptch02-primitive-data-definite-loops.ppt
ch02-primitive-data-definite-loops.ppt
ghoitsun
 

Similar a Improved alerting with Prometheus and Alertmanager (20)

#include customer.h#include heap.h#include iostream.docx
#include customer.h#include heap.h#include iostream.docx#include customer.h#include heap.h#include iostream.docx
#include customer.h#include heap.h#include iostream.docx
 
AI&ML Conference 2019 - Deep Learning from zero to hero
AI&ML Conference 2019 - Deep Learning from zero to heroAI&ML Conference 2019 - Deep Learning from zero to hero
AI&ML Conference 2019 - Deep Learning from zero to hero
 
Accelerating OpenERP accounting: Precalculated period sums
Accelerating OpenERP accounting: Precalculated period sumsAccelerating OpenERP accounting: Precalculated period sums
Accelerating OpenERP accounting: Precalculated period sums
 
Proposed pricing model for cloud computing
Proposed pricing model for cloud computingProposed pricing model for cloud computing
Proposed pricing model for cloud computing
 
Code Tuning
Code TuningCode Tuning
Code Tuning
 
C# Loops
C# LoopsC# Loops
C# Loops
 
Performance
PerformancePerformance
Performance
 
14 queuing
14 queuing14 queuing
14 queuing
 
Ch02 primitive-data-definite-loops
Ch02 primitive-data-definite-loopsCh02 primitive-data-definite-loops
Ch02 primitive-data-definite-loops
 
ScalaMeter 2012
ScalaMeter 2012ScalaMeter 2012
ScalaMeter 2012
 
ch02-primitive-data-definite-loops.ppt
ch02-primitive-data-definite-loops.pptch02-primitive-data-definite-loops.ppt
ch02-primitive-data-definite-loops.ppt
 
ch02-primitive-data-definite-loops.ppt
ch02-primitive-data-definite-loops.pptch02-primitive-data-definite-loops.ppt
ch02-primitive-data-definite-loops.ppt
 
Objects? No thanks!
Objects? No thanks!Objects? No thanks!
Objects? No thanks!
 
Performance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part IIPerformance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part II
 
Observability in a Dynamically Scheduled World
Observability in a Dynamically Scheduled WorldObservability in a Dynamically Scheduled World
Observability in a Dynamically Scheduled World
 
Ch3
Ch3Ch3
Ch3
 
C++ process new
C++ process newC++ process new
C++ process new
 
Cyrille Martraire: Monoids, Monoids Everywhere! at I T.A.K.E. Unconference 2015
Cyrille Martraire: Monoids, Monoids Everywhere! at I T.A.K.E. Unconference 2015Cyrille Martraire: Monoids, Monoids Everywhere! at I T.A.K.E. Unconference 2015
Cyrille Martraire: Monoids, Monoids Everywhere! at I T.A.K.E. Unconference 2015
 
Control Statement.ppt
Control Statement.pptControl Statement.ppt
Control Statement.ppt
 
The hitchhiker’s guide to Prometheus
The hitchhiker’s guide to PrometheusThe hitchhiker’s guide to Prometheus
The hitchhiker’s guide to Prometheus
 

Más de Julien Pivotto

Más de Julien Pivotto (20)

The O11y Toolkit
The O11y ToolkitThe O11y Toolkit
The O11y Toolkit
 
What's New in Prometheus and Its Ecosystem
What's New in Prometheus and Its EcosystemWhat's New in Prometheus and Its Ecosystem
What's New in Prometheus and Its Ecosystem
 
Prometheus: What is is, what is new, what is coming
Prometheus: What is is, what is new, what is comingPrometheus: What is is, what is new, what is coming
Prometheus: What is is, what is new, what is coming
 
What's new in Prometheus?
What's new in Prometheus?What's new in Prometheus?
What's new in Prometheus?
 
Introduction to Grafana Loki
Introduction to Grafana LokiIntroduction to Grafana Loki
Introduction to Grafana Loki
 
Why you should revisit mgmt
Why you should revisit mgmtWhy you should revisit mgmt
Why you should revisit mgmt
 
Observing the HashiCorp Ecosystem From Prometheus
Observing the HashiCorp Ecosystem From PrometheusObserving the HashiCorp Ecosystem From Prometheus
Observing the HashiCorp Ecosystem From Prometheus
 
Monitoring in a fast-changing world with Prometheus
Monitoring in a fast-changing world with PrometheusMonitoring in a fast-changing world with Prometheus
Monitoring in a fast-changing world with Prometheus
 
5 tips for Prometheus Service Discovery
5 tips for Prometheus Service Discovery5 tips for Prometheus Service Discovery
5 tips for Prometheus Service Discovery
 
Prometheus and TLS - an Introduction
Prometheus and TLS - an IntroductionPrometheus and TLS - an Introduction
Prometheus and TLS - an Introduction
 
Powerful graphs in Grafana
Powerful graphs in GrafanaPowerful graphs in Grafana
Powerful graphs in Grafana
 
YAML Magic
YAML MagicYAML Magic
YAML Magic
 
HAProxy as Egress Controller
HAProxy as Egress ControllerHAProxy as Egress Controller
HAProxy as Egress Controller
 
SIngle Sign On with Keycloak
SIngle Sign On with KeycloakSIngle Sign On with Keycloak
SIngle Sign On with Keycloak
 
Monitoring as an entry point for collaboration
Monitoring as an entry point for collaborationMonitoring as an entry point for collaboration
Monitoring as an entry point for collaboration
 
Incident Resolution as Code
Incident Resolution as CodeIncident Resolution as Code
Incident Resolution as Code
 
Monitor your CentOS stack with Prometheus
Monitor your CentOS stack with PrometheusMonitor your CentOS stack with Prometheus
Monitor your CentOS stack with Prometheus
 
Monitor your CentOS stack with Prometheus
Monitor your CentOS stack with PrometheusMonitor your CentOS stack with Prometheus
Monitor your CentOS stack with Prometheus
 
An introduction to Ansible
An introduction to AnsibleAn introduction to Ansible
An introduction to Ansible
 
Jsonnet
JsonnetJsonnet
Jsonnet
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

Improved alerting with Prometheus and Alertmanager

  • 1. PromCon Munich Julien Pivotto @roidelapluie Improved alerting with Prometheus and Alertmanager November 8th, 2019
  • 2. • This talk contains PromQL. • This talk contains YAML. • What you will see was built over time. Important notes! @roidelapluie
  • 4. Message Broker in the Belgian healthcare sector • High visibility • Sync & Async • Legacy & New • Lots of partners • Multiple customers Message Broker @roidelapluie
  • 6. Alerts are not only for incidents. Some alerts carry business information about ongoing events (good or bad). Some alerts go outside of our org. Some alerts are not for humans. Alerting @roidelapluie
  • 8. Repeat every 15m, 1h, 4h, 24h, 2d 24x7, 10x5, 12x6, 10x7, never Legal holidays Time frames @roidelapluie
  • 9. Updated annotations & value Updated graphs 15m/1h repeat interval? @roidelapluie
  • 10. • Alertmanager owns the noti cations • Webhook receivers have no logic • Take decisions at time of alert writing Constraints @roidelapluie
  • 11. • Avoid Alertmanager recon gurations • Safe and easy way to write alerts • Only send relevant alerts • Alert on staging environments Challenges @roidelapluie
  • 13. - alert: a target is down expr: up == 0 for: 5m Gauges @roidelapluie
  • 15. Instead of: - alert: a target is down expr: up == 0 for: 5m Do: - alert: a target is down expr: avg_over_time(up[5m]) < .9 for: 5m Gauges @roidelapluie
  • 16. Alert me if temperature is above 27°C Hysteresis @roidelapluie
  • 17. - alert: temperature is above threshold expr: temperature_celcius > 27 for: 5m labels: priority: high Hysteresis @roidelapluie
  • 19. Hysteresis is the dependence of the state of a system on its history. Hysteresis @roidelapluie Wikipedia CC-BY-SA-3.0
  • 20. - alert: temperature is above threshold expr: | avg_over_time(temperature_celcius[5m]) > 27 for: 5m labels: priority: high alternative: max_over_time 5m might be too short if > 5m: when is it resolved? Hysteresis @roidelapluie
  • 21. Alert me • if temperature is above 27°C • only stop when it gets below 25°C Hysteresis @roidelapluie
  • 22. (avg_over_time(temperature_celcius[5m]) > 27) or (temperature_celcius > 25 and count without (alertstate, alertname, priority) ALERTS{ alertstate="firing", alertname="temperature is above threshold" }) Hysteresis @roidelapluie
  • 23. temperature_celcius > 27 but... Computed threshold @roidelapluie
  • 24. - record: temperature_threshold_celcius expr: | 27+0*temperature_celcius{ location=~".*ambiant" } or 25+0*temperature_celcius Bonus: temperature_threshold_celcius can be used in grafana! Computed threshold @roidelapluie
  • 25. - alert: temperature is above threshold expr: | temperature_celcius > temperature_threshold_celcius Note: put threshold & alert in the same alert group Computed threshold @roidelapluie
  • 26. - alert: no more sms expr: sms_available < 39000 Absence @roidelapluie
  • 28. No metric = No alert! Metric is back = New alert! Absence @roidelapluie
  • 29. - record: sms_available_last expr: | sms_available or sms_available_last - alert: no more sms record: sms_available_last < 39000 - alert: no more sms data record: absent(sms_available) for: 1h Absence @roidelapluie
  • 32. Alertmanager receivers - name: "opsteam/mail" email_configs: - to: 'ops@inuits.eu' send_resolved: yes html: "{{ template "inuits.html.tmpl" . }}" text: "{{ template "inuits.txt.tmpl" . }}" headers: Subject: "{{ template "title.tmpl" . }}" Hint: Subject can be a template. Receivers @roidelapluie
  • 33. Alertmanager receivers - name: "opsteam/mail/noresolved" email_configs: - to: 'ops@inuits.eu' send_resolved: no html: "{{ template "inuits.html.tmpl" . }}" text: "{{ template "inuits.txt.tmpl" . }}" headers: Subject: "{{ template "title.tmpl" . }}" Same, but with send_resolved: no Receivers @roidelapluie
  • 34. Alertmanager receivers - name: "lotsOfPeople/mail" email_configs: - to: 'a@inuits.eu,b@inuits.eu,c@inuits.eu' headers: To: a@inuits.eu CC: b@inuits.eu Reply-To: support@inuits.eu c@inuits.eu is now BCC. Email: CC, BCC @roidelapluie
  • 35. Prometheus alert - alert: Not enough traffic expr: ... for: 5m labels: recipients: customer1/sms,opsteam/ticket annotations: summary: ... resolved_summary: ... Who gets the alert? @roidelapluie
  • 36. Alertmanager routing - receiver: "customer1/sms" match_re: recipient: "(.*,)?customer1/sms(,.*)?" continue: true routes: [...] - receiver: "opsteam/ticket" match_re: recipient: "(.*,)?opsteam/ticket(,.*)?" continue: true routes: [...] Who gets the alert? @roidelapluie
  • 37. Prometheus alert - alert: Not enough traffic expr: ... for: 5m labels: recipients: customer1/sms,opsteam/ticket send_resolved: "no" Resolved @roidelapluie
  • 38. Alertmanager routing - receiver: "customer1/sms" match_re: recipient: "(.*,)?customer1/sms(,.*)?" continue: true routes: - receiver: customer1/sms/noresolved match: send_resolved: "no" Resolved @roidelapluie
  • 39. Prometheus alert - alert: Not enough traffic expr: ... for: 5m labels: recipients: customer1/sms,opsteam/ticket repeat_interval: 1h Repeat interval @roidelapluie
  • 40. Alertmanager routing - receiver: "customer1/sms" match_re: recipient: "(.*,)?customer1/sms(,.*)?" continue: true routes: - receiver: customer1/sms repeat_interval: 1h match: repeat_interval: 1h Repeat interval @roidelapluie
  • 41. Some channels have speci c group_interval: 0s. Some channels always send_resolved: no. Some recipients have aliases (ticket+chat). Extra con gurations @roidelapluie
  • 42. Extract of amtool con g routes show ─ {recipient=~"^(?:(.*,)?jpivotto/mail(,.*)?)$"} continue: true receiver: jpivotto/mail    ├── {repeat_interval="15m",send_resolved="no"} receiver: jpivotto/mail/noresolved    ├── {repeat_interval="15m"} receiver: jpivotto/mail    ├── {repeat_interval="30m",send_resolved="no"} receiver: jpivotto/mail/noresolved    ├── {repeat_interval="30m"} receiver: jpivotto/mail    ├── {repeat_interval="1h",send_resolved="no"} receiver: jpivotto/mail/noresolved    ├── {repeat_interval="1h"} receiver: jpivotto/mail    ├── {repeat_interval="2h",send_resolved="no"} receiver: jpivotto/mail/noresolved    ├── {repeat_interval="2h"} receiver: jpivotto/mail    ├── {repeat_interval="4h",send_resolved="no"} receiver: jpivotto/mail/noresolved    ├── {repeat_interval="4h"} receiver: jpivotto/mail    ├── {repeat_interval="6h",send_resolved="no"} receiver: jpivotto/mail/noresolved    ├── {repeat_interval="6h"} receiver: jpivotto/mail    ├── {repeat_interval="12h",send_resolved="no"} receiver: jpivotto/mail/noresolved    ├── {repeat_interval="12h"} receiver: jpivotto/mail    ├── {repeat_interval="24h",send_resolved="no"} receiver:jpivotto/mail/noresolved    ├── {repeat_interval="24h"} receiver: jpivotto/mail    ├── {send_resolved="no"} receiver: jpivotto/mail/noresolved    └── {repeat_interval=""} receiver: jpivotto/mail Routes tree @roidelapluie
  • 43. Con g Management! Our input receivers: customer: email: to: [customer@example.com] cc: [service-management@inuits.eu] bcc: [ops@inuits.eu] sms: [+1234567890, +2345678901] chat: room: "#customer" How do we achieve it? @roidelapluie
  • 44. • Script that is deployed with AM • Knows all the recipients • Will validate alerts yaml • promtool • mandatory labels • validate receivers label • validate repeat_interval label Not possible to write alerts that go nowhere by accident. Con guration management @roidelapluie
  • 46. Prometheus alert - alert: a target is down expr: up == 0 for: 5m labels: recipients: customer1/sms,opsteam/ticket time_window: 13x5 Time frame @roidelapluie
  • 47. - record: daily_saving_time_belgium expr: | (vector(0) and (month() < 3 or month() > 10)) or (vector(1) and (month() > 3 and month() < 10)) or ( ( (month() %2 and (day_of_month() - day_of_week() > (30 + +month() % 2 - 7)) and day_of_week() > 0) or -1*month()%2+1 and (day_of_month() - day_of_week() <= (30 + month() % 2 - 7)) ) ) or (vector(1) and ((month()==10 and hour() < 1) or (month()==3 and hour() > 0 or vector(0) - record: belgium_localtime expr: | time() + 3600 + 3600 * daily_saving_time_belgium Timezone @roidelapluie
  • 48. hour(belgium_localtime) hour() and other time-functions can take a timestamp as argument. Belgian hour @roidelapluie
  • 49. - record: public_holiday expr: | vector(1) and day_of_month(belgium_localtime) == 25 and month(belgium_localtime) == 12 labels: name: Xmas Holidays @roidelapluie
  • 50. groups: - name: Easter Meeus/Jones/Butcher Algorithm interval: 60s rules: - record: easter_y expr: year(belgium_localtime) - record: easter_a expr: easter_y % 19 - record: easter_b expr: floor(easter_y / 100) - record: easter_c expr: easter_y % 100 - record: easter_d expr: floor(easter_b / 4) - record: easter_e expr: easter_b % 4 - record: easter_f expr: floor((easter_b +8 ) / 25) - record: easter_g expr: floor((easter_b - easter_f + 1 ) / 3) - record: easter_h expr: (19*easter_a + easter_b - easter_d - easter_g + 15 ) % 30 - record: easter_i expr: floor(easter_c/4) - record: easter_k expr: easter_c%4 - record: easter_l expr: (32 + 2*easter_e + 2*easter_i - easter_h - easter_k) % 7 - record: easter_m expr: floor((easter_a + 11*easter_h + 22*easter_l) / 451) - record: easter_month expr: floor((easter_h + easter_l - 7*easter_m + 114) / 31) - record: easter_day expr: ((easter_h + easter_l - 7*easter_m + 114) %31) + 1 Easter @roidelapluie Wikipedia CC-BY-SA-3.0
  • 51. - record: public_holiday expr: | vector(1) and day_of_month(belgium_localtime-86400) == easter_day and month(belgium_localtime-86400) == easter_month labels: name: Easter Monday - record: public_holiday expr: | vector(1) and day_of_month(belgium_localtime-40*86400) == easter_day and month(belgium_localtime-40*86400) == easter_month labels: name: Feast of the Ascension Easter @roidelapluie
  • 52. - record: business_day expr: | vector(1) and day_of_week(belgium_localtime) > 0 and day_of_week(belgium_localtime) < 6 unless count(public_holiday) - record: belgium_hour expr: | hour(belgium_localtime) - record: business_hour expr: | vector(1) and belgium_hour >= 8 < 18 and business_day Business hour @roidelapluie
  • 53. - record: extended_business_hour expr: | (vector(1) and belgium_hour >= 7 < 20 and business_day) - record: extended_business_hour_sat expr: | extended_business_hour or (vector(1) and belgium_hour >= 7 < 14 and day_of_week(belgium_localtime) == 6 unless count(public_holiday)) Extended business hours @roidelapluie
  • 54. (sum(rate(http_requests_total{code=~"5.."}[5m])) by (vhost) > 10 and on () business_hour) or (sum(rate(http_requests_total{code=~"5.."}[5m])) by (vhost) > 1 and sum(rate(http_requests_total{code=~"2.."}[5m])) by (vhost) < 1) and on () business_hour Thresholds depending on time @roidelapluie
  • 55. - record: daylight expr: | vector(1) and belgium_hour >= 8 < 18 - record: extended_daylight expr: | vector(1) and belgium_hour >= 7 < 20 Day and night @roidelapluie
  • 56. - alert: Time Window - Night expr: absent(daylight) labels: recipient: none - alert: Time Window - OBH expr: absent(business_hour) labels: recipient: none - alert: Time Window - Extended Night expr: absent(extended_daylight) labels: recipient: none Alerts @roidelapluie
  • 57. - alert: Time Window - Extended OBH with Saturday expr: absent(extended_business_hour_sat) labels: recipient: none - alert: Time Window - Extended OBH expr: absent(extended_business_hour) labels: recipient: none Alerts @roidelapluie
  • 58. At this point, we will have "meaningless" alerts at night and during business holidays. Alerts @roidelapluie
  • 59. Inhibition is a concept of suppressing noti cations for certain alerts if certain other alerts are already ring. Inhibition @roidelapluie Alertmanager documentation CC-BY-4.0
  • 60. Alertmanager inhibition inhibit_rules: - source_match: alertname: "Time Window - Night" target_match: time_window: 10x7 - source_match: alertname: "Time Window - OBH" target_match: time_window: 10x5 Inhibition @roidelapluie
  • 61. Alertmanager inhibition - source_match: alertname: "Time Window - Extended Night" target_match: time_window: 13x7 - source_match: alertname: "Time Window - Extended OBH" target_match: time_window: 13x5 - source_match: alertname: "Time Window - Extended OBH with Saturday" target_match: time_window: 13x6 Inhibition @roidelapluie
  • 63. Prometheus alert - alert: a target is down expr: up == 0 for: 5m labels: recipients_prod: customer1/sms,opsteam/ticket time_window_prod: 24x7 recipients: opsteam/chat time_window: 8x5 Per env recipients @roidelapluie
  • 64. Prometheus con g alerting: alert_relabel_configs: - source_labels: [time_window_prod,env] regex: "(.+);prod" target_label: time_window replacement: '$1' - source_labels: [time_window_dev,env] regex: "(.+);dev" target_label: time_window replacement: '$1' Repeat for other env, other labels. Per env recipients @roidelapluie
  • 65. Prometheus con g alerting: alert_relabel_configs: - source_labels: [time_window] regex: "never" action: drop Be careful about the order (time_window can be mutated by relabeling). Drop alert @roidelapluie
  • 67. • Alert-Writing experience is great with this • Prometheus and PromQL can do a lot • Con g management lls the "gaps" • With some e ort, we have everything we wanted Conclusion @roidelapluie
  • 68. - name: normal rate interval: 120s rules: - record: request_rate_history expr: | sum(rate(http_requests_total[5m])) by (env) labels: when: 0w Bonus @roidelapluie
  • 69. - record: request_rate_history expr: | ( sum(rate(http_requests_total[5m] offset 168h)) by (env) and on () ( daily_saving_time_belgium offset 1w == daily_saving_time_belgium) ) or ( sum(rate(http_requests_total[5m] offset 167h)) by (env) and on () ( daily_saving_time_belgium offset 1w < daily_saving_time_belgium) ) or ( sum(rate(http_requests_total[5m] offset 169h)) by (env) and on () ( daily_saving_time_belgium offset 1w > daily_saving_time_belgium) ) labels: when: 1w Bonus @roidelapluie
  • 70. - record: request_rate_normal expr: | max(bottomk(1, topk(4, request_rate_history) by(env) ) by(env)) by(env) Bonus @roidelapluie
  • 73. Really hope we will get rid of DST in 2021! Bonus @roidelapluie
  • 74. Julien Pivotto @roidelapluie roidelapluie@inuits.eu Essensteenweg 31 2930 Brasschaat Belgium Contact: info@inuits.eu +32-3-8082105