4. Current Tools
●
Metrics collections: Collectd, statsd, Cloudwatch
●
Monitoring: Sensu, NewRelic
●
Alert channels: PagerDuty, emails, slack
●
Dashboards: Grafana, CloudWatch, NewRelic
●
Application testing: E2E Testing System
●
Internal tools: Sensu mobile, events system,
Sensu bar and more
5. Increasing Coverage
●
Automatic collection of basic
system and 3rd party metrics
for new instances
●
Add alerts automatically for
new instance of existed
subscriber
●
Each Developer / DevOps is
responsible for monitoring his
application / infrastructure
●
Easy method to add new
alerts and dashboards
●
Automatic events flow
6. Pager Schedules
●
Divided into logical groups of ownership
●
Schedule has escalation point
●
On call should be able to connect and respond to
issues in his area
●
Easy method to override schedule
●
Ability to contact relevant on call
●
Ability to page relevant on call
7. Automatic Self Healing
●
Better MTTR
●
Avoid waking On Call if
possible
●
Log activity to float
recurrent issues
●
Limit the healing to avoid
restart loops
●
Make sure to sync
Healer Alert↔
8. Bots, Integrations and Alerts Channels
●
Alerts channels: Emails, slack, PD mobile, sms, calls
●
Integrations: Sensu to PD/Slack, CloudWatch to PD,
3rd party (EX: CouchBase, NewRelic, etc) to PD,
●
Slack Bot:
9. Events Dashboard
●
Simple Rest API for sending events
●
Clean timeline view to spot production events
●
Connections between events (“depends on” and “dependents”)
●
Detailed view for each event
10. Accessibility
●
Available from everywhere by mobile
●
Easy to ack, resolve, mute alerts
●
Slack bots to reach help
●
Automatically get graph with the alert
●
Ability to search, edit, copy, etc alerts
●
Treat alerts management as code (SVC, DB,
backups, etc)
11. Best Practices Summary
●
Share the pain
●
Automate base metrics
●
Automate healing
●
Make help reachable
●
Make it easy to add alerts and dashboards
●
Use warning levels as soft events to avoid phone calls at night
●
Automate graphs in alerts
●
Positive alerting system check each day
●
Dependencies between alerts
●
Postmortems