This document discusses moving beyond the limitations of Nagios for infrastructure monitoring. It summarizes that while Nagios is an industry standard, its configuration can be daunting and not user-friendly. It then describes common problems with Nagios like being overwhelmed by alerts, entering a "spiral of death" by adding more checks that lead to more alerts, and trying to improve coverage through more checks but resulting in the "trough of despair." The document recommends ways to improve the situation like measuring data, looking for patterns in alerts, putting alerts in context visually, and focusing on business impact.
11. Quality
of life
Few checks
Few alerts
More checks
Too many alerts
# of alerts
FIGHT OR FLIGHT
12. Effective Checks n^2
Coverage Fault-tolerant
Less urgency
Few checks
Few alerts
Every host counts
More checks
Too many alerts
Every host still counts Scale
Complexity
THE TROUGH OF DESPAIR
19. PUT ALERTS IN CONTEXT
https://app.datad0g.com/dash/dash/1000#/date_range/1310682467000.0-1310684267000.0
20. Ultimate (hard) question
‣Does this alert impact the business?
‣If so by how much?
‣Assumes that you track business metrics...
‣And they can be accessed programatically
FOCUS ON THE BUSINESS
21. What applies to Nagios...
Applies to other sources too
etc...