This document describes how Sensu, an open source monitoring tool, can be used to build dynamic monitoring systems. It outlines some of the limitations of traditional tools like Nagios, and how Sensu addresses them by allowing clients to define their own checks and sending check results to a server. The document demonstrates how Sensu can integrate with service metadata from tools like PaaSTA to generate alerts specific to a team or service. It encourages building monitoring that adapts to the exact deployment configuration to catch issues, like checking replication across latency zones. The purpose is to inspire building custom monitoring based on Sensu's flexible event handling and integration with existing service data.
2. Outline
● Let’s visit the dark ages
● How Sensu Works
● Special (open source) Yelp + Sensu Sauce
● Mini-Demo
● How PaaSTA Uses Sensu
● Second Demo
3. The Dark Ages
● One Word: Nagios
● Monitoring for Services: “Also Nagios”
● Probably alerts go to OPS anyway
● Probably just making sure the LB is up
● Very little developer visibility
● Hard to articulate to nagios what you want
4. An Aside: Map Versus Territory
● Territory: The actual things
in production running right
now
● Map: What your monitoring
system *thinks* is running
right now
Who/What
keeps these in
sync?????
5. How Sensu Works
Client Server
Check Results
Any Events for me to
handle?
Some Host
RabbitMQ
Clients execute checks
Servers don’t know what checks
exist beforehand, they just
operate on events
6. How Sensu Works - In Words
● Clients can Schedule and Execute checks,
but just put the results on the queue
● Servers handle results off the queue,
route them to things like email, pagerduty,
JIRA, etc.
● Also API, CLI, check history, silencing,
dashboard, etc.
7. Special (Open Source) Yelp-Sensu Sauce
● https://github.com/Yelp/sensu_handlers
● “Smart” handlers that respond to Sensu
events based on the event data
● Team is the “primary key” when
determining what to do
8. Declare Your Teams
sensu_handlers::teams:
dev:
pagerduty_api_key: 1234
pages_irc_channel: 'dev1-pages'
notifications_irc_channel: 'devs'
ops:
pagerduty_api_key: 78923
pages_irc_channel: 'ops-pages'
notifications_irc_channel: 'operations-notifications'
notification_email: 'operations@localhost'
project: OPS
hardware:
# Uses the ops Pagerduty service for page-worthy events,
# but otherwise just jira tickets
pagerduty_api_key: 78923
project: METAL
9. Mini - Demo
What does it look like when you can
dynamically define checks on Sensu clients in
a team-centric way?
10.
11. {
"name": "test_alert_for_kwa",
"team": "kwa",
"irc_channels": [],
"notification_email": "kwa@dev.yelpcorp.com",
"ticket": false,
"project": false,
"page": false,
"output": "Test output from send-test-sensu-alert",
"status": 2,
"command": "send-test-sensu-alert",
}
What just happened?
12. How PaaSTA Uses Sensu
● Take advantage of Sensu’s ability to
receive arbitrary events
● We already know which team owns each
service (started documenting that with the
soa-configs)
● We already know where services are
deployed and what latency zones they are
in
13. Sensu + PaaSTA Demo
What if your monitoring system knew all
about your services and how they are
supposed to be deployed?
14.
15. What just happened?
● We “went behind PaaSTA’s back” to simulate a failure
of an AZ
● We got a replication alert because of of the latency
zones didn’t meet our expected replication count. (0
out of 3)
● We decided to “remediate” it by expanding our
latency zone to “region”
● Paasta “Made it so”, and our alert resolved and the
status command reflected the fact that we are
expecting 6 in that one region
16. How Did Sensu “Know”?
Is this a Problem?
What should I do About it?
17. How Did Sensu “Know”?
● Sensu doesn’t “Know” anything except for
the “Teams” metadata hash
● PaaSTA checks Haproxy in each latency
zone because it can read the same SOA
configs that SmartStack does!
● PaaSTA “Knows” which team owns each
service because we told it in SOA configs!
● Sensu just processes the event like normal
18. Conclusion
● Use a monitoring system that can receive
and process arbitrary events for easy
integration (Sensu)
● Keep service metadata in an easy-to-access
place for pieces to integrate easily (SOA
configs)
● Monitor the exact thing you care about
(replication in each latency zone)
19. Reading Comprehension Question:
(What was the purpose of this talk?)
A. To Describe how cool Sensu is
B. To Make viewers feel inadequate of their own Nagios
installation
C. To tease viewers about Sensu glue that is not open
source yet
D. To Inspire viewers to build their own dynamic
Monitoring based on some of these ideas!
E. Other?
20. Reading Comprehension Question:
(What was the purpose of this talk?)
A. To Describe how cool Sensu is
B. To Make viewers feel inadequate of their own Nagios
installation
C. To tease viewers about Sensu glue that is not open
source yet
D. To Inspire viewers to build their own dynamic
Monitoring based on some of these ideas!
E. Other?