As the Yelp infrastructure and engineering team grew, so did the pain of managing Nagios. Problems like splitting alerting across multiple teams, providing high availability and managing nagios systems in multiple environments had become pressing. As we grew towards a service oriented architecture and pushed some services out into the cloud, we rapidly needed more automated monitoring configuration.
An evolutionary solution wasn’t going to solve all of our problems, we needed to revolutionize our monitoring. Sensu is built from the ground up to solve many of our issues and be easy to extend.
This talk covers our puppet ‘monitoring_check’ API (that sets up monitoring for our services within puppet), how and why we deploy Sensu and our custom handlers and escalations, along with how we provide automatic ‘self service’ monitoring for dynamic services and how we deal with the challenges posed by the more ephemeral nature of cloud architectures.
3. Cycle of failure and
disappointment
• Manually edited and deployed monitoring
• Changes require two teams
• Low developer visibility about production
3
5. Cycle of failure and
disappointment
• Manually edited and deployed monitoring
• Changes require two teams
• Low developer visibility about production
• Escalation of issues is hard
• Ops ignore alerts from services
• Postmortems
5
7. Cycle of failure and
disappointment
• Manually edited and deployed monitoring
• Changes require two teams
• Low developer visibility about production
• Escalation of issues is hard
• Ops ignore alerts from services
• Postmortems
• High friction, low trust, low visibility.
7
12. “51 % viewed their ERP implementation as
unsuccessful”
12
The Robbins-Gioia Survey (2001)
13. The Conference Board Survey (2001)
“40 % of the projects failed to achieve their
business case within one year of going live”
13
14. McKinsey & Company in conjunction
with the University of Oxford (2012)
• “17 percent of large IT projects go so
badly that they can threaten the very
existence of the company”
• “On average, large IT projects run 45
percent over budget and 7 percent over
time, while delivering 56 percent less
value than predicted”
14
15. Failure is an option
-‐
blog.parasoft.com/single-‐greatest-‐barrier-‐with-‐sw-‐delivery
15
18. Why Sensu?
• Designed to be pluggable / extensible
• Arbitrary check metadata
• Simple model
• Components do exactly one thing
• Ruby
• Not afraid to extend (or fork!)
18
26. How we use Sensu
• Don’t use all of this!
• ‘Standalone’ checks only
• Default in the puppet module
26
27. Sensu data flow
• Sensu client runs checks on each machine
• Pushes results to RabbitMQ
• Clustered, clients/messages will fail over.
• Sensu server (multiple, ha)
• Processes check results, invokes handlers
• Writes state to redis
• Redis + sentinel
• Read by API (2 instances)
• All layers behind haproxy
27
28. Quis custodiet ipsos custodes?
28
“Sensu
has
so
many
moving
parts
that
I
wouldn’t
be
able
to
sleep
at
night
unless
I
set
up
a
Nagios
instance
to
make
sure
they
were
all
running.”
42. Check scripts
• Same as nagios checks
• Simple (text) output
• Exit code
• Result sent to server, along with check definition
• Including all the custom metadata
• Our handlers use the extra data.
42
44. How do checks get run?
• Every machine runs the client.
• Client managed by puppet
• Client has a TCP socket you can send JSON to
• Custom checks + pysensu-yelp
44
55. User specified monitoring
• Data lives in the service config
• Next to the code to emit metrics
• Next to metadata about SLAs and LB timeouts
• Developers can push without OPS
55
56. Cluster checks
• We’re working on this currently
• Assert some % of machines are healthy.
• Use to reduce alert noise.
• If a service becomes fully unavailable to clients,
you want to page someone.
• If one machine goes belly up, you don’t (make
a JIRA ticket for handling later!)
56
57. WIP
• This is all still a work in progress.
• We’ve not 100% migrated off of Nagios
• Open sourcing the pieces
57
58. Thanks!
• Slides will be online shortly:
• slideshare.net/bobtfish
• @bobtfish
• Some (most?) of our code is open source:
• https://github.com/Yelp/sensu/commit/
aa5c43c2fdfde5e8739952c0b8082000934f3ad2
• https://github.com/Yelp/puppet-monitoring_check
• https://github.com/Yelp/puppet-netstdlib
• https://github.com/Yelp/sensu_handlers
• https://github.com/Yelp/pysensu-yelp
58