2. Who the hell am I?
Dave Tibbs
@LowlySysadm1n
l
Systems Administrator at Brightpearl Inc
l
Started at Brightpearl UK in October 2010
l
Back then, only about 20 people in the company
– I was the only Systems Administrator/General
IT Dogsbody
l
~7 years experience as Sysadmin working with
various flavours of Linux
3. Monitoring – who needs it anyway?
l
Basically everyone – if you're running production
software that people depend on, you need to know
what's going on with your servers
l
You can't rely on screaming users to let you know
when things go wrong
l
Certain metrics can be a very good indicator of
failures before they happen – think disk space,
memory consumption, failed backups, web
requests/sec, etc
5. Right, better get some monitoring.
Nagios, then?
l
Reputation of being the default, safe choice
l
Claim to be “Industry Standard” on their website
l
Historically people were put off by extortionate
costs of enterprise software (e.g. HP Openview) –
now cloud-based software still requires a
subscription.
l
Hey, Nagios is free.
l
Neckbeards rejoice – it's open source.
6. In the beginning, it was joyous.
l
MONITOR ALL TEH THINGZ
l
(Relatively) low server count means it was still
manageable. Easy to tune alerts to specific
servers.
l
All the plugins you can imagine means we could
monitor RDS instances, internal office servers,
UPS, etc etc
l
Email alerts for warnings keep us abreast of
things that might happen
l
Pagerduty integration for critical alerts
l
Configuration assisted with Chef.
7. But then...
l As the number of servers increases, so does the
configuration required
l ...and so do the spurious alerts, where the
thresholds aren't so simple to set. Hosting cost
restraints means sometimes running close to the
wire on some servers but not others.
l Because of this, NAGIOSAGEDDON in your
email inbox. Soon enough, everyone's ignoring
them, especially the warnings. And especially if
stuff is still working
8. A quick note on Nagios checks.
l Monitoring host sends check command over NRPE and waits for a response
l Queue of checks are processed one by one – if networking to certain hosts is
slow, it's slower to process the list.
l If the list of checks doesn't get processed before the next check is due.....
9. So Nagios sucks then?
l Well, Nagios gets some things right -
The plugin model is simple (4 exit codes!) and
reasonably well-designed
●
It's pretty reliable
●
SSL Support = secure
l If you're running a small office/datacentre with
servers and requirements that rarely or never
change it works – but still with a lot of painful
setting up
l But as soon as you deviate from this, it all goes
wrong.
10. Yes, bascially Nagios sucks.
l A lot has changed in the IT world in 15 years –
Nagios hasn't.
l It's completely unscalable. There is no such
thing as a Nagios cluster. More checks = more
server load on master
l The configuration format is horrible –
chef/puppet only slightly dulls the pain
l It has a horrendous interface – even if you pay
for Nagios XI, which isn't cheap
l It assumes a static infrastructure, which in the
days of Cloud is almost never.
l Configuration has to be duplicated in two places
11. So what to do?
l Reached the limit of Nagios pain – determined to
shake the Stockholm Syndrome we all appear to
have
l Alerts are pretty much ignored by all, once flood
gets large enough they WILL end up filtered.
Nagios has gone stopped for days without
anybody noticing.
l A monitoring system that people ignore is utterly
pointless.
l Started to investigate other alternatives.
12. Alternatives to Nagios
l NagiosXI - $$$ and apparently not much better.
l Zabbix – Not as much support as Nagios, lots of
people seem to think it's worse. Configuration
possibly even more complex
l ZenOSS – Confusing config, issues with false
positives and massive numbers of alerts
l Then I found Sensu.
13. What is this Sensu then?
l Much, much better model (queue-subscriber)
l Purpose-built for this, best tool for the job. Think
Graphite for graphing, pagerduty for alerting.
l Supports existing Nagios plugins
l Integrates with graphite, pagerduty
l Easy to scale – automatically handles clustering.
l Great REST API – you can do most things with it
14. No really, what is is it?
l Often described as a “monitoring router”
l Results of “check” scripts are passed onto one
or more handlers, depending on certain
conditions
l Written in Ruby (yay!)
l Configuration is all in JSON
l Four main components:
●
Server
●
Client
●
API
●
Dashboard
15. Compared to Nagios, this is good
l Hosting our infrastructure in the cloud, we need
to have our monitoring solution be
●
able to cope with changing
instances/infrastructure
●
aware of new servers without us having to
remember to tell it
●
Able to cope with possbibly rapid expansion
l Sensu fulfills these objectives reasonably well.
16. So is Sensu perfect?
●
No, nothing is.
●
The dashboard is immature – basically still a
bit rubbish
●
Current release is only version 0.12 – so the
whole software itself is fairly immature.
●
Fairly complicated install process, with
dependencies on quite a bit more than Nagios.
It's been Chef'd (and Chef'd well) but seems
easy for these dependencies to break with
version inconsistencies.
17. But it's still immeasurably better.
●
It'll scale well when our infrastructure expands
●
Has performed great in a test environment
●
Looking forward to rolling it out to production!
Editor's Notes
EXPLAIN WHY NAGIOS CHECKS ARE BAD – NRPE check fired to each server, the more checks, the more they queue up. Check can fire off on server before previous one has completed – never get a result back.Chef kind of helps with configuration, but not by a lot. As there are more servers, there are more exceptions not covered so easily by configuration management.
What follows NAGIOSAGEDDON? Mail queue overload and eventual crash. Alerts stop all together, which nobody notices, because they're ignoring them.
If the list of checks doesn't get processed before the next check is due..... we may never get results back for the later checks in the list.Or, consider that the server is able to process the checks required within the time “window” (e.g. 1 minute for checks that are made every minute) – what if the number of checks is doubled? Tripled?
Reliability – when was the last time you saw the nagios daemon crash? It's usually things external to Nagios that are the problem,
Painful setting up – there are bolt-ons like Groundworks to improve setting up but they're not that much better than arsing about with configuration files
Deviation = non-static hostnames in the cloud. Generally in a datacentre most is static.
A lot has changed in 15 years – biggest of which is is a) everyone's running more servers and more servicesb) Most people relying on the cloud = many many non-static IP addresses.
Nagios is 15 years old, give or take – released in 1999 and the design hasn't changed much in years. It's not fair to expect them to predict the changes back then, but neither has the software moved with the times.
Configuration duplication – the server has to be aware of what checks it wants clients to make, the client has to be aware of what checks it's going to be expected to be run. Absolutely crazy setup.
Stockholm syndrome not just in our company or even with me – everyone seems to have it. Reference everyone defending Nagios when it's basically shit.
“Sensu” from the Japanese word for “fan” - relates to the “fanout exchange”, one of the exchange types used by RabbitMQ.
Server – orchestrates check executions, processes the results, and handles events from results to handlers. You can run more than one server and tasks are distributed amongst them
Client – Recieves check execution requests, executes the checks, and publishes the results.
API – Provides a REST-like interface to Sensu data, such as registered clients and current events. You can run more than one.
Dashboard – UI for Sensu. Not great.
Server – orchestrates check executions, processes the results, and handles events from results to handlers. You can run more than one server and tasks are distributed amongst them
Client – Recieves check execution requests, executes the checks, and publishes the results.
API – Provides a REST-like interface to Sensu data, such as registered clients and current events. You can run more than one.
Dashboard – UI for Sensu. Not great.
Server – orchestrates check executions, processes the results, and handles events from results to handlers. You can run more than one server and tasks are distributed amongst them
Client – Recieves check execution requests, executes the checks, and publishes the results.
API – Provides a REST-like interface to Sensu data, such as registered clients and current events. You can run more than one.
Dashboard – UI for Sensu. Not great.
Server – orchestrates check executions, processes the results, and handles events from results to handlers. You can run more than one server and tasks are distributed amongst them
Client – Recieves check execution requests, executes the checks, and publishes the results.
API – Provides a REST-like interface to Sensu data, such as registered clients and current events. You can run more than one.
Dashboard – UI for Sensu. Not great.