Sensu at brightpearl

Sensu at Brightpearl
Turning a hatred of Nagios into a
love of Sensu
www.brightpearl.com

Who the hell am I?
Dave Tibbs
@LowlySysadm1n
l
Systems Administrator at Brightpearl Inc
l
Started at Brightpearl UK in October 2010
l
Back then, only about 20 people in the company
– I was the only Systems Administrator/General
IT Dogsbody
l
~7 years experience as Sysadmin working with
various flavours of Linux

Monitoring – who needs it anyway?
l
Basically everyone – if you're running production
software that people depend on, you need to know
what's going on with your servers
l
You can't rely on screaming users to let you know
when things go wrong
l
Certain metrics can be a very good indicator of
failures before they happen – think disk space,
memory consumption, failed backups, web
requests/sec, etc

Monitoring in place when I started

Right, better get some monitoring.
Nagios, then?
l
Reputation of being the default, safe choice
l
Claim to be “Industry Standard” on their website
l
Historically people were put off by extortionate
costs of enterprise software (e.g. HP Openview) –
now cloud-based software still requires a
subscription.
l
Hey, Nagios is free.
l
Neckbeards rejoice – it's open source.

In the beginning, it was joyous.
l
MONITOR ALL TEH THINGZ
l
(Relatively) low server count means it was still
manageable. Easy to tune alerts to specific
servers.
l
All the plugins you can imagine means we could
monitor RDS instances, internal office servers,
UPS, etc etc
l
Email alerts for warnings keep us abreast of
things that might happen
l
Pagerduty integration for critical alerts
l
Configuration assisted with Chef.

But then...
l As the number of servers increases, so does the
configuration required
l ...and so do the spurious alerts, where the
thresholds aren't so simple to set. Hosting cost
restraints means sometimes running close to the
wire on some servers but not others.
l Because of this, NAGIOSAGEDDON in your
email inbox. Soon enough, everyone's ignoring
them, especially the warnings. And especially if
stuff is still working

A quick note on Nagios checks.
l Monitoring host sends check command over NRPE and waits for a response
l Queue of checks are processed one by one – if networking to certain hosts is
slow, it's slower to process the list.
l If the list of checks doesn't get processed before the next check is due.....

So Nagios sucks then?
l Well, Nagios gets some things right -
The plugin model is simple (4 exit codes!) and
reasonably well-designed
●
It's pretty reliable
●
SSL Support = secure
l If you're running a small office/datacentre with
servers and requirements that rarely or never
change it works – but still with a lot of painful
setting up
l But as soon as you deviate from this, it all goes
wrong.

Yes, bascially Nagios sucks.
l A lot has changed in the IT world in 15 years –
Nagios hasn't.
l It's completely unscalable. There is no such
thing as a Nagios cluster. More checks = more
server load on master
l The configuration format is horrible –
chef/puppet only slightly dulls the pain
l It has a horrendous interface – even if you pay
for Nagios XI, which isn't cheap
l It assumes a static infrastructure, which in the
days of Cloud is almost never.
l Configuration has to be duplicated in two places

So what to do?
l Reached the limit of Nagios pain – determined to
shake the Stockholm Syndrome we all appear to
have
l Alerts are pretty much ignored by all, once flood
gets large enough they WILL end up filtered.
Nagios has gone stopped for days without
anybody noticing.
l A monitoring system that people ignore is utterly
pointless.
l Started to investigate other alternatives.

Alternatives to Nagios
l NagiosXI - $$$ and apparently not much better.
l Zabbix – Not as much support as Nagios, lots of
people seem to think it's worse. Configuration
possibly even more complex
l ZenOSS – Confusing config, issues with false
positives and massive numbers of alerts
l Then I found Sensu.

What is this Sensu then?
l Much, much better model (queue-subscriber)
l Purpose-built for this, best tool for the job. Think
Graphite for graphing, pagerduty for alerting.
l Supports existing Nagios plugins
l Integrates with graphite, pagerduty
l Easy to scale – automatically handles clustering.
l Great REST API – you can do most things with it

No really, what is is it?
l Often described as a “monitoring router”
l Results of “check” scripts are passed onto one
or more handlers, depending on certain
conditions
l Written in Ruby (yay!)
l Configuration is all in JSON
l Four main components:
●
Server
●
Client
●
API
●
Dashboard

Compared to Nagios, this is good
l Hosting our infrastructure in the cloud, we need
to have our monitoring solution be
●
able to cope with changing
instances/infrastructure
●
aware of new servers without us having to
remember to tell it
●
Able to cope with possbibly rapid expansion
l Sensu fulfills these objectives reasonably well.

So is Sensu perfect?
●
No, nothing is.
●
The dashboard is immature – basically still a
bit rubbish
●
Current release is only version 0.12 – so the
whole software itself is fairly immature.
●
Fairly complicated install process, with
dependencies on quite a bit more than Nagios.
It's been Chef'd (and Chef'd well) but seems
easy for these dependencies to break with
version inconsistencies.

But it's still immeasurably better.
●
It'll scale well when our infrastructure expands
●
Has performed great in a test environment
●
Looking forward to rolling it out to production!

Sensu at brightpearl

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to Sensu at brightpearl

Similar to Sensu at brightpearl (20)

Recently uploaded

Recently uploaded (20)

Sensu at brightpearl

Editor's Notes