OSMC 2012 | Shinken by Jean Gabès

When IT get bad, it can be dangerous for
business

So to save the world business :
Monitoring tools !

For pure IT monitoring, Nagios™® is the last 10
years reference ...

●
Mod_gearman : Lan distribution
●
LiveStatus : data access
●
Thruk/Multisite/NagVis : real-time view
●
PNP, Graphite : graphs

Plugins & modularity ARE great !

But maybe now it's not enough ?

With multiple layers (physical, network,
virtual, …)

With lot of clusters everywhere

Classic IT monitoring difficulties
●
Too much load (plugins, notif latency, ...)
●
Hard to maintain configuration
●
Distant site lost ?
●
High availability

Yes you can stack Nagios modules & scripts to
nearly solve this ...

… or you can just use Shinken :)

Shinken is a full Nagios™® rewrite in Python

Shinken
Huge community activity
Icinga

Dedicated Linux
Mag issue in France
(July)

With Shinken, by design :
●
Raid like high availability
●
Multi levels load balancing (DMZ, LAN, inter-
datacenters)
●
Multiplatform (yes it also means Windows, and
even Android)
●
Good speed
●
In core business rules (& | Xof:)

No more problems for setting up the monitoring,
so what if we look at 2012+ admin problems ?

Configuration simplification
●
Escalations defined with templates
●
Recurring downtimes are just a timeperiod to set
for an host
●
Easier service dependencies definitions

If an ESX crash, you don't want to receive 30+
hosts down for the VM on it !

Only the ESX host down one, so ok, easy : just
setup host dependencies :)

But you know that Vmware admins are funny
guys

They « VMotion » VMs as often as a Perl coder
type $_

So forget about flat file host dep configuration :)

You can just use Shinken Vmware module. You
only need check_esx3.pl for it (thanks to OP5
guys!)

OK, but what about reducing the « worse thing
for an admin » ?

It's not coffee/beer outage...

Example : is critical on testing so critical ?

Rule N°1 for the admins :
Never touch the production a friday
Production is all that matters

Rule N°2 :
Production is all that matters

Is critical on testing so critical ? → NO !
Ok easy : notifications_enabled

More complex, but more « real world » : a
production switch breaks a testing app

Do you need to awake the admin @3AM for
this ?? No !

The key is the root problem analysis + business
impact level on « apps »

And what about time based importance ? For
example : your paid service is only « important » 3
days a month

Business impacts modulations
define businessimpactmodulation{
business_impact_modulation_name Paid_IS_Important
business_impact 5
modulation_period PaidPeriod ; 3 days period
}
define service{
service_description Paid
use generic-service
check_command bp_rule! paie-srv,bdd & paie-srv,http
host_name Applications
business_impact 3
business_impact_modulations Paid_IS_Important
}

Strong differences in Shinken between root
problems & impacts

And between « importance » levels, more than
just warning/critical

Ok for notifications, but what about Uis ?

Shinken WebUI got its own philosophy

●
Strong separation between problems & impacts
●
Focus on (huge) business impacts
●
Dependencies are the key, show them all !
●
Aggregate all load balanced elements
●
HA by design

●
Very “visual” (dependencies, alerts, graphs)
●
HTML5 everywhere (sorry for IE6...)
●
Only useful info are show, other are hidden by
default
●
Linkable to others Uis (PNP, graphite) as
modules
●
Even your boss will understand it
●
And so will night shift operators !

Two main (incompatible) user types
●
Boss : want to see end-users apps impacts (and
why it's down...)
●
Admins : want to see what IT elements are the
problems

●
Root problems VS impacts view
●
No one want to see both
●
All is sorted by business impact of course

And if the admin want to show why it's so
important

Both will understand the dep graph

And each one can have it's own dashboard, with
its own widgets

●
To test it : demo-shinken.web4all.fr
●
Like Shinken, the UI is modular (like PNP or
Graphite inclusion)

OK, we see what we need to see, and only this.
Great.

But the heavier task is still here : we need to add
our new hosts in it :p

●
Shinken extends the Nagios configuration logic
●
Services on hostgroups where good, but why
add a server to the linux hostgroups if you already
“link” with the linux template?

Can be great to have complex expression like
« Linux&Prod » for service linking

We can only « tags » our hosts, and not multiply
our hostgroups (like linux,production tags
instead of linux,production,linuxproductions
groups)
●
O(n) data versus O(n²)

●
Too much service definitions
●
You can't avoid host definition, but you can try to
reduce your service number
●
Let drop service centric data to an host centric
one

●
Which disk volume check is an host data, not a
service one
●
Which database check is an host data, not a
service one

●
Get back configuration data from service to the
hosts
●
Less services defined, more template usage
●
More host custom macros

●
Key : duplicate_foreach keyword in Shinken
●
Generate a service for each « value » in an
custom macros

Define host{
host_name srv-lin-1
Use linux
_disks /, /var, /data
}
Define service {
host_name linux
Register 0
Description Disk $KEY$
check_command check_disk!$KEY$
duplicate_foreach _disks
}

Define host{
host_name big-switch-stack
Use switch
_ports Unit [1-6] Port [1-48]
}
Define service {
host_name switch
Register 0
Description Port $KEY$
check_command check_port!$KEY$
duplicate_foreach _ports
}
You will have : 6*48 services with one definition!

Fact : admins are good IT guys!

So and admin don't want to :
●
Write plugins from scratch
●
Manually tag their hosts
●
Wrote the .cfg files for a new server flavor too
●
(in fact all they want is systems to run by themselves and go take coffee)

Why manually fill tags or customs for your hosts,
when you can write rules about it?

Example : IP range based rule module. If the host
is in a IP range you can automatically add a
property to it :
●
If in DMZ : will be checked by a DMZ poller
●
If in testing LAN : no notifications
●
If behind a router : add the router as parent
Example : IP range based rule module.

define module{
module_name Ip_VLAN_10
module_type ip_tag
ip_range 10.0.100.0/24
property parents
value gw_vlan_100
method replace
}

But still need to « add » hosts....
Shinken discovery !

●
Runners : script that « scan » and output 'data'
●
Rules : read data and generate host/service from
it

Ex : nmap runner scan an host and output 'data'
$ nmap_discovery_runner.py -t localhost
localhost::isup=1
localhost::os=linux
localhost::osversion=2.6.x
localhost::osvendor=linux
localhost::macvendor=hp
localhost::openports=22,443,3306
localhost::fqdn=localhost
localhost::ip=127.0.0.1

Sample rule for linux tag
define discoveryrule {
discoveryrule_name Linux
creation_type host
os linux ; what we match
+use linux ; what we wrote in the object, here
; append the linux template
}

Sample rule for Https tag
define discoveryrule {
discoveryrule_name Https
creation_type host
openports 443 ; if we got the 443 port ...
+use Https ; … add the Https template
}

localhost : use ssh,mysql,https,linux

Multi-level discovery
●
1 If you match a data
●
2 Launch a new runner
●
3 apply new rules
●
4 GOTO 1

Ex : Windows shares discovery
define discoveryrun {
discoveryrun_name WindowsShares
discoveryrun_command discovery_windows_share
# And scan only windows detected hosts!
os windows
}

Result
define host {
host_name win-srv
use windows
_shares Work,Public,Private
}

CLI launch :
shinken-discovery -c etc/discovery.cfg --db Mongodb -m
'NMAPTARGET=localhost'

SKonf :
●
UI for easy configuration management
●
Can use discovery or a more « classic » way
●
Manage Shinken specific properties
●
(good) beta version from now

Let get back from configuration to more
monitoring logic
Sometime external check plugins can't help you
(for example : a server with a Collectd daemon)

Such pure passive data is hard to manage and
« check »

Solution : Triggers (yes, like in Zabbix)
●
.trig files (in fact python source)
●
A trigger is linked to hosts/services in the
configuration
●
Will « run » after a check (or a new passive data)
●
Can do what ever they want in the core!

Sample :
# self = number of users collectd service for the host
nb_users = perf(self, 'users')
warn = int(get_custom(self.host, '_users_warn'))
crit = int(get_custom(self.host, '_users_crit'))
return_code = 0
output = 'Check OK'
if nb_users > warn:
output = 'Warning : users are too high %s' % nb_users
return_code = 1
if nb_users > crit:
output = 'Critical : users are too high %s' % nb_users
return_code = 2
set_value(self, output=output, return_code=return_code)

Ok that's won't replace NRPE or check_mk, but
can be useful for log parsing with a syslog listener
module or a SNMP Trap parser one for example...

… or for more advanced things like KPI
computation, or even advanced correlations

Sample : compute the avg time of N web servers
times = perfs("srv-web-*/Http", 'time')
avg_time = sum(times)/len(times)
set_value(self, output='OK', perfdata='avgtime=%dms' % avg_time, return_code=0)

Sample : advanced correlation rule
bd_state = state("srv-bdd”,”Oracle")
avg_time = perf("srv-web/AvgTime", 'avgtime')
return_code = 0
output = 'Check OK'
if bd_state == 'WARNING' or avg_time > 5:
output = 'Warning : the application is in degraded mode'
return_code = 1
if bd_state == 'CRITICAL' or avg_time > 10:
output = 'Critical : the application is down!'
return_code = 2
set_value(self, output=output, return_code=return_code)

How to install Shinken?
Quite easy :
# curl -L http://install.shinken-monitoring.org | /bin/bash

Conclusion ?
●
Be lazy and take coffee
●
The Shinken architecture is done and powerful
●
Lot of improvements in the monitoring logic
compare to Nagios™®
●
WebUI is great, sKonf will be great soon
●
Soon professional support from a “Shinken
Enterprise »

OSMC 2012 | Shinken by Jean Gabès

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a OSMC 2012 | Shinken by Jean Gabès

Similar a OSMC 2012 | Shinken by Jean Gabès (20)

Último

Último (20)

OSMC 2012 | Shinken by Jean Gabès