Shinken is a full rewrite of Nagios in Python that aims to solve issues with scaling, high availability, and simplifying administration for modern IT infrastructures. Key features include built-in high availability, multi-level load balancing, support for multiple platforms, faster performance, and advanced business rules. The Shinken web interface focuses on aggregating related elements and showing dependencies to help both technical and non-technical users understand business impacts. Advanced modules allow for discovery, triggers for passive data, and templating to reduce configuration complexity.
18. Classic IT monitoring difficulties
●
Too much load (plugins, notif latency, ...)
●
Hard to maintain configuration
●
Distant site lost ?
●
High availability
25. With Shinken, by design :
●
Raid like high availability
●
Multi levels load balancing (DMZ, LAN, inter-
datacenters)
●
Multiplatform (yes it also means Windows, and
even Android)
●
Good speed
●
In core business rules (& | Xof:)
26. No more problems for setting up the monitoring,
so what if we look at 2012+ admin problems ?
54. ●
Strong separation between problems & impacts
●
Focus on (huge) business impacts
●
Dependencies are the key, show them all !
●
Aggregate all load balanced elements
●
HA by design
55. ●
Very “visual” (dependencies, alerts, graphs)
●
HTML5 everywhere (sorry for IE6...)
●
Only useful info are show, other are hidden by
default
●
Linkable to others Uis (PNP, graphite) as
modules
●
Even your boss will understand it
●
And so will night shift operators !
56. Two main (incompatible) user types
●
Boss : want to see end-users apps impacts (and
why it's down...)
●
Admins : want to see what IT elements are the
problems
57. ●
Root problems VS impacts view
●
No one want to see both
●
All is sorted by business impact of course
68. ●
Shinken extends the Nagios configuration logic
●
Services on hostgroups where good, but why
add a server to the linux hostgroups if you already
“link” with the linux template?
69. Can be great to have complex expression like
« Linux&Prod » for service linking
70. We can only « tags » our hosts, and not multiply
our hostgroups (like linux,production tags
instead of linux,production,linuxproductions
groups)
●
O(n) data versus O(n²)
71. ●
Too much service definitions
●
You can't avoid host definition, but you can try to
reduce your service number
●
Let drop service centric data to an host centric
one
72. ●
Which disk volume check is an host data, not a
service one
●
Which database check is an host data, not a
service one
73. ●
Get back configuration data from service to the
hosts
●
Less services defined, more template usage
●
More host custom macros
74. ●
Key : duplicate_foreach keyword in Shinken
●
Generate a service for each « value » in an
custom macros
75. Define host{
host_name srv-lin-1
Use linux
_disks /, /var, /data
}
Define service {
host_name linux
Register 0
Description Disk $KEY$
check_command check_disk!$KEY$
duplicate_foreach _disks
}
76. Define host{
host_name big-switch-stack
Use switch
_ports Unit [1-6] Port [1-48]
}
Define service {
host_name switch
Register 0
Description Port $KEY$
check_command check_port!$KEY$
duplicate_foreach _ports
}
You will have : 6*48 services with one definition!
79. So and admin don't want to :
●
Write plugins from scratch
●
Manually tag their hosts
●
Wrote the .cfg files for a new server flavor too
●
(in fact all they want is systems to run by themselves and go take coffee)
80. Why manually fill tags or customs for your hosts,
when you can write rules about it?
81. Example : IP range based rule module. If the host
is in a IP range you can automatically add a
property to it :
●
If in DMZ : will be checked by a DMZ poller
●
If in testing LAN : no notifications
●
If behind a router : add the router as parent
Example : IP range based rule module.
84. ●
Runners : script that « scan » and output 'data'
●
Rules : read data and generate host/service from
it
85. Ex : nmap runner scan an host and output 'data'
$ nmap_discovery_runner.py -t localhost
localhost::isup=1
localhost::os=linux
localhost::osversion=2.6.x
localhost::osvendor=linux
localhost::macvendor=hp
localhost::openports=22,443,3306
localhost::fqdn=localhost
localhost::ip=127.0.0.1
86. Sample rule for linux tag
define discoveryrule {
discoveryrule_name Linux
creation_type host
os linux ; what we match
+use linux ; what we wrote in the object, here
; append the linux template
}
87. Sample rule for Https tag
define discoveryrule {
discoveryrule_name Https
creation_type host
openports 443 ; if we got the 443 port ...
+use Https ; … add the Https template
}
90. Ex : Windows shares discovery
define discoveryrun {
discoveryrun_name WindowsShares
discoveryrun_command discovery_windows_share
# And scan only windows detected hosts!
os windows
}
94. SKonf :
●
UI for easy configuration management
●
Can use discovery or a more « classic » way
●
Manage Shinken specific properties
●
(good) beta version from now
95.
96.
97.
98.
99. Let get back from configuration to more
monitoring logic
Sometime external check plugins can't help you
(for example : a server with a Collectd daemon)
102. Solution : Triggers (yes, like in Zabbix)
●
.trig files (in fact python source)
●
A trigger is linked to hosts/services in the
configuration
●
Will « run » after a check (or a new passive data)
●
Can do what ever they want in the core!
103. Sample :
# self = number of users collectd service for the host
nb_users = perf(self, 'users')
warn = int(get_custom(self.host, '_users_warn'))
crit = int(get_custom(self.host, '_users_crit'))
return_code = 0
output = 'Check OK'
if nb_users > warn:
output = 'Warning : users are too high %s' % nb_users
return_code = 1
if nb_users > crit:
output = 'Critical : users are too high %s' % nb_users
return_code = 2
set_value(self, output=output, return_code=return_code)
104. Ok that's won't replace NRPE or check_mk, but
can be useful for log parsing with a syslog listener
module or a SNMP Trap parser one for example...
105. … or for more advanced things like KPI
computation, or even advanced correlations
106. Sample : compute the avg time of N web servers
times = perfs("srv-web-*/Http", 'time')
avg_time = sum(times)/len(times)
set_value(self, output='OK', perfdata='avgtime=%dms' % avg_time, return_code=0)
107. Sample : advanced correlation rule
bd_state = state("srv-bdd”,”Oracle")
avg_time = perf("srv-web/AvgTime", 'avgtime')
return_code = 0
output = 'Check OK'
if bd_state == 'WARNING' or avg_time > 5:
output = 'Warning : the application is in degraded mode'
return_code = 1
if bd_state == 'CRITICAL' or avg_time > 10:
output = 'Critical : the application is down!'
return_code = 2
set_value(self, output=output, return_code=return_code)
108. How to install Shinken?
Quite easy :
# curl -L http://install.shinken-monitoring.org | /bin/bash
109. Conclusion ?
●
Be lazy and take coffee
●
The Shinken architecture is done and powerful
●
Lot of improvements in the monitoring logic
compare to Nagios™®
●
WebUI is great, sKonf will be great soon
●
Soon professional support from a “Shinken
Enterprise »