SlideShare una empresa de Scribd logo
1 de 85
Descargar para leer sin conexión
Server Hardware Monitoring 
done right! 
Werner Fischer, Thomas-Krenn.AG
2 
Status quo 
_ Überwachen Sie Ihre Server Hardware? 
Ja Nein
Nach diesem Vortrag 
überwachen Sie sicherer 
und umfangreicher 
3
Nach diesem Vortrag 
überwachen Sie sicherer 
und umfangreicher 
(hoffe ich zumindest... ;-) 
4
5 
Status quo 
_ Welche Technologien nutzen Sie? 
IPMI / SNMP NRPE CAM CAT
6 
CAMera
7 
CATinspection → satification?
8 
Agenda 
_ IPMI (20') 
_ RAM (5') 
_ RAID (10') 
_ SMART (5') 
_ GPU (5')
9 
monitor your 
IPMI-Sensors!
10
11
12 
Intelligent Platform Management 
Interface
13 
 Monitoring 
(temp, fans, ...) 
 Recovery Control 
(on/off/reset) 
 Logging 
(System Event Log) 
Inventory 
(FRU information) 
Funktionen
FRU Temp. 
sensor 
… 
Chassis board 
14 
Aufbau 
Motherboard 
private mgmt. busses 
Processor 
board 
Memory 
board 
Zugriff mit 
Benutzername 
& Passwort 
Baseboard 
Management 
Controller 
(BMC) 
System bus 
NVS Storage 
SDR 
SEL 
FRU 
Chassis 
mgmt. 
(Satellite 
Controller) 
Sensors & Controls 
Fan sensor 
Temp. sensor 
Power control 
Reset control 
… 
FRU 
Temp. s. 
FRU 
IPMB 
LAN 
interface 
Serial 
Port 
Sharing 
M/B 
Serial 
Controller 
BMC 
Serial 
Controller 
Serial/Modem 
interface 
Serial 
Connector 
LAN 
Connector 
PCI mgmt. bus 
Network 
(LAN) 
Controller 
Remote Mmgt. Card 
(KVM over IP, ...) 
Auxillary 
IPMB Connector 
ICMB 
ICMB 
bridge 
System 
interface 
Redundant Power 
board 
FRU 
Zugriff mit 
root Rechten
15 
IPMI Sensor Klassen 
Discrete (True/False) Threshold (Schwellwerte) 
Mehrere Zustände möglich: 
● bis zu 15 Status möglich 
● jeder Status = 1 Bit 
● mehrere aktive Statusbits möglich 
Zustand abhängig von: 
● Vergleich analoger Messert mit dem 
Schwellwerten (Thresholds) 
Liefert: 
● allgemeine Zustände 
● Sensor-spezifische Zustände 
Liefert: 
● analogen Messwert 
● diskreten Status 
Ähnliche Klasse OEM 
● Bedeutung der Zustände werden 
vom OEM definiert
16 
IPMI Sensor Klassen 
Discrete Threshold 
[root@test ~]# ipmitool sdr get "PS2 Status" 
Sensor ID : PS2 Status (0x71) 
Entity ID : 10.2 (Power Supply) 
Sensor Type (Discrete): Power Supply 
States Asserted : Power Supply 
[Presence detected] 
[Power Supply AC 
lost] 
Assertion Events : Power Supply 
[Presence detected] 
[Power Supply AC 
lost] 
Assertions Enabled : Power Supply 
[Presence detected] 
[Failure detected] 
[Predictive failure] 
[Power Supply AC 
lost] 
[...] 
Deassertions Enabled : Power Supply 
[...] 
[root@test ~]# ipmitool sdr get "Fan 1" 
Sensor ID : Fan 1 (0x50) 
Entity ID : 29.1 (Fan 
Device) 
Sensor Type (Analog) : Fan 
Sensor Reading : 5719 (+/­0) 
RPM 
Status : ok 
Nominal Reading : 6708.000 
Normal Minimum : 2451.000 
Normal Maximum : 10965.000 
Lower critical : 1720.000 
Lower non­critical 
: 1978.000 
Positive Hysteresis : 86.000 
Negative Hysteresis : 86.000 
Minimum sensor range : Unspecified 
Maximum sensor range : Unspecified 
Event Message Control : Per­threshold 
Readable Thresholds : lcr lnc 
Settable Thresholds : lcr lnc 
Threshold Read Mask : lcr lnc 
Assertion Events : 
Assertions Enabled : lnc­lcr­Deassertions 
Enabled : lnc­lcr­
$ sudo ipmi­sensors 
­­output­sensor­state 
­­interpret­oem­data 
Password: 
ID | Name | Type | State | Reading | Units | Event 
4 | System Temp | Temperature | Nominal | 27.00 | C | 'OK' 
71 | Peripheral Temp | Temperature | Nominal | 35.00 | C | 'OK' 
138 | CPU Temp | OEM Reserved | Nominal | N/A | N/A | 'Low' 
205 | FAN 1 | Fan | Nominal | 1800.00 | RPM | 'OK' 
… 942 | VBAT | Voltage | Nominal | 3.15 | V | 'OK' 
1009 | VSB | Voltage | Nominal | 3.34 | V | 'OK' 
1076 | AVCC | Voltage | Nominal | 3.38 | V | 'OK' 
1143 | Chassis Intru | Physical Security | Critical | N/A | N/A | 'Gen...' 
17 
IPMI Sensoren OK 
Critical
18 
IPMI Sensoren (Discrete) 
$ cat /etc/freeipmi/freeipmi_interpret_sensor.conf 
[…] 
## IPMI_Physical_Security 
# 
# IPMI_Physical_Security_No_Event Nominal 
# IPMI_Physical_Security_General_Chassis_Intrusion Critical 
# IPMI_Physical_Security_Drive_Bay_Intrusion Critical 
[…] 
# IPMI_Power_Supply_No_Event Nominal 
# IPMI_Power_Supply_Presence_Detected Nominal 
# IPMI_Power_Supply_Power_Supply_Failure_Detected Critical 
# IPMI_Power_Supply_Predictive_Failure Critical 
# IPMI_Power_Supply_Power_Supply_Input_Lost_AC_DC Critical 
[…]
$ ./check_ipmi_sensor ­H 
192.168.255.5 ­f 
ipmi.cfg ­vv 
IPMI Status: OK | 'System Temp'=27.00 'Peripheral Temp'=35.00 'FAN 
1'=1800.00 'Vcore'=0.98 '3.3VCC'=3.36 '12V'=11.93 'VDIMM'=1.53 
'5VCC'=5.09 '­12V'= 
­12.09 
'VBAT'=3.15 'VSB'=3.34 'AVCC'=3.38 
System Temp = 27.00 (Status: Nominal) 
Peripheral Temp = 35.00 (Status: Nominal) 
CPU Temp = 'Low' (Status: Nominal) 
FAN 1 = 1800.00 (Status: Nominal) 
Vcore = 0.98 (Status: Nominal) 
3.3VCC = 3.36 (Status: Nominal) 
12V = 11.93 (Status: Nominal) 
VDIMM = 1.53 (Status: Nominal) 
5VCC = 5.09 (Status: Nominal) 
­12V 
= ­12.09 
(Status: Nominal) 
VBAT = 3.15 (Status: Nominal) 
VSB = 3.34 (Status: Nominal) 
AVCC = 3.38 (Status: Nominal) 
Chassis Intru = 'OK' (Status: Nominal) 
19 
IPMI Plugin
20 
IPMI Plugin 
#!/usr/bin/perl 
# check_ipmi_sensor: Nagios/Icinga plugin to check IPMI sensors 
## 
Copyright (C) 2009­2014 
Thomas­Krenn. 
AG, 
# additional contributors see changelog.txt 
## 
This program is free software; you can redistribute it and/or modify it under 
[…] 
Version 3.5 20141031 
* Fix LAN Driver if called on localhost 
Version 3.4 20140929 
* Fix implicit array warning with split 
* Add option to disable LAN protocol version 2.0 
Version 3.3 20140606 
* Print a warning if ipmi­sensors 
only returned a single output row 
* Ignore sudo errors and warnings in IPMI command output 
(Thanks to Robert Heinzmann for contributing) 
* Use LAN protocol version 2.0 per default 
* Print empty output error only if return code was 0 
* Exit the plugin with return code 3 if fru command fails 
* Added an include list option to only include specific sensors 
Version 3.2 20131028 
* Added FRU serial number to output
21 
so weit so gut?
Intelligent? Platform Management 
22 
Interface
23
Das Abhörsystem 
in ihrem Computer 
24 
The Eavesdropping System in Your Computer 
(Bruce Schneier, Schneier on Security Blog 31.01.2013)
25
26
230.000 1HE Server 
→ 10.223,5 m Höhe 
(Mount Everest 8.848 m) 
27
28
29 
IPMI Firmware by ATEN / AMI 
_ Mainboard-Hersteller 
passen Firmware an 
_ OS = Embedded Linux 
_ IPMI Firmware Teile 
Closed-Source
Wir empfehlen administrative Zugänge 
wie IPMI- aber auch etwa SSH-Dienste 
nicht offen im Internet zu betreiben, 
30 
sondern mittels Firewall/VPN den 
Zugriff auf solche Dienste 
ausschließlich berechtigten Personen 
zu ermöglichen.
31 
Was wenn doch? 
Enable 
&DROP
32 
IPMI Top 3 
Sicherheitstipps
33 
#1 - Netzwerk
34 
#1 - Netzwerk
35 
#2 – User Management 
sjfaiklaz afjhuijoh 
Administrator 
User
In short, the authentication process for IPMI 2.0 mandates 
that the server send a salted SHA1 or MD5 hash of the 
requested user's password to the client, prior to the client 
authenticating. 
36 
#2 – User Management 
A Penetration Tester's Guide to IPMI and BMCs (rapid7.com) 
msf > use auxiliary/scanner/ipmi/ipmi_dumphashes 
msf auxiliary(ipmi_dumphashes) > set RHOSTS 10.1.102.141 
RHOSTS => 10.1.102.141 
msf auxiliary(ipmi_dumphashes) > set THREADS 128 
THREADS => 128 
msf auxiliary(ipmi_dumphashes) > run 
[+] 10.1.102.141:623 - IPMI - Hash found: 
admin:14667523250000004ec525d3852f4fa73c93b674788217fe00000000000000 
00000000000000000000000000000000000000000000000000140561646d696e:2c7 
6e372d89ac7cd4e3bfecb423962f708d0741c
37 
#2 – User Management 
$ ./cudaHashcat64.bin --outfile=ipmi.out -m 7300 hash.txt -a 3 ?lu? 
lu?lu?lu?lu?lu 
[...] 
Session.Name...: cudaHashcat 
Status.........: Exhausted 
Input.Mode.....: Mask (?lu?lu?lu?lu?lu?lu) [12] 
Hash.Target....: 
54414378fb2db5ff365e4bc5856adaf4c1b8a2f2153efd1b81fb54dfe1bf56478788 
ea7ba154375b40167e34f026e1020010d21d1ea31625040561646d696e:0a0b16023 
1e204a6d0bd086e26718002409b35b7 
Hash.Type......: IPMI2 RAKP HMAC-SHA1 
Time.Started...: Thu Sep 18 10:11:17 2014 (6 secs) 
Time.Estimated.: 0 secs 
Speed.GPU.#1...: 52732.3 kH/s 
Recovered......: 0/1 (0.00%) Digests, 0/1 (0.00%) Salts 
Progress.......: 308915776/308915776 (100.00%) 
Skipped........: 0/308915776 (0.00%) 
Rejected.......: 0/308915776 (0.00%) 
HWMon.GPU.#1...: -1% Util, 41c Temp, 31% Fan
38 
#2 – User Management 
20 
Komplexe 
& lange 
Passwörter
39 
#3 – Dienste limitieren
42 
monitor your RAM! 
(it's ECC, isn't it?)
44 
3% 
min 1 CE/Jahr (DDR2) 
Google 2009, Jaguar-Cluster 2012
45 
70% 
CE's vor UE's 
Google 2009
1,3% 
46 
Server mit UE's/Jahr 
Google 2009
root@debian­test:/ 
sys/devices/system/edac/mc/mc0/csrow0# ls ­l 
total 0 
­r­­r­­r­­1 
root root 4096 Nov 12 09:02 ce_count 
­r­­r­­r­­1 
root root 4096 Nov 12 09:02 ch0_ce_count 
­rw­r­­r­­1 
root root 4096 Nov 12 09:02 ch0_dimm_label 
­r­­r­­r­­1 
root root 4096 Nov 12 09:02 ch1_ce_count 
­rw­r­­r­­1 
root root 4096 Nov 12 09:02 ch1_dimm_label 
­r­­r­­r­­1 
root root 4096 Nov 12 09:02 dev_type 
­r­­r­­r­­1 
root root 4096 Nov 12 09:02 edac_mode 
­r­­r­­r­­1 
root root 4096 Nov 12 09:02 mem_type 
drwxr­xr­x 
2 root root 0 Nov 12 09:02 power 
­r­­r­­r­­1 
root root 4096 Nov 12 09:02 size_mb 
lrwxrwxrwx 1 root root 0 Nov 12 09:02 subsystem ­> 
../../../../../../bus/mc0 
­r­­r­­r­­1 
root root 4096 Nov 12 09:02 ue_count 
­rw­r­­r­­1 
root root 4096 Nov 12 09:02 uevent 
root@debian­test:/ 
sys/devices/system/edac/mc/mc0/csrow0# cat ce_count 
47 
0 root@debian­test:/ 
sys/devices/system/edac/mc/mc0/csrow0# cat ue_count 
0 
Linux EDAC
Linux EDAC Supportmatrix 
Treibermodul CPUs Kernel Unterstützte Architekturen 
amd64_edac.c AMD 2.6.31 
48 
2.6.39 
3.10 
3.13 
3.15 
K8 und F10 
F15 
F16 
F15_M30H 
F16_M30H 
i7core_edac.c Intel Single/Dual 2.6.35 Nehalem/Westmere 
ie31200_edac.c Intel Single-CPU 3.17 Sandy & Ivy Bridge 
Haswell 
sb_edac.c Intel Dual-CPU 3.2 
3.13 
3.17 
Sandy Bridge 
Ivy Bridge 
Haswell
$ ipmi­sel 
ID | Date | Time | Name | State | Event 
1 | Feb­03­2012 
| 10:31:58 | CPU0 DIMM0 | Warning | Correctable memory error 
2 | Feb­13­2012 
| 22:28:58 | CPU0 DIMM0 | Warning | Correctable memory error 
3 | Feb­14­2012 
| 00:29:03 | CPU0 DIMM0 | Warning | Correctable memory error 
4 | Feb­14­2012 
| 01:29:06 | CPU0 DIMM0 | Warning | Correctable memory error 
... 
49 
IPMI SEL (System Event Log) 
Unterstützung ab 
check_ipmi_sensor v3.6 
(geplant 12/2014)
$ ipmi­sel 
ID | Date | Time | Name | State | Event 
1 | Feb­03­2012 
| 10:31:58 | CPU0 DIMM0 | Warning | Correctable memory error 
2 | Feb­13­2012 
| 22:28:58 | CPU0 DIMM0 | Warning | Correctable memory error 
3 | Feb­14­2012 
| 00:29:03 | CPU0 DIMM0 | Warning | Correctable memory error 
4 | Feb­14­2012 
| 01:29:06 | CPU0 DIMM0 | Warning | Correctable memory error 
... 
50 
IPMI SEL (System Event Log) 
OS unabhängig
51 
monitor your RAID!
53 
Linux 
Software 
RAID 
LSI / Adaptec 
Hardware 
RAID
54 
Avago MegaRAID (LSI)
55 
root@debian­test:~# 
storcli64 
Storage Command Line Tool Ver 1.13.06 Sep 03, 2014 
(c)Copyright 2014, LSI Corporation, All Rights Reserved. 
help ­lists 
all the commands with their usage. E.g. storcli help 
<command> help ­gives 
details about a particular command. E.g. storcli add help 
List of commands: 
Commands Description 
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­add 
Adds/creates a new element to controller like VD,Spare..etc 
delete Deletes an element like VD,Spare 
show Displays information about an element 
set Set a particular value to a property 
get Get a particular value to a property 
compare Compares particular value to a property 
start Start background operation 
stop Stop background operation 
pause Pause background operation 
resume Resume background operation 
download Downloads file to given device 
expand expands size of given drive 
insert inserts new drive for missing 
transform downgrades the controller 
/cx Controller specific commands 
/ex Enclosure specific commands 
/sx Slot/PD specific commands 
/vx Virtual drive specific commands 
/dx Disk group specific commands 
/fall Foreign configuration specific commands 
/px Phy specific commands 
/[bbu|cv] Battery Backup Unit, Cachevault commands
$ /usr/lib/nagios/plugins/check_lsi_raid ­vv 
Warning (LD Warn) [c0/v0_Consist = Warning (No)]| 
CV_Temperature=22;70;85 ROC_Temperature=57;80;90 
c0/e252/s0_Drive_Temperature=21;40;45 
c0/e252/s1_Drive_Temperature=21;40;45 
Used storcli commands: 
­/ 
usr/bin/sudo /usr/sbin/storcli64 /c0 /cv show status 
­/ 
usr/bin/sudo /usr/sbin/storcli64 adpallinfo a0 
­/ 
usr/bin/sudo /usr/sbin/storcli64 /c0/vall show all 
­/ 
usr/bin/sudo /usr/sbin/storcli64 /c0/vall show init 
­/ 
usr/bin/sudo /usr/sbin/storcli64 /c0/eall/sall show all 
­/ 
usr/bin/sudo /usr/sbin/storcli64 /c0/eall/sall show initialization 
­/ 
usr/bin/sudo /usr/sbin/storcli64 /c0/eall/sall show rebuild 
Warning sensors: 
­c0/ 
v0_Consist (No) 
56 
check_lsi_raid
Warum adpallinfo a0? 
„storcli /0 show all … 
blocks the whole raid card 
i/o for … upto ~4 seconds“ 
57
Warum adpallinfo a0? 
„storcli /0 show all … 
blocks the whole raid card 
i/o for … upto ~4 seconds“ 
58
59 
check_lsi_raid 
$ /usr/lib/nagios/plugins/check_lsi_raid ­h 
check_lsi_raid: Nagios/Icinga plugin to check LSI Raid Controller status 
Pulgin version: 2.0 
Copyright (C) 2013­2014 
Thomas­Krenn. 
AG 
Current updates available at 
http://git.thomas­krenn. 
com/check_lsi_raid.git 
This Nagios/Icinga Plugin checks LSI RAID controllers for controller, 
physical device, logical device, BBU and CV warnings and errors. 
In order for this plugin to work properly you need to add the nagios 
user to your sudoers file (or create a new one in /etc/sudoers.d/). 
Usage: 
[ ­h 
| ­­help 
] 
Display this help page 
[ ­v 
| ­vv 
| ­vvv 
| ­­verbose 
] 
Sets the verbosity level. 
No ­v 
is the normal single line output for Nagios/Icinga, ­v 
is a 
more detailed version but still usable in Nagios. ­vv 
is a 
multiline output for debugging configuration errors or more 
detailed information. ­vvv 
is for plugin problem diagnosis. 
For further information please visit: 
http://nagiosplug.sourceforge.net/developer­guidelines. 
html#AEN39 
[ ­V 
­­version 
] 
Displays the plugin and, if available, the version if StorCLI. 
[ ­C 
<num> | ­­controller 
<num> ] 
Specifies a controller number, defaults to 0. 
...
60 
VMware? → CIM Provider
61 
VMware? → Plugin 
check_esxi_hardware.py check_vmware_esx.pl 
Hardware VMware allgemein 
python-pywbem VMware Perl SDK 
Claudio Kuenzler et.al. 
Infos: 
Martin Fürstenau
VMware? check_esxi_hardware.py 
62 
#!/usr/bin/python 
# ­* 
­coding: 
UTF­8 
­* 
­ 
## 
Script for checking global health of host running VMware ESX/ESXi 
## 
Licence : GNU General Public Licence (GPL) http://www.gnu.org/ 
# This program is free software; you can redistribute it and/or 
... 
# Copyright (c) 2008 David Ligeret 
# Copyright (c) 2009 Joshua Daniel Franklin 
# Copyright (c) 2010 Branden Schneider 
# Copyright (c) 2010­2014 
Claudio Kuenzler 
# Copyright (c) 2010 Samir Ibradzic 
# Copyright (c) 2010 Aaron Rogers 
# Copyright (c) 2011 Ludovic Hutin 
# Copyright (c) 2011 Carsten Schoene 
# Copyright (c) 2011­2012 
Phil Randal 
# Copyright (c) 2011 Fredrik Aslund 
# Copyright (c) 2011 Bertrand Jomin 
# Copyright (c) 2011 Ian Chard 
# Copyright (c) 2012 Craig Hart 
# Copyright (c) 2013 Carl R. Friend
63 
Adaptec by PMC
64 
$ sudo arcconf 
| UCLI | Adaptec by PMC uniform command line interface 
| UCLI | Version 1.6 (B21062) 
| UCLI | (C) Adaptec by PMC 2003­2014 
| UCLI | All Rights Reserved 
ATAPASSWORD | setting password on a physical drive 
COPYBACK | toggles controller copy back mode 
CREATE | creates a logical device 
CONSISTENCYCHECK | toggles the controller background consistency check mode 
DELETE | deletes one or more logical devices 
ERRORTUNABLE | sets error tunable profiles on the controller 
EXPANDERLIST | Lists the Expanders Connected to the Controller 
EXPANDERUPGRADE | updates expander firmware 
FAILOVER | toggles the controller automatic failover mode 
GETCONFIG | prints controller information 
GETLOGS | gets controller log information 
GETPERFORM | gets the parameters for a performance mode 
GETSMARTSTATS | gets the SMART statistics 
GETSTATUS | displays the status of running tasks 
GETVERSION | prints version information for all controllers 
IDENTIFY | blinks LEDS on device(s) connected to a controller 
IMAGEUPDATE | update physical device firmware 
KEY | installs a Feature Key onto a controller 
MODIFY | performs RAID Level Migration or Online Capacity Expansion 
PHYERRORLOG | displays PHY error logs for controller or device or an 
| expander PHY 
PRESERVECACHE | changes the cache preservation settings on the controller 
RESCAN | checks for new or removed drives 
RESETSTATISTICSCOUNTERS | resets the controller statistics counters 
ROMUPDATE | updates controller firmware 
SAVESUPPORTARCHIVE | saves the support archive 
SETALARM | controls the controller alarm, if present 
...
check_adaptec_raid Update 
$ ./check_adaptec_raid ­p 
/usr/sbin/arcconf 
AACRAID CRITICAL (Ctrl #1): [ZMM critical] 
$ ./check_adaptec_raid ­h 
Thomas­Krenn 
Adaptec Raid Controller Nagios/Icinga Plugin Version: 1.0 
Copyright (C) 2009­2013 
Thomas­Krenn. 
AG 
Current updates available via git at: 
65 
http://git.thomas­krenn. 
com/check_adaptec_raid.git 
This Nagios/Icinga Plugin checks ADAPTEC RAID­Controllers 
for Controller, 
Physical­Device 
and Logical Device warnings and errors. 
In order for this plugin to work properly you need to add the 
nagios­user 
to your sudoers file (or create a new one in /etc/sudoers.d/). 
This is required as arcconf must be called with sudo permissions. 
Usage: 
[ ­C 
<Controller number> ] [ ­LD 
<Logical device number> ] 
[ ­PD 
<Physical device number> ] [ ­T 
<Warning Temp., Crit. Temp.> ] 
[ ­h 
| ­­help 
] 
Display this help page 
[ ­v 
| ­vv 
| ­vvv 
| ­­verbose 
] 
Sets the verbosity level 
no ­v 
single line output for Nagios/Icinga 
­v 
single line with more details 
... 
geplant 
(2015)
VMware? → CIM Provider erwartet 
_ aktuell: 
66 
_ „CIM Provider“ für remote arcconf 
_ Adaptec MSM in einer VM 
_ künftig: 
_ „echter“ CIM Provider
67 
be smart, 
use SMART ;-)
68 
Self- 
Monitoring, 
Analysis & 
Reporting 
Technology
69 
Standardisiert NICHT standadisiert 
Datenformat 
Kommandos 
Errorlogs 
Tests 
Attribute 
Dokumentation 
vom Hersteller 
erforderlich 
(oft nicht 
öffentlich, außer 
Intel/Samsung)
70 
check_smart_attributes 
$ /usr/lib/nagios/plugins/check_smart_attributes  
> ­d 
/dev/sda  
> ­dbj 
/etc/nagios­plugins/ 
config/check_smartdb.json 
OK (sda) |sda_Media_Wearout_Indicator=098;16;6 
sda_Host_Writes_32MiB=575272 sda_Host_Reads_32MiB=723527
/etc/nagios­plugins/ 
config/check_smartdb.json 
... 
"Intel DC S3700" : { 
"Device" : ["Intel DC S3700 Series SSDs","INTEL SSDSC2BA100G3", 
"ID#" : { 
"5" : "RAW_VALUE", # Re­allocated 
Sector Count 
... 
"194" : "RAW_VALUE", # Temperature ­Device 
Internal Te 
... 
"232" : "VALUE", # Available Reserved Space 
"233" : "VALUE", # Media Wearout Indicator 
"234" : "VALUE", # Thermal Throttle Status 
"241" : "RAW_VALUE", # Total LBAs Written (32MiB) 
"242" : "RAW_VALUE", # Total LBAs Read (32MiB) 
"1024" : "VALUE" # ATA error count (custom) 
71 
}, 
"Threshs" : { 
"5" : ["20","40"], 
... 
"232" : ["16:","11:"], 
"233" : ["16:","6:"], 
"1024" : ["0","10"] 
}, 
"Perfs" : ["194","233","241","242"] 
}, 
...
/etc/nagios­plugins/ 
config/check_smartdb.json 
72 
...
/etc/nagios­plugins/ 
config/check_smartdb.json 
Ständig neue SSDs&HDDs 
73
/etc/nagios­plugins/ 
config/check_smartdb.json 
Ständig neue SSDs&HDDs 
74 
Aktualisierungen?
/etc/nagios­plugins/ 
config/check_smartdb.json 
Git(t) sei 
Dank ;-) 
75
ja cool, aber was ist mit RAID Controllern? 
... 
[­d| 
­­device 
<path to device being checked>] 
Specify the device being monitored. If multiple devices should be 
checked provide the '­d' 
option multiple times. 
E.g. '­d 
/dev/sda ­d 
/dev/sdb' 
For devices behind LSI RAID controllers specify 'megaraid' and then the 
device number, e.g. '­d 
megaraid6'. Use storcli to find out the 
corresponding device numbers. 
For devices behind Adaptec RAID controllers specify '/dev/sg<X>' where 
<X> is the number for your device. Use e.g. sg_scan to find the device. 
You must also use '­O 
sat' or '­O 
scsi' according to the device 
interface. This are extra options only necessary for '/dev/sg<X>' 
devices. 
76 
...
ja cool, aber was ist mit RAID Controllern? 
$ /usr/lib/nagios/plugins/check_smart_attributes  
> ­d 
megaraid6 
> ­dbj 
/etc/nagios­plugins/ 
config/check_smartdb.json 
OK (megaraid6) | 
megaraid6_Temperature_Internal=26 
megaraid6_Media_Wearout_Indicator=100;16;6 
megaraid6_Host_Writes_32MiB=70283 
megaraid6_Host_Reads_32MiB=1650800 
$ /usr/lib/nagios/plugins/check_smart_attributes  
> ­d 
megaraid7 
> ­dbj 
/etc/nagios­plugins/ 
config/check_smartdb.json 
Warning (megaraid7) [megaraid7_CRC_Error_Count = Warning]| 
megaraid7_Temperature_Internal=34 
megaraid7_Media_Wearout_Indicator=098;16;6 
megaraid7_Host_Writes_32MiB=189904 
megaraid7_Host_Reads_32MiB=29658 
77
78 
monitor your GPU!
79 
check_gpu_sensor 
$ /usr/lib/nagios/plugins/check_gpu_sensor ­db 
0000:83:00.0 
OK ­Tesla 
K20 |ECCL2AggSgl=0;1;2; 
ECCTexAggSgl=0;1;2; 
memUtilRate=0 
PWRUsage=49.81;150;200; 
ECCRegAggSgl=0;1;2; 
SMClock=705 
ECCL1AggSgl=0;1;2; 
GPUTemperature=38;85;100; 
memClock=2600 
usedMemory=0.24;95;99; 
fanSpeed=30;80;95; 
graphicsClock=705 
GPUUtilRate=0 
ECCMemAggSgl=0;1;2;
NVIDIA: „angezeigte 
Lüfterdrehzahl lässt nicht 
darauf schließen, ob sich der 
Lüfter tatsächlich dreht.“ 
80 
„es ist jene Drehzahl, mit der der Lüfter-Algorithmus versucht den Lüfter zu betreiben.“ 
wir empfehlen: 
„Temperatursensor“
81 
Plugins - Future 
_ Überwachung von 
FW-Versionen 
_ RAID Consistency 
Checks 
_ Temperatur von 
10GBit NICs 
(siehe Intel X540 FAQs)
82 
so, was nun?
83 
Relax ... 
_ alle Plugins unter git.thomas-krenn.com 
_ alle Plugins erfüllen 
Plugin Developer Guidelines (-h für Hilfe) 
_ „Plugin Entwicklung für Einsteiger“ 
von Alexander Wirt heute um 14:15h
84 
Relax, start ... 
Serverliste 
erstellen 
IPMI 
sicher 
konfigurieren 
relevante 
Plugins 
einrichten
85 
Relax, start and have fun at

Más contenido relacionado

La actualidad más candente

LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3
LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3
LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3Linaro
 
Demystifying Secure enclave processor
Demystifying Secure enclave processorDemystifying Secure enclave processor
Demystifying Secure enclave processorPriyanka Aash
 
Linux : The Common Mailbox Framework
Linux : The Common Mailbox FrameworkLinux : The Common Mailbox Framework
Linux : The Common Mailbox FrameworkMr. Vengineer
 
LCU14 302- How to port OP-TEE to another platform
LCU14 302- How to port OP-TEE to another platformLCU14 302- How to port OP-TEE to another platform
LCU14 302- How to port OP-TEE to another platformLinaro
 
Attack your Trusted Core
Attack your Trusted CoreAttack your Trusted Core
Attack your Trusted CoreDi Shen
 
HKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with CoresightHKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with CoresightLinaro
 
HKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
HKG15-505: Power Management interactions with OP-TEE and Trusted FirmwareHKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
HKG15-505: Power Management interactions with OP-TEE and Trusted FirmwareLinaro
 
SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)Linaro
 
Project ACRN GPIO mediator introduction
Project ACRN GPIO mediator introductionProject ACRN GPIO mediator introduction
Project ACRN GPIO mediator introductionProject ACRN
 
ARM Trusted FirmwareのBL31を単体で使う!
ARM Trusted FirmwareのBL31を単体で使う!ARM Trusted FirmwareのBL31を単体で使う!
ARM Trusted FirmwareのBL31を単体で使う!Mr. Vengineer
 
[DCG 25] Александр Большев - Never Trust Your Inputs or How To Fool an ADC
[DCG 25] Александр Большев - Never Trust Your Inputs or How To Fool an ADC [DCG 25] Александр Большев - Never Trust Your Inputs or How To Fool an ADC
[DCG 25] Александр Большев - Never Trust Your Inputs or How To Fool an ADC DefconRussia
 
Embedded Recipes 2019 - Introduction to JTAG debugging
Embedded Recipes 2019 - Introduction to JTAG debuggingEmbedded Recipes 2019 - Introduction to JTAG debugging
Embedded Recipes 2019 - Introduction to JTAG debuggingAnne Nicolas
 
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203Linaro
 
Escalating Privileges in Linux using Fault Injection - FDTC 2017
Escalating Privileges in Linux using Fault Injection - FDTC 2017Escalating Privileges in Linux using Fault Injection - FDTC 2017
Escalating Privileges in Linux using Fault Injection - FDTC 2017Cristofaro Mune
 
Part-1 : Mastering microcontroller with embedded driver development
Part-1 : Mastering microcontroller with embedded driver development Part-1 : Mastering microcontroller with embedded driver development
Part-1 : Mastering microcontroller with embedded driver development FastBit Embedded Brain Academy
 
Breaking hardware enforced security with hypervisors
Breaking hardware enforced security with hypervisorsBreaking hardware enforced security with hypervisors
Breaking hardware enforced security with hypervisorsPriyanka Aash
 
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMUSFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMULinaro
 

La actualidad más candente (20)

LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3
LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3
LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3
 
Demystifying Secure enclave processor
Demystifying Secure enclave processorDemystifying Secure enclave processor
Demystifying Secure enclave processor
 
Linux : The Common Mailbox Framework
Linux : The Common Mailbox FrameworkLinux : The Common Mailbox Framework
Linux : The Common Mailbox Framework
 
Microcontroller part 2
Microcontroller part 2Microcontroller part 2
Microcontroller part 2
 
LCU14 302- How to port OP-TEE to another platform
LCU14 302- How to port OP-TEE to another platformLCU14 302- How to port OP-TEE to another platform
LCU14 302- How to port OP-TEE to another platform
 
Attack your Trusted Core
Attack your Trusted CoreAttack your Trusted Core
Attack your Trusted Core
 
HKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with CoresightHKG18-TR14 - Postmortem Debugging with Coresight
HKG18-TR14 - Postmortem Debugging with Coresight
 
HKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
HKG15-505: Power Management interactions with OP-TEE and Trusted FirmwareHKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
HKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
 
SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)SFO15-TR9: PSCI, ACPI (and UEFI to boot)
SFO15-TR9: PSCI, ACPI (and UEFI to boot)
 
Project ACRN GPIO mediator introduction
Project ACRN GPIO mediator introductionProject ACRN GPIO mediator introduction
Project ACRN GPIO mediator introduction
 
ARM Trusted FirmwareのBL31を単体で使う!
ARM Trusted FirmwareのBL31を単体で使う!ARM Trusted FirmwareのBL31を単体で使う!
ARM Trusted FirmwareのBL31を単体で使う!
 
STM32 Microcontroller Clocks and RCC block
STM32 Microcontroller Clocks and RCC blockSTM32 Microcontroller Clocks and RCC block
STM32 Microcontroller Clocks and RCC block
 
[DCG 25] Александр Большев - Never Trust Your Inputs or How To Fool an ADC
[DCG 25] Александр Большев - Never Trust Your Inputs or How To Fool an ADC [DCG 25] Александр Большев - Never Trust Your Inputs or How To Fool an ADC
[DCG 25] Александр Большев - Never Trust Your Inputs or How To Fool an ADC
 
Embedded Recipes 2019 - Introduction to JTAG debugging
Embedded Recipes 2019 - Introduction to JTAG debuggingEmbedded Recipes 2019 - Introduction to JTAG debugging
Embedded Recipes 2019 - Introduction to JTAG debugging
 
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203
 
Linux interrupts
Linux interruptsLinux interrupts
Linux interrupts
 
Escalating Privileges in Linux using Fault Injection - FDTC 2017
Escalating Privileges in Linux using Fault Injection - FDTC 2017Escalating Privileges in Linux using Fault Injection - FDTC 2017
Escalating Privileges in Linux using Fault Injection - FDTC 2017
 
Part-1 : Mastering microcontroller with embedded driver development
Part-1 : Mastering microcontroller with embedded driver development Part-1 : Mastering microcontroller with embedded driver development
Part-1 : Mastering microcontroller with embedded driver development
 
Breaking hardware enforced security with hypervisors
Breaking hardware enforced security with hypervisorsBreaking hardware enforced security with hypervisors
Breaking hardware enforced security with hypervisors
 
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMUSFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
SFO15-202: Towards Multi-Threaded Tiny Code Generator (TCG) in QEMU
 

Similar a OSMC 2014: Server Hardware Monitoring done right | Werner Fischer

Information Gathering 2
Information Gathering 2Information Gathering 2
Information Gathering 2Aero Plane
 
OSMC 2014 | Server Hardware Monitoring done right by Werner Fischer
OSMC 2014 | Server Hardware Monitoring done right by Werner FischerOSMC 2014 | Server Hardware Monitoring done right by Werner Fischer
OSMC 2014 | Server Hardware Monitoring done right by Werner FischerNETWAYS
 
Icinga Camp Berlin 2017 - 10 Tips for better Hardware Monitoring
Icinga Camp Berlin 2017 - 10 Tips for better Hardware MonitoringIcinga Camp Berlin 2017 - 10 Tips for better Hardware Monitoring
Icinga Camp Berlin 2017 - 10 Tips for better Hardware MonitoringIcinga
 
Track c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -eveTrack c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -evechiportal
 
Lost in Translation: When Industrial Protocol Translation goes Wrong [CONFide...
Lost in Translation: When Industrial Protocol Translation goes Wrong [CONFide...Lost in Translation: When Industrial Protocol Translation goes Wrong [CONFide...
Lost in Translation: When Industrial Protocol Translation goes Wrong [CONFide...Marco Balduzzi
 
managing your network environment
managing your network environmentmanaging your network environment
managing your network environmentscooby_doo
 
emips_overview_apr08
emips_overview_apr08emips_overview_apr08
emips_overview_apr08Neil Pittman
 
Nvidia tegra K1 Presentation
Nvidia tegra K1 PresentationNvidia tegra K1 Presentation
Nvidia tegra K1 PresentationANURAG SEKHSARIA
 
Android Things in action
Android Things in actionAndroid Things in action
Android Things in actionStefano Sanna
 
20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Jorisimec.archive
 
05 module managing your network enviornment
05  module managing your network enviornment05  module managing your network enviornment
05 module managing your network enviornmentAsif
 
Important cisco-chow-commands
Important cisco-chow-commandsImportant cisco-chow-commands
Important cisco-chow-commandsssusere31b5c
 
LCA13: CPUIDLE: One driver to rule them all?
LCA13: CPUIDLE: One driver to rule them all?LCA13: CPUIDLE: One driver to rule them all?
LCA13: CPUIDLE: One driver to rule them all?Linaro
 
44CON 2014 - Stupid PCIe Tricks, Joe Fitzpatrick
44CON 2014 - Stupid PCIe Tricks, Joe Fitzpatrick44CON 2014 - Stupid PCIe Tricks, Joe Fitzpatrick
44CON 2014 - Stupid PCIe Tricks, Joe Fitzpatrick44CON
 
Positive Hack Days. Pavlov. Network Infrastructure Security Assessment
Positive Hack Days. Pavlov. Network Infrastructure Security AssessmentPositive Hack Days. Pavlov. Network Infrastructure Security Assessment
Positive Hack Days. Pavlov. Network Infrastructure Security AssessmentPositive Hack Days
 
CCA security answers chapter 2 test
CCA security answers chapter 2 testCCA security answers chapter 2 test
CCA security answers chapter 2 testSoporte Yottatec
 

Similar a OSMC 2014: Server Hardware Monitoring done right | Werner Fischer (20)

Information Gathering 2
Information Gathering 2Information Gathering 2
Information Gathering 2
 
OSMC 2014 | Server Hardware Monitoring done right by Werner Fischer
OSMC 2014 | Server Hardware Monitoring done right by Werner FischerOSMC 2014 | Server Hardware Monitoring done right by Werner Fischer
OSMC 2014 | Server Hardware Monitoring done right by Werner Fischer
 
Icinga Camp Berlin 2017 - 10 Tips for better Hardware Monitoring
Icinga Camp Berlin 2017 - 10 Tips for better Hardware MonitoringIcinga Camp Berlin 2017 - 10 Tips for better Hardware Monitoring
Icinga Camp Berlin 2017 - 10 Tips for better Hardware Monitoring
 
Track c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -eveTrack c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -eve
 
Lost in Translation: When Industrial Protocol Translation goes Wrong [CONFide...
Lost in Translation: When Industrial Protocol Translation goes Wrong [CONFide...Lost in Translation: When Industrial Protocol Translation goes Wrong [CONFide...
Lost in Translation: When Industrial Protocol Translation goes Wrong [CONFide...
 
managing your network environment
managing your network environmentmanaging your network environment
managing your network environment
 
SR-IOV Introduce
SR-IOV IntroduceSR-IOV Introduce
SR-IOV Introduce
 
emips_overview_apr08
emips_overview_apr08emips_overview_apr08
emips_overview_apr08
 
Nvidia tegra K1 Presentation
Nvidia tegra K1 PresentationNvidia tegra K1 Presentation
Nvidia tegra K1 Presentation
 
Android Things in action
Android Things in actionAndroid Things in action
Android Things in action
 
20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris
 
Kernel Debugging & Profiling
Kernel Debugging & ProfilingKernel Debugging & Profiling
Kernel Debugging & Profiling
 
05 module managing your network enviornment
05  module managing your network enviornment05  module managing your network enviornment
05 module managing your network enviornment
 
Kernel Debugging & Profiling
Kernel Debugging & ProfilingKernel Debugging & Profiling
Kernel Debugging & Profiling
 
Important cisco-chow-commands
Important cisco-chow-commandsImportant cisco-chow-commands
Important cisco-chow-commands
 
Txt Introduction
Txt IntroductionTxt Introduction
Txt Introduction
 
LCA13: CPUIDLE: One driver to rule them all?
LCA13: CPUIDLE: One driver to rule them all?LCA13: CPUIDLE: One driver to rule them all?
LCA13: CPUIDLE: One driver to rule them all?
 
44CON 2014 - Stupid PCIe Tricks, Joe Fitzpatrick
44CON 2014 - Stupid PCIe Tricks, Joe Fitzpatrick44CON 2014 - Stupid PCIe Tricks, Joe Fitzpatrick
44CON 2014 - Stupid PCIe Tricks, Joe Fitzpatrick
 
Positive Hack Days. Pavlov. Network Infrastructure Security Assessment
Positive Hack Days. Pavlov. Network Infrastructure Security AssessmentPositive Hack Days. Pavlov. Network Infrastructure Security Assessment
Positive Hack Days. Pavlov. Network Infrastructure Security Assessment
 
CCA security answers chapter 2 test
CCA security answers chapter 2 testCCA security answers chapter 2 test
CCA security answers chapter 2 test
 

Último

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionOnePlan Solutions
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesVictorSzoltysek
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdfPearlKirahMaeRagusta1
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...kalichargn70th171
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024Mind IT Systems
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 

Último (20)

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 

OSMC 2014: Server Hardware Monitoring done right | Werner Fischer

  • 1. Server Hardware Monitoring done right! Werner Fischer, Thomas-Krenn.AG
  • 2. 2 Status quo _ Überwachen Sie Ihre Server Hardware? Ja Nein
  • 3. Nach diesem Vortrag überwachen Sie sicherer und umfangreicher 3
  • 4. Nach diesem Vortrag überwachen Sie sicherer und umfangreicher (hoffe ich zumindest... ;-) 4
  • 5. 5 Status quo _ Welche Technologien nutzen Sie? IPMI / SNMP NRPE CAM CAT
  • 7. 7 CATinspection → satification?
  • 8. 8 Agenda _ IPMI (20') _ RAM (5') _ RAID (10') _ SMART (5') _ GPU (5')
  • 9. 9 monitor your IPMI-Sensors!
  • 10. 10
  • 11. 11
  • 12. 12 Intelligent Platform Management Interface
  • 13. 13  Monitoring (temp, fans, ...)  Recovery Control (on/off/reset)  Logging (System Event Log) Inventory (FRU information) Funktionen
  • 14. FRU Temp. sensor … Chassis board 14 Aufbau Motherboard private mgmt. busses Processor board Memory board Zugriff mit Benutzername & Passwort Baseboard Management Controller (BMC) System bus NVS Storage SDR SEL FRU Chassis mgmt. (Satellite Controller) Sensors & Controls Fan sensor Temp. sensor Power control Reset control … FRU Temp. s. FRU IPMB LAN interface Serial Port Sharing M/B Serial Controller BMC Serial Controller Serial/Modem interface Serial Connector LAN Connector PCI mgmt. bus Network (LAN) Controller Remote Mmgt. Card (KVM over IP, ...) Auxillary IPMB Connector ICMB ICMB bridge System interface Redundant Power board FRU Zugriff mit root Rechten
  • 15. 15 IPMI Sensor Klassen Discrete (True/False) Threshold (Schwellwerte) Mehrere Zustände möglich: ● bis zu 15 Status möglich ● jeder Status = 1 Bit ● mehrere aktive Statusbits möglich Zustand abhängig von: ● Vergleich analoger Messert mit dem Schwellwerten (Thresholds) Liefert: ● allgemeine Zustände ● Sensor-spezifische Zustände Liefert: ● analogen Messwert ● diskreten Status Ähnliche Klasse OEM ● Bedeutung der Zustände werden vom OEM definiert
  • 16. 16 IPMI Sensor Klassen Discrete Threshold [root@test ~]# ipmitool sdr get "PS2 Status" Sensor ID : PS2 Status (0x71) Entity ID : 10.2 (Power Supply) Sensor Type (Discrete): Power Supply States Asserted : Power Supply [Presence detected] [Power Supply AC lost] Assertion Events : Power Supply [Presence detected] [Power Supply AC lost] Assertions Enabled : Power Supply [Presence detected] [Failure detected] [Predictive failure] [Power Supply AC lost] [...] Deassertions Enabled : Power Supply [...] [root@test ~]# ipmitool sdr get "Fan 1" Sensor ID : Fan 1 (0x50) Entity ID : 29.1 (Fan Device) Sensor Type (Analog) : Fan Sensor Reading : 5719 (+/­0) RPM Status : ok Nominal Reading : 6708.000 Normal Minimum : 2451.000 Normal Maximum : 10965.000 Lower critical : 1720.000 Lower non­critical : 1978.000 Positive Hysteresis : 86.000 Negative Hysteresis : 86.000 Minimum sensor range : Unspecified Maximum sensor range : Unspecified Event Message Control : Per­threshold Readable Thresholds : lcr lnc Settable Thresholds : lcr lnc Threshold Read Mask : lcr lnc Assertion Events : Assertions Enabled : lnc­lcr­Deassertions Enabled : lnc­lcr­
  • 17. $ sudo ipmi­sensors ­­output­sensor­state ­­interpret­oem­data Password: ID | Name | Type | State | Reading | Units | Event 4 | System Temp | Temperature | Nominal | 27.00 | C | 'OK' 71 | Peripheral Temp | Temperature | Nominal | 35.00 | C | 'OK' 138 | CPU Temp | OEM Reserved | Nominal | N/A | N/A | 'Low' 205 | FAN 1 | Fan | Nominal | 1800.00 | RPM | 'OK' … 942 | VBAT | Voltage | Nominal | 3.15 | V | 'OK' 1009 | VSB | Voltage | Nominal | 3.34 | V | 'OK' 1076 | AVCC | Voltage | Nominal | 3.38 | V | 'OK' 1143 | Chassis Intru | Physical Security | Critical | N/A | N/A | 'Gen...' 17 IPMI Sensoren OK Critical
  • 18. 18 IPMI Sensoren (Discrete) $ cat /etc/freeipmi/freeipmi_interpret_sensor.conf […] ## IPMI_Physical_Security # # IPMI_Physical_Security_No_Event Nominal # IPMI_Physical_Security_General_Chassis_Intrusion Critical # IPMI_Physical_Security_Drive_Bay_Intrusion Critical […] # IPMI_Power_Supply_No_Event Nominal # IPMI_Power_Supply_Presence_Detected Nominal # IPMI_Power_Supply_Power_Supply_Failure_Detected Critical # IPMI_Power_Supply_Predictive_Failure Critical # IPMI_Power_Supply_Power_Supply_Input_Lost_AC_DC Critical […]
  • 19. $ ./check_ipmi_sensor ­H 192.168.255.5 ­f ipmi.cfg ­vv IPMI Status: OK | 'System Temp'=27.00 'Peripheral Temp'=35.00 'FAN 1'=1800.00 'Vcore'=0.98 '3.3VCC'=3.36 '12V'=11.93 'VDIMM'=1.53 '5VCC'=5.09 '­12V'= ­12.09 'VBAT'=3.15 'VSB'=3.34 'AVCC'=3.38 System Temp = 27.00 (Status: Nominal) Peripheral Temp = 35.00 (Status: Nominal) CPU Temp = 'Low' (Status: Nominal) FAN 1 = 1800.00 (Status: Nominal) Vcore = 0.98 (Status: Nominal) 3.3VCC = 3.36 (Status: Nominal) 12V = 11.93 (Status: Nominal) VDIMM = 1.53 (Status: Nominal) 5VCC = 5.09 (Status: Nominal) ­12V = ­12.09 (Status: Nominal) VBAT = 3.15 (Status: Nominal) VSB = 3.34 (Status: Nominal) AVCC = 3.38 (Status: Nominal) Chassis Intru = 'OK' (Status: Nominal) 19 IPMI Plugin
  • 20. 20 IPMI Plugin #!/usr/bin/perl # check_ipmi_sensor: Nagios/Icinga plugin to check IPMI sensors ## Copyright (C) 2009­2014 Thomas­Krenn. AG, # additional contributors see changelog.txt ## This program is free software; you can redistribute it and/or modify it under […] Version 3.5 20141031 * Fix LAN Driver if called on localhost Version 3.4 20140929 * Fix implicit array warning with split * Add option to disable LAN protocol version 2.0 Version 3.3 20140606 * Print a warning if ipmi­sensors only returned a single output row * Ignore sudo errors and warnings in IPMI command output (Thanks to Robert Heinzmann for contributing) * Use LAN protocol version 2.0 per default * Print empty output error only if return code was 0 * Exit the plugin with return code 3 if fru command fails * Added an include list option to only include specific sensors Version 3.2 20131028 * Added FRU serial number to output
  • 21. 21 so weit so gut?
  • 23. 23
  • 24. Das Abhörsystem in ihrem Computer 24 The Eavesdropping System in Your Computer (Bruce Schneier, Schneier on Security Blog 31.01.2013)
  • 25. 25
  • 26. 26
  • 27. 230.000 1HE Server → 10.223,5 m Höhe (Mount Everest 8.848 m) 27
  • 28. 28
  • 29. 29 IPMI Firmware by ATEN / AMI _ Mainboard-Hersteller passen Firmware an _ OS = Embedded Linux _ IPMI Firmware Teile Closed-Source
  • 30. Wir empfehlen administrative Zugänge wie IPMI- aber auch etwa SSH-Dienste nicht offen im Internet zu betreiben, 30 sondern mittels Firewall/VPN den Zugriff auf solche Dienste ausschließlich berechtigten Personen zu ermöglichen.
  • 31. 31 Was wenn doch? Enable &DROP
  • 32. 32 IPMI Top 3 Sicherheitstipps
  • 33. 33 #1 - Netzwerk
  • 34. 34 #1 - Netzwerk
  • 35. 35 #2 – User Management sjfaiklaz afjhuijoh Administrator User
  • 36. In short, the authentication process for IPMI 2.0 mandates that the server send a salted SHA1 or MD5 hash of the requested user's password to the client, prior to the client authenticating. 36 #2 – User Management A Penetration Tester's Guide to IPMI and BMCs (rapid7.com) msf > use auxiliary/scanner/ipmi/ipmi_dumphashes msf auxiliary(ipmi_dumphashes) > set RHOSTS 10.1.102.141 RHOSTS => 10.1.102.141 msf auxiliary(ipmi_dumphashes) > set THREADS 128 THREADS => 128 msf auxiliary(ipmi_dumphashes) > run [+] 10.1.102.141:623 - IPMI - Hash found: admin:14667523250000004ec525d3852f4fa73c93b674788217fe00000000000000 00000000000000000000000000000000000000000000000000140561646d696e:2c7 6e372d89ac7cd4e3bfecb423962f708d0741c
  • 37. 37 #2 – User Management $ ./cudaHashcat64.bin --outfile=ipmi.out -m 7300 hash.txt -a 3 ?lu? lu?lu?lu?lu?lu [...] Session.Name...: cudaHashcat Status.........: Exhausted Input.Mode.....: Mask (?lu?lu?lu?lu?lu?lu) [12] Hash.Target....: 54414378fb2db5ff365e4bc5856adaf4c1b8a2f2153efd1b81fb54dfe1bf56478788 ea7ba154375b40167e34f026e1020010d21d1ea31625040561646d696e:0a0b16023 1e204a6d0bd086e26718002409b35b7 Hash.Type......: IPMI2 RAKP HMAC-SHA1 Time.Started...: Thu Sep 18 10:11:17 2014 (6 secs) Time.Estimated.: 0 secs Speed.GPU.#1...: 52732.3 kH/s Recovered......: 0/1 (0.00%) Digests, 0/1 (0.00%) Salts Progress.......: 308915776/308915776 (100.00%) Skipped........: 0/308915776 (0.00%) Rejected.......: 0/308915776 (0.00%) HWMon.GPU.#1...: -1% Util, 41c Temp, 31% Fan
  • 38. 38 #2 – User Management 20 Komplexe & lange Passwörter
  • 39. 39 #3 – Dienste limitieren
  • 40.
  • 41.
  • 42. 42 monitor your RAM! (it's ECC, isn't it?)
  • 43.
  • 44. 44 3% min 1 CE/Jahr (DDR2) Google 2009, Jaguar-Cluster 2012
  • 45. 45 70% CE's vor UE's Google 2009
  • 46. 1,3% 46 Server mit UE's/Jahr Google 2009
  • 47. root@debian­test:/ sys/devices/system/edac/mc/mc0/csrow0# ls ­l total 0 ­r­­r­­r­­1 root root 4096 Nov 12 09:02 ce_count ­r­­r­­r­­1 root root 4096 Nov 12 09:02 ch0_ce_count ­rw­r­­r­­1 root root 4096 Nov 12 09:02 ch0_dimm_label ­r­­r­­r­­1 root root 4096 Nov 12 09:02 ch1_ce_count ­rw­r­­r­­1 root root 4096 Nov 12 09:02 ch1_dimm_label ­r­­r­­r­­1 root root 4096 Nov 12 09:02 dev_type ­r­­r­­r­­1 root root 4096 Nov 12 09:02 edac_mode ­r­­r­­r­­1 root root 4096 Nov 12 09:02 mem_type drwxr­xr­x 2 root root 0 Nov 12 09:02 power ­r­­r­­r­­1 root root 4096 Nov 12 09:02 size_mb lrwxrwxrwx 1 root root 0 Nov 12 09:02 subsystem ­> ../../../../../../bus/mc0 ­r­­r­­r­­1 root root 4096 Nov 12 09:02 ue_count ­rw­r­­r­­1 root root 4096 Nov 12 09:02 uevent root@debian­test:/ sys/devices/system/edac/mc/mc0/csrow0# cat ce_count 47 0 root@debian­test:/ sys/devices/system/edac/mc/mc0/csrow0# cat ue_count 0 Linux EDAC
  • 48. Linux EDAC Supportmatrix Treibermodul CPUs Kernel Unterstützte Architekturen amd64_edac.c AMD 2.6.31 48 2.6.39 3.10 3.13 3.15 K8 und F10 F15 F16 F15_M30H F16_M30H i7core_edac.c Intel Single/Dual 2.6.35 Nehalem/Westmere ie31200_edac.c Intel Single-CPU 3.17 Sandy & Ivy Bridge Haswell sb_edac.c Intel Dual-CPU 3.2 3.13 3.17 Sandy Bridge Ivy Bridge Haswell
  • 49. $ ipmi­sel ID | Date | Time | Name | State | Event 1 | Feb­03­2012 | 10:31:58 | CPU0 DIMM0 | Warning | Correctable memory error 2 | Feb­13­2012 | 22:28:58 | CPU0 DIMM0 | Warning | Correctable memory error 3 | Feb­14­2012 | 00:29:03 | CPU0 DIMM0 | Warning | Correctable memory error 4 | Feb­14­2012 | 01:29:06 | CPU0 DIMM0 | Warning | Correctable memory error ... 49 IPMI SEL (System Event Log) Unterstützung ab check_ipmi_sensor v3.6 (geplant 12/2014)
  • 50. $ ipmi­sel ID | Date | Time | Name | State | Event 1 | Feb­03­2012 | 10:31:58 | CPU0 DIMM0 | Warning | Correctable memory error 2 | Feb­13­2012 | 22:28:58 | CPU0 DIMM0 | Warning | Correctable memory error 3 | Feb­14­2012 | 00:29:03 | CPU0 DIMM0 | Warning | Correctable memory error 4 | Feb­14­2012 | 01:29:06 | CPU0 DIMM0 | Warning | Correctable memory error ... 50 IPMI SEL (System Event Log) OS unabhängig
  • 52.
  • 53. 53 Linux Software RAID LSI / Adaptec Hardware RAID
  • 55. 55 root@debian­test:~# storcli64 Storage Command Line Tool Ver 1.13.06 Sep 03, 2014 (c)Copyright 2014, LSI Corporation, All Rights Reserved. help ­lists all the commands with their usage. E.g. storcli help <command> help ­gives details about a particular command. E.g. storcli add help List of commands: Commands Description ­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­add Adds/creates a new element to controller like VD,Spare..etc delete Deletes an element like VD,Spare show Displays information about an element set Set a particular value to a property get Get a particular value to a property compare Compares particular value to a property start Start background operation stop Stop background operation pause Pause background operation resume Resume background operation download Downloads file to given device expand expands size of given drive insert inserts new drive for missing transform downgrades the controller /cx Controller specific commands /ex Enclosure specific commands /sx Slot/PD specific commands /vx Virtual drive specific commands /dx Disk group specific commands /fall Foreign configuration specific commands /px Phy specific commands /[bbu|cv] Battery Backup Unit, Cachevault commands
  • 56. $ /usr/lib/nagios/plugins/check_lsi_raid ­vv Warning (LD Warn) [c0/v0_Consist = Warning (No)]| CV_Temperature=22;70;85 ROC_Temperature=57;80;90 c0/e252/s0_Drive_Temperature=21;40;45 c0/e252/s1_Drive_Temperature=21;40;45 Used storcli commands: ­/ usr/bin/sudo /usr/sbin/storcli64 /c0 /cv show status ­/ usr/bin/sudo /usr/sbin/storcli64 adpallinfo a0 ­/ usr/bin/sudo /usr/sbin/storcli64 /c0/vall show all ­/ usr/bin/sudo /usr/sbin/storcli64 /c0/vall show init ­/ usr/bin/sudo /usr/sbin/storcli64 /c0/eall/sall show all ­/ usr/bin/sudo /usr/sbin/storcli64 /c0/eall/sall show initialization ­/ usr/bin/sudo /usr/sbin/storcli64 /c0/eall/sall show rebuild Warning sensors: ­c0/ v0_Consist (No) 56 check_lsi_raid
  • 57. Warum adpallinfo a0? „storcli /0 show all … blocks the whole raid card i/o for … upto ~4 seconds“ 57
  • 58. Warum adpallinfo a0? „storcli /0 show all … blocks the whole raid card i/o for … upto ~4 seconds“ 58
  • 59. 59 check_lsi_raid $ /usr/lib/nagios/plugins/check_lsi_raid ­h check_lsi_raid: Nagios/Icinga plugin to check LSI Raid Controller status Pulgin version: 2.0 Copyright (C) 2013­2014 Thomas­Krenn. AG Current updates available at http://git.thomas­krenn. com/check_lsi_raid.git This Nagios/Icinga Plugin checks LSI RAID controllers for controller, physical device, logical device, BBU and CV warnings and errors. In order for this plugin to work properly you need to add the nagios user to your sudoers file (or create a new one in /etc/sudoers.d/). Usage: [ ­h | ­­help ] Display this help page [ ­v | ­vv | ­vvv | ­­verbose ] Sets the verbosity level. No ­v is the normal single line output for Nagios/Icinga, ­v is a more detailed version but still usable in Nagios. ­vv is a multiline output for debugging configuration errors or more detailed information. ­vvv is for plugin problem diagnosis. For further information please visit: http://nagiosplug.sourceforge.net/developer­guidelines. html#AEN39 [ ­V ­­version ] Displays the plugin and, if available, the version if StorCLI. [ ­C <num> | ­­controller <num> ] Specifies a controller number, defaults to 0. ...
  • 60. 60 VMware? → CIM Provider
  • 61. 61 VMware? → Plugin check_esxi_hardware.py check_vmware_esx.pl Hardware VMware allgemein python-pywbem VMware Perl SDK Claudio Kuenzler et.al. Infos: Martin Fürstenau
  • 62. VMware? check_esxi_hardware.py 62 #!/usr/bin/python # ­* ­coding: UTF­8 ­* ­ ## Script for checking global health of host running VMware ESX/ESXi ## Licence : GNU General Public Licence (GPL) http://www.gnu.org/ # This program is free software; you can redistribute it and/or ... # Copyright (c) 2008 David Ligeret # Copyright (c) 2009 Joshua Daniel Franklin # Copyright (c) 2010 Branden Schneider # Copyright (c) 2010­2014 Claudio Kuenzler # Copyright (c) 2010 Samir Ibradzic # Copyright (c) 2010 Aaron Rogers # Copyright (c) 2011 Ludovic Hutin # Copyright (c) 2011 Carsten Schoene # Copyright (c) 2011­2012 Phil Randal # Copyright (c) 2011 Fredrik Aslund # Copyright (c) 2011 Bertrand Jomin # Copyright (c) 2011 Ian Chard # Copyright (c) 2012 Craig Hart # Copyright (c) 2013 Carl R. Friend
  • 64. 64 $ sudo arcconf | UCLI | Adaptec by PMC uniform command line interface | UCLI | Version 1.6 (B21062) | UCLI | (C) Adaptec by PMC 2003­2014 | UCLI | All Rights Reserved ATAPASSWORD | setting password on a physical drive COPYBACK | toggles controller copy back mode CREATE | creates a logical device CONSISTENCYCHECK | toggles the controller background consistency check mode DELETE | deletes one or more logical devices ERRORTUNABLE | sets error tunable profiles on the controller EXPANDERLIST | Lists the Expanders Connected to the Controller EXPANDERUPGRADE | updates expander firmware FAILOVER | toggles the controller automatic failover mode GETCONFIG | prints controller information GETLOGS | gets controller log information GETPERFORM | gets the parameters for a performance mode GETSMARTSTATS | gets the SMART statistics GETSTATUS | displays the status of running tasks GETVERSION | prints version information for all controllers IDENTIFY | blinks LEDS on device(s) connected to a controller IMAGEUPDATE | update physical device firmware KEY | installs a Feature Key onto a controller MODIFY | performs RAID Level Migration or Online Capacity Expansion PHYERRORLOG | displays PHY error logs for controller or device or an | expander PHY PRESERVECACHE | changes the cache preservation settings on the controller RESCAN | checks for new or removed drives RESETSTATISTICSCOUNTERS | resets the controller statistics counters ROMUPDATE | updates controller firmware SAVESUPPORTARCHIVE | saves the support archive SETALARM | controls the controller alarm, if present ...
  • 65. check_adaptec_raid Update $ ./check_adaptec_raid ­p /usr/sbin/arcconf AACRAID CRITICAL (Ctrl #1): [ZMM critical] $ ./check_adaptec_raid ­h Thomas­Krenn Adaptec Raid Controller Nagios/Icinga Plugin Version: 1.0 Copyright (C) 2009­2013 Thomas­Krenn. AG Current updates available via git at: 65 http://git.thomas­krenn. com/check_adaptec_raid.git This Nagios/Icinga Plugin checks ADAPTEC RAID­Controllers for Controller, Physical­Device and Logical Device warnings and errors. In order for this plugin to work properly you need to add the nagios­user to your sudoers file (or create a new one in /etc/sudoers.d/). This is required as arcconf must be called with sudo permissions. Usage: [ ­C <Controller number> ] [ ­LD <Logical device number> ] [ ­PD <Physical device number> ] [ ­T <Warning Temp., Crit. Temp.> ] [ ­h | ­­help ] Display this help page [ ­v | ­vv | ­vvv | ­­verbose ] Sets the verbosity level no ­v single line output for Nagios/Icinga ­v single line with more details ... geplant (2015)
  • 66. VMware? → CIM Provider erwartet _ aktuell: 66 _ „CIM Provider“ für remote arcconf _ Adaptec MSM in einer VM _ künftig: _ „echter“ CIM Provider
  • 67. 67 be smart, use SMART ;-)
  • 68. 68 Self- Monitoring, Analysis & Reporting Technology
  • 69. 69 Standardisiert NICHT standadisiert Datenformat Kommandos Errorlogs Tests Attribute Dokumentation vom Hersteller erforderlich (oft nicht öffentlich, außer Intel/Samsung)
  • 70. 70 check_smart_attributes $ /usr/lib/nagios/plugins/check_smart_attributes > ­d /dev/sda > ­dbj /etc/nagios­plugins/ config/check_smartdb.json OK (sda) |sda_Media_Wearout_Indicator=098;16;6 sda_Host_Writes_32MiB=575272 sda_Host_Reads_32MiB=723527
  • 71. /etc/nagios­plugins/ config/check_smartdb.json ... "Intel DC S3700" : { "Device" : ["Intel DC S3700 Series SSDs","INTEL SSDSC2BA100G3", "ID#" : { "5" : "RAW_VALUE", # Re­allocated Sector Count ... "194" : "RAW_VALUE", # Temperature ­Device Internal Te ... "232" : "VALUE", # Available Reserved Space "233" : "VALUE", # Media Wearout Indicator "234" : "VALUE", # Thermal Throttle Status "241" : "RAW_VALUE", # Total LBAs Written (32MiB) "242" : "RAW_VALUE", # Total LBAs Read (32MiB) "1024" : "VALUE" # ATA error count (custom) 71 }, "Threshs" : { "5" : ["20","40"], ... "232" : ["16:","11:"], "233" : ["16:","6:"], "1024" : ["0","10"] }, "Perfs" : ["194","233","241","242"] }, ...
  • 74. /etc/nagios­plugins/ config/check_smartdb.json Ständig neue SSDs&HDDs 74 Aktualisierungen?
  • 76. ja cool, aber was ist mit RAID Controllern? ... [­d| ­­device <path to device being checked>] Specify the device being monitored. If multiple devices should be checked provide the '­d' option multiple times. E.g. '­d /dev/sda ­d /dev/sdb' For devices behind LSI RAID controllers specify 'megaraid' and then the device number, e.g. '­d megaraid6'. Use storcli to find out the corresponding device numbers. For devices behind Adaptec RAID controllers specify '/dev/sg<X>' where <X> is the number for your device. Use e.g. sg_scan to find the device. You must also use '­O sat' or '­O scsi' according to the device interface. This are extra options only necessary for '/dev/sg<X>' devices. 76 ...
  • 77. ja cool, aber was ist mit RAID Controllern? $ /usr/lib/nagios/plugins/check_smart_attributes > ­d megaraid6 > ­dbj /etc/nagios­plugins/ config/check_smartdb.json OK (megaraid6) | megaraid6_Temperature_Internal=26 megaraid6_Media_Wearout_Indicator=100;16;6 megaraid6_Host_Writes_32MiB=70283 megaraid6_Host_Reads_32MiB=1650800 $ /usr/lib/nagios/plugins/check_smart_attributes > ­d megaraid7 > ­dbj /etc/nagios­plugins/ config/check_smartdb.json Warning (megaraid7) [megaraid7_CRC_Error_Count = Warning]| megaraid7_Temperature_Internal=34 megaraid7_Media_Wearout_Indicator=098;16;6 megaraid7_Host_Writes_32MiB=189904 megaraid7_Host_Reads_32MiB=29658 77
  • 79. 79 check_gpu_sensor $ /usr/lib/nagios/plugins/check_gpu_sensor ­db 0000:83:00.0 OK ­Tesla K20 |ECCL2AggSgl=0;1;2; ECCTexAggSgl=0;1;2; memUtilRate=0 PWRUsage=49.81;150;200; ECCRegAggSgl=0;1;2; SMClock=705 ECCL1AggSgl=0;1;2; GPUTemperature=38;85;100; memClock=2600 usedMemory=0.24;95;99; fanSpeed=30;80;95; graphicsClock=705 GPUUtilRate=0 ECCMemAggSgl=0;1;2;
  • 80. NVIDIA: „angezeigte Lüfterdrehzahl lässt nicht darauf schließen, ob sich der Lüfter tatsächlich dreht.“ 80 „es ist jene Drehzahl, mit der der Lüfter-Algorithmus versucht den Lüfter zu betreiben.“ wir empfehlen: „Temperatursensor“
  • 81. 81 Plugins - Future _ Überwachung von FW-Versionen _ RAID Consistency Checks _ Temperatur von 10GBit NICs (siehe Intel X540 FAQs)
  • 82. 82 so, was nun?
  • 83. 83 Relax ... _ alle Plugins unter git.thomas-krenn.com _ alle Plugins erfüllen Plugin Developer Guidelines (-h für Hilfe) _ „Plugin Entwicklung für Einsteiger“ von Alexander Wirt heute um 14:15h
  • 84. 84 Relax, start ... Serverliste erstellen IPMI sicher konfigurieren relevante Plugins einrichten
  • 85. 85 Relax, start and have fun at