SlideShare una empresa de Scribd logo
1 de 37
Descargar para leer sin conexión
Please stop using Nagios
(so it can die peacefully)
Andy Sykes
Devops @ Forward3D
@supersheep
andy@forward3d.com
Do you use Nagios?
Tell me why you picked it.
Go on.
If you don't, why don't you?
Reasons for choosing Nagios

•  stupid simple plugin system
•  billions* of existing plugins
•  years of development behind it
•  you can hire people who know it
"Everybody uses it."**

* may not actually be true
** except me. and maybe you. and that guy at the back, who really likes Zabbix. you know
who you are.
Reasons for choosing Nagios

•  stupid simple plugin system
•  billions* of existing plugins
•  years of development behind it
•  you can hire people who know it
"Everybody uses it."**

* may not actually be true
** except me. and maybe you. and that guy at the back, who really likes Zabbix. you know
who you are.
So why did you pick Nagios?
Because it's the "safe", default choice.
Because we've grown accustomed to the things
that really, really suck about it.
It's a little like we've all got Stockholm
Syndrome.
What Nagios gets right
Incredibly simple plugin model.
Fairly secure (SSL between agents + master).
Very simple conceptually.
Reliable.
Nagios, I hate thee; let me count thy ways
Doesn't scale. At all.
World's second most horrible configuration*.
Horrendous interface**.
Assumes a static infrastructure.
No decent programmatic interfaces***.
Throws away perfdata.
Stupid wire format for clients (NRPE/NSCA).
* the world's most horrible configuration is, obviously, Sendmail.
** even the paid Nagios XI one is ugly as sin and unusable.
*** if I catch you parsing status.dat, I will beat your ass.
Expansion about config
Configuration has to be in two places:
Server has to know what checks to invoke
via NRPE.
Client has to know what checks it will be
asked to invoke with NRPE.
THIS IS MADNESS.
Scaling, or lack of it
No such thing as a Nagios cluster.
More checks = more work = longer before you
know something's happened!
Every check increases your master's load
average.
Okay, yes, there’s mod_gearman
But it’s a hack at best.
No redundancy for the machine that distributes
the checks, so it’s not a real cluster.
API poverty
Can't easily integrate with other systems.
Can't easily write custom dashboards.
Can't get information out again!

Assumes a static infra
Master has to be told about a client before
things can happen.
The bandaids we make
Interface:
Opsview, Icinga, Shinken, others

API:
Parsing status.dat, NDO

Client wire format:
Opsview's NRPE, NRD

Config management:
Puppet types, Chef cookbooks
None of it is good enough.
The take-home point:

"If we keep using Nagios,
we'll never get anything
better."
(Writing monitoring systems is hard, and needs community involvement and
real world adoption. Nagios steals mindshare by being just good enough. It's
the monitoring system we deserve, but not the one we need right now.)
So, smart guy. What do we do?

Steal all the things that are great about Nagios.
(existing plugin investment, simplicity, security, reliability)

Strap them to something more awesome.
(scalable, API-ready, config management friendly, modern!)
THIS DOESN’T MEAN WRITING
YOUR OWN MONITORING SYSTEM
Points for thought:

●  What else are people using?
●  Should we greenfield or lift existing tools?
●  What tools could we go with?
My suggestion:

Like OMD, but better.
Wrap up a series of “best in breed” tools to
make one kickass monitoring tool.
What we need:
Core
Agent
Graphing
Anomaly detection
Alerting
UI
Core:
Holds configuration about hosts / services
Distributed across X masters
Check execution (poke)
Results queue (poke response)
There’s something we can use for this.
Sensu!
Sensu is often described as the “monitoring router”.
{
"checks": {
"chef_client": {
"command": "check-chef-client.rb",
"subscribers": [
"production" ],
"interval": 60,
"handlers": [
"pagerduty",
"irc"
]
}
}
}

Only on the server
Client requires no registration for the server
to know about it
Uses Nagios status return codes
Doesn’t talk to the server - talks to
RabbitMQ
Core:
Holds configuration about hosts / services
Distributed across X masters
Check execution (poke)
Results queue (poke response)
What we need:
Core
- Sensu-server
Agent
- Sensu-client
Graphing
Anomaly detection
Alerting
UI
Graphing is easy now.
If you’re not using Graphite, you should be.
Sensu “metric” checks can pump data to it.
What we need:
Core
- Sensu-server
Agent
- Sensu-client
Graphing - Graphite
Anomaly detection
Alerting
UI
Anomaly detection is hard.
We’ve got all this metric data, but how do we check it?
- Skyline/Oculus (Etsy)
- Grok (very early days)
- ???
What we need:
Core
- Sensu-server
Agent
- Sensu-client
Graphing - Graphite
Anomaly detection - ???
Alerting
UI
Alerting is tricky, but mostly solved.
Flapjack! - flapjack.io
Alerting is not the concern of your monitoring tool.
Push all alerts at Flapjack
- define gateways (PagerDuty, email)
- create relationships between checks and gateways
What we need:
Core
- Sensu-server
Agent
- Sensu-client
Graphing - Graphite
Anomaly detection - ???
Alerting - Flapjack
UI
User interfaces are hard.
What do we need from it?
- What’s broken
- When it broke, when it broke in the past
- Say “OK, I know it’s broken”
- View graphs to see how quickly it broke
- See every check everywhere, and filter the list
The Sensu Dashboard sucks.
No history!
Acknowledgements aren’t easy to do.
No graphing.
Can’t see anything that’s reporting an OK status.
This won’t do.
I’m going to have to write a UI. Sigh.
What we need:
Core
- Sensu-server
Agent
- Sensu-client
Graphing - Graphite
Anomaly detection - ???
Alerting - Flapjack
UI
- ???
In Summary

Nagios sucks.
There are good tools for each concern
of monitoring.
If we can package them together, we
can have something that rocks.
Thank You.

Contact
andy@forward3d.com (@supersheep)

Más contenido relacionado

La actualidad más candente

Law of Demeter & Objective Sense of Style
Law of Demeter & Objective Sense of StyleLaw of Demeter & Objective Sense of Style
Law of Demeter & Objective Sense of StyleVladimir Tsukur
 
Railway Oriented Programming
Railway Oriented ProgrammingRailway Oriented Programming
Railway Oriented ProgrammingScott Wlaschin
 
유니티 + Nodejs를 활용한 멀티플레이어 게임 개발하기
유니티 + Nodejs를 활용한 멀티플레이어 게임 개발하기유니티 + Nodejs를 활용한 멀티플레이어 게임 개발하기
유니티 + Nodejs를 활용한 멀티플레이어 게임 개발하기Kiyoung Moon
 
Sprinkle your Devops platform with product thinking.pdf
Sprinkle your Devops platform with product thinking.pdfSprinkle your Devops platform with product thinking.pdf
Sprinkle your Devops platform with product thinking.pdfJavier Turégano Molina
 
CSS Grid Layout for Topconf, Linz
CSS Grid Layout for Topconf, LinzCSS Grid Layout for Topconf, Linz
CSS Grid Layout for Topconf, LinzRachel Andrew
 
이클립스 플랫폼
이클립스 플랫폼이클립스 플랫폼
이클립스 플랫폼Kenu, GwangNam Heo
 
Getting started with Docker
Getting started with DockerGetting started with Docker
Getting started with DockerRavindu Fernando
 
쉽게 쓰여진 Django
쉽게 쓰여진 Django쉽게 쓰여진 Django
쉽게 쓰여진 DjangoTaehoon Kim
 
Container based CI/CD on GitHub Actions
Container based CI/CD on GitHub ActionsContainer based CI/CD on GitHub Actions
Container based CI/CD on GitHub ActionsCasey Lee
 
Introduce to Rust-A Powerful System Language
Introduce to Rust-A Powerful System LanguageIntroduce to Rust-A Powerful System Language
Introduce to Rust-A Powerful System Language安齊 劉
 
Monitoramento de serviços com Zabbix + Grafana + Python - Marcelo Santoto - D...
Monitoramento de serviços com Zabbix + Grafana + Python - Marcelo Santoto - D...Monitoramento de serviços com Zabbix + Grafana + Python - Marcelo Santoto - D...
Monitoramento de serviços com Zabbix + Grafana + Python - Marcelo Santoto - D...Felipe Blini
 
The lazy programmer's guide to writing thousands of tests
The lazy programmer's guide to writing thousands of testsThe lazy programmer's guide to writing thousands of tests
The lazy programmer's guide to writing thousands of testsScott Wlaschin
 
The Power of Composition (NDC Oslo 2020)
The Power of Composition (NDC Oslo 2020)The Power of Composition (NDC Oslo 2020)
The Power of Composition (NDC Oslo 2020)Scott Wlaschin
 
Rust: Unlocking Systems Programming
Rust: Unlocking Systems ProgrammingRust: Unlocking Systems Programming
Rust: Unlocking Systems ProgrammingC4Media
 
Functional Patterns with Java8 @Bucharest Java User Group
Functional Patterns with Java8 @Bucharest Java User GroupFunctional Patterns with Java8 @Bucharest Java User Group
Functional Patterns with Java8 @Bucharest Java User GroupVictor Rentea
 
NDC14 범용 게임 서버 프레임워크 디자인 및 테크닉
NDC14 범용 게임 서버 프레임워크 디자인 및 테크닉NDC14 범용 게임 서버 프레임워크 디자인 및 테크닉
NDC14 범용 게임 서버 프레임워크 디자인 및 테크닉iFunFactory Inc.
 
[AWS Community Day 2021] AWS와 함께하는 무중단 배포 파이프라인 개선기
[AWS Community Day 2021] AWS와 함께하는 무중단 배포 파이프라인 개선기[AWS Community Day 2021] AWS와 함께하는 무중단 배포 파이프라인 개선기
[AWS Community Day 2021] AWS와 함께하는 무중단 배포 파이프라인 개선기SungChanHwang
 

La actualidad más candente (20)

Law of Demeter & Objective Sense of Style
Law of Demeter & Objective Sense of StyleLaw of Demeter & Objective Sense of Style
Law of Demeter & Objective Sense of Style
 
Railway Oriented Programming
Railway Oriented ProgrammingRailway Oriented Programming
Railway Oriented Programming
 
유니티 + Nodejs를 활용한 멀티플레이어 게임 개발하기
유니티 + Nodejs를 활용한 멀티플레이어 게임 개발하기유니티 + Nodejs를 활용한 멀티플레이어 게임 개발하기
유니티 + Nodejs를 활용한 멀티플레이어 게임 개발하기
 
Sprinkle your Devops platform with product thinking.pdf
Sprinkle your Devops platform with product thinking.pdfSprinkle your Devops platform with product thinking.pdf
Sprinkle your Devops platform with product thinking.pdf
 
CSS Grid Layout for Topconf, Linz
CSS Grid Layout for Topconf, LinzCSS Grid Layout for Topconf, Linz
CSS Grid Layout for Topconf, Linz
 
Intro to Terraform
Intro to TerraformIntro to Terraform
Intro to Terraform
 
이클립스 플랫폼
이클립스 플랫폼이클립스 플랫폼
이클립스 플랫폼
 
Getting started with Docker
Getting started with DockerGetting started with Docker
Getting started with Docker
 
쉽게 쓰여진 Django
쉽게 쓰여진 Django쉽게 쓰여진 Django
쉽게 쓰여진 Django
 
Container based CI/CD on GitHub Actions
Container based CI/CD on GitHub ActionsContainer based CI/CD on GitHub Actions
Container based CI/CD on GitHub Actions
 
Introduce to Rust-A Powerful System Language
Introduce to Rust-A Powerful System LanguageIntroduce to Rust-A Powerful System Language
Introduce to Rust-A Powerful System Language
 
RxSwift to Combine
RxSwift to CombineRxSwift to Combine
RxSwift to Combine
 
Monitoramento de serviços com Zabbix + Grafana + Python - Marcelo Santoto - D...
Monitoramento de serviços com Zabbix + Grafana + Python - Marcelo Santoto - D...Monitoramento de serviços com Zabbix + Grafana + Python - Marcelo Santoto - D...
Monitoramento de serviços com Zabbix + Grafana + Python - Marcelo Santoto - D...
 
Jenkins CI presentation
Jenkins CI presentationJenkins CI presentation
Jenkins CI presentation
 
The lazy programmer's guide to writing thousands of tests
The lazy programmer's guide to writing thousands of testsThe lazy programmer's guide to writing thousands of tests
The lazy programmer's guide to writing thousands of tests
 
The Power of Composition (NDC Oslo 2020)
The Power of Composition (NDC Oslo 2020)The Power of Composition (NDC Oslo 2020)
The Power of Composition (NDC Oslo 2020)
 
Rust: Unlocking Systems Programming
Rust: Unlocking Systems ProgrammingRust: Unlocking Systems Programming
Rust: Unlocking Systems Programming
 
Functional Patterns with Java8 @Bucharest Java User Group
Functional Patterns with Java8 @Bucharest Java User GroupFunctional Patterns with Java8 @Bucharest Java User Group
Functional Patterns with Java8 @Bucharest Java User Group
 
NDC14 범용 게임 서버 프레임워크 디자인 및 테크닉
NDC14 범용 게임 서버 프레임워크 디자인 및 테크닉NDC14 범용 게임 서버 프레임워크 디자인 및 테크닉
NDC14 범용 게임 서버 프레임워크 디자인 및 테크닉
 
[AWS Community Day 2021] AWS와 함께하는 무중단 배포 파이프라인 개선기
[AWS Community Day 2021] AWS와 함께하는 무중단 배포 파이프라인 개선기[AWS Community Day 2021] AWS와 함께하는 무중단 배포 파이프라인 개선기
[AWS Community Day 2021] AWS와 함께하는 무중단 배포 파이프라인 개선기
 

Destacado

Zabbix 3.0 and beyond - FISL 2015
Zabbix 3.0 and beyond - FISL 2015Zabbix 3.0 and beyond - FISL 2015
Zabbix 3.0 and beyond - FISL 2015Zabbix
 
Grafana and MySQL - Benefits and Challenges
Grafana and MySQL - Benefits and ChallengesGrafana and MySQL - Benefits and Challenges
Grafana and MySQL - Benefits and ChallengesPhilip Wernersbach
 
Andrew Nelson - Zabbix and SNMP on Linux
Andrew Nelson - Zabbix and SNMP on LinuxAndrew Nelson - Zabbix and SNMP on Linux
Andrew Nelson - Zabbix and SNMP on LinuxZabbix
 
Icinga Camp Barcelona - Current State of Icinga
Icinga Camp Barcelona - Current State of IcingaIcinga Camp Barcelona - Current State of Icinga
Icinga Camp Barcelona - Current State of IcingaIcinga
 
Alexei Vladishev - Opening Speech
Alexei Vladishev - Opening SpeechAlexei Vladishev - Opening Speech
Alexei Vladishev - Opening SpeechZabbix
 
Fall in Love with Graphs and Metrics using Grafana
Fall in Love with Graphs and Metrics using GrafanaFall in Love with Graphs and Metrics using Grafana
Fall in Love with Graphs and Metrics using Grafanatorkelo
 

Destacado (7)

Zabbix 3.0 and beyond - FISL 2015
Zabbix 3.0 and beyond - FISL 2015Zabbix 3.0 and beyond - FISL 2015
Zabbix 3.0 and beyond - FISL 2015
 
Grafana and MySQL - Benefits and Challenges
Grafana and MySQL - Benefits and ChallengesGrafana and MySQL - Benefits and Challenges
Grafana and MySQL - Benefits and Challenges
 
Andrew Nelson - Zabbix and SNMP on Linux
Andrew Nelson - Zabbix and SNMP on LinuxAndrew Nelson - Zabbix and SNMP on Linux
Andrew Nelson - Zabbix and SNMP on Linux
 
Icinga Camp Barcelona - Current State of Icinga
Icinga Camp Barcelona - Current State of IcingaIcinga Camp Barcelona - Current State of Icinga
Icinga Camp Barcelona - Current State of Icinga
 
Monitoring the #DevOps way
Monitoring the #DevOps wayMonitoring the #DevOps way
Monitoring the #DevOps way
 
Alexei Vladishev - Opening Speech
Alexei Vladishev - Opening SpeechAlexei Vladishev - Opening Speech
Alexei Vladishev - Opening Speech
 
Fall in Love with Graphs and Metrics using Grafana
Fall in Love with Graphs and Metrics using GrafanaFall in Love with Graphs and Metrics using Grafana
Fall in Love with Graphs and Metrics using Grafana
 

Similar a Stop using Nagios (so it can die peacefully)

How Yelp Uses Sensu to Monitor Services in a SOA World
How Yelp Uses Sensu to Monitor Services in a SOA WorldHow Yelp Uses Sensu to Monitor Services in a SOA World
How Yelp Uses Sensu to Monitor Services in a SOA WorldKyle Anderson
 
Monitoring with sensu
Monitoring with sensuMonitoring with sensu
Monitoring with sensumiquelruizm
 
Automating Monitoring with Puppet
Automating Monitoring with PuppetAutomating Monitoring with Puppet
Automating Monitoring with PuppetChristian Mague
 
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...Puppet
 
Django: Beyond Basics
Django: Beyond BasicsDjango: Beyond Basics
Django: Beyond Basicsarunvr
 
Sensu @ Yelp!: A Guided Tour
Sensu @ Yelp!: A Guided TourSensu @ Yelp!: A Guided Tour
Sensu @ Yelp!: A Guided TourKyle Anderson
 
Making operations visible - devopsdays tokyo 2013
Making operations visible  - devopsdays tokyo 2013Making operations visible  - devopsdays tokyo 2013
Making operations visible - devopsdays tokyo 2013Nick Galbreath
 
Making operations visible - Nick Gallbreath
Making operations visible - Nick GallbreathMaking operations visible - Nick Gallbreath
Making operations visible - Nick GallbreathDevopsdays
 
Move out from AppEngine, and Python PaaS alternatives
Move out from AppEngine, and Python PaaS alternativesMove out from AppEngine, and Python PaaS alternatives
Move out from AppEngine, and Python PaaS alternativestzang ms
 
Advanced googling
Advanced googlingAdvanced googling
Advanced googlingsonuagain
 
OSMC 2012 | Shinken by Jean Gabès
OSMC 2012 | Shinken by Jean GabèsOSMC 2012 | Shinken by Jean Gabès
OSMC 2012 | Shinken by Jean GabèsNETWAYS
 
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl OpenNebula Project
 
Monitoring of OpenNebula installations
Monitoring of OpenNebula installationsMonitoring of OpenNebula installations
Monitoring of OpenNebula installationsNETWAYS
 
Abusing bleeding edge web standards for appsec glory
Abusing bleeding edge web standards for appsec gloryAbusing bleeding edge web standards for appsec glory
Abusing bleeding edge web standards for appsec gloryPriyanka Aash
 
Von JavaEE auf Microservice in 6 Monaten - The Good, the Bad, and the wtfs...
Von JavaEE auf Microservice in 6 Monaten - The Good, the Bad, and the wtfs...Von JavaEE auf Microservice in 6 Monaten - The Good, the Bad, and the wtfs...
Von JavaEE auf Microservice in 6 Monaten - The Good, the Bad, and the wtfs...André Goliath
 
Hacklu2011 tricaud
Hacklu2011 tricaudHacklu2011 tricaud
Hacklu2011 tricaudstricaud
 
Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud
Skynet project: Monitor, analyze, scale, and maintain a system in the CloudSkynet project: Monitor, analyze, scale, and maintain a system in the Cloud
Skynet project: Monitor, analyze, scale, and maintain a system in the CloudSylvain Kalache
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Demi Ben-Ari
 

Similar a Stop using Nagios (so it can die peacefully) (20)

How Yelp Uses Sensu to Monitor Services in a SOA World
How Yelp Uses Sensu to Monitor Services in a SOA WorldHow Yelp Uses Sensu to Monitor Services in a SOA World
How Yelp Uses Sensu to Monitor Services in a SOA World
 
Monitoring with sensu
Monitoring with sensuMonitoring with sensu
Monitoring with sensu
 
Automating Monitoring with Puppet
Automating Monitoring with PuppetAutomating Monitoring with Puppet
Automating Monitoring with Puppet
 
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
“Sensu and Sensibility” - The Story of a Journey From #monitoringsucks to #mo...
 
Django: Beyond Basics
Django: Beyond BasicsDjango: Beyond Basics
Django: Beyond Basics
 
Sensu @ Yelp!: A Guided Tour
Sensu @ Yelp!: A Guided TourSensu @ Yelp!: A Guided Tour
Sensu @ Yelp!: A Guided Tour
 
Making operations visible - devopsdays tokyo 2013
Making operations visible  - devopsdays tokyo 2013Making operations visible  - devopsdays tokyo 2013
Making operations visible - devopsdays tokyo 2013
 
Making operations visible - Nick Gallbreath
Making operations visible - Nick GallbreathMaking operations visible - Nick Gallbreath
Making operations visible - Nick Gallbreath
 
Move out from AppEngine, and Python PaaS alternatives
Move out from AppEngine, and Python PaaS alternativesMove out from AppEngine, and Python PaaS alternatives
Move out from AppEngine, and Python PaaS alternatives
 
Advanced googling
Advanced googlingAdvanced googling
Advanced googling
 
Google Hacking
Google HackingGoogle Hacking
Google Hacking
 
OSMC 2012 | Shinken by Jean Gabès
OSMC 2012 | Shinken by Jean GabèsOSMC 2012 | Shinken by Jean Gabès
OSMC 2012 | Shinken by Jean Gabès
 
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl
 
Monitoring of OpenNebula installations
Monitoring of OpenNebula installationsMonitoring of OpenNebula installations
Monitoring of OpenNebula installations
 
Abusing bleeding edge web standards for appsec glory
Abusing bleeding edge web standards for appsec gloryAbusing bleeding edge web standards for appsec glory
Abusing bleeding edge web standards for appsec glory
 
Django Girls Tutorial
Django Girls TutorialDjango Girls Tutorial
Django Girls Tutorial
 
Von JavaEE auf Microservice in 6 Monaten - The Good, the Bad, and the wtfs...
Von JavaEE auf Microservice in 6 Monaten - The Good, the Bad, and the wtfs...Von JavaEE auf Microservice in 6 Monaten - The Good, the Bad, and the wtfs...
Von JavaEE auf Microservice in 6 Monaten - The Good, the Bad, and the wtfs...
 
Hacklu2011 tricaud
Hacklu2011 tricaudHacklu2011 tricaud
Hacklu2011 tricaud
 
Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud
Skynet project: Monitor, analyze, scale, and maintain a system in the CloudSkynet project: Monitor, analyze, scale, and maintain a system in the Cloud
Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
 

Último

Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 

Último (20)

Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 

Stop using Nagios (so it can die peacefully)

  • 1. Please stop using Nagios (so it can die peacefully) Andy Sykes Devops @ Forward3D @supersheep andy@forward3d.com
  • 2. Do you use Nagios? Tell me why you picked it. Go on. If you don't, why don't you?
  • 3. Reasons for choosing Nagios •  stupid simple plugin system •  billions* of existing plugins •  years of development behind it •  you can hire people who know it "Everybody uses it."** * may not actually be true ** except me. and maybe you. and that guy at the back, who really likes Zabbix. you know who you are.
  • 4. Reasons for choosing Nagios •  stupid simple plugin system •  billions* of existing plugins •  years of development behind it •  you can hire people who know it "Everybody uses it."** * may not actually be true ** except me. and maybe you. and that guy at the back, who really likes Zabbix. you know who you are.
  • 5. So why did you pick Nagios? Because it's the "safe", default choice. Because we've grown accustomed to the things that really, really suck about it. It's a little like we've all got Stockholm Syndrome.
  • 6. What Nagios gets right Incredibly simple plugin model. Fairly secure (SSL between agents + master). Very simple conceptually. Reliable.
  • 7. Nagios, I hate thee; let me count thy ways Doesn't scale. At all. World's second most horrible configuration*. Horrendous interface**. Assumes a static infrastructure. No decent programmatic interfaces***. Throws away perfdata. Stupid wire format for clients (NRPE/NSCA). * the world's most horrible configuration is, obviously, Sendmail. ** even the paid Nagios XI one is ugly as sin and unusable. *** if I catch you parsing status.dat, I will beat your ass.
  • 8. Expansion about config Configuration has to be in two places: Server has to know what checks to invoke via NRPE. Client has to know what checks it will be asked to invoke with NRPE. THIS IS MADNESS.
  • 9. Scaling, or lack of it No such thing as a Nagios cluster. More checks = more work = longer before you know something's happened! Every check increases your master's load average.
  • 10. Okay, yes, there’s mod_gearman But it’s a hack at best. No redundancy for the machine that distributes the checks, so it’s not a real cluster.
  • 11. API poverty Can't easily integrate with other systems. Can't easily write custom dashboards. Can't get information out again! Assumes a static infra Master has to be told about a client before things can happen.
  • 12. The bandaids we make Interface: Opsview, Icinga, Shinken, others API: Parsing status.dat, NDO Client wire format: Opsview's NRPE, NRD Config management: Puppet types, Chef cookbooks None of it is good enough.
  • 13. The take-home point: "If we keep using Nagios, we'll never get anything better." (Writing monitoring systems is hard, and needs community involvement and real world adoption. Nagios steals mindshare by being just good enough. It's the monitoring system we deserve, but not the one we need right now.)
  • 14. So, smart guy. What do we do? Steal all the things that are great about Nagios. (existing plugin investment, simplicity, security, reliability) Strap them to something more awesome. (scalable, API-ready, config management friendly, modern!)
  • 15. THIS DOESN’T MEAN WRITING YOUR OWN MONITORING SYSTEM
  • 16. Points for thought: ●  What else are people using? ●  Should we greenfield or lift existing tools? ●  What tools could we go with?
  • 17. My suggestion: Like OMD, but better. Wrap up a series of “best in breed” tools to make one kickass monitoring tool.
  • 19. Core: Holds configuration about hosts / services Distributed across X masters Check execution (poke) Results queue (poke response)
  • 20. There’s something we can use for this. Sensu! Sensu is often described as the “monitoring router”.
  • 21.
  • 22. { "checks": { "chef_client": { "command": "check-chef-client.rb", "subscribers": [ "production" ], "interval": 60, "handlers": [ "pagerduty", "irc" ] } } } Only on the server
  • 23. Client requires no registration for the server to know about it Uses Nagios status return codes Doesn’t talk to the server - talks to RabbitMQ
  • 24. Core: Holds configuration about hosts / services Distributed across X masters Check execution (poke) Results queue (poke response)
  • 25. What we need: Core - Sensu-server Agent - Sensu-client Graphing Anomaly detection Alerting UI
  • 26. Graphing is easy now. If you’re not using Graphite, you should be. Sensu “metric” checks can pump data to it.
  • 27. What we need: Core - Sensu-server Agent - Sensu-client Graphing - Graphite Anomaly detection Alerting UI
  • 28. Anomaly detection is hard. We’ve got all this metric data, but how do we check it? - Skyline/Oculus (Etsy) - Grok (very early days) - ???
  • 29. What we need: Core - Sensu-server Agent - Sensu-client Graphing - Graphite Anomaly detection - ??? Alerting UI
  • 30. Alerting is tricky, but mostly solved. Flapjack! - flapjack.io Alerting is not the concern of your monitoring tool. Push all alerts at Flapjack - define gateways (PagerDuty, email) - create relationships between checks and gateways
  • 31. What we need: Core - Sensu-server Agent - Sensu-client Graphing - Graphite Anomaly detection - ??? Alerting - Flapjack UI
  • 32. User interfaces are hard. What do we need from it? - What’s broken - When it broke, when it broke in the past - Say “OK, I know it’s broken” - View graphs to see how quickly it broke - See every check everywhere, and filter the list
  • 33. The Sensu Dashboard sucks. No history! Acknowledgements aren’t easy to do. No graphing. Can’t see anything that’s reporting an OK status. This won’t do.
  • 34. I’m going to have to write a UI. Sigh.
  • 35. What we need: Core - Sensu-server Agent - Sensu-client Graphing - Graphite Anomaly detection - ??? Alerting - Flapjack UI - ???
  • 36. In Summary Nagios sucks. There are good tools for each concern of monitoring. If we can package them together, we can have something that rocks.