Beyond Nagios

•Descargar como KEY, PDF•

1 recomendación•1,326 vistas

This document discusses moving beyond the limitations of Nagios for infrastructure monitoring. It summarizes that while Nagios is an industry standard, its configuration can be daunting and not user-friendly. It then describes common problems with Nagios like being overwhelmed by alerts, entering a "spiral of death" by adding more checks that lead to more alerts, and trying to improve coverage through more checks but resulting in the "trough of despair." The document recommends ways to improve the situation like measuring data, looking for patterns in alerts, putting alerts in context visually, and focusing on business impact.

Tecnología Empresariales

Beyond Nagios

NYC DevOps 2011/07/21
Alexis Lê-Quôc - alq@datadoghq.com

What I’m Going To Talk About

• Super-quick Nagios summary

• Monitoring/Alerting Pathologies

• How to ﬁx it

What Is

• “Industry Standard in IT Infrastructure Monitoring”

• For once it’s true...

• Scheduler & Notiﬁcation server

(+) Robust, Mature code-base

(-) Conﬁguration can be daunting

(-) Not human-friendly

Process alerts
& Fix things

Receive alerts Add more checks

THE HAPPY START

Missed alerts

Ignore Alerts Add more checks

THE SPIRAL OF DEATH

Quality
of life

Few checks
Few alerts

More checks
Too many alerts

# of alerts
FIGHT OR FLIGHT

Effective Checks n^2
Coverage Fault-tolerant
Less urgency

Few checks
Few alerts
Every host counts

More checks
Too many alerts
Every host still counts Scale
Complexity

THE TROUGH OF DESPAIR

Effective
Coverage

Scale
IF ONLY I ADDED MORE
CHECKS...

Way Out
‣Breathe!
‣Measure
‣Look for Patterns
‣Put Alerts in Context
‣Focus on the Business

Turn Nagios logs into structured data

Analyze

day | success_pct | warning_pct | error_pct | events
---------------------+-------------+-------------+-----------+--------
2011-07-12 00:00:00 | 89 | 0| 2 | 9628
2011-07-13 00:00:00 | 90 | 0| 2 | 9210
2011-07-14 00:00:00 | 90 | 0| 2 | 9735
2011-07-15 00:00:00 | 89 | 0| 2 | 9531

MEASURE

day | success_pct | warning_pct | error_pct | events
---------------------+-------------+-------------+-----------+--------
2011-07-12 00:00:00 | 89 | 0| 2 | 9628
2011-07-13 00:00:00 | 90 | 0| 2 | 9210
2011-07-14 00:00:00 | 90 | 0| 2 | 9735
2011-07-15 00:00:00 | 89 | 0| 2 | 9531

VISUALIZATION MATTERS

PUT ALERTS IN CONTEXT
https://app.datad0g.com/dash/dash/1000#/date_range/1310682467000.0-1310684267000.0

Ultimate (hard) question
‣Does this alert impact the business?
‣If so by how much?
‣Assumes that you track business metrics...
‣And they can be accessed programatically

FOCUS ON THE BUSINESS

What applies to Nagios...
Applies to other sources too

etc...

Más contenido relacionado

Similar a Beyond Nagios

Securing Systems - Still Crazy After All These YearsAdrian Sanabria

ISACA Ireland Keynote 2015Shannon Lietz

Rundeck OverviewRundeck

How to use Istio/Anthos to build Enterprise SRETzung-Hsien (Shawn) Ho

Business Case Calculator for DevOps Initiatives - Leading credit card service...Capgemini

Modern Monitoring [ with Prometheus ]Haggai Philip Zagury

An Introduction to ORYX SoftwareAccountagility

DevSecCon KeyNote London 2015Shannon Lietz

DevSecCon KeynoteShannon Lietz

Information Security in the Gaming WorldDimitrios Stergiou

Q insure Jaikumar Karuppannan

Quick wins in the NetOps Journey by Vincent Boon, OpengearMyNOG

Ploigos - How It Works, and Why.pdfBill Bensing

EN - Workload ModuleVisual Planning

Achieving Compliance Through SecurityEnergySec

What does performance mean in the cloudMichael Kopp

OSDC 2014: Fernando Hönig - New Data Center Service Model: Cloud + DevOpsNETWAYS

45 Minutes to PCI Compliance in the CloudCloudPassage

Do You Really Need to Evolve From Monitoring to Observability?Splunk

Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continu...Nagios

Similar a Beyond Nagios (20)

Securing Systems - Still Crazy After All These Years

ISACA Ireland Keynote 2015

Rundeck Overview

How to use Istio/Anthos to build Enterprise SRE

Business Case Calculator for DevOps Initiatives - Leading credit card service...

Modern Monitoring [ with Prometheus ]

An Introduction to ORYX Software

DevSecCon KeyNote London 2015

DevSecCon Keynote

Information Security in the Gaming World

Q insure

Quick wins in the NetOps Journey by Vincent Boon, Opengear

Ploigos - How It Works, and Why.pdf

EN - Workload Module

Achieving Compliance Through Security

What does performance mean in the cloud

OSDC 2014: Fernando Hönig - New Data Center Service Model: Cloud + DevOps

45 Minutes to PCI Compliance in the Cloud

Do You Really Need to Evolve From Monitoring to Observability?

Nagios Conference 2012 - Kishore Jalleda - Nagios in the Agile DevOps Continu...

Último

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

Developing An App To Navigate The Roads of BrazilV3cube

How to convert PDF to text with Nanonetsnaman860154

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

GenCyber Cyber Security Day PresentationMichael W. Hawkins

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

Salesforce Community Group Quito, Salesforce 101Paola De la Torre

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

A Domino Admins Adventures (Engage 2024)Gabriella Davis

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

🐬 The future of MySQL is Postgres 🐘RTylerCroy

Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

Beyond Nagios

1. Beyond Nagios NYC DevOps 2011/07/21 Alexis Lê-Quôc - alq@datadoghq.com

2. Beyond Nagios NYC DevOps 2011/07/21 Alexis Lê-Quôc - alq@datadoghq.com

3. What I’m Going To Talk About • Super-quick Nagios summary • Monitoring/Alerting Pathologies • How to ﬁx it

4. What Is • “Industry Standard in IT Infrastructure Monitoring” • For once it’s true... • Scheduler & Notiﬁcation server

5. (+) Robust, Mature code-base (-) Conﬁguration can be daunting (-) Not human-friendly

6. “OVERWHELMING”

7. A “NORMAL” HOUR

8. THE “OTHER” NAGIOS UI

9. Process alerts & Fix things Receive alerts Add more checks THE HAPPY START

10. Missed alerts Ignore Alerts Add more checks THE SPIRAL OF DEATH

11. Quality of life Few checks Few alerts More checks Too many alerts # of alerts FIGHT OR FLIGHT

12. Effective Checks n^2 Coverage Fault-tolerant Less urgency Few checks Few alerts Every host counts More checks Too many alerts Every host still counts Scale Complexity THE TROUGH OF DESPAIR

13. Effective Coverage Scale IF ONLY I ADDED MORE CHECKS...

14. Reset!

15. Way Out ‣Breathe! ‣Measure ‣Look for Patterns ‣Put Alerts in Context ‣Focus on the Business

16. Turn Nagios logs into structured data Analyze day | success_pct | warning_pct | error_pct | events ---------------------+-------------+-------------+-----------+-------- 2011-07-12 00:00:00 | 89 | 0| 2 | 9628 2011-07-13 00:00:00 | 90 | 0| 2 | 9210 2011-07-14 00:00:00 | 90 | 0| 2 | 9735 2011-07-15 00:00:00 | 89 | 0| 2 | 9531 MEASURE

17. day | success_pct | warning_pct | error_pct | events ---------------------+-------------+-------------+-----------+-------- 2011-07-12 00:00:00 | 89 | 0| 2 | 9628 2011-07-13 00:00:00 | 90 | 0| 2 | 9210 2011-07-14 00:00:00 | 90 | 0| 2 | 9735 2011-07-15 00:00:00 | 89 | 0| 2 | 9531 VISUALIZATION MATTERS

18. In Time Flapping LOOK FOR PATTERNS

19. PUT ALERTS IN CONTEXT https://app.datad0g.com/dash/dash/1000#/date_range/1310682467000.0-1310684267000.0

20. Ultimate (hard) question ‣Does this alert impact the business? ‣If so by how much? ‣Assumes that you track business metrics... ‣And they can be accessed programatically FOCUS ON THE BUSINESS

21. What applies to Nagios... Applies to other sources too etc...

22. Thanks http://datadoghq.com

Beyond Nagios

Recomendados

Recomendados

Más contenido relacionado

Similar a Beyond Nagios

Similar a Beyond Nagios (20)

Último

Último (20)

Beyond Nagios

Notas del editor