SlideShare una empresa de Scribd logo
1 de 18
Nagios and Mod-Gearman
In a Large-Scale
Environment


 Jason Cook <jcook@verisign.com>
 8/28/2012
A Brief History of Nagios
   at Verisign




Verisign Public                2
Legacy Nagios Setup

   • Whitepaper NSCA configuration
           • Typical 3-Tier setup
                  • Remote System
                  • Distributed Nagios Servers
                  • Central Nagios Servers
           • Architecture in-place for several years
           • Reasonably stable, though high-maintenance
           • Very heterogeneous environment.
                  • Many OS and Nagios versions
   • All notifications sent to an Event Management System
   • Offloaded graphing/trending to a custom solution.



Verisign Public                                             3
Simplified Passive Architecture Diagram




Verisign Public                              4
Challenges with our passive setup

   • Scaling the Nagios server layers
           • Requires changes to all NSCA instances using the servers
           • Load-Balancing solutions mostly require removing freshness
             checks…
   • Freshness checking is a challenge
           • More freshness checking means more Nagios forking.
                  • More Nagios forking is more operational sadness in a large
                    environment.
                  • With Freshness, you end up having an active environment, even
                    if it wasn’t your intention.
           • Freshness errors do not tell the whole story
                  • Where is the problem?
                      • Even if you know where the problem is, it can be difficult to track
                        down what’s causing it. Nagios? Plugin? System busy? NSCA?
                        Network? Many questions, few obvious answers.


Verisign Public                                                                               5
Challenges with our passive setup (continued)

   • Lack of centralized scheduling
           • Adjusting schedules can be difficult for those without in-depth
             knowledge of Nagios and how it all works.
           • Inability to have a user run a check immediately without
             having even more in-depth knowledge about Nagios.
   • Lots of Nagios builds for various platforms.
           • Since we were using NSCA, we needed libmcrypt for
             encryption.
                  • libmcrypt not a standard library for many systems, so yet another
                    package to maintain.
   • All of this needed quite a bit of custom code for
     intelligent result queuing/sending so as to gracefully
     handle network outages and minimize send_nsca
     forking (especially on the distributed servers).
Verisign Public                                                                         6
A Move to Active
   Monitoring




Verisign Public       7
An alternative arises…

   • Gearman
           • Provides a generic application framework to farm out work to
             other machines or processes that are better suited to do the
             work.
           • Integrates with Nagios via the Mod-Gearman NEB module.
   • NRPE
           • Nagios Remote Plugin Executor
   • Merlin
           • Module for Effortless Redundancy and Loadbalancing In
             Nagios
           • Allows our Nagios instances to share scheduling (and
             therefore check results) between one another.
                  • Great for load sharing and redundancy


Verisign Public                                                             8
Simplified Active Architecture Diagram




Verisign Public                             9
Some details about the setup

   • All components run in VMs
   • Nagios 3.4.1 (with nanosleep)
   • Merlin (1.1.15)
   • Mod-Gearman 1.2.6
   • MK Livestatus (perhaps the greatest NEB module of all
     time)
   • Merlin setup is a simple peer<->peer configuration
   • Mod-Gearman NEB modules are configured to talk to
     multiple gearman servers (gearman server preference
     is alternated on each system, so that Gearman server
     failures are easily handled)
   • One Mod-Gearman worker process for each gearman
     server per worker.
Verisign Public                                          10
VM Configuration & Performance

   • VM Configuration:
           • 4 V-CPUs
           • 2GB RAM
           • Linux 2.6.32
   • Performance Considerations
           • Very CPU Bound
           • RAM usage is very low
   • VM Usage
           • 2 Nagios server
           • 2 Gearman Server
           • 2 Mod-Gearman Workers




Verisign Public                      11
Application Configurations

   • Nagios
           •      100000 services @ 5 minute interval
           •      sleep_time = 0.01
           •      host_inter_check_delay_method=n
           •      service_inter_check_delay_method=0.01
           •      max_concurrent_checks=0
           •      5 gearman collector threads
   • Gearman
           • 10 I/O Threads
   • Mod-Gearman Workers
           • 1000 worker processes per system
           • 50 per second max spawn rate


Verisign Public                                           12
Performance Results




Verisign Public          13
Observations

   • These 6 VMs can easily handle 20000 active services
     per minute.
           • Additional capacity can be had easily
                  • Add Merlin peers
                  • Add more workers
           • Scales up very well
   • renice of critical processes makes sure they’re getting
     the priority they need.
   • The environment can be a bit fragile.
           • Less fragile than before, but still has several components
             which all must be working correctly.




Verisign Public                                                           14
Benefits

   • Much less hardware
   • Centralized view and control over all monitoring
   • Opportunity to leverage the Gearman architecture for
     other services
   • Higher confidence in monitoring accuracy
   • More flexibility in scheduling logic.
   • Event handlers become very useful, since there is a
     broader view of the infrastructure via MK Livestatus.




Verisign Public                                              15
Final Thoughts

   • Tested several methodologies before arriving at the
     Nagios+Gearman conclusion.
           • Multisite
           • DNX
           • NRDP
   • The current design is still a work in progress, but will
     be easier to change and grow (Nagios 4?).
   • Move anything possible off of Nagios and to external
     processes.




Verisign Public                                                 16
Credits

   • Verisign System Administrators for helping me test
   • Gearman (http://www.gearman.org)
   • ConSol Labs (http://labs.consol.de)
           • Thruk
           • Mod-Gearman
   • Mathias Kettner (http://mathias-kettner.de)
           • MK Livestatus
   • op5 (http://www.op5.org)
           • Merlin
   • Nagios (http://nagios.org)
           • Nagios Core
           • NRPE


Verisign Public                                           17
Thank You




© 2012 VeriSign, Inc. All rights reserved. VERISIGN and other trademarks, service marks, and
designs are registered or unregistered trademarks of VeriSign, Inc. and its subsidiaries in the United
States and in foreign countries. All other trademarks are property of their respective owners.

Más contenido relacionado

La actualidad más candente

Service Ownership with PagerDuty and Rundeck: Help others help you
Service Ownership with PagerDuty and Rundeck:  Help others help you Service Ownership with PagerDuty and Rundeck:  Help others help you
Service Ownership with PagerDuty and Rundeck: Help others help you
TraciMyers5
 
Scott Schnoll - Exchange server 2013 virtualization best practices
Scott Schnoll - Exchange server 2013 virtualization best practicesScott Schnoll - Exchange server 2013 virtualization best practices
Scott Schnoll - Exchange server 2013 virtualization best practices
Nordic Infrastructure Conference
 

La actualidad más candente (20)

Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring
Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios MonitoringNagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring
Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring
 
Nagios Conference 2011 - Nicholas Scott - Nagios Performance Tuning
Nagios Conference 2011 - Nicholas Scott - Nagios Performance TuningNagios Conference 2011 - Nicholas Scott - Nagios Performance Tuning
Nagios Conference 2011 - Nicholas Scott - Nagios Performance Tuning
 
Service Ownership with PagerDuty and Rundeck: Help others help you
Service Ownership with PagerDuty and Rundeck:  Help others help you Service Ownership with PagerDuty and Rundeck:  Help others help you
Service Ownership with PagerDuty and Rundeck: Help others help you
 
Learning Nagios module 1
Learning Nagios module 1Learning Nagios module 1
Learning Nagios module 1
 
Continuent Tungsten - Scalable Saa S Data Management
Continuent Tungsten - Scalable Saa S Data ManagementContinuent Tungsten - Scalable Saa S Data Management
Continuent Tungsten - Scalable Saa S Data Management
 
Nagios
NagiosNagios
Nagios
 
VMworld 2016: vSphere 6.x Host Resource Deep Dive
VMworld 2016: vSphere 6.x Host Resource Deep DiveVMworld 2016: vSphere 6.x Host Resource Deep Dive
VMworld 2016: vSphere 6.x Host Resource Deep Dive
 
Nagios Conference 2014 - Frank Pantaleo - Nagios Monitoring of Netezza Databases
Nagios Conference 2014 - Frank Pantaleo - Nagios Monitoring of Netezza DatabasesNagios Conference 2014 - Frank Pantaleo - Nagios Monitoring of Netezza Databases
Nagios Conference 2014 - Frank Pantaleo - Nagios Monitoring of Netezza Databases
 
Nagios Conference 2014 - James Clark - Nagios Cool Tips and Tricks
Nagios Conference 2014 - James Clark - Nagios Cool Tips and TricksNagios Conference 2014 - James Clark - Nagios Cool Tips and Tricks
Nagios Conference 2014 - James Clark - Nagios Cool Tips and Tricks
 
Pull, Don't Push! Sensu Summit 2018 Talk
Pull, Don't Push! Sensu Summit 2018 TalkPull, Don't Push! Sensu Summit 2018 Talk
Pull, Don't Push! Sensu Summit 2018 Talk
 
Gluster Metrics: why they are crucial for running stable deployments of all s...
Gluster Metrics: why they are crucial for running stable deployments of all s...Gluster Metrics: why they are crucial for running stable deployments of all s...
Gluster Metrics: why they are crucial for running stable deployments of all s...
 
Zabbix 3.2 presentation June 2017
Zabbix 3.2 presentation June 2017Zabbix 3.2 presentation June 2017
Zabbix 3.2 presentation June 2017
 
VMworld 2015: VMware NSX Deep Dive
VMworld 2015: VMware NSX Deep DiveVMworld 2015: VMware NSX Deep Dive
VMworld 2015: VMware NSX Deep Dive
 
Nagios Conference 2013 - John Sellens - Monitoring Remote Locations with Nagios
Nagios Conference 2013 - John Sellens - Monitoring Remote Locations with NagiosNagios Conference 2013 - John Sellens - Monitoring Remote Locations with Nagios
Nagios Conference 2013 - John Sellens - Monitoring Remote Locations with Nagios
 
Scott Schnoll - Exchange server 2013 virtualization best practices
Scott Schnoll - Exchange server 2013 virtualization best practicesScott Schnoll - Exchange server 2013 virtualization best practices
Scott Schnoll - Exchange server 2013 virtualization best practices
 
WebLogic Stability; Detect and Analyse Stuck Threads
WebLogic Stability; Detect and Analyse Stuck ThreadsWebLogic Stability; Detect and Analyse Stuck Threads
WebLogic Stability; Detect and Analyse Stuck Threads
 
Saltconf16 william-cannon b
Saltconf16 william-cannon bSaltconf16 william-cannon b
Saltconf16 william-cannon b
 
Towards automated testing - CloudStack Collab Conference
Towards automated testing - CloudStack Collab ConferenceTowards automated testing - CloudStack Collab Conference
Towards automated testing - CloudStack Collab Conference
 
A day in the life of a VSAN I/O - STO7875
A day in the life of a VSAN I/O - STO7875A day in the life of a VSAN I/O - STO7875
A day in the life of a VSAN I/O - STO7875
 
Network Configuration Manager Training - [Season 4] Part 1 - Configuration ba...
Network Configuration Manager Training - [Season 4] Part 1 - Configuration ba...Network Configuration Manager Training - [Season 4] Part 1 - Configuration ba...
Network Configuration Manager Training - [Season 4] Part 1 - Configuration ba...
 

Similar a Nagios Conference 2012 - Jason Cook - Nagios and Mod-Gearman

NGENSTOR_ODA_P2V_V5
NGENSTOR_ODA_P2V_V5NGENSTOR_ODA_P2V_V5
NGENSTOR_ODA_P2V_V5
UniFabric
 
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #7: ClusterControl
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #7: ClusterControlWebinar Slides: MySQL HA/DR/Geo-Scale - High Noon #7: ClusterControl
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #7: ClusterControl
Continuent
 
How DreamHost builds a public cloud with OpenStack.pdf
How DreamHost builds a public cloud with OpenStack.pdfHow DreamHost builds a public cloud with OpenStack.pdf
How DreamHost builds a public cloud with OpenStack.pdf
OpenStack Foundation
 
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #5: Oracle’s InnoDB Cluster
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #5: Oracle’s InnoDB ClusterWebinar Slides: MySQL HA/DR/Geo-Scale - High Noon #5: Oracle’s InnoDB Cluster
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #5: Oracle’s InnoDB Cluster
Continuent
 

Similar a Nagios Conference 2012 - Jason Cook - Nagios and Mod-Gearman (20)

Make It Cooler: Using Decentralized Version Control
Make It Cooler: Using Decentralized Version ControlMake It Cooler: Using Decentralized Version Control
Make It Cooler: Using Decentralized Version Control
 
NGENSTOR_ODA_P2V_V5
NGENSTOR_ODA_P2V_V5NGENSTOR_ODA_P2V_V5
NGENSTOR_ODA_P2V_V5
 
Nagios Conference 2011 - William Leibzon - Nagios In Cloud Computing Environm...
Nagios Conference 2011 - William Leibzon - Nagios In Cloud Computing Environm...Nagios Conference 2011 - William Leibzon - Nagios In Cloud Computing Environm...
Nagios Conference 2011 - William Leibzon - Nagios In Cloud Computing Environm...
 
Designing your XenApp 7.5 Environment
Designing your XenApp 7.5 EnvironmentDesigning your XenApp 7.5 Environment
Designing your XenApp 7.5 Environment
 
Flintstones or Jetsons? Jump Start Your Virtual Test Lab
Flintstones or Jetsons? Jump Start Your Virtual Test LabFlintstones or Jetsons? Jump Start Your Virtual Test Lab
Flintstones or Jetsons? Jump Start Your Virtual Test Lab
 
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
 
Simplified, Robust and Speedy Novell Identity Manager Implementation with Des...
Simplified, Robust and Speedy Novell Identity Manager Implementation with Des...Simplified, Robust and Speedy Novell Identity Manager Implementation with Des...
Simplified, Robust and Speedy Novell Identity Manager Implementation with Des...
 
Sergey Dzyuban "To Build My Own Cloud with Blackjack…"
Sergey Dzyuban "To Build My Own Cloud with Blackjack…"Sergey Dzyuban "To Build My Own Cloud with Blackjack…"
Sergey Dzyuban "To Build My Own Cloud with Blackjack…"
 
Designing your xen app 7.5 environment
Designing your xen app 7.5 environmentDesigning your xen app 7.5 environment
Designing your xen app 7.5 environment
 
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #7: ClusterControl
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #7: ClusterControlWebinar Slides: MySQL HA/DR/Geo-Scale - High Noon #7: ClusterControl
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #7: ClusterControl
 
The Rocky Cloud Road
The Rocky Cloud RoadThe Rocky Cloud Road
The Rocky Cloud Road
 
How DreamHost builds a Public Cloud with OpenStack
How DreamHost builds a Public Cloud with OpenStackHow DreamHost builds a Public Cloud with OpenStack
How DreamHost builds a Public Cloud with OpenStack
 
How DreamHost builds a public cloud with OpenStack.pdf
How DreamHost builds a public cloud with OpenStack.pdfHow DreamHost builds a public cloud with OpenStack.pdf
How DreamHost builds a public cloud with OpenStack.pdf
 
Scaling With Sun Systems For MySQL Jan09
Scaling With Sun Systems For MySQL Jan09Scaling With Sun Systems For MySQL Jan09
Scaling With Sun Systems For MySQL Jan09
 
Kafka Summit SF 2017 - Running Kafka for Maximum Pain
Kafka Summit SF 2017 - Running Kafka for Maximum PainKafka Summit SF 2017 - Running Kafka for Maximum Pain
Kafka Summit SF 2017 - Running Kafka for Maximum Pain
 
Managing High Availability with Low Cost
Managing High Availability with Low CostManaging High Availability with Low Cost
Managing High Availability with Low Cost
 
Architecture to Scale. DONN ROCHETTE at Big Data Spain 2012
Architecture to Scale. DONN ROCHETTE at Big Data Spain 2012Architecture to Scale. DONN ROCHETTE at Big Data Spain 2012
Architecture to Scale. DONN ROCHETTE at Big Data Spain 2012
 
Jumping from Continuous Integration to Continuous Delivery with Jenkins Enter...
Jumping from Continuous Integration to Continuous Delivery with Jenkins Enter...Jumping from Continuous Integration to Continuous Delivery with Jenkins Enter...
Jumping from Continuous Integration to Continuous Delivery with Jenkins Enter...
 
Virtualisation at Ringo
Virtualisation at RingoVirtualisation at Ringo
Virtualisation at Ringo
 
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #5: Oracle’s InnoDB Cluster
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #5: Oracle’s InnoDB ClusterWebinar Slides: MySQL HA/DR/Geo-Scale - High Noon #5: Oracle’s InnoDB Cluster
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #5: Oracle’s InnoDB Cluster
 

Más de Nagios

Más de Nagios (20)

Nagios XI Best Practices
Nagios XI Best PracticesNagios XI Best Practices
Nagios XI Best Practices
 
Jesse Olson - Nagios Log Server Architecture Overview
Jesse Olson - Nagios Log Server Architecture OverviewJesse Olson - Nagios Log Server Architecture Overview
Jesse Olson - Nagios Log Server Architecture Overview
 
Trevor McDonald - Nagios XI Under The Hood
Trevor McDonald  - Nagios XI Under The HoodTrevor McDonald  - Nagios XI Under The Hood
Trevor McDonald - Nagios XI Under The Hood
 
Sean Falzon - Nagios - Resilient Notifications
Sean Falzon - Nagios - Resilient NotificationsSean Falzon - Nagios - Resilient Notifications
Sean Falzon - Nagios - Resilient Notifications
 
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise EditionMarcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition
 
Janice Singh - Writing Custom Nagios Plugins
Janice Singh - Writing Custom Nagios PluginsJanice Singh - Writing Custom Nagios Plugins
Janice Singh - Writing Custom Nagios Plugins
 
Dave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical ExperienceDave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical Experience
 
Mike Weber - Nagios and Group Deployment of Service Checks
Mike Weber - Nagios and Group Deployment of Service ChecksMike Weber - Nagios and Group Deployment of Service Checks
Mike Weber - Nagios and Group Deployment of Service Checks
 
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios InstallationMike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
 
Matt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With NagiosMatt Bruzek - Monitoring Your Public Cloud With Nagios
Matt Bruzek - Monitoring Your Public Cloud With Nagios
 
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
 
Eric Loyd - Fractal Nagios
Eric Loyd - Fractal NagiosEric Loyd - Fractal Nagios
Eric Loyd - Fractal Nagios
 
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
 
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
Thomas Schmainda - Tracking Boeing Satellites With Nagios - Nagios World Conf...
 
Nagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson OpeningNagios World Conference 2015 - Scott Wilkerson Opening
Nagios World Conference 2015 - Scott Wilkerson Opening
 
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios CoreNrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
Nrpe - Nagios Remote Plugin Executor. NRPE plugin for Nagios Core
 
Nagios Log Server - Features
Nagios Log Server - FeaturesNagios Log Server - Features
Nagios Log Server - Features
 
Nagios Network Analyzer - Features
Nagios Network Analyzer - FeaturesNagios Network Analyzer - Features
Nagios Network Analyzer - Features
 
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing NagiosNagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
Nagios Conference 2014 - Dorance Martinez Cortes - Customizing Nagios
 
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment Options
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment OptionsNagios Conference 2014 - Mike Weber - Nagios Rapid Deployment Options
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment Options
 

Último

Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
FIDO Alliance
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
FIDO Alliance
 

Último (20)

Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 
Your enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4jYour enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4j
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptx
 
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The InsideCollecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 

Nagios Conference 2012 - Jason Cook - Nagios and Mod-Gearman

  • 1. Nagios and Mod-Gearman In a Large-Scale Environment Jason Cook <jcook@verisign.com> 8/28/2012
  • 2. A Brief History of Nagios at Verisign Verisign Public 2
  • 3. Legacy Nagios Setup • Whitepaper NSCA configuration • Typical 3-Tier setup • Remote System • Distributed Nagios Servers • Central Nagios Servers • Architecture in-place for several years • Reasonably stable, though high-maintenance • Very heterogeneous environment. • Many OS and Nagios versions • All notifications sent to an Event Management System • Offloaded graphing/trending to a custom solution. Verisign Public 3
  • 4. Simplified Passive Architecture Diagram Verisign Public 4
  • 5. Challenges with our passive setup • Scaling the Nagios server layers • Requires changes to all NSCA instances using the servers • Load-Balancing solutions mostly require removing freshness checks… • Freshness checking is a challenge • More freshness checking means more Nagios forking. • More Nagios forking is more operational sadness in a large environment. • With Freshness, you end up having an active environment, even if it wasn’t your intention. • Freshness errors do not tell the whole story • Where is the problem? • Even if you know where the problem is, it can be difficult to track down what’s causing it. Nagios? Plugin? System busy? NSCA? Network? Many questions, few obvious answers. Verisign Public 5
  • 6. Challenges with our passive setup (continued) • Lack of centralized scheduling • Adjusting schedules can be difficult for those without in-depth knowledge of Nagios and how it all works. • Inability to have a user run a check immediately without having even more in-depth knowledge about Nagios. • Lots of Nagios builds for various platforms. • Since we were using NSCA, we needed libmcrypt for encryption. • libmcrypt not a standard library for many systems, so yet another package to maintain. • All of this needed quite a bit of custom code for intelligent result queuing/sending so as to gracefully handle network outages and minimize send_nsca forking (especially on the distributed servers). Verisign Public 6
  • 7. A Move to Active Monitoring Verisign Public 7
  • 8. An alternative arises… • Gearman • Provides a generic application framework to farm out work to other machines or processes that are better suited to do the work. • Integrates with Nagios via the Mod-Gearman NEB module. • NRPE • Nagios Remote Plugin Executor • Merlin • Module for Effortless Redundancy and Loadbalancing In Nagios • Allows our Nagios instances to share scheduling (and therefore check results) between one another. • Great for load sharing and redundancy Verisign Public 8
  • 9. Simplified Active Architecture Diagram Verisign Public 9
  • 10. Some details about the setup • All components run in VMs • Nagios 3.4.1 (with nanosleep) • Merlin (1.1.15) • Mod-Gearman 1.2.6 • MK Livestatus (perhaps the greatest NEB module of all time) • Merlin setup is a simple peer<->peer configuration • Mod-Gearman NEB modules are configured to talk to multiple gearman servers (gearman server preference is alternated on each system, so that Gearman server failures are easily handled) • One Mod-Gearman worker process for each gearman server per worker. Verisign Public 10
  • 11. VM Configuration & Performance • VM Configuration: • 4 V-CPUs • 2GB RAM • Linux 2.6.32 • Performance Considerations • Very CPU Bound • RAM usage is very low • VM Usage • 2 Nagios server • 2 Gearman Server • 2 Mod-Gearman Workers Verisign Public 11
  • 12. Application Configurations • Nagios • 100000 services @ 5 minute interval • sleep_time = 0.01 • host_inter_check_delay_method=n • service_inter_check_delay_method=0.01 • max_concurrent_checks=0 • 5 gearman collector threads • Gearman • 10 I/O Threads • Mod-Gearman Workers • 1000 worker processes per system • 50 per second max spawn rate Verisign Public 12
  • 14. Observations • These 6 VMs can easily handle 20000 active services per minute. • Additional capacity can be had easily • Add Merlin peers • Add more workers • Scales up very well • renice of critical processes makes sure they’re getting the priority they need. • The environment can be a bit fragile. • Less fragile than before, but still has several components which all must be working correctly. Verisign Public 14
  • 15. Benefits • Much less hardware • Centralized view and control over all monitoring • Opportunity to leverage the Gearman architecture for other services • Higher confidence in monitoring accuracy • More flexibility in scheduling logic. • Event handlers become very useful, since there is a broader view of the infrastructure via MK Livestatus. Verisign Public 15
  • 16. Final Thoughts • Tested several methodologies before arriving at the Nagios+Gearman conclusion. • Multisite • DNX • NRDP • The current design is still a work in progress, but will be easier to change and grow (Nagios 4?). • Move anything possible off of Nagios and to external processes. Verisign Public 16
  • 17. Credits • Verisign System Administrators for helping me test • Gearman (http://www.gearman.org) • ConSol Labs (http://labs.consol.de) • Thruk • Mod-Gearman • Mathias Kettner (http://mathias-kettner.de) • MK Livestatus • op5 (http://www.op5.org) • Merlin • Nagios (http://nagios.org) • Nagios Core • NRPE Verisign Public 17
  • 18. Thank You © 2012 VeriSign, Inc. All rights reserved. VERISIGN and other trademarks, service marks, and designs are registered or unregistered trademarks of VeriSign, Inc. and its subsidiaries in the United States and in foreign countries. All other trademarks are property of their respective owners.