SlideShare una empresa de Scribd logo
1 de 28
Descargar para leer sin conexión
Optimizing your Monitoring and Trending tools for the Cloud
Nagios World Conference 2012

Nicolas Brousse
Lead Operations Engineer
September 28th 2012
• 
• 
• 
• 
• 
• 
• 
• 
• 

About TubeMogul
What are some of our challenges?
Our environment
Amazon Cloud Environment
Automated Monitoring
Efficient on-call rotation
Efficient monitoring
What’s next?
Q&A
About TubeMogul
•  Founded in 2006
•  Formerly a video distribution and analytics
platform
•  TubeMogul is a Brand-Focused Video Marketing
Company
–  Build for Branding
–  Integrate real-time media buying, ad serving,
targeting, optimization and brand measurement

TubeMogul simplifies the delivery of video ads
and maximizes the impact of every dollar spent by
brand marketers
http://www.tubemogul.com
What are some of our challenges?
•  Monitoring between 700 to 1000 servers
•  Servers spread across 6 different locations
–  4 Amazon EC2 Regions (our public cloud provider)
–  1 Hosted (Liquidweb) & 1 VPS (Linode)

•  Little monitoring resources
–  Collecting over 115,000 metrics
–  Monitoring over 20,000 services with Nagios

•  Multiple billions of HTTP requests a day
–  Most of it must be served in less than 100ms
–  Lost of traffic could mean lost of business opportunity
–  Or worst, over-spending…
Our environment
•  Over 80 different server profiles
•  Our stack:
–  Java (Embedded Jetty, Tomcat)
–  PHP, RoR
–  Hadoop: HDFS, M/R, Hbase, Hive
–  Couchbase
–  MySQL

•  Monitoring: Nagios, NSCA
•  Graphing: Ganglia, sFlow, Graphite
•  Configuration Management: Puppet
Amazon Cloud Environment
Amazon Cloud Environment
•  We use EC2, SDB, SQS, EMR, S3, etc.
•  We don’t use ELB
•  We heavily use EC2 Tags
ec2-describe-instances -F tag:hostname=dev-build01
Automated Monitoring
Automated Monitoring
Configuring Ganglia using Puppet templates
globals {
…
override_hostname = <%= scope.lookupvar('hostname') %>
…
}
udp_send_channel {
host = <%= scope.lookupvar('ec2_tag_nagios_host') %>
port = 8649
ttl = 1
}
sflow {
udp_port = 6343
accept_jvm_metrics = yes
multiple_jvm_instances = yes
}
Automated Monitoring
Or configuring Host sFlow using Puppet
templates
sflow{
DNSSD = off
polling = 20
sampling = 512
collector{
ip = <%= ec2_tag_nagios_host %>
udpport = 6343
}
}
Automated Monitoring
•  Puppet configure our monitoring instances
–  We use Nagios regex : use_regexp_matching=1
–  But we don’t use true regex : use_true_regexp_matching=0
–  We use NSCA with Upstart

–  We don’t use the perfdata
–  We use pre-cached objects
–  We includes our configurations from 3 directories
•  objects => templates, contacts, commands, event_handlers
•  servers => contain a configuration file for each server
•  clusters => contain a configuration file for each cluster
Automated Monitoring
Process of event when starting a new host and add it to our monitoring:
1. 

We start a new instance using Cerveza and Cloud-init

2. 

Puppet configure Gmond or Host sFlow on the instance

3. 

Our monitoring server running Gmond and Gmetad get data from
the new instance

4. 

A Nagios check run every minute and check for new hosts
• 
• 
• 

5. 

Look for new hosts using EC2 API
Look for EC2 tag “hostname” to confirm it’s a legit host, not a zombie / fail start
Look for EC2 tag “nagios_host” to see if the host belong to this monitoring instance

If a new host is found:
• 
• 

We build a config for the host based on a template file and doing some string replace
Once all config have been generated, we rebuild pre-cache objects and reload
Nagios

6. 

If we find “Zombie” host, we generate a Warning alert

7. 

If the config is corrupt, we send a Critical alert
Automated Monitoring
Automated Monitoring
Efficient on-call rotation
•  Follow the sun
–  Some of our team is in Ukraine, no more Tier 1 night
on-call for us

•  Nagios timeperiod and escalation are a pain to
maintain
–  Nagios notification plugged to Google Calendar
•  Using our own notification script for email and paging
•  Google Clendar make it easy for each team to manage their
own on-call calendar
•  Support for multiple Tier and complex schedules
•  Caching Google Calendar info locally every hour

–  Simpler definitions and rules in Nagios contacts
–  Notify only people on-call, unless they asked for “off
call” emails
Efficient on-call rotation
Using Google Calendar…
Efficient on-call rotation
• 
• 
• 
• 

Simple contact definitions
Google Calendar info
Tier Filter (Regex)
Tier Interval (time to wait before escalating
alert since last tier)
•  Off call email
Efficient on-call rotation
Efficient on-call rotation
On-call contact fetched from Google Calendar at the bottom of the alert
makes our life easier!
Efficient monitoring
We disable most notification and only care of a
cluster status
Efficient monitoring
Most of our checks are based on Ganglia RRD
files
Efficient monitoring
It become really easy to monitor any metrics returned
by Ganglia
Efficient monitoring

We can check cluster status by hosts/services but also
per returned messages !
Efficient monitoring
Efficient monitoring
Using Graphite Federated Storage
–  One place to see all our metrics from all the world
–  No delay due to rsync of RRD files
–  Graph close to real time, delay only due to
rrdcached flushing interval
What’s next?
Some hot topics…
•  Do trending alert with Nagios based on
Graphite/Ganglia data
•  Better automation for non-cloud servers
•  Ensure we can scale our monitoring when
using hybrid cloud (Eucalyptus) or multiple
public cloud provider
•  Get a better centralized view of our different
Nagios
Ops @ TubeMogul
All this wouldn’t be possible without a strong
System Operation team
Andrey Shestakov
Eamon Bisson Donahue
Justin Francisconi
Marylene Tanfin
Nicolas Brousse
Stan Rudenko
Thank you…

TubeMogul is Hiring !
http://www.tubemogul.com/jobs
jobs@tubemogul.com
Follow us on Twitter
@TubeMogul

@orieg

Más contenido relacionado

La actualidad más candente

Auditing data and answering the life long question, is it the end of the day ...
Auditing data and answering the life long question, is it the end of the day ...Auditing data and answering the life long question, is it the end of the day ...
Auditing data and answering the life long question, is it the end of the day ...Simona Meriam
 
Meetup elastic centralised security monitoring with decentralized clusters
Meetup elastic   centralised security monitoring with decentralized clustersMeetup elastic   centralised security monitoring with decentralized clusters
Meetup elastic centralised security monitoring with decentralized clustersAstridLiu1
 
Whats the Time
Whats the TimeWhats the Time
Whats the TimeAPNIC
 
Netflix viewing data architecture evolution - EBJUG Nov 2014
Netflix viewing data architecture evolution - EBJUG Nov 2014Netflix viewing data architecture evolution - EBJUG Nov 2014
Netflix viewing data architecture evolution - EBJUG Nov 2014Philip Fisher-Ogden
 
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017Monal Daxini
 
Open Source Monitoring Tools Shootout
Open Source Monitoring Tools ShootoutOpen Source Monitoring Tools Shootout
Open Source Monitoring Tools Shootouttomdc
 
Nielsen Presents: Fun with Kafka, Spark and Offset Management
Nielsen Presents: Fun with Kafka, Spark and Offset ManagementNielsen Presents: Fun with Kafka, Spark and Offset Management
Nielsen Presents: Fun with Kafka, Spark and Offset ManagementSimona Meriam
 
How we built an event-time merge of two kafka-streams with spark-streaming
How we built an event-time merge of two kafka-streams with spark-streamingHow we built an event-time merge of two kafka-streams with spark-streaming
How we built an event-time merge of two kafka-streams with spark-streamingRalf Sigmund
 
The ultimate container monitoring bake-off - Rancher Online Meetup October 2016
The ultimate container monitoring bake-off - Rancher Online Meetup October 2016The ultimate container monitoring bake-off - Rancher Online Meetup October 2016
The ultimate container monitoring bake-off - Rancher Online Meetup October 2016Shannon Williams
 
Spark logs made easy
Spark logs made easySpark logs made easy
Spark logs made easySimona Meriam
 
(DVO204) Monitoring Strategies: Finding Signal in the Noise
(DVO204) Monitoring Strategies: Finding Signal in the Noise(DVO204) Monitoring Strategies: Finding Signal in the Noise
(DVO204) Monitoring Strategies: Finding Signal in the NoiseAmazon Web Services
 
Beaming flink to the cloud @ netflix ff 2016-monal-daxini
Beaming flink to the cloud @ netflix   ff 2016-monal-daxiniBeaming flink to the cloud @ netflix   ff 2016-monal-daxini
Beaming flink to the cloud @ netflix ff 2016-monal-daxiniMonal Daxini
 
Dr. Elephant – Achieving Quicker, Easier, and Cost-Effective Big Data Analyti...
Dr. Elephant – Achieving Quicker, Easier, and Cost-Effective Big Data Analyti...Dr. Elephant – Achieving Quicker, Easier, and Cost-Effective Big Data Analyti...
Dr. Elephant – Achieving Quicker, Easier, and Cost-Effective Big Data Analyti...Akshay Rai
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayC4Media
 
New Relic .NET Agent Overview
New Relic .NET Agent OverviewNew Relic .NET Agent Overview
New Relic .NET Agent OverviewBrian Doll
 
Scalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at PinterestScalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at PinterestKrishna Gade
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogC4Media
 
Securing your AWS Deployments with Spinnaker and Armory Enterprise
Securing your AWS Deployments with Spinnaker and Armory EnterpriseSecuring your AWS Deployments with Spinnaker and Armory Enterprise
Securing your AWS Deployments with Spinnaker and Armory EnterpriseDevOps.com
 
Rapid Prototyping with AWS IoT and Mongoose OS on ESP8266, ESP32, and CC3200 ...
Rapid Prototyping with AWS IoT and Mongoose OS on ESP8266, ESP32, and CC3200 ...Rapid Prototyping with AWS IoT and Mongoose OS on ESP8266, ESP32, and CC3200 ...
Rapid Prototyping with AWS IoT and Mongoose OS on ESP8266, ESP32, and CC3200 ...Amazon Web Services
 

La actualidad más candente (20)

sensu
sensusensu
sensu
 
Auditing data and answering the life long question, is it the end of the day ...
Auditing data and answering the life long question, is it the end of the day ...Auditing data and answering the life long question, is it the end of the day ...
Auditing data and answering the life long question, is it the end of the day ...
 
Meetup elastic centralised security monitoring with decentralized clusters
Meetup elastic   centralised security monitoring with decentralized clustersMeetup elastic   centralised security monitoring with decentralized clusters
Meetup elastic centralised security monitoring with decentralized clusters
 
Whats the Time
Whats the TimeWhats the Time
Whats the Time
 
Netflix viewing data architecture evolution - EBJUG Nov 2014
Netflix viewing data architecture evolution - EBJUG Nov 2014Netflix viewing data architecture evolution - EBJUG Nov 2014
Netflix viewing data architecture evolution - EBJUG Nov 2014
 
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
 
Open Source Monitoring Tools Shootout
Open Source Monitoring Tools ShootoutOpen Source Monitoring Tools Shootout
Open Source Monitoring Tools Shootout
 
Nielsen Presents: Fun with Kafka, Spark and Offset Management
Nielsen Presents: Fun with Kafka, Spark and Offset ManagementNielsen Presents: Fun with Kafka, Spark and Offset Management
Nielsen Presents: Fun with Kafka, Spark and Offset Management
 
How we built an event-time merge of two kafka-streams with spark-streaming
How we built an event-time merge of two kafka-streams with spark-streamingHow we built an event-time merge of two kafka-streams with spark-streaming
How we built an event-time merge of two kafka-streams with spark-streaming
 
The ultimate container monitoring bake-off - Rancher Online Meetup October 2016
The ultimate container monitoring bake-off - Rancher Online Meetup October 2016The ultimate container monitoring bake-off - Rancher Online Meetup October 2016
The ultimate container monitoring bake-off - Rancher Online Meetup October 2016
 
Spark logs made easy
Spark logs made easySpark logs made easy
Spark logs made easy
 
(DVO204) Monitoring Strategies: Finding Signal in the Noise
(DVO204) Monitoring Strategies: Finding Signal in the Noise(DVO204) Monitoring Strategies: Finding Signal in the Noise
(DVO204) Monitoring Strategies: Finding Signal in the Noise
 
Beaming flink to the cloud @ netflix ff 2016-monal-daxini
Beaming flink to the cloud @ netflix   ff 2016-monal-daxiniBeaming flink to the cloud @ netflix   ff 2016-monal-daxini
Beaming flink to the cloud @ netflix ff 2016-monal-daxini
 
Dr. Elephant – Achieving Quicker, Easier, and Cost-Effective Big Data Analyti...
Dr. Elephant – Achieving Quicker, Easier, and Cost-Effective Big Data Analyti...Dr. Elephant – Achieving Quicker, Easier, and Cost-Effective Big Data Analyti...
Dr. Elephant – Achieving Quicker, Easier, and Cost-Effective Big Data Analyti...
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
New Relic .NET Agent Overview
New Relic .NET Agent OverviewNew Relic .NET Agent Overview
New Relic .NET Agent Overview
 
Scalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at PinterestScalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at Pinterest
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
 
Securing your AWS Deployments with Spinnaker and Armory Enterprise
Securing your AWS Deployments with Spinnaker and Armory EnterpriseSecuring your AWS Deployments with Spinnaker and Armory Enterprise
Securing your AWS Deployments with Spinnaker and Armory Enterprise
 
Rapid Prototyping with AWS IoT and Mongoose OS on ESP8266, ESP32, and CC3200 ...
Rapid Prototyping with AWS IoT and Mongoose OS on ESP8266, ESP32, and CC3200 ...Rapid Prototyping with AWS IoT and Mongoose OS on ESP8266, ESP32, and CC3200 ...
Rapid Prototyping with AWS IoT and Mongoose OS on ESP8266, ESP32, and CC3200 ...
 

Destacado

Scaling Bleeding Edge Technology in a Fast-paced Environment
Scaling Bleeding Edge Technology in a Fast-paced EnvironmentScaling Bleeding Edge Technology in a Fast-paced Environment
Scaling Bleeding Edge Technology in a Fast-paced EnvironmentNicolas Brousse
 
Gut feelings always gives better result than sophisticated
Gut feelings always gives better result than sophisticatedGut feelings always gives better result than sophisticated
Gut feelings always gives better result than sophisticatedKNAV International Ltd
 
Digital dannelse
Digital dannelseDigital dannelse
Digital dannelsemlysemose
 
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)Nicolas Brousse
 
Internet, bidaiak eta sare sozialak
Internet, bidaiak eta sare sozialakInternet, bidaiak eta sare sozialak
Internet, bidaiak eta sare sozialaktokitan
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing DataWorks Summit
 

Destacado (7)

Powerpoint
PowerpointPowerpoint
Powerpoint
 
Scaling Bleeding Edge Technology in a Fast-paced Environment
Scaling Bleeding Edge Technology in a Fast-paced EnvironmentScaling Bleeding Edge Technology in a Fast-paced Environment
Scaling Bleeding Edge Technology in a Fast-paced Environment
 
Gut feelings always gives better result than sophisticated
Gut feelings always gives better result than sophisticatedGut feelings always gives better result than sophisticated
Gut feelings always gives better result than sophisticated
 
Digital dannelse
Digital dannelseDigital dannelse
Digital dannelse
 
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)
 
Internet, bidaiak eta sare sozialak
Internet, bidaiak eta sare sozialakInternet, bidaiak eta sare sozialak
Internet, bidaiak eta sare sozialak
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 

Similar a Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Conference 2012)

Nagios Conference 2012 - Nicolas Brousse - Optimizing your Monitoring and Tre...
Nagios Conference 2012 - Nicolas Brousse - Optimizing your Monitoring and Tre...Nagios Conference 2012 - Nicolas Brousse - Optimizing your Monitoring and Tre...
Nagios Conference 2012 - Nicolas Brousse - Optimizing your Monitoring and Tre...Nagios
 
Measuring IPv6 using ad-based measurement
Measuring IPv6 using ad-based measurementMeasuring IPv6 using ad-based measurement
Measuring IPv6 using ad-based measurementAPNIC
 
Itsummit2015 blizzard
Itsummit2015 blizzardItsummit2015 blizzard
Itsummit2015 blizzardkevin_donovan
 
Regain Control Thanks To Prometheus
Regain Control Thanks To PrometheusRegain Control Thanks To Prometheus
Regain Control Thanks To PrometheusEtienne Coutaud
 
Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...
Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...
Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...Nicolas Brousse
 
Nagios Conference 2013 - Nicolas Brousse - Bringing Business Awareness
Nagios Conference 2013 - Nicolas Brousse - Bringing Business AwarenessNagios Conference 2013 - Nicolas Brousse - Bringing Business Awareness
Nagios Conference 2013 - Nicolas Brousse - Bringing Business AwarenessNagios
 
PyCon India 2012: Celery Talk
PyCon India 2012: Celery TalkPyCon India 2012: Celery Talk
PyCon India 2012: Celery TalkPiyush Kumar
 
Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...
Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...
Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...Xiaoman DONG
 
Sumo Logic: Optimizing Scheduled Searches
Sumo Logic: Optimizing Scheduled SearchesSumo Logic: Optimizing Scheduled Searches
Sumo Logic: Optimizing Scheduled SearchesSumo Logic
 
Monitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloudMonitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloudDatadog
 
AWS re:Invent 2016: Automated Governance of Your AWS Resources (DEV302)
AWS re:Invent 2016: Automated Governance of Your AWS Resources (DEV302)AWS re:Invent 2016: Automated Governance of Your AWS Resources (DEV302)
AWS re:Invent 2016: Automated Governance of Your AWS Resources (DEV302)Amazon Web Services
 
Lesson_08_Continuous_Monitoring.pdf
Lesson_08_Continuous_Monitoring.pdfLesson_08_Continuous_Monitoring.pdf
Lesson_08_Continuous_Monitoring.pdfMinh Quân Đoàn
 
How Greenhouse Software Unlocked the Power of Machine Data Analytics with Sum...
How Greenhouse Software Unlocked the Power of Machine Data Analytics with Sum...How Greenhouse Software Unlocked the Power of Machine Data Analytics with Sum...
How Greenhouse Software Unlocked the Power of Machine Data Analytics with Sum...Amazon Web Services
 
Nagios Conference 2007 | Enterprise Application Monitoring with Nagios by Jam...
Nagios Conference 2007 | Enterprise Application Monitoring with Nagios by Jam...Nagios Conference 2007 | Enterprise Application Monitoring with Nagios by Jam...
Nagios Conference 2007 | Enterprise Application Monitoring with Nagios by Jam...NETWAYS
 
Simply Business - Near Real Time Event Processing
Simply Business - Near Real Time Event ProcessingSimply Business - Near Real Time Event Processing
Simply Business - Near Real Time Event Processingidan_by
 
Automated Governance of Your AWS Resources
Automated Governance of Your AWS ResourcesAutomated Governance of Your AWS Resources
Automated Governance of Your AWS ResourcesAmazon Web Services
 
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and DaemonsQConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemonsaspyker
 
Google Cloud Next '22 Recap: Serverless & Data edition
Google Cloud Next '22 Recap: Serverless & Data editionGoogle Cloud Next '22 Recap: Serverless & Data edition
Google Cloud Next '22 Recap: Serverless & Data editionDaniel Zivkovic
 
Docker, Monitoring and SLURM Specific Visualisations
Docker, Monitoring and SLURM Specific VisualisationsDocker, Monitoring and SLURM Specific Visualisations
Docker, Monitoring and SLURM Specific Visualisationsalherca1
 

Similar a Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Conference 2012) (20)

Nagios Conference 2012 - Nicolas Brousse - Optimizing your Monitoring and Tre...
Nagios Conference 2012 - Nicolas Brousse - Optimizing your Monitoring and Tre...Nagios Conference 2012 - Nicolas Brousse - Optimizing your Monitoring and Tre...
Nagios Conference 2012 - Nicolas Brousse - Optimizing your Monitoring and Tre...
 
Measuring IPv6 using ad-based measurement
Measuring IPv6 using ad-based measurementMeasuring IPv6 using ad-based measurement
Measuring IPv6 using ad-based measurement
 
Itsummit2015 blizzard
Itsummit2015 blizzardItsummit2015 blizzard
Itsummit2015 blizzard
 
Regain Control Thanks To Prometheus
Regain Control Thanks To PrometheusRegain Control Thanks To Prometheus
Regain Control Thanks To Prometheus
 
Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...
Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...
Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...
 
Nagios Conference 2013 - Nicolas Brousse - Bringing Business Awareness
Nagios Conference 2013 - Nicolas Brousse - Bringing Business AwarenessNagios Conference 2013 - Nicolas Brousse - Bringing Business Awareness
Nagios Conference 2013 - Nicolas Brousse - Bringing Business Awareness
 
PyCon India 2012: Celery Talk
PyCon India 2012: Celery TalkPyCon India 2012: Celery Talk
PyCon India 2012: Celery Talk
 
Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...
Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...
Kubernetes Clusters At Scale: Managing Hundreds Apache Pinot Kubernetes Clust...
 
Zentral london mac_ad_uk_2017
Zentral london mac_ad_uk_2017Zentral london mac_ad_uk_2017
Zentral london mac_ad_uk_2017
 
Sumo Logic: Optimizing Scheduled Searches
Sumo Logic: Optimizing Scheduled SearchesSumo Logic: Optimizing Scheduled Searches
Sumo Logic: Optimizing Scheduled Searches
 
Monitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloudMonitoring kubernetes across data center and cloud
Monitoring kubernetes across data center and cloud
 
AWS re:Invent 2016: Automated Governance of Your AWS Resources (DEV302)
AWS re:Invent 2016: Automated Governance of Your AWS Resources (DEV302)AWS re:Invent 2016: Automated Governance of Your AWS Resources (DEV302)
AWS re:Invent 2016: Automated Governance of Your AWS Resources (DEV302)
 
Lesson_08_Continuous_Monitoring.pdf
Lesson_08_Continuous_Monitoring.pdfLesson_08_Continuous_Monitoring.pdf
Lesson_08_Continuous_Monitoring.pdf
 
How Greenhouse Software Unlocked the Power of Machine Data Analytics with Sum...
How Greenhouse Software Unlocked the Power of Machine Data Analytics with Sum...How Greenhouse Software Unlocked the Power of Machine Data Analytics with Sum...
How Greenhouse Software Unlocked the Power of Machine Data Analytics with Sum...
 
Nagios Conference 2007 | Enterprise Application Monitoring with Nagios by Jam...
Nagios Conference 2007 | Enterprise Application Monitoring with Nagios by Jam...Nagios Conference 2007 | Enterprise Application Monitoring with Nagios by Jam...
Nagios Conference 2007 | Enterprise Application Monitoring with Nagios by Jam...
 
Simply Business - Near Real Time Event Processing
Simply Business - Near Real Time Event ProcessingSimply Business - Near Real Time Event Processing
Simply Business - Near Real Time Event Processing
 
Automated Governance of Your AWS Resources
Automated Governance of Your AWS ResourcesAutomated Governance of Your AWS Resources
Automated Governance of Your AWS Resources
 
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and DaemonsQConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
 
Google Cloud Next '22 Recap: Serverless & Data edition
Google Cloud Next '22 Recap: Serverless & Data editionGoogle Cloud Next '22 Recap: Serverless & Data edition
Google Cloud Next '22 Recap: Serverless & Data edition
 
Docker, Monitoring and SLURM Specific Visualisations
Docker, Monitoring and SLURM Specific VisualisationsDocker, Monitoring and SLURM Specific Visualisations
Docker, Monitoring and SLURM Specific Visualisations
 

Más de Nicolas Brousse

<Programming> 2019 - ICW'19: The Issue of Monorepo and Polyrepo In Large Ente...
<Programming> 2019 - ICW'19: The Issue of Monorepo and Polyrepo In Large Ente...<Programming> 2019 - ICW'19: The Issue of Monorepo and Polyrepo In Large Ente...
<Programming> 2019 - ICW'19: The Issue of Monorepo and Polyrepo In Large Ente...Nicolas Brousse
 
Improving Adobe Experience Cloud Services Dependability with Machine Learning
Improving Adobe Experience Cloud Services Dependability with Machine LearningImproving Adobe Experience Cloud Services Dependability with Machine Learning
Improving Adobe Experience Cloud Services Dependability with Machine LearningNicolas Brousse
 
IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability o...
IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability o...IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability o...
IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability o...Nicolas Brousse
 
PuppetConf 2017 | Adobe Advertising Cloud: A Lean Puppet Workflow to Support ...
PuppetConf 2017 | Adobe Advertising Cloud: A Lean Puppet Workflow to Support ...PuppetConf 2017 | Adobe Advertising Cloud: A Lean Puppet Workflow to Support ...
PuppetConf 2017 | Adobe Advertising Cloud: A Lean Puppet Workflow to Support ...Nicolas Brousse
 
Adobe Advertising Cloud: The Reality of Cloud Bursting with OpenStack
Adobe Advertising Cloud: The Reality of Cloud Bursting with OpenStackAdobe Advertising Cloud: The Reality of Cloud Bursting with OpenStack
Adobe Advertising Cloud: The Reality of Cloud Bursting with OpenStackNicolas Brousse
 
SuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuite
SuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuiteSuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuite
SuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuiteNicolas Brousse
 
SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...
SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...
SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...Nicolas Brousse
 
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthNicolas Brousse
 
Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...
Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...
Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...Nicolas Brousse
 
Improving Operations Efficiency with Puppet
Improving Operations Efficiency with PuppetImproving Operations Efficiency with Puppet
Improving Operations Efficiency with PuppetNicolas Brousse
 

Más de Nicolas Brousse (10)

<Programming> 2019 - ICW'19: The Issue of Monorepo and Polyrepo In Large Ente...
<Programming> 2019 - ICW'19: The Issue of Monorepo and Polyrepo In Large Ente...<Programming> 2019 - ICW'19: The Issue of Monorepo and Polyrepo In Large Ente...
<Programming> 2019 - ICW'19: The Issue of Monorepo and Polyrepo In Large Ente...
 
Improving Adobe Experience Cloud Services Dependability with Machine Learning
Improving Adobe Experience Cloud Services Dependability with Machine LearningImproving Adobe Experience Cloud Services Dependability with Machine Learning
Improving Adobe Experience Cloud Services Dependability with Machine Learning
 
IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability o...
IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability o...IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability o...
IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability o...
 
PuppetConf 2017 | Adobe Advertising Cloud: A Lean Puppet Workflow to Support ...
PuppetConf 2017 | Adobe Advertising Cloud: A Lean Puppet Workflow to Support ...PuppetConf 2017 | Adobe Advertising Cloud: A Lean Puppet Workflow to Support ...
PuppetConf 2017 | Adobe Advertising Cloud: A Lean Puppet Workflow to Support ...
 
Adobe Advertising Cloud: The Reality of Cloud Bursting with OpenStack
Adobe Advertising Cloud: The Reality of Cloud Bursting with OpenStackAdobe Advertising Cloud: The Reality of Cloud Bursting with OpenStack
Adobe Advertising Cloud: The Reality of Cloud Bursting with OpenStack
 
SuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuite
SuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuiteSuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuite
SuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuite
 
SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...
SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...
SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...
 
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
 
Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...
Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...
Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...
 
Improving Operations Efficiency with Puppet
Improving Operations Efficiency with PuppetImproving Operations Efficiency with Puppet
Improving Operations Efficiency with Puppet
 

Último

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 

Último (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Conference 2012)

  • 1. Optimizing your Monitoring and Trending tools for the Cloud Nagios World Conference 2012 Nicolas Brousse Lead Operations Engineer September 28th 2012
  • 2. •  •  •  •  •  •  •  •  •  About TubeMogul What are some of our challenges? Our environment Amazon Cloud Environment Automated Monitoring Efficient on-call rotation Efficient monitoring What’s next? Q&A
  • 3. About TubeMogul •  Founded in 2006 •  Formerly a video distribution and analytics platform •  TubeMogul is a Brand-Focused Video Marketing Company –  Build for Branding –  Integrate real-time media buying, ad serving, targeting, optimization and brand measurement TubeMogul simplifies the delivery of video ads and maximizes the impact of every dollar spent by brand marketers http://www.tubemogul.com
  • 4. What are some of our challenges? •  Monitoring between 700 to 1000 servers •  Servers spread across 6 different locations –  4 Amazon EC2 Regions (our public cloud provider) –  1 Hosted (Liquidweb) & 1 VPS (Linode) •  Little monitoring resources –  Collecting over 115,000 metrics –  Monitoring over 20,000 services with Nagios •  Multiple billions of HTTP requests a day –  Most of it must be served in less than 100ms –  Lost of traffic could mean lost of business opportunity –  Or worst, over-spending…
  • 5. Our environment •  Over 80 different server profiles •  Our stack: –  Java (Embedded Jetty, Tomcat) –  PHP, RoR –  Hadoop: HDFS, M/R, Hbase, Hive –  Couchbase –  MySQL •  Monitoring: Nagios, NSCA •  Graphing: Ganglia, sFlow, Graphite •  Configuration Management: Puppet
  • 7. Amazon Cloud Environment •  We use EC2, SDB, SQS, EMR, S3, etc. •  We don’t use ELB •  We heavily use EC2 Tags ec2-describe-instances -F tag:hostname=dev-build01
  • 9. Automated Monitoring Configuring Ganglia using Puppet templates globals { … override_hostname = <%= scope.lookupvar('hostname') %> … } udp_send_channel { host = <%= scope.lookupvar('ec2_tag_nagios_host') %> port = 8649 ttl = 1 } sflow { udp_port = 6343 accept_jvm_metrics = yes multiple_jvm_instances = yes }
  • 10. Automated Monitoring Or configuring Host sFlow using Puppet templates sflow{ DNSSD = off polling = 20 sampling = 512 collector{ ip = <%= ec2_tag_nagios_host %> udpport = 6343 } }
  • 11. Automated Monitoring •  Puppet configure our monitoring instances –  We use Nagios regex : use_regexp_matching=1 –  But we don’t use true regex : use_true_regexp_matching=0 –  We use NSCA with Upstart –  We don’t use the perfdata –  We use pre-cached objects –  We includes our configurations from 3 directories •  objects => templates, contacts, commands, event_handlers •  servers => contain a configuration file for each server •  clusters => contain a configuration file for each cluster
  • 12. Automated Monitoring Process of event when starting a new host and add it to our monitoring: 1.  We start a new instance using Cerveza and Cloud-init 2.  Puppet configure Gmond or Host sFlow on the instance 3.  Our monitoring server running Gmond and Gmetad get data from the new instance 4.  A Nagios check run every minute and check for new hosts •  •  •  5.  Look for new hosts using EC2 API Look for EC2 tag “hostname” to confirm it’s a legit host, not a zombie / fail start Look for EC2 tag “nagios_host” to see if the host belong to this monitoring instance If a new host is found: •  •  We build a config for the host based on a template file and doing some string replace Once all config have been generated, we rebuild pre-cache objects and reload Nagios 6.  If we find “Zombie” host, we generate a Warning alert 7.  If the config is corrupt, we send a Critical alert
  • 15. Efficient on-call rotation •  Follow the sun –  Some of our team is in Ukraine, no more Tier 1 night on-call for us •  Nagios timeperiod and escalation are a pain to maintain –  Nagios notification plugged to Google Calendar •  Using our own notification script for email and paging •  Google Clendar make it easy for each team to manage their own on-call calendar •  Support for multiple Tier and complex schedules •  Caching Google Calendar info locally every hour –  Simpler definitions and rules in Nagios contacts –  Notify only people on-call, unless they asked for “off call” emails
  • 16. Efficient on-call rotation Using Google Calendar…
  • 17. Efficient on-call rotation •  •  •  •  Simple contact definitions Google Calendar info Tier Filter (Regex) Tier Interval (time to wait before escalating alert since last tier) •  Off call email
  • 19. Efficient on-call rotation On-call contact fetched from Google Calendar at the bottom of the alert makes our life easier!
  • 20. Efficient monitoring We disable most notification and only care of a cluster status
  • 21. Efficient monitoring Most of our checks are based on Ganglia RRD files
  • 22. Efficient monitoring It become really easy to monitor any metrics returned by Ganglia
  • 23. Efficient monitoring We can check cluster status by hosts/services but also per returned messages !
  • 25. Efficient monitoring Using Graphite Federated Storage –  One place to see all our metrics from all the world –  No delay due to rsync of RRD files –  Graph close to real time, delay only due to rrdcached flushing interval
  • 26. What’s next? Some hot topics… •  Do trending alert with Nagios based on Graphite/Ganglia data •  Better automation for non-cloud servers •  Ensure we can scale our monitoring when using hybrid cloud (Eucalyptus) or multiple public cloud provider •  Get a better centralized view of our different Nagios
  • 27. Ops @ TubeMogul All this wouldn’t be possible without a strong System Operation team Andrey Shestakov Eamon Bisson Donahue Justin Francisconi Marylene Tanfin Nicolas Brousse Stan Rudenko
  • 28. Thank you… TubeMogul is Hiring ! http://www.tubemogul.com/jobs jobs@tubemogul.com Follow us on Twitter @TubeMogul @orieg