SlideShare una empresa de Scribd logo
1 de 18
Descargar para leer sin conexión
NWC 2011

Monitoring a Cloud Infrastructure in a Multi-Region Topology
Nicolas Brousse
nicolas@tubemogul.com
September 29th 2011

2011 TubeMogul Incorporated All rights reserved.

1
Introduction - About the speaker
• My name is Nicolas Brousse
• I previously worked for many industry leading company in France
– From Web Hosting to Online Video services

(Lycos, MultiMania, Kewego, MediaPlazza...)

– Heavy traffic environment and large user databases
• I work as a Lead Operations Engineer at TubeMogul.com since 2008
• I help TubeMogul to scale its infrastructure
– From 20 servers to +500 servers
– Using 4 Amazon EC2 Regions + 1 Colo
– Monitoring with Nagios over 6,000 actives services and 1,000 passives services
– Collecting over 80,000 metrics with Ganglia
– Managing over 300 TB of data in Hadoop HDFS
– Billions HTTP queries a day
• Occasionally contribute to OpenSource projects
– Ganglia (PHP and PERL module)
– PHP Judy
2011 TubeMogul Incorporated All rights reserved.

2
Introduction - About TubeMogul
• Created in November 2006 by John Hughes and Brett Wilson
• Formerly a video distribution and analytics platform
• Acquire Illuminex - a flash analytics firm - in October 2008
• New platform call PlayTime™ :
– TubeMogul is a Video Marketing Company
– Built for Branding
– Integrate real-time media buying, ad serving, targeting, optimization and brand
measurement

TubeMogul simplifies the delivery of video ads and maximizes the impact of
every dollar spent by brand marketers
http://www.tubemogul.com/company/about_us

2011 TubeMogul Incorporated All rights reserved.

3
Our Environment
• +10 servers hosted at LiquidWeb
• Few VPS on Linode
• +500 instances on Amazon EC2
– Over 50 different servers configurations

• Our technology stack :
– JAVA, PHP
– Hadoop : HDFS, MapReduce, HBase, Hive
– Membase
– Memcache
– MySQL
– And more...

• Monitoring with Nagios
– Using NSCA when possible

• Graphing and Trending using Ganglia with Python plugins
– Some legacy servers using Munin

• Configuration Management using Puppet
2011 TubeMogul Incorporated All rights reserved.

4
Amazon Clound Environment

2011 TubeMogul Incorporated All rights reserved.

5
Amazon Clound Environment
• We like it because....
– We can quickly start new servers/clusters
– We can quickly start new servers/clusters in many regions
• US East (Virginia)
• US West (North California)
• Europe (Dublin)
• Asia Pacific (Tokyo & Singapore)
– We can use different type of instances (RAM, CPU, Disks, etc.)
– It’s easy to automate with EC2 API
– It’s easy to plug to a configuration management tool

• But...
– It can be hard to troubleshoot some failures or network problems
– Occasionally being notified of hardware failures after the facts
– No Multicast (Though, possible with Amazon VPC)
– Bandwidth cost between regions can get expensive

2011 TubeMogul Incorporated All rights reserved.

6
What’s the plan ?
• Our monitoring must be able to scale
• We need a better Graphing/Trending solution
• Our monitoring configuration must be automated
– How to monitor a cluster of servers with variables number of servers every hours ?
– How to change configuration in multiple regions without missing something ?

• A failure in one region shouldn’t impact other regions
• We want to be wake-up only when it really matter
• We have limited resources
– Can’t spend big bucks for monitoring
– Small operation team

2011 TubeMogul Incorporated All rights reserved.

7
Graphing, Trending...
Munin

Ganglia

munin-update
munin-graph

Gmetad
Pull

Pull

Gmond
munin-nodes

Push

Gmond

Gmetad

Pull

sequential polling

2011 TubeMogul Incorporated All rights reserved.

8
Graphing, Trending...
• Why we switched from Munin to Ganglia ?
– Pretty much : Pull vs Push
• Munin server fetch data from Munin Clients (munin-nodes)
– Can quickly overload the Munin server in disk I/O and CPU
– Data collected in sequential order impacted by previous run time and server load
• Ganglia Client send data to representative clusters nodes. Data get federated
periodically by a Gmetad process.
– Lighter on the aggregation side
– Clients push data at defined interval
– Can use threshold to send data only when it make sense
» using time_threshold and value_threshold in the metric
– Ganglia is designed for Clusters and Grids
• You can use multiple layer of gmond/gmetad process
• You don’t need to manually add servers to your configuration

2011 TubeMogul Incorporated All rights reserved.

9
Monitoring with Nagios

2011 TubeMogul Incorporated All rights reserved.

10
Automating Nagios configuration
• Puppet will configure our monitoring instance in each Region
– We use Nagios regex : use_regexp_matching=1
– But we don’t use true regex : use_true_regexp_matching=0
– We use NSCA with Upstart

– We don’t use the perfdata
– We includes our configurations from 3 directories
- objects => templates, contacts, commands, event_handlers
- servers => contain a configuration file for each server
- clusters => contain a configuration file for each cluster

2011 TubeMogul Incorporated All rights reserved.

11
Automating Nagios configuration
Process of event when starting a new host and add it to our monitoring:
1. We start a new instance using Cerveza and Cloud-init
2. Puppet configure Gmond on the instance
3. Our monitoring server running Gmetad get data from the new instance
4. A Nagios check run every minute and look for new hosts in Ganglia
5. If a new host is found, the check script rebuild the Nagios config and
reload Nagios
6. If the config is corrupt, the check script will send a critical alert

2011 TubeMogul Incorporated All rights reserved.

12
Automating Nagios configuration
• Each server configuration is

generated from a template
• Our nagios plugin
“check_tm_clusters”, goes
over the RRD files generated
by Ganglia
• If a new host is found, it
simply copy the template to
the servers config directory
and replace the variables as
reported by Ganglia and
looking at DNS entries

2011 TubeMogul Incorporated All rights reserved.

13
Reducing noise and false positive
• We disable most notification and only care of a cluster status

• Most of our checks are based on Ganglia RRD files

2011 TubeMogul Incorporated All rights reserved.

14
Reducing noise and false positive
• It become really easy to monitor any metrics returned by Ganglia

2011 TubeMogul Incorporated All rights reserved.

15
Reducing noise and false positive
• We can check cluster status by hosts/services but also per returned
messages !

2011 TubeMogul Incorporated All rights reserved.

16
Reducing noise and false positive
• We extensively use our “check_cluster” plugin
• We limit as much as possible email notification
• We use a custom variable _PAGING to identify pageable services
• Paging ONLY on Critical alerts for services/hosts with _PAGING=yes
• Use different contacts and time periods to send alerts to the right person
• We use Nagios Checker for FireFox and Chrome

2011 TubeMogul Incorporated All rights reserved.

17
Thank You...
TubeMogul is Hiring !
http://www.tubemogul.com/company/careers
jobs@tubemogul.com

Follow us on Twitter
@tubemogul
2011 TubeMogul Incorporated All rights reserved.

@orieg
18

Más contenido relacionado

Destacado

Welcome - Keynote - AWSome Day Helsinki 2017
Welcome - Keynote - AWSome Day Helsinki 2017Welcome - Keynote - AWSome Day Helsinki 2017
Welcome - Keynote - AWSome Day Helsinki 2017Amazon Web Services
 
AWS 101: Introduction to AWS
AWS 101: Introduction to AWSAWS 101: Introduction to AWS
AWS 101: Introduction to AWSIan Massingham
 
Penetrating the Cloud: Opportunities & Challenges for Businesses
Penetrating the Cloud: Opportunities & Challenges for BusinessesPenetrating the Cloud: Opportunities & Challenges for Businesses
Penetrating the Cloud: Opportunities & Challenges for BusinessesCompTIA
 
Intro to cloud computing — MegaCOMM 2013, Jerusalem
Intro to cloud computing — MegaCOMM 2013, JerusalemIntro to cloud computing — MegaCOMM 2013, Jerusalem
Intro to cloud computing — MegaCOMM 2013, JerusalemReuven Lerner
 
Avoiding Cloud Outage
Avoiding Cloud OutageAvoiding Cloud Outage
Avoiding Cloud OutageNati Shalom
 
2013 State of Cloud Survey SMB Results
2013 State of Cloud Survey SMB Results2013 State of Cloud Survey SMB Results
2013 State of Cloud Survey SMB ResultsSymantec
 
Best Practices for Architecting in the Cloud - Jeff Barr
Best Practices for Architecting in the Cloud - Jeff BarrBest Practices for Architecting in the Cloud - Jeff Barr
Best Practices for Architecting in the Cloud - Jeff BarrAmazon Web Services
 
Delivering IaaS with Open Source Software
Delivering IaaS with Open Source SoftwareDelivering IaaS with Open Source Software
Delivering IaaS with Open Source SoftwareMark Hinkle
 
Summer School Scale Cloud Across the Enterprise
Summer School   Scale Cloud Across the EnterpriseSummer School   Scale Cloud Across the Enterprise
Summer School Scale Cloud Across the EnterpriseWSO2
 
Can we hack open source #cloud platforms to help reduce emissions?
Can we hack open source #cloud platforms to help reduce emissions?Can we hack open source #cloud platforms to help reduce emissions?
Can we hack open source #cloud platforms to help reduce emissions?Tom Raftery
 
The Total Cost of Ownership (TCO) of Web Applications in the AWS Cloud - Jine...
The Total Cost of Ownership (TCO) of Web Applications in the AWS Cloud - Jine...The Total Cost of Ownership (TCO) of Web Applications in the AWS Cloud - Jine...
The Total Cost of Ownership (TCO) of Web Applications in the AWS Cloud - Jine...Amazon Web Services
 
LinuxFest NW 2013: Hitchhiker's Guide to Open Source Cloud Computing
LinuxFest NW 2013: Hitchhiker's Guide to Open Source Cloud ComputingLinuxFest NW 2013: Hitchhiker's Guide to Open Source Cloud Computing
LinuxFest NW 2013: Hitchhiker's Guide to Open Source Cloud ComputingMark Hinkle
 
Linthicum what is-the-true-future-of-cloud-computing
Linthicum what is-the-true-future-of-cloud-computingLinthicum what is-the-true-future-of-cloud-computing
Linthicum what is-the-true-future-of-cloud-computingDavid Linthicum
 
The Inevitable Cloud Outage
The Inevitable Cloud OutageThe Inevitable Cloud Outage
The Inevitable Cloud OutageNewvewm
 
Simplifying The Cloud Top 10 Questions By SMBs
Simplifying The Cloud Top 10 Questions By SMBsSimplifying The Cloud Top 10 Questions By SMBs
Simplifying The Cloud Top 10 Questions By SMBsSun Digital, Inc.
 
Breaking through the Clouds
Breaking through the CloudsBreaking through the Clouds
Breaking through the CloudsAndy Piper
 
Module 1: AWS Introduction and History - AWSome Day Online Conference - APAC
Module 1: AWS Introduction and History - AWSome Day Online Conference - APACModule 1: AWS Introduction and History - AWSome Day Online Conference - APAC
Module 1: AWS Introduction and History - AWSome Day Online Conference - APACAmazon Web Services
 
Introduction to Amazon Web Services
Introduction to Amazon Web ServicesIntroduction to Amazon Web Services
Introduction to Amazon Web ServicesRobert Greiner
 

Destacado (19)

Technical Track
Technical TrackTechnical Track
Technical Track
 
Welcome - Keynote - AWSome Day Helsinki 2017
Welcome - Keynote - AWSome Day Helsinki 2017Welcome - Keynote - AWSome Day Helsinki 2017
Welcome - Keynote - AWSome Day Helsinki 2017
 
AWS 101: Introduction to AWS
AWS 101: Introduction to AWSAWS 101: Introduction to AWS
AWS 101: Introduction to AWS
 
Penetrating the Cloud: Opportunities & Challenges for Businesses
Penetrating the Cloud: Opportunities & Challenges for BusinessesPenetrating the Cloud: Opportunities & Challenges for Businesses
Penetrating the Cloud: Opportunities & Challenges for Businesses
 
Intro to cloud computing — MegaCOMM 2013, Jerusalem
Intro to cloud computing — MegaCOMM 2013, JerusalemIntro to cloud computing — MegaCOMM 2013, Jerusalem
Intro to cloud computing — MegaCOMM 2013, Jerusalem
 
Avoiding Cloud Outage
Avoiding Cloud OutageAvoiding Cloud Outage
Avoiding Cloud Outage
 
2013 State of Cloud Survey SMB Results
2013 State of Cloud Survey SMB Results2013 State of Cloud Survey SMB Results
2013 State of Cloud Survey SMB Results
 
Best Practices for Architecting in the Cloud - Jeff Barr
Best Practices for Architecting in the Cloud - Jeff BarrBest Practices for Architecting in the Cloud - Jeff Barr
Best Practices for Architecting in the Cloud - Jeff Barr
 
Delivering IaaS with Open Source Software
Delivering IaaS with Open Source SoftwareDelivering IaaS with Open Source Software
Delivering IaaS with Open Source Software
 
Summer School Scale Cloud Across the Enterprise
Summer School   Scale Cloud Across the EnterpriseSummer School   Scale Cloud Across the Enterprise
Summer School Scale Cloud Across the Enterprise
 
Can we hack open source #cloud platforms to help reduce emissions?
Can we hack open source #cloud platforms to help reduce emissions?Can we hack open source #cloud platforms to help reduce emissions?
Can we hack open source #cloud platforms to help reduce emissions?
 
The Total Cost of Ownership (TCO) of Web Applications in the AWS Cloud - Jine...
The Total Cost of Ownership (TCO) of Web Applications in the AWS Cloud - Jine...The Total Cost of Ownership (TCO) of Web Applications in the AWS Cloud - Jine...
The Total Cost of Ownership (TCO) of Web Applications in the AWS Cloud - Jine...
 
LinuxFest NW 2013: Hitchhiker's Guide to Open Source Cloud Computing
LinuxFest NW 2013: Hitchhiker's Guide to Open Source Cloud ComputingLinuxFest NW 2013: Hitchhiker's Guide to Open Source Cloud Computing
LinuxFest NW 2013: Hitchhiker's Guide to Open Source Cloud Computing
 
Linthicum what is-the-true-future-of-cloud-computing
Linthicum what is-the-true-future-of-cloud-computingLinthicum what is-the-true-future-of-cloud-computing
Linthicum what is-the-true-future-of-cloud-computing
 
The Inevitable Cloud Outage
The Inevitable Cloud OutageThe Inevitable Cloud Outage
The Inevitable Cloud Outage
 
Simplifying The Cloud Top 10 Questions By SMBs
Simplifying The Cloud Top 10 Questions By SMBsSimplifying The Cloud Top 10 Questions By SMBs
Simplifying The Cloud Top 10 Questions By SMBs
 
Breaking through the Clouds
Breaking through the CloudsBreaking through the Clouds
Breaking through the Clouds
 
Module 1: AWS Introduction and History - AWSome Day Online Conference - APAC
Module 1: AWS Introduction and History - AWSome Day Online Conference - APACModule 1: AWS Introduction and History - AWSome Day Online Conference - APAC
Module 1: AWS Introduction and History - AWSome Day Online Conference - APAC
 
Introduction to Amazon Web Services
Introduction to Amazon Web ServicesIntroduction to Amazon Web Services
Introduction to Amazon Web Services
 

Más de Nicolas Brousse

<Programming> 2019 - ICW'19: The Issue of Monorepo and Polyrepo In Large Ente...
<Programming> 2019 - ICW'19: The Issue of Monorepo and Polyrepo In Large Ente...<Programming> 2019 - ICW'19: The Issue of Monorepo and Polyrepo In Large Ente...
<Programming> 2019 - ICW'19: The Issue of Monorepo and Polyrepo In Large Ente...Nicolas Brousse
 
Improving Adobe Experience Cloud Services Dependability with Machine Learning
Improving Adobe Experience Cloud Services Dependability with Machine LearningImproving Adobe Experience Cloud Services Dependability with Machine Learning
Improving Adobe Experience Cloud Services Dependability with Machine LearningNicolas Brousse
 
IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability o...
IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability o...IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability o...
IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability o...Nicolas Brousse
 
PuppetConf 2017 | Adobe Advertising Cloud: A Lean Puppet Workflow to Support ...
PuppetConf 2017 | Adobe Advertising Cloud: A Lean Puppet Workflow to Support ...PuppetConf 2017 | Adobe Advertising Cloud: A Lean Puppet Workflow to Support ...
PuppetConf 2017 | Adobe Advertising Cloud: A Lean Puppet Workflow to Support ...Nicolas Brousse
 
Adobe Advertising Cloud: The Reality of Cloud Bursting with OpenStack
Adobe Advertising Cloud: The Reality of Cloud Bursting with OpenStackAdobe Advertising Cloud: The Reality of Cloud Bursting with OpenStack
Adobe Advertising Cloud: The Reality of Cloud Bursting with OpenStackNicolas Brousse
 
SuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuite
SuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuiteSuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuite
SuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuiteNicolas Brousse
 
SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...
SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...
SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...Nicolas Brousse
 
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthNicolas Brousse
 
Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...
Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...
Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...Nicolas Brousse
 
Improving Operations Efficiency with Puppet
Improving Operations Efficiency with PuppetImproving Operations Efficiency with Puppet
Improving Operations Efficiency with PuppetNicolas Brousse
 
Scaling Bleeding Edge Technology in a Fast-paced Environment
Scaling Bleeding Edge Technology in a Fast-paced EnvironmentScaling Bleeding Edge Technology in a Fast-paced Environment
Scaling Bleeding Edge Technology in a Fast-paced EnvironmentNicolas Brousse
 
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)Nicolas Brousse
 
Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...
Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...
Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...Nicolas Brousse
 
Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...
Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...
Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...Nicolas Brousse
 

Más de Nicolas Brousse (14)

<Programming> 2019 - ICW'19: The Issue of Monorepo and Polyrepo In Large Ente...
<Programming> 2019 - ICW'19: The Issue of Monorepo and Polyrepo In Large Ente...<Programming> 2019 - ICW'19: The Issue of Monorepo and Polyrepo In Large Ente...
<Programming> 2019 - ICW'19: The Issue of Monorepo and Polyrepo In Large Ente...
 
Improving Adobe Experience Cloud Services Dependability with Machine Learning
Improving Adobe Experience Cloud Services Dependability with Machine LearningImproving Adobe Experience Cloud Services Dependability with Machine Learning
Improving Adobe Experience Cloud Services Dependability with Machine Learning
 
IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability o...
IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability o...IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability o...
IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability o...
 
PuppetConf 2017 | Adobe Advertising Cloud: A Lean Puppet Workflow to Support ...
PuppetConf 2017 | Adobe Advertising Cloud: A Lean Puppet Workflow to Support ...PuppetConf 2017 | Adobe Advertising Cloud: A Lean Puppet Workflow to Support ...
PuppetConf 2017 | Adobe Advertising Cloud: A Lean Puppet Workflow to Support ...
 
Adobe Advertising Cloud: The Reality of Cloud Bursting with OpenStack
Adobe Advertising Cloud: The Reality of Cloud Bursting with OpenStackAdobe Advertising Cloud: The Reality of Cloud Bursting with OpenStack
Adobe Advertising Cloud: The Reality of Cloud Bursting with OpenStack
 
SuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuite
SuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuiteSuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuite
SuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuite
 
SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...
SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...
SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...
 
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
 
Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...
Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...
Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...
 
Improving Operations Efficiency with Puppet
Improving Operations Efficiency with PuppetImproving Operations Efficiency with Puppet
Improving Operations Efficiency with Puppet
 
Scaling Bleeding Edge Technology in a Fast-paced Environment
Scaling Bleeding Edge Technology in a Fast-paced EnvironmentScaling Bleeding Edge Technology in a Fast-paced Environment
Scaling Bleeding Edge Technology in a Fast-paced Environment
 
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)
 
Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...
Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...
Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...
 
Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...
Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...
Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...
 

Último

Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 

Último (20)

Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 

Monitoring a Cloud Infrastructure in a Multi-Region Topology

  • 1. NWC 2011 Monitoring a Cloud Infrastructure in a Multi-Region Topology Nicolas Brousse nicolas@tubemogul.com September 29th 2011 2011 TubeMogul Incorporated All rights reserved. 1
  • 2. Introduction - About the speaker • My name is Nicolas Brousse • I previously worked for many industry leading company in France – From Web Hosting to Online Video services (Lycos, MultiMania, Kewego, MediaPlazza...) – Heavy traffic environment and large user databases • I work as a Lead Operations Engineer at TubeMogul.com since 2008 • I help TubeMogul to scale its infrastructure – From 20 servers to +500 servers – Using 4 Amazon EC2 Regions + 1 Colo – Monitoring with Nagios over 6,000 actives services and 1,000 passives services – Collecting over 80,000 metrics with Ganglia – Managing over 300 TB of data in Hadoop HDFS – Billions HTTP queries a day • Occasionally contribute to OpenSource projects – Ganglia (PHP and PERL module) – PHP Judy 2011 TubeMogul Incorporated All rights reserved. 2
  • 3. Introduction - About TubeMogul • Created in November 2006 by John Hughes and Brett Wilson • Formerly a video distribution and analytics platform • Acquire Illuminex - a flash analytics firm - in October 2008 • New platform call PlayTime™ : – TubeMogul is a Video Marketing Company – Built for Branding – Integrate real-time media buying, ad serving, targeting, optimization and brand measurement TubeMogul simplifies the delivery of video ads and maximizes the impact of every dollar spent by brand marketers http://www.tubemogul.com/company/about_us 2011 TubeMogul Incorporated All rights reserved. 3
  • 4. Our Environment • +10 servers hosted at LiquidWeb • Few VPS on Linode • +500 instances on Amazon EC2 – Over 50 different servers configurations • Our technology stack : – JAVA, PHP – Hadoop : HDFS, MapReduce, HBase, Hive – Membase – Memcache – MySQL – And more... • Monitoring with Nagios – Using NSCA when possible • Graphing and Trending using Ganglia with Python plugins – Some legacy servers using Munin • Configuration Management using Puppet 2011 TubeMogul Incorporated All rights reserved. 4
  • 5. Amazon Clound Environment 2011 TubeMogul Incorporated All rights reserved. 5
  • 6. Amazon Clound Environment • We like it because.... – We can quickly start new servers/clusters – We can quickly start new servers/clusters in many regions • US East (Virginia) • US West (North California) • Europe (Dublin) • Asia Pacific (Tokyo & Singapore) – We can use different type of instances (RAM, CPU, Disks, etc.) – It’s easy to automate with EC2 API – It’s easy to plug to a configuration management tool • But... – It can be hard to troubleshoot some failures or network problems – Occasionally being notified of hardware failures after the facts – No Multicast (Though, possible with Amazon VPC) – Bandwidth cost between regions can get expensive 2011 TubeMogul Incorporated All rights reserved. 6
  • 7. What’s the plan ? • Our monitoring must be able to scale • We need a better Graphing/Trending solution • Our monitoring configuration must be automated – How to monitor a cluster of servers with variables number of servers every hours ? – How to change configuration in multiple regions without missing something ? • A failure in one region shouldn’t impact other regions • We want to be wake-up only when it really matter • We have limited resources – Can’t spend big bucks for monitoring – Small operation team 2011 TubeMogul Incorporated All rights reserved. 7
  • 9. Graphing, Trending... • Why we switched from Munin to Ganglia ? – Pretty much : Pull vs Push • Munin server fetch data from Munin Clients (munin-nodes) – Can quickly overload the Munin server in disk I/O and CPU – Data collected in sequential order impacted by previous run time and server load • Ganglia Client send data to representative clusters nodes. Data get federated periodically by a Gmetad process. – Lighter on the aggregation side – Clients push data at defined interval – Can use threshold to send data only when it make sense » using time_threshold and value_threshold in the metric – Ganglia is designed for Clusters and Grids • You can use multiple layer of gmond/gmetad process • You don’t need to manually add servers to your configuration 2011 TubeMogul Incorporated All rights reserved. 9
  • 10. Monitoring with Nagios 2011 TubeMogul Incorporated All rights reserved. 10
  • 11. Automating Nagios configuration • Puppet will configure our monitoring instance in each Region – We use Nagios regex : use_regexp_matching=1 – But we don’t use true regex : use_true_regexp_matching=0 – We use NSCA with Upstart – We don’t use the perfdata – We includes our configurations from 3 directories - objects => templates, contacts, commands, event_handlers - servers => contain a configuration file for each server - clusters => contain a configuration file for each cluster 2011 TubeMogul Incorporated All rights reserved. 11
  • 12. Automating Nagios configuration Process of event when starting a new host and add it to our monitoring: 1. We start a new instance using Cerveza and Cloud-init 2. Puppet configure Gmond on the instance 3. Our monitoring server running Gmetad get data from the new instance 4. A Nagios check run every minute and look for new hosts in Ganglia 5. If a new host is found, the check script rebuild the Nagios config and reload Nagios 6. If the config is corrupt, the check script will send a critical alert 2011 TubeMogul Incorporated All rights reserved. 12
  • 13. Automating Nagios configuration • Each server configuration is generated from a template • Our nagios plugin “check_tm_clusters”, goes over the RRD files generated by Ganglia • If a new host is found, it simply copy the template to the servers config directory and replace the variables as reported by Ganglia and looking at DNS entries 2011 TubeMogul Incorporated All rights reserved. 13
  • 14. Reducing noise and false positive • We disable most notification and only care of a cluster status • Most of our checks are based on Ganglia RRD files 2011 TubeMogul Incorporated All rights reserved. 14
  • 15. Reducing noise and false positive • It become really easy to monitor any metrics returned by Ganglia 2011 TubeMogul Incorporated All rights reserved. 15
  • 16. Reducing noise and false positive • We can check cluster status by hosts/services but also per returned messages ! 2011 TubeMogul Incorporated All rights reserved. 16
  • 17. Reducing noise and false positive • We extensively use our “check_cluster” plugin • We limit as much as possible email notification • We use a custom variable _PAGING to identify pageable services • Paging ONLY on Critical alerts for services/hosts with _PAGING=yes • Use different contacts and time periods to send alerts to the right person • We use Nagios Checker for FireFox and Chrome 2011 TubeMogul Incorporated All rights reserved. 17
  • 18. Thank You... TubeMogul is Hiring ! http://www.tubemogul.com/company/careers jobs@tubemogul.com Follow us on Twitter @tubemogul 2011 TubeMogul Incorporated All rights reserved. @orieg 18