SlideShare una empresa de Scribd logo
1 de 14
Descargar para leer sin conexión
LISA 2011

Practice and Experience Report
Scaling on EC2 in a fast-paced environment
Nicolas Brousse
nicolas@tubemogul.com
December 8th 2011

2011 TubeMogul Incorporated All rights reserved.

1
Introduction - About the speaker
• My name is Nicolas Brousse
• I previously worked for many industry leading company in France
– From Web Hosting to Online Video services

(Lycos, MultiMania, Kewego, MediaPlazza...)

– Heavy traffic environment and large user databases
• I work as a Lead Operations Engineer at TubeMogul.com since 2008
• I help TubeMogul to scale its infrastructure
– From 20 servers to +500 servers
– Using 4 Amazon EC2 Regions + 1 Colo
– Monitoring with Nagios over 6,000 actives services and 1,000 passives services
– Collecting over 80,000 metrics with Ganglia
– Managing over 300 TB of data in Hadoop HDFS
– Billions HTTP queries a day
• Occasionally contribute to OpenSource projects
– Ganglia (PHP and PERL module)
– PHP Judy
2011 TubeMogul Incorporated All rights reserved.

2
Introduction - About TubeMogul
• Created in November 2006 by John Hughes and Brett Wilson
• Formerly a video distribution and analytics platform
• Acquire Illuminex - a flash analytics firm - in October 2008
• New platform call PlayTime™ :
– TubeMogul is a Video Marketing Company
– Built for Branding
– Integrate real-time media buying, ad serving, targeting, optimization and brand
measurement

TubeMogul simplifies the delivery of video ads and maximizes the impact of
every dollar spent by brand marketers
http://www.tubemogul.com/company/about_us

2011 TubeMogul Incorporated All rights reserved.

3
Amazon Clound Environment

2011 TubeMogul Incorporated All rights reserved.

4
Amazon Clound Environment
• We like it because....
– We can quickly start new servers/clusters
– We can quickly start new servers/clusters in many regions
• US East (Virginia)
• US West (North California)
• Europe (Dublin)
• Asia Pacific (Tokyo & Singapore)
– We can use different type of instances (RAM, CPU, Disks, etc.)
– It’s easy to automate with EC2 API
– It’s easy to plug to a configuration management tool

• But...
– It can be hard to troubleshoot some failures or network problems
– Occasionally being notified of hardware failures after the facts
– No Multicast (Though, possible with Amazon VPC)
– Bandwidth cost between regions can get expensive

2011 TubeMogul Incorporated All rights reserved.

5
Auto scale a cloud application
Crawling partners’ API to
aggregate Analytics data:
1. Call AWS API to start
instances
2. AWS Start instances
3. We push our code to the
instances
4. Open SSH tunnel to DB
5. Crawl Partners API and
aggregate the data
6. Push the data to our DB thru
SSH Tunnel
7. Shutdown EC2 instance

2011 TubeMogul Incorporated All rights reserved.

6
First approach to manage “always on” instances
•

Instances provisioning with a home made script called “Cerveza” in Tcl/Tk
–
–

•

Used to configure instance profile at boot
Let us run commands on multiple instances at once

Using LDAP and SQLite to track hostname and instance profiles
–
–

•

DNS Bind plugged to LDAP
SQLite store EBS volume, Instances id, keypair, AMI, etc.

Instance configuration with shell scripts from NFS mount
–

EBS Raid setup

–

Software deployment

–

Configuration changes

2011 TubeMogul Incorporated All rights reserved.

7
First approach to manage “always on” instances
•

Access Control
–
–

SSH Access limited by VPN

–

•

Security Group for each Cluster type (Hadoop, MySQL, etc.)
SSH and VPN plugged to LDAP

Easy way to identify instances
–
–

•

DNS plugged to LDAP
Each instance configured with human readable hostname
example: dev-mysql01, hadoop-namenode01, etc.

Easy working process for devs
–

•

Users’ home directory using NFS auto-mount

Automate Instances Monitoring
–

Trending with Ganglia, Alerting with Nagios

–

Using trending data for most Nagios checks

–

NSCA

2011 TubeMogul Incorporated All rights reserved.

8
Learning the hard way - the story
• SPOF in our infrastructure design lead us to a long outage
– prevent us to login into servers for couple of days
– couldn’t start new instances

1. Main server with NFS/LDAP went down
• corrupted EBS lead to fsck at boot time but no KVM on EC2...
• starting new instance still get stuck on fsck at boot
• end up using old AMI which we rebuild changing fstab
2. New instance took time to get ready to use because of old AMI
• need to recover from ldap backup first
• lost lot of configuration setup, slowing down our ability to manage instances
• need to re-install and configure our instance management tool “Cerveza”
3. /etc/resolv.conf need to be changed on every instances
• ssh backdoor didn’t work all the time because of LDAP/SSH/NFS timeout
4. Security Group rules with private IP made the recovery process more
complex
2011 TubeMogul Incorporated All rights reserved.

9
Learning the hard way - quick fixes

2011 TubeMogul Incorporated All rights reserved.

10
Going worldwide
• Need to handle multiple EC2 region for improved response time
– use of gateway server in each Availability Zone
– DNS Caching, LDAP Syncrepl, NFS with FS-Cache

• “Cerveza” rewrite in Python
– Replace SQLite by SimpleDB
– can be run from our laptop, need to run on a specific server
– handle full provisioning (start/stop/reboot) instances
– easily define instances profiles with YAML
– use EC2 tags to identify instances and EBS volumes

• Stop building our own AMI
– use of public ubuntu AMI server to reduce maintenance and support burden
– use of cloud-init to easy start and preconfigure new instances
– use of Puppet as our configuration management tool

2011 TubeMogul Incorporated All rights reserved.

11
Going worldwide

2011 TubeMogul Incorporated All rights reserved.

12
Lesson Learned
• Cloud environment doesn’t mean you should by-pass basic infrastructure
management rules
• Make sure the evolution of your infrastructure doesn’t introduce SPOF
• Keep it simple, stupid
• Don’t forget about backup and recovery process
• Use a configuration management tool early to prevent headache later

2011 TubeMogul Incorporated All rights reserved.

13
Thank You...
TubeMogul is Hiring !
http://www.tubemogul.com/company/careers
jobs@tubemogul.com
Follow us on Twitter
@tubemogul
2011 TubeMogul Incorporated All rights reserved.

@orieg

14

Más contenido relacionado

Más de Nicolas Brousse

SuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuite
SuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuiteSuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuite
SuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuiteNicolas Brousse
 
SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...
SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...
SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...Nicolas Brousse
 
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthNicolas Brousse
 
Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...
Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...
Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...Nicolas Brousse
 
Improving Operations Efficiency with Puppet
Improving Operations Efficiency with PuppetImproving Operations Efficiency with Puppet
Improving Operations Efficiency with PuppetNicolas Brousse
 
Scaling Bleeding Edge Technology in a Fast-paced Environment
Scaling Bleeding Edge Technology in a Fast-paced EnvironmentScaling Bleeding Edge Technology in a Fast-paced Environment
Scaling Bleeding Edge Technology in a Fast-paced EnvironmentNicolas Brousse
 
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)Nicolas Brousse
 
Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...
Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...
Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...Nicolas Brousse
 
Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...
Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...
Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...Nicolas Brousse
 

Más de Nicolas Brousse (9)

SuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuite
SuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuiteSuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuite
SuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuite
 
SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...
SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...
SRECon16: Moving Large Workloads from a Public Cloud to an OpenStack Private ...
 
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
 
Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...
Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...
Puppet Camp Silicon Valley 2015: How TubeMogul reached 10,000 Puppet Deployme...
 
Improving Operations Efficiency with Puppet
Improving Operations Efficiency with PuppetImproving Operations Efficiency with Puppet
Improving Operations Efficiency with Puppet
 
Scaling Bleeding Edge Technology in a Fast-paced Environment
Scaling Bleeding Edge Technology in a Fast-paced EnvironmentScaling Bleeding Edge Technology in a Fast-paced Environment
Scaling Bleeding Edge Technology in a Fast-paced Environment
 
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)
 
Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...
Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...
Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...
 
Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...
Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...
Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...
 

Último

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 

Último (20)

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 

Scaling on EC2 in a Fast-Paced Environment (LISA'11)

  • 1. LISA 2011 Practice and Experience Report Scaling on EC2 in a fast-paced environment Nicolas Brousse nicolas@tubemogul.com December 8th 2011 2011 TubeMogul Incorporated All rights reserved. 1
  • 2. Introduction - About the speaker • My name is Nicolas Brousse • I previously worked for many industry leading company in France – From Web Hosting to Online Video services (Lycos, MultiMania, Kewego, MediaPlazza...) – Heavy traffic environment and large user databases • I work as a Lead Operations Engineer at TubeMogul.com since 2008 • I help TubeMogul to scale its infrastructure – From 20 servers to +500 servers – Using 4 Amazon EC2 Regions + 1 Colo – Monitoring with Nagios over 6,000 actives services and 1,000 passives services – Collecting over 80,000 metrics with Ganglia – Managing over 300 TB of data in Hadoop HDFS – Billions HTTP queries a day • Occasionally contribute to OpenSource projects – Ganglia (PHP and PERL module) – PHP Judy 2011 TubeMogul Incorporated All rights reserved. 2
  • 3. Introduction - About TubeMogul • Created in November 2006 by John Hughes and Brett Wilson • Formerly a video distribution and analytics platform • Acquire Illuminex - a flash analytics firm - in October 2008 • New platform call PlayTime™ : – TubeMogul is a Video Marketing Company – Built for Branding – Integrate real-time media buying, ad serving, targeting, optimization and brand measurement TubeMogul simplifies the delivery of video ads and maximizes the impact of every dollar spent by brand marketers http://www.tubemogul.com/company/about_us 2011 TubeMogul Incorporated All rights reserved. 3
  • 4. Amazon Clound Environment 2011 TubeMogul Incorporated All rights reserved. 4
  • 5. Amazon Clound Environment • We like it because.... – We can quickly start new servers/clusters – We can quickly start new servers/clusters in many regions • US East (Virginia) • US West (North California) • Europe (Dublin) • Asia Pacific (Tokyo & Singapore) – We can use different type of instances (RAM, CPU, Disks, etc.) – It’s easy to automate with EC2 API – It’s easy to plug to a configuration management tool • But... – It can be hard to troubleshoot some failures or network problems – Occasionally being notified of hardware failures after the facts – No Multicast (Though, possible with Amazon VPC) – Bandwidth cost between regions can get expensive 2011 TubeMogul Incorporated All rights reserved. 5
  • 6. Auto scale a cloud application Crawling partners’ API to aggregate Analytics data: 1. Call AWS API to start instances 2. AWS Start instances 3. We push our code to the instances 4. Open SSH tunnel to DB 5. Crawl Partners API and aggregate the data 6. Push the data to our DB thru SSH Tunnel 7. Shutdown EC2 instance 2011 TubeMogul Incorporated All rights reserved. 6
  • 7. First approach to manage “always on” instances • Instances provisioning with a home made script called “Cerveza” in Tcl/Tk – – • Used to configure instance profile at boot Let us run commands on multiple instances at once Using LDAP and SQLite to track hostname and instance profiles – – • DNS Bind plugged to LDAP SQLite store EBS volume, Instances id, keypair, AMI, etc. Instance configuration with shell scripts from NFS mount – EBS Raid setup – Software deployment – Configuration changes 2011 TubeMogul Incorporated All rights reserved. 7
  • 8. First approach to manage “always on” instances • Access Control – – SSH Access limited by VPN – • Security Group for each Cluster type (Hadoop, MySQL, etc.) SSH and VPN plugged to LDAP Easy way to identify instances – – • DNS plugged to LDAP Each instance configured with human readable hostname example: dev-mysql01, hadoop-namenode01, etc. Easy working process for devs – • Users’ home directory using NFS auto-mount Automate Instances Monitoring – Trending with Ganglia, Alerting with Nagios – Using trending data for most Nagios checks – NSCA 2011 TubeMogul Incorporated All rights reserved. 8
  • 9. Learning the hard way - the story • SPOF in our infrastructure design lead us to a long outage – prevent us to login into servers for couple of days – couldn’t start new instances 1. Main server with NFS/LDAP went down • corrupted EBS lead to fsck at boot time but no KVM on EC2... • starting new instance still get stuck on fsck at boot • end up using old AMI which we rebuild changing fstab 2. New instance took time to get ready to use because of old AMI • need to recover from ldap backup first • lost lot of configuration setup, slowing down our ability to manage instances • need to re-install and configure our instance management tool “Cerveza” 3. /etc/resolv.conf need to be changed on every instances • ssh backdoor didn’t work all the time because of LDAP/SSH/NFS timeout 4. Security Group rules with private IP made the recovery process more complex 2011 TubeMogul Incorporated All rights reserved. 9
  • 10. Learning the hard way - quick fixes 2011 TubeMogul Incorporated All rights reserved. 10
  • 11. Going worldwide • Need to handle multiple EC2 region for improved response time – use of gateway server in each Availability Zone – DNS Caching, LDAP Syncrepl, NFS with FS-Cache • “Cerveza” rewrite in Python – Replace SQLite by SimpleDB – can be run from our laptop, need to run on a specific server – handle full provisioning (start/stop/reboot) instances – easily define instances profiles with YAML – use EC2 tags to identify instances and EBS volumes • Stop building our own AMI – use of public ubuntu AMI server to reduce maintenance and support burden – use of cloud-init to easy start and preconfigure new instances – use of Puppet as our configuration management tool 2011 TubeMogul Incorporated All rights reserved. 11
  • 12. Going worldwide 2011 TubeMogul Incorporated All rights reserved. 12
  • 13. Lesson Learned • Cloud environment doesn’t mean you should by-pass basic infrastructure management rules • Make sure the evolution of your infrastructure doesn’t introduce SPOF • Keep it simple, stupid • Don’t forget about backup and recovery process • Use a configuration management tool early to prevent headache later 2011 TubeMogul Incorporated All rights reserved. 13
  • 14. Thank You... TubeMogul is Hiring ! http://www.tubemogul.com/company/careers jobs@tubemogul.com Follow us on Twitter @tubemogul 2011 TubeMogul Incorporated All rights reserved. @orieg 14