SlideShare una empresa de Scribd logo
How TubeMogul Handles over
One Trillion HTTP Requests a
Month
November 12th
, 2015
Nicolas Brousse | Sr. Director Of Operations Engineering | nicolas@tubemogul.com
Who are we?
TubeMogul
● Enterprise software company for digital branding
● Over 27 Billions Ads served in 2014
● Over 30 Billions Ad Auctions per day
● Bid processed in less than 50 ms
● Bid served in less than 80 ms (include network round trip)
● 5 PB of monthly video traffic served
Who are we?
Operations Engineering
● Ensure the smooth day to day operation of the platform
infrastructure
● Provide a cost effective and cutting edge infrastructure
● Team composed of SREs, SEs and DBAs
● Managing over 2,500 servers (virtual and physical)
Our Infrastructure
Multiple locations with a mix of Public Cloud and On Premises
● Java (a lot!)
● MySQL
● Couchbase
● Vertica
● Kafka
● Storm
● Zookeeper, Exhibitor
● Hadoop, HBase, Hive
● Terracotta
● ElasticSearch, Logstash, Kibana
● Varnish
● PHP, Python, Ruby, Go...
● Apache httpd
● Nagios
● Ganglia
Technology Hoarders
● Graphite
● Memcached
● Puppet
● HAproxy
● OpenStack
● Git and Gerrit
● Gor
● ActiveMQ
● OpenLDAP
● Redis
● Blackbox
● Jenkins, Sonar
● Tomcat
● Jetty (embedded)
● AWS DynamoDB, EC2, S3...
High Level Technical Overview
Technical Challenges
● Tight day to day operations
● Configuration Management and Automation
● Change Management with Peer Review and CI
● Measure and Monitor a lot
How do we manage all this?
OnCall Team Process
Request based on
Dashboards, Monitoring,
Paging or Engineers.
Ticket categorized in two swimlanes:
● Production Support
○ High Priority: Top to Bottom
○ On-call 24/7 (follow the sun)
○ Incident are handled 1st
○ Maintenance are handled 2nd
● Developer Support
○ Best Effort: Top to Bottom
○ Long effort request moved to
Infrastructure pipeline
● Large Nagios installation
● Introducing Sensu for scalability and as an easy Monitoring API for Developers
● Centralized OnCall Dashboard
Alerting
● TL;DR doesn't matter as long as you keep the flexibility for your dev team
● We leverage AWS for many different workload and scenarios
○ Using EC2, DynamoDB, SQS, SES, SNS, RDS, SWF, etc.
○ Workload varies from ephemeral computes to always on
● We moved part of our low latency dependent workload out of AWS to our On
Premises Cloud
○ Data Center proximity to key partners
○ Performance Investigation and Tuning
○ Network Visibility
○ Business Accountability
Which Cloud Provider? Private or Public?
CloudMogul with OpenStack
Challenge of Low Latency Globally
● Geo based DNS isn't based on network performance
● Proximity to user is key
○ Reduce Latency of standard TCP Handshake
○ Reduce Latency of SSL Handshake
● Mobile Networks...
● Require a global footprint
● Large footprint means unlikely to benefit from TLS session resumption
How to ensure pixel delivery at 50ms globally on the 50th
percentile while keeping a small server footprint?
Leverage CDN for Fast Pixel Collection at the Edge
Further Improvement
● Leverage CDN capabilities even further
○ First layer of protection against DDoS
○ Fast.ly VCL is very powerful
● Evaluate routing solution based on RUM (Cedexis)
● Evaluate smarter DNS routing (NS1)
● Round Robin DNS is great
○ Until your DNS entries are too large and client start using DNS thru TCP
● In US West, we went from 31 EC2 instances (c3.2xlarge) to two SuperMicro
servers
○ 32 Cores E5-2667 v3 @ 3.20GHz and 128 GB RAM
○ Use baremetal and leveraging VLAN to access OpenStack Tenant
● Managing SSL session is the most consuming in our workload (CPU and RAM)
○ A TLS connection can use up to 64Kb RAM
● CPU Pinning for network interrupts (4 Core), HAproxy (28 Core)
○ Disable irqbalance
○ Various sysctl config tuning (TCP, VM)
● One frontend for HTTP and HTTPS
● Crossdomain.xml files are served directly by HAproxy (no call to backend)
● All logs sent directly in json to ELK
● Home made process (HAvOC) to generate config and scaling of backend
Load Balancing with HAproxy
● Ganglia / Graphite / Grafana / ELK
Graphing and Logging As A Service
Network Visibility: Catchpoint
Monitor from multiple location
globally, complex test, trace routes,
alerts, etc.
Network Visibility: Dyn Internet Intelligence
Alerts on bgp route changes, prefix
changes, latency variation, internet
disruptions, etc.
● 2008 - 2010: Use SVN, Bash scripts and custom templates.
● 2010: Managing about 250 instances. Start looking at Puppet.
● 2011: Puppet 0.25 then 2.7 by EOY on 400 servers with 2 contributors.
● 2012: 800 servers managed by Puppet. 4 contributors.
● 2013: 1,000 servers managed by Puppet. 6 contributors.
● 2014: 1,500 servers managed by Puppet. Introduced Continuous Delivery
Workflow. 9 contributors. Start 3.7 migration.
● 2015: 2,000 servers managed by Puppet. 13 contributors.
Five Years Of Puppet!
● 2000 nodes
● 225 unique nodes definition
● 1 puppetmaster
● 112 Puppet modules
Puppet Stats
● Virtual and Physical Servers Configuration : Master mode
● Building AWS AMI with Packer : Master mode
● Local development environment with Vagrant : Master mode
● OpenStack deployment : Masterless mode
Where and how do we use Puppet ?
Infrastructure As Code: Code Review?
● Gerrit, an industry standard : Eclipse, Google, Chromium, OpenStack,
WikiMedia, LibreOffice, Spotify, GlusterFS, etc...
● Fine Grained Permissions Rules
● Plugged to LDAP
● Code Review per commit
● Stream Events
● Use GitBlit
● Integrated with Jenkins and Jira
● Managing about 600 Git repositories
A Powerful Gerrit Integration
● 1 job per module
● 1 job for the manifests and hiera data
● 1 job for the Puppet fileserver
● 1 job to deploy
Continuous Delivery with Jenkins
Global Jenkins stats for the past year
● ~10,000 Puppet deployment
● Over 8,500 Production App Deployment
Plugin : github.com/jenkinsci/job-dsl-plugin
● Automate the jobs creation
● Ensure a standard across all the jobs
● Versioned the configuration
● Apply changes to all your jobs without pain
● Test your configuration changes
Jenkins job DSL : code your Jenkins jobs
Team Awareness: HipChat Integration with Hubot
Team Awareness: HipChat Integration with more bots!
The Workflow
All This Wouldn't Be Possible Without a Strong Team.
Thank You.
OPS @ TubeMogul
SRE
Aleksey Mykhailov
Oleg Galitskiy
Brandon Rochon
Stan Rudenko
Julien Fabre
Joseph Herlant
SE
Alan Barnes
Aleksander Stepanov
Matt Cupples
Yurii Rochniak
Yurii Varvynets
Manasi Limbachiya
DBA
Alina Alexeeva
Cloud Engineer
Mykola Mogylenko
Pierre Gohon
Pierre Grandin
Nicolas Brousse @orieg

Más contenido relacionado

La actualidad más candente

What's New in NGINX Plus R12?
What's New in NGINX Plus R12? What's New in NGINX Plus R12?
What's New in NGINX Plus R12?
NGINX, Inc.
 
The 3 Models in the NGINX Microservices Reference Architecture
The 3 Models in the NGINX Microservices Reference ArchitectureThe 3 Models in the NGINX Microservices Reference Architecture
The 3 Models in the NGINX Microservices Reference Architecture
NGINX, Inc.
 
RedisConf17 - Operationalizing Redis at Scale
RedisConf17 - Operationalizing Redis at ScaleRedisConf17 - Operationalizing Redis at Scale
RedisConf17 - Operationalizing Redis at Scale
Redis Labs
 
Lcu14 Lightning Talk- NGINX
Lcu14 Lightning Talk- NGINXLcu14 Lightning Talk- NGINX
Lcu14 Lightning Talk- NGINX
Linaro
 
Dave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical ExperienceDave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical Experience
Nagios
 
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Nagios
 
Architecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructureArchitecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructure
mattlieber
 
PGConf APAC 2018 - PostgreSQL performance comparison in various clouds
PGConf APAC 2018 - PostgreSQL performance comparison in various cloudsPGConf APAC 2018 - PostgreSQL performance comparison in various clouds
PGConf APAC 2018 - PostgreSQL performance comparison in various clouds
PGConf APAC
 
Honest Performance Testing with "NDBench" (Vinay Chella, Netflix) | Cassandra...
Honest Performance Testing with "NDBench" (Vinay Chella, Netflix) | Cassandra...Honest Performance Testing with "NDBench" (Vinay Chella, Netflix) | Cassandra...
Honest Performance Testing with "NDBench" (Vinay Chella, Netflix) | Cassandra...
DataStax
 
What’s New in NGINX Plus R16?
What’s New in NGINX Plus R16?What’s New in NGINX Plus R16?
What’s New in NGINX Plus R16?
NGINX, Inc.
 
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios InstallationMike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
Nagios
 
Webcast - Making kubernetes production ready
Webcast - Making kubernetes production readyWebcast - Making kubernetes production ready
Webcast - Making kubernetes production ready
Applatix
 
Nick Fisk - low latency Ceph
Nick Fisk - low latency CephNick Fisk - low latency Ceph
Nick Fisk - low latency Ceph
ShapeBlue
 
MRA AMA Part 10: Kubernetes and the Microservices Reference Architecture
MRA AMA Part 10: Kubernetes and the Microservices Reference ArchitectureMRA AMA Part 10: Kubernetes and the Microservices Reference Architecture
MRA AMA Part 10: Kubernetes and the Microservices Reference Architecture
NGINX, Inc.
 
Supercharge Application Delivery to Satisfy Users
Supercharge Application Delivery to Satisfy UsersSupercharge Application Delivery to Satisfy Users
Supercharge Application Delivery to Satisfy Users
NGINX, Inc.
 
Paul Angus - what's new in ACS 4.11
Paul Angus - what's new in ACS 4.11Paul Angus - what's new in ACS 4.11
Paul Angus - what's new in ACS 4.11
ShapeBlue
 
Serverless and Servicefull Applications - Where Microservices complements Ser...
Serverless and Servicefull Applications - Where Microservices complements Ser...Serverless and Servicefull Applications - Where Microservices complements Ser...
Serverless and Servicefull Applications - Where Microservices complements Ser...
Red Hat Developers
 
Benchmarking NGINX for Accuracy and Results
Benchmarking NGINX for Accuracy and ResultsBenchmarking NGINX for Accuracy and Results
Benchmarking NGINX for Accuracy and Results
NGINX, Inc.
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Amazon Web Services
 
Analyzing NGINX Logs with Datadog
Analyzing NGINX Logs with DatadogAnalyzing NGINX Logs with Datadog
Analyzing NGINX Logs with Datadog
NGINX, Inc.
 

La actualidad más candente (20)

What's New in NGINX Plus R12?
What's New in NGINX Plus R12? What's New in NGINX Plus R12?
What's New in NGINX Plus R12?
 
The 3 Models in the NGINX Microservices Reference Architecture
The 3 Models in the NGINX Microservices Reference ArchitectureThe 3 Models in the NGINX Microservices Reference Architecture
The 3 Models in the NGINX Microservices Reference Architecture
 
RedisConf17 - Operationalizing Redis at Scale
RedisConf17 - Operationalizing Redis at ScaleRedisConf17 - Operationalizing Redis at Scale
RedisConf17 - Operationalizing Redis at Scale
 
Lcu14 Lightning Talk- NGINX
Lcu14 Lightning Talk- NGINXLcu14 Lightning Talk- NGINX
Lcu14 Lightning Talk- NGINX
 
Dave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical ExperienceDave Williams - Nagios Log Server - Practical Experience
Dave Williams - Nagios Log Server - Practical Experience
 
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
Bryan Heden - Agile Networks - Using Nagios XI as the platform for Monitoring...
 
Architecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructureArchitecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructure
 
PGConf APAC 2018 - PostgreSQL performance comparison in various clouds
PGConf APAC 2018 - PostgreSQL performance comparison in various cloudsPGConf APAC 2018 - PostgreSQL performance comparison in various clouds
PGConf APAC 2018 - PostgreSQL performance comparison in various clouds
 
Honest Performance Testing with "NDBench" (Vinay Chella, Netflix) | Cassandra...
Honest Performance Testing with "NDBench" (Vinay Chella, Netflix) | Cassandra...Honest Performance Testing with "NDBench" (Vinay Chella, Netflix) | Cassandra...
Honest Performance Testing with "NDBench" (Vinay Chella, Netflix) | Cassandra...
 
What’s New in NGINX Plus R16?
What’s New in NGINX Plus R16?What’s New in NGINX Plus R16?
What’s New in NGINX Plus R16?
 
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios InstallationMike Guthrie - Revamping Your 10 Year Old Nagios Installation
Mike Guthrie - Revamping Your 10 Year Old Nagios Installation
 
Webcast - Making kubernetes production ready
Webcast - Making kubernetes production readyWebcast - Making kubernetes production ready
Webcast - Making kubernetes production ready
 
Nick Fisk - low latency Ceph
Nick Fisk - low latency CephNick Fisk - low latency Ceph
Nick Fisk - low latency Ceph
 
MRA AMA Part 10: Kubernetes and the Microservices Reference Architecture
MRA AMA Part 10: Kubernetes and the Microservices Reference ArchitectureMRA AMA Part 10: Kubernetes and the Microservices Reference Architecture
MRA AMA Part 10: Kubernetes and the Microservices Reference Architecture
 
Supercharge Application Delivery to Satisfy Users
Supercharge Application Delivery to Satisfy UsersSupercharge Application Delivery to Satisfy Users
Supercharge Application Delivery to Satisfy Users
 
Paul Angus - what's new in ACS 4.11
Paul Angus - what's new in ACS 4.11Paul Angus - what's new in ACS 4.11
Paul Angus - what's new in ACS 4.11
 
Serverless and Servicefull Applications - Where Microservices complements Ser...
Serverless and Servicefull Applications - Where Microservices complements Ser...Serverless and Servicefull Applications - Where Microservices complements Ser...
Serverless and Servicefull Applications - Where Microservices complements Ser...
 
Benchmarking NGINX for Accuracy and Results
Benchmarking NGINX for Accuracy and ResultsBenchmarking NGINX for Accuracy and Results
Benchmarking NGINX for Accuracy and Results
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
 
Analyzing NGINX Logs with Datadog
Analyzing NGINX Logs with DatadogAnalyzing NGINX Logs with Datadog
Analyzing NGINX Logs with Datadog
 

Destacado

Load balancing at tuenti
Load balancing at tuentiLoad balancing at tuenti
Load balancing at tuenti
Ricardo Bartolomé
 
Velocity 2011 - Our first DDoS attack
Velocity 2011 - Our first DDoS attackVelocity 2011 - Our first DDoS attack
Velocity 2011 - Our first DDoS attack
Cosimo Streppone
 
ChinaNetCloud Training - HAProxy Intro
ChinaNetCloud Training - HAProxy IntroChinaNetCloud Training - HAProxy Intro
ChinaNetCloud Training - HAProxy Intro
ChinaNetCloud
 
HAProxy
HAProxy HAProxy
HAProxy
Arindam Nayak
 
SuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuite
SuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuiteSuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuite
SuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuite
Nicolas Brousse
 
HAProxy tech talk
HAProxy tech talkHAProxy tech talk
HAProxy tech talk
icebourg
 
AWS Customer Presentation - How TubeMogul uses AWS
AWS Customer Presentation - How TubeMogul uses AWSAWS Customer Presentation - How TubeMogul uses AWS
AWS Customer Presentation - How TubeMogul uses AWS
Amazon Web Services
 
Web Server Load Balancer
Web Server Load BalancerWeb Server Load Balancer
Web Server Load Balancer
MobME Technical
 
HAProxy scale out using open source
HAProxy scale out using open sourceHAProxy scale out using open source
HAProxy scale out using open source
Ingo Walz
 
Adform webinar: New Features
Adform webinar: New FeaturesAdform webinar: New Features
Adform webinar: New Features
AdformMarketing
 
Mobile DSPs Landscape
Mobile DSPs LandscapeMobile DSPs Landscape
Mobile DSPs Landscape
Soko Media
 
2015Media kit-E
2015Media kit-E2015Media kit-E
2015Media kit-E
George Zhang
 
Artem Makarov, Business Development Russia, Trademob
Artem Makarov, Business Development Russia, TrademobArtem Makarov, Business Development Russia, Trademob
Artem Makarov, Business Development Russia, Trademob
anastasiaalikova
 
Matomy Mobile
Matomy MobileMatomy Mobile
Matomy Mobile
Simcha1234
 
DSPs, RTB and Programmatic Buying: What Is It and Why Should I Care?
 DSPs, RTB and Programmatic Buying: What Is It and Why Should I Care? DSPs, RTB and Programmatic Buying: What Is It and Why Should I Care?
DSPs, RTB and Programmatic Buying: What Is It and Why Should I Care?
Ionova
 

Destacado (15)

Load balancing at tuenti
Load balancing at tuentiLoad balancing at tuenti
Load balancing at tuenti
 
Velocity 2011 - Our first DDoS attack
Velocity 2011 - Our first DDoS attackVelocity 2011 - Our first DDoS attack
Velocity 2011 - Our first DDoS attack
 
ChinaNetCloud Training - HAProxy Intro
ChinaNetCloud Training - HAProxy IntroChinaNetCloud Training - HAProxy Intro
ChinaNetCloud Training - HAProxy Intro
 
HAProxy
HAProxy HAProxy
HAProxy
 
SuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuite
SuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuiteSuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuite
SuiteWorld16: Mega Volume - How TubeMogul Leverages NetSuite
 
HAProxy tech talk
HAProxy tech talkHAProxy tech talk
HAProxy tech talk
 
AWS Customer Presentation - How TubeMogul uses AWS
AWS Customer Presentation - How TubeMogul uses AWSAWS Customer Presentation - How TubeMogul uses AWS
AWS Customer Presentation - How TubeMogul uses AWS
 
Web Server Load Balancer
Web Server Load BalancerWeb Server Load Balancer
Web Server Load Balancer
 
HAProxy scale out using open source
HAProxy scale out using open sourceHAProxy scale out using open source
HAProxy scale out using open source
 
Adform webinar: New Features
Adform webinar: New FeaturesAdform webinar: New Features
Adform webinar: New Features
 
Mobile DSPs Landscape
Mobile DSPs LandscapeMobile DSPs Landscape
Mobile DSPs Landscape
 
2015Media kit-E
2015Media kit-E2015Media kit-E
2015Media kit-E
 
Artem Makarov, Business Development Russia, Trademob
Artem Makarov, Business Development Russia, TrademobArtem Makarov, Business Development Russia, Trademob
Artem Makarov, Business Development Russia, Trademob
 
Matomy Mobile
Matomy MobileMatomy Mobile
Matomy Mobile
 
DSPs, RTB and Programmatic Buying: What Is It and Why Should I Care?
 DSPs, RTB and Programmatic Buying: What Is It and Why Should I Care? DSPs, RTB and Programmatic Buying: What Is It and Why Should I Care?
DSPs, RTB and Programmatic Buying: What Is It and Why Should I Care?
 

Similar a USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month

Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
javier ramirez
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
Ruslan Meshenberg
 
Netty training
Netty trainingNetty training
Netty training
Marcelo Serpa
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE Architectures
Alexander Penev
 
Netty training
Netty trainingNetty training
Boyan Krosnov - Building a software-defined cloud - our experience
Boyan Krosnov - Building a software-defined cloud - our experienceBoyan Krosnov - Building a software-defined cloud - our experience
Boyan Krosnov - Building a software-defined cloud - our experience
ShapeBlue
 
introduction to micro services
introduction to micro servicesintroduction to micro services
introduction to micro services
Spyros Lambrinidis
 
Web後端技術的演變
Web後端技術的演變Web後端技術的演變
Web後端技術的演變
inwin stack
 
20141111_SOS3_Gallo
20141111_SOS3_Gallo20141111_SOS3_Gallo
20141111_SOS3_Gallo
Andrea Gallo
 
Accelerated SDN in Azure
Accelerated SDN in AzureAccelerated SDN in Azure
Accelerated SDN in Azure
Open Networking Summit
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016
aspyker
 
Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016
Sharma Podila
 
The architecture of SkySQL
The architecture of SkySQLThe architecture of SkySQL
The architecture of SkySQL
MariaDB plc
 
Summer 2017 undergraduate research powerpoint
Summer 2017 undergraduate research powerpointSummer 2017 undergraduate research powerpoint
Summer 2017 undergraduate research powerpoint
Christopher Dubois
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015
aspyker
 
How Sysbee Manages Infrastructures and Provides Advanced Monitoring by Using ...
How Sysbee Manages Infrastructures and Provides Advanced Monitoring by Using ...How Sysbee Manages Infrastructures and Provides Advanced Monitoring by Using ...
How Sysbee Manages Infrastructures and Provides Advanced Monitoring by Using ...
InfluxData
 
NetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & ContainersNetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & Containers
aspyker
 
OpenStack Best Practices and Considerations - terasky tech day
OpenStack Best Practices and Considerations  - terasky tech dayOpenStack Best Practices and Considerations  - terasky tech day
OpenStack Best Practices and Considerations - terasky tech day
Arthur Berezin
 
Easy Microservices with JHipster - Devoxx BE 2017
Easy Microservices with JHipster - Devoxx BE 2017Easy Microservices with JHipster - Devoxx BE 2017
Easy Microservices with JHipster - Devoxx BE 2017
Deepu K Sasidharan
 
Devoxx Belgium 2017 - easy microservices with JHipster
Devoxx Belgium 2017 - easy microservices with JHipsterDevoxx Belgium 2017 - easy microservices with JHipster
Devoxx Belgium 2017 - easy microservices with JHipster
Julien Dubois
 

Similar a USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month (20)

Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
Netty training
Netty trainingNetty training
Netty training
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE Architectures
 
Netty training
Netty trainingNetty training
Netty training
 
Boyan Krosnov - Building a software-defined cloud - our experience
Boyan Krosnov - Building a software-defined cloud - our experienceBoyan Krosnov - Building a software-defined cloud - our experience
Boyan Krosnov - Building a software-defined cloud - our experience
 
introduction to micro services
introduction to micro servicesintroduction to micro services
introduction to micro services
 
Web後端技術的演變
Web後端技術的演變Web後端技術的演變
Web後端技術的演變
 
20141111_SOS3_Gallo
20141111_SOS3_Gallo20141111_SOS3_Gallo
20141111_SOS3_Gallo
 
Accelerated SDN in Azure
Accelerated SDN in AzureAccelerated SDN in Azure
Accelerated SDN in Azure
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016
 
Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016
 
The architecture of SkySQL
The architecture of SkySQLThe architecture of SkySQL
The architecture of SkySQL
 
Summer 2017 undergraduate research powerpoint
Summer 2017 undergraduate research powerpointSummer 2017 undergraduate research powerpoint
Summer 2017 undergraduate research powerpoint
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015
 
How Sysbee Manages Infrastructures and Provides Advanced Monitoring by Using ...
How Sysbee Manages Infrastructures and Provides Advanced Monitoring by Using ...How Sysbee Manages Infrastructures and Provides Advanced Monitoring by Using ...
How Sysbee Manages Infrastructures and Provides Advanced Monitoring by Using ...
 
NetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & ContainersNetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & Containers
 
OpenStack Best Practices and Considerations - terasky tech day
OpenStack Best Practices and Considerations  - terasky tech dayOpenStack Best Practices and Considerations  - terasky tech day
OpenStack Best Practices and Considerations - terasky tech day
 
Easy Microservices with JHipster - Devoxx BE 2017
Easy Microservices with JHipster - Devoxx BE 2017Easy Microservices with JHipster - Devoxx BE 2017
Easy Microservices with JHipster - Devoxx BE 2017
 
Devoxx Belgium 2017 - easy microservices with JHipster
Devoxx Belgium 2017 - easy microservices with JHipsterDevoxx Belgium 2017 - easy microservices with JHipster
Devoxx Belgium 2017 - easy microservices with JHipster
 

Más de Nicolas Brousse

<Programming> 2019 - ICW'19: The Issue of Monorepo and Polyrepo In Large Ente...
<Programming> 2019 - ICW'19: The Issue of Monorepo and Polyrepo In Large Ente...<Programming> 2019 - ICW'19: The Issue of Monorepo and Polyrepo In Large Ente...
<Programming> 2019 - ICW'19: The Issue of Monorepo and Polyrepo In Large Ente...
Nicolas Brousse
 
Improving Adobe Experience Cloud Services Dependability with Machine Learning
Improving Adobe Experience Cloud Services Dependability with Machine LearningImproving Adobe Experience Cloud Services Dependability with Machine Learning
Improving Adobe Experience Cloud Services Dependability with Machine Learning
Nicolas Brousse
 
IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability o...
IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability o...IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability o...
IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability o...
Nicolas Brousse
 
PuppetConf 2017 | Adobe Advertising Cloud: A Lean Puppet Workflow to Support ...
PuppetConf 2017 | Adobe Advertising Cloud: A Lean Puppet Workflow to Support ...PuppetConf 2017 | Adobe Advertising Cloud: A Lean Puppet Workflow to Support ...
PuppetConf 2017 | Adobe Advertising Cloud: A Lean Puppet Workflow to Support ...
Nicolas Brousse
 
Adobe Advertising Cloud: The Reality of Cloud Bursting with OpenStack
Adobe Advertising Cloud: The Reality of Cloud Bursting with OpenStackAdobe Advertising Cloud: The Reality of Cloud Bursting with OpenStack
Adobe Advertising Cloud: The Reality of Cloud Bursting with OpenStack
Nicolas Brousse
 
Improving Operations Efficiency with Puppet
Improving Operations Efficiency with PuppetImproving Operations Efficiency with Puppet
Improving Operations Efficiency with Puppet
Nicolas Brousse
 
Scaling Bleeding Edge Technology in a Fast-paced Environment
Scaling Bleeding Edge Technology in a Fast-paced EnvironmentScaling Bleeding Edge Technology in a Fast-paced Environment
Scaling Bleeding Edge Technology in a Fast-paced Environment
Nicolas Brousse
 
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)
Nicolas Brousse
 
Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...
Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...
Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...
Nicolas Brousse
 
Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...
Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...
Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...
Nicolas Brousse
 

Más de Nicolas Brousse (10)

<Programming> 2019 - ICW'19: The Issue of Monorepo and Polyrepo In Large Ente...
<Programming> 2019 - ICW'19: The Issue of Monorepo and Polyrepo In Large Ente...<Programming> 2019 - ICW'19: The Issue of Monorepo and Polyrepo In Large Ente...
<Programming> 2019 - ICW'19: The Issue of Monorepo and Polyrepo In Large Ente...
 
Improving Adobe Experience Cloud Services Dependability with Machine Learning
Improving Adobe Experience Cloud Services Dependability with Machine LearningImproving Adobe Experience Cloud Services Dependability with Machine Learning
Improving Adobe Experience Cloud Services Dependability with Machine Learning
 
IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability o...
IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability o...IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability o...
IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability o...
 
PuppetConf 2017 | Adobe Advertising Cloud: A Lean Puppet Workflow to Support ...
PuppetConf 2017 | Adobe Advertising Cloud: A Lean Puppet Workflow to Support ...PuppetConf 2017 | Adobe Advertising Cloud: A Lean Puppet Workflow to Support ...
PuppetConf 2017 | Adobe Advertising Cloud: A Lean Puppet Workflow to Support ...
 
Adobe Advertising Cloud: The Reality of Cloud Bursting with OpenStack
Adobe Advertising Cloud: The Reality of Cloud Bursting with OpenStackAdobe Advertising Cloud: The Reality of Cloud Bursting with OpenStack
Adobe Advertising Cloud: The Reality of Cloud Bursting with OpenStack
 
Improving Operations Efficiency with Puppet
Improving Operations Efficiency with PuppetImproving Operations Efficiency with Puppet
Improving Operations Efficiency with Puppet
 
Scaling Bleeding Edge Technology in a Fast-paced Environment
Scaling Bleeding Edge Technology in a Fast-paced EnvironmentScaling Bleeding Edge Technology in a Fast-paced Environment
Scaling Bleeding Edge Technology in a Fast-paced Environment
 
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)
Scaling on EC2 in a fast-paced environment (LISA'11 - Full Paper)
 
Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...
Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...
Bringing Business Awareness to Your Operation Team (Nagios World Conference 2...
 
Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...
Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...
Optimizing your Monitoring and Trending tools for the Cloud (Nagios World Con...
 

Último

Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
IJECEIAES
 
artificial intelligence and data science contents.pptx
artificial intelligence and data science contents.pptxartificial intelligence and data science contents.pptx
artificial intelligence and data science contents.pptx
GauravCar
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
IJECEIAES
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
Yasser Mahgoub
 
Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...
Prakhyath Rai
 
Material for memory and display system h
Material for memory and display system hMaterial for memory and display system h
Material for memory and display system h
gowrishankartb2005
 
CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1
PKavitha10
 
Certificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi AhmedCertificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi Ahmed
Mahmoud Morsy
 
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by AnantLLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
Anant Corporation
 
Curve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods RegressionCurve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods Regression
Nada Hikmah
 
The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.
sachin chaurasia
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
bijceesjournal
 
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have oneISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
Las Vegas Warehouse
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
171ticu
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
IJECEIAES
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
ydzowc
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
Madan Karki
 
Introduction to AI Safety (public presentation).pptx
Introduction to AI Safety (public presentation).pptxIntroduction to AI Safety (public presentation).pptx
Introduction to AI Safety (public presentation).pptx
MiscAnnoy1
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
171ticu
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
co23btech11018
 

Último (20)

Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
 
artificial intelligence and data science contents.pptx
artificial intelligence and data science contents.pptxartificial intelligence and data science contents.pptx
artificial intelligence and data science contents.pptx
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
 
Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...
 
Material for memory and display system h
Material for memory and display system hMaterial for memory and display system h
Material for memory and display system h
 
CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1
 
Certificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi AhmedCertificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi Ahmed
 
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by AnantLLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
 
Curve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods RegressionCurve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods Regression
 
The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
 
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have oneISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
 
Introduction to AI Safety (public presentation).pptx
Introduction to AI Safety (public presentation).pptxIntroduction to AI Safety (public presentation).pptx
Introduction to AI Safety (public presentation).pptx
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
 

USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month

  • 1. How TubeMogul Handles over One Trillion HTTP Requests a Month November 12th , 2015 Nicolas Brousse | Sr. Director Of Operations Engineering | nicolas@tubemogul.com
  • 2. Who are we? TubeMogul ● Enterprise software company for digital branding ● Over 27 Billions Ads served in 2014 ● Over 30 Billions Ad Auctions per day ● Bid processed in less than 50 ms ● Bid served in less than 80 ms (include network round trip) ● 5 PB of monthly video traffic served
  • 3. Who are we? Operations Engineering ● Ensure the smooth day to day operation of the platform infrastructure ● Provide a cost effective and cutting edge infrastructure ● Team composed of SREs, SEs and DBAs ● Managing over 2,500 servers (virtual and physical)
  • 4. Our Infrastructure Multiple locations with a mix of Public Cloud and On Premises
  • 5. ● Java (a lot!) ● MySQL ● Couchbase ● Vertica ● Kafka ● Storm ● Zookeeper, Exhibitor ● Hadoop, HBase, Hive ● Terracotta ● ElasticSearch, Logstash, Kibana ● Varnish ● PHP, Python, Ruby, Go... ● Apache httpd ● Nagios ● Ganglia Technology Hoarders ● Graphite ● Memcached ● Puppet ● HAproxy ● OpenStack ● Git and Gerrit ● Gor ● ActiveMQ ● OpenLDAP ● Redis ● Blackbox ● Jenkins, Sonar ● Tomcat ● Jetty (embedded) ● AWS DynamoDB, EC2, S3...
  • 8. ● Tight day to day operations ● Configuration Management and Automation ● Change Management with Peer Review and CI ● Measure and Monitor a lot How do we manage all this?
  • 9. OnCall Team Process Request based on Dashboards, Monitoring, Paging or Engineers. Ticket categorized in two swimlanes: ● Production Support ○ High Priority: Top to Bottom ○ On-call 24/7 (follow the sun) ○ Incident are handled 1st ○ Maintenance are handled 2nd ● Developer Support ○ Best Effort: Top to Bottom ○ Long effort request moved to Infrastructure pipeline
  • 10. ● Large Nagios installation ● Introducing Sensu for scalability and as an easy Monitoring API for Developers ● Centralized OnCall Dashboard Alerting
  • 11. ● TL;DR doesn't matter as long as you keep the flexibility for your dev team ● We leverage AWS for many different workload and scenarios ○ Using EC2, DynamoDB, SQS, SES, SNS, RDS, SWF, etc. ○ Workload varies from ephemeral computes to always on ● We moved part of our low latency dependent workload out of AWS to our On Premises Cloud ○ Data Center proximity to key partners ○ Performance Investigation and Tuning ○ Network Visibility ○ Business Accountability Which Cloud Provider? Private or Public?
  • 13. Challenge of Low Latency Globally ● Geo based DNS isn't based on network performance ● Proximity to user is key ○ Reduce Latency of standard TCP Handshake ○ Reduce Latency of SSL Handshake ● Mobile Networks... ● Require a global footprint ● Large footprint means unlikely to benefit from TLS session resumption How to ensure pixel delivery at 50ms globally on the 50th percentile while keeping a small server footprint?
  • 14. Leverage CDN for Fast Pixel Collection at the Edge
  • 15. Further Improvement ● Leverage CDN capabilities even further ○ First layer of protection against DDoS ○ Fast.ly VCL is very powerful ● Evaluate routing solution based on RUM (Cedexis) ● Evaluate smarter DNS routing (NS1)
  • 16. ● Round Robin DNS is great ○ Until your DNS entries are too large and client start using DNS thru TCP ● In US West, we went from 31 EC2 instances (c3.2xlarge) to two SuperMicro servers ○ 32 Cores E5-2667 v3 @ 3.20GHz and 128 GB RAM ○ Use baremetal and leveraging VLAN to access OpenStack Tenant ● Managing SSL session is the most consuming in our workload (CPU and RAM) ○ A TLS connection can use up to 64Kb RAM ● CPU Pinning for network interrupts (4 Core), HAproxy (28 Core) ○ Disable irqbalance ○ Various sysctl config tuning (TCP, VM) ● One frontend for HTTP and HTTPS ● Crossdomain.xml files are served directly by HAproxy (no call to backend) ● All logs sent directly in json to ELK ● Home made process (HAvOC) to generate config and scaling of backend Load Balancing with HAproxy
  • 17. ● Ganglia / Graphite / Grafana / ELK Graphing and Logging As A Service
  • 18. Network Visibility: Catchpoint Monitor from multiple location globally, complex test, trace routes, alerts, etc.
  • 19. Network Visibility: Dyn Internet Intelligence Alerts on bgp route changes, prefix changes, latency variation, internet disruptions, etc.
  • 20. ● 2008 - 2010: Use SVN, Bash scripts and custom templates. ● 2010: Managing about 250 instances. Start looking at Puppet. ● 2011: Puppet 0.25 then 2.7 by EOY on 400 servers with 2 contributors. ● 2012: 800 servers managed by Puppet. 4 contributors. ● 2013: 1,000 servers managed by Puppet. 6 contributors. ● 2014: 1,500 servers managed by Puppet. Introduced Continuous Delivery Workflow. 9 contributors. Start 3.7 migration. ● 2015: 2,000 servers managed by Puppet. 13 contributors. Five Years Of Puppet!
  • 21. ● 2000 nodes ● 225 unique nodes definition ● 1 puppetmaster ● 112 Puppet modules Puppet Stats
  • 22. ● Virtual and Physical Servers Configuration : Master mode ● Building AWS AMI with Packer : Master mode ● Local development environment with Vagrant : Master mode ● OpenStack deployment : Masterless mode Where and how do we use Puppet ?
  • 23. Infrastructure As Code: Code Review?
  • 24. ● Gerrit, an industry standard : Eclipse, Google, Chromium, OpenStack, WikiMedia, LibreOffice, Spotify, GlusterFS, etc... ● Fine Grained Permissions Rules ● Plugged to LDAP ● Code Review per commit ● Stream Events ● Use GitBlit ● Integrated with Jenkins and Jira ● Managing about 600 Git repositories A Powerful Gerrit Integration
  • 25. ● 1 job per module ● 1 job for the manifests and hiera data ● 1 job for the Puppet fileserver ● 1 job to deploy Continuous Delivery with Jenkins Global Jenkins stats for the past year ● ~10,000 Puppet deployment ● Over 8,500 Production App Deployment
  • 26. Plugin : github.com/jenkinsci/job-dsl-plugin ● Automate the jobs creation ● Ensure a standard across all the jobs ● Versioned the configuration ● Apply changes to all your jobs without pain ● Test your configuration changes Jenkins job DSL : code your Jenkins jobs
  • 27. Team Awareness: HipChat Integration with Hubot
  • 28. Team Awareness: HipChat Integration with more bots!
  • 30. All This Wouldn't Be Possible Without a Strong Team. Thank You. OPS @ TubeMogul SRE Aleksey Mykhailov Oleg Galitskiy Brandon Rochon Stan Rudenko Julien Fabre Joseph Herlant SE Alan Barnes Aleksander Stepanov Matt Cupples Yurii Rochniak Yurii Varvynets Manasi Limbachiya DBA Alina Alexeeva Cloud Engineer Mykola Mogylenko Pierre Gohon Pierre Grandin