Docker makes it easy to package and launch code onto a virtual machine. But once you scale your container across multiple machines, or even multiple AWS Regions, how do you efficiently manage container traffic, resource utilization, security, and code changes? In this session, we feature best practices and real-world examples of customers who deployed containerized apps at scale. We include strategies for maximizing cost efficiency across various traffic patterns and implementing a granular access control mechanism for your container infrastructure.
Going Big with Containers: Customer Case Studies of Large-Scale Deployments - ENT209 - re:Invent 2017
1. Going Big With Containers
C u s t o m e r C a s e S t u d i e s o f L a r g e - S c a l e D e p l o y m e n t s
M a t t C a l l a n a n – E n g i n e e r i n g M a n a g e r - E x p e d i a
m c a l l a n a n @ e x p e d i a . c o m
l i n k e d i n . c o m / i n / m a t t h e w c a l l a n a n
@ m c a l l a n a
E N T 2 0 9
N o v e m b e r 2 9 , 2 0 1 7
2. Going Big With Containers
Large-Scale Deployments with Amazon ECS
3. Matt Callanan
Engineering Manager / Tech Lead
“Cloud Acceleration Team”
Expedia
Brisbane, Australia
• mcallanan@expedia.com
• linkedin.com/in/matthewcallanan
• @mcallana
26. Time Value of Information
• “A piece of information is worth
more now than it is tomorrow”
• If every commit is a
hypothesis, how much is
verifying that commit worth
now as opposed to later?
30. User enters the details
of their app into
Primer Web App for
new app creation
Creates Dockerfile
and repo in private
docker registry
31. Primer Application Creation
• Within 10 minutes:
• Application code repository created
• Continuous Delivery pipeline created
• Docker repository created
• Application built as a Docker image
• Application deployed to a prod-like environment
33. How Long Does Feedback Take In a Monolith?
Monolith with 10x release cycleMicroservice
34. Why is Fast Feedback Important?
• Most Likely to Fail
o 68% Industry Failure Rate
• 10x cycle time = 1/10th success rate
o Monolith: 0.32/1 Feature
o Microservices: 3.2/10 Features
43. Stress
Deploy
Smoke Tests
Release
Integration
Deploy
Smoke Tests
Release
Docker Registry
GitSource
Code
Commit Build
Compile
Build artifacts (jar, zip, etc.)
Build Docker Image
(based on Primer
template base image)
Test Deployment
Deploy
Smoke Tests
Release
Application
Docker
image
Production
Region 1
Deploy
Smoke Tests
Release
Env-specific
configuration,
Metadata
App
Config
DropWizard,
Springboot,
Scalatra, Sinatra,
ExpressJS, Go, etc.
Base
Docker
image
Typical Deployment Pipeline
Application
Docker
image
Application
Docker
image
Production
Region 2
Deploy
Smoke Tests
Release
Production
Region 3
Deploy
Smoke Tests
Release
…
44. Blue-Green Deploys
• Split releases into “deploy” and “release” steps
• Allows for testing between deploy and release
1. Deploy a “Canary”
2. Release live upgrade using ECS implicit blue-green
replacement
45. Blue-Green Deploys with Canary
Amazon
Route 53
CNAME
Load Balancer
Amazon ECS
Live Service - v1
Live Traffic
46. Blue-Green Deploys with Canary
Amazon
Route 53
CNAME
Load Balancer
Amazon ECS
Amazon
Route 53
CNAME
Load Balancer
Canary - v2
Live Service - v1
Live Traffic
Testing
47. Blue-Green Deploys with Canary
Amazon
Route 53
CNAME
Load Balancer
Amazon ECS
Amazon
Route 53
CNAME
Load Balancer
Canary - v2
Live Service - v1
Live Traffic
48. Blue-Green Deploys with Canary
Amazon
Route 53
CNAME
Load Balancer
Amazon ECS
Amazon
Route 53
CNAME
Load Balancer
Canary - v2
Live Service - v1 v2
Live Traffic
49. Blue-Green Deploys with Canary
Amazon
Route 53
CNAME
Load Balancer
Amazon ECS
Amazon
Route 53
CNAME
Load Balancer
Canary - v2
Live Service - v2
Live Traffic
50. Blue-Green Deploys with Canary
Amazon
Route 53
CNAME
Load Balancer
Amazon ECS
Live Service - v2
Live Traffic
55. Application Stack - Single Region
Amazon
Route 53
CNAME
Classic Load
Balancer
Amazon ECS
Service
56. Multi-Region Traffic Management
App A
Amazon
Route 53
CNAME
Classic Load
Balancer
Amazon
ECS
Service
App A
Amazon
Route 53
CNAME
Classic Load
Balancer
Amazon
ECS
Service
App A
Internet
Traffic Rules
Geo,
Fixed
Region 1
Region 2
Region N
57. Intra-Region Service Discovery
App A
Amazon
Route 53
CNAME
Classic Load
Balancer
Amazon
ECS
Service
App C
Amazon
Route 53
CNAME
Classic Load
Balancer
Amazon
ECS
Service
Internet
App B
Amazon
Route 53
CNAME
Classic Load
Balancer
Amazon
ECS
Service
Region 1
Public Apps Private Apps
58. Platform: Deployment Automation
BENEFITS:
• Speed: Many manual steps
reduced to the click of a
button
• Safety: Repeatable, reliable
•Microservice Generation
Application Creation
•Deployment Pipeline / Blue-Green Deploys
•Auto-Scaling
•Security
•Logging
•Traffic Management / Service Discovery
Deployment Automation
•ECS Cluster Creation / Immutable Servers / Auto-Scaling
•Zero-Downtime Upgrades
•Monitoring
•Right-Sizing
Cluster Management
•ECS, EC2, VPC, IAM, CloudWatch, CloudFormation,
AutoScaling, Route 53, ELB, Lambda, SNS, Support
AWS
62. Immutable Servers
Amazon-provided Base AMI
Standard Chef cookbook
Custom setup baked into AMI
ecs-optimized AMI
Expedia standard image
Docker Config
Daemon containers
Golden AMI
docker ecs-agent
63. Immutable Servers
ecs-optimized AMI
Expedia standard image
Docker Config
Daemon containers
Golden AMI
docker ecs-agent
ecs-optimized AMI
Expedia standard image
Docker Config
Daemon containers
Cluster Instance
Custom bootstrap:
• ECS Cluster Config
• Start ECS Agent, Docker
• Cron: Restart ECS agent
• Cron: Custom Metrics
docker ecs-agent
65. “PRISM” Goals
• Zero-downtime for applications as their workloads get relocated onto new instances
Safety
• Complete as fast as possible
Speed
• Quickly retreat back to known-good state if anything goes wrong
Rollbackable
• Resumeable if anything goes wrong
Idempotent
• Drain in batches to prevent burden on Docker registry and network
• Avoid having tasks relocated to instances about to be drained
Avoid “thundering herd” scenario
74. Things to Monitor
• ECS Instances - Memory, CPU, Disk
• ECS Clusters - Memory, CPU Reservation
• Auto-Scaling Groups - Current vs Maximum
• Build & Deployment (Jenkins) Servers & Nodes -
Memory, CPU, Disk
• Logging Servers - Memory, CPU, Disk
• Docker Registry - Memory, CPU, Disk
75. Monitoring: Flow of Metrics
Amazon CloudWatch
EC2 Instances
Auto Scaling group
ECS agent metrics
Extended CloudWatch metrics
Cron job custom metrics
Jenkins job pulls metrics
periodically
Grafana pulls
from CloudWatch
Grafana pulls
from Graphite
77. Right-Sizing Instances
Aim: Balance the CPU and Memory reservation for
applications along the ratios of CPU-to-Memory Resources
available on instance
• c4.4xlarge
• 30GiB RAM, 16 CPU Cores
• r4.2xlarge
• 61GiB RAM, 8 CPU Cores
CPU Memory
78. Largest Production Cluster – CPU Reservation
230 Instances 480 Services 3,200 Containers
12% CPU Utilization
64% CPU Reservation
c3vis Open Source: https://github.com/ExpediaDotCom/c3vis
79. Largest Production Cluster – Memory Reservation
230 Instances 480 Services 3,200 Containers
13% Memory Utilization
29% Memory Reservation
c3vis Open Source: https://github.com/ExpediaDotCom/c3vis
80. Largest Production Cluster – CPU Reservation
230 Instances 480 Services 3,200 Containers
12% CPU Utilization
64% CPU Reservation
c3vis Open Source: https://github.com/ExpediaDotCom/c3vis
81. Platform Building Blocks
BENEFITS:
• Speed: Pre-built ECS clusters
means no EC2 instance startup
time at deploy & autoscaling
time
• Speed: Docker only pulls the
layers it needs for images with
common hierarchy
• Safety: Immutable Servers
gives confidence no
configuration drift on
production infrastructure
• Scale: Clusters automatically
scale horizontally to match
workload
•Microservice Generation
Application Creation
•Deployment Pipeline/Blue-Green Deploys
•Auto-Scaling
•Security
•Logging
•Traffic Management/Service Discovery
Deployment Automation
•ECS Cluster Creation/Immutable Servers / Auto-Scaling
•Zero-Downtime Upgrades
•Monitoring
•Right-Sizing
Cluster Management
•ECS, EC2, VPC, IAM, CloudWatch, CloudFormation,
AutoScaling, Route 53, ELB, Lambda, SNS, Support
AWS
97. Lesson: True Blue-Green Deploys
ECS simulates blue-green deploys for each service behind load-balancer
• Benefit: Don’t need to warm up the load-balancer for each release
• Downside: Need to recreate load-balancer if modifying active listener - involves
downtime
• Downside: Can’t send some traffic to old tasks and some to new tasks for load
testing
Some aspects of ELBs are immutable:
• ELB Scheme (e.g. “internet-facing”)
Some aspects of ECS-ELB integration are immutable:
• Once ECS service created, can’t assign different load-balancer
• ELB Listeners associated with containers can’t be removed
Recreating ELB with different configuration necessitates recreating ECS service
102. • Bleed Traffic at 10% intervals using
weighted CNAMEs
• Load Testing with live traffic
• Allows: Rollback to known good (v1)
• Allows: New ELB settings
• Requires: Warming up ELB
Desired Blue-Green Deploys
Amazon
Route 53
CNAME
Load Balancer
Amazon ECS
Live Service - v1
Load Balancer
Live Service - v2
Live Traffic
109. Lesson: Know Your Resource Limits
Ask nicely :)
Start ECS agent with
exponential back off
110. Lesson: Beware of Rate Limits
API Rate Limits
• The more ELBs and ECS services you have the more
ECS ELB traffic your account will have
• DescribeInstanceHealth API call
• Workaround: Shard your Cloud presence into Smaller
Accounts
112. Lesson: Avoid Auto-Scale Thrashing
Problem
1. ASG scales up due to high Memory
Reservation
2. 5mins later ASG scales down due to low
CPU Reservation
3. Repeat from #1
Solution #1 Fix scaling dimensions
• Scale Down only when both are low
Solution #2 Fix Ratios
• Match service resource ratios to instance
type resource ratio
For now Set scale down policies low
CPU Memory
115. Benefits of Microservice Platform on ECS
• Cost: Reduced cost of experimentation
• Speed: Fast feedback
Application Creation - Microservice Generation
• Speed: Many manual steps reduced to the click of a button
• Safety: Repeatable, reliable
Deployment Automation
• Speed: Pre-built ECS clusters means no EC2 instance startup time at deploy & auto-scaling
time
• Speed: Docker only pulls the layers it needs for images with common hierarchy
• Safety: Immutable Servers gives confidence no configuration drift on production infrastructure
• Scale: Clusters automatically scale horizontally to match workload
Cluster Management
116. Benefits of Microservice Platform on ECS
• Cost: Reduced cost of experimentation
• Speed: Fast feedback
Application Creation - Microservice Generation
• Speed: Many manual steps reduced to the click of a button
• Safety: Repeatable, reliable
Deployment Automation
• Speed: Pre-built ECS clusters means no EC2 instance startup time at deploy & auto-scaling
time
• Speed: Docker only pulls the layers it needs for images with common hierarchy
• Safety: Immutable Servers gives confidence no configuration drift on production infrastructure
• Scale: Clusters automatically scale horizontally to match workload
Cluster Management
117. Benefits of Microservice Platform on ECS
• Cost: Reduced cost of experimentation
• Speed: Fast feedback
Application Creation - Microservice Generation
• Speed: Many manual steps reduced to the click of a button
• Safety: Repeatable, reliable
Deployment Automation
• Speed: Pre-built ECS clusters means no EC2 instance startup time at deploy & auto-scaling
time
• Speed: Docker only pulls the layers it needs for images with common hierarchy
• Safety: Immutable Servers gives confidence no configuration drift on production infrastructure
• Scale: Clusters automatically scale horizontally to match workload
Cluster Management
118. Benefits of Microservice Platform on ECS
• Cost: Reduced cost of experimentation
• Speed: Fast feedback
Application Creation - Microservice Generation
• Speed: Many manual steps reduced to the click of a button
• Safety: Repeatable, reliable
Deployment Automation
• Speed: Pre-built ECS clusters means no EC2 instance startup time at deploy & auto-scaling
time
• Speed: Docker only pulls the layers it needs for images with common hierarchy
• Safety: Immutable Servers gives confidence no configuration drift on production infrastructure
• Scale: Clusters automatically scale horizontally to match workload
Cluster Management
120. Time Savings Per Deploy
• Primer: Dedicated EC2 instances with Chef-built AMIs:
• 30 minutes per deploy
• Primer 2.0: Docker on ECS:
• 3 minutes per deploy
• Receive feedback 27 minutes faster
122. Support team = 8 people
EC2 Deploy with AMI = 30mins
ECS Deploy with Container = 3mins
27min saving per deploy 524 builds =
29.5 dev days saved
every day
30 Developer Days Saved Every Business Day
123. Opportunity Cost Savings
ECS EC2
• Pre-built fleet of clusters
• Quickly and safely run software as Docker images
• Reduced opportunity cost by 30x
124. Thanks! Questions?
Matt Callanan
m c a l l a n a n @ e x p e d i a . c o m
l i n k e d i n . c o m / i n / m a t t h e w c a l l a n a n
@ m c a l l a n a
125. Image Attribution
Image
“Pipelines descending to Inveruglas Power Station” (http://www.geograph.org.uk/photo/2214366) is licensed under CC BY SA 2.0 (http://creativecommons.org/licenses/by-sa/2.0/) /
Desaturated and cropped from original
“The Future” (https://flic.kr/p/26YCn1) by Kristian Bjornard is licensed under CC BY SA 2.0 (https://creativecommons.org/licenses/by-sa/2.0/)
“CTA Loop Junction” (https://commons.wikimedia.org/wiki/File:CTA_loop_junction.jpg) by Daniel Schwen is licensed under CC BY SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0/)
“Logging operations at Millmoor Rig” (http://bit.ly/1Nb20LS) by Walter Baxter is licensed under CC BY SA 2.0 (https://creativecommons.org/licenses/by-sa/2.0/)
“Traffic Monitoring” (https://commons.wikimedia.org/wiki/File:Traffic_Monitoring.JPG) by Suryasuharman is licensed under CC BY SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0/)
“DNS logo” (https://commons.wikimedia.org/wiki/File:DNS_logo.jpg) by I laramide I is licensed under CC BY SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0/)
“Matrix-code-computer-pc-data” (https://pixabay.com/en/matrix-code-computer-pc-data-356024/) by Comfreak is licensed under CC ZERO
(https://creativecommons.org/publicdomain/zero/1.0/)
“Sample-color-blue-green” (https://pixabay.com/en/sample-color-blue-green-rubber-815141/ ) by LyraBelacqua-Sally is licensed under CC ZERO
(https://creativecommons.org/publicdomain/zero/1.0/)
“Fashion-wristwatch-time” (https://www.pexels.com/photo/fashion-wristwatch-time-watch-1252/) by SplitShire.com is licensed under CC ZERO
(https://creativecommons.org/publicdomain/zero/1.0/)
“Chat” (https://openclipart.org/detail/129049/chat) by Merlin2525 is licensed under unlimited-commercial-use (https://openclipart.org/unlimited-commercial-use-clipart)
“scales” (https://openclipart.org/detail/24101/scales) by scott_kirkwood is licensed under unlimited-commercial-use (https://openclipart.org/unlimited-commercial-use-clipart)
“Compiz GIT Repository” (https://flic.kr/p/Ssras) by -= Treviño =- is licensed under BY NC SA 2.0 (https://creativecommons.org/licenses/by-nc-sa/2.0)
“logs” (https://flic.kr/p/9F8tjX) by Rick Payette is licensed under CC BY NC ND 2.0 (https://creativecommons.org/licenses/by-nc-nd/2.0)
Docker logo used according to https://www.docker.com/brand-guidelines
Shipping Container Clip Art: https://pixabay.com/en/container-shipping-trucking-307872/ by Clker-Free-Vector-Images is licensed under CC ZERO
Computer Code: https://pixabay.com/en/binary-1-0-computer-code-zero-1066983/ by HypnoArt is licensed under CC ZERO