SmugMug's Zero Downtime Migration to AWS

SmugMug’s Zero Downtime Migration to AWS
ARC312
Andrew Shieh, SmugMug Operations
shandrew @ smugmug.com
November 15, 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Friday, November 15, 13

SmugMug—Who are we?


The early days of SmugMug
• Gradual bootstrapped growth
• Multiple self-managed datacenter cages
• Too many servers of varying types
• Too many disks
• Tons of valuable skilled employee
hours spent in cages


Data
Center
Fantasy


Data Center Reality


SmugMug <3 AWS
• Early adopter of Amazon S3
• Over the years, moved rendering,
upload, archiving, payments,
permissions, email, and more
compute to AWS
• Before mid-2012, no ultra-high
performance I/O

SmugMug Architecture ~2006

AWS: S3
SV: Web, DB, Image*


AWS: S3

SmugMug Architecture ~2011

AWS: S3
AWS: S3, Image (upload,
SV: Web, DB


processing, render, video, …)

SmugMug Architecture - Transition

AWS: S3
SV: Web, DB


AWS: S3, Image*, Web
DC: Replication DB,
Direct Connect

SmugMug Architecture Today

Ø


AWS: S3, Image*,
Web, DB

How did we get there?


Our database I/O evolution:
Always cutting edge
• Started with MySQL on spinning
disk RAID, max RAM
• Moved to ZFS SSD + SSD cache +
spinning disks
• Moved to custom 24-SSD arrays


hi1.4xlarge FTW
• our custom, obscure hardware =>
difficult to resolve problems,
difficult to upgrade
• hi1 overall DB IO performance
comparable to 8 x SSD RAID10
• < 3%/yr hi1 instance failure rate!

Amazon VPC - also a big win
• Easy mapping of internal / external network security
model to AWS


Zero downtime move?


Zero Downtime Move
• Flexibility of the AWS cloud
makes a zero downtime move
inexpensive. Pay for only what
you use. Provision fast.
• Plan
• Test
• Plan and test again


Major changes post-move
• Database storage goes from SSD to
hi1.4xlarge ephemeral
• Hardware load balancers become
Elastic Load Balancing load balancers


Major changes post-move
• Database storage goes from SSD to
hi1.4xlarge ephemeral
• Hardware load balancers become ELB
• haproxy layer 7 load/traffic directing
goes from static to dynamic config
• Web servers autoscale for each cluster
• Membase to ElastiCache (later to
Amazon EC2)

Zero Downtime Move Requirements
• Read-only site mode
• Traffic control — shadow load
• Cross country MySQL replication +
sufficient bandwidth


Zero Downtime Move Requirements
• Read-only site mode
• Traffic control — shadow load
• Cross country MySQL replication +
sufficient bandwidth
• Bot testing
• Read-only live site testing w/ QA


More on moving
• Full scale read-write testing
is difficult
• Be aware of AWS limits
• Talk to support for big
growth
• Roll back plan - manage
risky change


Flipping the switch to AWS
• “The biggest, scariest engineering
change we've made in the company's
history” - Don, SmugMug Chief Geek
• Go read-only (1 min)
• Pre-Scale up big
• MHA to reassign MySQL
masters and their replication (30min)
• Point DNS+CDN to Elastic Load
Balancing (5-30m)

Flipping the switch to AWS
• Test! (60 min)
• When Read-only is
all good, go to readwrite (5 min)
• Test! Inevitable bugs
at this step (hours)


MHA?
• Facebook, DeNA
• Helps to reliably reassign
MySQL masters and
replication, maintaining
consistency


MHA?
• Manual failover in MySQL
5.5 and earlier is painful, timeconsuming
• Be careful with automation for
rare events — it can bite


Problems?
• Completely redundant
network links can fail
• Bugs related to IP address
change
• ElastiCache performance
• NewRelic! Use it or a similar
APM product


Results


Results
• Data Center - performance fluctuated
through day
• AWS w/scaling - flat performance
throughout the day - significant
scalability limits removed
• Networking was a key improvement
• Success!


Lessons Learned
• We love AWS even more than before
• Automate everything
• Understand Amazon EBS, and
understand underlying details of AWS
services
• Unpredictable Ops schedules vs. large
projects


Lessons Learned

Job #1:
Making
business
happen

We made more changes, because we could
• As long as we’re moving our infrastructure,
why not rebuild most of it too?
• Linux, MySQL, package versions upgraded
• New monitoring tools
• NFS dependencies eliminated, moved to
Amazon S3 or DynamoDB
• Code pushes managed by nice distributed
tools utilizing Amazon S3 + internal torrent

One last thing...
• Go Multi-availability-zone!
• Load balancers send traffic to multiple
haproxy per AZ with AZ-specific web
clusters, DB replicas
• Backed up w/ cross AZ
• Keep SPOFs in one AZ


Questions?
Andrew Shieh, Sunnyvale, CA
shandrew@smugmug.com
@shandrew
http://www.smugmug.com/
http://pics.shieh.info/
Thank you!


Please give us your feedback on this
presentation

ARC312 - SmugMug’s Zero
Downtime Migration to AWS
As a thank you, we will select prize
winners daily for completed surveys!


Thank You

SmugMug's Zero Downtime Migration to AWS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to SmugMug's Zero Downtime Migration to AWS

Similar to SmugMug's Zero Downtime Migration to AWS (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

SmugMug's Zero Downtime Migration to AWS