High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

SCALING HIGH-AVAILABILITY
INFRASTRUCTURE
IN THE CLOUD

OCT 11, 2011, WEB 2.0
twilio
CLOUD COMMUNICATIONS
EVAN COOKE
CO-FOUNDER & CTO

High-Availability
Sounds good, we need that!

eat!
al M
n ic
Te ch
mm
umm
Y

High-Availability

Uptime
Availability =
Uptime + Downtime

High-Availability

Availability % Downtime/yr Downtime/mo

99.9% ("three nines") 8.76 hours 43.2 minutes

99.99% ("four nines") 52.56 minutes 4.32 minutes

99.999% ("ﬁve nines") 5.26 minutes 25.9 seconds

99.9999% ("six nines") 31.5 seconds 2.59 seconds

High-Availability

Availability % Downtime/yr Downtime/mo

99.9% ("three nines") 8.76 hours 43.2 minutes

99.99% ("four nines") 52.56 minutes 4.32 minutes
99.999% ("ﬁve nines") 5.26 minutes 25.9 seconds

99.9999% ("six nines") 31.5 seconds 2.59 seconds

Can’t rely on human to respond in a 5 min window!
Must use automation.

Happens to the best

2.5 Hours Down 11 Hours Down Hours
September 23, 2010 October 4, 2010 November 14, 2010
“...we had to stop all “...At 6:30pm EST, we “...Before every run of
traffic to this database determined the most our test suite we destroy
cluster, which meant effective course of action then re-create the
turning off the site. Once was to re-index the database... Due to the
the databases had [database] shard, which configuration error
recovered and the root would address the memory GitHub's production
cause had been fixed, we fragmentation and usage database was
slowly allowed more issues. The whole process, destroyed then re-
people back onto the including extensive testing created. Not good.”
site.” against data loss and data
corruption, took about five
hours.”

Causes of Downtime
Lack of best practice change control
Lack of best practice monitoring of the relevant components
Lack of best practice requirements and procurement
Lack of best practice operations
Lack of best practice avoidance of network failures
Lack of best practice avoidance of internal application failures
Lack of best practice avoidance of external services that fail
Lack of best practice physical environment
Lack of best practice network redundancy
Lack of best practice technical solution of backup
Lack of best practice process solution of backup
Lack of best practice physical location
Lack of best practice infrastructure redundancy
Lack of best practice storage architecture redundancy
E. Marcus and H. Stern, Blueprints for high availability, second edition.
Indianapolis, IN, USA: John Wiley & Sons, Inc., 2003.

Cloud Non-Cloud
Data Change Operations Datacenter
Persistence Control
storage change control monitoring of avoidance of
architecture the relevant network failures
redundancy components
physical
technical requirements environment
solution of procurement
network
backup operations redundancy
process avoidance of
physical location
solution of internal app
backup failures infrastructure
avoidance of redundancy
external
services that fail

Happens to the best

2.5 Hours Down 11 Hours Down Hours
September 23, 2010 October 4, 2010 November 14, 2010
“...we had to stop all “...At 6:30pm EST, we “...Before every run of
traffic to this database determined the most our test suite we destroy
Database
cluster, which meant
turning off the site. Once
Database
effective course of action
was to re-index the Database
then re-create the
database... Due to the
the databases had [database] shard, which configuration error
would address the memory
recovered and the root
cause had been fixed, we fragmentation and usage Change
GitHub's production
database was
issues. The whole process,
slowly allowed more
people back onto the including extensive testing Control
destroyed then re-
created. Not good.”
site.” against data loss and data
corruption, took about five
hours.”

Persistence Control
Today control
storagechange monitoring of
the relevant
avoidance of
network failures
architecture
Data Persistence
redundancy components
physical
Change Control
network
lessons learned
solution@twilio
physical location
of internal app
external
services that fail

Twilio provides web service APIs to
automate Voice and SMS communications

Carriers Inbound Calls
Voice Outbound Calls
Mobile/Browser VoIP

Send To/From Phone
SMS Numbers
Short Codes
Developer
Phone Dynamically Buy
Numbers Phone Numbers
End User

2011

2010
2009

3 6 20

70+

100x Growth in Tx/Day over 1 Year
100X

10X

X

1 Year

2011

2010
2009
100’s of
10’s of Servers
10 Servers
Servers

2011
• 100’s of prod hosts in continuous
operation
• 80+ service types running in prod
• 50+ prod database servers
• Prod deployments several times/day
across 7 engineering teams

2011
• Frameworks
- PHP for frontend components
- Python Twisted & gevent for async network
services
- Java for backend services
• Storage technology
- MySQL for core DB services
- Redis for queuing and messaging

Data persistence is hard
(especially in the cloud)

Data persistence is hard
Data persistence is the hardest
technical problem most scalable
SaaS businesses face

What is data persistence?

Stuﬀ that looks like this

What is data persistence?

Databases
Queues
Files

Incoming Requests
LB

A A
Tier 1 Data
Q Q
Persistence!
SQL

Tier 2 B B B B

Files C C D D K/V
Tier 3

Why is persistence so hard?
• Difﬁcult to change structure
- Huge inertia e.g., large schema migrations
• Painful to recover from disk/node failures
- “just boot a new node” doesn’t work
• Woeful performance/scalability
- I/O is huge bottleneck in modern servers (e.g. EC2)
• Freak’in complex!!!
- Atomic transactions/rollback, ACID, blah blah blah

Difﬁcult to Change Structure
ALTER TABLE names
DROP COLUMN Value
Id Name Value Id Name
1 Bob 12 1 Bob
2 Jane 78 2 Jane
3 Steve 56 3 Steve
...
500 million rows
HOURS later...

‣ You live with data decisions for a long time

Painful to Recover from Failures
Data on secondary?
W R R
How much data?
R/W consistency?

DB DB

Primary Secondary
‣ Because of complexity, failover is human process

Woeful Performance/Scalability
ec2
m1.xlarge
raid0 4x ephemeral
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 169.31 111.88 57.43 469.31 0.90 2.25 12.24 2.29 4.36 1.12 59.01
sdc 178.22 110.89 59.41 396.04 0.93 1.98 13.08 1.58 3.50 1.18 53.56
sdd 145.54 102.97 50.50 384.16 0.78 1.90 12.63 1.00 2.34 1.03 44.85
sde 166.34 95.05 54.46 337.62 0.85 1.69 13.27 1.12 2.84 1.22 47.92
md0 0.00 0.00 880.20 2007.92 3.44 7.82 7.99 0.00 0.00 0.00 0.00

~10 MB/s write

‣ Poor I/O on cloud today, 100x slower than real HW


DB DB DB DB DB DB

‣ Difﬁcult to horizontally scale in the cloud

@!#$%^&* Complex
• Incredibly complex BUFFER POOL AND MEMORY
----------------------
Total memory allocated 11655168000; in
conﬁguration Internal hash tables (constant factor
Adaptive hash index 223758224 (179
Page hash 11248264
- Billion knobs and buttons Dictionary cache 45048690 (449
File system 84400 (82672
- Whole companies exist just to Lock system
Recovery system
28180376 (281
0 (0 + 0)
tune DB’s Threads 428608 (406
Dictionary memory allocated 57346

• Lots of consistency/
Buffer pool size 693759
Buffer pool size, bytes 11366547456
Free buffers 1
transactional models Database pages
Old database pages
691085
255087

• Multi-region data is
Modified db pages 326490
Pending reads 0
Pending writes: LRU 0, flush list 0, s
unsolved - Facebook and Pages made young 497782847, not young
24.78 youngs/s, 0.00 non-youngs/s
Google struggle Pages read 447257683, created 16982810
24.82 reads/s, 1.14 creates/s, 33.36 w
Buffer pool hit rate 993 / 1000, young

Deep breath, step back
Think about each problem
(use @twilio examples)

• Software that runs in the cloud
• Open source

1
Difﬁcult to Change Structure
• Don’t have structure
- key/value databases (SimpleDB, Cassandra)
- document-orient databases (CouchDB, MongoDB)
• Don’t store a lot of data...

1
Don’t Store Stuff
• Outsource data as much as possible
• But NOT to your customers

1
Don’t Store Stuff
• Aggressively archive and move data ofﬂine

S3/SimpleDB
~500M
Rows
(keep indices in memory)

Build UX that supports longer/restricted
access times to older data

1
Don’t Store Stuff
• Avoid stateful systems/architectures where
possible

Web

Browser Web Session
DB

Cookie: Web
SessionID

1
Don’t Store Stuff
• Avoid stateful systems/architectures where
possible
Store state in client Web
browser

Browser Web Session
DB

Cookie: Web
enc($session)

2
• Avoid single points of failure
- E.g., master-master (active/active)
- Complex to set up, complex failure modes
- Sometimes it’s the only solution
- Lots of great docs on web

• Minimize number of stateful node, separate
stateful & stateless components...

2
Separate Stateful and Stateless
Components

Req App A App B App C

On failure, even
App B
if we boot
replacement, we
lose data

2
Components

Req App A App B App C

Queue

Queue
Queue

On failure, even
App B
if we boot
Queue

replacement, we
lose data

2
Components
Keep connection open for whole app path!
(hint: use evented framework)
Req App AA App BB App C
App A
App Twilio’s App stack App C
App B App C
SMS
uses this approach

On failure, we
don’t lose a
single request

2
• Avoid single points of failure
- E.g., master-master (active/active)
- Complex to set up, complex failure modes
- Sometimes it’s the only solution
- Lots of great blog docs on web

• Minimize number of stateful nodes,
separate stateful & stateless components
• Build a data change control process to
avoid mistakes and errors...

• 100’s of prod hosts in continuous
operation
• 80+ service types running in prod
• 50+ prod database servers
• Prod deployments several times/day
across 7 engineering teams
Components deployed at different frequencies:
Partially Continuous Deployment

Website Deployment
Content
Frequency(Risk)
4 buckets
Website
Code
Log Scale

1000x
REST
API Big DB
100x
Schema
10x
1x
CMS PHP/Ruby Python/Java SQL
etc. etc.

Website Deployment
Content
Processes

Website
Code
REST
API Big DB
Schema

One Click CI Tests CI Tests CI Tests
One Click Human Sign-off Human Sign-off
One Click Human Assisted Click

3
• If disk I/O is poor, avoid disk
- Tune tune tune. Keep your indices in memory
- Use an in-memory datastore e.g., Redis and
conﬁgure replication such that if you have a master
failure, you can always promote a slave

• When disk I/O saturates, shard
- LOTs of sharding info on web
- Method of last resort, single point of failure
becomes multiple single points of failure

4
@#$%^&* Complex

• Bring the simplest tool to Magic Database

the job
- Use a strictly consistent store
only if you need it
- If you don’t need HA, don’t add
the complexity

• There is no magic database. Magic Database does it all.
Decompose requirements, Consistency, Availability,

mix-and-match datastores Partition-tolerance, it's got all
three.
as needed...

4
Twilio Data Lifecycle

CREATE UPDATE UPDATE

name:foo name:foo name:foo name:foo
status:INIT status:QUEUED status:GOING status:DONE
ret:0 ret:0 ret:0 ret:42

Twilio Examples: Call, SMS, Conference
Other Examples: Order, Workﬂow, $

4

CREATE UPDATE UPDATE

name:foo name:foo name:foo name:foo
status:INIT status:QUEUED status:GOING status:DONE
ret:0 ret:0 ret:0 ret:42

In-Flight Post-Flight

4
Applications

• Atomically update • Billing
part of a workﬂow • Log Access
• Analytics
• Reporting


4
Properties

High-Availability
• Strict Consistency • Eventual Consistency
• Key/Value • Range Queries w/
• ~20ms Filters
• ~200ms


4
Systems with very different access semantics

Data Store A Data Store B


4
Eventual consistency
Q
Logs Range queries
Filtered queries
(REST API) ~200ms
Billions
Strict
Consistency Eventual consistency
Key/Value Arbitrary queries
Q Reporting High Latency
~20ms Billions
10k-1M
Idempotent
Aggregation
Q Billing Key/Value
Billions

4
SQL Sharded
Logs Cassandra/Acunu
Q MongoDb
(REST API) Riak
CouchDb

MySQL
PostgreSQL Q Reporting Hadoop
Redis
NDB

SQL Sharded
Q Billing Redis

Why is persistence so hard?
• Difﬁcult to change structure
Don’t store stuff!
- Huge inertia e.g.,schema-less
Go large schema migrations

• Painful to recover from disk/node failures
Separate stateful/stateless
Change control processes
- “just boot a new node” doesn’t work
• Woeful performance/scalability
Memory FTW
Shard
- I/O is huge bottleneck in modern servers (e.g. EC2)
• Freak’in complex!!! data lifecycle
Decompose
- AtomicMinimize complexity blah blah
transactions/rollback, ACID, blah

Incoming Requests
LB

A A
Tier 1
Q Q
SQL

Tier 2 B B B B

Files C C D D K/V
Tier 3

Incoming Requests

Idempotent LB
request path
A A Aggregate into
Tier 1 HA queues
Master-Master
Q Q SQL
SQL MySQL

Tier 2 B B B B Move K/V to
SimpleDB w/
Move ﬁle store local cache
to S3 S3 SimpleDB
C C D D
Tier 3

Persistence Control
storage
architecture
redundancy
HA
change control monitoring of
the relevant
components
avoidance of
network failures

physical

is
network

Hard
physical location
solution of internal app
external
services that fail

SCALING HIGH-AVAILABILITY
INFRASTRUCTURE IN THE CLOUD

Focus on data
How you store it
Where you store it
When you can delete it
Control changes to it

Open Problems...
Massively
scalable
HA Logs
Q range queries
queue (REST API) ﬁlterable
~200ms
Simple
multi-AZ Simple
multi-region Q Reporting HA
Hadoop
consistent Hadoop
K/V

Massively
Q Billing scalable
aggregator

twilio
http://www.twilio.com
@emcooke

High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

Similar a High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011 (20)

Más de Twilio Inc

Más de Twilio Inc (20)

Último

Último (20)

High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011