Designing a massively scalable highly available persistence layer has been one of the great challenges we’ve faced building out Twilio’s cloud communications infrastructure. Robust Voice and SMS APIs have strict consistency, latency, and availability requirements that cannot be solved using traditional sharding or scaling approaches. In this talk we first look to understand the challenges of running high-availability services in the cloud and then describe how we’ve architected “in-flight” and “post-flight” data into separate datastores that can be implemented using a range of technologies.
5. High-Availability
Sounds good, we need that!
Availability % Downtime/yr Downtime/mo
99.9% ("three nines") 8.76 hours 43.2 minutes
99.99% ("four nines") 52.56 minutes 4.32 minutes
99.999% ("five nines") 5.26 minutes 25.9 seconds
99.9999% ("six nines") 31.5 seconds 2.59 seconds
Can’t rely on human to respond in a 5 min window!
Must use automation.
6. Happens to the best
2.5 Hours Down 11 Hours Down Hours
September 23, 2010 October 4, 2010 November 14, 2010
“...we had to stop all “...At 6:30pm EST, we “...Before every run of
traffic to this database determined the most our test suite we destroy
cluster, which meant effective course of action then re-create the
turning off the site. Once was to re-index the database... Due to the
the databases had [database] shard, which configuration error
recovered and the root would address the memory GitHub's production
cause had been fixed, we fragmentation and usage database was
slowly allowed more issues. The whole process, destroyed then re-
people back onto the including extensive testing created. Not good.”
site.” against data loss and data
corruption, took about five
hours.”
7. Causes of Downtime
Lack of best practice change control
Lack of best practice monitoring of the relevant components
Lack of best practice requirements and procurement
Lack of best practice operations
Lack of best practice avoidance of network failures
Lack of best practice avoidance of internal application failures
Lack of best practice avoidance of external services that fail
Lack of best practice physical environment
Lack of best practice network redundancy
Lack of best practice technical solution of backup
Lack of best practice process solution of backup
Lack of best practice physical location
Lack of best practice infrastructure redundancy
Lack of best practice storage architecture redundancy
E. Marcus and H. Stern, Blueprints for high availability, second edition.
Indianapolis, IN, USA: John Wiley & Sons, Inc., 2003.
8. Cloud Non-Cloud
Data Change Operations Datacenter
Persistence Control
storage change control monitoring of avoidance of
architecture the relevant network failures
redundancy components
physical
technical requirements environment
solution of procurement
network
backup operations redundancy
process avoidance of
physical location
solution of internal app
backup failures infrastructure
avoidance of redundancy
external
services that fail
9. Happens to the best
2.5 Hours Down 11 Hours Down Hours
September 23, 2010 October 4, 2010 November 14, 2010
“...we had to stop all “...At 6:30pm EST, we “...Before every run of
traffic to this database determined the most our test suite we destroy
Database
cluster, which meant
turning off the site. Once
Database
effective course of action
was to re-index the Database
then re-create the
database... Due to the
the databases had [database] shard, which configuration error
would address the memory
recovered and the root
cause had been fixed, we fragmentation and usage Change
GitHub's production
database was
issues. The whole process,
slowly allowed more
people back onto the including extensive testing Control
destroyed then re-
created. Not good.”
site.” against data loss and data
corruption, took about five
hours.”
10. Data Change Operations Datacenter
Persistence Control
Today control
storagechange monitoring of
the relevant
avoidance of
network failures
architecture
Data Persistence
redundancy components
physical
Change Control
technical requirements environment
solution of procurement
network
backup operations redundancy
lessons learned
process avoidance of
solution@twilio
physical location
of internal app
backup failures infrastructure
avoidance of redundancy
external
services that fail
11.
12. Twilio provides web service APIs to
automate Voice and SMS communications
Carriers Inbound Calls
Voice Outbound Calls
Mobile/Browser VoIP
Send To/From Phone
SMS Numbers
Short Codes
Developer
Phone Dynamically Buy
Numbers Phone Numbers
End User
15. 2011
2010
2009
100’s of
10’s of Servers
10 Servers
Servers
16. 2011
• 100’s of prod hosts in continuous
operation
• 80+ service types running in prod
• 50+ prod database servers
• Prod deployments several times/day
across 7 engineering teams
17. 2011
• Frameworks
- PHP for frontend components
- Python Twisted & gevent for async network
services
- Java for backend services
• Storage technology
- MySQL for core DB services
- Redis for queuing and messaging
19. Data persistence is hard
Data persistence is the hardest
technical problem most scalable
SaaS businesses face
20. What is data persistence?
Stuff that looks like this
21. What is data persistence?
Databases
Queues
Files
22. Incoming Requests
LB
A A
Tier 1 Data
Q Q
Persistence!
SQL
Tier 2 B B B B
Files C C D D K/V
Tier 3
23. Why is persistence so hard?
• Difficult to change structure
- Huge inertia e.g., large schema migrations
• Painful to recover from disk/node failures
- “just boot a new node” doesn’t work
• Woeful performance/scalability
- I/O is huge bottleneck in modern servers (e.g. EC2)
• Freak’in complex!!!
- Atomic transactions/rollback, ACID, blah blah blah
24. Difficult to Change Structure
ALTER TABLE names
DROP COLUMN Value
Id Name Value Id Name
1 Bob 12 1 Bob
2 Jane 78 2 Jane
3 Steve 56 3 Steve
...
500 million rows
HOURS later...
‣ You live with data decisions for a long time
25. Painful to Recover from Failures
Data on secondary?
W R R
How much data?
R/W consistency?
DB DB
Primary Secondary
‣ Because of complexity, failover is human process
28. @!#$%^&* Complex
• Incredibly complex BUFFER POOL AND MEMORY
----------------------
Total memory allocated 11655168000; in
configuration Internal hash tables (constant factor
Adaptive hash index 223758224 (179
Page hash 11248264
- Billion knobs and buttons Dictionary cache 45048690 (449
File system 84400 (82672
- Whole companies exist just to Lock system
Recovery system
28180376 (281
0 (0 + 0)
tune DB’s Threads 428608 (406
Dictionary memory allocated 57346
• Lots of consistency/
Buffer pool size 693759
Buffer pool size, bytes 11366547456
Free buffers 1
transactional models Database pages
Old database pages
691085
255087
• Multi-region data is
Modified db pages 326490
Pending reads 0
Pending writes: LRU 0, flush list 0, s
unsolved - Facebook and Pages made young 497782847, not young
24.78 youngs/s, 0.00 non-youngs/s
Google struggle Pages read 447257683, created 16982810
24.82 reads/s, 1.14 creates/s, 33.36 w
Buffer pool hit rate 993 / 1000, young
29. Deep breath, step back
Think about each problem
(use @twilio examples)
• Software that runs in the cloud
• Open source
30. 1
Difficult to Change Structure
• Don’t have structure
- key/value databases (SimpleDB, Cassandra)
- document-orient databases (CouchDB, MongoDB)
• Don’t store a lot of data...
31. 1
Don’t Store Stuff
• Outsource data as much as possible
• But NOT to your customers
32. 1
Don’t Store Stuff
• Aggressively archive and move data offline
S3/SimpleDB
~500M
Rows
(keep indices in memory)
Build UX that supports longer/restricted
access times to older data
33. 1
Don’t Store Stuff
• Avoid stateful systems/architectures where
possible
Web
Browser Web Session
DB
Cookie: Web
SessionID
34. 1
Don’t Store Stuff
• Avoid stateful systems/architectures where
possible
Store state in client Web
browser
Browser Web Session
DB
Cookie: Web
enc($session)
35. 2
Painful to Recover from Failures
• Avoid single points of failure
- E.g., master-master (active/active)
- Complex to set up, complex failure modes
- Sometimes it’s the only solution
- Lots of great docs on web
• Minimize number of stateful node, separate
stateful & stateless components...
36. 2
Separate Stateful and Stateless
Components
Req App A App B App C
On failure, even
App B
if we boot
replacement, we
lose data
37. 2
Separate Stateful and Stateless
Components
Req App A App B App C
Queue
Queue
Queue
On failure, even
App B
if we boot
Queue
replacement, we
lose data
38. 2
Separate Stateful and Stateless
Components
Keep connection open for whole app path!
(hint: use evented framework)
Req App AA App BB App C
App A
App Twilio’s App stack App C
App B App C
SMS
uses this approach
On failure, we
don’t lose a
single request
39. 2
Painful to Recover from Failures
• Avoid single points of failure
- E.g., master-master (active/active)
- Complex to set up, complex failure modes
- Sometimes it’s the only solution
- Lots of great blog docs on web
• Minimize number of stateful nodes,
separate stateful & stateless components
• Build a data change control process to
avoid mistakes and errors...
40. • 100’s of prod hosts in continuous
operation
• 80+ service types running in prod
• 50+ prod database servers
• Prod deployments several times/day
across 7 engineering teams
Components deployed at different frequencies:
Partially Continuous Deployment
41. Website Deployment
Content
Frequency(Risk)
4 buckets
Website
Code
Log Scale
1000x
REST
API Big DB
100x
Schema
10x
1x
CMS PHP/Ruby Python/Java SQL
etc. etc.
42. Website Deployment
Content
Processes
Website
Code
REST
API Big DB
Schema
One Click CI Tests CI Tests CI Tests
One Click Human Sign-off Human Sign-off
One Click Human Assisted Click
43. 3
Woeful Performance/Scalability
• If disk I/O is poor, avoid disk
- Tune tune tune. Keep your indices in memory
- Use an in-memory datastore e.g., Redis and
configure replication such that if you have a master
failure, you can always promote a slave
• When disk I/O saturates, shard
- LOTs of sharding info on web
- Method of last resort, single point of failure
becomes multiple single points of failure
44. 4
@#$%^&* Complex
• Bring the simplest tool to Magic Database
the job
- Use a strictly consistent store
only if you need it
- If you don’t need HA, don’t add
the complexity
• There is no magic database. Magic Database does it all.
Decompose requirements, Consistency, Availability,
mix-and-match datastores Partition-tolerance, it's got all
three.
as needed...
53. Why is persistence so hard?
• Difficult to change structure
Don’t store stuff!
- Huge inertia e.g.,schema-less
Go large schema migrations
• Painful to recover from disk/node failures
Separate stateful/stateless
Change control processes
- “just boot a new node” doesn’t work
• Woeful performance/scalability
Memory FTW
Shard
- I/O is huge bottleneck in modern servers (e.g. EC2)
• Freak’in complex!!! data lifecycle
Decompose
- AtomicMinimize complexity blah blah
transactions/rollback, ACID, blah
54. Incoming Requests
LB
A A
Tier 1
Q Q
SQL
Tier 2 B B B B
Files C C D D K/V
Tier 3
55. Incoming Requests
Idempotent LB
request path
A A Aggregate into
Tier 1 HA queues
Master-Master
Q Q SQL
SQL MySQL
Tier 2 B B B B Move K/V to
SimpleDB w/
Move file store local cache
to S3 S3 SimpleDB
C C D D
Tier 3
56. Data Change Operations Datacenter
Persistence Control
storage
architecture
redundancy
HA
change control monitoring of
the relevant
components
avoidance of
network failures
physical
is
technical requirements environment
solution of procurement
network
backup operations redundancy
process avoidance of
Hard
physical location
solution of internal app
backup failures infrastructure
avoidance of redundancy
external
services that fail