Online games pose a few interesting challenges on their backend: A single user generates one http call every few seconds and the balance between data read and write is close to 50/50 which makes the use of a write through cache or other common scaling approaches less effective. Starting from a rather classic Ruby on Rails application as the traffic grew we gradually changed it in order to meet the required performance. And when small changes no longer were enough we turned inside out parts of our data persistency layer migrating from SQL to NoSQL without taking downtimes longer than a few minutes. Follow the problems we hit, how we diagnosed them, and how we got around limitations. See which tools we found useful and which other lessons we learned by running the system with a team of just two developers without a sysadmin or operation team as support.
17. We added a few applicaMon servers over Mme
lb
app app app app app app app app app
db db
Tuesday, June 5, 2012
18. 250K daily users and no problems
&$!!!$!!!"
%$#!!$!!!"
%$!!!$!!!"
#!!$!!!"
Life was good
!"
'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"
Tuesday, June 5, 2012
19. Life was well and I went on a nice vacaMon
TO DO
<picture: Jesper in clot
canyon>
Tuesday, June 5, 2012
24. A journey to 1,000,000 daily users
Start of the journey
6 weeks of pain
Paradise
Conclusion
Tuesday, June 5, 2012
25. SQL queries generated by Rubyamf gem
AMF responses to Flash client
Tuesday, June 5, 2012
26. SQL queries generated by Rubyamf gem
AMF responses to Flash client
Wrong config...
... so associated data was included, too
Tuesday, June 5, 2012
27. SQL queries generated by Rubyamf gem
AMF responses to Flash client
Wrong config...
... so associated data was included, too
=> Easy to fix
Tuesday, June 5, 2012
28. More traffic using the same cluster
lb
app app app app app app app app app
db db
Tuesday, June 5, 2012
29. Config tweaks brought us to 300K DAU
&$!!!$!!!"
%$#!!$!!!"
%$!!!$!!!"
#!!$!!!"
Config fixes
!"
'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"
Tuesday, June 5, 2012
30. AcMveRecord’s checks caused 20% extra DB
Checking connecMon state
MySQL process list full of ‘status’ calls
Tuesday, June 5, 2012
31. AcMveRecord’s checks caused 20% extra DB
Checking connecMon state
MySQL process list full of ‘status’ calls
=> Fixed by 1 line of code
Tuesday, June 5, 2012
32. I/O on MySQL masters sMll was the bojleneck
New Relic: 60% of all UPDATEs on ‘Mles’ table
Tuesday, June 5, 2012
33. Tiles are part of the core game loop
Core game loop
1) plant
2) wait
3) harvest
Tuesday, June 5, 2012
34. We started to shard on model, too
Adding new shards
old old
master slave
Tuesday, June 5, 2012
35. We started to shard on model, too
Adding new shards
1) Setup new masters as slaves of old ones
old old new
master slave master
Tuesday, June 5, 2012
36. We started to shard on model, too
Adding new shards
1) Setup new masters
old old new new
master slave master slave
Tuesday, June 5, 2012
37. We started to shard on model, too
Adding new shards
1) Setup new masters
2) Start using new masters
old old new new
master slave master slave
Tuesday, June 5, 2012
38. We started to shard on model, too
Adding new shards
1) Setup new masters
2) Start using new masters
3) Cut replicaBon
old old new new
master slave master slave
Tuesday, June 5, 2012
39. We started to shard on model, too
Adding new shards
1) Setup new masters
2) Start using new masters
3) Cut replicaBon
4) Truncate
old old new new
master slave master slave
Tuesday, June 5, 2012
40. 4 DB masters and a few more servers
lb
app app app app app app app app
app app app app app app app app
Bles Bles
db db
db db
Tuesday, June 5, 2012
41. Sharding by model brought us to 400K DAU
&$!!!$!!!"
%$#!!$!!!"
%$!!!$!!!"
#!!$!!!"
Shard by model
!"
'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"
Tuesday, June 5, 2012
42. We improved our MySQL setup
RAID-‐0 of EBS volumes
Tuesday, June 5, 2012
43. We improved our MySQL setup
RAID-‐0 of EBS volumes
Using XtraDB
Tuesday, June 5, 2012
44. We improved our MySQL setup
RAID-‐0 of EBS volumes
Using XtraDB
Tweaking my.cnf
Tuesday, June 5, 2012
46. Sharding gem circumvented AR’s internal cache
AcMveRecord caches SQL queries...
... only in our development environment!
Tuesday, June 5, 2012
47. Sharding gem circumvented AR’s internal cache
AcMveRecord caches SQL queries...
... only in our development environment!
=> Fixed by 2 lines of code
Tuesday, June 5, 2012
48. I/O sMll was not fast enough
If 2 + 2 is not enough, ...
Tuesday, June 5, 2012
49. I/O sMll was not fast enough
If 2 + 2 is not enough, ...
… perhaps 4 + 4 masters will do?
Tuesday, June 5, 2012
50. It’s no fun to handle 8+8 MySQL DBs
lb
app app app app app app app app app
app app app app app app app app app
Bles Bles
db db
db db
Tuesday, June 5, 2012
51. It’s no fun to handle 8+8 MySQL DBs
lb
app app app app app app app app app
app app app app app app app app app
Bles Bles Bles Bles
db db db db
db db db db
Tuesday, June 5, 2012
52. At 500K DAU we were at a dead end
&$!!!$!!!"
%$#!!$!!!"
%$!!!$!!!"
#!!$!!!"
!"
'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"
Tuesday, June 5, 2012
53. At 500K DAU we were at a dead end
&$!!!$!!!"
%$#!!$!!!"
%$!!!$!!!"
#!!$!!!"
!"
'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"
Tuesday, June 5, 2012
54. I/O remained bojleneck for MySQL UPDATEs
Each DB master could do
about 1000 DB write/s.
Tuesday, June 5, 2012
55. I/O remained bojleneck for MySQL UPDATEs
Each DB master could do
about 1000 DB write/s.
That’s not enough!
Tuesday, June 5, 2012
58. Redis is fast but goes beyond simple key/value
Redis is a key-‐value store
Hashes, Sets, Sorted Sets, Lists
Atomic operaBons like set, get, increment
Tuesday, June 5, 2012
59. Redis is fast but goes beyond simple key/value
Redis is a key-‐value store
Hashes, Sets, Sorted Sets, Lists
Atomic operaBons like set, get, increment
50,000 transacMons/s on EC2
Writes are as fast as reads
Tuesday, June 5, 2012
60. We could learn from another team using Redis
Tuesday, June 5, 2012
61. We could learn from another team using Redis
Tuesday, June 5, 2012
62. Shelf Mles : An ideal candidate for Redis
Shelf 2les:
{ plant1 => 184,
plant2 => 141,
plant3 => 130,
plant4 => 112,
… }
Tuesday, June 5, 2012
63. Shelf Mles : An ideal candidate for using Redis
Redis Hash
HGETALL
HGETSET
HINCRBY
…
Tuesday, June 5, 2012
73. Migrate on the fly -‐ and clean up later
1. Let migraMon run unMl everything cools down
Tuesday, June 5, 2012
74. Migrate on the fly -‐ and clean up later
1. Let migraMon run unMl everything cools down
2. Migrate the rest manually
Tuesday, June 5, 2012
75. Migrate on the fly -‐ and clean up later
1. Let migraMon run unMl everything cools down
2. Migrate the rest manually
3. Remove migraMon code
Tuesday, June 5, 2012
76. Migrate on the fly -‐ and clean up later
1. Let migraMon run unMl everything cools down
2. Migrate the rest manually
3. Remove migraMon code
4. Wait unMl no fallback necessary
Tuesday, June 5, 2012
77. Migrate on the fly -‐ and clean up later
1. Let migraMon run unMl everything cools down
2. Migrate the rest manually
3. Remove migraMon code
4. Wait unMl no fallback necessary
5. Remove SQL table
Tuesday, June 5, 2012
78. A journey to 1,000,000 daily users
Start of the journey
6 weeks of pain
Paredise (or not?)
Conclusion
Tuesday, June 5, 2012
79. Again: Tiles are part of the core game loop
Core game loop
1) plant
2) wait
3) harvest
Tuesday, June 5, 2012
81. Size majers for migraMons
MigraMon check overload
Tuesday, June 5, 2012
82. Size majers for migraMons
MigraMon check overload
MigraBon only on startup
Tuesday, June 5, 2012
83. Size majers for migraMons
MigraMon check overload
MigraBon only on startup
Overlooked an edge case
Tuesday, June 5, 2012
84. Size majers for migraMons
MigraMon check overload
MigraBon only on startup
Overlooked an edge case
Only migrate 1% of users
Tuesday, June 5, 2012
85. Size majers for migraMons
MigraMon check overload
MigraBon only on startup
Overlooked an edge case
Only migrate 1% of users
ConBnue if everything is ok
Tuesday, June 5, 2012
87. In-‐memory DBs don’t like dumping to disk
Dumping to disk
SAVE is blocking
Tuesday, June 5, 2012
88. In-‐memory DBs don’t like dumping to disk
Dumping to disk
SAVE is blocking
BGSAVE needs free RAM
Tuesday, June 5, 2012
89. In-‐memory DBs don’t like dumping to disk
Dumping to disk
SAVE is blocking
BGSAVE needs free RAM
Latency increase by 100%
Tuesday, June 5, 2012
90. In-‐memory DBs don’t like dumping to disk
Dumping to disk
SAVE is blocking
BGSAVE needs free RAM
Latency increase by 100%
=> BGSAVE on slaves every 15 minutes
Tuesday, June 5, 2012
91. Redis replicaMon starts with a BGSAVE
StarMng up a new slave by replicaMon
Tuesday, June 5, 2012
92. Redis replicaMon starts with a BGSAVE
StarMng up a new slave by replicaMon
BGSAVE on master
Tuesday, June 5, 2012
93. Redis replicaMon starts with a BGSAVE
StarMng up a new slave by replicaMon
BGSAVE on master
Slave imports dumped file
Tuesday, June 5, 2012
94. Redis replicaMon starts with a BGSAVE
StarMng up a new slave by replicaMon
BGSAVE on master
Slave imports dumped file
=> No RAM means no new slaves
Tuesday, June 5, 2012
95. Redis had a memory fragmenMon problem
44 GB
in 8 days
24 GB
Tuesday, June 5, 2012
96. Redis had a memory fragmenMon problem
38 GB
in 3 days
24 GB
Tuesday, June 5, 2012
97. Redis had a memory fragmenMon problem
2.2
in v
xe d
F i 38 GB
in 3 days
24 GB
Tuesday, June 5, 2012
98. If MySQL is a truck
Fast enough
Disk based
Robust
Fast enough disk based robust
Tuesday, June 5, 2012
99. If MySQL is a truck, Redis is a race car
Super fast
RAM based
Fragile
Super fast RAM based fragile
Tuesday, June 5, 2012
100. Big and staMc data in MySQL, rest goes to Redis
256 GB data 60 GB data
10% writes 50% writes
h"p://www.flickr.com/photos/erix/245657047/
Tuesday, June 5, 2012
101. Lots of boxes, but automaMon helps a lot!
lb lb
app app app app app app app app app app app app app
app app app app app app app app app app app app app
app app app app app app app app app app app app app
db db db db db redis redis redis redis redis
Tuesday, June 5, 2012
102. We reached 1 million daily users!
&$!!!$!!!"
%$#!!$!!!"
%$!!!$!!!"
1,000,000 -‐ Big party!
#!!$!!!"
!"
'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"
Tuesday, June 5, 2012
103. We started archiving inacMve users
&$!!!$!!!"
%$#!!$!!!"
50% DB
%$!!!$!!!"
reduc2on
#!!$!!!"
!"
'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"
Tuesday, June 5, 2012
104. We even survived a complete data center loss
&$!!!$!!!"
EBS no
%$#!!$!!!"
more!
%$!!!$!!!"
#!!$!!!"
!"
'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"
Tuesday, June 5, 2012
105. We improved our MySQL schema on-‐the-‐fly
&$!!!$!!!"
30% DB
%$#!!$!!!"
reduc2on
%$!!!$!!!"
#!!$!!!"
!"
'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"
Tuesday, June 5, 2012
106. Meanwhile we have more than 2M daily users
&$!!!$!!!"
%$#!!$!!!"
%$!!!$!!!"
#!!$!!!"
!"
'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"
Tuesday, June 5, 2012
107. A journey to 1,000,000 daily users
Start of the journey
6 weeks of pain
Paredise (or not?)
Conclusion
Tuesday, June 5, 2012
108. EvoluMon every week
EVOLUTION
of sotware
&$!!!$!!!"
%$#!!$!!!"
%$!!!$!!!"
#!!$!!!"
!"
'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"
Tuesday, June 5, 2012
109. EvoluMon every week
EVOLUTION
of sotware
&$!!!$!!!"
%$#!!$!!!"
%$!!!$!!!"
#!!$!!!"
!"
'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"
Tuesday, June 5, 2012
110. EvoluMon every week
REVOLUTION
of sotware
&$!!!$!!!"
%$#!!$!!!"
%$!!!$!!!"
#!!$!!!"
!"
'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"
Tuesday, June 5, 2012
111. EvoluMon every week
REVOLUTION
of sotware
&$!!!$!!!"
%$#!!$!!!"
%$!!!$!!!"
#!!$!!!"
!"
'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"
Tuesday, June 5, 2012
112. EvoluMon every week
REVOLUTION
of sotware
&$!!!$!!!"
%$#!!$!!!"
%$!!!$!!!"
#!!$!!!"
!"
'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"
Tuesday, June 5, 2012