12. The Internet
ARCHITECTURE Static assets
HAProxy layer
Entirely cloud
based Web layer
Chef
Nodes come and Cache
go - frequently! Monitor
Cassandra Cluster
Automatic Task
deployment direct
RDS MySQL
Server
from Github via Amazon AWS eu-west-1
Logs, backups
Amazon S3
Chef
13. CACHING
Cache as speedup or Cache as mission-critical?
Use Django cache framework
Pylibmc - consistent hashing and server death patches
Problems as you scale up...
14. CACHE PROBLEMS
Cache miss behaviour value = cache.get(key)
if value is None:
try:
Thundering herds are bad lock = cache.add(lock_key(key))
if lock:
Key overload # Do something expensive
new_value = calculate_new_value()
cache.set(key, new_value)
Server overload return new_value
finally:
Dualcache - https:// if lock:
cache.delete(lock_key(key)
gist.github.com/953524
return value
15. COUNTING
Hard to count a few things very fast
And have real-time access to the latest result
Things we tried:
memcache
Cassandra counters
Final solution: Sharded counters
16. SHARDED COUNTERS
Implemented in about 350 lines of Python
To provide two basic operations!
incr()
get()
Uses a combination of two layers of memcache and
Cassandra to provide real-time, scalable counters
17. CASSANDRA
Core piece of our infrastructure
Highly write-scalable
Reads scaled from cache
Using Acunu Cassandra for virtual nodes
“Fake” Django ORM classes to make it feel more natural
But no automatic join support
20. Q&A
AND YES, WE’RE HIRING SO IF YOU’RE INTERESTED IN BUILDING EXTREMELY LARGE
DJANGO SITES THEN GET IN TOUCH
MALCOLM@TELLYBUG.COM
Notas del editor
\n
\n
\n
XFactor 2012 app. Also Switch, BGT, Arab Voice, Unzipped...\n
Questions for audience:\n\n- Technical?\n- Running Django in production\n- Scale - 10 ... 100 .... 1000 .... 10000 .... 100000 req/s\n
XFactor - over 1M installs, 260 Million boos/claps\nBGT - 250K simultaneous users\n\n
XFactor - over 1M installs, 260 Million boos/claps\nBGT - 250K simultaneous users\n\n
XFactor - over 1M installs, 260 Million boos/claps\nBGT - 250K simultaneous users\n\n
XFactor - over 1M installs, 260 Million boos/claps\nBGT - 250K simultaneous users\n\n
XFactor - over 1M installs, 260 Million boos/claps\nBGT - 250K simultaneous users\n\n
XFactor - over 1M installs, 260 Million boos/claps\nBGT - 250K simultaneous users\n\n
\n
cf Google serving 34K searches/s worldwide\n
\n
Cache is either a speedup for your site, or it is mission critical. The deciding factor is whether your DB can handle the load if the cache fails.\nAt > 500 req/s, MySQL on AWS can’t keep up - hence cache is critical\n\n
Discuss the code:\n- what happens if you return None? How does that affect upstream bits of code?\n- occasional latency problems if the value expires - everything fails for as long as calculate_new_value() takes to return\n\nGhetto locking - if using to protect e.g. DB writes, the key itself can end up as a problem\n\n
\n
Describe how sharded counters work\n- and the very interesting challenge of debugging!\n
Used for write performance rather than data size - still more data in MySQL than Cassandra\n\n
\n
Mini rant - trouble finding any tool that copes with a highly scalable infrastructure up and down\n\nTried: Zabbix, Nagios, Cloudwatch, New Relic, Sensu, librato ... and probably some others\nNow building our own :(\n