This talk provides an overview of the large platform architecture used at State51 to deliver independent music and media content. It discusses using technologies like Varnish, Nginx, MogileFS, and Memcached to serve content efficiently from cheap hardware. Jobs and queues are used to process workflows like encoding files. The architecture is designed to be redundant and handle failures across the web, application, storage, and processing layers. Logging and monitoring are important for debugging in these complex systems.
Large platform architecture in (mostly) perl - an illustrated tour
1. Large platform
architecture in (mostly)
perl - an illustrated
tour
Tomas (t0m) Doran
São Paulo.pm perl workshop 2010
YAPC::EU Pisa 2010
2. This talk
• Is mostly a ramble
• About what I do for a living
• Good bits
• and bad bits (probably mostly bad bits)
• And when I say ‘illustrated’, I’m not very
good at diagrams, sorry...
3. Making money from
independent music
• IMPOSSIBLE
• No, no it isn’t. But we’re very lucky to have
people who know the music industry
• A startup would tank
• Last.fm guys “keep losing less money”
4. The state51 conspiracy
Consolidated Independent
Media Service Provider
• Several (largely profitable) businesses based
on the same technology platform
• East London (Brick Lane), a warehouse.
• > 60% of UK independent content goes
through us somewhere
5. Being S3 on the cheap
• WAV files are big.Videos are bigger.
• Transcodes aren’t small, especially when
you have 15 of them.
• My music collection is several hundred
terrabytes
• Need to be able to serve this stuff fast and
concurrently.
6. MogileFS
• Is free.
• Runs on cheap hardware
• Cheaper then S3.
• Not so awesome if you aren’t Livejournal
7. Data center design
• 8 amp racks. Seriously, WTF!?!?!
• Electricity is more expensive than servers,
ergo rolling hardware upgrades trivially pay
for themselves.
• Transit is really, really expensive.
• Worth buying fiber to other locations to
peer if you need lots of bandwidth.
9. Web architecture
• App servers apache, apps FastCGI, port 81
• Varnish + ESI, caching, port 80
• 1 varnish per host, talks to all the apaches
• 1 VIP per host
• Host fail:VIP transfer
• Apache/app fail (or overload), varnish
rebalances/retries.
10. Web architecture (cont)
• Varnish doesn’t cache media, just provides
failover.
• nginx sends the hit to FastCGI app.
• Returns X-Accel-Redirect.
• nginx talks to MogileFS, handles delivery.
12. Storage architecture
• Lots of boxes with lots of disk.
• Many additional roles to storage. (Mogile
tracker, memcache node, metal encoding,
VMWare, SOAP Service)
• Not all the boxes do all the roles.
• All the roles can safely fall over and die.
• Which is good, as they do. Or the box falls
over. Or a, then b.
14. WAV files
• WAV is a container format.
• Loosely defined.
• You can stuff XML documents in WAV files
• Some encoders (oh hai flac) very picky.
• ‘dirty’ and ‘clean’ WAV files.
16. Win32
• We’re running ActiveState for hysterical
raisins.
• No XS modules
• Thin as possible
17. Encoding
HTTP Nodes
HTTP Nodes
HTTP Nodes Encoding Service Uploading Service
GET
&
PUT
SOAP
media
Encoder
Downloader Uploader
Win32 &
Local Disk Encoder
(mp3)
Encoder
(wma) Unix
18. Snakes On A Plane
• SOAP actually works ok here, as we
control both ends.
• Old version of SOAP::Lite
• Wouldn’t recommend interoperating
19. Logging
• Used to be terribly hard to debug
• Push logs into syslog
• Aggregate in splunk - time correlated from
encoding machines, web service machines,
etc.
• Much easier to work out what happened.
20. Hardware is shit
• When you have several 100 Tb, undetected
bit error rate of magnetic media is actually
significant.
• See also networks, memory, etc.
21. Things will always fail
• If you need reliability, you have to design it
in from the start.
• Not only will you have (a lot of) hardware
failures, all the software will break in
unexpected ways. Lets not talk about
netotworks..
• Maybe you don’t need this..
22. Queueing
• We have work queues of different types of
media (e.g. mp3/wma/aac etc)
• In the database.
• Don’t do this.
23. MySQL sucks
• 1 type of JOIN
• No query rewriting
• Not enough stats for the planner to be
sane
24. This can hurt
• File Transform table:
• Master (File)
• Result (File)
• Status (pending/complete/failed/running)
• TransformStep (from/to)
• Leads to bad join order, massive fail
26. How to fail
• SELECT all file transforms that lead to wma
(millions).
• JOIN all files, ever (millions). Filter to find
those in state ‘pending’
• All pending looks like a bad bet - cardinality
of ‘all wmas’ looks better than cardinality of
‘all pending’.
• JOIN in the wrong order, nested loop,
screwed..
27. Queueing
• Did I mention queues in the DB suck?
• Even if you’re not screwing it up.
• Get a Message Queue (or at least an async
job server)
• If your problem is simple - Gearman.
Harder or you need interop - RabbitMQ.
28. Mutable state
• Mutable state is the enemy
• Too many things rw.
• No idea how an object got to this state
29. Anemic domain model
Object-oriented programming (OOP) is a
programming paradigm that uses "objects" –
data structures consisting of data fields and
methods together with their
interactions – to design applications and
computer programs. Programming techniques
may include features such as data
abstraction, encapsulation, modularity,
polymorphism, and inheritance.
30. Anemic domain model
• Superset of too much mutable state
• Able to create invalid objects
• Able to make previously valid objects
invalid
• Violation of the encapsulation and
information hiding principles.
31. scripts
• Lots of our business logic was in scripts
that manipulated objects
• You need people to run scripts (in screen
sessions)
• Ewwww, ewwwww.
32. Jobs
• Moved to a job based approach
• Jobs started by file creation, or changing
state of something in a web app
• Jobs sent via message queuing.
• Results go via message queueing
• Jobs trigger other jobs
33. Jobs Example
• Validate XLS file supplied with order.
• Valid files trigger another job to create
objects for each thing in the XLS
• This then triggers another job to create
transforms, which are then done...
• ... etc ...
• Can’t do this workflow in a web request.
34. Jobs Future
• More automation of things people run
scripts for.
• Automatic job regeneration (you will lose
messages).
35. Lava flow
• Old (possibly unclean/invalid) data
• Old (unused/unmaintained) code
• “What harm does it do”
37. Data consistency
• This should theoretically be the same thing
as relational integrity.
• In practice...
38. Mumble View Crap
• Too much logic in templates
• Copy & paste
• Business objects viewed as unchangeable
• Deleted 3000 lines from 2 simple
workflows. This fixed a dozen bugs.
39. Tangram
• No LEFT JOIN
• Displaying a product list becomes an x n
problem.
• OUCH
• Keep stupid - put the entire DB hot in
memcache!
40. Don’t do web design
• You are a programmer
• Make people pay for a design/CSS/HTML
person
• Work with them
• Be happy
41. Love your sysadmins
• Help them out.
• Build packages, or local::libs or something
• Keep everything in revision control
• Allow things to be sensibly configured.
• DOCUMENT THE POSSIBLE SETTINGS
• Use systems management - Puppet?
42. Love your logs
• Active feedback
• Aggregate in splunk
• Actively prune useless stuff
• Actively add useful stuff after a production
incident
43. ESI
• Is really awesome
• Make the pain go away
• PURGE requests
• Keep everything hot all the time
44. memcache everything
• Keep the entire database hot in memcache
• We mostly ask trivial questions, so just
cache those paths.
• 30 Gb of RAM isn’t actually much (3
boxes..)
45. memcache
• IS A CACHE
• Use sequential port numbers and CNAMES
• E.g. cache0:11210, cache1:11211,
cache2:11212 etc..
• Run several per machine
• Allows you to scale capacity and rebalance
without entire cache flush.
46. Don’t push bytes
• X-Sendfile and X-Accel-Redirect
• I already talked about file delivery like this
• Using 100Mb of RAM to proxy web
requests does not scale.
47. Test everything
• Redundant systems need testing
• You’ll still die unexpectedly in production
• If you can manage it, make responsibility for
deployment SEP.