2. Background
• WAS:
– key participant design team (10/2009-04/2010)
– vocal proponent of final consensus architecture
• WAS NOT:
– Part of actual implementation
– Involved in FB chat backend
• IS NOT: Facebook Employee since 08/2011
3. Problem
• 1Billion Users
• Volume:
– 25 messages/day * 4KB (exclude attachments)
– 10 TB per day
• Indexes/Summaries:
– Keyword/Threadid/Label Index
– Label/Thread message counts
4. Problem – Cont.
• Must have: Cross continent copy
– Ideally concurrent updates across regions
– At least disaster recoverable
• No Single Failure Points for Entire Service
– FB has downtime on a few MySql databases/day
– No one cares
• Cannot Cannot Cannot lose Data
5. Solved Problem
• Attachment Store
– HayStack
– Stores FB Photos
– Optimized for Immutable Data
• Hiring best programmers available
– Choose best design, not implementation
– But get things done Fast
6. Write Throughput
• Disk:
– Need Log Structured container
– Can store small messages inline
– Can store keyword index as well
– What about read performance?
• Flash/Memory
– Expensive
– Only metadata
7. LSM Trees
• High write throughput
• Recent Data Clustered
– Nice! Fits a mailbox access pattern
• Inherently Snapshotted
– Backups/DR should be easy
8. Reads?
• Write-Optimized => Read-Penalty
• Cache working set in App Server
– At-Most one App Server per User.
– All mailbox updates via Application Server
– Serve directly from cache
• Cold-Start
– LSM tree clustering should make retrieving recent
messages/threads fast.
10. Cassandra vs. HBase (abridged)
• Tested it out (c. 2010)
– HBase held up, (FB Internal) Cassandra didn’t
• Tried to understand internals
– HBase held up, Cassandra didn’t
• Really Really trusted HDFS
– Stored PB of data for years with no loss
• Missing features in Hbase/HDFS can be added
11. Disaster Recovery (HBase)
1. Ship HLog to Remote Data Center real-time
2. Every-day update Remote Snapshot
3. Reset remote HLog
• No need to synchronize #2 and #3 perfectly
– HLog replay is idempotent
13. What about Flash?
• In HBase:
– Store recent LSM tree segments in Flash
– Store HBase block cache
– Inefficient in Cassandra! (3x LSM trees/cache)
• In App Server
– Page /in out User cache from Flash
14. Lingering Doubts
• Small Components vs. Big Systems
– Small Components are better
– Is HDFS too big?
• Separate DataNode, BlockManager, NameNode
• HBase doesn’t need NameNode
• Gave up on Cross-DC concurrency
– Partition Users if required
– Global user->DC registry needs to deal with
partitions and conflict resolution
– TBD
16. Cassandra: Flat Earth
• The world is hierarchical
– PCI Bus, Rack, Data Center, Region, Continent ..
– Odds of Partitioning differ
vs.
• Symmetric hash ring spanning continents
– Odds of partitioning considered constant
17. Cassandra – No Centralization
• The world has central (but HA) tiers:
– DNS servers, Core-Switches, Memcache-Tier, …
• Cassandra: all servers independent
– No authoritative commit log or snapshot
– Do Repeat Your Reads (DRYR) paradigm
18. Philosophies have Consequences
• Consistent Reads are expensive
– N=3, R=2, W=2
– Ugh: why are reads expensive in write optimized
system?
• Is Consistency foolproof ?
– Edge cases with failed writes
– Internet still debating
– If Science has Bugs – then imagine Code!
19. Distributed Storage vs. Database
• How to recover failed block or disk?
• Distributed Storage (HDFS):
– Simple - Find other replicas for that block.
• Distributed Database (Cassandra):
– A ton of my databases lived on that drive
– Hard: Let’s merge all the affected databases
20. Eventual Consistency
• Read-Modify-Write pattern problematic
1. Read value
2. Apply Business Logic
3. Write value
Stale Read leads to Junk
• What about atomic increments?
21. Conflict Resolution
• Easy to resolve conflicts in Increments
• Imagine multi-row transactions
– Pointless resolving conflicts at row level
Solve conflicts at highest possible layer
– Transaction Monitor
22. How did it work out?
• Ton of missing Hbase/HDFS features added
– Bloom Filters, Namenode HA
– Remote Hlog shipping
– Modified Block Placement Policy
– Sticky Regions
– Improved Block Cache
–…
• User -> AppServer via Zookeeper
• App Server worked out