Messaging architecture @FB (Fifth Elephant Conference)

Messaging Architecture @FB

Joydeep Sen Sarma

Background
• WAS:
– key participant design team (10/2009-04/2010)
– vocal proponent of final consensus architecture

• WAS NOT:
– Part of actual implementation
– Involved in FB chat backend

• IS NOT: Facebook Employee since 08/2011

Problem
• 1Billion Users

• Volume:
– 25 messages/day * 4KB (exclude attachments)
– 10 TB per day

• Indexes/Summaries:
– Keyword/Threadid/Label Index
– Label/Thread message counts

Problem – Cont.
• Must have: Cross continent copy
– Ideally concurrent updates across regions
– At least disaster recoverable

• No Single Failure Points for Entire Service
– FB has downtime on a few MySql databases/day
– No one cares

• Cannot Cannot Cannot lose Data

Solved Problem
• Attachment Store
– HayStack
– Stores FB Photos
– Optimized for Immutable Data

• Hiring best programmers available 
– Choose best design, not implementation
– But get things done Fast

Write Throughput
• Disk:
– Need Log Structured container
– Can store small messages inline
– Can store keyword index as well
– What about read performance?

• Flash/Memory
– Expensive
– Only metadata

LSM Trees

• High write throughput

• Recent Data Clustered
– Nice! Fits a mailbox access pattern

• Inherently Snapshotted
– Backups/DR should be easy

Reads?
• Write-Optimized => Read-Penalty

• Cache working set in App Server
– At-Most one App Server per User.
– All mailbox updates via Application Server
– Serve directly from cache

• Cold-Start
– LSM tree clustering should make retrieving recent
messages/threads fast.

SPOF?
• Single Hbase/HDFS cluster?

• NO!
– Lots of 100 node clusters
– HDFS Namenode HA

Cassandra vs. HBase (abridged)
• Tested it out (c. 2010)
– HBase held up, (FB Internal) Cassandra didn’t

• Tried to understand internals
– HBase held up, Cassandra didn’t

• Really Really trusted HDFS
– Stored PB of data for years with no loss

• Missing features in Hbase/HDFS can be added

Disaster Recovery (HBase)
1. Ship HLog to Remote Data Center real-time
2. Every-day update Remote Snapshot
3. Reset remote HLog

• No need to synchronize #2 and #3 perfectly
– HLog replay is idempotent

Test!

 Try to avoid writing a cache in Java

What about Flash?
• In HBase:
– Store recent LSM tree segments in Flash
– Store HBase block cache
– Inefficient in Cassandra! (3x LSM trees/cache)

• In App Server
– Page /in out User cache from Flash

Lingering Doubts
• Small Components vs. Big Systems
– Small Components are better
– Is HDFS too big?
• Separate DataNode, BlockManager, NameNode
• HBase doesn’t need NameNode

• Gave up on Cross-DC concurrency
– Partition Users if required
– Global user->DC registry needs to deal with
partitions and conflict resolution
– TBD

Cassandra: Flat Earth
• The world is hierarchical
– PCI Bus, Rack, Data Center, Region, Continent ..
– Odds of Partitioning differ

vs.

• Symmetric hash ring spanning continents
– Odds of partitioning considered constant

Cassandra – No Centralization
• The world has central (but HA) tiers:
– DNS servers, Core-Switches, Memcache-Tier, …

• Cassandra: all servers independent
– No authoritative commit log or snapshot
– Do Repeat Your Reads (DRYR) paradigm

Philosophies have Consequences
• Consistent Reads are expensive
– N=3, R=2, W=2
– Ugh: why are reads expensive in write optimized
system?

• Is Consistency foolproof ?
– Edge cases with failed writes
– Internet still debating
– If Science has Bugs – then imagine Code!

Distributed Storage vs. Database
• How to recover failed block or disk?

• Distributed Storage (HDFS):
– Simple - Find other replicas for that block.

• Distributed Database (Cassandra):
– A ton of my databases lived on that drive
– Hard: Let’s merge all the affected databases

Eventual Consistency
• Read-Modify-Write pattern problematic
1. Read value
2. Apply Business Logic
3. Write value
Stale Read leads to Junk

• What about atomic increments?

Conflict Resolution
• Easy to resolve conflicts in Increments

• Imagine multi-row transactions
– Pointless resolving conflicts at row level

Solve conflicts at highest possible layer
– Transaction Monitor

How did it work out?
• Ton of missing Hbase/HDFS features added
– Bloom Filters, Namenode HA
– Remote Hlog shipping
– Modified Block Placement Policy
– Sticky Regions
– Improved Block Cache
–…
• User -> AppServer via Zookeeper
• App Server worked out

Messaging architecture @FB (Fifth Elephant Conference)

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (6)

Similar a Messaging architecture @FB (Fifth Elephant Conference)

Similar a Messaging architecture @FB (Fifth Elephant Conference) (20)

Último

Último (20)

Messaging architecture @FB (Fifth Elephant Conference)