2. Data Management at Facebook
(Back in the Day)
Jeff Hammerbacher
VP Product and Chief Scientist, Cloudera
October 22, 2008
3. My Background
Thanks for Asking
▪ hammer@cloudera.com
▪ Studied Mathematics at Harvard
▪ Worked as a Quant on Wall Street
▪ Came to Facebook in early 2006 as a Research Scientist
▪ Managed the Facebook Data Team through September 2008
▪ Over 25 amazing engineers and data scientists
▪ Now a cofounder of Cloudera
▪ Hadoop support and optimization
4. Common Themes
1. Simplicity
▪ Do one thing well ...
2. Scalability
▪ ... a lot
3. Manageability
▪ Remove the humans
4. Open Source
▪ Build a community
5. Serving Facebook.com
Data Retrieval and Hardware
GET /index.php HTTP/1.1
Host: www.facebook.com
▪ Three main server profiles:
▪ Web
▪ Memcached
Web Tier
▪ MySQL (more than 10,000 Servers)
▪ Simplified away:
Memcached Tier
▪ AJAX (around 1,000 servers)
MySQL Tier
(around 2,000 servers)
▪ Photo and Video
▪ Services
6. Services Infrastructure
What’s an SOA?
▪ Almost all services written in Thrift
▪ Networks Type-ahead, Search, Ads, SMS Gateway, Chat, Notes
Import, Scribe
▪ Batteries included
▪ Network transport libraries
▪ Serialization libraries
▪ Code generation
▪ Robust server implementations (multithreaded, nonblocking, etc.)
▪ Now an Apache Incubator project
▪ For more information, read the whitepaper
7. Services Infrastructure
Thrift, Mainly
▪ Developing a Thrift service:
▪ Define your data structures
▪ JSON-like data model
▪ Define your service endpoints
▪ Select your languages
▪ Generate stub code
▪ Write service logic
▪ Write client
▪ Configure and deploy
▪ Monitor, provision, and upgrade
8. Data Infrastructure
Offline Batch Processing
Scribe Tier MySQL Tier
▪ “Data Warehousing”
▪ Began with Oracle database
▪ Schedule data collection via cron
▪ Collect data every 24 hours
▪ “ETL” scripts: hand-coded Python Data Collection
Server
▪ Data volumes quickly grew
▪ Started at tens of GB in early 2006 Oracle Database
Server
▪ Up to about 1 TB per day by mid-2007
▪ Log files largest source of data growth
9. Data Infrastructure
Distributed Processing with Cheetah
▪ Goal: summarize log files outside of the database
▪ Solution: Cheetah, a distributed log file processing system
▪ Distributor.pl: distribute binaries to processing nodes
▪ C++ Binaries: parse, agg, load
Partitioned Log File
Cheetah Master
Filer Processing Tier
10. Data Infrastructure
Moving from Cheetah to Hadoop
▪ Cheetah limitations
▪ Limited filer bandwidth
▪ No centralized logfile metadata
▪ Writing a new Cheetah job requires writing C++ binaries
▪ Jobs are difficult to monitor and debug
▪ No support for ad hoc querying
▪ Not open source
12. Initial Hadoop Applications
Unstructured text analysis
▪ Intern asked to understand brand sentiment and influence
▪ Many tools for supporting his project had to be built
▪ Understanding serialization format of wall post logs
▪ Common data operations: project, filter, join, group by
▪ Developed using Hadoop streaming for rapid prototyping in
Python
▪ Scheduling regular processing and recovering from failures
▪ Making it easy to regularly load new data
14. Initial Hadoop Applications
Ensemble Learning
▪ Build a lot of Decision Trees and average them
▪ Random Forests are a combination of tree predictors such that
each tree depends on the values of a random vector sampled
independently and with the same distribution for all trees in the
forest
▪ Can be used for regression or classification
▪ See “Random Forests” by Leo Breiman
15. More Hadoop Applications
Insights
▪ Monitor performance of your Facebook Ad, Page, Application
▪ Regular aggregation of high volumes of log file data
▪ First hourly pipelines
▪ Publish data back to a MySQL tier
▪ System currently only running partially on Hadoop
17. More Hadoop Applications
Platform Application Reputation Scoring
▪ Users complaining about being spammed by Platform
applications
▪ Now, every Platform Application has a set of quotas
▪ Notifications
▪ News Feed story insertion
▪ Invitations
▪ Emails
▪ Quotas determined by calculating a “reputation score” for the
application
18. Hive
Structured Data Management with Hadoop
▪ Hadoop:
▪ HDFS
▪ MapReduce
▪ Resource Manager
▪ Job Scheduler
▪ Hive:
▪ Logical data partitioning
▪ Metadata store (command line and web interfaces)
▪ Query Operators
▪ Query Language
20. Hive
The Team
▪ Joydeep Sen Sarma
▪ Ashish Thusoo
▪ Pete Wyckoff
▪ Suresh Anthony
▪ Zheng Shao
▪ Venky Iyer
▪ Dhruba Borthakur
▪ Namit Jain
▪ Raghu Murthy
▪ Prasad Chakka
21. Hive
Some Stats
▪ Cluster size - 320 nodes, 2560 cores, 1.3 PB capacity
▪ Total data (compressed, deduplicated) - 180 TB
▪ Net data per day
▪ 10 TB uncompressed - 4 TB from databases, 6 TB from logs
▪ Over 2 TB compressed
▪ Data Processing Statistics
▪ 3,200 Jobs and 800,000 Tasks per day
▪ 55 TB of compressed data processed per day
▪ 15 TB of compressed data produced per day
▪ 80 M minutes of compute time per day
22. Cassandra
Structured Storage over a P2P Network
▪ Conceptually: BigTable data model on Dynamo infrastructure
▪ Design Goals:
▪ High availability
▪ Incremental scalability
▪ Eventual consistency (trade consistency for availability)
▪ Optimistic replication
▪ Low total cost of ownership
▪ Minimal administrative overhead
▪ Tunable tradeoffs between consistency, durability, and latency
25. Cassandra
The Team
▪ Avinash Lakshman
▪ Prashant Malik
▪ Karthik Ranganathan
▪ Kannan Muthukkaruppan
26. Cassandra
Some Stats
▪ Cluster size - 120 nodes
▪ Single instance across two data centers
▪ Total data stored - 36 TB
▪ Writes - 300 million writes per day.
▪ Reads - 1 million reads per day.
▪ Read Latencies
▪ Min - 6.03 ms
▪ Mean - 90.6 ms
▪ Median - 18.24 ms
27. (c) 2008 Facebook, Inc. or its licensors. quot;Facebookquot; is a registered trademark of Facebook, Inc.. All rights reserved. 1.0