This document summarizes Cassandra, an open source distributed database management system designed to handle large amounts of data across many commodity servers. It discusses Cassandra's history, key features like tunable consistency levels and support for structured and indexed columns. Case studies describe how companies like Digg, Twitter, Facebook and Mahalo use Cassandra to handle terabytes of data and high transaction volumes. The roadmap outlines upcoming releases that will improve features like compaction, management tools, and support for dynamic schema changes.
8. • 7 new committers added
• Dozens of contributors
• 100+ people on IRC
• Hundreds of closed issues (bugs, features, etc)
• 3 major releases, 2 point releases
• Graduation to TLP?
15. Querying
• get(): retrieve by column name
• multiget(): by column name for a set of keys
• get slice(): by column name, or a range of names
• returning columns
• returning super columns
• multiget slice(): a subset of columns for a set of keys
• get count: number of columns or sub-columns
• get range slice(): subset of columns for a range of keys
22. About writes...
• No reads
• No seeks
• Sequential disk access
• Atomic within a column family
• Fast
• Any node
• Always writeable (hinted hand-off)
26. Case 1: Digg
Digg is a social news site that allows people to discover and share
content from anywhere on the Internet by submitting stories and
links, and voting and commenting on submitted stories and links.
Ranked 98th by Alexa.com.
28. Problem
• Terabytes of data; high transaction rate (reads dominated)
• Multiple clusters; heavily sharded
• Management nightmare (high effort, error prone)
• Unsatisfied availability requirements (geographic isolation)
29. Solution
• Currently production on ”Green Badges”
• Cassandra as primary data store RSN
• Datacenter and rack-aware replication
30. Case 2: Twitter
Twitter is a social networking and microblogging service that
enables its users to send and read tweets, text-based posts of up to
140 characters.
Ranked 12th by Alexa.com.
32. MySQL
• Terabytes of data, ˜1,000,000 ops/s
• Calls for heavy sharding, light replication
• Schema changes are very difficult, (if possible at all)
• Manual sharding is very high effort
• Automated sharding and replication is Hard
33. Case 3: Facebook
Facebook is a social networking site where users can create a
profile, add friends, and send them messages. Users can also join
groups organized by location or other points of common interest.
Ranked #2 by Alexa.com.
34. Inbox Search
• 100 TB
• 160 nodes
• 1/2 billion writes per day (2yr old number?)
35. Case 4: Mahalo
Mahalo.com is a web directory and knowledge exchange. It
differentiates itself by tracking and building hand-crafted result
sets for many of the popular search terms.
(it also means ”thank you” in Hawaiian)
36. MySQL
• Partial deployment; 16 million video records (and growing)
• Writes (and storage) rapidly exceeding single box limitations
• Managability suffering (clustering is painful)
• Concerns over availability