In this talk we describe the features of Cassandra that set it above the pack, and how to get the most out of them, depending on your application. In particular, we'll describe de-normalization, and detail how the algorithms behind Cassandra leverage awesome write speed to accelerate reads; and we'll explain how Cassandra achieves multi-datacenter support, tunable consistency and no single point of failure, to give a great solution for highly available systems.
4. History
• 2007: Started at Facebook for inbox search
• July 2008: Open sourced by Facebook
• March 2009: Apache Incubator
• February 2010: Apache top-level project
• May 2011:Version 0.8
Monday, 15 August 2011
5. What it’s good for
• Horizontal scalability
• No single-point of failure
• Multi-data centre support
• Very high write workloads
• Tuneable consistency
Monday, 15 August 2011
6. What it’s not so good for
• Transactions
• Read heavy workloads
• Low latency applications
• compared to in-memory dbs
Monday, 15 August 2011
8. Keyspaces and Column Families
SQL Cassandra
Database row/key col_1 col_2
Keyspace
row/key col_1 col_1
row/ col_1 col_1
Table Column Family
Keyspaces & CFs have different
sets of configuration settings
Monday, 15 August 2011
9. Column Family
key: {
column: value,
column: value,
...
}
Monday, 15 August 2011
10. Rows and columns
col1 col2 col3 col4 col5 col6 col7
row1 x x x
row2 x x x x x
row3 x x x x x
row4 x x x x
row5 x x x x
row6 x
row7 x x x
Monday, 15 August 2011
11. Reads
• get
• get_slice One row, some cols
• name predicate
• slice range
• multiget_slice Multiple rows
• get_range_slices
Monday, 15 August 2011
12. get
col1 col2 col3 col4 col5 col6 col7
row1 x x x
row2 x x x x x
row3 x x x x x
row4 x x x x
row5 x x x x
row6 x
row7 x x x
Monday, 15 August 2011
13. get_slice: name predicate
col1 col2 col3 col4 col5 col6 col7
row1 x x x
row2 x x x x x
row3 x x x x x
row4 x x x x
row5 x x x x
row6 x
row7 x x x
Monday, 15 August 2011
14. get_slice: slice range
col1 col2 col3 col4 col5 col6 col7
row1 x x x
row2 x x x x x
row3 x x x x x x
row4 x x x x
row5 x x x x
row6 x
row7 x x x
Monday, 15 August 2011
15. multiget_slice: name
predicate
col1 col2 col3 col4 col5 col6 col7
row1 x x x
row2 x x x x x
row3 x x x x x
row4 x x x x
row5 x x x x
row6 x
row7 x x x
Monday, 15 August 2011
16. get_range_slices: slice range
col1 col2 col3 col4 col5 col6 col7
row1 x x x
row2 x x x x x
row3 x x x x x
row4 x x x x
row5 x x x x
row6 x
row7 x x x
Monday, 15 August 2011
24. Partitioning + Replication
• Partitioning data on to nodes
• load balancing
• row-based
• Replication
• to protect against failure
• better availability
Monday, 15 August 2011
25. Partitioning
• Random: take hash of row key
• good for load balancing
• bad for range queries
• Ordered: subdivide key space
• bad for load balancing
• good for range queries
• Or build your own...
Monday, 15 August 2011
26. Simple Replication
(k, v)
Nodes arranged on a ‘ring’
Monday, 15 August 2011
27. Simple Replication
Primary location
(k, v)
Nodes arranged on a ‘ring’
Monday, 15 August 2011
28. Simple Replication
Primary location
(k, v) Extra copies
are successors
on the ring
Nodes arranged on a ‘ring’
Monday, 15 August 2011
29. Topology-aware
Replication
• Snitch : node IP (DataCenter, rack)
• EC2Snitch
• Region DC; availability_zone rack
• PropertyFileSnitch
• Configured from a file
Monday, 15 August 2011
30. Topology-aware
Replication
DC 1 DC 2
(k, v)
r1 r2 r1 r2
Monday, 15 August 2011
31. Topology-aware
Replication
DC 1 DC 2
(k, v)
r1 r2 r1 r2
Monday, 15 August 2011
32. Topology-aware
Replication
DC 1 DC 2
extra copies
to different
data center
(k, v)
r1 r2 r1 r2
Monday, 15 August 2011
33. Topology-aware
Replication
DC 1 DC 2
extra copies
to different
data center
(k, v)
spread across
racks within a r1 r2 r1 r2
data center
Monday, 15 August 2011
35. Consistency Level
• How many replicas must respond in order to
declare success
• W/N must succeed for write to succeed
• write with client-generated timestamp
• R/N must succeed for read to succeed
• return most recent, by timestamp
Monday, 15 August 2011
36. Consistency Level
• 1, 2, 3 responses
• Quorum (more than half)
• Quorum in local data center
• Quorum in each data center
Monday, 15 August 2011
38. Read repair
• If the replicas disagree on read, send most
recent data back
n1
read k? n2
n3
Monday, 15 August 2011
39. Read repair
• If the replicas disagree on read, send most
recent data back
n1 v, t1
read k? n2 not found!
n3 v’, t2
Monday, 15 August 2011
40. Read repair
• If the replicas disagree on read, send most
recent data back
n1 v, t1
n2 not found!
n3 v’, t2
Monday, 15 August 2011
41. Read repair
• If the replicas disagree on read, send most
recent data back
n1
n2
n3 write (k, v’, t2)
Monday, 15 August 2011
42. Hinted handoff
• When a node is unavailable
• Writes can be written to any node as a hint
• Delivered when the node comes back
online
Monday, 15 August 2011
43. Anti-entropy
• Equivalent to ‘read repair all’
• Requires reading all data (woah)
• (Although only hashes are sent to calculate diffs)
• Manual process
Monday, 15 August 2011
45. De-normalisation
• Disk space is much cheaper than disk seeks
• Read at 100 MB/s, seek at 100 IO/s
• => copy data to avoid seeks
Monday, 15 August 2011
47. Data-centric model
m1: {
sender: user1
content: “Mary had a little lamb”
recipients: user2, user3
}
• but how to do ‘recipients’ for Inbox?
• one-to-many modelled by a join table
Monday, 15 August 2011
48. To join
m1: { user2: {
sender: user1 m1: true
subject: “A rhyme”
content: “Mary had a little lamb” }
} user3: {
m2: {
sender: user1 m1: true
subject: “colours” m2: true
content: “Its fleece was white as snow”
} }
m3: { user4: {
sender: user1
subject: “loyalty” m2: true
content: “And everywhere that Mary went” m3: true
}
}
Monday, 15 August 2011
49. .. or not to join
• Joins are expensive, so de-normalise to trade
off space for time
• We can have lots of columns, so think BIG:
• Make message id a time-typed super-column.
• This makes get_slice an efficient way of
searching for messages in a time window
Monday, 15 August 2011
51. De-normalisation +
Cassandra
• have to write a copy of the record for each
recipient ... but writes are very cheap
• get_slice fetches columns for a particular
row, so gets received messages for a user
• on-disk column order is optimal for this
query
Monday, 15 August 2011
53. What it’s good for
• Horizontal scalability
• No single-point of failure
• Multi-data centre support
• Very high write workloads
• Tuneable consistency
Monday, 15 August 2011