Cassandra: Two data centers and great performance

Cassandra FTW
Andrew Byde
Principal Scientist

Monday, 15 August 2011

Menu

• Introduction
• Data model + storage architecture
• Partitioning + replication
• Consistency
• De-normalisation


History + design


History

• 2007: Started at Facebook for inbox search
• July 2008: Open sourced by Facebook
• March 2009: Apache Incubator
• February 2010: Apache top-level project
• May 2011:Version 0.8

What it’s good for

• Horizontal scalability
• No single-point of failure
• Multi-data centre support
• Very high write workloads
• Tuneable consistency


What it’s not so good for

• Transactions
• Read heavy workloads
• Low latency applications
• compared to in-memory dbs


Data model


Keyspaces and Column Families
SQL Cassandra

Database row/key col_1 col_2
Keyspace
row/key col_1 col_1
row/ col_1 col_1

Table Column Family

Keyspaces & CFs have different
sets of conﬁguration settings

Column Family

key: {
column: value,
column: value,
...
}


Rows and columns
col1 col2 col3 col4 col5 col6 col7
row1 x x x
row2 x x x x x
row3 x x x x x
row4 x x x x
row5 x x x x
row6 x
row7 x x x


Reads
• get
• get_slice One row, some cols
• name predicate
• slice range
• multiget_slice Multiple rows
• get_range_slices

get
row1 x x x
row2 x x x x x
row3 x x x x x
row4 x x x x
row5 x x x x
row6 x
row7 x x x


get_slice: name predicate
row1 x x x
row2 x x x x x
row3 x x x x x
row4 x x x x
row5 x x x x
row6 x
row7 x x x


get_slice: slice range
row1 x x x
row2 x x x x x
row3 x x x x x x
row4 x x x x
row5 x x x x
row6 x
row7 x x x


multiget_slice: name
predicate
row1 x x x
row2 x x x x x
row3 x x x x x
row4 x x x x
row5 x x x x
row6 x
row7 x x x


get_range_slices: slice range
row1 x x x
row2 x x x x x
row3 x x x x x
row4 x x x x
row5 x x x x
row6 x
row7 x x x


Storage
architecture


Data Layout
writes
key-value insert
on-disk
un-ordered
commit log in-memory
... (key,col)-sorted
memtable
ﬂush
on-disk 01001101110101000 01001101110101000

(key,col)-sorted ...
SSTables

Data Layout
SSTables

SSTable
Bloom Filter 01001101110101000

Index
Data


Data Layout
reads
?

01001101110101000 01001101110101000 010011011101010001111010101001


Data Layout
reads
?

X X
01001101110101000 01001101110101000 010011011101010001111010101001


Distribution:

Partitioning +
Replication


Partitioning + Replication

(k, v)
?


Partitioning + Replication
• Partitioning data on to nodes
• load balancing
• row-based
• Replication
• to protect against failure
• better availability

Partitioning
• Random: take hash of row key
• good for load balancing

• bad for range queries

• Ordered: subdivide key space
• bad for load balancing

• good for range queries

• Or build your own...

Simple Replication

(k, v)

Nodes arranged on a ‘ring’

Simple Replication
Primary location

(k, v)


Simple Replication
Primary location

(k, v) Extra copies
are successors
on the ring


Topology-aware
Replication
• Snitch : node IP (DataCenter, rack)

• EC2Snitch
• Region DC; availability_zone rack

• PropertyFileSnitch
• Conﬁgured from a ﬁle


Topology-aware
Replication
DC 1 DC 2

(k, v)

r1 r2 r1 r2


Topology-aware
Replication
DC 1 DC 2
extra copies
to different
data center

(k, v)

r1 r2 r1 r2


Topology-aware
Replication
DC 1 DC 2
extra copies
to different
data center

(k, v)

spread across
racks within a r1 r2 r1 r2
data center


Distribution:

Consistency


Consistency Level

• How many replicas must respond in order to
declare success
• W/N must succeed for write to succeed
• write with client-generated timestamp

• R/N must succeed for read to succeed
• return most recent, by timestamp


Consistency Level

• 1, 2, 3 responses
• Quorum (more than half)
• Quorum in local data center
• Quorum in each data center


Maintaining consistency

• Read repair
• Hinted handoff
• Anti-entropy


Read repair
• If the replicas disagree on read, send most
recent data back

n1

read k? n2

n3


Read repair
recent data back

n1 v, t1

read k? n2 not found!

n3 v’, t2


Read repair
recent data back

n1 v, t1

n2 not found!

n3 v’, t2


Read repair
recent data back

n1

n2

n3 write (k, v’, t2)


Hinted handoff

• When a node is unavailable
• Writes can be written to any node as a hint
• Delivered when the node comes back
online


Anti-entropy

• Equivalent to ‘read repair all’
• Requires reading all data (woah)
• (Although only hashes are sent to calculate diffs)

• Manual process


De-normalisation


De-normalisation

• Disk space is much cheaper than disk seeks
• Read at 100 MB/s, seek at 100 IO/s
• => copy data to avoid seeks


Inbox
user2

user1 msg1
user3
msg2

msg3 user4
...


Data-centric model
m1: {
sender: user1
content: “Mary had a little lamb”
recipients: user2, user3
}

• but how to do ‘recipients’ for Inbox?
• one-to-many modelled by a join table


To join
m1: { user2: {
sender: user1 m1: true
subject: “A rhyme”
content: “Mary had a little lamb” }
} user3: {
m2: {
sender: user1 m1: true
subject: “colours” m2: true
content: “Its fleece was white as snow”
} }
m3: { user4: {
sender: user1
subject: “loyalty” m2: true
content: “And everywhere that Mary went” m3: true
}
}


.. or not to join
• Joins are expensive, so de-normalise to trade
off space for time
• We can have lots of columns, so think BIG:
• Make message id a time-typed super-column.
• This makes get_slice an efﬁcient way of
searching for messages in a time window


Super Column Family
user2: {
m1: {
sender: user1
}
}
user3: {
m1: {
sender: user1
}
m2: {
sender: user1
subject: “colours”
}
}
...


De-normalisation +
Cassandra
• have to write a copy of the record for each
recipient ... but writes are very cheap
• get_slice fetches columns for a particular
row, so gets received messages for a user
• on-disk column order is optimal for this
query


Conclusion


Cassandra: Two data centers and great performance

Recomendados

Recomendados

Más contenido relacionado

Más de DATAVERSITY

Más de DATAVERSITY (20)

Último

Último (20)

Cassandra: Two data centers and great performance