Cassandra's Sweet Spot - an introduction to Apache Cassandra

Cassandra’s sweet spot

Dave Gardner
@davegardnerisme

jobs.hailocab.com
Looking for an expert backend
Java dev – speak to me!

meetup.com/Cassandra-
London
Next event 21st November

Building applications with Cassandra

• Key features
• Creating an application
• Data modeling

Comparing Cassandra with X

“Can someone quickly explain the
differences between the two? Other than
the fact that MongoDB supports ad-hoc
querying I don't know whats different. It also
appears (using google trends) that MongoDB
seems to be growing while Cassandra is
dying off. Is this the case?”
27th July
2010http://comments.gmane.org/gmane.comp.db.cassandra.user/
7773


“They have approximately nothing in
common. And, no, Cassandra is
definitely not dying off.”

28th July 2010
http://comments.gmane.org/gmane.comp.db.cassandra.user/7773

Top Tip #1

To use a NoSQL solution effectively, we
need to identify it's sweet spot.

Top Tip #1

To use a NoSQL solution effectively, we
need to identify it's sweet spot.

This means learning about each solution;
how is it designed? what algorithms
does it use?
http://www.alberton.info/nosql_databases_what_when_why_phpuk2
011.html


“they say … I can’t decide between this project
and this project even though they look nothing
like each other. And the fact that you can’t
decide indicates that you don’t actually have a
problem that requires them.”

Benjamin Black – NoSQL Tapes (at 30:15)
http://nosqltapes.com/video/benjamin-black-on-nosql-cloud-computing-and-
fast_ip

Headline features

1. Elastic
Read and write throughput increases
linearly as new machines are added

http://cassandra.apache.org/

Headline features

2. Decentralised
Fault tolerant with no single point of
failure; no “master” node


The dynamo paper

• Consistent hashing
• Vector clocks
• Gossip protocol
• Hinted handoff
• Read repair

http://www.allthingsdistributed.com/files/amazon-dynamo-
sosp2007.pdf

The dynamo paper
#
1 RF = 3

# #
6 2

Coordinator
# #
5 3

Client
#
4

Headline features

3. Rich data model
Column based, range slices, column
slices, secondary
indexes, counters, expiring columns


The big table paper

• Sparse "columnar" data model
• SSTable disk storage
• Append-only commit log
• Memtable (buffer and sort)
• Immutable SSTable files
• Compaction
http://labs.google.com/papers/bigtable-osdi06.pdf
http://www.slideshare.net/geminimobile/bigtable-4820829

The big table paper

Column Family

Name Name Name
Row Key
Value Value Value

Column Column Column

Headline features

4. You're in control
Tunable consistency, per operation


Consistency levels

How many replicas must respond to
declare success?

Consistency levels: write operations

Level Description
ANY One node, including hinted handoff
ONE One node
QUORUM N/2 + 1 replicas
LOCAL_QUORUM N/2 + 1 replicas in local data centre
EACH_QUORUM N/2 + 1 replicas in each data centre
ALL All replicas

http://wiki.apache.org/cassandra/API#Write

Consistency levels: read operations

Level Description
ONE 1st Response
QUORUM N/2 + 1 replicas
LOCAL_QUORUM N/2 + 1 replicas in local data centre
EACH_QUORUM N/2 + 1 replicas in each data centre
ALL All replicas

http://wiki.apache.org/cassandra/API#Read

Headline features

5. Performant
Well known for high write performance

http://www.datastax.com/docs/1.0/introduction/index#core-
strengths-of-cassandra

Benchmark*

http://blog.cubrid.org/dev-
platform/nosql-benchmarking/

* Add pinch of salt

Recap: headline features

1. Elastic
2. Decentralised
3. Rich data model
4. You’re in control (tunable consistency)
5. Performant

A simple ad-targeting application

Some ads
Choose which
ad to show

Our user knowledge


Allow us to capture user behaviour/data
via “pixels” - placing users into segments
(different buckets)

http://pixel.wehaveyourkidneys.com/add.php?add=foo


Record clicks and impressions of each
ad; storing data per-ad and per-segment

http://pixel.wehaveyourkidneys.com/adImpression.php?ad=1
http://pixel.wehaveyourkidneys.com/adClick.php?ad=1


Real-time ad performance
analytics, broken down by segment
(which segments are performing well?)

http://www.wehaveyourkidneys.com/adPerformance.php?ad=1


Recommendations based on best-
performing ads

(this is left as an exercise for the reader)

Additional requirements

• Large number of users
• High volume of impressions
• Highly available – downtime is money

A good fit for Cassandra?

Yes!

Big data, high availability and lots of
writes are all good signs that Cassandra
will fit well.

http://www.nosqldatabases.com/main/2010/10/19/what-is-
cassandra-good-for.html

A good fit for Cassandra?

Although there are many things that
people are using Cassandra for.

Highly available HTTP request routing
(tiny data!)

http://blip.tv/datastax/highly-available-http-request-routing-dns-
using-cassandra-5501901

Top Tip #2

Cassandra is an excellent fit where
availability matters, where there is a lot
of data or where you have a large
number of write operations.

Demo

Live demo before we start

Data modeling

Start from your queries, work backwards

http://www.slideshare.net/mattdennis/cassandra-data-modeling
http://blip.tv/datastax/data-modeling-workshop-5496906

Data model basics: conflict resolution

Per-column timestamp-based conflict
resolution
{ {
column: foo, column: foo,
value: bar, value: zing,
timestamp: 1000 timestamp: 1001
} }


Data model basics: column ordering

Columns ordered at time of
writing, according to Column Family
schema
{ {
column: zebra, column: badger,
value: foo, value: foo,
timestamp: 1000 timestamp: 1001
} }


Data model basics: column ordering

Columns ordered at time of
writing, according to Column Family
schema
{
badger: foo, with AsciiType column
zebra: foo schema
}


Data modeling: user segments

Add user to bucket X, with expiry time Y
Which buckets is user X in?

["user"][<uuid>][<bucketId>] = 1

[CF] [rowKey] [columnName] = value


user Column Family:
[f97be9cc-5255-4578-8813-76701c0945bd][bar] = 1
[f97be9cc-5255-4578-8813-76701c0945bd][foo] = 1
[06a6f1b0-fcf2-41d9-8949-fe2d416bde8e][baz] = 1
[06a6f1b0-fcf2-41d9-8949-fe2d416bde8e][zoo] = 1
[503778bc-246f-4041-ac5a-fd944176b26d][aaa] = 1

Q: Is user in segment X?
A: Single column fetch


user Column Family:
[f97be9cc-5255-4578-8813-76701c0945bd][bar] = 1
[f97be9cc-5255-4578-8813-76701c0945bd][foo] = 1
[06a6f1b0-fcf2-41d9-8949-fe2d416bde8e][baz] = 1
[06a6f1b0-fcf2-41d9-8949-fe2d416bde8e][zoo] = 1
[503778bc-246f-4041-ac5a-fd944176b26d][aaa] = 1

Q: Which segments is user X in?
A: Column slice fetch

Top Tip #3

With column slices, we get the columns
back ordered, according to our schema

We cannot do the same for rows
however, unless we use the Order
Preserving Partitioner

Top Tip #4

Don’t use the Order Preserving
Partitioner unless you absolutely have to

http://ria101.wordpress.com/2010/02/22/cassandra-
randompartitioner-vs-orderpreservingpartitioner/

Expiring columns

An expiring column will be automatically
deleted after n seconds



$pool = new ConnectionPool(
'whyk', array('localhost')
);
$users = new ColumnFamily($pool, 'users');
$users->insert(
$userUuid,
array($segment => 1),
NULL, // default TS
$expires
);

Using phpcassa client: https://github.com/thobbs/phpcassa


UPDATE users
USING TTL = 3600
SET 'foo' = 1
WHERE KEY =
'f97be9cc-5255-4578-8813-76701c0945bd'

Using CQL
http://www.datastax.com/dev/blog/what%E2%80%99s-new-in-
cassandra-0-8-part-1-cql-the-cassandra-query-language

http://www.datastax.com/docs/1.0/references/cql

Top Tip #5

Try to exploit Cassandra’s columnar data
model; avoid read-before write and
locking by safely mutating individual
columns

Data modeling: ad performance

Track overall ad performance; how many
clicks/impressions per ad?

["ads"][<adId>][<stamp>]["click"] = #
["ads"][<adId>][<stamp>]["impression"] = #

[CF] [Row] [S.Col] [Col] = value

Using super columns

Top Tip #6

Friends don’t let friends use Super
Columns.

http://rubyscale.com/2010/beware-the-supercolumn-its-a-trap-for-
the-unwary/


Try again using regular columns:

["ads"][<adId>][<stamp>-"click"] = #
["ads"][<adId>][<stamp>-"impression"] = #

[CF] [Row] [Col] = value


ads Column Family:
[1][2011103015-click] = 1
[1][2011103015-impression] = 3434
[1][2011103016-click] = 12
[1][2011103016-impression] = 5411
[1][2011103017-click] = 2
[1][2011103017-impression] = 345

Q: Get performance of ad X between two date/times
A: Column slice against single row specifying a start
stamp and end stamp + 1

Think carefully about your data

This scheme works because I’m
assuming each ad has a relatively short
lifespan. This means that there are lots
of rows and hence the load is spread.

Other options:
http://rubyscale.com/2011/basic-time-series-with-cassandra/

Counters

• Distributed atomic counters
• Easy to use
• Not idempotent

http://www.datastax.com/dev/blog/whats-new-in-cassandra-0-8-part-
2-counters


$stamp = date('YmdH');
$ads->add(
$adId, // row key
"$stamp-impression", // column
1 // increment
);

We’ll store performance metrics in hour buckets for graphing.


UPDATE ads
SET '2011103015-impression'
= '2011103015-impression' + 1
WHERE KEY = '1’

Data modeling: performance/segment

We can add in another dimension to our
stats so we can breakdown by segment.

["ads"][<adId>]
[<stamp>-<segment>-"click"] = #

[CF] [Row]
[Col] = value


ads Column Family:
[1][2011103015-bar-click] = 1
[1][2011103015-bar-impression] = 3434
[1][2011103015-foo-click] = 12
[1][2011103015-foo-impression] = 5411
[1][2011103016-bar-click] = 2

Q: Get performance of ad X between two date/times,
split by segment
A: Column slice against single row specifying a start
stamp and end stamp + 1


$stamp = date('YmdH');
$ads->add(
"$adId-segments", // row key
"$stamp-$segment-impression", // column
1 // incr
);

We’ll store performance metrics in hour buckets for graphing.

Data modeling: segment stats

Track overall clicks/impressions per
bucket; which buckets are most clicky?

["segments"][<adId>-"segments"]
[<stamp>-<segment>-"click"] = #

[CF] [Row]
[Col] = value

Recap: Data modeling

• Think about the queries, work
backwards
• Don’t overuse single rows; try to
spread the load
• Don’t use super columns
• Ask on IRC! #cassandra

Recap: Common data modeling patterns

1. Using column names with no value

[cf][rowKey][columnName] = 1

Recap: Common data modeling patterns

2. Counters

[cf][rowKey][columnName]++

And also…

3. Serialising a whole object

[cf][rowKey][columnName] = {
foo: 3,
bar: 11
}

There’s more: Brisk

Integrated Hadoop distribution (without
HDFS installed). Run Hive and Pig queries
directly against Cassandra

DataStax now offer this functionality in
their “Enterprise” product

http://www.datastax.com/products/enterprise

Hive

CREATE EXTERNAL TABLE tempUsers
(userUuid string, segmentId string, value string)
STORED BY
'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'
WITH SERDEPROPERTIES (
"cassandra.columns.mapping" = ":key,:column,:value",
"cassandra.cf.name" = "users"
);

SELECT segmentId, count(1) AS total
FROM tempUsers
GROUP BY segmentId
ORDER BY total DESC;

There’s more: Supercharged Cassandra

Acunu have reengineered the entire Unix
storage stack, optimised specifically for
Big Data workloads

Includes instant snapshot of CFs

http://www.acunu.com/products/choosing-cassandra/

In conclusion

Cassandra is founded on sound design
principles

In conclusion

The Cassandra data model, sometimes
mentioned as a weakness, is incredibly
powerful

In conclusion

The clients are getting better; CQL is a
step forward

In conclusion

Hadoop integration means we can
analyse data directly from a Cassandra
cluster

In conclusion

Cassandra’s sweet spot is highly
available “big data” (especially time-
series) with large numbers of writes

Thanks

Learn more about Cassandra
meetup.com/Cassandra-London
Checkout the code https://github.com/davegardnerisme/we-have-
your-kidneys

Watch videos from Cassandra SF 2011
http://www.datastax.com/events/cassandrasf2011/presentations

Cassandra's Sweet Spot - an introduction to Apache Cassandra

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Cassandra's Sweet Spot - an introduction to Apache Cassandra

Similar a Cassandra's Sweet Spot - an introduction to Apache Cassandra (20)

Más de Dave Gardner

Más de Dave Gardner (8)

Último

Último (20)

Cassandra's Sweet Spot - an introduction to Apache Cassandra