2. Overview
● general guiding goals for Cassandra data
models
● Interesting and/or common examples/questions
to get us started
● Should be plenty of time at the end for
questions, so bring them up if you have them !
3. Data Modeling Goals
● Keep data queried together on disk together
● In a more general sense think about the
efficiency of querying your data and work
backward from there to a model in Cassandra
● Usually, you shouldn't try to normalize your data
(contrary to many use cases in relational
databases)
● Usually better to keep a record that something
happened as opposed to changing a value (not
always the best approach though)
4. Time Series Data
● Easily the most common use of Cassandra
● Financial tick data
● Click streams
● Sensor data
● Performance metrics
● GPS data
● Event logs
● etc, etc, etc ...
● All of the above are essentially the same as far as
C* is concerned
5. Time Series Thought Model
● Things happen in some timestamp ordered
stream and consist of values associated with
the given timestamp (i.e. “data points”)
– Every 30 seconds record location, speed, heading and
engine temp
– Every 5 minutes record CPU, IO and Memory usage
● We are interested in recreating, aggregating
and/or analyzing arbitrary time slices of the
stream
– Where was agent:007 and what was he doing between
11:21am and 2:38pm yesterday?
– What are the last N actions foo did on my site?
6. Data Points Defined
● Each data point has 1-N values
● Each data point corresponds to a specific point
in time or an interval/bucket (e.g. 5 th minute of
17th hour on some date)
7. Data Points Mapped to Cassandra
● Row Key is id of the data point stream bucketed by time
– e.g. plane01:jan_2011 or plane01:jan_01_2011 for month or day buckets
respectively
● Column Name is TimeUUID(timestamp of date point)
● Column Value is serialized data point
– JSON, XML, pickle, msgpack, thrift, protobuf, avro, BSON, WTFe
● Bucketing
– Avoids always requiring multiple seeks when only small slices of the stream are
requested (e.g. stream is 5 years old but I'm on only interested in Jan 5 th 3 years
ago and/or yesterday between 2pm and 3pm).
– Make it easy to lazily aggregate old stream activity
– Reduces compaction overhead since old rows will never have to be merged again
(until you “back fill” and/or delete something)
8. A Slightly More Concrete Example
● Sensor data from airplanes
● Every 30 seconds each plane sends
latitude+longitude, altitude and wine remaining
in mdennis' glass.
9. The Visual
plane5:jan_2011
TimeUUID0 TimeUUID1 TimeUUID2
p5:j11 28.90, 124.30
45K feet
28.85, 124.25
44K feet
28.81, 124.22
44K feet
70% 50% 95%
Middle of the ocean and half
a glass of wine at 44K feet
● Row Key is the id of stream being recorded (e.g.
plane5:jan_2011)
● Column Name is timestamp (or TimeUUID) associated with
the data point
● Column Value is the value of the event (e.g. protobuf
serialized lat/long+alt+wine_level)
10. Querying
● When querying, construct TimeUUIDs for
the min/max of the time range in question
and use them as the start/end in your
get_slice call
● Or use a empty start and/or end along with
a count
11. Bucket Sizes?
● Depends greatly on
● Average size of time slice queried
● Average data point size
● Write rate of data points to a stream
● IO capacity of the nodes
12. So... Bucket Sizes?
● No Bigger than a few GB per row
● bucket_size * write_rate * sizeof(avg_data_point)
● Bucket size >= average size of time slice queried
● No more than maybe 10M entries per row
● No more than a month if you have lots of different
streams
● NB: there are exceptions to all of the above, which
are really nothing more than guidelines
13. Ordering
● In cases where the most recent data is the
most interesting (e.g. last N events for entity foo
or last hour of events for entity bar), you can
reverse the comparator (i.e. sort descending
instead of ascending)
● http://thelastpickle.com/2011/10/03/Reverse-Comparators/
● https://issues.apache.org/jira/browse/CASSANDRA-2355
14. Spanning Buckets
● If your time slice spans buckets, you'll need to
construct all the row keys in question (i.e. number of
unique row keys = spans+1)
● If you want all the results between the dates, pass
all the row keys to multiget_slice with the start and
end of the desired time slice
● If you only want the first N results within your time
slice, lowest latency comes from multiget_slice as
above but best efficiency comes from serially paging
one row key at a time until your desired count is
reached
15. Expiring Streams
(e.g. “I only care about the past year”)
● Just set the TTL to the age you want to keep
● yeah, that's pretty much it ...
16. Counters
● Sometimes you're only interested in counting
things that happened within some time slice
● Minor adaptation to the previous content to use
counters (be aware they are not idempotent)
● Column names become buckets
● Values become counters
17. Example: Counting User Logins
user3:system5:logins:by_day
20110107 ... 20110523
U3:S5:L:D
2 ... 7
2 logins on Jan 7th 2011 7 logins on May 23rd 2011
for user 3 on system 5 for user 3 on system 5
user3:system5:logins:by_hour
2011010710 ... 2011052316
U3:S5:L:H
1 ... 7
one login for user 3 on system 5 2 logins for user 3 on system 5
on Jan 7th 2011 for the 10th hour on May 23rd 2011 for the 16th hour
18. Eventually Atomic
● In a legacy RDBMS atomicity is “easy”
● Attempting full ACID compliance in distributed systems is a
bad idea (and actually impossible in the strictest sense)
● However, consistency is important and can certainly be
achieved in C*
● Many approaches / alternatives
● I like a transaction log approach, especially in the context
of C*
19. Transaction Logs
(in this context)
● Records what is going to be performed before it
is actually performed
● Performs the actions that need to be atomic (in
the indivisible sense, not the all at once sense
which is usually what people mean when they
say isolation)
● Marks that the actions were performed
20. In Cassandra
● Serialize all actions that need to be performed
in a single column – JSON, XML, YAML (yuck!),
pickle, JSO, msgpack, protobuf, et cetera
● Row Key = randomly chosen C* node token
● Column Name = TimeUUID(nowish)
● Perform actions
● Delete Column
21. Configuration Details
● Short gc_grace_seconds on the XACT_LOG
Column Family (e.g. 5 minutes)
● Write to XACT_LOG at CL.QUORUM or
CL.LOCAL_QUORUM for durability
● if it fails with an unavailable exception, pick a
different node token and/or node and try again
(gives same semantics as a relational DB in terms
of knowing the state of your transaction)
22. Failures
● Before insert into the XACT_LOG
● After insert, before actions
● After insert, in middle of actions
● After insert, after actions, before delete
● After insert, after actions, after delete
23. Recovery
● Each C* has a crond job offset from every other
by some time period
● Each job runs the same code: multiget_slice for
all node tokens for all columns older than some
time period (the “recovery period”)
● Any columns need to be replayed in their
entirety and are deleted after replay (normally
there are no columns because normally things
are working)
24. XACT_LOG Comments
● Idempotent writes are awesome (that's why this
works so well)
● Doesn't work so well for counters (they're not
idempotent)
● Clients must be able to deal with temporarily
inconsistent data (they have to do this anyway)