The document summarizes a workshop on Cassandra data modeling. It discusses four use cases: (1) modeling clickstream data by storing sessions and clicks in separate column families, (2) modeling a rolling time window of data points by storing each point in a column with a TTL, (3) modeling rolling counters by storing counts in columns indexed by time bucket, and (4) using transaction logs to achieve eventual consistency when modeling many-to-many relationships by serializing transactions and deleting logs after commit. The document provides recommendations and alternatives for each use case.
2. Overview
● Hopefully interactive
● Use cases submitted via Google Moderator,
email, IRC, etc
● Interesting and/or common requests in the
slides to get us started
● Bring up others if you have them !
3. Data Modeling Goals
● Keep data queried together on disk together
● In a more general sense think about the
efficiency of querying your data and work
backward from there to a model in Cassandra
● Don't try to normalize your data (contrary to
many use cases in relational databases)
● Usually better to keep a record that something
happened as opposed to changing a value (not
always advisable or possible)
4. ClickStream Data
(use case #1)
● A ClickStream (in this context) is the sequence
of actions a user of an application performs
● Usually this refers to clicking links in a WebApp
● Useful for ad selection, error recording, UI/UX
improvement, A/B testing, debugging, et cetera
● Not a lot of detail in the Google Moderator
request on what the purpose of collecting the
ClickStream data was – so I made some up
5. ClickStream Data Defined
● Record actions of a user within a session for
debugging purposes if app/browser/page/server
crashes
6. Recording Sessions
● CF for sessions a user has had
● Row Key is user name/id
● Column Name is session id (TimeUUID)
● Column Value is empty (or length of session, or some
aggregated details about the session after it ended)
● CF for actual sessions
● Row Key is TimeUUID session id
● Column Name is timestamp/TimeUUID of each click
● Column Value is details about that click (serialized)
7. UserSessions Column Family
Session_01 Session_02 Session_03
(TimeUUID) (TimeUUID)
userId (TimeUUID)
(empty/agg) (empty/agg) (empty/agg)
● Most recent session
● All sessions for a given time period
8. Sessions Column Family
timestamp_01 timestamp_02 timestamp_03
SessionId
(TimeUUID) ClickData ClickData ClickData
(json/xml/etc) (json/xml/etc) (json/xml/etc)
● Retrieve entire session's ClickStream (row)
● Order of clicks/events preserved
● Retrieve ClickStream for a slice of time within the session
● First action taken in a session
● Most recent action taken in a session
● Why JSON/XML/etc?
10. Of Course
(depends on what you want to do)
● Secondary Indexes
● All Sessions in one row
● Track by time of activity instead of session
11. Secondary Indexes Applied
● Drop UserSessions CF and use secondary
indexes
● Uses a “well known” column to record the user
in the row; secondary index is created on that
column
● Doesn't work so well when storing aggregates
about sessions in the UserSessions CF
● Better when you want to retrieve all sessions a
user has had
12. All Sessions In One Row Applied
● Row Key is userId
● Column Name is composite of timestamp and
sessionId
● Can efficiently request activity of a user across
all sessions within a specific time range
● Rows could potentially grow quite large, be
careful
● Reads will almost always require at least two
seeks on disk
13. Time Period Partitioning Applied
● Row Key is composite of userId and time “bucket”
● e.g. jan_2011 or jan_01_2011 for month or day buckets respectively
● Column Name is TimeUUID of click
● Column Value is serialized click data
● Avoids always requiring multiple seeks when the user has old
data but only recent data is requested
● Easy to lazily aggregate old activity
● Can still efficiently request activity of a user across all
sessions within a specific time range
14. Rolling Time Window Of Data Points
(use case #2)
● Similar to RRDTool was the example given
● Essentially store a series of data points within a
rolling window
● common request from Cassandra users for this
and/or similar
15. Data Points Defined
● Each data point has a value (or multiple values)
● Each data point corresponds to a specific point
in time or an interval/bucket (e.g. 5 th minute of
th
17 hour on some date)
16. Time Window Model
System7:RenderTime
TimeUUID0 TimeUUID1 TimeUUID2
s7:rt 0.051 0.014 0.173
Some request took 0.014 seconds to render
● Row Key is the id of the time window data you are
tracking (e.g. server7:render_time)
● Column Name is timestamp (or TimeUUID) the event
occurred at
● Column Value is the value of the event (e.g. 0.051)
17. The Details
● Cassandra TTL values are key here
● When you insert each data point set the TTL to the max time
range you will ever request; there is very little overhead to
expiring columns
● When querying, construct TimeUUIDs for the min/max of
the time range in question and use them as the start/end
in your get_slice call
● Consider partitioning the rows by a known time period
(e.g. “year”) if you plan on keeping a long history of data
(NB: requires slightly more complex logic in the app if a
time range spans such a period)
● Very efficient queries for any window of time
18. Rolling Window Of Counters
(use case #3)
● “How to model rolling time window that contains counters with time
buckets of monthly (12 months), weekly (4 weeks), daily (7 days),
hourly (24 hours)? Example would be; how many times user logged
into a system in last 24 hours, last 7 days ...”
● Timezones and “rolling window” is what makes this interesting
19. Rolling Time Window Details
● One row for every granularity you want to track
(e.g. day, hour)
● Row Key consists of the granularity, metric, user
and system
● Column Name is a “fixed” time bucket on UTC time
● Column Values are counts of the logins in that
bucket
● get_slice calls to return multiple counters which
are them summed up
20. Rolling Time Window Counter Model
user3:system5:logins:by_day
20110107 ... 20110523
U3:S5:L:D
2 ... 7
2 logins in Jan 7th 2011 7 logins on May 23rd 2011
for user 3 on system 5 for user 3 on system 5
user3:system5:logins:by_hour
2011010710 ... 2011052316
U3:S5:L:H
1 ... 7
one login for user 3 on system 5 2 logins for user 3 on system 5
on Jan 7th 2011 for the 10th hour on May 23rd 2011 for the 16th hour
21. Rolling Time Window Queries
● Time window is rolling and there are other
timezones besides UTC
● one get_slice for the “middle” counts
● one get_slice for the “left end”
● one get_slice for the “right end”
22. Example: logins for the past 7 days
● Determine date/time boundaries
● Determine UTC days that are wholly contained
within your boundaries to select and sum
● Select and sum counters for the remaining hours
on either side of the UTC days
● O(1) queries (3 in this case), can be requested
from C* in parallel
● NB: some timezones are annoying (e.g. 15 minute
or 30 minutes offsets); I try to ignore them
23. Alternatives?
(of course)
● If you're counting logins and each user doesn't login
in hundreds of times a day, just have one row per
user with a TimeUUID column name for the time the
login occurred
● Supports any timezone/range/granularity easily
● More expensive for large ranges (e.g. year)
regardless of granularity, so cache results (in C*)
lazily.
● NB: caching results for rolling windows is not usually
helpful (because, well it's rolling and always changes)
24. Eventually Atomic
(use case #4)
● “When there are many to many or one to many relations involved how
to model that and also keep it atomic? for eg: one user can upload
many pictures and those pictures can somehow be related to other
users as well.”
● Attempting full ACID compliance in distributed systems is a bad idea
(and impossible in the general sense)
● However, consistency is important and can certainly be achieved in
C*
● Many approaches / alternatives
● I like transaction log approach, especially in the context of C*
25. Transaction Logs
(in this context)
● Records what is going to be performed before it
is actually performed
● Performs the actions that need to be atomic (in
the indivisible sense, not the all at once sense)
● Marks that the actions were performed
26. In Cassandra
● Serialize all actions that need to be performed
in a single column – JSON, XML, YAML (yuck!),
cpickle, JSO, et cetera
● Row Key = randomly chosen C* node token
● Column Name = TimeUUID
● Perform actions
● Delete Column
27. Configuration Details
● Short GC_Grace on the XACT_LOG Column
Family (e.g. 1 hour)
● Write to XACT_LOG at CL.QUORUM or
CL.LOCAL_QUORUM for durability (if it fails
with an unavailable exception, pick a different
node token and/or node and try again; same
semantics as a traditional relational DB)
● 1M memtable ops, 1 hour memtable flush time
28. Failures
● Before insert into the XACT_LOG
● After insert, before actions
● After insert, in middle of actions
● After insert, after actions, before delete
● After insert, after actions, after delete
29. Recovery
● Each C* has a crond job offset from every other
by some time period
● Each job runs the same code: multiget_slice for
all node tokens for all columns older than some
time period
● Any columns need to be replayed in their
entirety and are deleted after replay (normally
there are no columns because normally things
are working normally)
30. XACT_LOG Comments
● Idempotent writes are awesome (that's why this
works so well)
● Doesn't work so well for counters (they're not
idempotent)
● Clients must be able to deal with temporarily
inconsistent data (they have to do this anyway)
● Could use a reliable queuing service (e.g. SQS)
instead of polling – push to SQS first, then
XACT log.