18. Consistency, Availability
Consistency
– Can I read stale data?
Availability
– Can I write/read at all?
Tunable Consistency
19. Consistency
N = Total number of replicas
R = Number of replicas read from
– (before the response is returned)
W = Number of replicas written to
– (before the write is considered a success)
20. Consistency
N = Total number of replicas
R = Number of replicas read from
– (before the response is returned)
W = Number of replicas written to
– (before the write is considered a success)
W + R > N gives strong consistency
21. Consistency
W + R > N gives strong consistency
N=3
W=2
R=2
2 + 2 > 3 ==> strongly consistent
22. Consistency
W + R > N gives strong consistency
N=3
W=2
R=2
2 + 2 > 3 ==> strongly consistent
Only 2 of the 3 replicas must be
available.
23. Consistency
Tunable Consistency
– Specify N (Replication Factor) per data set
– Specify R, W per operation
24. Consistency
Tunable Consistency
– Specify N (Replication Factor) per data set
– Specify R, W per operation
– Quorum: N/2 + 1
• R = W = Quorum
• Strong consistency
• Tolerate the loss of N – Quorum replicas
– R, W can also be 1 or N
25. Availability
Can tolerate the loss of:
– N – R replicas for reads
– N – W replicas for writes
26. CAP Theorem
During node or network failure:
100%
Not
Possible
Availability
Possible
Consistency 100%
27. CAP Theorem
During node or network failure:
100%
Not
Ca Possible
ss
an
dr
Availability a
Possible
Consistency 100%
28. Clustering
No single point of failure
Replication that works
Scales linearly
– 2x nodes = 2x performance
• For both reads and writes
– Up to 100's of nodes
– See “Netflix: 1 million writes/sec on AWS”
Operationally simple
Multi-Datacenter Replication
29. Data Model
Comes from Google BigTable
Goals
– Commodity Hardware
• Spinning disks
– Handle data sets much larger than memory
• Minimize disk seeks
– High throughput
– Low latency
– Durable
30. Column Families
Static
– Object data
– Similar to a table in a relational database
Dynamic
– Precomputed query results
– Materialized views
(these are just educational classifications)
32. Dynamic Column Families
Rows
– Each row has a unique primary key
– Sorted list of (name, value) tuples
• Like an ordered hash
– The (name, value) tuple is called a “column”
34. Dynamic Column Families
Other Examples:
– Timeline of tweets by a user
– Timeline of tweets by all of the people a user is
following
– List of comments sorted by score
– List of friends grouped by state
35. The Data API
RPC-based API
– github.com/twitter/cassandra
CQL (Cassandra Query Language)
– code.google.com/a/apache-extras.org/p/cassandra-ruby/
36. Inserting Data
INSERT INTO users (KEY, “name”, “age”)
VALUES (“thobbs”, “Tyler”, 24);
37. Updating Data
Updates are the same as inserts:
INSERT INTO users (KEY, “age”)
VALUES (“thobbs”, 34);
Or
UPDATE users SET “age” = 34
WHERE KEY = “thobbs”;
39. Fetching Data
Explicit column select:
SELECT “name”, “age” FROM users
WHERE KEY = “thobbs”;
40. Fetching Data
Get a slice of columns
UPDATE letters SET 1='a', 2='b', 3='c', 4='d', 5='e'
WHERE KEY = “key”;
SELECT 1..3 FROM letters WHERE KEY = “key”;
Returns [(1, a), (2, b), (3, c)]
41. Fetching Data
Get a slice of columns
SELECT FIRST 2 FROM letters WHERE KEY = “key”;
Returns [(1, a), (2, b)]
SELECT FIRST 2 REVERSED FROM letters
WHERE KEY = “key”;
Returns [(5, e), (4, d)]
42. Fetching Data
Get a slice of columns
SELECT 3..'' FROM letters WHERE KEY = “key”;
Returns [(3, c), (4, d), (5, e)]
SELECT FIRST 2 REVERSED 4..'' FROM letters
WHERE KEY = “key”;
Returns [(4, d), (3, c)]
43. Deleting Data
Delete a whole row:
DELETE FROM users WHERE KEY = “thobbs”;
Delete specific columns:
DELETE “age” FROM users
WHERE KEY = “thobbs”;
44. Secondary Indexes
Builtin basic indexes
CREATE INDEX ageIndex ON users (age);
SELECT name FROM USERS
WHERE age = 24 AND state = “TX”;
45. Performance
Writes
– 10k – 30k per second per node
– Sub-millisecond latency
Reads
– 1k – 20k per second per node (depends on data
set, caching
– 0.1 to 10ms latency
46. Other Features
Distributed Counters
– Can support millions of high-volume counters
Excellent Multi-datacenter Support
– Disaster recovery
– Locality
Hadoop Integration
– Isolation of resources
– Hive and Pig drivers
Compression
47. What Cassandra Can't Do
Transactions
– Unless you use a distributed lock
– Atomicity, Isolation
– These aren't needed as often as you'd think
Limited support for ad-hoc queries
– Know what you want to do with the data
49. Problems you shouldn't solve with C*
Prototyping
Distributed Locking
Small datasets
– (When you don't need availability)
Complex graph processing
– Shallow graph queries work well, though
Fundamentally highly relational/transactional
data
50. The sweet spot for Cassandra
Large dataset, low latency queries
Simple to medium complexity queries
– Key/value
– Time series, ordered data
– Lists, sets, maps
High Availability
51. The sweet spot for Cassandra
Social
– Texts, comments, check-ins, collaboration
Activity
– Feeds, timelines, clickstreams, logs, sensor data
Metrics
– Performance data over time
– CloudKick, DataStax OpsCenter
Text Search
– Inbox search at Facebook
52. ORMs
Poor integration
ORMs are not a natural fit for Cassandra
– In C*, we mainly care about queries, not objects
– Beyond simple K/V, abstraction breaks
Suggestion: don't waste time with an ORM
– C* will only be used for a specific subset of your
data/queries
– Use the C* API directly in your model