9. Client-server
All your wildest !
dreams will come !
true.
Client - server database architecture.
Obsolete.
The original “funnel-shaped” architecture
10. 3-tier
Client - client - server database architecture.
Still suitable for small applications.
e.g. LAMP, RoR
11. 3-tier + caching
cache
slave
more complex
cache coherency is a hard problem
cascading failures are common
Next: Out: the funnel. In: The ring.
master
slave
12. Webscale
outer ring: clients (cell phones, etc.)
middle ring: application servers
inside ring: Cassandra servers
!
Serving millions of clients with mere hundreds or thousands of nodes requires a different approach to applications!
15. Dynamo Paper(2007)
• How do we build a data store that is:
• Reliable
• Performant
• “Always On”
• Nothing new and shiny
!
!
Evolutionary. Real. Computer Science
Also the basis for Riak and Voldemort
20. Cassandra - Fully Replicated
• Client writes local
• Data syncs across WAN
• Replication per Data Center
!20
21. Read-Modify-Write
UPDATE
Employees
SET
Rank=4,
Promoted=2014-‐01-‐24
WHERE
EmployeeID=1337;
EmployeeID**1337
Name********アルトビー
StartDate***2013510501
Rank********3
Promoted****null
This might be what it looks like from SQL / CQL, but …
!
EmployeeID**1337
Name********アルトビー
StartDate***2013510501
Rank********4
Promoted****2014501524
22. Read-Modify-Write
UPDATE
Employees
SET
Rank=4,
Promoted=2014-‐01-‐24
WHERE
EmployeeID=1337;
EmployeeID**1337
Name********アルトビー
StartDate***2013510501
Rank********4
Promoted****2014501524
EmployeeID**1337
Name********アルトビー
StartDate***2013510501
Rank********3
Promoted****null
RDBMS
TNSTAAFL
無償の昼食なんてものはありません
TNSTAAFL …
If you’re lucky, the cell is in cache.
Otherwise, it’s a disk access to read, another to write.
23. Eventual Consistency
UPDATE
Employees
SET
Rank=4,
Promoted=2014-‐01-‐24
WHERE
EmployeeID=1337;
EmployeeID**1337
Name********アルトビー
StartDate***2013510501
Rank********3
Promoted****null
EmployeeID**1337
Name********アルトビー
StartDate***2013510501
Rank********4
Promoted****2014501524
Explain distributed RMW
More complicated.
Will talk about how it’s abstracted in CQL later.
Coordinator
24. Eventual Consistency
UPDATE
Employees
SET
Rank=4,
Promoted=2014-‐01-‐24
WHERE
EmployeeID=1337;
EmployeeID**1337
Name********アルトビー
StartDate***2013510501
Rank********3
Promoted****null
EmployeeID**1337
Name********アルトビー
StartDate***2013510501
Rank********4
Promoted****2014501524
Coordinator
read
write
Memory replication on write, depending on RF, usually RF=3.
Reads AND writes remain available through partitions.
Hinted handoff.
29. Overwriting
CREATE TABLE host_lookup (
name
varchar,
id
uuid,
PRIMARY KEY(name)
);
!
INSERT INTO host_uuid (name,id) VALUES
(“www.tobert.org”, “463b03ec-fcc1-4428-bac8-80ccee1c2f77”);
!
INSERT INTO host_uuid (name,id) VALUES
(“tobert.org”,
“463b03ec-fcc1-4428-bac8-80ccee1c2f77”);
!
INSERT INTO host_uuid (name,id) VALUES
(“www.tobert.org”, “463b03ec-fcc1-4428-bac8-80ccee1c2f77”);
!
SELECT id FROM host_lookup WHERE name=“tobert.org”;
Beware of expensive compaction
Best for: small indexes, lookup tables
Compaction handles RMW at storage level in the background.
Under heavy writes, clock synchronization is very important to avoid timestamp collisions. In practice, this isn’t a
problem very often and even when it goes wrong, not much harm done.
30. Key/Value
CREATE TABLE keyval (
key VARCHAR,
value blob,
PRIMARY KEY(key)
);
!
INSERT INTO keyval (key,value) VALUES (?, ?);
!
SELECT value FROM keyval WHERE key=?;
e.g. memcached
Don’t do this.
But it works when you really need it.
31. Journaling / Logging / Time-series
CREATE TABLE tsdb (
time_bucket timestamp,
time
timestamp,
value
blob,
PRIMARY KEY(time_bucket, time)
);
!
INSERT INTO tsdb (time_bucket, time, value) VALUES (
“2014-10-24”,
-- 1-day bucket (UTC)
“2014-10-24T12:12:12Z”, -- ALWAYS USE UTC
‘{“foo”: “bar”}’
);
Oversimplified, use normalization over blobs whenever possible.
ALWAYS USE UTC :)
32. Journaling / Logging / Time-series
2014(01(24 2014(01(24T12:12:12Z 2014(01(24T21:21:21Z
{“key”:" value”}
{“key”:"“value”}
2014(01(25 2014(01(25T13:13:13Z
{“key”:"“value”}
{"“2014(01(24”"=>"{
""""“2014(01(24T12:12:12Z”"=>"{
""""""""‘{“foo”:"“bar”}’
""""}
}
Oversimplified, use normalization over blobs whenever possible.
ALWAYS USE UTC :)
33. Cassandra Collections
CREATE TABLE posts (
id
uuid,
body
varchar,
created timestamp,
authors set<varchar>,
tags
set<varchar>,
PRIMARY KEY(id)
);
!
INSERT INTO posts (id,body,created,authors,tags) VALUES (
ea4aba7d-9344-4d08-8ca5-873aa1214068,
‘アルトビーの犬はばかね’,
‘now',
[‘アルトビー’, ’ィオートビー’],
[‘dog’, ‘silly’, ’犬’, ‘ばか’]
);
quick story about 犬ばかね
sets & maps are CRDTs, safe to modify
34. Cassandra Collections
CREATE TABLE metrics (
bucket timestamp,
time
timestamp,
value
blob,
labels map<varchar,varchar>,
PRIMARY KEY(bucket)
);
sets & maps are CRDTs, safe to modify
35. Lightweight Transactions
• Cassandra 2.0 and on support LWT based on PAXOS
• PAXOS is a distributed consensus protocol
• Given a constraint, Cassandra ensures correct ordering
36. Lightweight Transactions
UPDATE
users
SET
username=‘tobert’
WHERE
id=68021e8a-‐9eb0-‐436c-‐8cdd-‐aac629788383
IF
username=‘renice’;
!
INSERT
INTO
users
(id,
username)
VALUES
(68021e8a-‐9eb0-‐436c-‐8cdd-‐aac629788383,
‘renice’)
IF
NOT
EXISTS;
!
!
Client error on conflict.
51. Ending discussion notes
• 2 socket, ECC memory
• 16GiB minimum, prefer 32-64GiB, over 128GiB and Linux will need serious tuning
• SSD where possible, Samsung 840 Pro is a good choice, any Intel is fine
• NO SAN/NAS, 20ms latency tops
• if you MUST (and please, don’t) dedicate spindles to C* nodes, use separate network
• Avoid disk configurations targeted at Hadoop, disks are too slow
• http://www.datastax.com/documentation/cassandra/2.0/pdf/
cassandra20.pdf
• read the sections on Repair, Tombstones & Snapshots