3. DHT 101
partitioning
Z A
Wednesday, November 7, 12 3
The keyspace, a namespace encompassing all possible keys
4. DHT 101
partitioning
Z A
Y B
C
Wednesday, November 7, 12 4
The namespace is divided into N partitions (where N is the number of nodes). Partitions are
mapped to nodes and placed evenly throughout the namespace.
5. DHT 101
partitioning
Z A
Y Key = Aaa B
C
Wednesday, November 7, 12 5
A record, stored by key, is positioned on the next node (working clockwise) from where it
sorts in the namespace
6. DHT 101
replica placement
Z A
Y Key = Aaa B
C
Wednesday, November 7, 12 6
Additional copies (replicas) are stored on other nodes. Commonly the next N-1 nodes, but
anything deterministic will work.
7. DHT 101
consistency
Consistency
Availability
Partition tolerance
Wednesday, November 7, 12 7
With multiple copies comes a set of trade-offs commonly articulated using the CAP theorem;
At any given point, we can only guarantee 2 of Consistency, Availability, and Partition
tolerance.
8. DHT 101
scenario: consistency level = one
A
W
?
?
Wednesday, November 7, 12 8
Writing at consistency level ONE provides very high availability, only one in 3 member nodes
need be up for write to succeed
9. DHT 101
scenario: consistency level = all
A
R
?
?
Wednesday, November 7, 12 9
If strong consistency is required, reads with consistency ALL can be used of writes performed
at ONE. The trade-off is in availability, all 3 member nodes must be up, else the read fails.
10. DHT 101
scenario: quorum write
A
W
R+W > N B
?
Wednesday, November 7, 12 10
Using QUORUM consistency, we only require floor((N/2)+1) nodes.
11. DHT 101
scenario: quorum read
?
R+W > N B
R
C
Wednesday, November 7, 12 11
Using QUORUM consistency, we only require floor((N/2)+1) nodes.
14. Problem:
Poor load distribution
Wednesday, November 7, 12 14
15. Distributing Load
Z A
Y B
C
M
Wednesday, November 7, 12 15
B and C hold replicas of A
16. Distributing Load
Z A
Y B
C
M
Wednesday, November 7, 12 16
A and B hold replicas of Z
17. Distributing Load
Z A
Y B
C
M
Wednesday, November 7, 12 17
Z and A hold replicas of Y
18. Distributing Load
Z A
Y B
C
M
Wednesday, November 7, 12 18
Disaster strikes!
19. Distributing Load
Z A
Y B
C
M
Wednesday, November 7, 12 19
Sets [Y,Z,A], [Z,A,B], [A,B,C] all suffer the loss of A; Results in extra load on neighboring
nodes
20. Distributing Load
Z A
A1
Y B
C
M
Wednesday, November 7, 12 20
Solution: Replace/repair down node
21. Distributing Load
Z A
A1
Y B
C
M
Wednesday, November 7, 12 21
Solution: Replace/repair down node
22. Distributing Load
Z A
A1
Y B
C
M
Wednesday, November 7, 12 22
Neighboring nodes are needed to stream missing data to A; Results in even more load on
neighboring nodes
23. Problem:
Poor data distribution
Wednesday, November 7, 12 23
24. Distributing Data
A
C
D
B
Wednesday, November 7, 12 24
Ideal distribution of keyspace
25. Distributing Data
A
E
C
D
B
Wednesday, November 7, 12 25
Bootstrapping a node, bisecting one partition; Distribution is no longer ideal
26. Distributing Data
A A
E
C
D
C
D
B B
Wednesday, November 7, 12 26
Moving existing nodes means moving corresponding data; Not ideal
27. Distributing Data
A A
E
C
D
C
D
B B
Wednesday, November 7, 12 27
Moving existing nodes means moving corresponding data; Not ideal
28. Distributing Data
A
H E
C
D
G F
B
Wednesday, November 7, 12 28
Frequently cited alternative: Double the size of your cluster, bisecting all ranges
29. Distributing Data
A
H E
C
D
G F
B
Wednesday, November 7, 12 29
Frequently cited alternative: Double the size of your cluster, bisecting all ranges
31. In a nutshell...
host
host
host
Wednesday, November 7, 12 31
Basically: “nodes” on the ring are virtual, and many of them are mapped to each “real” node
(host)
32. Benefits
• Operationally simpler (no token
management)
• Better distribution of load
• Concurrent streaming involving all hosts
• Smaller partitions mean greater reliability
• Supports heterogenous hardware
Wednesday, November 7, 12 32
33. Strategies
• Automatic sharding
• Fixed partition assignment
• Random token assignment
Wednesday, November 7, 12 33
34. Strategy
Automatic Sharding
• Partitions are split when data exceeds a
threshold
• Newly created partitions are relocated to a
host with lower data load
• Similar to sharding performed by Bigtable,
or Mongo auto-sharding
Wednesday, November 7, 12 34
35. Strategy
Fixed Partition Assignment
• Namespace divided into Q evenly-sized
partitions
• Q/N partitions assigned per host (where N
is the number of hosts)
• Joining hosts “steal” partitions evenly from
existing hosts.
• Used by Dynamo and Voldemort (described
in Dynamo paper as “strategy 3”)
Wednesday, November 7, 12 35
36. Strategy
Random Token Assignment
• Each host assigned T random tokens
• T random tokens generated for joining
hosts; New tokens divide existing ranges
• Similar to libketama; Identical to Classic
Cassandra when T=1
Wednesday, November 7, 12 36
37. Considerations
1. Number of partitions
2. Partition size
3. How 1 changes with more nodes and data
4. How 2 changes with more nodes and data
Wednesday, November 7, 12 37
38. Evaluating
Strategy No. Partitions Partition size
Random O(N) O(B/N)
Fixed O(1) O(B)
Auto-sharding O(B) O(1)
B ~ total data size, N ~ number of hosts
Wednesday, November 7, 12 38
39. Evaluating
• Automatic sharding
• partition size constant (great)
• number of partitions scales linearly with
data size (bad)
• Fixed partition assignment
• Random token assignment
Wednesday, November 7, 12 39
40. Evaluating
• Automatic sharding
• Fixed partition assignment
• Number of partitions is constant (good)
• Partition size scales linearly with data size
(bad)
• Higher operational complexity (bad)
• Random token assignment
Wednesday, November 7, 12 40
41. Evaluating
• Automatic sharding
• Fixed partition assignment
• Random token assignment
• Number of partitions scales linearly with
number of hosts (good ok)
• Partition size increases with more data;
decreases with more hosts (good)
Wednesday, November 7, 12 41
42. Evaluating
• Automatic sharding
• Fixed partition assignment
• Random token assignment
Wednesday, November 7, 12 42
44. Configuration
conf/cassandra.yaml
# Comma separated list of tokens,
# (new installs only).
initial_token:<token>,<token>,<token>
or
# Number of tokens to generate.
num_tokens: 256
Wednesday, November 7, 12 44
Two params control how tokens are assigned. The initial_token param now optionally
accepts a csv list, or (preferably) you can assign a numeric value to num_tokens
45. Configuration
nodetool info
Token : (invoke with -T/--tokens to see all 256 tokens)
ID : 64090651-6034-41d5-bfc6-ddd24957f164
Gossip active : true
Thrift active : true
Load : 92.69 KB
Generation No : 1351030018
Uptime (seconds): 45
Heap Memory (MB): 95.16 / 1956.00
Data Center : datacenter1
Rack : rack1
Exceptions : 0
Key Cache : size 240 (bytes), capacity 101711872 (bytes ...
Row Cache : size 0 (bytes), capacity 0 (bytes), 0 hits, ...
Wednesday, November 7, 12 45
To keep the output readable, nodetool info no longer displays tokens (if there are more than
one), unless the -T/--tokens argument is passed
46. Configuration
nodetool ring
Datacenter: datacenter1
==========
Replicas: 2
Address Rack Status State Load Owns Token
9022770486425350384
127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -9182469192098976078
127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -9054823614314102214
127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8970752544645156769
127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8927190060345427739
127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8880475677109843259
127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8817876497520861779
127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8810512134942064901
127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8661764562509480261
127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8641550925069186492
127.0.0.1 rack1 Up Normal 97.24 KB 66.03% -8636224350654790732
...
...
Wednesday, November 7, 12 46
nodetool ring is still there, but the output is significantly more verbose, and it is less useful
as the go-to
47. Configuration
nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 10.0.0.1 97.2 KB 256 66.0% 64090651-6034-41d5-bfc6-ddd24957f164 rack1
UN 10.0.0.2 92.7 KB 256 66.2% b3c3b03c-9202-4e7b-811a-9de89656ec4c rack1
UN 10.0.0.3 92.6 KB 256 67.7% e4eef159-cb77-4627-84c4-14efbc868082 rack1
Wednesday, November 7, 12 47
New go-to command is nodetool status
48. Configuration
nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 10.0.0.1 97.2 KB 256 66.0% 64090651-6034-41d5-bfc6-ddd24957f164 rack1
UN 10.0.0.2 92.7 KB 256 66.2% b3c3b03c-9202-4e7b-811a-9de89656ec4c rack1
UN 10.0.0.3 92.6 KB 256 67.7% e4eef159-cb77-4627-84c4-14efbc868082 rack1
Wednesday, November 7, 12 48
Of note, since it is no longer practical to name a host by it’s token (because it can have
many), each host has a unique ID
49. Configuration
nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 10.0.0.1 97.2 KB 256 66.0% 64090651-6034-41d5-bfc6-ddd24957f164 rack1
UN 10.0.0.2 92.7 KB 256 66.2% b3c3b03c-9202-4e7b-811a-9de89656ec4c rack1
UN 10.0.0.3 92.6 KB 256 67.7% e4eef159-cb77-4627-84c4-14efbc868082 rack1
Wednesday, November 7, 12 49
Note the token per-node count
51. Migration
edit conf/cassandra.yaml and restart
# Number of tokens to generate.
num_tokens: 256
Wednesday, November 7, 12 51
Step 1: Set num_tokens in cassandra.yaml, and restart node
52. Migration
convert to T contiguous tokens in existing ranges
A AA
A AA B
A
A
AA
A
A
AA A
A
A
A
AAA AA
A A
A
C
A
A
A
A
A
A
A
A
Wednesday, November 7, 12 52
This will cause the existing range to be split into T contiguous tokens. This results in no
change to placement
53. Migration
shuffle
A AA
A AA B
A
A
AA
A
A
AA A
A
A
A
AAA AA
A A
A
C
A
A
A
A
A
A
A
A
Wednesday, November 7, 12 53
Step 2: Initialize a shuffle operation. Nodes randomly exchange ranges.
54. Shuffle
• Range transfers are queued on each host
• Hosts initiate transfer of ranges to self
• Pay attention to the logs!
Wednesday, November 7, 12 54
55. Shuffle
bin/shuffle
Usage: shuffle [options] <sub-command>
Sub-commands:
create Initialize a new shuffle operation
ls List pending relocations
clear Clear pending relocations
en[able] Enable shuffling
dis[able] Disable shuffling
Options:
-dc, --only-dc Apply only to named DC (create only)
-tp, --thrift-port Thrift port number (Default: 9160)
-p, --port JMX port number (Default: 7199)
-tf, --thrift-framed Enable framed transport for Thrift (Default: false)
-en, --and-enable Immediately enable shuffling (create only)
-H, --help Print help information
-h, --host JMX hostname or IP address (Default: localhost)
-th, --thrift-host Thrift hostname or IP address (Default: JMX host)
Wednesday, November 7, 12 55