1. Chord: A Scalable Peer-
to-peer Lookup Protocol
for Internet Applications
Ion Stoica, Robert Morris, David Liben-Nowell, David R.
Karger, M. Frans Kaashoek, Frank Dabek Hari Balakrishnan
2. About Chord
• Protocol and algorithm for distributed hash
table (DHT)
• Structured-decentralized P2P overlay
network
3. About Chord
• Supports just one operation: lookup(key)
• Given a key, it maps the key onto a node
• Returns the IP-address of the node
• Associate a value with each key and store
the pair on the resulting node:
• Distributed hash table
• Uses consistent hashing for this operation
4. Chord properties (1)
• Load balancing:
• spreads keys evenly over nodes
• Decentralized:
• fully distributed, all nodes equally
important
• does not take into account heterogeneity
of nodes
5. Chord properties (2)
• Scalability:
• lookup cost: O(log(N))
• Availability:
• automatically updates lookup tables as
nodes join, leave, fail
• Flexible naming:
• Flat key-space, no constraint on structure
of keys
6. Traditional hashing
• Given an object, assign it to a bin from a set
of N bins
• Each bin: roughly same amount of objects
• Problem:
• If N changes, all objects need to be
reassigned
7. Consistent hashing
• Evenly distributes K objects across N bins,
about K/N each
• When N changes:
• Not all objects need to be moved
• Only O(K/N) need to
• Uses a base hash function, such as SHA-1
8. Chord protocol
• Assign each node and key an m-bit
identifier using SHA-1:
• node: hash the IP-address
• key: hash the key
• Identifiers are ordered on an identifier circle
module 2m, the Chord ring:
• circle of numbers from 0 to 2m-1
9. Assigning keys to nodes
• Assigning key k to a node:
• first node whose identifier is equal to or
follows the identifier of key k in the
identifier space
• successor node of k: successor(k)
• first node clockwise from k on Chord ring
11. Minimal disruption
• Minimal disruption when node joins/leaves
• n joins:
• certain keys assigned to successor(n) are
now assigned to n
• n leaves:
• all keys assigned to n are now assigned
to successor(n)
12. Simple Chord lookup
• Simplest Chord lookup algorithm:
• Each node only needs to know its
successor on Chord ring
• Query for identifier id is passed around
Chord ring until successor(id) is found
• Result returned along reverse path
• Easy but not efficient: # messages O(N)
13. Simple Chord lookup
N1
lookup(K54)
/ ask node n to fi nd the successor of id N8
K54 N56
n.find successor(id) the successor
// ask node n to find
if// of id successor])
(id ∈ (n,
n.find_successor(id)
return successor; N51
N14
else return (n, successor])
if (id ∈
successor; N48
// forward the query around the circle
else
return successor.fiquery around the
// forward the nd successor(id);
// circle
return successor.find_successor(id); N21
N42
(a)
N38
N32
(b)
seudocode to fi nd the successor node of an identifi er id. Remote procedure calls and variable lookups
a query from node 8 for key 54, using the pseudocode in Figure 3(a).
14. Finger table
• Chord maintains additional routing info:
• not essential for correctness
• but allows acceleration of lookups
• Each node maintains finger table with m
entries:
• n.finger[i] = successor(n + 2i-1)
15. Finger table
le (but slow) pseudocode to fi nd the successor node of an identifi er id. Rem
e path taken by a query from node 8 for key 54, using the pseudocode in Figu
N1
Finger table
N8
+1 N8 + 1 N14
N8 + 2 N14 K54 N
+2 N8 + 4 N14
+4 N8 + 8 N21
N51 N51
+32 +8 N14 N8 +16 N32
N8 +32 N42
N48 +16 N48
N21
N42
N
N38
N32
16. Improved lookup
• Lookup for id now works as follows:
• If id falls between n and successor(n), successor(n) is
returned
• Otherwise, lookup is performed at node n’, which is the
node in the finger table of n that most immediately
precedes id
• Since each node has finger entries at power of two
intervals around the identifier circle, each node can
forward a query at least halfway along the remaining
distance between the node and the target identifier.
• Thus O(log(N)) nodes need to be contacted
17. N32
(b)
Chord protocol
successor node of an identifi er id. Remote procedure calls and variable lookups ar
for key 54, using the pseudocode in Figure 3(a).
1 N1
Finger table lookup(54)
// ask node n to find the successor
N8
// of id +1 N8 + 1 N14 N8
n.find_successor(id) N8 + 2 N14 K54 N56
if (id ∈ (n, successor]) N14
N8 + 4
+2
return successor; + 8 N21
+4 N8
else N51
+8 N14 N8 +16 N32 N14
n′ = closest_preceding_node(id);
N8 +32 N42
16 return n′.find_successor(id); N48
// search the local table for the
// highest predecessor of id
n.closest_preceding_node(id)
N21 N21
for i = m downto 1
if (finger[i] ∈ (n, id)) N42
return finger[i];
return n;
N38
N32
(a) (b)
18. Coping with joining/leaving
• Essential that successor pointers stay up to
date
• Each node periodically runs stabilization
protocol which updates Chord’s successor
pointers and finger tables
• For node n to join, it must know a node n’ of
the Chord ring
19. Coping with joining/leaving
• To join, n performs join()
• node n asks n’ to find_successor(n), which
n sets as its sucessor
• Each node periodically performs stabilize() to
learn about newly joined nodes
• n asks his successor s for s’ predecessor p
• decides whether p should be n’s successor
instead
20. Coping with joining/leaving
• stabilize() also notifies n’s successor s of n’s
existence
• s might change it’s predecessor to n
• Each node periodically calls fix_fingers()
• makes sure fingers are up to date
• is how new nodes acquire finger table
• is how old nodes add new nodes to their
tables
21. Coping with joining/leaving
// n′ thinks it might be our
//join a Chord ring containing node n′.
// predecessor.
n.join(n′)
n.notify(n′)
predecessor = nil;
if (predecessor is nil or
successor = n′.find_successor(n);
n′ ∈ (predecessor, n))
predecessor = n′;
// called periodically. verifies n’s
// immediate successor, and tells the
// successor about n.
// called periodically. refreshes
n.stabilize()
// finger table entries. next stores
x = successor.predecessor;
// the index of the next finger to fix.
if (x ∈ (n, successor))
n.fix_fingers()
successor = x;
next = next + 1 ;
successor.notify(n);
if (next > m)
next = 1;
finger[next] = find_successor(n+2next−1);
22. Coping with joining/leaving
N21 N21 N21 N21
successor(N21)
N26 N26 N26
K24 K24
N32 N32 N32 N32
K24 K24 K24 K30
K30 K30 K30
(a) (b) (c) (d)
lustrating the join operation. Node 26 joins the system between nodes 21 and 32. The arcs represent the successor relationsh
to node 32; (b) node 26 fi nds its successor (i.e., node 32) and points to it; (c) node 26 copies all keys less than 26 from node
ates the successor of node 21 to node 26.
most immediate predecessor of id. In addition, its r successors means that it can inform the hig
23. Coping with joining/leaving
• Lookup will only fail if successor pointers are
not up to date => retry after a pause
• As soon as successor pointers are correct,
lookup will work
• over time, finger tables will be up to date
and linear search is no longer needed
• Old finger tables can still be used and in
general lookup will remain O(log(N)) even if
not all finger tables are up to date
24. Successor list
• Lookup fails if a node doesn’t know its correct
successor
• can occur when nodes fail
• Solution: each node maintains a successor
list:
• contains the node’s first r successors
• if first successor does not respond, try the
rest of the list
25. Replication
• Successor list can also be used for replication
• application on top of Chord might store
replicas of data associated with a key on k
nodes from n’s successor list
• Chord can inform application when this
successor list changes, so application can
create new replicas
26. Chord in practice
• In practice, Chord ring will never be in stable
state:
• Joins and leaves occur continuously
• Interleaved by stabilization algorithm
• Not enough time to stabilize before new
change
• If stabilization is run at a certain rate:
• Chord ring will be continuously “almost
stable” and lookup will be fast and correct
27. Virtual nodes
• Chord load balancing can be improved
through virtual nodes:
• node ids do not uniformly cover entire id
space
• to make (# keys / node) more uniform,
associate keys with virtual nodes
• map r virtual nodes, with unrelated ids, to
each real node
• Tradeoff: each real node needs r times as
much space to store finger table
28. Network latency
• Chord path length is O(log(N)), but latency
can be quite large
• Nodes close in id space can be far away in
the underlying network
• Idea:
• choose next-hop finger based on progress
in id space and latency in network
• maximize progress in id space
• minimize latency
29. Network latency
• Different approach to improving latency:
• each entry in finger table can point to list of
nodes instead of a single node
• nodes have similar ids and are equivalent
for routing purposes
• latency of the nodes could be different
• use closest node in terms of network
distance in the underlying network
30. Applications of Chord
• Chord File System (CFS):
• Chord locates storage blocks
• Distributed indexes:
• solves Napster’s central server issue
• Large-scale combinatorial search:
• candidate solutions are keys, Chord maps these to
machines which test them
• Cooperative mirroring:
• Uses Chord to provide load balancing
• Time-shared storage:
• Uses Chord to provide availability