Scalable membership management

Scalable membership management
and
failure detection?

Vinay Setty
INF5360

What is Gossiping?
• Spread of information in a random manner
• Some examples:
– Human gossiping
– Epidemic diseases
– Physical phenomenon: wild fire, diffusion etc
– Computer viruses and worms

Gossiping in Computer Science
• Term first coined by Demers et al (1987)
• Some applications of gossip protocols
– Peer Sampling
– Data Aggregation
– Clustering
– Information Dissemination (Multicast, Pub/Sub)
– Overlay/topology
– Maintenance
– Failure detection?

Gossip-Based Protocol: Example

3 4
0

5

1
2
9
8

7
6

Today’s Focus
• Theoretical angle for Gossip-based protocols
[Allavena et al PODC 2005]
– Probability of partitioning
– Time till partitioning
– Bounds on in-degree
– Essential elements of gossiping
– Simulation results
• Cyclon [Voulgaris et al]
• Scamp [Ganesh et al]
• NewsCast [Jelasity et al]

Membership Service
• Full Membership
– Complete knowledge at each node
– Random subset used for gossiping
– Not scalable
– Hard to maintain
• Partial Membership
– Random subset at each node
– Gossip partners chosen from local view

View Selection
L1
s s,p,r p

v s,p,t u r

t t,q,r
q
L2

L1 L2
v

Weighted with w

Essential Elements of Gossiping
• Mixing: Construct a list L1 consisting of local
views of local view of node u
– Guarantees non partitioning
– “Pull” based
• Reinforcement: Construct a list L2 consisting
of local views of nodes that requested local
view of u
– Balances network
– removes old possibly dead edges, adds new edges

Partitioning and Size Estimate

• A and B partition iff x=1 and y=0
• Partitioning is least possible when x=y
• Goal of protocol is to maintain this balance

Size Estimates
• Idea:
– Assuming edges were drawn uniformly randomly,
expected x+y  |A|
– x is estimate of size of A by nodes in A
– y is estimate of size of A by nodes in B
• Mixing:
– Agreeing on estimation of x and y ensures no
partition (even if x and y are not accurate)
• Reinforcement:
– Brings estimation of x and y to correct value

K-regularity
• View Size: k
• Number of nodes: n
• Fraction of nodes in partition: γ
• |A|= γn ≤ |B|
• #edges from A to B: (1-x)γkn
• #edges from B to A: y (1-γ)kn
• Number of edges in A-B cut:
– (1-x)γkn +x (1-γ)kn (since x=y)
– ≥ γkn (assuming γ≤½)

Time Till Partitioning
• View Size: k
• Number of nodes: n
• Fraction of nodes in partition: γ
• Churn rate: μ (μn nodes leave and join)
• Claim: Expected time before a partition of size
γ happens ≈ 2γkn
– As long as μ≪γkn

100, 000 nodes, view sizes of 17, a fanout of 3, and a loosely
synchronised syst em, t he maximum in-degree was always
which re-samples ran- below 4.5 t imes t hat of a random graph and t he st andard
g t he names of t he nodes deviat ion was not more t han 3.2 t imes larger t han t hat of a
Iterations until Partitioning
y or anot her is doomed
enat ion of all t he views.
random graph. T hese values would improve wit h increased
fanout , but even a fanout of 2 gives sat isfact ory perfor-
onds t o creat ing a new mance.
eplacement from t he old
om V at each it erat ion.
10000
annot reappear wit hout
cement . T he diversit y 9000
ime, and in fact rat her 8000
Number of iterations
rk. Not e t hat it is t he- 7000
e by creat ing a protnodes: n
Number of ocol
6000
t at ion on V, size: tk = log n
View but his is
Churn: n/32
esn’t necessarily behave 5000
ving or joining t he net - 4000
ively add t he names of 3000
t o V, a process we call
ome reinforcement , even 2000
n t he art icle: each pro- 1000
hen sending a message. 0
nd Cyclon [16] as well: 1 1.5 2 2.5 3 3.5
eir view t hat t hey t hen Log10 of the number of nodes

aviour in say t he cont ext F igure 4: N umber of it erat ions unt il part it ioning
t he “ News Event s” are
des. Let only t he nodes We were int erest ed in mat ching our t heoret ical result s
odes add t heir names t o about part it ioning and churn. We ran simulat ions evaluat -
Event s” inst ead of every ing t he number of it erat ions unt il part it ioning. By part it ion-

View Size vs Time until Partition

Number of nodes: n
View size: k = log n
Churn: n/32

Simplified Model for Proof
– Single randomly chosen element from view is
replaced instead of whole views
– Assumption: The out-edges of nodes of A are
identically distributed and same applies to B
– a = #edges from A to A
– c = #edges from A to B
– b = #edges from B to A
– d = #edges from B to B

Proof Intuition
Partition state: a = γkn and b = 0

In-Degree Analysis
• Load balancing requires balance in in-degree
distribution
• In-degree is governed by the way edges created,
copied and destroyed
• Copying some edges more than others cause
variability in in-degree
• Node living longer is expected to have higher in-
degree
• Solution: Increase reinforcement and keep track
of timestamps like in Cyclon
• Simulation: max in-degree < 4.5 times of random
graph and standard deviation < 3.2 times

Discussion
• Are these theoretical guarantees practically
useful?
• Goal is not provide failure detection

Cyclon
• Consists of same elements as suggested by
[Allavena et al PODC 2005]
• [Allavena et al PODC 2005] Analysis holds for
Cyclon
• Major differences:
– Timestamps
– shuffling

Basic Shuffling
• Select a random subset of l neighbors (1 ≤ l ≤ c) from P’s
own cache, and a random peer, Q, within this subset,
where l is a system parameter, called shuffle length.
• Replace Q’s address with P’s address.
• Send the updated subset to Q.
• Receive from Q a subset of no more than l of Q’s neighbors.
• Discard entries pointing to P, and entries that are already in
P’s cache.
• Update P’s cache to include all remaining entries, by
– firstly using empty cache slots (if any), and
– secondly replacing entries among the ones originally sent to Q.

Enhanced Shuffling
• Increase by one the age of all neighbors.
• Select neighbor Q with the highest age among all neighbors, and l −
1 other random neighbors.
• Replace Q’s entry with a new entry of age 0 and with P’s address.
• Send the updated subset to peer Q.
• Receive from Q a subset of no more that l of its own entries.
• Discard entries pointing at P and entries already contained in P’s
cache.
• Update P’s cache to include all remaining entries, by firstly using
empty
• cache slots (if any), and secondly replacing entries among the ones
sent to Q.

removed. Note that the number of clusters decreases as we approach 100% node
removal because the total number of surviving nodes becomes too small. Fig-

Number of Clusters
ure 7(b) shows the number of nodes not belonging to the largest cluster, in log
scale.
These graphs show considerable robustness to node failures, especially con-
sidering the fact that in the early stages of clustering very few nodes are out of
the largest cluster, which indicates that most nodes are still connected in a single

Fig. 7. (a) Number of disjoint clusters, as a result of removing a large percentage of nodes. Shows
that the overlay does not break into two or more disjoint clusters, unless a major percentage of the
nodes are removed. (b) Number of nodes not belonging to the largest cluster. Shows that in the ﬁrst
steps of clustering only a few nodes are separated from the main cluster, which still connects the

SCAMP
• Partial knowledge of the membership: local
view
• Fanout automatically set = size of the local
view
• Fanout evolves naturally with the size of the
group
– Size of local views converges towards C.log(n)

Join (Subscription)

Subscription to Subscription forwarded
a random member P=1/sizeof view
s 1 s
S 0 (1-P)
s
s P=1/sizeof view

s 2 s
(1-P)

s
P=1/sizeof view

3 s
(1-P)

Join(Subscription) algorithm

7 6
Local view 1 4 5 6 4
0
6
7 2 3 6
6 6
1 2
6

0
8 3 6 7 0 1 5 6
6 6
5 8 7

Load Balancing
• Indirection:
– Forward the subscription instead of handling
request
• Lease associated with each subscription
• Periodically nodes have to re-subscribe
– Nodes having failed permanently will time out
– Re-balance the partial views

Unsubscription

Local view 8 9 0 8 9 4
1 4 5 Unsub (0), [1,4,5]
4 x x
0

7 3 0 7 3 5
1 y y

6 0 2 6 0 1
5
z z

Degree

• System modelled as random directed graph
• D(N) = Average out-degree for N-nodes
system
• Subscription adds D(N)+1 directed arcs, so
• (N+1) D(N+1) = N D(N) + D(N)+1
• Solution of this recursion is
• D(N)=D(1)+1/2+1/3+…+1/N  Log(N)

Distribution of view size

35000

30000
Log=13.12
25000
Number of nodes

20000

15000

10000 Log=12.2

5000

0
0 5 10 15 20 25 30 35 40 45 50
View Size

200 000 Node System 500 000 Node System
33

Reliability: 5000 node system
1

0.98

0.96
Reliability

0.94

0.92

0.9
0 500 1000 1500 2000 2500
Number of failures

SCAMP
Global membership knowledge, fanout=8
Global membership knowledge, fanout=9

34

NewsCast
• Goal: Aggregate information in
– a large and dynamic
– distributed environment
– a robust and dependable manner

Idea
• Gets news from application, timestamps it and
adds local peer address to the cache entry
• Finds a random peer in cache addresses
– Sends all cache entries to this peer
– Receives all cache entries from that peer
• Passes on cache entries (containing news items)
to application
• Merges old cache with received cache
– Keeps at most C most recent cache entries

Aggregation
• Each node ni maintains a single number xi
• Every node ni selects a random node nk, and
sends its value xi to nk
• nk responds with the aggregate (e.g. max(xi,xk)
) of the incoming and its own value
• 4. Aggregate values will converge
“exponentially”

Aggregation
1
proportion of not-reached nodes

0.1

0.01

0.001
theoretical model
c=20
c=40
c=80
0.0001
6 7 8 9 10 11
cycle

Scalable membership management

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (16)

Similar a Scalable membership management

Similar a Scalable membership management (20)

Último

Último (20)

Scalable membership management

Notas del editor