2. Why ZooKeeper?
•
Lots of servers
•
Lots of processes
•
High volumes of data
•
Highly complex software systems
•
… mere mortal developers
3. What ZooKeeper gives you
● Simple programming model
● Coordination of distributed processes
● Fast notification of changes
● Elasticity
● Easy setup
● High availability
4. ZooKeeper Configuration
• Membership
• Role of each server
– E.g., follower or observer
• Quorum System spec
– Zookeeper: majority or hierarchical
• Network addresses & ports
• Timeouts, directory paths, etc.
5. Zookeeper - distributed and replicated
ZooKeeper Service
Leader
Server Server Server Server Server
Client Client Client Client Client Client Client Client
• All servers store a copy of the data (in memory)
• A leader is elected at startup
• Reads served by followers, all updates go through leader
• Update acked when a quorum of servers have persisted the
change (on disk)
• Zookeeper uses ZAB - its own atomic broadcast protocol
6. Dynamic Membership Changes
• Necessary in every long-lived system!
• Examples:
– Cloud computing: adapt to changing load, don’t pre-allocate!
– Failures: replacing failed nodes with healthy ones
– Upgrades: replacing out-of-date nodes with up-to-date ones
– Free up storage space: decreasing the number of replicas
– Moving nodes: within the network or the data center
– Increase resilience by changing the set of servers
Example: asynch. replication works as long as > #servers/2 operate:
7. Dynamic Membership Changes
• Necessary in every long-lived system!
• Examples:
– Cloud computing: adapt to changing load, don’t pre-allocate!
– Failures: replacing failed nodes with healthy ones
– Upgrades: replacing out-of-date nodes with up-to-date ones
– Free up storage space: decreasing the number of replicas
– Moving nodes: within the network or the data center
– Increase resilience by changing the set of servers
Example: asynch. replication works as long as > #servers/2 operate:
8. Dynamic Membership Changes
• Necessary in every long-lived system!
• Examples:
– Cloud computing: adapt to changing load, don’t pre-allocate!
– Failures: replacing failed nodes with healthy ones
– Upgrades: replacing out-of-date nodes with up-to-date ones
– Free up storage space: decreasing the number of replicas
– Moving nodes: within the network or the data center
– Increase resilience by changing the set of servers
Example: asynch. replication works as long as > #servers/2 operate:
9. Dynamic Membership Changes
• Necessary in every long-lived system!
• Examples:
– Cloud computing: adapt to changing load, don’t pre-allocate!
– Failures: replacing failed nodes with healthy ones
– Upgrades: replacing out-of-date nodes with up-to-date ones
– Free up storage space: decreasing the number of replicas
– Moving nodes: within the network or the data center
– Increase resilience by changing the set of servers
Example: asynch. replication works as long as > #servers/2 operate:
10. Dynamic Membership Changes
• Necessary in every long-lived system!
• Examples:
– Cloud computing: adapt to changing load, don’t pre-allocate!
– Failures: replacing failed nodes with healthy ones
– Upgrades: replacing out-of-date nodes with up-to-date ones
– Free up storage space: decreasing the number of replicas
– Moving nodes: within the network or the data center
– Increase resilience by changing the set of servers
Example: asynch. replication works as long as > #servers/2 operate:
11. Dynamic Membership Changes
• Necessary in every long-lived system!
• Examples:
– Cloud computing: adapt to changing load, don’t pre-allocate!
– Failures: replacing failed nodes with healthy ones
– Upgrades: replacing out-of-date nodes with up-to-date ones
– Free up storage space: decreasing the number of replicas
– Moving nodes: within the network or the data center
– Increase resilience by changing the set of servers
Example: asynch. replication works as long as > #servers/2 operate:
12. Hazards of Manual Reconfiguration
E
A
C
{A, B, C}
B {A, B, C} D
{A, B, C}
• Goal: add servers E and D
13. Hazards of Manual Reconfiguration
E
A
C
{A, B, C, D, E} {A, B, C, D, E}
B {A, B, C, D, E} D
{A, B, C, D, E}
{A, B, C, D, E}
• Goal: add servers E and D
• Change Configuration
14. Hazards of Manual Reconfiguration
E
A
C
{A, B, C, D, E} {A, B, C, D, E}
B {A, B, C, D, E} D
{A, B, C, D, E}
{A, B, C, D, E}
• Goal: add servers E and D
• Change Configuration
• Restart Servers
15. Hazards of Manual Reconfiguration
E
A
C
{A, B, C, D, E} {A, B, C, D, E}
B {A, B, C, D, E} D
{A, B, C, D, E}
{A, B, C, D, E}
• Goal: add servers E and D
• Change Configuration
• Restart Servers
16. Hazards of Manual Reconfiguration
E
A
C
{A, B, C, D, E} {A, B, C, D, E}
B {A, B, C, D, E} D
{A, B, C, D, E}
{A, B, C, D, E}
• Goal: add servers E and D
• Change Configuration
• Restart Servers
17. Hazards of Manual Reconfiguration
E
A
C
{A, B, C, D, E} {A, B, C, D, E}
B {A, B, C, D, E} D
{A, B, C, D, E}
{A, B, C, D, E}
• Goal: add servers E and D
• Change Configuration
• Restart Servers
• Lost and !
18. 18
Just use a coordination service!
• Zookeeper is the coordination service
– Don’t want to deploy another system to coordinate it!
• Who will reconfigure that system ?
– GFS has 3 levels of coordination services
• More system components -> more management overhead
• Use Zookeeper to reconfigure itself!
– Other systems store configuration information in Zookeeper
– Can we do the same??
– Only if there are no failures
23. This doesn’t work for reconfigurations!
E
C
B
{A, B, C, D, E} {A, B, C, D, E}
setData(/zookeeper/config, {A, B, F})
{A, B, C, D, E} D
remove C, D, E add F
F
{A, B, C, D, E}
A
{A, B, C, D, E}
24. This doesn’t work for reconfigurations!
E
C
B
{A, B, C, D, E} {A, B, C, D, E}
setData(/zookeeper/config, {A, B, F})
{A, B, C, D, E} D
remove C, D, E add F
F
{A, B, C, D, E}
A
{A, B, F}
{A, B, F}
25. This doesn’t work for reconfigurations!
E
C
B
{A, B, C, D, E} {A, B, C, D, E}
setData(/zookeeper/config, {A, B, F})
{A, B, C, D, E} D
remove C, D, E add F
F
{A, B, C, D, E}
A
{A, B, F}
{A, B, F}
• Must persist the decision to reconfigure in the old
config before activating the new config!
• Once such decision is reached, must not allow further
ops to be committed in old config
26. Our Solution
• Correct
• Fully automatic
• No external services or additional components
• Minimal changes to Zookeeper
• Usually unnoticeable to clients
– Pause operations only in rare circumstances
– Clients work with a single configuration
• Rebalances clients across servers in new configuration
• Reconfigures immediately
• Speculative Reconfiguration
– Reconfiguration (and commands that follow it) speculatively sent out by the
primary, similarly to all other updates
27. Principles
● Commit reconfig in a quorum of the old ensemble
– Submit reconfig op just like any other update
● Make sure new ensemble has latest state before
becoming active
– Get quorum of synced followers from new config
– Get acks from both old and new ensembles before committing
updates proposed between reconfig op and activation
– Activate new configuration when reconfig commits
● Once new ensemble active old ensemble cannot commit
or propose new updates
● Gossip activation through leader election and syncing
● Verify configuration id of leader and follower
29. Reconfiguration scenario 1
E
A
C
{A, B, C} {A, B, C}
B {A, B, C} D
{A, B, C}
{A, B, C}
• Goal: add servers E and D
30. Reconfiguration scenario 1
E
A
C
{A, B, C}
B {A, B, C} D
{A, B, C}
• Goal: add servers E and D
• doesn't commit until quorums of
both ensembles ack
31. Reconfiguration scenario 1
E
A
C
{A, B, C} {A, B, C}
B {A, B, C} D
{A, B, C}
{A, B, C}
• Goal: add servers E and D
• doesn't commit until quorums of
both ensembles ack
32. Reconfiguration scenario 1
E
A
C
{A, B, C} {A, B, C}
B {A, B, C} D
{A, B, C}
{A, B, C}
• Goal: add servers E and D
• doesn't commit until quorums of
both ensembles ack
33. Reconfiguration scenario 1
E
A
C
{A, B, C, D, E} {A, B, C, D, E}
B {A, B, C} D
{A, B, C, D, E}
{A, B, C, D, E}
• Goal: add servers E and D
• doesn't commit until quorums of
both ensembles ack
34. Reconfiguration scenario 1
E
A
C
{A, B, C, D, E} {A, B, C, D, E}
B {A, B, C} D
{A, B, C, D, E}
{A, B, C, D, E}
• Goal: add servers E and D
• doesn't commit until quorums of
both ensembles ack
• E and D gossip new configuration
to C
35. Reconfiguration scenario 1
E
A
C
{A, B, C, D, E} {A, B, C, D, E}
B {A, B, C, D, E} D
{A, B, C, D, E}
{A, B, C, D, E}
• Goal: add servers E and D
• doesn't commit until quorums of
both ensembles ack
• E and D gossip new configuration
to C
36. Example - reconfig using CLI
reconfig -add 1=host1.com:1234:1235:observer;1239
-add 2=host2.com:1236:1237:follower;1231 -remove 5
●
Change follower 1 to an observer and change its ports
●
Add follower 2 to the ensemble
●
Remove follower 5 from the ensemble
reconfig -file myNewConfig.txt -v 234547
●
Change the current config to the one in myNewConfig.txt
●
But only if current config version is 234547
getConfig -w -c
●
set a watch on /zookeeper/config
●
-c means we only want the new connection string for clients
37. When it will not work
● Quorum of new ensemble must be in sync
● Another reconfig in progress
● Version condition check fails
38. How do you know you are done
● Write something somewhere
39. The “client side” of reconfiguration
• When system changes, clients need to stay connected
– The usual solution: directory service (e.g., DNS)
• Re-balancing load during reconfiguration is also important!
• Goal: uniform #clients per server with minimal client migration
– Migration should be proportional to change in membership
X 10 X 10 X 10
40. The “client side” of reconfiguration
• When system changes, clients need to stay connected
– The usual solution: directory service (e.g., DNS)
• Re-balancing load during reconfiguration is also important!
• Goal: uniform #clients per server with minimal client migration
– Migration should be proportional to change in membership
X 10 X 10 X 10
41. Our approach - Probabilistic Load Balancing
• Example 1 :
X 10 X 10 X 10
42. Our approach - Probabilistic Load Balancing
• Example 1 :
X 10 X 10 X 10
43. Our approach - Probabilistic Load Balancing
• Example 1 :
X6 X6 X6 X6 X6
– Each client moves to a random new server with probability 0.4
– 1 – 3/5 = 0.4
– Exp. 40% clients will move off of each server
44. Our approach - Probabilistic Load Balancing
• Example 1 :
X6 X6 X6 X6 X6
– Each client moves to a random new server with probability 0.4
– 1 – 3/5 = 0.4
– Exp. 40% clients will move off of each server
●
Example 2 :
X6 X6 X6 X6 X6
45. Our approach - Probabilistic Load Balancing
• Example 1 :
X6 X6 X6 X6 X6
– Each client moves to a random new server with probability 0.4
– 1 – 3/5 = 0.4
– Exp. 40% clients will move off of each server
●
Example 2 :
X6 X6 X6 X6 X6
46. Our approach - Probabilistic Load Balancing
• Example 1 :
X6 X6 X6 X6 X6
– Each client moves to a random new server with probability 0.4
– 1 – 3/5 = 0.4
– Exp. 40% clients will move off of each server
●
Example 2 :
4/18 4/18 10/18
X6 X6 X6 X6 X6
– Connected clients don’t move
– Disconnected clients move to old servers with prob 4/18 and new one with prob
10/18
– Exp. 8 clients will move from A, B, C to D, E and 10 to F
47. Our approach - Probabilistic Load Balancing
• Example 1 :
X6 X6 X6 X6 X6
– Each client moves to a random new server with probability 0.4
– 1 – 3/5 = 0.4
– Exp. 40% clients will move off of each server
●
Example 2 :
4/18 4/18 10/18
X 10 X 10 X 10
– Connected clients don’t move
– Disconnected clients move to old servers with prob 4/18 and new one with prob
10/18
– Exp. 8 clients will move from A, B, C to D, E and 10 to F
49. ProbabilisticCurrent Load Balancing
When moving from config. S to S’:
E (load (i, S ' )) = load (i, S ) + ∑ load ( j, S ) ⋅ Pr( j → i ) − load (i, S ) ∑ Pr(i → j )
j∈S ∧ j ≠i j∈S ' ∧ j ≠i
expected #clients #clients
connected to i in S’ connected #clients
(10 in last example) to i in S #clients
moving to i from moving from i to
other servers in S other servers in S’
Solving for Pr we get case-specific probabilities.
Input: each client answers locally
Question 1: Are there more servers now or less ?
Question 2: Is my server being removed?
Output: 1) disconnect or stay connected to my server
if disconnect 2) Pr(connect to one of the old servers)
and Pr(connect to newly added server)
50. Implementation
• Implemented in Zookeeper (Java & C), integration ongoing
– 3 new Zookeeper API calls: reconfig, getConfig, updateServerList
– feature requested since 2008, expected in 3.5.0 release (july 2012)
• Dynamic changes to:
– Membership
– Quorum System
– Server roles
– Addresses & ports
• Reconfiguration modes:
– Incremental (add servers E and D, remove server B)
– Non-incremental (new config = {A, C, D, E})
– Blind or conditioned (reconfig only if current config is #5)
• Subscriptions to config changes
– Client can invoke client-side re-balancing upon change
51. 52
Summary
• Design and implementation of reconfiguration for Apache Zookeeper
– being contributed into Zookeeper codebase
• Much simpler than state of the art, using properties already provided by Zookeeper
• Many nice features:
– Doesn’t limit concurrency
– Reconfigures immediately
– Preserves primary order
– Doesn’t stop client ops
– Zookeeper used by online systems, any delay must be avoided
– Clients work with a single configuration at a time
– No external services
– Includes client-side rebalancing