1. ᔕᕈᒪᐱᓭ
Distributed Systems
Made Simple
Pascal Felber, Raluca Halalai, Lorenzo Leonini,
Etienne Rivière,Valerio Schiavoni, José Valerio
Université de Neuchâtel, Switzerland
www.splay-project.org
ᔕᕈᒪᐱᓭ
Monday, January 23, 12
2. 2
Motivations
• Developing, testing, and tuning distributed
applications is hard
• In Computer Science research, fixing the
gap of simplicity between pseudocode
description and implementation is hard
• Using worldwide testbeds is hard
ᔕᕈᒪᐱᓭ Distributed Systems Made Simple - Pascal Felber - University of Neuchâtel
Monday, January 23, 12
3. 3
Networks of idle
Testbeds workstations
• Set of machines for
testing distributed
application/protocol
• Several different testbeds!
A cluster
@UniNE
Your machine
ᔕᕈᒪᐱᓭ Distributed Systems Made Simple - Pascal Felber - University of Neuchâtel
Monday, January 23, 12
4. 4
What is PlanetLab?
ᔕᕈᒪᐱᓭ Distributed Systems Made Simple - Pascal Felber - University of Neuchâtel
Monday, January 23, 12
5. 4
What is PlanetLab?
• Machines contributed by universities,
companies,etc.
• 1098 nodes at 531 sites (02/09/2011)
• Shared resources, no privileged access
• University-quality Internet links
• High resource contention
• Faults, churn, packet-loss is the norm
• Challenging conditions
ᔕᕈᒪᐱᓭ Distributed Systems Made Simple - Pascal Felber - University of Neuchâtel
Monday, January 23, 12
6. 5
Daily Job With
Distributed Systems
1
code
1 • Write (testbed specific) code
• Local tests, in-house cluster, PlanetLab...
ᔕᕈᒪᐱᓭ Distributed Systems Made Simple - Pascal Felber - University of Neuchâtel
Monday, January 23, 12
7. 5
Daily Job With
Distributed Systems
2
1
code debug
2 • Debug (in this context, a nightmare)
ᔕᕈᒪᐱᓭ Distributed Systems Made Simple - Pascal Felber - University of Neuchâtel
Monday, January 23, 12
8. 5
Daily Job With
Distributed Systems
3
1 2
code debug deploy
3 • Deploy, with testbed specific scripts
ᔕᕈᒪᐱᓭ Distributed Systems Made Simple - Pascal Felber - University of Neuchâtel
Monday, January 23, 12
9. 5
Daily Job With
Distributed Systems
4
1 2 3
code debug deploy get logs
4 • Get logs, with testbed specific scripts
ᔕᕈᒪᐱᓭ Distributed Systems Made Simple - Pascal Felber - University of Neuchâtel
Monday, January 23, 12
10. 5
Daily Job With
Distributed Systems
1 2 3 4 5
code debug deploy get logs plots
5 • Produce plots, hopefully
ᔕᕈᒪᐱᓭ Distributed Systems Made Simple - Pascal Felber - University of Neuchâtel
Monday, January 23, 12
11. 6
ᔕᕈᒪᐱᓭ at a Glance
code debug deploy get logs plots t...
plo
gnu
• Supports the development, evaluation, testing, and
tuning of distributed applications on any testbed:
• In-house cluster, shared testbeds, emulated
environments...
• Provides an easy-to-use pseudocode-like language
based on Lua
• Open-source: http://www.splay-project.org
ᔕᕈᒪᐱᓭ Distributed Systems Made Simple - Pascal Felber - University of Neuchâtel
Monday, January 23, 12
12. 7
The Big Picture
application
splayd
splayd
splayd
splayd
Splay
Controller
SQL DB
ᔕᕈᒪᐱᓭ Distributed Systems Made Simple - Pascal Felber - University of Neuchâtel
Monday, January 23, 12
13. 7
The Big Picture
application
splayd splayd
splayd
splayd
Splay
Controller
SQL DB
ᔕᕈᒪᐱᓭ Distributed Systems Made Simple - Pascal Felber - University of Neuchâtel
Monday, January 23, 12
14. 7
The Big Picture
splayd splayd
application
splayd
splayd
Splay
Controller
SQL DB
ᔕᕈᒪᐱᓭ Distributed Systems Made Simple - Pascal Felber - University of Neuchâtel
Monday, January 23, 12
15. 8
SPLAY architecture
• Portable daemons (ANSI
C) on the testbed(s)
• Testbed agnostic
for the user
• Lightweight
(memory & CPU)
• Retrieve results
by remote logging
ᔕᕈᒪᐱᓭ Distributed Systems Made Simple - Pascal Felber - University of Neuchâtel
Monday, January 23, 12
16. 9
Lua
SPLAY language
ributed system. It is noteworthy that the churn manage- Lua’s interpreter can directly execute source code,
•
ment system relieves the need for fault injection systems as well as hardware-dependent (but operating system-
Based on Lua
uch as Loki [13]. Another typical use of the churn man- independent) bytecode. In S PLAY, the favored way of
agement system is for long-running applications, e.g., a submitting applications is in the form of source code, but
• High-level scripting language, simple interaction with
DHT that serves as a substrate for some other distributed
application under test and needs to stay available for the
bytecode programs are also supported (e.g., for intellec-
tual property protection).
C, bytecode-based, garbage collection, ...
whole duration of the experiments. In such a scenario, Isolation and sandboxing are achieved thanks to Lua’s
•
one can ask the churn manager to maintain a fixed size support for first-class functions with lexical scoping and
Close to pseudo code (focus on the algorithm)
population of nodes and to automatically bootstrap new closures, which allow us to restrict access to I/O and net-
ones as faults occur in the testbed.
•
working libraries. We modify the behavior of these func-
Favors natural ways of expressing algorithms (e.g., RPCs)
tions to implement the restriction imposed by the admin-
istrator or by the user at the time he/she submits the ap-
•
3.3 Language and Applications
SPLAY applications can be run locally without the plication for deployment over S PLAY.
Lua also supports cooperative multitasking by the
S PLAY applications are written in the Lua language [17],
a highly efficient scripting language. infrastructure (for debugging & testing)
deployment This design choice means of coroutines, which are at the core of S PLAY’s
•
was dictated by four majors factors. First, the most im- event-based model (discussed below).
portant reason is the support of sandboxingsetremote
Comprehensive for of events/threads
processes and, as a result, increased security both for the
libraries (extensible)
luasocket* io (fs)* stdlib*
sb_socket socketevents sb_fs sb_stdlib
estbed owner and its users. As mentioned earlier, sand-
boxing is a sound basis for execution in non-dedicated crypto* llenc json*
environments, where resources need to be constrained misc log rpc splay::app
and where the hosting operating system must be shielded * : third!party and lua libraries : main dependencies
rom possibly buggy or ill-behaved code. Second, one
ᔕᕈᒪᐱᓭ Distributed Systems Made Simple -Figure Felber - University of main S PLAY libraries.
of S PLAY’s goals is to support large numbers of pro- Pascal 6: Overview of the Neuchâtel
cesses within23, 12
Monday, January a single host of the testbed. This calls for
17. 10
Why ?
• Light & Fast
• (Very) close to equivalent code in C
• Concise
• Allow developers to focus on ideas more
than implementation details
• Key for researchers and students
• Sandbox thanks to the possibility of easily
redefine (even built-in) functions
ᔕᕈᒪᐱᓭ Distributed Systems Made Simple - Pascal Felber - University of Neuchâtel
Monday, January 23, 12
18. 11
Concise
Pseudo code
as published
on original
paper
Executable
code using
SPLAY
libraries
ᔕᕈᒪᐱᓭ Distributed Systems Made Simple - Pascal Felber - University of Neuchâtel
Monday, January 23, 12
19. 12
http://www.splay-project.org
Monday, January 23, 12
20. 13
Sandbox: Motivations
• Experiments should access only
their own resources
• Required for non-dedicated testbeds
• Idle workstations in universities or companies
• Must protect the regular users from ill-behaved
code (e.g., full disk, full memory, etc.)
• Memory-allocation, filesystem, network resources
must be restricted
ᔕᕈᒪᐱᓭ Distributed Systems Made Simple - Pascal Felber - University of Neuchâtel
Monday, January 23, 12
21. 14
Sandboxing with Lua
• In Lua, all functions can be re-defined transparently
• Resource control done in re-defined functions
before calling the original function
• No code modification required for the user
Sandboxing
SPLAY Application
SPLAY Lua libraries
(includes all stdlib and system calls)
Operating system
ᔕᕈᒪᐱᓭ Distributed Systems Made Simple - Pascal Felber - University of Neuchâtel
Monday, January 23, 12
22. 15
for _,module in pairs(splay)
present novelties
introduced due to dist
systems
luasocket events io
crypto llenc/benc json
misc log rpc
Modules sandboxed to prevent
access to sensible resources
ᔕᕈᒪᐱᓭ Distributed Systems Made Simple - Pascal Felber - University of Neuchâtel
Monday, January 23, 12
23. 17
splay.events
• Distributed protocols can use message-
passing paradigm to communicate
• Nodes react to events events
• Local, incoming, outgoing messages
• The core of the Splay runtime
• Libraries splay.socket & splay.io
provide non-blocking operational mode
• Based on Lua’s coroutines
ᔕᕈᒪᐱᓭ Distributed Systems Made Simple - Pascal Felber - University of Neuchâtel
Monday, January 23, 12
24. 19
splay.rpc
• Default mechanism to communicate
between nodes
• Support for UDP/TCP
• Efficient BitTorrent-like encoding rpc
• Experimental binary encoding
ᔕᕈᒪᐱᓭ Distributed Systems Made Simple - Pascal Felber - University of Neuchâtel
Monday, January 23, 12
25. responseCallback.onResponseReceived(msg);
}/*
20
* CAN BE COMMENTED AS SEND RETRIAL OF MESSAGES IS OFF else { it is a
Life Before ᔕᕈᒪᐱᓭ
* duplicate lookup response, send a PONG to (hopefully) stop source to
* send us again that message final Message pong = new
* Message(MessageType.PONG, msg.messageId, this, msg.src); pong.ackedMsg
* = msg; simulator.send(pong); }
*/
• Time spent on developing testbed
}
/* ELSE IF IS SOMETHING NOT A LOOKUP RESPONSE */
else {
final Message beingAcked = msg.ackedMsg; specific protocols
• Or fallback to simulations...
final ResponseArrivedCallback handler = this.sentMessagesCallbacks
.remove(beingAcked);
• The focus should be on ideas
/* if there is a mapping */
if (handler != null) {
handler.onResponseReceived(msg);
• Researchers usually have no time
}/*
* else { this may happen in case of a response arrive later than
* expected... log .info("%%% " + this + " received the response => " +
to produce industrial-quality code
* msg +
* " but no handler was expecting it! So, simply send back a PONG to make it stop sending again..n"
* ); send a PONG to (hopefully) stop source to send us again final
* Message pong = new Message(MessageType.PONG, msg.messageId, this,
there is more :-(
* msg.src); pong.ackedMsg = msg; simulator.send(pong); }
*/
}
// /* it replied at last, so "reset" the value */
ᔕᕈᒪᐱᓭ Distributed Systems Made Simple - Pascal Felber - University of Neuchâtel
// this.nodeConsecutiveTimeouts.put((DHTNode) msg.src, 0);
}
Monday, January 23, 12
26. such as epidemic diffusion on Erd¨ s-Renyi random 21
o
graphs [16] and various types of distribution trees [10]
Life With ᔕᕈᒪᐱᓭ
(n-ary trees, parallel trees). As one can note from the fol-
lowing figure, all implementations are extremely concise
7
in terms of lines of code (LOC):
Chord (base) 59 (base) + 17 (FT) + 26 (leafset) = 101
Pastry 265
Scribe Pastry 79
SplitStream Pastry Scribe 58
BitTorrent 420
Cyclon 93
Epidemic 35
Trees 47
Although the number of lines is clearly just a rough
• Lines thepseudocode ~== Lines of executableitcodestill a
indicator of
of
expressiveness of a system, is
5 Splay
daemons have been continuously running of Neuchâtel hosts for
ᔕᕈᒪᐱᓭ Distributed Systems Made Simple - Pascal Felber - University on these
more than one year.
Monday, January 23, 12
27. There exist several characterizations of churn that can
22
y be leveraged to reproduce realistic conditions for the pro-
-
u-
Churn management
tocol under test. First, synthetic description issued from
analytical studies [23] can be used to generate churn sce-
n
•
narios and is key for testing large-scale application
Churn replay them in the system. Second, several
0
e-
•
traces “Natural churn”of real networks have been made
of the dynamics on PlanetLab not reproducible
e •
publicly available by the community (e.g., see the reposi-
Hard to compare algorithms under same conditions
•
tory at [1]); they cover a wide rangechurn
SPLAY allows to finely control of applications such
as highly churned file-sharing system [8] or high perfor-
• Traces (e.g., from file sharing systems)
mance computing clusters [26].
• Synthetic descriptions: dedicated language
1 < 2 > < 3 >4< 5 >6
1 at 30s join 10
Leaves/joins per min.
Nodes leaving
2 from 5m to 10m inc 10 Nodes joining 30
3 from 10m to 15m const 20
Nodes
10
churn 50% 0
40
4 at 15m leave 50% 30
5 from 15m to 20m inc 10 20 Binned number of joins
churn 150% 10 and leaves per minute
at 20m stop 0
er 6
0 5 10 15 20
Time (minutes)
Figure 5: Example of a synthetic churn description.
ᔕᕈᒪᐱᓭ Distributed Systems Made Simple - Pascal Felber - University of Neuchâtel
Monday, January 23, 12
28. This section evaluates the use of the churn management as 14% of t
23
module, both using traces and synthetic descriptions. Us- minute; (2) r
ing churn is as simple as launching a regular S PLAY appli- plex nor lon
Synthetic churn
cation with a trace file as extra argument. S PLAY provides
a set of tools to generate and process trace files. One can,
for instance, speed-up a trace, increase the churn ampli-
we did for F
estimate tha
human effor
tude whilst keeping its statistical properties, or generate a than with an
trace from a synthetic description. a Pastry DHT
Routing performance in lieve that the
90th perc.
60 courage the
75th perc. 40 protocols un
50th perc.
Route failures (%)
25th perc. Self-repair current syste
5th perc. 20
Delay (ms)
Failure rate 10
110
Time (seconds)
0
50% of the network fail 8 130
20
6
15
10 4
5 2
0 0
0 1 2 3 4 5 6 7 8 9 10 0
Time (minutes)
Figure 11: Stabilized
Using churn management to reproduce massive Figure 13: D
churn conditions for thefailure: half of theimplementation.
Massive S PLAY Pastry nodes fail tion of (1) the
superset of da
Figure 11 presents aMade Simple - experiment of aof massive
ᔕᕈᒪᐱᓭ Distributed Systems typical Pascal Felber - University Neuchâtel
failure using the synthetic description. We ran Pastry 5.6 Deploy
Monday, January 23, 12
29. 24
Trace-driven churn
• Traces from the OverNet file sharing network [IPTPS03]
• Speed up of 2, 5, 10 (1 minute = 30, 12, 6 seconds)
• Evolution of hit-ratio & delays of Pastry on PlanetLab
Churn x2 Churn x5 Churn x10
12 12 12
10 10 10
Route failures (%)
Route failures (%)
Route failures (%)
Leaves and joins per minute Delay (seconds)
Leaves and joins per minute Delay (seconds)
Leaves and joins per minute Delay (seconds)
8 8 8
6 6 6
4 4 4
2 2 2
2 0 2 0 2 0
1.5 1.5 1.5
1 1 1
0.5 0.5 0.5
0 0 0
200 200 200
Nodes leaving Nodes leaving Nodes leaving
Nodes joining 700 Nodes joining 700 Nodes joining 700
150 150 150
650 650 650
Nodes
Nodes
Nodes
100 100 100
600 600 600
50 550 50 550 50 550
0 500 0 500 0 500
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
Figure 11: Study of the effect of churn on Pastry deployed on PlanetLab. Churn is derived from the trace of the Overnet file sharing
system and sped up for increasing volatility.
10
Churn x2 Churn x5 Churn x10
110% 150% 200% 70 SPLAY trees CRCP trees
ime (seconds)
8 130% 170% 60
6
mpletions
50
4 http://www.splay-project.org40 16 KB 128 KB 512 KB
2
Monday, January 23, 12 30
30. 25
Live Demo
www.splay-project.org
ᔕᕈᒪᐱᓭ Distributed Systems Made Simple - Pascal Felber - University of Neuchâtel
Monday, January 23, 12
31. 26
Live demo:
Gossip-based dissemination
• Also called epidemic dissemination
• A message spreads like the flu among population
• Robust way to spread message in random network
• Two actions: push and pull
• Push: node chooses f other random node to infect
• A newly infected node infects f other in turn
• Done for a limited number of steps (TTL)
• Pull: node tries to fetch message from random node
ᔕᕈᒪᐱᓭ Distributed Systems Made Simple - Pascal Felber - University of Neuchâtel
Monday, January 23, 12
32. 27
SOURCE
Every node knows a random set
of other nodes.
A node wants to send a message
to all nodes.
ᔕᕈᒪᐱᓭ Distributed Systems Made Simple - Pascal Felber - University of Neuchâtel
Monday, January 23, 12
33. 28
A first initial push in which nodes
that receive the messages for the
first time forwards it to f random
neighbors, up to TTL times.
ᔕᕈᒪᐱᓭ Distributed Systems Made Simple - Pascal Felber - University of Neuchâtel
Monday, January 23, 12
34. 29
Periodically, nodes pull for new
messages from one random
neighbor. The message
eventually reaches all peers with
high probability.
ᔕᕈᒪᐱᓭ Distributed Systems Made Simple - Pascal Felber - University of Neuchâtel
Monday, January 23, 12
35. 30
Algorithm 1: A simple push-pull dissemination protocol, on node P .
Constants
f : Fanout (push: P forwards a new message to f random peers).
TTL: Time To Live (push: a message is forwarded to f peers at most
TTL times).
: Pull period (pull : P asks for new messages every seconds).
Variables
R: set of received messages IDs
N : set of node identifiers
/* Invoked when a message is pushed to node P by node Q */
function Push(Message m, hops)
if m received for the first time then
/* Store the message locally */
R R⌅m
/* Propage the message if required */
if hops > 0 then
invoke Push(msg, hops-1 ) on f random peers from N
/* Periodic pull */
thread PeriodicPull()
every seconds do
invoke Pull(R, P ) on a random node from N
/* Invoked when a node Q requests messages from node P */
function Pull(RQ , Q)
foreach m | m ⇥ R ⇧ m ⇤⇥ RQ do
invoke Push(m, 0 ) on Q
ᔕᕈᒪᐱᓭ Distributed Systems Made Simple - Pascal Felber - University of Neuchâtel
Monday, January 23, 12
36. 31
require"splay.base"
rpc = require"splay.urpc" Loading Base and UDP RPC Libraries.
--[[ SETTINGS ]]--
fanout = 3
ttl = 3
pull_period = 3
rpc_timeout = 15 Set the timeout for RPC operations
(i.e., an RPC will return nil after 15 seconds).
--[[ VARS ]]--
msgs = {}
function push(q, m)
if not msgs[m.id] then
msgs[m.id] = m Logging facilities: all logs are available at the
log.print("NEW: "..m.id.." (ttl:"..m.t..") from "..q.ip..":"..q.port)
m.t = m.t - 1
controllers, where they are timestamped and
if m.t > 0 then stored in the DB.
for _, n in pairs(misc.random_pick(job.nodes, fanout)) do
log.print("FORWARD: "..m.id.." (ttl:"..m.t..") to
"..n.ip..":"..n.port) The RPC call itself is embedded in an inner
events.thread(function()
rpc.call(n, {"push", job.me, m}, rpc_timeout)
anonymous function in a new anonymous
end) thread: we do not care about its success
end (epidemics are naturally robust to message
end
else loss). Testing for success would be easy: the
log.print("DUP: "..m.id.." (ttl:"..m.t..") from "..q.ip..":"..q.port) second return value is nil in case of failure.
end
end
ᔕᕈᒪᐱᓭ Distributed Systems Made Simple - Pascal Felber - University of Neuchâtel
Monday, January 23, 12
37. 32
SPLAY provides a library for cooperative
function periodic_pull()
while events.sleep(pull_period) do
multithreading, events, scheduling and
local q = misc.random_pick_one(job.nodes) synchronization. SPLAY also provides various
log.print("PULLING: "..q.ip..":"..q.port) commodity libraries.
local ids = {}
for id, _ in pairs(msgs) do
ids[id] = true (In LUA, arrays and maps are the same type)
end
local r = rpc.call(q, {"pull", ids}, rpc_timeout)
if r then Variable r is nil if the RPC times out.
for id, m in pairs(r) do
if not msgs[id] then
msgs[id] = m
log.print("NEW REPLY: "..m.id.." from "..q.ip..":"..q.port)
else
log.print("DUP REPLY: "..m.id.." from "..q.ip..":"..q.port)
end
end
end
end
end
function pull(ids)
local r = {} There is no difference between a local function
for id, m in pairs(msgs) do
if not ids[id] then
and one called by a distant node. Even variables
r[id] = m can be accessed by a distant node by a RPC!
end SPLAY does all the work of serialization, network
end
return r
calls, etc. while preserving low footprint and
end high performance.
ᔕᕈᒪᐱᓭ Distributed Systems Made Simple - Pascal Felber - University of Neuchâtel
Monday, January 23, 12
38. 33
function source()
local m = {id = 1, t = ttl, data = "data"}
for _, n in pairs(misc.random_pick(job.nodes, fanout)) do
log.print("SOURCE: "..m.id.." to "..n.ip..":"..n.port)
events.thread(function()
rpc.call(n, {"push", job.me, m}, rpc_timeout) Calling a remote function.
end)
end
end
events.loop(function()
rpc.server(job.me)
log.print("ME", job.me.ip, job.me.port, job.position) Start the RPC server: it is up and running.
events.sleep(15)
events.thread(periodic_pull)
if job.position == 1 then Embed a function in a separate thread.
source()
end
end)
ᔕᕈᒪᐱᓭ Distributed Systems Made Simple - Pascal Felber - University of Neuchâtel
Monday, January 23, 12
39. 34
ay
aw
e- e
ak lid
T S
• Distributed systems raise a number of issues
for their evaluation
• Hard to implement, debug, deploy, tune
• ᔕᕈᒪᐱᓭ leverages Lua and centralized
controller to produce an easy to use yet
powerful working environment
ᔕᕈᒪᐱᓭ Distributed Systems Made Simple - Pascal Felber - University of Neuchâtel
Monday, January 23, 12