The OMG DDS standard has been witnessing a very strong adoption as the distribution middleware of choice for a large class of mission and business critical systems, such as Air Traffic Control, Automated Trading, SCADA, Smart Energy, etc.
The main reason for choosing DDS lies in its efficiency, scalability, high-availability and configurability -- through the 20+ QoS policy. Yet, all of these nice properties come at the cost of a relaxed consistency model no strong guarantees over global invariants.
As a result, many architects have to devise, by themselves – assuming the DDS primitives as a foundation – the correct algorithms for classical problems such as fault-detection, leader election, consensus, distributed mutual exclusion, atomic multicast, distributed queues, etc.
In this presentation we will explore DDS-based distributed algorithms for many classical, yet fundamental, problems in distributed systems. For simplicity, we'll start with algorithms that ignore the presence of failures. Then we will (1) demonstrate how these algorithms can be extended to deal with failures, and (2) introduce Paxos as one of the fundamental algorithm for consensus and atomic broadcast.
Finally, we'll show how these classical algorithms can be used to implement useful extensions of the DDS semantics, such as multi-writer / multi-reader distributed queues.
Breaking the Kubernetes Kill Chain: Host Path Mount
Classical Distributed Algorithms with DDS
1. Classical Distributed
Algorithms with DDS
[Developing Higher Level Abstractions on DDS]
OpenSplice DDS
Angelo CORSARO, Ph.D.
Chief Technology Officer
OMG DDS Sig Co-Chair
PrismTech
angelo.corsaro@prismtech.com
2. Context
☐ The Data Distribution Service (DDS) provides a very useful foundation
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
for building highly dynamic, reconfigurable, dependable and high
performance systems
☐ However, in building distributed systems with DDS one is often faced
OpenSplice DDS
with two kind of problems:
☐ How can distributed coordination problems be solved with DDS?
e.g. distributed mutual exclusion, consensus, etc
☐ How can higher order primitives and abstractions be supported over DDS?
e.g. fault-tolerant distributed queues, total-order multicast, etc.
☐ In this presentation we will look at how DDS can be used to implement
some of the classical Distributed Algorithm that solve these problems
4. Data Distribution Service
For Real-Time Systems
DDS provides a Topic-Based Publish/
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
Subscribe abstraction based on: Data
Reader
Data
Writer
☐ Topics: data distribution subject’s
OpenSplice DDS
Data
Reader
Data TopicD
Writer
DataWriters: data producers
TopicA
☐
Data
TopicB Reader
Data
Writer
☐ DataReaders: data consumers TopicC
...
Data Data
Writer Reader
DDS Global Data Space
5. Data Distribution Service
For Real-Time Systems
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
☐ DataWriters and DataReaders
are automatically and Data
Reader
dynamically matched by the Data
Writer
DDS Dynamic Discovery
OpenSplice DDS
Data
Reader
Data TopicD
Writer
TopicA
☐ A rich set of QoS allows to Data
Reader
TopicB
control existential, temporal,
Data
Writer
TopicC
...
and spatial properties of data Data Data
Writer Reader
DDS Global Data Space
6. DDS Topics
“Circle”, “Square”, “Triangle”, ...
☐ A Topic defines a class of streams
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
☐ A Topic has associated a unique Name
name, a user defined extensible
type and a set of QoS policies Topic
Typ
S
OpenSplice DDS
DURABILITY,
Qo
☐ QoS Policies capture the Topic
e
DEADLINE,
ShapeType
non-functional invariants PRIORITY,
…
☐ Topics can be discovered or
locally defined struct ShapeType {
@Key
string color;
long x;
long y;
long shapesize;
};
7. Topic Instances
☐ Each unique key value
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
Instances Instances
identifies a unique stream
of data color =”Green”
DDS not only
struct ShapeType {
☐ color =”red” @Key string color;
Topic
OpenSplice DDS
long x; long y;
demultiplexes “streams” color = “Blue”
long shapesize;};
but provides also lifecycle
information
☐ A DDS DataWriter can
write multiple instances
8. Anatomy of a DDS Application
Domain (e.g. Domain 123)
Domain
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
Participant
Topic
Partition (e.g. “Telemetry”, “Shapes”, )
OpenSplice DDS
Publisher
Subscriber
Topic Instances/Samples
DataWrter DataReader
9. Channel Properties
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
☐ We can think of a DataWriter and its
matching DataReaders as connected by DR
a logical typed communication channel
DW Topic DR
OpenSplice DDS
☐ The properties of this channel are
controlled by means of QoS Policies DR
☐ At the two extreme this logical
communication channel can be:
☐ Best-Effort/Reliable Last n-values Channel
☐ Best-Effort/Reliable FIFO Channel
10. Last n-values Channel
☐ The last n-values channel is useful when
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
modeling distributed state
☐ When n=1 then the last value channel provides DR
a way of modeling an eventually consistent
distributed state DW Topic DR
OpenSplice DDS
☐ This abstraction is very useful if what matters is
DR
the current value of a given topic instance
☐ The Qos Policies that give a Last n-value
Channel are:
☐ RELIABILITY = BEST_EFFORT | RELIABLE
☐ HISTORY = KEEP_LAST(n)
☐ DURABILITY = TRANSIENT | PERSISTENT [in most cases]
11. FIFO Channel
☐ The FIFO Channel is useful when we care about
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
every single sample that was produced for a
given topic -- as opposed to the “last value”
DR
☐ This abstraction is very useful when writing
distributing algorithm over DDS DW Topic DR
OpenSplice DDS
☐ Depending on Qos Policies, DDS provides: DR
☐ Best-Effort/Reliable FIFO Channel
☐ FT-Reliable FIFO Channel (using an OpenSplice-
specific extension)
☐ The Qos Policies that give a FIFO Channel are:
☐ RELIABILITY = BEST_EFFORT | RELIABLE
☐ HISTORY = KEEP_ALL
12. Membership
DR
☐ We can think of a DDS Topic as defining a
group
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
DW Topic DR
☐ The members of this group are matching
DataReaders and DataWriters
DR
☐ DDS’ dynamic discovery manages this group
DataWriter Group View
OpenSplice DDS
membership, however it provides a low level
interface to group management and eventual
consistency of views
DW
☐ In addition, the group view provided by DDS
makes available matched readers on the
writer-side and matched-writers on the reader- DW Topic DR
side
☐ This is not sufficient for certain distributed DW
algorithms. DataReader Group View
13. Fault-Detection
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
☐ DDS provides built-in mechanism
DW
for detection of DataWriter faults
through the
OpenSplice DDS
LivelinessChangedStatus DW Topic DR
☐ A writer is considered as having DW
lost its liveliness if it has failed to DataReader Group View
assert it within its lease period
15. System Model
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
☐ Partially Synchronous
☐ After a Global Stabilization Time (GST) communication latencies are
OpenSplice DDS
bounded, yet the bound is unknown
☐ Non-Byzantine Fail/Recovery
☐ Process can fail and restart but don’t perform malicious actions
16. Programming Environment
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
☐ The algorithms that will be showed next
are implemented on OpenSplice using
the Escalier Scala API
OpenSplice DDS
☐ All algorithms are available as part of the ¥ DDS-based Advanced Distributed
Open Source project dada Algorithms Toolkit
¥ Open Source
¥ github.com/kydos/dada
OpenSplice | DDS Escalier
¥ #1 OMG DDS Implementation ¥ Fastest growing JVM Language ¥ Scala API for OpenSplice DDS
¥ Open Source ¥ Open Source ¥ Open Source
¥ www.opensplice.org ¥ www.scala-lang.org ¥ github.com/kydos/escalier
18. Group Management
abstract class Group {
// Join/Leave API
def join(mid: Int)
def leave(mid: Int)
☐ A Group Management
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
// Group View API
abstraction should provide the def size: Int
def view: List[Int]
ability to join/leave a group, def waitForViewSize(n: Int)
def waitForViewSize(n: Int,
provide the current view and timeout: Int)
detect failures of group members
OpenSplice DDS
// Leader Election API
def leader: Option[Int]
def proposeLeader(mid: Int, lid: Int)
☐ Ideally group management
should also provide the ability to // Reactions handling Group Events
val reactions: Reactions
elect leaders }
☐ A Group Member should case
case
class
class
MemberJoin(val mid: Int)
MemberLeave(val mid: Int)
represent a process case
case
class
class
MemberFailure(mid:Int)
EpochChange(epoch: Long)
case class NewLeader(mid: Option[Int])
19. Topic Types
☐ To implement the Group abstraction with support for Leader
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
Election it is sufficient to rely on the following topic types:
OpenSplice DDS
enum TMemberStatus {
JOINED,
LEFT, struct TEventualLeaderVote {
FAILED, long long epoch;
SUSPECTED long mid;
}; long lid; // voted leader-id
};
struct TMemberInfo { #pragma keylist TEventualLeaderVote mid
long mid; // member-id
TMemberStatus status;
};
#pragma keylist TMemberInfo mid
20. Topics
Group Management
☐ The TMemberInfo topic is used to advertise presence and manage the
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
members state transitions
Leader Election
☐ The TEventualLeaderVote topic is used to cast votes for leader election
OpenSplice DDS
This leads us to:
☐ Topic(name = MemberInfo, type = TMemberInfo,
QoS = {Reliability.Reliable, History.KeepLast(1), Durability.TransientLocal})
☐ Topic(name = EventualLeaderVote, type = TEventualLeaderVote,
QoS = {Reliability.Reliable, History.KeepLast(1), Durability.TransientLocal}
21. Observation
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
☐ Notice that we are using two Last-Value Channels for
implementing both the (eventual) group management and the
(eventual) leader election
OpenSplice DDS
☐ This makes it possible to:
☐ Let DDS provide our latest known state automatically thanks to the
TransientLocal Durability
☐ No need for periodically asserting our liveliness as DDS will do that our
DataWriter
22. Leader Election
join crash
M1
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
Leader: None => M1 Leader: None => M1 Leader: None => M0 Leader: None => M0
join
M2
OpenSplice DDS
join
M0
epoch = 0 epoch = 1 epoch = 2 epoch = 3
☐ At the beginning of each epoch the leader is None
☐ Each new epoch a leader election algorithm is run
23. Distinguishing Groups
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
☐ To isolate the traffic generated by different groups, we use the
group-id gid to name the partition in which all the group related
traffic will take place
OpenSplice DDS
Partition associated to
the group with gid=2
“2”
“1”
“3” DDS Domain
24. Example
object GroupMember {
def main(args: Array[String]) {
if (args.length < 2) {
[1/2]
println("USAGE: GroupMember <gid> <mid>")
sys.exit(1)
}
val gid = args(0).toInt
val mid = args(1).toInt
val group = Group(gid)
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
group.join(mid)
val printGroupView = () => {
print("Group["+ gid +"] = { ")
☐ Events provide notification of group.view foreach(m => print(m + " "))
println("}")}
group membership changes
OpenSplice DDS
group.reactions += {
case MemberFailure(mid) => {
☐ These events are handled by println("Member "+ mid + " Failed.")
printGroupView()
registering partial functions }
case MemberJoin(mid) => {
with the Group reactions println("Member "+ mid + " Joined")
printGroupView()
}
case MemberLeave(mid) => {
println("Member "+ mid +" Left")
printGroupView()
}
}
}
}
25. Example [1/2]
object EventualLeaderElection {
☐ An eventual leader election algorithm def main(args: Array[String]) {
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
if (args.length < 2) {
can be implemented by simply println("USAGE: GroupMember <gid> <mid>")
casting a vote each time there is an }
sys.exit(1)
group epoch change val gid = args(0).toInt
val mid = args(1).toInt
☐ A Group Epoch change takes place val group = Group(gid)
OpenSplice DDS
each time there is a change on the group.join(mid)
group view
group.reactions += {
case EpochChange(e) => {
☐ The leader is eventually elected only if val lid = group.view.min
a majority of the process currently on }
group.proposeLeader(mid, lid)
the view agree case NewLeader(l) =>
println(">> NewLeader = "+ l)
}
☐ Otherwise the group leader is set to }
“None” }
27. Lamport’s Distributed Mutex
☐ A relatively simple Distributed Mutex Algorithm was proposed by Leslie
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
Lamport as an example application of Lamport’s Logical Clocks
☐ The basic protocol (with Agrawala optimization) works as follows
(sketched):
OpenSplice DDS
☐ When a process needs to enter a critical section sends a MUTEX request by
tagging it with its current logical clock
☐ The process obtains the Mutex only when he has received ACKs from all the
other process in the group
☐ When process receives a Mutex requests he sends an ACK only if he has not an
outstanding Mutex request timestamped with a smaller logical clock
28. Mutex Abstraction
☐ A base class defines the
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
Mutex Protocol abstract class Mutex {
def acquire()
☐ The Mutex companion def release()
uses dependency injection }
OpenSplice DDS
to decide which concrete
mutex implementation to
use
29. Foundation Abstractions
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
☐ The mutual exclusion algorithm requires essentially:
☐ FIFO communication channels between group members
☐ Logical Clocks
OpenSplice DDS
☐ MutexRequest and MutexAck Messages
These needs, have now to be translated in terms of topic types,
topics, readers/writers and QoS Settings
30. Topic Types
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
☐ For implementing the Mutual Exclusion Algorithm it is sufficient to
define the following topic types:
OpenSplice DDS
struct TLogicalClock {
long ts;
long mid;
};
#pragma keylist LogicalClock mid
struct TAck {
long amid; // acknowledged member-id
LogicalClock ts;
};
#pragma keylist TAck ts.mid
31. Topics
We need essentially two topics:
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
☐ One topic for representing the Mutex Requests, and
☐ Another topic for representing Acks
OpenSplice DDS
This leads us to:
☐ Topic(name = MutexRequest, type = TLogicalClock,
QoS = {Reliability.Reliable, History.KeepAll})
☐ Topic(name = MutexAck, type = TAck,
QoS = {Reliability.Reliable, History.KeepAll})
32. Show me the Code!
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
☐ All the algorithms presented were implemented using DDS and
Scala
OpenSplice DDS
☐ Specifically we’ve used the OpenSplice Escalier language
mapping for Scala
☐ The resulting library has been baptized “dada” (DDS Advanced
Distributed Algorithms) and is available under LGPL-v3
33. LCMutex
☐ The LCMutex is one of the possible Mutex protocol, implementing
the Agrawala variation of the classical Lamport’s Algorithm
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
class LCMutex(val mid: Int, val gid: Int, val n: Int)(implicit val logger: Logger) extends Mutex {
private var group = Group(gid)
private var ts = LogicalClock(0, mid)
OpenSplice DDS
private var receivedAcks = new AtomicLong(0)
private var pendingRequests = new SynchronizedPriorityQueue[LogicalClock]()
private var myRequest = LogicalClock.Infinite
private val reqDW =
DataWriter[TLogicalClock](LCMutex.groupPublisher(gid), LCMutex.mutexRequestTopic, LCMutex.dwQos)
private val reqDR =
DataReader[TLogicalClock](LCMutex.groupSubscriber(gid), LCMutex.mutexRequestTopic, LCMutex.drQos)
private val ackDW =
DataWriter[TAck](LCMutex.groupPublisher(gid), LCMutex.mutexAckTopic, LCMutex.dwQos)
private val ackDR =
DataReader[TAck](LCMutex.groupSubscriber(gid), LCMutex.mutexAckTopic, LCMutex.drQos)
private val ackSemaphore = new Semaphore(0)
34. LCMutex.acquire
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
def acquire() {
ts = ts.inc()
myRequest = ts Notice that as the LCMutex
reqDW ! myRequest is single-threaded we can’t
ackSemaphore.acquire() issue concurrent acquire.
OpenSplice DDS
}
35. LCMutex.release
def release() {
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
myRequest = LogicalClock.Infinite
(pendingRequests dequeueAll) foreach { req => Notice that as the LCMutex
ts = ts inc() is single-threaded we can’t
ackDW ! new TAck(req.id, ts) issue a new request before
}
}
we release.
OpenSplice DDS
36. LCMutex.onACK
ackDR.reactions += {
case DataAvailable(dr) => {
// Count only the ACK for us
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
val acks = ((ackDR take) filter (_.amid == mid))
val k = acks.length
if (k > 0) {
// Set the local clock to the max (tsi, tsj) + 1
synchronized {
OpenSplice DDS
val maxTs = math.max(ts.ts, (acks map (_.ts.ts)).max) + 1
ts = LogicalClock(maxTs, ts.id)
}
val ra = receivedAcks.addAndGet(k)
val groupSize = group.size
// If received sufficient many ACKs we can enter our Mutex!
if (ra == groupSize - 1) {
receivedAcks.set(0)
ackSemaphore.release()
}
}
}
}
37. LCMutex.onReq
reqDR.reactions += {
case DataAvailable(dr) => {
val requests = (reqDR take) filterNot (_.mid == mid)
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
if (requests.isEmpty == false ) {
synchronized {
val maxTs = math.max((requests map (_.ts)).max, ts.ts) + 1
ts = LogicalClock(maxTs, ts.id)
}
requests foreach (r => {
if (r < myRequest) {
OpenSplice DDS
ts = ts inc()
val ack = new TAck(r.mid, ts)
ackDW ! ack
None
}
else {
(pendingRequests find (_ == r)).getOrElse({
pendingRequests.enqueue(r)
r})
}
})
}
}
}
39. Distributed Queue Abstraction
A distributed queue is conceptually provide with the ability of
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
☐
enqueueing and dequeueing elements
☐ Depending on the invariants that are guaranteed the distributed
OpenSplice DDS
queue implementation can be more or less efficient
☐ In what follows we’ll focus on a relaxed form of distributed queue,
called Eventual Queue, which while providing a relaxed yet very
useful semantics is amenable to high performance
implementations
40. Eventual Queue Specification
☐ Invariants
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
☐ All enqueued elements will be eventually dequeued
☐ Each element is dequeued once
☐ If the queue is empty a dequeue returns nothing
If the queue is non-empty a dequeue might return something
OpenSplice DDS
☐
☐ Elements might be dequeued in a different order than they are enqueued
DR
DW
DR
DW
DR
DW
Distributed Eventual Queue
DR
41. Eventual Queue Specification
☐ Invariants
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
☐ All enqueued elements will be eventually dequeued
☐ Each element is dequeued once
☐ If the queue is empty a dequeue returns nothing
OpenSplice DDS
☐ If the queue is non-empty a dequeue might return something
☐ Elements might be dequeued in a different order than they are enqueued
DR
DW
DR
DW
DR
DW
Distributed Eventual Queue
DR
42. Eventual Queue Specification
☐ Invariants
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
☐ All enqueued elements will be eventually dequeued
☐ Each element is dequeued once
☐ If the queue is empty a dequeue returns nothing
OpenSplice DDS
☐ If the queue is non-empty a dequeue might return something
☐ Elements might be dequeued in a different order than they are enqueued
DR
DW
DR
DW
DR
DW
Distributed Eventual Queue
DR
43. Eventual Queue Specification
☐ Invariants
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
☐ All enqueued elements will be eventually dequeued
☐ Each element is dequeued once
☐ If the queue is empty a dequeue returns nothing
OpenSplice DDS
☐ If the queue is non-empty a dequeue might return something
☐ Elements might be dequeued in a different order than they are enqueued
DR
DW
DR
DW
DR
DW
Distributed Eventual Queue
DR
44. Eventual Queue Specification
☐ Invariants
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
☐ All enqueued elements will be eventually dequeued
☐ Each element is dequeued once
☐ If the queue is empty a dequeue returns nothing
OpenSplice DDS
☐ If the queue is non-empty a dequeue might return something
☐ Elements might be dequeued in a different order than they are enqueued
DR
DW
DR
DW
DR
DW
Distributed Eventual Queue
DR
45. Eventual Queue Abstraction
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
☐ A Queue can be seen as the trait Enqueue[T] {
def enqueue(t: T)
composition of two simpler data }
structure, a Dequeue and an trait Dequeue[T] {
Enqueue def dequeue(): Option[T]
OpenSplice DDS
def sdequeue(): Option[T]
def length: Int
☐ The Enqueue simply allows to add def isEmpty: Boolean = length == 0
elements }
trait Queue[T]
The Enqueue simply allows to get
☐ extends Enqueue[T] with Dequeue[T]
elements
46. Eventual Queue on DDS
☐ One approach to implement the eventual queue on DDS is to
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
keep a local queue on each of the consumer and to run a
coordination algorithm to enforce the Eventual Queue Invariants
☐ The advantage of this approach is that the latency of the
OpenSplice DDS
dequeue is minimized and the throughput of enqueues is
maximized (we’ll see this latter is really a property of the eventual
queue)
☐ The disadvantage, for some use cases, is that the consumer need
to store the whole queue locally thus, this solution is applicable for
either symmetric environments running on LANs
47. Eventual Queue Invariants & DDS
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
☐ All enqueued elements will be eventually dequeued
☐ Each element is dequeued once
☐ If the queue is empty a dequeue returns nothing
OpenSplice DDS
☐ If the queue is non-empty a dequeue might return something
☐ These invariants require that we implement a distributed protocol for ensuring
that values are eventual picked up and picked up only once!
☐ Elements might be dequeued in a different order than they are
enqueued
48. Eventual Queue Invariants & DDS
☐ All enqueued elements will be eventually dequeued
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
☐ If the queue is empty a dequeue returns nothing
☐ If the queue is non-empty a dequeue might return something
☐ Elements might be dequeued in a different order than they are
OpenSplice DDS
enqueued
☐ This essentially means that we can have different local order for the queue
elements on each consumer. Which in turns means that we can distribute
enqueued elements by simple DDS writes!
☐ The implication of this is that the enqueue operation is going to be as efficient as
a DDS write
☐ Finally, to ensure eventual consistency in presence of writer faults we’ll take
advantage of OpenSplice FT-Reliability!
49. Dequeue Protocol: General Idea
☐ A possible Dequeue protocol can be derived by the Lamport/Agrawala
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
Distributed Mutual Exclusion Algorithm
☐ The general idea is similar as we want to order dequeues as opposed to
access to some critical section, however there are some important details to
OpenSplice DDS
be sorted out to ensure that we really maintain the eventual queue invariants
☐ Key Issues to be dealt
☐ DDS provides eventual consistency thus we might have wildly different local view of the
content of the queue (not just its order but the actual elements)
☐ Once a process has gained the right to dequeue it has to be sure that it can pick an
element that nobody else has picked just before. Then he has to ensure that before he
allows anybody else to pick a value his choice has to be popped by all other local
queues
50. Topic Types
struct TLogicalClock {
long long ts;
long mid;
To implement the Eventual Queue
};
☐
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
over DDS we use three different enum TCommandKind {
DEQUEUE,
Topic Types ACK,
POP
};
☐ The TQueueCommand represents all
OpenSplice DDS
struct TQueueCommand {
the commands used by the TCommandKind kind;
long mid;
protocol (more later on this) };
TLogicalClock ts;
#pragma keylist TQueueCommand
☐ TQueueElement represents a
writer time-stamped queue
typedef sequence<octet> TData;
struct TQueueElement {
element TLogicalClock ts;
TData data;
};
#pragma keylist TQueueElement
51. Topics
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
To implement the Eventual Queue we need only two topics:
☐ One topic for representing the queue elements
OpenSplice DDS
☐ Another topic for representing all the protocol messages. Notice
that the choice of using a single topic for all the protocol messages
was carefully made to be able to ensure FIFO ordering between
protocol messages
52. Topics
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
This leads us to:
☐ Topic(name = QueueElement, type = TQueueElement,
OpenSplice DDS
QoS = {Reliability.Reliable, History.KeepAll})
☐ Topic(name = QueueCommand, type = TQueueCommand,
QoS = {Reliability.Reliable, History.KeepAll})
54. Example: Producer
object MessageProducer {
def main(args: Array[String]) {
if (args.length < 4) {
println("USAGE:nt MessageProducer <mid> <gid> <n> <samples>")
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
sys.exit(1)
}
val mid = args(0).toInt
val gid = args(1).toInt
val n = args(2).toInt
val samples = args(3).toInt
val group = Group(gid)
group.reactions += {
OpenSplice DDS
case MemberJoin(mid) => println("Joined M["+ mid +"]")
}
group.join(mid)
group.waitForViewSize(n)
val queue = Enqueue[String]("CounterQueue", mid, gid)
for (i <- 1 to samples) {
val msg = "MSG["+ mid +", "+ i +"]"
println(msg)
queue.enqueue(msg)
// Pace the write so that you can see what's going on
Thread.sleep(300)
}
}
}
55. Example: Consumer
object MessageConsumer {
def main(args: Array[String]) {
if (args.length < 4) {
println("USAGE:nt MessageProducer <mid> <gid> <readers-num> <n>")
sys.exit(1)
}
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
val mid = args(0).toInt
val gid = args(1).toInt
val rn = args(2).toInt
val n = args(3).toInt
val group = Group(gid)
group.reactions += {
case MemberJoin(mid) => println("Joined M["+ mid +"]")
OpenSplice DDS
}
group.join(mid)
group.waitForViewSize(n)
val queue = Queue[String]("CounterQueue", mid, gid, rn)
val baseSleep = 1000
while (true) {
queue.sdequeue() match {
case Some(s) => println(Console.MAGENTA_B + s + Console.RESET)
case _ => println(Console.MAGENTA_B + "None" + Console.RESET)
}
val sleepTime = baseSleep + (math.random * baseSleep).toInt
Thread.sleep(sleepTime)
}
}
}
57. Fault-Detectors
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
☐ The algorithms presented so far can be easily extended to deal
with failures by taking advantage of group abstraction presented
earlier
OpenSplice DDS
☐ The main issue to carefully consider is that if a timing assumption is
violated thus leading to falsely suspecting the crash of a process
safety of some of those algorithms might be violated!
59. Paxos in Brief
☐ Paxos is a protocol for state-machine replication proposed by Leslie
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
Lamport in his “The Part-time Parliament”
☐ The Paxos protocol works in under asynchrony -- to be precise, it is safe
under asynchrony and has progress under partial synchrony (both are not
possible under asynchrony due to FLP) -- and admits a crash/recovery
OpenSplice DDS
failure mode
☐ Paxos requires some form of stable storage
☐ The theoretical specification of the protocol is very simple and elegant
☐ The practical implementations of the protocol have to fill in many hairy
details...
60. Paxos in Brief
☐ The Paxos protocol considers three different kinds of agents (the
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
same process can play multiple roles):
☐ Proposers
☐ Acceptors
Learners
OpenSplice DDS
☐
☐ To make progress the protocol requires that a proposer acts as the
leader in issuing proposals to acceptors on behalf of clients
☐ The protocol is safe even if there are multiple leaders, in that case
progress might be scarified
☐ This implies that Paxos can use an eventual leader election algorithm to decide
the distinguished proposer
61. Paxos Synod Protocol
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
OpenSplice DDS
[Pseudocode from “Ring Paxos: A High-Throughput Atomic Broadcast Protocol, DSN 2010”. Notice that
the pseudo code is not correct as it suffers from progress in several cases, however it illustrates the
key idea of the Paxos Synod protocol]
62. OpenSplice DDS
C2
C1
Cn
P2
P1
Pk
[Leader]
Paxos in Action
A2
A1
Am
L2
L1
Lh
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
63. OpenSplice DDS
C2
C1
Cn
P2
P1
Pk
[Leader]
phase1A(c-rnd)
A2
A1
Am
L2
L1
Lh
Paxos in Action -- Phase 1A
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
64. Paxos in Action -- Phase 1B
[Leader]
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
C1 P1 A1 L1
C2 P2 A2 L2
OpenSplice DDS
Cn Pk Am Lh
phase1B(rnd, v-rnd, v-val)
65. OpenSplice DDS
C2
C1
Cn
P2
P1
Pk
[Leader]
phase2A(c-rnd, c-val)
A2
A1
Am
L2
L1
Lh
Paxos in Action -- Phase 2A
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
66. OpenSplice DDS
C2
C1
Cn
P2
P1
Pk
[Leader]
phase2B(v-rnd, v-val)
A2
A1
Am
L2
L1
Lh
Paxos in Action -- Phase 2B
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
67. OpenSplice DDS
C2
C1
Cn
P2
P1
Pk
[Leader]
Decision(v-val)
A2
A1
Am
L2
L1
Lh
Paxos in Action -- Phase 2B
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
68. Eventual Queue with Paxos
☐ The Eventual queue we specified on the previous section can be
implemented using an adaptation of the Paxos protocol
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
☐ In this case, consumers don’t cache locally the queue but
leverage a mid-tier running the Paxos protocol to serve dequeues
OpenSplice DDS
C1 P1
Pi [Proposers]
C2 P2
Ai [Acceptors]
L1 [Learners]
[Learners] Cn [Eventual Queue] Pm
70. Concluding Remarks
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
☐ OpenSplice DDS provides a good foundation to effectively and
efficiently express some of the most important distributed
algorithms
e.g. DataWriter fault-detection and OpenSplice FT-Reliable Multicast
OpenSplice DDS
☐
☐ dada provides access to reference implementations of many of the
most important distributed algorithms
☐ It is implemented in Scala, but that means you can also use these libraries
from Java too!
71. References
Copyright
2011,
PrismTech
–
All
Rights
Reserved.
OpenSplice | DDS Escalier
¥ #1 OMG DDS Implementation ¥ Fastest growing JVM Language ¥ Scala API for OpenSplice DDS
¥ Open Source ¥ Open Source ¥ Open Source
¥ www.opensplice.org ¥ www.scala-lang.org ¥ github.com/kydos/escalier
OpenSplice DDS
¥ Simple C++ API for DDS ¥ DDS-PSM-Java for OpenSplice DDS ¥ DDS-based Advanced Distributed
¥ Open Source ¥ Open Source Algorithms Toolkit
¥ github.com/kydos/simd-cxx ¥ github.com/kydos/simd-java ¥ Open Source
¥ github.com/kydos/dada
72. :: Connect with Us ::
¥opensplice.com ¥forums.opensplice.org
¥@acorsaro
¥opensplice.org ¥opensplicedds@prismtech.com ¥@prismtech
OpenSplice DDS
¥ crc@prismtech.com
¥sales@prismtech.com
¥youtube.com/opensplicetube ¥slideshare.net/angelo.corsaro