Strategies for Landing an Oracle DBA Job as a Fresher
PhD Thesis
1. Rolando da Silva Martins
On the Integration of Real-Time
and Fault-Tolerance in P2P
Middleware
Departamento de Ciˆncia de Computadores
e
Faculdade de Ciˆncias da Universidade do Porto
e
2012
2.
3. Rolando da Silva Martins
On the Integration of Real-Time
and Fault-Tolerance in P2P
Middleware
Tese submetida ` Faculdade de Ciˆncias da
a e
Universidade do Porto para obten¸˜o do grau de Doutor
ca
em Ciˆncia de Computadores
e
Advisors: Prof. Fernando Silva and Prof. Lu´ Lopes
ıs
Departamento de Ciˆncia de Computadores
e
Faculdade de Ciˆncias da Universidade do Porto
e
Maio de 2012
4.
5. To my wife Liliana, for her endless love, support, and encouragement.
3
6.
7. –Imagination is everything. It is the preview of
Acknowledgments
life’s coming attractions.
Albert Einstein
To my soul-mate Liliana, for her endless support on the best and worst of times. Her
unconditional love and support helped me to overcome the most daunting adversities
and challenges.
I would like to thank EFACEC, in particular to Cipriano Lomba, Pedro Silva and Paulo
Paix˜o, for their vision and support that allowed me to pursuit this Ph.D.
a
I would like to thank the financial support from EFACEC, Sistemas de Engenharia,
S.A. and FCT - Funda¸˜o para a Ciˆncia e Tecnologia, with Ph.D. grant SFRH/B-
ca e
DE/15644/2006.
I would especially like to thank my advisors, Professors Lu´ Lopes and Fernando Silva,
ıs
for their endless effort and teaching over the past four years. Lu´ thank you for steering
ıs,
me when my mind entered a code frenzy, and for teaching me how to put my thoughts
to words. Fernando, your keen eye is always able to understand the “big picture”, this
was vital to detected and prevent the pitfalls of building large and complex middleware
systems. To both, I thank you for opening the door of CRACS to me. I had an incredible
time working with you.
A huge thank you to Professor Priya Narasimhan, for acting as an unofficial advisor.
She opened the door of CMU to me and helped to shape my work at crucial stages.
Priya, I had a fantastic time mind-storming with you, each time I managed to learn
something new and exciting. Thank you for sharing with me your insights on MEAD’s
architecture, and your knowledge on fault-tolerance and real-time.
Lu´ Fernando and Priya, I hope someday to be able to repay your generosity and
ıs,
friendship. It is inspirational to see your passion for your work, and your continuous
effort on helping others.
I would like to thank Jiaqi Tan for taking the time to explain me the architecture and
functionalities of MapReduce, and Professor Alysson Bessani, for his thoughts on my
work and for his insights on byzantine failures and consensus protocols.
I also would like to thank CRACS members, Professors Ricardo Rocha, Eduardo Cor-
reia, V´ Costa, and Inˆs Dutra, for listening and sharing their thoughts on my work.
ıtor e
A big thank you to Hugo Ribeiro, for his crucial help with the experimental setup.
5
8.
9. –All is worthwhile if the soul is not small.
Abstract
Fernando Pessoa
The development and management of large-scale information systems, such as high-
speed transportation networks, are pushing the limits of the current state-of-the-art
in middleware frameworks. These systems are not only subject to hardware failures,
but also impose stringent constraints on the software used for management and there-
fore on the underlying middleware framework. In particular, fulfilling the Quality-
of-Service (QoS) demands of services in such systems requires simultaneous run-time
support for Fault-Tolerance (FT) and Real-Time (RT) computing, a marriage that
remains a challenge for current middleware frameworks. Fault-tolerance support is
usually introduced in the form of expensive high-level services arranged in a client-server
architecture. This approach is inadequate if one wishes to support real-time tasks due
to the expensive cross-layer communication and resource consumption involved.
In this thesis we design and implement Stheno, a general purpose P2 P middleware
architecture. Stheno innovates by integrating both FT and soft-RT in the architecture,
by: (a) implementing FT support at a much lower level in the middleware on top of a
suitable network abstraction; (b) using the peer-to-peer mesh services to support FT,
and; (c) supporting real-time services through a QoS daemon that manages the under-
lying kernel-level resource reservation infrastructure (CPU time), while simultaneously
(d) providing support for multi-core computing and traffic demultiplexing. Stheno is
able to minimize resource consumption and latencies from FT mechanisms and allows
RT services to perform withing QoS limits.
Stheno has a service oriented architecture that does not limit the type of service that can
be deployed in the middleware. Whereas current middleware systems do not provide
a flexible service framework, as their architecture is normally designed to support a
specific application domain, for example, the Remote Procedure Call (RPC) service.
Stheno is able to transparently deploy a new service within the infrastructure without
the user assistance. Using the P2 P infrastructure, Stheno searches and selects a suitable
node to deploy the service with the specified level of QoS limits.
We thoroughly evaluate Stheno, namely evaluate the major overlay mechanisms, such
as membership, discovery and service deployment, the impact of FT over RT, with
and without resource reservation, and compare with other closely related middleware
frameworks. Results showed that Stheno is able to sustain RT performance while
simultaneously providing FT support. The performance of the resource reservation
infrastructure enabled Stheno to maintain this behavior even under heavy load.
7
10.
11. Acronyms
API Application Programming Interface
BFT Byzantine Fault-Tolerance
CCM CORBA Component Model
CID Cell Identifier
CORBA Common Object Request Broker Architecture
COTS Common Of The Shelf
DBMS Database Management Systems
DDS Data Distribution Service
DHT Distributed Hash Table
DOC Distributed Object Computing
DRE Distributed Real-Time and Embedded
DSMS Data Stream Management Systems
EDF Earliest Deadline First
EM/EC Execution Model/Execution Context
FT Fault-Tolerance
IDL Interface Description Language
IID Instance Identifier
IPC Inter-Process Communication
IaaS Infrastructure as a Service
J2SE Java 2 Standard Edition
JMS Java Messaging Service
JRTS Java Real-Time System
JVM Java Virtual Machine
9
12. JeOS Just Enough Operating System
KVM Kernel Virtual-Machine
LFU Least Frequently Used
LRU Least Recently Used
LwCCM Lightweight CORBA Component Model
MOM Message-Oriented Middleware
NSIS Next Steps in Signaling
OID Object Identifier
OMA Object Management Architecture
OS Operating Systems
PID Peer Identifier
POSIX Portable Operating System Interface
PoL Place of Launch
QoS Quality-of-Service
RGID Replication Group Identifier
RMI Remote Method Invocation
RPC Remote Procedure Call
RSVP Resource Reservation Protocol
RTSJ Real-Time Specification for Java
RT Real-Time
SAP Service Access Point
SID Service Identifier
SLA Service Level of Agreement
SSD Solid State Disk
10
13. TDMA Time Division Multiple Access
TSS Thread-Specific Storage
UUID Universal Unique Identifier
VM Virtual Machine
VoD Video on Demand
11
29. –Most of the important things in the world have
1
been accomplished by people who have kept
trying when there seemed to be no hope at all.
Dale Carnegie
Introduction
1.1 Motivation
The development and management of large-scale information systems is pushing the
limits of the current state-of-the-art in middleware frameworks. At EFACEC1 , we have
to handle a multitude of application domains, including: information systems used
to manage public, high-speed transportation networks; automated power management
systems to handle smart grids, and; power supply systems to monitor power supply units
through embedded sensors. Such systems typically transfer large amounts of streaming
data; have erratic periods of extreme network activity; are subject to relatively common
hardware failures and for comparatively long periods, and; require low jitter and fast
response time for safety reasons, for example, vehicle coordination.
Target Systems
The main motivation for this PhD thesis was the need to address the requirements of the
public transportation solutions at EFACEC, more specifically, the light-train systems.
The deployment of one of such systems is installed in Oporto’s light-train network and
is composed of 5 lines, 70 stations and approximately 200 sensors (partially illustrated
in Figure 1.1). Each station is managed by a computational node, that we designate
as peer, that is responsible for managing all the local audio, video, display panels, and
low-level sensors such as track sensors for detecting inbound and outbound trains.
1
EFACEC, the largest Portuguese Group in the field of electricity, with a strong presence in systems
engineering namely in public transportation and energy systems, employs around 3000 people and has
a turnover of almost 1000 million euro; it is established in more than 50 countries and exports almost
half of its production (c.f. http://www.efacec.com).
27
30. CHAPTER 1. INTRODUCTION
The system supports three types of traffic: normal - for regular operations over the
system, such as playing an audio message in a station through an audio codec; critical
- medium priority traffic comprised of urgent events, such as an equipment malfunction
notification; alarms - high priority traffic that notifies critical events, such as low-level
sensor events. Independently of the traffic type (e.g., event, RPC operation), the system
requires that any operation must be completed within 2 seconds.
From the point of view of distributed architectures, the current deployments would be
best matched with P2 P infra-structures that are resilient and allow resources (e.g., a
sensor connected through a serial link to a peer) to be seamlessly mapped to the logical
topology, the mesh, that also provide support for real-time (RT) and fault-tolerant (FT)
services. Support for both RT and FT is fundamental to meet system requirements.
Moreover, the next generation light train solutions require deployments across cities
and regions that can be overwhelmingly large. This introduces the need for a scalable
hierarchical abstraction, the cell, that is composed of several peers that cooperate to
maintain a portion of the mesh.
Figure 1.1: Oporto’s light-train network.
1.2 Challenges and Opportunities
The requirements from our target systems pose a significant number of challenges. The
presence of FT mechanisms, specially using space redundancy [1], it introduces the need
for the presence of multiple copies of the same resource (replicas), and these, in turn,
ultimately lead to a greater resource consumption.
FT also introduce overheads in the form of latency and this is another constraint
that is important when dealing with RT systems. When an operation is performed,
irrespectively, of whether it is real-time or not, any state change that it causes by it
28
31. 1.2. CHALLENGES AND OPPORTUNITIES
must be propagated among the replicas through a replication algorithm that introduces
an additional source of latency. Furthermore, the recovery time, that consists in the
time that the system needs to recover from a fault, is an additional source of latency
to real-time operations. There are well known replication styles that offer different
trade-offs between state consistency and latency.
Our target systems have different traffic types with distinct deadlines requirements that
must be supported while using Common Of The Shelf (COTS) hardware (e.g., ethernet
networking) and software (e.g., Linux). This requires that the RT mechanisms leverage
the available resources, through resource reservation, while providing different threading
strategies that allow different trade-offs between latency and throughput.
To overcome the overhead introduced by the FT mechanisms, it must be possible
to employ a replication algorithm that do not compromises the RT requirements.
Replication algorithms that offer a higher degree of consistency introduce a higher
level of latency [1, 2] that may be prohibitive for certain traffic types. On the other
hand, certain replication algorithms exhibit a lower resource consumption and latency
at the expense of a longer recovery time, that may also be prohibitive.
Considering current state-of-the-art research we see many opportunities to address
the previous challenges. One is the use of COTS operating system that allow for a
faster implementation time, thus smaller development cost, while offering the necessary
infrastructure to build a new middleware system.
P2 P networks can be used to provide a resilient infra-structure that mirrors the physical
deployments of our target systems, furthermore, different P2 P topologies offer different
trade-offs between self-healing, resource consumption and latency in end-to-end oper-
ations. Moreover, by directly implementing FT on the P2 P infra-structure we hope
to lower resource usage and latency to allow the integration of RT. By using proven
replication algorithms [1, 2] that offer well-known trade-offs regarding consistency,
resource consumption and latency, we can focus on the actual problem of integrating
real-time, fault-tolerance within a P2 P infrastructure.
On the other hand, RT support can be achieve through the implementation of different
threading strategies, resource reservation (through the Linux’s Control Groups) and by
avoiding traffic multiplexing through the use of different access points to handle different
traffic priorities. Whilst the use of Earliest Deadline First (EDF) scheduling would
provide greater RT guarantees, this goal will not be pursued due the lack of maturity
of the current EDF implementations in Linux (our reference COTS operating system).
Because we are limited to use priority based scheduling and resource reservation, we can
29
32. CHAPTER 1. INTRODUCTION
only partially support our goal of providing end-to-end guarantees, more specifically,
we enhance our RT guarantees through the use of RT scheduling policies with over-
provisioning to ensure that deadlines are met.
1.3 Problem Definition
The work presented in this thesis focuses on the integration of Real-Time (RT) and
Fault-Tolerance (FT) in a scalable general purpose middleware system. This goal
can only be achieved if the following premises are valid: (a) FT infrastructure cannot
interfere in RT behavior, independently of the replication policy; (b) the network model
must be able to scale, and; (c) ultimately, FT mechanisms need to be efficient and aware
of the underlying infrastructure, i.e. network model, operating system and physical
environment.
Our problem definition is a direct consequence of the requirements from our target
systems, and it can be summarize with the following question: ”Can we opportunistically
leverage and integrate these proven strategies to simultaneously support soft-RT and FT
to meet the needs of our target systems even under faulty conditions?”
In this thesis we argue that a lightweight implementation of fault-tolerance mechanisms
in a middleware is fundamental for its successful integration with soft real-time support.
Our approach is novel in that it explores peer-to-peer networking as a means to imple-
ment generic, transparent, lightweight fault-tolerance support. We do this by directly
embedding fault-tolerance mechanisms into peer-to-peer overlays, taking advantage of
their scalable, decentralized and resilient nature. For example, peer-to-peer networks
readily provide the functionality required to maintain and locate redundant copies of
resources. Given their dynamic and adaptive nature, they are promising infra-structures
for developing lightweight fault-tolerant and soft real-time middleware.
Despite these a priori advantages, mainstream generic peer-to-peer middleware systems
for QoS computing are, to our knowledge, unavailable. Motivated by this state of
affairs, by the limitations of the current infra-structure for the information system we
are managing at EFACEC (based on CORBA technology) and, last but not least, by the
comparative advantages of flexible peer-to-peer network architectures, we have designed
and implemented a prototype service-oriented peer-to-peer middleware framework.
The networking layer relies on a modular infra-structure that can handle multiple peer-
to-peer overlays. The support for fault-tolerance and soft real-time features is provided
30
33. 1.4. ASSUMPTIONS AND NON-GOALS
at this level through the implementation of efficient and resilient services for, e.g.
resource discovery, messaging and routing. The kernel of the middleware system (the
runtime) is implemented on top of these overlays and uses the above mentioned peer-to-
peer functionalities to provide developers with APIs for customization of QoS policies
for services (e.g. bandwidth reservation, CPU/core reservation, scheduling strategy,
number of replicas). This approach was inspired in that of TAO [3], that allows for
distinct strategies for the execution of tasks by threads to be defined.
1.4 Assumptions and Non-Goals
The distributed model used in this thesis is based on a partial asynchronous model
computing model, as defined in [2], extended with fault-detectors.
The services and P2 P plugin implemented in this thesis only support crash failures. We
consider a crash failure [1] to be characterized as a complete shutdown of a computing
instance in the event of a failure, ceasing to interact any further with the remaining
entities of the distributed system.
The timing faults are handled differently by services and the P2 P plugin. In our service
implementations a timing fault is logged (for analysis) with no other action being
performed, whereas, in the P2 P layer we consider a timing fault as a crash failure, i.e.,
if the remote creation of a service exceeds its deadline, the peer is considered crashed.
This method is also called as process controlled crash, or crash control, as defined in
[4]. In this thesis, we adopted a more relaxed version. If a peer wrongly suspect of
being crashed, it does not get killed or commits suicide, instead it gets shunned, that
is, a peer is expelled from the overlay, and is forced to rejoin it, more precisely, it must
rebind using the membership service in the P2 P layer.
The fault model used was motivated by the author’s experience on several field deploy-
ments of ligth-train transportation systems, such as the Oporto, Dublin and Tenerife
Light Rail solutions [5]. Due to the use of highly redundant hardware solutions, such as
redundant power supplies and redundant 10-Gbit network ring links, network failures
tend to be short. The most common cause for downtime is related with software bugs,
that mostly results in a crashing computing node. While simultaneous failures can
happen, they are considered rare events.
We also assume that the resource-reservation mechanisms are always available.
In this thesis we do not address value faults and byzantine faults, as they are not a
31
34. CHAPTER 1. INTRODUCTION
requirement for our target systems. Furthermore, we do not provide a formal specifica-
tion and verification of the system. While this would be beneficial to assess system
correctness, we had to limit the scope of this thesis. Nevertheless, we provide an
empirical evaluation of the system.
We also do not address hard real-time because the lack of a mature support for EDF
scheduling in the Linux kernel. Furthermore, we do not provide a fully optimized
implementation, but only a proof-of-concept to validate our approach. Testing the
system in a production environment is left for future work.
1.5 Contributions
Before undertaking the task of building an entire new middleware system from scratch,
we explored current solutions, presented in Chapter 2, to see if any of them could
support the requirements from our target system. As we did not find any suitable
solution, we then assessed if it was possible to extend an available solution to meet
those requirements. In our previous work, DAEM [6], we explored the use of JGroups [7]
within an hierarchical P2 P mesh, and concluded that the simultaneous support for real-
time, fault-tolerance and P2 P requires fine grain control of resources that is not possible
with the use of ”black-box” solutions, for example, it is impossible to have out-of-the-
box support for resource reservation in JGroups.
Given these assessments, we have designed and implemented Stheno, that to the best
of our knowledge is the first middleware system to seamlessly integrate fault-tolerance
and real-time in a peer-to-peer infrastructure. Our approach was motivated by the
lack of support of current solutions for the timing, reliability and physical deployment
characteristics of our target systems.
For that, a complete architectural design is proposed that addresses the levels of the
software stack, including kernel space, network, runtime and services, to achieve a
seamless integration. The list of contributions include: (a) a full specification of a user
Application Programming Interface (API); (b) pluggable P2 P network infrastructure
aiming to better adjust to the target application; (c) support for configurable FT on
the P2 P layer with the goal of providing lightweight FT mechanisms, that fully enable
RT behavior, and; (d) integration of resource reservation at all the levels of runtime,
enabling (partial) end-to-end Quality-of-Service (QoS) guarantees.
Previous work [8, 9, 10] on resource reservation focused uniquely on CPU provisioning
for real-time systems. In this thesis we present, Euryale, a QoS network oriented
32
35. 1.6. THESIS OUTLINE
framework that features resource reservation with support for a broader range of sub-
systems, including CPU, memory, I/O and network bandwidth for a general purpose
operating system as Linux. At the heart of this infrastructure resides Medusa, a QoS
daemon that handles admission and management of QoS requests.
Current well-known threading strategies, such as Leader-Followers [11], Thread-per-
Connection [12] and Thread-per-Request [13], offer well-known trade-offs between la-
tency and resource usage [3, 14]. However, they do not support resource reservation,
namely, CPU partitioning. In order to suppress this limitation, this thesis provides an
additional contribution with the introduction of a novel design pattern (Chapter 4) that
is able to integrate multi-core computing with resource reservation within a configurable
framework that supports these well-known threading strategies. For example, when a
client connects to a service it can specify, through the QoS real-time parameters, for a
particular threading strategy that best meets its requirements.
We present a full implementation that covers all the previously architectural features,
including a complete overlay implementation, inspired in the P3 [15] topology, that
seamlessly integrates RT and FT.
To evaluate our implementation and justify our claims, we present a complete evalua-
tion for both mechanisms. The impact of the resource reservation mechanism is also
evaluated, as well as a comparative evaluation of RT performance against state-of-the-
art middleware systems. The experimental results show that Stheno meets and exceeds
target system requirements for end-to-end latency and fail-over latency.
1.6 Thesis Outline
The focus of this thesis is on the design, implementation and evaluation of a scalable
general purpose middleware that provides the seamless integration of RT and FT. The
remaining of this thesis is organized as follows.
Chapter 2: Overview of Related Work.
This chapter presents an overview on related middleware systems that exhibit support
for RT, FT and P2 P, the mandatory requirements from our target system. We started
by searching for an available off-the-shelf solution that could support all of these
requirements, or in its absence, identifying a current solution that could be extended in
order to avoid creating a new middleware solution from scratch.
Chapter 3: Architecture.
33
36. CHAPTER 1. INTRODUCTION
Chapter 3 describes the runtime architecture on the proposed middleware. We start
by providing a detailed insight on the architecture, covering all layers present in the
runtime. Special attention is given to the presentation of the QoS and resource reser-
vation infrastructure. This is followed by an overview of the programming model that
describes the most important interfaces present in the runtime, as well the interactions
that occur between them. The chapter ends with the description of the fundamental
runtime operations, namely: the creation of services with and without FT support,
deployment strategy, and client creation.
Chapter 4: Implementation.
Chapter 4 describes the implementation of a prototype based on the aforementioned
architecture, and is divided in four parts. In the first part, we present a complete
implementation of P2 P overlay that is inspired on the P3 [15] topology, while providing
some insight on the limitations of the current prototype. The second part of this chapter
focuses on the implementation of three types of user services, namely, Remote Procedure
Call (RPC), Actuator, and Streaming. These services are thoroughly evaluated in
Chapter 5. In the third part, we describe our support for multi-core computing, through
the presentation of a novel design pattern, the Execution Model/Context. This design
pattern is able to integrate resource reservation, especially CPU partitioning, with
different well-known (and configurable) threading strategies. The fourth and final part
of this chapter describes the most relevant parameters used in the bootstrap of the
runtime.
Chapter 5: Evaluation.
The experimental results are presented in this chapter. It starts by providing details
of physical setup used throughout the evaluation. Then it describes the parameters
used in the testbed suite, that is composed by the three services previously described in
Chapter 4. We then focus on presenting the results for the benchmarks, including the
assessment of the impact of FT on RT, and the impact of the resource reservation infra-
structure in the overall performance. The chapter ends with a comparative evaluation
against well-known middleware systems.
Chapter 6: Conclusion and Future Work.
This last chapter presents the concluding remarks. It highlights the contributions of
the proposed and implemented middleware, and provides
34
37. –By failing to prepare, you are preparing to fail.
2
Benjamin Franklin
Overview of Related Work
2.1 Overview
This chapter presents an overview of the state-of-the-art on related middleware systems.
As illustrated in Figure 2.1, we are mostly interested in systems that exhibit support
for real-time (RT), fault-tolerance (FT) and peer-to-peer (P2 P), the mandatory require-
ments from our target system. We started by searching for an available off-the-shelf
solution that could support all of these requirements, or in its absence, identify a current
solution that could be extended, and thus avoid the creation of a new middleware
solution from the ground up. For that reason, we have focused on the intersecting
domains, namely, RT+FT, RT+P2 P and FT+P2 P, since the systems contained in these
domains are closer to meet the requirements of our target system.
From an historic perspective, the origins of modern middleware systems can be traced
back to the 1980s, with the introduction of the concept of ubiquitous computing, in
which computational resources are accessible and seen as ordinary commodities such
as electricity or tapwater [2]. Furthermore, the interaction between these resources
and the users was governed by the client-server model [16] and a supporting protocol
called RPC [17]. The client-server model is still the most prevalent paradigm in current
distributed systems.
An important architecture for client-server systems was introduced with the Common
Object Request Broker Architecture (CORBA) standard [18] in the 1990s, but it did not
address real-time or fault-tolerance. Only recently both real-time and fault-tolerance
specifications were finalized but remained mutually exclusive. This means that a
system supporting the real-time specification will not be able to support the fault-
35
38. CHAPTER 2. OVERVIEW OF RELATED WORK
DDS Video
Streaming
RT
RT+P2P
CORBA RT FT RT+FT RT+FT+P2P P2P
FT+P2P
FT
Pastry
Distributed
storage
CORBA FT
Stheno
Figure 2.1: Middleware system classes.
tolerance specification, and vice-versa. Nevertheless, seminal work has already ad-
dressed these limitations and offered systems supporting both features, namely, TAO [3]
and MEAD [14]. At the same time, Remote Method Invocation (RMI) [19] appeared as
a Java alternative capable of providing a more flexible and easy-to-use environment.
In recent years, CORBA entered in a steady decline [20] in favor of web-oriented
platforms, such as J2EE [21], .NET [22] and SOAP [23], and P2 P systems. The
web-oriented platforms, such as the JBoss [24] application server, aim to integrate
availability with scalability, but they remain unable to support real-time. Moreover,
while partitioning offers a clean approach to improve scalability, it fails to support
large scale distributed systems [2]. Alternatively, P2 P systems focused on providing
logical organizations, i.e., meshes, that abstract the underlying physical deployment
while providing a decentralized architecture for increased resiliency. These systems
focused initially on resilient distributed storage solutions, such as Dynamo [25], but
progressively evolved to support soft real-time systems, such as video streaming [26].
More recently, Message-Oriented Middleware (MOM) systems [27] offer a distributed
message passing infrastructure based on an asynchronous interaction model, that is
able to suppress the scaling issues present in RPC. A considerable amount of im-
plementations exist, including Tibco [28], Websphere MQ [29] and Java Messaging
Service (JMS) [30]. MOM sometimes are integrated as subsystems in the application
server infrastructures, such as JMS in J2EE and Websphere MQ in the Websphere
Application Server.
A substantial body of research has focused on the integration of real-time within
36
39. 2.2. RT+FT MIDDLEWARE SYSTEMS
CORBA-based middleware, such as TAO [3] (that later addressed the integration of
fault-tolerance). More recently, QoS-enabled publish-subscribe middleware systems
based on the JAIN SLEE specification [31], such as Mobicents [32], and in the Data
Distribution Service (DDS) specification, such as OpenDDS [33], Connext DDS [34]
and OpenSplice [35], appeared as a way to overcome the current lack of support for
real-time applications in SOA-based middleware systems.
The introduction of fault-tolerance in middleware systems also remains an active topic
of research. CORBA-based middleware systems were a fertile ground to test fault-
tolerance techniques in a general purpose platform, resulting in the creation of the
CORBA-FT specification [36]. Nowadays, some of this focus was redirected to SOA-
based platforms, such as J2EE. One of the most popular deployments, JBoss, supports
scalability and availability through partitioning. Each partition is supported by a group
communication framework based on the virtual synchrony model, more specifically, the
JGroups [7] group communication framework.
2.2 RT+FT Middleware Systems
This section overviews systems that provide simultaneous support for real-time and
fault-tolerance. These systems are divided into special purposed solutions, designed for
specific application domains, and CORBA-based solutions, aimed for general purposed
computing.
2.2.1 Special Purpose RT+FT Systems
Special purpose real-time fault-tolerant systems introduced concepts and implementa-
tion strategies that are still relevant on current state-of-the-art middleware systems.
Armada
Armada [37] focused on providing middleware services and a communication infrastruc-
ture to support FT and RT semantics for distributed real-time systems. This was
pursued in two ways, which we now describe.
The first contribution was the introduction of a communication infrastructure that is
able to provide end-to-end QoS guarantees, in both unicast and multicast primitives.
This was supported by a control signaling and a QoS-sensitive data transfer (as in the
newer Resource Reservation Protocol (RSVP) and Next Steps in Signaling (NSIS)).
37
40. CHAPTER 2. OVERVIEW OF RELATED WORK
The network infrastructure used a reservation mechanism based on EDF scheduling
policy that was built on top of the Mach OS priority based scheduling. The initial
implementation was done in the user-level but subsequently migrated to the kernel
level with the goal of reducing latency.
Much of the architectural decisions regarding RT support were based on the available
operating system at the time, mainly Mach OS. Despite the advantages of a micro-
kernel approach, its application remains restricted by the underlying cost associated
with message passing and context switching. Instead, a large body of research has been
made on monolithic kernels, specially in Linux OS, that are able to offer the advantages
of the micro-kernel approach, through the introduction of kernel modules, and the speed
of monolithic kernels.
The second contribution came in the form of a group communication infrastructure
based on a ring topology that ensured the delivery of messages in a reliable and total
order fashion within a bounded time. It also had support for membership management
that offered consistent views of the group through the detection of process and commu-
nication failures. These group communication mechanisms enabled the support for FT
through the use of a passive replication scheme, that allowed for some inconsistencies
between the primary and the replicas, where the states of the replicas could lag behind
the state of the primary, up to a bounded time window.
Mars
Mars [38] provided support for the analysis and deployment of synchronous hard real-
time systems through a static off-line scheduler for CPU and Time Division Multiple
Access (TDMA) bus. Mars is able to offer FT support through the use of active
redundancy on the TDMA bus, i.e. sending multiple copies of the same message, and
self-checking mechanisms. Deterministic communications are achieved though the use
of a time-triggered protocol.
The project focused on the RT process control, where all the intervening entities are
known in advance. So it does not offer any type of support for dynamical admission of
new components, neither it supports on-the-fly fault-recovery.
ROAFTS
ROAFTS [39, 40] system aims to provide transparent adaptive FT support for dis-
tributed RT applications, consisting in a network of Time-triggered Message-triggered
Objects [41] (TMO’s), whose execution is managed by a TMO support manager. The
FT infrastructure consists in a set of specialized TMO’s, that include: (a) a generic
38
41. 2.2. RT+FT MIDDLEWARE SYSTEMS
fault server ; (b) and a network surveillance [42] manager. Fault-detection is assured
by the network surveillance TMO, and used by the generic fault-server to change the
FT policy with the goal of preserving RT semantics. The system assumes that RT
can live with lesser reliability assurances from the middleware, under highly dynamic
environments.
Maruti
Maruti [43] aimed to provide a development framework and an infrastructure for the
deployment of hard real-time applications within a reactive environment, focusing on
real-time requirements on a single-processor system. The reactive model is able to
offer runtime decisions on the admission of new processing requests without producing
adverse effects on the scheduling of existing requests. Fault-tolerance is achieved by
redundant computation. A configuration language allows the deployment of replicating
modules and services.
Delta-4
Delta-4 [44] provided an in-depth characterization of fault assumptions, for both the
host and the network. It also demonstrated various techniques for handling them,
namely, passive and active replication for fail-silent hosts and byzantine agreement for
fail-uncontrolled hosts. This work was followed by the Delta-4 Extra Performance Archi-
tecture (XPA) [45] that aimed to provide real-time support to the Delta-4 framework
through the introduction of the Leader/Follower replication model (better known as
semi-active replication) for fail-silent hosts. This work also lead to the extension to the
communication system to support additional communication primitives (the original
work on Delta-4 only supported the Atomic primitive), namely, Reliable, AtLeastN, and
AtLeastTo.
2.2.2 CORBA-based RT+FT Systems
The support for RT and FT in general purpose distributed platforms remains mostly
restricted to CORBA. While some support was carried out by Sun to introduce RT sup-
port for Java, with the introduction of the Real-Time Specification for Java (RTSJ) [46,
47], it was aimed to the Java 2 Standard Edition (J2SE). The most relevant implemen-
tations are Sun’s Java Real-Time System (JRTS) [48] and IBM’s Websphere Real-Time
VM [49, 50]. To the best of our knowledge, only WebLogic Real-Time [51] attempted
to provide support for RT in a J2EE environment. Nevertheless, this support seems to
be confined to the introduction of a deterministic garbage collector, through the use of
39
42. CHAPTER 2. OVERVIEW OF RELATED WORK
the RT JRockit JVM, as a way to prevent unpredictable pause times caused by garbage
collection [51].
Previous work on integration of RT and FT in CORBA context systems can be catego-
rized into three distinct approaches: (a) integration, where the base ORB is modified;
(b) services, systems that rely on high-level services to provide FT (and indirectly, RT),
and; (c) interception, systems that perform interception on client request to provide
transparent FT and RT.
Integration Approach
Past work on the integration of fault-tolerance in CORBA-like systems was done in
Electra [52], Maestro [53] and AQuA [54]. Electra [52] was one of the predecessors of
the CORBA-FT standard [55, 36], and it focused on enhancing the Object Manage-
ment Architecture (OMA) to support transparent and non-transparent fault-tolerance
capabilities. Instead of using message queues or transaction monitors [56], it relied on
object-communication groups [57, 58]. Maestro [53] is a distribute layer built on top of
the Ensemble [59] group communication, that was used by Electra [52] in the Quality
of Service for CORBA Objects (QuO) project [60]. Its main focus was to provide an
efficient, extensible and non disruptive integration of the object layers with the low-
level QoS system properties. The AQuA [54] system uses both QuO and Maestro on
top of the Ensemble communication groups, to provide a flexible and modular approach
that is able to adapt to faults and changes in the application requirements. Within its
framework a QuO runtime accepts availability requests by the application and relays
them to a dependability manager, that is responsible to leverage the requests from
multiple QuO runtimes.
TAO+QuO
The work done in [61] focused on the integration of QoS mechanisms, for both CPU and
network resources while supporting both priority- and reservation-based QoS semantics,
with standard COTS Distributed Real-Time and Embedded (DRE) middleware, more
precisely, TAO [3]. The underlying QoS infrastructure was provided by QuO[60]. The
priority-based approach was built on top of the RT-CORBA specification, and it defined
a set of standard features in order to provide end-to-end predictability for operations
within a fixed priority context [62]. The CPU priority-based resource management is
left to the scheduling of the underlying Operating Systems (OS), whereas the network
priority-based management is achieved through the use of the DiffServ architecture [63],
by setting the DSCP codepoint on the IP header of the GIOP requests. Based on
various factors, the QuO runtime can dynamically change this priority to adjust to
40
43. 2.2. RT+FT MIDDLEWARE SYSTEMS
environment changes. Alternatively, the network reservation-based approach relies on
the RSVP [64] signaling protocol to guarantee the desired network bandwidth between
hosts. The QuO runtime monitors the RSVP connections and makes adjustments to
overcome abnormal conditions. For example, in a video service it can drop frames to
maintain stability. The cpu-reservation is made using reservation mechanisms present
in the TimeSys Linux kernel. It is left to TAO and QuO to decide on the reservations
policies. This was done to preserve the end-to-end QoS semantics that is only available
at a higher level of the middleware.
CIAO+QuO
CIAO [65] is a QoS-aware CORBA Component Model (CCM) implementation built on
top of TAO [3] that aims to alleviate the complexity of integrating real-time features on
DRE using Distributed Object Computing (DOC) middleware. These DOC systems,
of which TAO is an example, offer configurable policies and mechanisms for QoS,
namely real-time, but lack a programming model that is capable of separating systemic
aspects from applicational logic. Furthermore, QoS provisioning must be done in an
end-to-end fashion, thus having to be applied to several interacting components. It
is difficult, or nearly impossible, to properly configure a component without taking
into account the QoS semantics for interacting entities. Developers using standard
DOC middleware systems are susceptible to produce misconfigurations that cause an
overall system misbehavior. CIAO overcomes these limitations by applying a wide
range of aspect-oriented development techniques that support the composition of real-
time semantics without intertwining configurations concerns. The support for CIAO’s
CCM architecture was done in CORFU [66] and is described below.
Work on the integration of CIAO with Quality Objects (QuO) [60] was done in [67].
The integration QuO’s infrastructure into CIAO, enhanced its limited static QoS provi-
sioning to a total provisioning middleware that is also able to accommodate dynamical
and adaptive QoS provisioning. For example, the setup of a RSVP [64] connection
would require the explicit configuration from the developer, defeating the purpose of
CIAO. Nevertheless, while CIAO is able to compose QuO components, Qoskets [68], it
does not provide a solution for component cross-cutting.
DynamicTAO
DynamicTAO [69] focused on providing a reflective model middleware that extends
TAO to support on-the-fly dynamic reconfiguration of its component behavior and
resource management through meta-interfaces. It allows the application to inspect
the internal state/configuration and, if necessary, to reconfigure it in order to adapt
41
44. CHAPTER 2. OVERVIEW OF RELATED WORK
to environment changes. Subsequently, it is possible to select networking protocols,
encoding and security policies to improve the overall system performance in the presence
of unexpected events.
Service-based Approach
An alternative, high-level service approach for CORBA fault-tolerance was taken by
Distributed Object-Oriented Reliable Service (DOORS) [70], Object Group Service
(OGS) [71], and Newtop Object Group Service [72]. DOORS focused on providing
replica management, fault-detection and fault-recovery as a CORBA high-level service.
It did group communication and it mainly focused on passive replication, but allowed
the developer to select the desired level of reliability (number of replicas), replication
policy, fault-detection mechanism, e.g. a SNMP enhanced fault-detection, and recovery
strategy. OGS improved over prior approaches by using a group communication protocol
that imposes consensus semantics. Instead of adopting an integrated approach, group
communication services are transparent to the ORB, by providing a request level
bridging. Newtop followed a similar approach to OGS but augmented the support
for network partition, allowing the newly formed sub-groups to continue to operate.
TAO
TAO [3] is a CORBA middleware with support for RT and FT middleware, that is
compliant with the OMG’s standards for CORBA-RT [73] and CORBA-FT [36]. The
support for RT includes priority propagation, explicit binding, and RT thread pools.
The FT is supported through the of a high level service, the Replication Manager, that
sits on top of the CORBA stack. This service is the cornerstone of the FT infrastructure,
acting as a rendezvous for all the remaining components, more precisely, monitors that
watch the status of the replicas, replica factories that allow the creation of new replicas,
and fault notifiers that inform the manager of failed replicas. TAO’s architecture is
further detailed in Section 2.6 of Chapter 3.
FLARe and CORFU
FLARe [74] focus on proactively adapting the replication group to underlying changes
on resource availability. To minimize resource usage, it only supports passive replica-
tion [75]. Its implementation is based on TAO [3]. It adds three new components to
the existing architecture: (a) Replication Manager high level service that decides on the
strategy to be employed to address the changes on resource availability and faults; (b)
a client interceptor that redirects invocations to the active primary; (c) a redirection
agent that receives updates from the Replication Manager and is used by the interceptor,
42
45. 2.2. RT+FT MIDDLEWARE SYSTEMS
and; (d) a resource monitor that watches the load on nodes and periodically notifies the
Replication Manager. In the presence of faulty conditions, such as overload of a node,
the Replication Manager adapts the replication group to the changing conditions, by
activating replicas on nodes that have a lower resource usage, and additionally, change
the location of the primary node to a better suitable placement.
CORFU [66] extends FLARe to support real-time and fault-tolerance for the Lightweight
CORBA Component Model (LwCCM) [76] standard for DRE systems. It provides
fail-stop behavior, that is, when one component on a failover unit fails, then all the
remaining components are stopped, allowing for a clean switch to a new unit. This is
achieved through a fault mapping facility that allows the correspondence of the object
failure into the respective plan(s), with the subsequent component shutdown.
DeCoRAM
The DeCoRAM system [77] aims to provide RT and FT properties through a resource-
aware configuration, executed using a deployment infrastructure. The class of supported
systems is confined to closed DRE, where the number of tasks and their respective
execution and resource requirements are known a priori and remain invariant thought
the system’s life-cycle. As the tasks and resources are static, it is possible to optimize the
allocation of the replicas on available nodes. The allocation algorithm is configurable
allowing for a user to choose the best approach to a particular application domain.
DeCoRAM provides a custom allocation algorithm named FERRARI (FailurE, Real-
Time, and Resource Awareness Reconciliation Intelligence) that addresses the opti-
mization problem, while satisfying both RT and FT system constraints. Because of the
limited resources normally available on DRE systems, DeCoRAM only supports passive
replication [75], thus avoiding the high overhead associated with active replication [78].
The allocation algorithm calculates the components inter-dependencies and deploys the
execution plan using the underlying middleware infrastructure, which is provided by
FLARe [74].
Interception-based Approach
The work done in Eternal [79, 80] focused on providing transparent fault-tolerance for
CORBA ensuring strong replica consistency through the use of reliable totally-ordered
multicast protocol. This approach alleviated the developer from having to deal with low-
level mechanisms for supporting fault-tolerance. In order to maintain compatibility with
the CORBA-FT standard, Eternal exposes the replication manager, fault detector, and
fault notifier to developers. However, the main infrastructure components are located
below the ORB for both efficiency and transparency purposes. These components
43
46. CHAPTER 2. OVERVIEW OF RELATED WORK
include logging-recovery mechanisms, replication mechanisms, and interceptors. The
replication mechanisms provide support for warm and cold passive replication and active
replication. The interceptor captures the CORBA IIOP requests and replies (based on
TCP/IP) and redirects them to the fault-tolerance infrastructure. The logging-recovery
mechanisms are responsible for managing the logging, checkpointing, and performing
the recovery protocols.
MEAD
MEAD focuses on providing fault-tolerance support in a non intrusive way by en-
hancing distributed RT systems with (a) a transparent, although tunable FT, that
is (b) proactively dependable through (c) resource awareness, that has (d) scalable and
fast fault-detection and fault-recovery. It uses CORBA-RT, more specifically TAO,
as proof-of-concept. The paper makes an important contribution by leveraging fault-
tolerance resource consumption for providing RT behavior. MEAD is detailed further
in Section 2.6 of Chapter 3.
2.3 P2P+RT Middleware Systems
While most of the focus on P2 P systems has been on the support of FT, there is a
growing interested in using these systems for RT applications, namely, in streaming
and QoS support. This section provides an overview on P2 P systems that support RT.
2.3.1 Streaming
Streaming and specially Video on Demand (VoD), were a natural evolution of the first
file sharing P2 P systems [81, 82]. With the steady increase of network bandwidth on the
Internet, it is now possible to have high-quality multimedia streaming solutions to the
end-user. These focus on providing near soft real-time performance resorting to streams
split through the use of distributed P2 P storage and redundant network channels.
PPTV
The work done in [26] provides the background for the analysis, design and behavior
of VoD systems, focusing on the PPTV system [83]. An overview of the different
replication strategies and their respective trade-offs is presented, namely, Least Recently
Used (LRU) and Least Frequently Used (LFU). The later uses a weighted estimation
based on the local cache completion and by the availability to demand ratio (ATD).
44
47. 2.3. P2 P+RT MIDDLEWARE SYSTEMS
Each stream is divided into chunks. The size of these chunks have a direct influence on
the efficiency of the streaming, with smaller size pieces facilitating replication and thus
overall system load-balancing, whereas bigger pieces decrease the resource overhead
associated with piece management and bandwidth consumption due to less protocol
control. To allow for a more efficient piece selection three algorithms are proposed:
sequential, rarest first and anchor-based. To ensure real-time behavior the system is
able to offer different levels of aggressiveness, including: simultaneous requests of the
same type to neighboring peers; simultaneous sending different content requests to
multiple peers, and; requesting to a single peer (making a more conservative use of
resources).
Thicket
Efficient data dissemination over unstructured P2 P was addressed by Thicket [84].
The work used multiple trees to ensure efficient usage of resources while providing
redundancy in the presence of node failure. In order to improve load-balancing across
the nodes, the protocol tries to minimize the existence of nodes that act as interior
nodes on several of trees, thus reducing the load produced from forwarding messages.
The protocol also defines a reconfiguration algorithm for leveraging load-balance across
neighbor nodes and a tree repair procedure to handle tree partitions. Results show
that the protocol is able to quickly recover from a large number of simultaneous node
failures and leverage the load across existing nodes.
2.3.2 QoS-Aware P2 P
Until recently, P2 P systems have been focused on providing resiliency and throughput,
and thus, not addressing the increasing need for QoS on latency-sensitive applications,
such as VoD.
QRON
QRON [85] aimed to provide a general unified framework in contrast to application-
specific overlays. The overlays brokers (OBs), present at each autonomous system in
the Internet, support QoS routing for overlay applications through resource negotiation
and allocation, and topology discovery. The main goal of QRON is to find a path that
satisfies the QoS requirements, while balancing the overlay traffic across the OBs and
overlay links. For this it proposes two distinct algorithms, a “modified shortest distance
path” (MSDP) and “proportional bandwidth shortest path (PBSP).
45
48. CHAPTER 2. OVERVIEW OF RELATED WORK
GlueQoS
GlueQoS [86] focused on the dynamic and symmetric QoS negotiation between QoS
features from two communicating processes. It provides a declarative language that
allows the specification of the feature QoS set (and possible conflicts) and a runtime
negotiation mechanism that finds a set of valid QoS features that is valid in the both
ends of the interacting components. Contrary to aspect-oriented programming [65], that
only enforces QoS semantics at deployment time, GlueQoS offers a runtime solution that
remains valid throughout the duration of the session between a client and a server.
2.4 P2P+FT Middleware Systems
The research on P2 P systems has been largely dominated by the pursuit for fault-
tolerance, such as in distributed storage, mainly due to the resilient and decentralized
nature of P2 P infrastructures.
2.4.1 Publish-subscribe
P2 P publish-subscribe systems are a set of P2 P systems that implement a message
pattern where the publishers (senders) do not have a predefined set of subscribers
(receivers) to their messages. Instead, the subscribers must first register their interests
with the target publisher, before starting to receive published messages. This decou-
pling between publishers and subscribers allows for a better scalability, and ultimately,
performance.
Scribe
Scribe [87] aimed to provided a large scale event notification infrastructure, built on
top of Pastry [88], for topic-based publish-subscribe applications. Pastry is used to
support topics and subscriptions and build multicast trees. Fault-Tolerance is provided
by the self-organizing capabilities of Pastry, through the adaptation to network failures
and subsequent multicast tree repair. The event dissemination performed is best-
effort oriented and without any delivery order guarantees. Nevertheless, it is possible
to enhance Scribe to support consistent ordering thought the implementation of a
sequential time stamping at the root of the topic. To ensure strong consistency and
tolerate topic root node failures, an implementation of a consensus algorithm such as
Paxos [89] is needed across the set of replicas (of the topic root).
46
49. 2.4. P2 P+FT MIDDLEWARE SYSTEMS
Hermes
Hermes [90] focused on providing a distributed event-based middleware with an underly-
ing P2 P overlay for scalability and reliability. Inspired by work done in Distributed Hash
Table (DHT) overlay routing [88, 91], it also has some notions of rendezvous similar
to [81]. It bridges the gap between programming language type semantics and low-level
event primitives, by introducing the concepts of event-type and event-attributes that
have some common ground with Interface Description Language (IDL) within the RPC
context. In order to improve performance, it is possible in the subscription process to
attach a filter expression to the event attributes. Several algorithms are proposed for
improving availability, but they all provide weak consistency properties.
2.4.2 Resource Computing
There is a growing interest on harvesting and managing the spare computing power
from the increasing number of networked devices, both public and private, as reported
in [92, 93, 94, 95]. Some relevant examples are:
BOINC
BOINC (Berkeley Open Infrastructure for Network Computing) [96] aimed to facili-
tate the harvesting of public resource computing by the scientific research community.
BOINC implements a redundant computing mechanism to prevent malicious or erro-
neous computational results. Each project specifies the number of results that should be
created for each “workunit”, i.e. the basic unit of computation to be performed. When
some number of the results are available, an application specific function is called to
evaluate the results and possibly choosing a canonical result. If no consensus is achieved,
or if simply the results fail, a new set o results are computed. This process repeats until
a successful consensus is achieved or an application defined timeout occurs.
P2 P-MapReduce
Developed at Google, MapReduce [97] is a programming model that is able parallelize
the processing of large data sets in a distributed environment. It follows a master-slave
model, where a master distributes the data set across a set of slaves, returning at end
the computational results (from the map or reduce tasks). MapReduce provides fault-
tolerance for slave nodes by reassigning the failed job to an alternative active slave,
but lacks support for master failures. P2 P-MapReduce [98] provides fault-tolerance by
resorting to two distinct P2 P overlays, one containing the current available masters in
47
50. CHAPTER 2. OVERVIEW OF RELATED WORK
the system, and the other with the active slaves. When an user submits a MapReduce
job, it queries the master overlay for a list of the available masters (ordered by their
workload). It then selects a master node and the number of replicas. After this, the
master node notifies its replicas that they will participate on the current job. A master
node is responsible for periodically synchronizing the state of the job over its replica set.
In case of failure, a distributed procedure is executed to elect the new master across
the active replicas. Finally, the master selects the set of slaves using a performance
metric based on workload and CPU performance from the slave overlay and starts the
computation.
2.4.3 Storage
Storage systems were one of the most prevalent applications on first generation P2 P sys-
tems. Evolving from early file-sharing systems, and with the help of DHT middlewares,
they have now become the choice for large-scale storage systems in both industry and
academia.
openDHT
Work done in [99] aimed to provide a lightweight framework for P2 P storage using
DHTs (such in [88, 91]) in a public environment. The key challenge was to handle
mutually untrusting clients, while guarantying fairness in the access and allocation of
storage. The work was able to provide a fair access to the underlying storage capacity,
while taking the assumption that storage capacity is free. Because of its intrinsic fair
approach, the system is unable to provide any type of Service Level of Agreement (SLA)
to the clients, so reducing the domain of applications that can use it.
Dynamo
Recent research on data storage [25] and distribution at Amazon, focus on key-value
approaches using P2 P overlays, more precisely DHT, to overcome the well explored
limitation of simultaneous providing high availability and strong consistency (through
synchronous replication) [100, 101]. The approach taken was to use an optimistic
replication scheme that relied on asynchronous replica synchronization (also known
as passive replication). The consistency conflicts between different replicas, that are
caused by network and server failures, are resolved in ’read time’, as opposed to the
more traditional ’write time’ strategy, with this being done to maximize the write
availability in the system. Such conflicts are resolved by the services, allowing for a
more efficient resolution (although the system offers a default ’last value holds’ strategy
48
51. 2.5. P2 P+RT+FT MIDDLEWARE SYSTEMS
to the services). Dynamo offers efficient key-value storage, while maximizing write
operations availability. Nevertheless, the ring based overlay hampers the scalability of
the system, and depending on the partitioning strategy used, the membership process
does not seem efficient.
2.5 P2P+RT+FT Middleware Systems
These types of systems offer a natural evolution over previous FT-RT middleware
systems. They aim to provide scalability and resilience through a P2 P network infra-
structure that is able to provide lightweight FT mechanisms, allowing them to support
soft RT semantics. We first proposed an architecture [102, 103] for a general purpose
middleware that aimed to integrate FT into the P2 P network layer, while being able
to provide RT support. The first implementation, in Java, of the architecture was done
in DAEM [6, 104]. This work used an hierarchical tree P2 P based on P3 [15]. The FT
support was performed in all levels of the tree, resulting in a high availability rate but
the use of JGroups [7] for maintaining strong consistency, both for mesh and service
data, resulted in high overhead. Due to its highly coupled tree architecture, faults had a
major impact on availability when they occurred near the root node, as they produced
a cascade failure. Initial support for RT was provided, but the high overhead of the
replication infrastructure limited its applicability.
2.6 A Closer Look at TAO, MEAD and ICE
This section provides a closer look at middleware systems that have provided us with
several strategies and insights that we used to design and implement Stheno, our
middleware solution that is able to support RT, FT and P2 P.
All the referred systems share a service oriented architecture with a client-server network
model, including: TAO, MEAD, and ICE. In terms of RT, both TAO and MEAD
support the RT-CORBA standard, while ICE only supports best-effort invocations. As
for FT support, TAO and ICE use high-level services, whereas MEAD uses a hybrid,
that combines both low and high-level services.
49
52. CHAPTER 2. OVERVIEW OF RELATED WORK
2.6.1 TAO
TAO is a classical RPC middleware and therefore only supports the client-server network
model. Name resolution is provided by a high-level service, representing a clear point-
of-failure and a bottleneck.
RT Support. TAO supports the RT CORBA specification 1.0., with the most impor-
tant features being: (a) priority propagation; (b) explicit binding, and; (c) RT thread
pools.
The priority propagation ensures that a request maintains its priority across a chain of
invocations. A client issues a request to an Object A, that in turn, issues an invocation
to other Object B. The request priority at Object A is then used to make the invocation
at Object B. There are two types of propagation: a server declared priorities, and
client propagated priorities. In the first type, a server dictates the priority that will
be used when processing an incoming invocation. In the other type, the priority of
the invocation is encoded within the request, so the server processes the request at the
priority specified by the client.
A source of unbound priority inversion is caused by the use of multiplexed communica-
tion channels. To overcome this, the RT CORBA specification defines that the network
channels should be pre-established, avoiding the latency caused by their creation. This
model allows two possible policies: (a) private connection between the client and the
server, or; (b) priority banded connection that can be shared but limits the priority of
the requests that can be made on it.
In CORBA, a thread pool uses a threading strategy, such as leader-followers [11], with
the support of a reactor (an object that handles network event de-multiplexing), and is
normally associated with an acceptor (an entity that handles the incoming connections),
a connection cache, and a memory pool. In classic CORBA a high priority thread can
be delayed by a low priority one, leading to priority inversion. So in an effort to avoid
this unwanted side-effect, the RT-CORBA specification defines the concept of thread
pool lanes.
All the threads belonging to a thread pool lane have the same priority, and so, only
process invocation that have the same priority (or a band that contains that priority).
Because each lane has it own acceptor, memory pool and reactor, the risk of priority
inversion is greatly minimized at the expense of greater resource usage overhead.
FT Support. In a effort to combine RT and FT semantics, the replication style
50
53. 2.6. A CLOSER LOOK AT TAO, MEAD AND ICE
proposed, semi-active, was heavily based on Delta4 [45]. This strategy avoids the
latency associated with both warm and cold passive replication [105] and the high
overhead and non-determinism of active replication, but represents an extension to the
FT specification.
Figure 2.2: TAO’s architectural layout (adapted from [3]).
Figure 2.2 shows the architectural overview of TAO. The support for FT is achieved
through the use of a set of high-level services built on top of TAO. These services include
a Fault Notifier, a Fault Detector and a Replication Manager.
The Replication Manager is the central component of the FT infrastructure. It acts
as central rendezvous to the remaining FT components, and it has the responsibilities
of managing the replication groups life-cycle (creation/destruction) and perform group
maintenance, that is the election of a new primary, removal of faulty replicas, and
updating group information.
It is composed by three sub-components: (a) a Group Manager, that manages the group
membership operations (adds and removes elements), allows the change of the primary
of a given group (for passive replication only), and allows manipulation and retrieval
of group member localization; (b) a Property Manager, that allows the manipulation of
51
54. CHAPTER 2. OVERVIEW OF RELATED WORK
replication properties, like replication style; and (c) a Generic Factory, the entry point
for creating and destroying objects.
The Fault Detector is the most basic component of the FT infrastructure. Its role is to
monitor components, processes and processing nodes and report eventual failures to the
Fault Notifier. In turn, the Fault Notifier aggregates these failures reports and forwards
them to the Replication Manager.
The FT bootstrapping sequence is as follows: (a) start of the Naming Service, next;
(b) the Replication Manager is started; (c) followed by the start of Fault Notifier; that
(d) finds the Replication Manager and registers itself with it. As a response, e) the
Replication Manager connects as a consumer to the Fault Notifier. (f) For each node
that is going to participate, starts a Fault Detector Factory and a Replica Factory, that
in turn register themselves in the Replication Manager. (g) A group creation request is
made to the Replication Manager (by an foreign entity, that is referred as Object Group
Creator ), followed by the request of a list to the available Fault Detector Factories and
a Replica Factories; (h) this is followed by a request to create an object group in the
Generic Factory. (i) The Object Group Creator then bootstraps the desired number
of replicas using the Replica Factory at each target node, and in turn, each Replica
Factory creates the actual replica, and at the same time, it starts a Fault Detector
at each site using the Fault Detector Factory. Each one of these detectors, finds the
Replication Manager and retrieves the reference to the Fault Notifier and connects to
it as a supplier. (j) Each replica is added to the object group by the Object Group
Creator by using the Group Manager at the Replication Manager. (k) At this point, a
client is started and retrieves the object reference from the naming service, and makes
an invocation to that group. This is then carried out by the primary of the replication
group.
Proactive FT Support. An alternative approach has been proposed by FLARe [74],
that focus on proactively adapting the replication group to the load present in the
system. The replication style is limited to semi-active replication using state-transfer,
that is commonly referred solely as passive replication .
Figure 2.3 shows the architectural overview of FLARe. This new architecture presents
three new components to TAO’s FT infrastructure: (a) a client interceptor, that redi-
rects the invocations to the proper server, as the initial reference could have been
changed by the proactive strategy, in response to a load change; (b) a redirection agent
that receives the updates with these changes from the Replication Manager; and (c)
a resource monitor that monitors the load on a processing node and sends periodical
52
55. 2.6. A CLOSER LOOK AT TAO, MEAD AND ICE
Figure 2.3: FLARe’s architectural layout (adapted from [74]).
updates to the Replication Manager.
In the presence of abnormal load fluctuations the Replication Manager changes the
replication group to adapt to these new conditions, by creating replicas on lower usage
nodes and, if required, by changing the primary to a better suitable replica.
TAO’s fault tolerance support relies on a centralized infrastructure, with its main
component, the Replication Manager, representing a major obstacle in the system’s
scalability and resiliency. No mechanisms are provided to replicate this entity.
2.6.2 MEAD
MEAD focused on providing fault-tolerance support in a non intrusive way for enhancing
distributed RT systems by providing a transparent, although tunable FT, that is
proactively dependable through resource awareness, that has scalable and fast fault-
detection and fault-recovery. It uses CORBA-RT, more specifically TAO, as proof-of-
concept.
Transparent Proactive FT Support. MEAD’s architecture contains three major
components, namely, the Proactive FT Manager, the Mead Recovery Manager and the
53
56. CHAPTER 2. OVERVIEW OF RELATED WORK
Mead Interceptor. The underlying communication is provided by Spread, an group
communication framework that offers reliable total ordered multicast, for guaranteeing
consistency for both component and node membership.
The Mead Interceptor provides the usual interception of system calls between the
application and underlying operating system. This approach allows a transparent and
non-intrusive way to enhance the middleware with fault-tolerance.
Figure 2.4: MEAD’s architectural layout (adapted from [14]).
Figure 2.4 shows the architectural overview of MEAD. The main component of the
MEAD system is the Proactive FT Manager, and is embedded within the interceptors
in both server and client. It has the responsibility of monitoring the resource usage at
each server, initialization a proactive recovery schema based on a two-step threshold.
When the resource usage gets higher then the first threshold, the proactive manager
sends a request to the MEAD Recover Manager to launch a new replica. If the usage
gets higher than the second threshold then the proactive manager starts migrating the
replica’s clients to the next non-faulty replica server.
The Mead Recovery Manager has some similarities with the Replication Manager of
CORBA-FT, as it also must launch new replicas in the presence of failures (node or
server). In MEAD, the recovery manager does not follow a centralized architecture, as
in TAO or FLARe, where all the components of the FT infrastructure are connected to
the replication manager, instead, they are connected by a reliable total ordered group
communication framework that establishes an implicit agreement at each communica-
tion round. These frameworks also provide a notion of view, i.e. an instantaneous
54