On the quality of service of crash recovery

IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 3, JULY-SEPTEMBER 2010 271

On the Quality of Service of
Crash-Recovery Failure Detectors
Tiejun Ma, Jane Hillston, and Stuart Anderson

Abstract—We model the probabilistic behavior of a system comprising a failure detector and a monitored crash-recovery target. We
extend failure detectors to take account of failure recovery in the target system. This involves extending QoS measures to include the
recovery detection speed and proportion of failures detected. We also extend estimating the parameters of the failure detector to
achieve a required QoS to configuring the crash-recovery failure detector. We investigate the impact of the dependability of the
monitored process on the QoS of our failure detector. Our analysis indicates that variation in the MTTF and MTTR of the monitored
process can have a significant impact on the QoS of our failure detector. Our analysis is supported by simulations that validate our
theoretical results.

Index Terms—Failure detectors, crash recovery, quality of service, availability, dependability, performance.

Ç

1 INTRODUCTION
and accuracy, of crash failure detector implementations and
F AULT tolerance is one of the most important issues for
achieving dependable distributed systems. One of the
most challenging problems in this research area is to tolerate
failure detection algorithms, e.g., [5], [6], [7], [8], [9], [10].
It is important to note that most of this previous work
the Byzantine failure, which is also sometimes called the focused on the QoS of crash failure detectors is based on the
arbitrary failure. This means that a process may behave in crash-stop or fail-free assumption. The fail-free assumption
an arbitrary manner, producing arbitrary responses at assumes that failures do not occur. The crash-stop assumption
arbitrary time [1]. It is the most difficult failure to detect. assumes that there is only one failure and the monitoring
One possible solution of Byzantine failure detection is procedure terminates once that crash failure is detected. The
adopting consensus algorithms. To achieve K fault toler- algorithms based on these assumptions focus on how to
ance, 3K þ 1 service replications are needed [2]. In the worst estimate the probabilistic message arrival time and a suitable
case, the K faulty services may send incorrect values, or time-out period for a failure detector to ensure a required QoS.
incorrectly represent the values of others, but the remaining However, fail-free and crash-stop can be strong assump-
2K þ 1 services can still return the same correct answer. tions. An alternative approach is to consider the crash-
Crash failure detection is one of the most important building recovery paradigm as discussed by Guerraoui and Rodrigues
blocks to achieve a successful consensus. However, detect- [11]. A process can keep crashing and recovering infinitely
ing crash failures is a difficult problem. In [3], Fischer et al. often and it is eventually always up and running. In theory, a
show the impossibility of separating a crashed process and a process recovery can be achieved by adopting stable storage
very slow one, in a pure asynchronous system, known as the and the state information of the process can be stored and
Fischer-Lynch-Paterson’s impossibility result. Subse- retrieved from the storage. After a crash is detected, the
quently, failure detector oracles, which give possibly recovery procedure can be initiated to retrieve the latest
erroneous information about the state of the monitored stored process information. In practice, in order to provide
target, have been proposed. In [4], Chandra and Toueg high availability, self-repairing and self-healing mechanisms
introduce the concept of unreliable crash failure detectors to are widely adopted in fault-tolerant systems to achieve
detect the eventual crash behavior of a process and classify automatic recovery after a crash occurs. Particularly, in
a set of abstract failure detectors based on the failure middleware systems, many techniques and algorithms have
detectors’ eventual behavior to solve a certain set of been proposed to achieve the self-repairing or self-healing
membership problems. This work inspired many research- goal, e.g., [12], [13], [14], [15].
ers to study the quality of service (QoS), such as the speed In such systems, it is assumed that the system undergoes
periodic crashes. During a crash period, the system is unable
to service any requests or send any messages, externally
behaving as if the system is unreachable. The end of the
crash period is marked by a recovery, after which the system
. T. Ma is with the Department of Computing, Imperial College London,
South Kensington Campus, 180 Queens Gate London, SW7 2AZ, UK. returns to normal service and its internal state is restored to
E-mail: tma@doc.ic.ac.uk. the state before the crash failure occurred.
. J. Hillston and S. Anderson are with the Laboratory for Foundations of For such systems, crash-recovery failure needs to be
Computer Science, School of Informatics, University of Edinburgh, considered as a frequently occurring failure type to be
10 Crichton Street, Edinburgh EH8 9AB, UK. detected. However, the crash-recovery case has been little
E-mail: {jeh, soa}@inf.ed.ac.uk.
studied, due to the fact that there are more possible
Manuscript received 19 Feb. 2008; revised 21 Apr. 2009; accepted 30 June discrepancies between the failure detector and the monitored
2009; published online 11 Aug. 2009.
For information on obtaining reprints of this article, please send e-mail to: target, increasing the size of the state space of the monitoring
tdsc@computer.org, and reference IEEECS Log Number TDSC-2008-02-0037. process, making the QoS analysis for such a paradigm more
Digital Object Identifier no. 10.1109/TDSC.2009.36. complicated.
1545-5971/10/$26.00 ß 2010 IEEE Published by the IEEE Computer Society

272 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 3, JULY-SEPTEMBER 2010

In [16], we presented an evaluation of the QoS of a crash-
recovery failure detector based on a simple time-out algo-
rithm. A crash-recovery target was modeled as an alternating
renewal process. The simulation results showed that the
crash-recovery behavior of the monitored target will impact
the QoS of such a failure detector, which implied that the
crash-recovery paradigm merited further studied. Such an
analysis was presented in [17]. In that paper, we outlined
how to model the failure detection pair in a crash-recovery
run and how to configure the failure detector to satisfy a
given QoS requirement. The current paper represents a
substantial expansion of [17]. We present more analytical
Fig. 1. The QoS metrics without considering false positive mistakes.
details and support the results with further simulation
studies. Analytical results, derived directly from the equa-
detector and the QoS metrics. In terms of the transitions
tions in this paper, are also plotted and compared with the
defined above and the fail-free assumption, Chen et al.
simulation results. We are then able to present a detailed
define the following QoS metrics for a failure detector:
analysis for each of the QoS metrics, which shows the
failure detection time (TD ), mistake recurrence time (TMR ),
validity of our model.
mistake duration (TM ), good period duration (TG ), and
1.1 Our Contribution query accuracy probability (PA ).
We show how to remove the fail-free or crash-stop assump- Some recent research has extended the QoS work of [5] in
tion and model the probabilistic behavior of a failure a number of ways. For example, the authors of [6], [9], [10],
detector with respect to a crash-recovery target, taking into [18] refine the model with different probabilistic message
delay and loss estimation methods. Meanwhile, others, such
consideration general dependability metrics, such as mean
as [7], [8], [19], [20], [21], focus on the scalability and
time to failure (MTTF) and mean time to recovery (MTTR). We
adaptivity of crash failure detection. But all of these papers
outline how the QoS of a failure detector is limited by the
are based on eventual crash-stop behavior of the monitored
dependability of the monitored target. Moreover, we
process or the fail-free assumption. Crash-recovery failure
establish that the crash-stop or fail-free models are special
detectors have been considered by several groups, e.g.,
cases of the crash-recovery model.
Boichat and Guerraoui [22] implemented reliable and total
In order to effectively assess the QoS of the failure
order broadcast primitives, assuming a practical asynchro-
detector in a crash-recovery run, we have defined new
nous crash-recovery model in which the processes and
QoS metrics to measure the recovery detection speed and
channels may crash and recover or crash and never recover;
the proportion of the failures of the monitored target which
[23], [24], [25], [26], each of which proposes failure detectors
are detected. To make an accurate estimation of the failure
to solve consensus problems rather than focusing on the
detector’s parameters needed to achieve a required QoS, a
QoS of the failure detector itself. In [23], the monitored
configuration procedure for a crash-recovery failure detector process is characterized as always-up, eventually-up, even-
is outlined. We demonstrate how to achieve the QoS from tually-down, or unstable. A process which crashes and
a given set of requirements based on the NFD-S algorithm recovers infinitely many times is regarded as unstable.
(see Appendix B, which can be found on the Computer But crash-recovery looping behavior exists for most systems.
Society Digital Library at http://doi.ieeecomputersociety. From the perspective of stochastic theory, crash-recovery
org/10.1109/TDSC.2009.36,) proposed by Chen et al. [5] behavior can be regarded as a regenerative process in which
with suitable modifications. To the best of our knowledge, the probabilistic live and recovery times are not zero. In the
none of these aspects of QoS of failure detectors have been following sections, we will analyze such a crash-recovery
presented before. paradigm and its failure detector from a QoS perspective.
1.2 Related Work This paper is organized as follows: in Section 2.1, we
model a crash-recovery service with general dependability
In [5], Chen et al. propose a set of QoS metrics to measure
metrics. Then, we show our model of the probabilistic
the accuracy and speed of a failure detector. Their model
message communication and its QoS metrics. In Section 3,
contains a pair of processes: one is the monitor process, the
other is the monitored process, and there is only one crash we show how to model the crash-recovery failure detector’s
during the monitoring period. The analysis is based on two probabilistic behavior. We refine the completeness of a crash-
separate stages of failure detection: the precrash stage, recovery failure detector and extend the QoS metrics to
which is a fail-free run; and the postcrash stage, which is a measure the completeness and the recovery detection speed
crash-stop run when the monitoring procedure will be of such a failure detector. Then, we show how to involve
terminated. In order to formally define the QoS metrics, the general dependability metrics for an approximate
Chen et al. [5] define state transitions of a failure detector analysis of the QoS of a failure detector and how to
monitoring a target process under the fail-free assumption. configure a crash-recovery failure detector to satisfy a given
At any time, the failure detector’s state is either Trust or set of QoS requirements. Moreover, we discuss the impact
Suspect with respect to the monitored process’s liveness. If a of the dependability of the crash-recovery service on the QoS
failure detector moves from a Trust state to a Suspect state, of failure detectors in detail. In Section 4, the estimation of
then an S-transition occurs; if the failure detector moves the input parameters of a crash-recovery failure detector is
from a Suspect state to a Trust state, then a T-transition presented. We show how to estimate the message delay,
occurs. Fig. 1 shows the state transitions of the failure message loss, MTTF, MTTR, etc., in a crash-recovery run. In

MA ET AL.: ON THE QUALITY OF SERVICE OF CRASH-RECOVERY FAILURE DETECTORS 273

random variables fXðnÞ; n 2 N g, where XðnÞ is the random
variable representing the time which elapses from the time
of the nth regeneration point to the ðn þ 1Þth one (i.e.,
XðnÞ ¼ Snþ1 À Sn ). For simplicity of presentation, we use X
instead of XðnÞ in the following since it is sufficient to
consider a single regeneration period. Furthermore, we can
consider X to be the sum of two independent random
variables: Xa and Xc . Here, Xa represents the time which
elapses from the time that the CR-TS starts a regeneration
period to the time the CR-TS fails and Xc represents the
time from when the CR-TS fails until to the time of the next
Fig. 2. Crash-recovery service modeling. regeneration point.
Lemma 1. In steady state, the CR-TS is an alternating renewal
Section 5, the analytical and simulation results are plotted process and the time between any two consecutive recovery time
and analyzed in detail. We show that the dependability of a points is one period of the crash-recovery service’s lifetime.
crash-recovery target has an impact on the QoS of a failure Thus, we assert that in order to design a failure detector for
detector and our analysis is valid. In Section 6, a brief the CR-TS, which is sensitive to the CR-TS’s behavior, we
summary of the paper is presented. Appendix A provides a only need to consider one period of the CR-TS since all of the
notation table for the variables used in the paper. other periods are independent and identically distributed.
Appendix B shows the pseudocode of the NFD-S algorithm.
2.2 Dependability of a Crash-Recovery Service
Appendix C presents the main proofs of the lemmas and
theorems presented in this paper. Dependability, one of the most important issues for
computer systems, is a complex attribute. Laprie et al. [1]
define the concept of dependability as the property of a
2 CRASH-RECOVERY SERVICE AND QoS OF computer system such that reliance can justifiably be placed on the
MESSAGE COMMUNICATION service it delivers. Associating timing information with the
behavior of a system, its dependability can be described
In this section, we outline the assumptions underlying quantitatively. Generally speaking, the dependability of a
our framework, considering the crash-recovery behavior system can be measured according to a number of different
of the target service, its dependability characteristics, and aspects such as reliability, availability, consistency, usability,
the behavior of the communication channel which security, etc. In order to simplify the measurements which
supports the failure detection process. are related to failure detection, here, we only introduce
2.1 The Crash-Recovery Service Modeling reliability, availability, and consistency, which are strongly
related to the QoS of failure detectors.
For a crash-recovery target service (CR-TS), we consider that In [27], Knight and Strunk give a definition of software
the service might crash at arbitrary time and take some time reliability and availability. We extend this with a definition
to be repaired and restart again after it fails. Let S be the of consistency as follows:
state space of a stochastic process Z :¼ fZðtÞ; t ! 0g, where
Z captures a CR-TS’s lifetime. Then, S can be regarded as . Reliability: is the probability that the system will
{Alive, Crash} and the CR-TS can periodically switch operate correctly in a specified operating environ-
between these two states. A transition occurs when the ment up until time t (t > 0).
state of the CR-TS changes. Fig. 2 shows the state transitions . Availability: is the probability that the system will be
of a CR-TS, where a C-transition occurs when the state of the operational at time t.
CR-TS switches from the Alive state to the Crash state; an . Consistency: is the probability that in a specified
R-transition occurs when the state of the CR-TS switches operating environment, the system will return to
from the Crash state to the Alive state. normal operation correctly after a failure within time t.
Assumption 1. If the service’s recovery is treated as a restart, These three metrics present different aspects of the
then the CR-TS’s lifetime Z is a regenerative process. system dependability. Generally, reliability presents how
long a system will operate correctly and can be captured by
Assumption 1 will be used in the following. It is based MTTF, which records the likelihood of a service to persist
without a failure. Availability presents the probability that a
on the following observations. The CR-TS will periodically
system is accessible or reachable with correct operation at
crash and recover, leading to a sequence of time points,
an arbitrary time and can be captured by mean time to failure
S1 ; S2 ; . . . ; Sn ; . . . (n ! 0), representing the times of the divided by mean time between failure (MTTF ). Consistency
MTBF
CR-TS’s recovery. The behavior of the system after Sn presents the ability of a system to recover from a failure
(n ! 0) is independent of what has occurred before, and state to the correct operation state and can be captured by
thus, Sn can be regarded as a restart. Moreover, the MTTR, which records how quickly a system recovers.
probability of Sn occurring is 1. This makes the time points In different scenarios, different aspects of dependability
S1 ; S2 ; . . . ; Sn regeneration points. may be given greater relative importance. For example,
Since the CR-TS’s lifetime Z is a regenerative process and consistency may be valued more than reliability in a
the sequence fS1 ; S2 ; . . . ; Sn ; . . .g characterizes the lifetime system designed to be always accessible. This means that
of the service, we can give an alternative definition of the fault-tolerance mechanisms should be able to adapt to
stochastic process Z. The stochastic process Z is a set of reflect differing dependability requirements.


2.3 QoS of Message Communication
In order to measure the communication between the FDS and
target service quantitatively, we define the communication
path between the FDS and the target service as a channel.
Each communication component pair holds one or more
virtual one-way, source-to-destination channel. Messages
can only flow from the source component to the destination
component. In addition, the channel model in this paper
relies on the assumption of a basic unreliable communication
channel with fairness, no-creation, and no-duplication [28].
This has some similarities with the Stubborn channels in [28],
but they allow duplicated messages and we assume that
there are no duplicated messages in our model.
This channel-based communication, which maintains Fig. 3. State space in a crash-recovery run. (a) Fail-free transition.
the interaction between the FDS and the CR-TS, can be (b) Crash-recovery transition.
characterized by the QoS of the communication, the adopted
failure detection algorithm, and the adopted communica- enough to be ignored and their local clocks are sufficiently
tion protocol, each of which has some associated properties. synchronized (this can be guaranteed by some time synchro-
In particular, we take the message transmission behavior to nization service such as the Network Time Protocol used in
be probabilistic: we describe the message delay or loss as [6]) to be regarded as a clock synchronized system. The
probabilistic behaviors associated with the communication failure detection algorithm we adopt is the NFD-S algorithm
channel. proposed in [5].
Definition 1. Let D be a random variable representing the time 3.2 Modeling a Push-Style Crash-Recovery FDS
which elapses from the time a message is sent until the time it The failure detector (FDS) has a set of suspicion levels S s :¼
arrives at the destination and EðDÞ be the average message fT rust; Suspectg as in [5]. The FDS can either trust or suspect
delay; let pL be the probability of a message loss during the a CR-TS’s liveness. Thus, for a fail-free run, a service only has
transmission; let XL be a random variable representing the one state: Alive. The state space of an FDS is S f :¼
number of consecutive messages lost and EðXL Þ be the average fT rust-Alive; Suspect-Aliveg, and the event space of an FDS
number of consecutive messages lost. F :¼ fS-transition; T -transitiong (Fig. 3a). For a fail-free run,
the QoS metrics of an FDS can be measured quite
From these definitions, properties such as the following straightforwardly. The average time spent in the Trust state
can be derived: is the mean length of the good period EðTG Þ; the average time
Lemma 2. If each message’s transmission and loss behavior are spent in the Suspect state is the mean time of the mistake
independent, then the probability that x (x ! 1) consecutive duration EðTM Þ; the average time between two consecutive
messages are lost is transfers to the Suspect state (two consecutive S-transitions) is
the mean time of the mistake recurrence EðTMR Þ.
P rðXL ¼ xÞ ¼ px Á ð1 À pL Þ:
L However, precisely speaking, the state space of an FDS
S c :¼ S Â S s , where S is the state space of the target service.
Overall, the QoS of this channel-based communication Therefore, for a CR-TS with failures, the state space of its
between the FDS and the CR-TS can be captured by EðDÞ, FDS increases because the service has more than one state
pL and EðXL Þ. In the following sections, we analyze how (see Fig. 3b). If the suspicion level is more than two, then S c
the FDS monitors the CR-TS and how the FDS can be will increase as well. The QoS metrics of an FDS are no
configured based on the characteristics of this channel- longer as simple as for fail-free runs.
based communication. For a fail-free run (MTTF ! þ1) or a crash-stop run
(MTTR ! þ1), the CR-TS’s current state S CRÀT S will be
3 QoS OF THE CRASH-RECOVERY FDS Alive for all time up to the crash, and it is easy to deduce the
3.1 System Model FDS’s accuracy S A directly from the FDS’s current state.
However, for a crash-recovery run, since the CR-TS could fail
We consider a distributed system model with two services:
or recover at arbitrary time, S A cannot be deduced solely
One FDS and one CR-TS, distributed over a wide-area
network. The FDS and the CR-TS are connected by an from the state of the FDS.
Furthermore, compared with a fail-free or crash-stop run,
unreliable communication channel (see Section 2.3). Liveness
(heartbeat) messages are transmitted through the channel. there are more mistake types in a crash-recovery run. In
The communication channel does not create or duplicate previous work, such as [5], [6], [8], [9], [10], [18], [20], only
liveness messages, but the messages might be lost or delayed the mistakes caused by the message transmission behaviors
indefinitely during transmission.1 The CR-TS can fail by (message delay and loss) are considered. But in a crash-
crashing but can be repaired and restart to run again after recovery run, a mistake starts whenever the CR-TS’s and
some repair time, i.e., it behaves as a crash-recovery model. The FDS’s states diverge. Thus, there are also mistakes caused
3
drift of the local clocks of the FDS and the CR-TS is small by the CR-TS’s crash (see TF in Fig. 1 or TM in Fig. 4c) and
recovery (see Fig. 4d) due to the delayed detection of such
1. This channel-based message transmission is the same as the events. Fig. 4 shows the four types of mistake which could
1
probabilistic network model in [5]. occur within a crash-recovery run. TM in Fig. 4a represents a


1 2 3 4
Fig. 4. The analysis of possible TM in a crash-recovery run. (a) TM . (b) TM . (c) TM . (d) TM .

2
mistake caused by a message delay. TM in Fig. 4b The above QoS metrics can measure some QoS aspects of
3 a failure detector in a crash-recovery run. However, they
represents a mistake caused by a message loss. TM in
Fig. 4c represents a mistake caused by CR-TS’s crash, while cannot measure how fast a recovery can be detected, the
4 proportion of the detected failures over the occurred
the FDS still trusts the CR-TS. TM in Fig. 4d represents a
mistake caused by CR-TS’s recovery, while the FDS still failures (completeness), etc. In the following section, we
suspects the CR-TS. A message loss or delay will result in a extend the QoS metrics to measure the recovery detection
Suspect-Alive mistake of the FDS (see Fig. 3b). A crash speed and the completeness of a failure detector.
failure will result in a Trust-Crash mistake. A recovery event
will result in a Suspect-Alive mistake. Mistakes caused by 3.3 Extended QoS Metrics for a Crash-Recovery
different reasons will result in different FDS parameter
FDS
reconfiguration plans. For instance, the best way for the For an FDS in a crash-recovery run, in addition to the QoS
FDS to tolerate more message losses or a longer message metrics introduced above, we propose some new QoS
delay is to increase the time-out duration; the best way for metrics.
the FDS to minimize the mistake duration caused by a crash First, in order to measure the speed with which an FDS
event is to decrease the time-out duration; and the best way can discover a recovery of the CR-TS, we define—the
to minimize the mistake duration caused by a recovery recovery detection time (TDR )—a random variable which
event is to increase the liveness message sending frequency. represents the time that elapses from the CR-TS’s recovery
Thus, we can see that an inaccurate mistake type identifica- time (an R-transition) to the time when the FDS discovers
tion might reduce the QoS of an FDS and should be the recovery.
Then, since in a crash-recovery run, there is no eventual
avoided.
behavior of a CR-TS, and a fast recovery could make a
From the above analysis, we can see that due to the
failure undetectable by the FDS. Under such circumstances,
increasing mistake types in a crash-recovery run, the defini-
the completeness property of a failure detector defined in [4]
tion of the QoS metrics in [5] using transitions is not valid in a can no longer be satisfied. In order to reflect this situation,
crash-recovery run. Thus, we redefine them as below: we refine the definition of the completeness as follows:
. Detection time (TD ): The elapsed time from when . Strong completeness: Every crash failure of a recover-
the monitored target crashes until the failure able process will be detected.
detector correctly suspects the monitored target. . Weak completeness: A specified proportion of the crash
. Mistake recurrence time (TMR ): The time between failures of a recoverable process will be detected.
the occurrence of two consecutive mistakes.
Therefore, in order to measure the completeness property of a
. Mistake duration (TM ): The time to correct a
crash-recovery FDS, we propose a new QoS metric. The
mistaken suspect or trust. detected failure proportion (RDF ) is a random variable
. Good period duration (TG ): The duration for which capturing the ratio of the detected crashes over the occurred
the failure detector maintains the correct state crashes (0 RDF 1). When no crash failures are detected,
information. RDF ¼ 0. When all of the occurring crashes are detected,
. Query accuracy probability (PA ): The probability RDF ¼ 1. The strong completeness property of an FDS
that the state information from the failure detector is requires that EðRDF Þ ¼ 1 (where E denotes expectation).
correct at an arbitrary time. The weak completeness property requires that EðRDF Þ ! RL , DF


. i is the time of the ith freshness point corresponding
to i ;
. b is the last freshness point3 before a crash; and
. f is the freshness point corresponding to f .
Let time-out be the threshold waiting time for the
expected arrival of the liveness message before suspecting
the CR-TS (time-out ¼ i À i in Fig. 5). Let tm (m ! 1) be a
r
recovery time of the current MTBF period (see Fig. 5). Then
in our model, the key thing for the QoS bounds analysis is
to derive the average number of mistakes that will happen
Fig. 5. The analysis of the FDS based on the NFD-S algorithm in a between the mth and ðm þ 1Þth recovery times, and the
crash-recovery run.
average duration of each mistake. We make the following
where RL is the specified lower bound of the detected definitions as extensions of Definition 1 in [5]:
DF
failure proportion and 0 RL DF 1. Definition 2. For the fail-free duration ½t1 ; t2 Þ within each
Overall, the QoS for a crash-recovery FDS can be MTBF period:
captured by PA , TM , TMR , TD , TDR , and RDF . In the next
section, we will analyze the QoS bounds of the FDS based 1. k: for any i ! 1, let k be the smallest integer such that
for all j ! i þ k, mj is sent at or after time i , where
on the NFD-S algorithm in a crash-recovery run by adopting
mj is the jth heartbeat message.4
the proposed basic and extended QoS metrics. 2. For any i ! 1, let pi ðxÞ be the probability that the FDS
j
3.4 QoS Estimate of the Crash-Recovery FDS Based does not receive the ði þ jÞth message miþj by
on the NFD-S Algorithm time i þ x, for every j ! 0 and every x ! 0; let
pi ¼ pi ð0Þ.
0 0
In a crash-recovery run, as the state of a CR-TS can switch i
3. For any i ! 2, let q0 be the probability that the FDS
between Alive and Crash, these crash or recovery events will
receives message miÀ1 before time i .
force the output of the FDS to be accurate or inaccurate. For
4. For any i ! 1, let ui ðxÞ be the probability that the FDS
analyzing the behavior of the failure detection pair, we
suspects the CR-TS at time i þ x, for every x 2 ½0; Þ.
want to pick an observation period, which will cover all the
5. pi : for any i ! 2, let pi be the probability that an
s s
events which may possibly occur. In our model, we pick
S-transition occurs at time i .
one MTBF period as the observation period. This is because,
as we discussed in Section 2.1, in order to study the steady According to the QoS analysis of the NFD-S algorithm in
state behavior of a CR-TS throughout its lifetime, we only Proposition 3 in [5], we now analyze the basic QoS metrics
need to observe the time period between two consecutive of the FDS based on the NFD-S algorithm in a crash-recovery
regeneration points (recovery times) of the CR-TS and the run and show the following relations hold:
average duration between the two consecutive regeneration Proposition 1.
points is MTBF. In the following, we will treat these as also
regeneration points of the system consisting of the failure 1. k ¼ dtime-out=e.
detection pair. This is an approximation made for prag- 2. for all j ! 0 and for all x ! 0,
matic reasons but it can be justified as follows:
Fig. 5 shows the relationship between an FDS and a pi ðxÞ ¼ ðpL þ ð1 À pL Þ Á P rðD time-out þ x À jÞÞ
j
À Á
CR-TS on the interval t 2 ½t0 ; t3 Þ, where both t0 and t3 are Á P r Xa i À tm þ x :
r
regeneration points. Obviously, the mean time between t0
and t3 is the MTBF. We split ½t0 ; t3 Þ into three intervals 3. i
q0 ¼ ð1 À pL Þ Á P rðD time-out þ Þ
½t0 ; t1 Þ, ½t1 ; t2 Þ, and ½t2 ; t3 Þ: À Á
ÁP r Xa Q tm :
iÀ r

. t1 is the time when the FDS detects the transition of 4. For all x 2 ½0; Þ; ui ðxÞ ¼ k pi ðxÞ.
j¼0 j
the CR-TS from the Crash state to the Alive state. 5. pi ¼ q0 Á ui ð0Þ.
s
i

. t2 is the time when the service crashes. Note that the In Proposition 1, the bounds of each QoS metric are
period ½t1 ; t2 Þ is without failures. derived based on the analysis of the average number of
Additionally, we define the following times: possible mistakes within the distinct intervals ½t0 ; t1 Þ, ½t1 ; t2 Þ,
and ½t2 ; t3 Þ. In consequence, the following theorem holds
. s is the first liveness message sending time after a
and can be used to estimate the FDS’s parameters or QoS
recovery;
. f is the sending time of the last liveness message bounds within a crash-recovery run:
before a crash; Theorem 1. The crash-recovery FDS based on the NFD-S
. i is the sending time of a liveness message between algorithm has the following properties:
s and f ;
. is the liveness message sending interval; 3. The expected arrival time of the liveness message.
4. k is assumed to be independent of i approximately. In fact, in a crash-
. s is the first decision time after recovery;2 recovery run, k is not completely independent of i. However, if the CR-TS
will remain alive for a reasonable duration, k will be almost independent of i
2. The actual arrival time of the first received valid liveness message. except for the last few messages before the CR-TS crashes.


MT BF ! EðTMR Þ
MT BF ð1Þ
! ÀÄ MT T F ÀEðT Å Á Æ Ç :
DR Þ
þ 1 Á pi þ EðDÞ þ 2
s

If Xc þ time-out, then
MT BF
! EðTMR Þ
2
MT BF ð2Þ
! ÀÄ MT T F ÀEðT Å Á Æ Ç ;
DR Þ
þ 1 Á pi þ EðDÞ þ 2
s

R
EðTD Þ þ EðTDR Þ þ MT T F ÀEðTDR Þ Á
0 ui ðxÞdx
PA ! 1 À ; ð3Þ
MT BF
Fig. 6. The extended FDS configuration based on the NFD-S algorithm
R in a crash-recovery run.
EðTDR Þ þ MT T F ÀEðTDR Þ Á 0 ui ðxÞdx þ
EðTD Þ
EðTM Þ ÀÄ MT T F ÀEðTDR Þ Å Á ; ð4Þ
þ 1 Á pi þ 1 needed to ensure that the NFD-S algorithm is still valid after
s
each recovery. However, without persistent storage to
snapshot the runtime information frequently, when a crash
EðTDR Þ ¼ EðDÞ þ Á EðXL Þ; ð5Þ failure occurs, all of the current runtime information might
be lost. Thus, continuously increasing the heartbeat se-
EðRDF Þ ! P rðXc þ time-outÞ: ð6Þ quence number cannot be guaranteed.
Since the NFD-S algorithm assumes that the local clocks of
Details of the proof of the theorem can be found in [29] and the FDS and the CR-TS are synchronized, we can compare
Appendix C.2. the sending times of heartbeat messages instead of the
When the monitored target is fail-free or crash-stop,5 for heartbeat sequence numbers in the algorithm. Then, for a
the basic QoS metrics in [5], applying (1)-(4) of Theorem 1, crash-recovery FDS, if the QoS requirements of the FDS are
we can easily deduce that given, the configuration procedure is illustrated in Fig. 6.
Initially, we can assume that the QoS of message
EðTMR Þ ! ; ð7Þ communication is perfect (e.g., pL ¼ 0, EðDÞ is small and
pi
s
EðXL Þ ¼ 0), and the CR-TS is fail-free. As the monitoring
Z procedure continues, the estimation of the QoS of message
1
EðTM Þ Á ui ðxÞdx i ; ð8Þ communication and the dependability metrics of the CR-TS
pi
s 0 q0
will become more accurate. Thus, the FDS will be reconfi-
Z
gured to adapt to changing input parameters, which help
1 better estimate and time-out.
PA ! 1 À Á ui ðxÞdx: ð9Þ
0 Then for given QoS requirements, expressed as bounds,
Thus, EðTMR Þ, EðTM Þ, and PA are exactly reduced to the the following inequalities need to be satisfied where a
QoS analysis results in [5] (see Appendix C.4 for the details superscript U denotes an upper bound and a superscript L
of the proof scratch). We can conclude that in terms of failure denotes a lower bound:
detection, a fail-free run or a crash-stop run with MTTF U L L
TD TD ; EðTMR Þ ! TMR ; PA ! PA ;
tending to infinity is a particular case of a crash-recovery run. ð10Þ
U U
If the monitored target’s MTTF is not sufficiently long and EðTM Þ TM ; EðTDR Þ TDR ; EðRDF Þ ! RL :
DF
the target is recoverable, then the impact of its dependability
From Theorem 1, we can estimate the parameters ( and
must also be taken into consideration. In the following
time-out) of the NFD-S algorithm according to the following
section, we will introduce how to configure the crash-recovery
FDS according to the QoS bounds we have derived from inequalities:
Theorem 1. þ time-out U
TD ; 0; ð11Þ
3.5 The Configuration of the Crash-Recovery FDS
Based on the NFD-S Algorithm MTBF L
ÀÄ MTTFÀEðTDR Þ Å Á Æ Ç ! TMR ; ð12Þ
For crash failure detectors, it is crucial to select some þ 1 Á pi þ EðDÞ þ 2
s
suitable input parameters (such as the liveness message
intersending interval and the time-out duration) to satisfy a R
given set of QoS requirements. In this section, we will show EðTD Þ þ EðTDR Þ þ MTTFÀEðTDR Þ Á
0 ui ðxÞdx L
how to achieve such steps in a crash-recovery run based on 1À ! PA ; ð13Þ
MTBF
the NFD-S algorithm. In a crash-recovery run, an assumption
that the sequence numbers of the heartbeat messages are R
EðTDR Þ þ MTTFÀEðTDR Þ Á ui ðxÞdx þ EðTD Þ
continually increasing after every recovery of the CR-TS is
ÀÄ MTTFÀEðTDR Þ Å
0
Á U
TM ; ð14Þ
þ 1 Á pi þ 1
s
5. The precrash duration of the crash-stop process is a long run.


U
EðDÞ þ EðXL Þ TDR ; ð15Þ

P rðXc þ time-outÞ ! RL :
DF ð16Þ
Then, the task of the NFD-S algorithm is to find the
largest satisfying inequalities (12)-(15) and if such exists,
U
find the largest time-out that satisfies þ time-out TD and
P rðXc þ time-outÞ ! RL . This can be done in the
DF
following steps:
L
Step I. If TMR MTBF, continue; else the QoS of the FDS
cannot be achieved.
Step II. Find the largest that satisfies the inequalities Fig. 7. Dependability metrics estimation.
(12)-(15); otherwise cannot find an appropriate (QoS
cannot be achieved). uniformly distributed on ½l ; l þ Þ, then after a recovery
U
Step III. If 0, find the largest time-out TD À and has completed, the average tm can be estimated by
c
P rðXc þ time-outÞ ! RL .DF tm ¼ l þ . Notice that a smaller message intersending
c 2
From the above steps, the estimation of and time-out for time () can result in a more accurate tm estimate. Then, the
c
a crash-recovery FDS based on the NFD-S algorithm amounts CR-TS’s MTBF, MTTF, MTTR, and the probability that the
to finding a numerical solution for the inequalities (11)-(16). CR-TS has not crashed up to time i þ x since its last
This can be done using binary search similarly to the recovery, P rðXa i þ x À tm Þ, can be estimated as follows:
r
approach outlined in [5]. But the estimation of the input Estimate MTBF. From the definition of MTBF, we know
parameters of the configuration becomes more difficult that MTBF is only related to the CR-TS’s recovery times
because parameters, such as EðXL Þ, MTTF, MTTR, etc., are tm ðsÞ. These tm ðsÞ can be obtained by adopting the recovery
r r
needed. How to estimate these parameters will be discussed time estimation methods proposed in [29]. Thus, MTBF can
in Section 4. be estimated as below:
Note that for this configuration procedure, choosing a
different message transmission protocol (e.g., TCP and À Á 1 X À mþ1
n Á
UDP) can imply different QoS for message communication. MTBF ¼ E tmþ1 À tm ¼
r r tr À tm :
r ð17Þ
n m¼1
Thus, this new configuration can be more adaptive to the
message transmission characteristics. For example, if the Estimate MTTF. MTTF can be estimated by using the
message loss probability or message delay is high for a recovery time (tm ) and the crash detection time (tm ) as
r d
certain protocol, then the FDS can switch to a more reliable Eðtm À tm Þ ¼ MTTF þ EðTD Þ. Then,
d r
protocol to achieve a better QoS without increasing the
communication frequency or the time-out length. À Á 1XÀ m
n Á
In the next section, we will discuss how to estimate the MTTF ¼ E tm À tm À EðTD Þ ¼
d r td À tm À EðTD Þ:
r
n m¼1
QoS of message transmission and the dependability metrics
of the CR-TS. ð18Þ
Estimate MTTR. MTTR can be estimated by using MTBF
4 PARAMETER ESTIMATION and MTTF directly for MTTR ¼ MTBF À MTTF or by
using tmþ1 and tm . Hence, the MTTR can be estimated as
r d
In the previous section, we explained how to configure a
crash-recovery FDS. However, for this procedure, several Eðtmþ1 À tm Þ ¼ MTTR À EðTD Þ. Then,
r d
input parameters are needed (see Fig. 6). In this section, we À Á
MTTR ¼ E tmþ1 À tm þ EðTD Þ
r d
will show how to estimate these input parameters for an
FDS configuration. 1 X À mþ1
n Á ð19Þ
¼ t À tm þ EðTD Þ:
n m¼1 r d
4.1 Dependability Metrics Estimation for the CR-TS
From the CR-TS modeling in Section 2, we see that there is Estimate P rðXa i þ x À tm Þ. When the probability
r
an intimate relationship between the MTTF, MTTR, and density function fa ðxÞ or the probability distribution
MTBF and the QoS of the FDS. In order to estimate these function Fa ðxÞ of Xa is known, the probability that the
dependability metrics, we only need to estimate the crash CR-TS does not crash until i þ x after its last recovery can
and recovery time of the CR-TS. We assume that the clocks be estimated as
between the FDS and the CR-TS are synchronized. Let t1 be r Z i þxÀtm
the CR-TS’s first start time, then for m ! 1, tm represents the À m
Á r
r P r Xa i þ x À tr ¼ 1 À fa ðxÞdx
mth recovery time; tm represents the mth recovery detection
dr 0 ð20Þ
time; tm represents the mth crash time; and tm represents þxÀtm
c d ¼ 1 À Fa ðxÞj0i r
:
the mth crash detection time (see Fig. 7). tm can be saved to
r
the persistent storage by the CR-TS after a recovery has When x ¼ 0, we obtain that
completed (see [29]). tm can be recorded by the FDS when a
d Z i Àtm
failure is detected, EðTD Þ can be estimated by using À Á r
Àtm
1
Pn m m m m P r Xa i À tm ¼ 1 À fa ðxÞdx ¼ 1 À Fa ðxÞj0i r :
m¼1 ðtd À tc Þ when tc is known. Actually, tc can be
r
n 0
estimated by saving the latest successful message sending ð21Þ
time l in the persistent storage. If a crash event happens


When the probability density function fa ðxÞ and the 4.3.2 The Impact on TMR
probability distribution function Fa ðxÞ of Xa are unknown, For a fail-free run, Chen et al. showed that when time-out
an empirical distribution function (EDF) estimation method length increases linearly, TMR increases exponentially (Fig. 12
can be adopted to estimate fa ðxÞ or Fa ðxÞ. In addition, in [5]). This implies that for such systems, an arbitrary level of
P rðXa i þ x À tm Þ is used to estimate the probability that
r TMR can be achieved. Roughly speaking, in a fail-free run,
an S-transition happens on ½t1 ; t2 ) (see Proposition 1), which when time-out increases to n Â (n 2 Z þ and n ! 1), the FDS
Z
is used to count the average number of mistakes in that can tolerate around n consecutive communication message
period. If we maximize P rðXa i þ x À tm Þ, then a
r losses. The mistake recurrence which is caused by message
maximum average number of mistakes on ½t1 ; t2 ) will be latency or loss decreases P1n rapidly, where
obtained. Therefore, we will get stricter QoS bound
estimates for PA , TM , and TMR . Thus, we can adopt i ¼ 1 P ¼ pL þ ð1 À pL Þ Á P rðtime-out Delay þ1Þ:
and x ¼ 0 to simplify the estimation of P rðXa i þ
x À tm Þ. Notice that the above method is only for the strict For a crash-recovery run, mistakes may occur on both
r
bound estimation rather than an optimized estimation. crash and recovery (see Fig. 3b) since message transmission
latency will delay the detection of the CR-TS’s state change.
4.2 Message Loss Length Estimation These mistakes are inevitable. This means that the upper
As discussed earlier, the parameters related to message bound on TMR is governed by MTTF and MTTR (see
transmission are the average message delay (EðDÞ), prob- inequalities (1)-(2) in Theorem 1). Even if all message delays
ability of message loss (pL ), and the consecutive message and losses can be tolerated, EðTMR Þ cannot increase to an
loss number XL (see Fig. 6). Since pL and EðDÞ estimation arbitrary level when MTTF is not þ1 and MTTR is not þ1
can be done very easily and have been introduced in many or 0. If failure is detectable, EðTMR Þ cannot exceed MTBF 2
other papers such as [5], we do not discuss them here. The since for each MTBF duration, there will be at least two
additional parameter XL is also used and captures the mistakes, corresponding to the two changes of state in the
bursty message loss behavior. In this section, we propose a CR-TS. When failure is undetectable, mistakes may happen
basic estimation method for XL , assuming independent
at the CR-TS’s crash or recovery time. Then, EðTMR Þ cannot
message transmissions.
exceed MTBF. Thus, after EðTMR Þ reaches MTBF , the overall
2
Lemma 3. If each message’s transmission and loss behavior is EðTMR Þ approaches MTBF gradually.
independent, then the mean number of consecutive message
p ð1ÀpM Þ
losses is EðXL Þ ¼ L 1ÀpLL À MpMþ1 , where M is the
L
4.3.3 The Impact on PA
maximum number of consecutive messages lost and pL is the PA , the proportion of time that the FDS is not in a mistake
probability that each message is lost during the transmission. state, will depend on the ratio of EðTM Þ and EðTMR Þ
The proof can be found in [29]. (PA ¼ 1 À EðTMRÞÞ in [5]). If a service is fail-free, PA can rapidly
EðTM

Remark 1. When M ! þ1 and 0 pL 1, then pM ! 0 approach 1. But in a crash-recovery run, when the time-out
L
and MpM ! 0, we obtain that
L
length is increased, both EðTM Þ and EðTMR Þ will eventually
pL reach their upper bounds, meaning that PA will also be
EðXL Þ ¼ : bounded. Generally, as time-out increases, less failures will
1 À pL
3
be detected and the mistakes caused by failures (see TM in
From the above lemma, we see that if each liveness Fig. 4c) will have more impact on EðTM Þ; thus, EðTM Þ will
message’s transmission is independent, EðXL Þ depends approach MTTR, since the maximum length of EðTM Þ is 3
only on pL and can be computed straightforwardly. MTTR. As the time-out length becomes larger with respect to
4.3 The Impact of Service Dependability Metrics on MTTR, more failures become undetectable. Thus, EðTM Þ
the QoS of the FDS will gradually approach MTTR.
A thorough analysis of the impact of the service depend- The speed of increase of TMR will depend on when
ability metrics on the QoS of the FDS has been presented in TMR reaches MTBF . Before this bound is reached, as the
2
[16]. Here, we only highlight the main observations. time-out length increases, TMR can increase exponentially
fast, as more message losses can be tolerated. After TMR
4.3.1 The Impact on TM and TD exceeds MTBF , it can only increase gradually to MTBF, as
2
Generally, for an FDS, the time-out length governs the time-out increases and more and more crashes become
failure detection speed because the FDS makes its decision undetectable. Thus, when TMR reaches its upper bound
at the time-out points. As the time-out length decreases, the but TM has not yet reached its upper bound, PA will
FDS will make faster, but less accurate, decisions. As time- decrease as time-out length increases. When both TM and
out increases, TD slows down but the FDS can tolerate more TMR reach their upper bound, PA will approach MTTF , MTBF
message delays or losses, which can improve the detection which is equal to the availability of the CR-TS.
accuracy to some extent. For a CR-TS, continually increas-
ing the time-out length may mean that failures become 5 SIMULATION EVALUATION AND ANALYSIS
undetectable, because its recovery duration could be shorter
than TD . Thus, EðTM Þ will not increase more than the In previous sections, we have shown how to calculate the
recovery duration, MTTR.6 parameters of the FDS with a given set of QoS requirements
and analyzed the QoS bounds of the crash-recovery FDS
6. Assuming that pL and D are not very large and MTTR ) . based on the NFD-S algorithm. In this section, we introduce


Fig. 8. The NFD-S algorithm: EðTM Þ. Fig. 9. The NFD-S algorithm: EðTMR Þ.

our analytical and simulation results, which verify our complete characteristics. If the time-out length was increased
previous analysis work. to 200, EðTM Þ would approach MTTR ¼ 50 closely.
An interesting phenomenon is visible in the graph as
5.1 Evaluation of the Crash-Recovery FDS Based time-out increases from 0.5 to 1.1: EðTM Þ decreases (or
on the NFD-S Algorithm increases more slowly), and then, increases again. We
For the simulation studies, we fix the heartbeat interval at analyze this phenomenon in detail as follows: Recall that for
¼ 1 and gradually increase the time-out length. a given length of time-out, there are four aspects which have
The message transmission parameters are pL ¼ 0:01 and impact on TM : the message delay and loss, and the CR-TS’s
EðDÞ ¼ 0:02, and the delay is assumed to be exponentially crash and recovery (see Fig. 4). TM caused by a message
distributed. These settings are similar to those used in the delay is governed by the ratio between EðDÞ and TD . For the
simulations in [5]. same EðDÞ, as time-out increases, more delayed messages
1
The CR-TS is defined as a recoverable process with can be tolerated. Thus, TM caused by a message delay (TM )
various values of MTTF and MTTR (exponentially distrib- will decrease and occur less frequently. TM caused by a
2
uted). We choose the exponential distribution for the message loss (TM ) is related to , pL , EðDÞ, and the time-out
following reasons. First, exponential failures are widely length. For constant message communication QoS (i.e., fixed
adopted for reliability analysis in many practical systems; pL and EðDÞ), TM caused by message loss is governed by the
ratio between and TD . Since as the time-out length
second, unlike some heavy tailed distributions such as the
increases, more message losses can be tolerated, the average
log-normal distribution, crash, and recovery with an ex- 2 2
duration of TM will decrease, and TM will occur less
ponential distribution will occur with reasonable interarri- 3
frequently. TM caused by a crash (TM ) is mainly governed
val times, avoiding the CR-TS behaving like a fail-free or
by TD (see Fig. 4c), because if a crash occurs, a false positive
crash-stop process. mistake will last until the time-out time or until the CR-TS
recovers. For detectable crashes, as the time-out length
5.1.1 Analysis for the Basic QoS Metrics 3 4
increases, TM will increase. TM caused by a recovery (TM ) is
We implemented the NFD-S algorithm presented in [5] to
mainly governed by pL and EðDÞ (see Fig. 4d), since after
evaluate the QoS of the FDS and compared the results with the CR-TS’s recovery, a recovery can be detected as soon as
the analytical results derived from Theorem 1. Figs. 8, 9, and a valid liveness message is received.
10 compare the QoS of the FDS based on the NFD-S algorithm From the above analysis, we know that for the same ,
(simulation results) and the corresponding analytical results pL , EðDÞ, MTTF, and MTTR, when the time-out length
from different perspectives. From these three figures, we
increases, the average mistake duration caused by message
have the following observations. 1 2
delays and message losses will decrease (TM b and TM c), the
Fig. 8 presents the EðTM Þ of the FDS derived from
average mistake duration caused by the CR-TS’s crash will
simulation and analytical results for two values of MTTR, 5 3
increase (TM d), and the average mistake caused by the
and 50, with corresponding values of MTTF, 100 and 1,000. 4
The simulation result for MTTR ¼ 5 shows that as the time- CR-TS’s recovery from a detectable crash is unaffected (TM )
out length increases, EðTM Þ will tend to MTTR, i.e., EðTM Þ is but fewer crashes and recoveries will be detected. In the
bounded by MTTR. With the exponentially distributed simulation pL ¼ 0:01 and MTBF ¼ 105, when time-out is
2 3
MTTR used in the simulation, the proportion of the detectable small, TM and TM occur with similar frequency. When time-
crashes will decrease more gradually. Thus, EðTM Þ ap- out increases from 0.5 to 1.0, (the FDS can tolerate zero
proaches MTTR more slowly than in the analytical results. message loss and most message delays), EðTM Þ increases
1 2 3 4
Simulation results for MTTR ¼ 50 confirm that if MTTR slow because TM b, TM b, TM d, and TM and their impacts
becomes large, as the time-out length increases, EðTM Þ can counterbalance. Overall, EðTM ) is stable within this period.
2
also grow large, since the bound is now large. Note that in As the time-out length increases, TM will occur less
3
the graph, we see only the linear part rather than the frequently. But TM occurs every MTBF period. Thus, as


However, from Fig. 10, we can also see that as the time-out
length increases, PA is not always increasing as in a fail-freeor
crash-stop run. Continually increasing time-out could de-
crease PA . This is because TMR is bounded by MTBF or MTBF
2
as discussed above. After EðTMR Þ reaches MTBF , it increases
2
slowly rather than exponentially fast but EðTM Þ increases
linearly and faster than EðTMR Þ. Thus, PA decreases, and
finally, PA will approach MTTF , which is equal to the
MTBF
availability of the CR-TS.
The above results indicate that for a highly available CR-
TS, a reasonable QoS for the FDS can be achieved even if the
FDS always trusts the CR-TS, when only the QoS metrics
defined in [5] are considered. This is especially true for a
Fig. 10. The NFD-S algorithms: PA .
highly available and highly consistent but not highly
reliable CR-TS. However, the completeness property of the
3 FDS will not be satisfied. Consequently, these simulation
the time-out increases, TM will dominant and EðTM Þ will
results demonstrate the necessity of the additional QoS
increase gradually.
metrics we proposed in Section 3.3 to measure the
In the simulation, pL ¼ 0:01 and MTBF ¼ 1;050, when
2 completeness aspects and the speed of the recovery detection
the time-out length is small, TM will have more impact than
3 2 of a crash-recovery FDS. Furthermore, these results also
TM , because TM occurs more frequently than the crash and
demonstrate the necessity of adopting the recovery detec-
recovery. Therefore, as the time-out length increases, the
2 tion protocols in [29], which can improve the proportion of
average duration of TM decreases and occurs less fre-
detected failures without reducing other QoS aspects.
quently; EðTM Þ will increase slower or even decrease since
In Figs. 8, 9, and 10, we can also observe how the
more message losses are tolerated. But if time-out continues
3 dependability of a CR-TS can influence the QoS of the FDS.
to increase, TM will become dominant and EðTM Þ will then
Particularly, for a highly available but not highly reliable
increase gradually.
CR-TS, the dependability of the CR-TS can have more
Overall, Fig. 8 shows that in a crash-recovery run, EðTM Þ
exhibits quite different characteristics from a fail-free or impact than the performance of the algorithm and the QoS
crash-stop run. If the message delay and the probability of of message transmission. In such situations, the depend-
message loss are not very large, EðTM Þ is bounded by ability of the CR-TS must be taken into account for the FDS
MTTR. From Fig. 8, we also observe that EðTM Þ can design and implementation.
possibly be decreased for some time-out values. Unlike in a From Figs. 8, 9, and 10, we can see that PA , EðTMR Þ and
fail-free run, continually increasing the time-out length EðTM Þ have bounds. Continually increasing the time-out
cannot achieve a better ðTM Þ. length might not be a reasonable way to achieve better PA ,
Fig. 9 presents EðTMR Þ of the FDS derived analytically and EðTMR Þ, and EðTM Þ. A potential trade-off exists between
from simulation with exponential MTTF and MTTR as above. the QoS metrics. For instance, for the NFD-S algorithm,
We can see that with constant time-out length, as MTBF time-out 2 ð1; 1:1Þ (time-out þ 2 ½2; 2:1Š) might achieve the
increases, EðTMR Þ also increases. This implies that EðTMR Þ is best over all QoS.
greatly impacted by the dependability of the CR-TS. In addition, EðTM Þ in a crash-recovery run exhibits quite
We can also see that for both these simulation cases, different characteristics compared with a fail-free or crash-
EðTMR Þ initially increases exponentially fast but after EðTMR Þ stop run. This is because in a crash-recovery run, the mistakes
reaches MTBF , the rate of increase is reduced. For the CR-TS caused by the crash and recovery are taken into considera-
2
with exponential MTTR, EðTMR Þ will increase gradually and tion, which means continually increasing the time-out length
approach MTBF, until all crashes become undetectable. This will not always decrease EðTM Þ. It may have the effect of
3
is because for nondeterministic MTTR, as the time-out length increasing false positive mistakes (TM , see Fig. 4). As the time-
increases, the proportion of the detectable crashes decreases. out length increases, mistakes caused by message delays
Therefore, for the detectable crashes, TMR MTBF , and for the and losses will occur less frequently, and false positive
2
undetectable crashes, TMR MTBF. Thus, EðTMR Þ will mistakes (which were not considered previously) will start
increase gradually between ½MTBF ; MTBFŠ, and finally,
2
to dominate the QoS of the FDS.
stabilize at MTBF. All of these results match our analysis in From Figs. 8, 9, and 10, we can observe that the
Section 4.3 well and indicate that if a CR-TS is not fail-free simulation results of EðTM Þ are smaller than the analytical
(MTTF ! 1) orcrash-stop (MTTR ! 1), EðTMR Þ will be results, and the simulation results of EðTMR Þ and PA are
bounded by MTBF when failures are undetectable and by larger than the analytical results, which indicate that the
MTBF
2 when failures are detectable. bound analysis of the basic QoS metrics in Theorem 1 is
Fig. 10 considers PA under the same communication QoS. valid and the simulation results satisfy the QoS require-
We see that when MTBF increases, PA will be improved. This ments according to the analysis. We can also observe a
is because EðTMR Þ also increases. Thus, from the equation gap between the analytical and simulation results. This is
PA ¼ 1 À EðTMRÞÞ , we know that for the same time-out length,
EðTM
caused by the overestimation or underestimation of some
when MTBF increases, a better PA can be achieved. values within the analytical results. EðTM Þ is overestimated


Fig. 12. The QoS relationship between communication, CR-TS,
and FDS.

Fig. 11. The NFD-S algorithms: EðRDF Þ. decreases. When MTTR becomes shorter, EðRDF Þ will
decrease faster. This is because the smaller MTTR is, the
U
by using the total mistake duration over the underestimated sooner time-out þ crosses MTTR (TD MTTR). Therefore,
average number of mistakes that might occur within a crash- more crashes remain undetected when the NFD-S algorithm
recovery period. Thus, the analytical results of EðTM Þ will be is adopted. In Fig. 11, we can also see that the simulation
larger than the simulation results. Similarly, EðTMR Þ is results of EðRDF Þ are larger than the analytical results, which
underestimated by using the observation duration (MTBF) means that the bound analysis of EðRDF Þ is valid and the
over an overestimation of the number of mistakes that simulation results satisfy the QoS requirements in terms of
might occur within a period. For instance, the number of RL . However, since most existing failure detection algo-
DF
rithms adopt increasing the time-out length to tolerate more
mistakes within the period is estimated as dEðDÞe þ 1, which

message losses and delays, if a CR-TS is recoverable and
is an upper bound rather than the average number. It
recovers fast, it could be difficult for these algorithms to
follows that EðTMR Þ of the analytical results will be smaller
achieve the QoS in [5] and satisfy the completeness property at
than the simulation results. Finally, PA is underestimated by
the same time. In such a situation, the recovery detection
using one minus an overestimated total mistake duration protocol introduced in [29] can be adopted, which can solve
over the observation period (MTBF). Thus, PA of the this problem reasonably well.
analytical results will be smaller than the simulation results.
All of these results satisfy the QoS requirements
U L L
EðTM Þ TM , PA PA , and EðTMR Þ TMR . In addition, 6 CONCLUSION
according to the NFD-S algorithm, the failure detection In this paper, the crash-recovery target and its failure detector
time TD is bounded by þ time-out regardless of the are modeled as stochastic processes. We redefined pre-
U
correctness of the detection; thus, TD TD must be viously proposed QoS metrics to be applicable to crash-
satisfied. recovery failure detection and introduced some new metrics
From Figs. 8, 9, and 10, we can also see that there are some to measure the recovery detection speed and the completeness
gaps between the analytical results and the simulation property of a failure detector. We also discussed the impact
results. This is mainly caused by the overestimating and of the monitored target’s crash-recovery behavior on each QoS
underestimating method we adopted to restrict the failure metric and showed that if a failure detector’s parameters are
detector’s QoS bound as discussed above. In addition, we to be accurately estimated, these dependability character-
use MTBF, MTTF, and MTTR, which are the expected values
istics must be taken into account. Thus, we showed how to
rather the real values for each failure and recovery. In the
configure the failure detector to satisfy a given set of
simulation, the results are calculated according to the
requirements based on the dependability characteristics in
randomly generated failure time and recovery time, which
represent the real time to failure and recovery, and these addition to the QoS of message transmission (see Fig. 12).
random variables will deviate from the expected values. This was based on the NFD-S algorithm [5]. Our analysis
Thus, there will be some discrepancies between the simula- shows that the QoS analysis in [5] is a particular case of a
tion and analytical results. These gaps show that there is still crash-recovery run. Furthermore, we discussed how to
space to improve the accuracy of the model and it would be estimate the input parameters for the algorithm.
interesting to investigate this point further in the future. Finally, the plotted simulation and analytical results
demonstrate that our QoS bound analysis is valid and can be
5.1.2 Analysis for the Extended QoS Metrics used as an approximate solution for the computation of the
We also plot the simulation and analytical results for the failure detector’s parameters or the QoS bounds estimation
failure detection proportion (RDF ) defined in Section 3.3 to if the failure detector’s parameters are given. Our simula-
demonstrate the impact of the failure and recovery events tion results confirm that when a failure detector is designed
on this metric. and implemented, the dependability of the crash-recovery
Fig. 11 shows the proportion of failures detected by the target needs to be considered in order to achieve more
FDS, for different dependability characteristics of the CR-TS, accurate parameter estimation. Furthermore, if the recovery
based on both simulation and analytical results. As the of the monitored target needs to be detected, further
time-out length increases, EðRDF Þ of the NFD-S algorithm enhancement of the existing algorithms is needed.

On the quality of service of crash recovery

On the quality of service of crash recovery

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (8)

Similar to On the quality of service of crash recovery

Similar to On the quality of service of crash recovery (20)

More from ingenioustech

More from ingenioustech (18)

Recently uploaded

Recently uploaded (20)

On the quality of service of crash recovery