SlideShare a Scribd company logo
1 of 13
Download to read offline
IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING,                    VOL. 7,      NO. 3,    JULY-SEPTEMBER 2010                                271




                        On the Quality of Service of
                     Crash-Recovery Failure Detectors
                                          Tiejun Ma, Jane Hillston, and Stuart Anderson

       Abstract—We model the probabilistic behavior of a system comprising a failure detector and a monitored crash-recovery target. We
       extend failure detectors to take account of failure recovery in the target system. This involves extending QoS measures to include the
       recovery detection speed and proportion of failures detected. We also extend estimating the parameters of the failure detector to
       achieve a required QoS to configuring the crash-recovery failure detector. We investigate the impact of the dependability of the
       monitored process on the QoS of our failure detector. Our analysis indicates that variation in the MTTF and MTTR of the monitored
       process can have a significant impact on the QoS of our failure detector. Our analysis is supported by simulations that validate our
       theoretical results.

       Index Terms—Failure detectors, crash recovery, quality of service, availability, dependability, performance.

                                                                                 Ç

1    INTRODUCTION
                                                                                     and accuracy, of crash failure detector implementations and
F   AULT tolerance is one of the most important issues for
    achieving dependable distributed systems. One of the
most challenging problems in this research area is to tolerate
                                                                                     failure detection algorithms, e.g., [5], [6], [7], [8], [9], [10].
                                                                                        It is important to note that most of this previous work
the Byzantine failure, which is also sometimes called the                            focused on the QoS of crash failure detectors is based on the
arbitrary failure. This means that a process may behave in                           crash-stop or fail-free assumption. The fail-free assumption
an arbitrary manner, producing arbitrary responses at                                assumes that failures do not occur. The crash-stop assumption
arbitrary time [1]. It is the most difficult failure to detect.                      assumes that there is only one failure and the monitoring
One possible solution of Byzantine failure detection is                              procedure terminates once that crash failure is detected. The
adopting consensus algorithms. To achieve K fault toler-                             algorithms based on these assumptions focus on how to
ance, 3K þ 1 service replications are needed [2]. In the worst                       estimate the probabilistic message arrival time and a suitable
case, the K faulty services may send incorrect values, or                            time-out period for a failure detector to ensure a required QoS.
incorrectly represent the values of others, but the remaining                           However, fail-free and crash-stop can be strong assump-
2K þ 1 services can still return the same correct answer.                            tions. An alternative approach is to consider the crash-
Crash failure detection is one of the most important building                        recovery paradigm as discussed by Guerraoui and Rodrigues
blocks to achieve a successful consensus. However, detect-                           [11]. A process can keep crashing and recovering infinitely
ing crash failures is a difficult problem. In [3], Fischer et al.                    often and it is eventually always up and running. In theory, a
show the impossibility of separating a crashed process and a                         process recovery can be achieved by adopting stable storage
very slow one, in a pure asynchronous system, known as the                           and the state information of the process can be stored and
Fischer-Lynch-Paterson’s impossibility result. Subse-                                retrieved from the storage. After a crash is detected, the
quently, failure detector oracles, which give possibly                               recovery procedure can be initiated to retrieve the latest
erroneous information about the state of the monitored                               stored process information. In practice, in order to provide
target, have been proposed. In [4], Chandra and Toueg                                high availability, self-repairing and self-healing mechanisms
introduce the concept of unreliable crash failure detectors to                       are widely adopted in fault-tolerant systems to achieve
detect the eventual crash behavior of a process and classify                         automatic recovery after a crash occurs. Particularly, in
a set of abstract failure detectors based on the failure                             middleware systems, many techniques and algorithms have
detectors’ eventual behavior to solve a certain set of                               been proposed to achieve the self-repairing or self-healing
membership problems. This work inspired many research-                               goal, e.g., [12], [13], [14], [15].
ers to study the quality of service (QoS), such as the speed                            In such systems, it is assumed that the system undergoes
                                                                                     periodic crashes. During a crash period, the system is unable
                                                                                     to service any requests or send any messages, externally
                                                                                     behaving as if the system is unreachable. The end of the
                                                                                     crash period is marked by a recovery, after which the system
. T. Ma is with the Department of Computing, Imperial College London,
  South Kensington Campus, 180 Queens Gate London, SW7 2AZ, UK.                      returns to normal service and its internal state is restored to
  E-mail: tma@doc.ic.ac.uk.                                                          the state before the crash failure occurred.
. J. Hillston and S. Anderson are with the Laboratory for Foundations of                For such systems, crash-recovery failure needs to be
  Computer Science, School of Informatics, University of Edinburgh,                  considered as a frequently occurring failure type to be
  10 Crichton Street, Edinburgh EH8 9AB, UK.                                         detected. However, the crash-recovery case has been little
  E-mail: {jeh, soa}@inf.ed.ac.uk.
                                                                                     studied, due to the fact that there are more possible
Manuscript received 19 Feb. 2008; revised 21 Apr. 2009; accepted 30 June             discrepancies between the failure detector and the monitored
2009; published online 11 Aug. 2009.
For information on obtaining reprints of this article, please send e-mail to:        target, increasing the size of the state space of the monitoring
tdsc@computer.org, and reference IEEECS Log Number TDSC-2008-02-0037.                process, making the QoS analysis for such a paradigm more
Digital Object Identifier no. 10.1109/TDSC.2009.36.                                  complicated.
                                               1545-5971/10/$26.00 ß 2010 IEEE       Published by the IEEE Computer Society
272                              IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING,          VOL. 7,   NO. 3,   JULY-SEPTEMBER 2010


   In [16], we presented an evaluation of the QoS of a crash-
recovery failure detector based on a simple time-out algo-
rithm. A crash-recovery target was modeled as an alternating
renewal process. The simulation results showed that the
crash-recovery behavior of the monitored target will impact
the QoS of such a failure detector, which implied that the
crash-recovery paradigm merited further studied. Such an
analysis was presented in [17]. In that paper, we outlined
how to model the failure detection pair in a crash-recovery
run and how to configure the failure detector to satisfy a
given QoS requirement. The current paper represents a
substantial expansion of [17]. We present more analytical
                                                                 Fig. 1. The QoS metrics without considering false positive mistakes.
details and support the results with further simulation
studies. Analytical results, derived directly from the equa-
                                                                 detector and the QoS metrics. In terms of the transitions
tions in this paper, are also plotted and compared with the
                                                                 defined above and the fail-free assumption, Chen et al.
simulation results. We are then able to present a detailed
                                                                 define the following QoS metrics for a failure detector:
analysis for each of the QoS metrics, which shows the
                                                                 failure detection time (TD ), mistake recurrence time (TMR ),
validity of our model.
                                                                 mistake duration (TM ), good period duration (TG ), and
1.1 Our Contribution                                             query accuracy probability (PA ).
We show how to remove the fail-free or crash-stop assump-           Some recent research has extended the QoS work of [5] in
tion and model the probabilistic behavior of a failure           a number of ways. For example, the authors of [6], [9], [10],
detector with respect to a crash-recovery target, taking into    [18] refine the model with different probabilistic message
                                                                 delay and loss estimation methods. Meanwhile, others, such
consideration general dependability metrics, such as mean
                                                                 as [7], [8], [19], [20], [21], focus on the scalability and
time to failure (MTTF) and mean time to recovery (MTTR). We
                                                                 adaptivity of crash failure detection. But all of these papers
outline how the QoS of a failure detector is limited by the
                                                                 are based on eventual crash-stop behavior of the monitored
dependability of the monitored target. Moreover, we
                                                                 process or the fail-free assumption. Crash-recovery failure
establish that the crash-stop or fail-free models are special
                                                                 detectors have been considered by several groups, e.g.,
cases of the crash-recovery model.
                                                                 Boichat and Guerraoui [22] implemented reliable and total
   In order to effectively assess the QoS of the failure
                                                                 order broadcast primitives, assuming a practical asynchro-
detector in a crash-recovery run, we have defined new
                                                                 nous crash-recovery model in which the processes and
QoS metrics to measure the recovery detection speed and
                                                                 channels may crash and recover or crash and never recover;
the proportion of the failures of the monitored target which
                                                                 [23], [24], [25], [26], each of which proposes failure detectors
are detected. To make an accurate estimation of the failure
                                                                 to solve consensus problems rather than focusing on the
detector’s parameters needed to achieve a required QoS, a
                                                                 QoS of the failure detector itself. In [23], the monitored
configuration procedure for a crash-recovery failure detector    process is characterized as always-up, eventually-up, even-
is outlined. We demonstrate how to achieve the QoS from          tually-down, or unstable. A process which crashes and
a given set of requirements based on the NFD-S algorithm         recovers infinitely many times is regarded as unstable.
(see Appendix B, which can be found on the Computer              But crash-recovery looping behavior exists for most systems.
Society Digital Library at http://doi.ieeecomputersociety.       From the perspective of stochastic theory, crash-recovery
org/10.1109/TDSC.2009.36,) proposed by Chen et al. [5]           behavior can be regarded as a regenerative process in which
with suitable modifications. To the best of our knowledge,       the probabilistic live and recovery times are not zero. In the
none of these aspects of QoS of failure detectors have been      following sections, we will analyze such a crash-recovery
presented before.                                                paradigm and its failure detector from a QoS perspective.
1.2 Related Work                                                    This paper is organized as follows: in Section 2.1, we
                                                                 model a crash-recovery service with general dependability
In [5], Chen et al. propose a set of QoS metrics to measure
                                                                 metrics. Then, we show our model of the probabilistic
the accuracy and speed of a failure detector. Their model
                                                                 message communication and its QoS metrics. In Section 3,
contains a pair of processes: one is the monitor process, the
other is the monitored process, and there is only one crash      we show how to model the crash-recovery failure detector’s
during the monitoring period. The analysis is based on two       probabilistic behavior. We refine the completeness of a crash-
separate stages of failure detection: the precrash stage,        recovery failure detector and extend the QoS metrics to
which is a fail-free run; and the postcrash stage, which is a    measure the completeness and the recovery detection speed
crash-stop run when the monitoring procedure will be             of such a failure detector. Then, we show how to involve
terminated. In order to formally define the QoS metrics,         the general dependability metrics for an approximate
Chen et al. [5] define state transitions of a failure detector   analysis of the QoS of a failure detector and how to
monitoring a target process under the fail-free assumption.      configure a crash-recovery failure detector to satisfy a given
At any time, the failure detector’s state is either Trust or     set of QoS requirements. Moreover, we discuss the impact
Suspect with respect to the monitored process’s liveness. If a   of the dependability of the crash-recovery service on the QoS
failure detector moves from a Trust state to a Suspect state,    of failure detectors in detail. In Section 4, the estimation of
then an S-transition occurs; if the failure detector moves       the input parameters of a crash-recovery failure detector is
from a Suspect state to a Trust state, then a T-transition       presented. We show how to estimate the message delay,
occurs. Fig. 1 shows the state transitions of the failure        message loss, MTTF, MTTR, etc., in a crash-recovery run. In
MA ET AL.: ON THE QUALITY OF SERVICE OF CRASH-RECOVERY FAILURE DETECTORS                                                                273


                                                                         random variables fXðnÞ; n 2 N g, where XðnÞ is the random
                                                                         variable representing the time which elapses from the time
                                                                         of the nth regeneration point to the ðn þ 1Þth one (i.e.,
                                                                         XðnÞ ¼ Snþ1 À Sn ). For simplicity of presentation, we use X
                                                                         instead of XðnÞ in the following since it is sufficient to
                                                                         consider a single regeneration period. Furthermore, we can
                                                                         consider X to be the sum of two independent random
                                                                         variables: Xa and Xc . Here, Xa represents the time which
                                                                         elapses from the time that the CR-TS starts a regeneration
                                                                         period to the time the CR-TS fails and Xc represents the
                                                                         time from when the CR-TS fails until to the time of the next
Fig. 2. Crash-recovery service modeling.                                 regeneration point.
                                                                         Lemma 1. In steady state, the CR-TS is an alternating renewal
Section 5, the analytical and simulation results are plotted                process and the time between any two consecutive recovery time
and analyzed in detail. We show that the dependability of a                 points is one period of the crash-recovery service’s lifetime.
crash-recovery target has an impact on the QoS of a failure                 Thus, we assert that in order to design a failure detector for
detector and our analysis is valid. In Section 6, a brief                the CR-TS, which is sensitive to the CR-TS’s behavior, we
summary of the paper is presented. Appendix A provides a                 only need to consider one period of the CR-TS since all of the
notation table for the variables used in the paper.                      other periods are independent and identically distributed.
Appendix B shows the pseudocode of the NFD-S algorithm.
                                                                         2.2 Dependability of a Crash-Recovery Service
Appendix C presents the main proofs of the lemmas and
theorems presented in this paper.                                        Dependability, one of the most important issues for
                                                                         computer systems, is a complex attribute. Laprie et al. [1]
                                                                         define the concept of dependability as the property of a
2    CRASH-RECOVERY SERVICE AND QoS                      OF              computer system such that reliance can justifiably be placed on the
     MESSAGE COMMUNICATION                                               service it delivers. Associating timing information with the
                                                                         behavior of a system, its dependability can be described
In this section, we outline the assumptions underlying                   quantitatively. Generally speaking, the dependability of a
our framework, considering the crash-recovery behavior                   system can be measured according to a number of different
of the target service, its dependability characteristics, and            aspects such as reliability, availability, consistency, usability,
the behavior of the communication channel which                          security, etc. In order to simplify the measurements which
supports the failure detection process.                                  are related to failure detection, here, we only introduce
2.1 The Crash-Recovery Service Modeling                                  reliability, availability, and consistency, which are strongly
                                                                         related to the QoS of failure detectors.
For a crash-recovery target service (CR-TS), we consider that                In [27], Knight and Strunk give a definition of software
the service might crash at arbitrary time and take some time             reliability and availability. We extend this with a definition
to be repaired and restart again after it fails. Let S be the            of consistency as follows:
state space of a stochastic process Z :¼ fZðtÞ; t ! 0g, where
Z captures a CR-TS’s lifetime. Then, S can be regarded as                   .    Reliability: is the probability that the system will
{Alive, Crash} and the CR-TS can periodically switch                             operate correctly in a specified operating environ-
between these two states. A transition occurs when the                           ment up until time t (t > 0).
state of the CR-TS changes. Fig. 2 shows the state transitions              . Availability: is the probability that the system will be
of a CR-TS, where a C-transition occurs when the state of the                    operational at time t.
CR-TS switches from the Alive state to the Crash state; an                  . Consistency: is the probability that in a specified
R-transition occurs when the state of the CR-TS switches                         operating environment, the system will return to
from the Crash state to the Alive state.                                         normal operation correctly after a failure within time t.
Assumption 1. If the service’s recovery is treated as a restart,            These three metrics present different aspects of the
  then the CR-TS’s lifetime Z is a regenerative process.                 system dependability. Generally, reliability presents how
                                                                         long a system will operate correctly and can be captured by
    Assumption 1 will be used in the following. It is based              MTTF, which records the likelihood of a service to persist
                                                                         without a failure. Availability presents the probability that a
on the following observations. The CR-TS will periodically
                                                                         system is accessible or reachable with correct operation at
crash and recover, leading to a sequence of time points,
                                                                         an arbitrary time and can be captured by mean time to failure
S1 ; S2 ; . . . ; Sn ; . . . (n ! 0), representing the times of the      divided by mean time between failure (MTTF ). Consistency
                                                                                                                    MTBF
CR-TS’s recovery. The behavior of the system after Sn                    presents the ability of a system to recover from a failure
(n ! 0) is independent of what has occurred before, and                  state to the correct operation state and can be captured by
thus, Sn can be regarded as a restart. Moreover, the                     MTTR, which records how quickly a system recovers.
probability of Sn occurring is 1. This makes the time points                In different scenarios, different aspects of dependability
S1 ; S2 ; . . . ; Sn regeneration points.                                may be given greater relative importance. For example,
    Since the CR-TS’s lifetime Z is a regenerative process and           consistency may be valued more than reliability in a
the sequence fS1 ; S2 ; . . . ; Sn ; . . .g characterizes the lifetime   system designed to be always accessible. This means that
of the service, we can give an alternative definition of the             fault-tolerance mechanisms should be able to adapt to
stochastic process Z. The stochastic process Z is a set of               reflect differing dependability requirements.
274                                IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING,           VOL. 7,   NO. 3,   JULY-SEPTEMBER 2010


2.3 QoS of Message Communication
In order to measure the communication between the FDS and
target service quantitatively, we define the communication
path between the FDS and the target service as a channel.
Each communication component pair holds one or more
virtual one-way, source-to-destination channel. Messages
can only flow from the source component to the destination
component. In addition, the channel model in this paper
relies on the assumption of a basic unreliable communication
channel with fairness, no-creation, and no-duplication [28].
This has some similarities with the Stubborn channels in [28],
but they allow duplicated messages and we assume that
there are no duplicated messages in our model.
   This channel-based communication, which maintains                Fig. 3. State space in a crash-recovery run. (a) Fail-free transition.
the interaction between the FDS and the CR-TS, can be               (b) Crash-recovery transition.
characterized by the QoS of the communication, the adopted
failure detection algorithm, and the adopted communica-             enough to be ignored and their local clocks are sufficiently
tion protocol, each of which has some associated properties.        synchronized (this can be guaranteed by some time synchro-
In particular, we take the message transmission behavior to         nization service such as the Network Time Protocol used in
be probabilistic: we describe the message delay or loss as          [6]) to be regarded as a clock synchronized system. The
probabilistic behaviors associated with the communication           failure detection algorithm we adopt is the NFD-S algorithm
channel.                                                            proposed in [5].
Definition 1. Let D be a random variable representing the time      3.2 Modeling a Push-Style Crash-Recovery FDS
  which elapses from the time a message is sent until the time it   The failure detector (FDS) has a set of suspicion levels S s :¼
  arrives at the destination and EðDÞ be the average message        fT rust; Suspectg as in [5]. The FDS can either trust or suspect
  delay; let pL be the probability of a message loss during the     a CR-TS’s liveness. Thus, for a fail-free run, a service only has
  transmission; let XL be a random variable representing the        one state: Alive. The state space of an FDS is S f :¼
  number of consecutive messages lost and EðXL Þ be the average     fT rust-Alive; Suspect-Aliveg, and the event space of an FDS
  number of consecutive messages lost.                              F :¼ fS-transition; T -transitiong (Fig. 3a). For a fail-free run,
                                                                    the QoS metrics of an FDS can be measured quite
   From these definitions, properties such as the following         straightforwardly. The average time spent in the Trust state
can be derived:                                                     is the mean length of the good period EðTG Þ; the average time
Lemma 2. If each message’s transmission and loss behavior are       spent in the Suspect state is the mean time of the mistake
  independent, then the probability that x (x ! 1) consecutive      duration EðTM Þ; the average time between two consecutive
  messages are lost is                                              transfers to the Suspect state (two consecutive S-transitions) is
                                                                    the mean time of the mistake recurrence EðTMR Þ.
                 P rðXL ¼ xÞ ¼ px Á ð1 À pL Þ:
                                L                                       However, precisely speaking, the state space of an FDS
                                                                    S c :¼ S Â S s , where S is the state space of the target service.
   Overall, the QoS of this channel-based communication             Therefore, for a CR-TS with failures, the state space of its
between the FDS and the CR-TS can be captured by EðDÞ,              FDS increases because the service has more than one state
pL and EðXL Þ. In the following sections, we analyze how            (see Fig. 3b). If the suspicion level is more than two, then S c
the FDS monitors the CR-TS and how the FDS can be                   will increase as well. The QoS metrics of an FDS are no
configured based on the characteristics of this channel-            longer as simple as for fail-free runs.
based communication.                                                    For a fail-free run (MTTF ! þ1) or a crash-stop run
                                                                    (MTTR ! þ1), the CR-TS’s current state S CRÀT S will be
3     QoS   OF THE   CRASH-RECOVERY FDS                             Alive for all time up to the crash, and it is easy to deduce the
3.1 System Model                                                    FDS’s accuracy S A directly from the FDS’s current state.
                                                                    However, for a crash-recovery run, since the CR-TS could fail
We consider a distributed system model with two services:
                                                                    or recover at arbitrary time, S A cannot be deduced solely
One FDS and one CR-TS, distributed over a wide-area
network. The FDS and the CR-TS are connected by an                  from the state of the FDS.
                                                                        Furthermore, compared with a fail-free or crash-stop run,
unreliable communication channel (see Section 2.3). Liveness
(heartbeat) messages are transmitted through the channel.           there are more mistake types in a crash-recovery run. In
The communication channel does not create or duplicate              previous work, such as [5], [6], [8], [9], [10], [18], [20], only
liveness messages, but the messages might be lost or delayed        the mistakes caused by the message transmission behaviors
indefinitely during transmission.1 The CR-TS can fail by            (message delay and loss) are considered. But in a crash-
crashing but can be repaired and restart to run again after         recovery run, a mistake starts whenever the CR-TS’s and
some repair time, i.e., it behaves as a crash-recovery model. The   FDS’s states diverge. Thus, there are also mistakes caused
                                                                                                                    3
drift of the local clocks of the FDS and the CR-TS is small         by the CR-TS’s crash (see TF in Fig. 1 or TM in Fig. 4c) and
                                                                    recovery (see Fig. 4d) due to the delayed detection of such
   1. This channel-based message transmission is the same as the    events. Fig. 4 shows the four types of mistake which could
                                                                                                            1
probabilistic network model in [5].                                 occur within a crash-recovery run. TM in Fig. 4a represents a
MA ET AL.: ON THE QUALITY OF SERVICE OF CRASH-RECOVERY FAILURE DETECTORS                                                                275




                                                                  1        2        3        4
Fig. 4. The analysis of possible TM in a crash-recovery run. (a) TM . (b) TM . (c) TM . (d) TM .

                                                 2
mistake caused by a message delay. TM in Fig. 4b                                The above QoS metrics can measure some QoS aspects of
                                                            3                a failure detector in a crash-recovery run. However, they
represents a mistake caused by a message loss. TM in
Fig. 4c represents a mistake caused by CR-TS’s crash, while                  cannot measure how fast a recovery can be detected, the
                                     4                                       proportion of the detected failures over the occurred
the FDS still trusts the CR-TS. TM in Fig. 4d represents a
mistake caused by CR-TS’s recovery, while the FDS still                      failures (completeness), etc. In the following section, we
suspects the CR-TS. A message loss or delay will result in a                 extend the QoS metrics to measure the recovery detection
Suspect-Alive mistake of the FDS (see Fig. 3b). A crash                      speed and the completeness of a failure detector.
failure will result in a Trust-Crash mistake. A recovery event
will result in a Suspect-Alive mistake. Mistakes caused by                   3.3   Extended QoS Metrics for a Crash-Recovery
different reasons will result in different FDS parameter
                                                                                   FDS
reconfiguration plans. For instance, the best way for the                    For an FDS in a crash-recovery run, in addition to the QoS
FDS to tolerate more message losses or a longer message                      metrics introduced above, we propose some new QoS
delay is to increase the time-out duration; the best way for                 metrics.
the FDS to minimize the mistake duration caused by a crash                      First, in order to measure the speed with which an FDS
event is to decrease the time-out duration; and the best way                 can discover a recovery of the CR-TS, we define—the
to minimize the mistake duration caused by a recovery                        recovery detection time (TDR )—a random variable which
event is to increase the liveness message sending frequency.                 represents the time that elapses from the CR-TS’s recovery
Thus, we can see that an inaccurate mistake type identifica-                 time (an R-transition) to the time when the FDS discovers
tion might reduce the QoS of an FDS and should be                            the recovery.
                                                                                Then, since in a crash-recovery run, there is no eventual
avoided.
                                                                             behavior of a CR-TS, and a fast recovery could make a
   From the above analysis, we can see that due to the
                                                                             failure undetectable by the FDS. Under such circumstances,
increasing mistake types in a crash-recovery run, the defini-
                                                                             the completeness property of a failure detector defined in [4]
tion of the QoS metrics in [5] using transitions is not valid in a           can no longer be satisfied. In order to reflect this situation,
crash-recovery run. Thus, we redefine them as below:                         we refine the definition of the completeness as follows:
   .    Detection time (TD ): The elapsed time from when                         .   Strong completeness: Every crash failure of a recover-
        the monitored target crashes until the failure                               able process will be detected.
        detector correctly suspects the monitored target.                       . Weak completeness: A specified proportion of the crash
   .    Mistake recurrence time (TMR ): The time between                             failures of a recoverable process will be detected.
        the occurrence of two consecutive mistakes.
                                                                             Therefore, in order to measure the completeness property of a
   .    Mistake duration (TM ): The time to correct a
                                                                             crash-recovery FDS, we propose a new QoS metric. The
        mistaken suspect or trust.                                           detected failure proportion (RDF ) is a random variable
   .    Good period duration (TG ): The duration for which                   capturing the ratio of the detected crashes over the occurred
        the failure detector maintains the correct state                     crashes (0 RDF 1). When no crash failures are detected,
        information.                                                         RDF ¼ 0. When all of the occurring crashes are detected,
   .    Query accuracy probability (PA ): The probability                    RDF ¼ 1. The strong completeness property of an FDS
        that the state information from the failure detector is              requires that EðRDF Þ ¼ 1 (where E denotes expectation).
        correct at an arbitrary time.                                        The weak completeness property requires that EðRDF Þ ! RL , DF
276                                      IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING,                 VOL. 7,   NO. 3,   JULY-SEPTEMBER 2010


                                                                                .    i is the time of the ith freshness point corresponding
                                                                                     to i ;
                                                                                . b is the last freshness point3 before a crash; and
                                                                                . f is the freshness point corresponding to f .
                                                                                Let time-out be the threshold waiting time for the
                                                                             expected arrival of the liveness message before suspecting
                                                                             the CR-TS (time-out ¼ i À i in Fig. 5). Let tm (m ! 1) be a
                                                                                                                               r
                                                                             recovery time of the current MTBF period (see Fig. 5). Then
                                                                             in our model, the key thing for the QoS bounds analysis is
                                                                             to derive the average number of mistakes that will happen
Fig. 5. The analysis of the FDS based on the NFD-S algorithm in a            between the mth and ðm þ 1Þth recovery times, and the
crash-recovery run.
                                                                             average duration of each mistake. We make the following
where RL is the specified lower bound of the detected                        definitions as extensions of Definition 1 in [5]:
         DF
failure proportion and 0 RL  DF    1.                                        Definition 2. For the fail-free duration ½t1 ; t2 Þ within each
   Overall, the QoS for a crash-recovery FDS can be                            MTBF period:
captured by PA , TM , TMR , TD , TDR , and RDF . In the next
section, we will analyze the QoS bounds of the FDS based                            1.     k: for any i ! 1, let k be the smallest integer such that
                                                                                           for all j ! i þ k, mj is sent at or after time i , where
on the NFD-S algorithm in a crash-recovery run by adopting
                                                                                           mj is the jth heartbeat message.4
the proposed basic and extended QoS metrics.                                        2.     For any i ! 1, let pi ðxÞ be the probability that the FDS
                                                                                                                j
3.4       QoS Estimate of the Crash-Recovery FDS Based                                     does not receive the ði þ jÞth message miþj by
          on the NFD-S Algorithm                                                           time i þ x, for every j ! 0 and every x ! 0; let
                                                                                           pi ¼ pi ð0Þ.
                                                                                             0     0
In a crash-recovery run, as the state of a CR-TS can switch                                                       i
                                                                                    3.     For any i ! 2, let q0 be the probability that the FDS
between Alive and Crash, these crash or recovery events will
                                                                                           receives message miÀ1 before time i .
force the output of the FDS to be accurate or inaccurate. For
                                                                                    4.     For any i ! 1, let ui ðxÞ be the probability that the FDS
analyzing the behavior of the failure detection pair, we
                                                                                           suspects the CR-TS at time i þ x, for every x 2 ½0; Þ.
want to pick an observation period, which will cover all the
                                                                                    5.     pi : for any i ! 2, let pi be the probability that an
                                                                                             s                        s
events which may possibly occur. In our model, we pick
                                                                                           S-transition occurs at time i .
one MTBF period as the observation period. This is because,
as we discussed in Section 2.1, in order to study the steady                    According to the QoS analysis of the NFD-S algorithm in
state behavior of a CR-TS throughout its lifetime, we only                   Proposition 3 in [5], we now analyze the basic QoS metrics
need to observe the time period between two consecutive                      of the FDS based on the NFD-S algorithm in a crash-recovery
regeneration points (recovery times) of the CR-TS and the                    run and show the following relations hold:
average duration between the two consecutive regeneration                    Proposition 1.
points is MTBF. In the following, we will treat these as also
regeneration points of the system consisting of the failure                         1.     k ¼ dtime-out=e.
detection pair. This is an approximation made for prag-                             2.     for all j ! 0 and for all x ! 0,
matic reasons but it can be justified as follows:
     Fig. 5 shows the relationship between an FDS and a                                  pi ðxÞ ¼ ðpL þ ð1 À pL Þ Á P rðD  time-out þ x À jÞÞ
                                                                                          j
                                                                                                       À                 Á
CR-TS on the interval t 2 ½t0 ; t3 Þ, where both t0 and t3 are                                    Á P r Xa  i À tm þ x :
                                                                                                                    r
regeneration points. Obviously, the mean time between t0
and t3 is the MTBF. We split ½t0 ; t3 Þ into three intervals                        3.      i
                                                                                           q0 ¼ ð1 À pL Þ Á P rðD  time-out þ Þ
½t0 ; t1 Þ, ½t1 ; t2 Þ, and ½t2 ; t3 Þ:                                                                        À             Á
                                                                                                           ÁP r Xa  Q tm :
                                                                                                                       iÀ r

      .  t1 is the time when the FDS detects the transition of                      4.     For all x 2 ½0; Þ; ui ðxÞ ¼ k pi ðxÞ.
                                                                                                                         j¼0 j
         the CR-TS from the Crash state to the Alive state.                         5.     pi ¼ q0 Á ui ð0Þ.
                                                                                            s
                                                                                                  i

      . t2 is the time when the service crashes. Note that the                 In Proposition 1, the bounds of each QoS metric are
         period ½t1 ; t2 Þ is without failures.                              derived based on the analysis of the average number of
      Additionally, we define the following times:                           possible mistakes within the distinct intervals ½t0 ; t1 Þ, ½t1 ; t2 Þ,
                                                                             and ½t2 ; t3 Þ. In consequence, the following theorem holds
      .   s is the first liveness message sending time after a
                                                                             and can be used to estimate the FDS’s parameters or QoS
          recovery;
      .   f is the sending time of the last liveness message                bounds within a crash-recovery run:
          before a crash;                                                    Theorem 1. The crash-recovery FDS based on the NFD-S
      .   i is the sending time of a liveness message between                 algorithm has the following properties:
          s and f ;
      .    is the liveness message sending interval;                           3. The expected arrival time of the liveness message.
                                                                                4. k is assumed to be independent of i approximately. In fact, in a crash-
      .   s is the first decision time after recovery;2                     recovery run, k is not completely independent of i. However, if the CR-TS
                                                                             will remain alive for a reasonable duration, k will be almost independent of i
  2. The actual arrival time of the first received valid liveness message.   except for the last few messages before the CR-TS crashes.
MA ET AL.: ON THE QUALITY OF SERVICE OF CRASH-RECOVERY FAILURE DETECTORS                                                                                                 277

      MT BF ! EðTMR Þ
                                         MT BF                                             ð1Þ
              ! ÀÄ MT T F ÀEðT           Å    Á      Æ    Ç    :
                                  DR Þ
                                          þ 1 Á pi þ EðDÞ þ 2
                                                  s    

   If Xc   þ time-out, then
      MT BF
            ! EðTMR Þ
        2
                                           MT BF                                           ð2Þ
              ! ÀÄ MT T F ÀEðT         Å      Á      Æ    Ç    ;
                                  DR Þ
                                          þ 1 Á pi þ EðDÞ þ 2
                                                  s    

                                                                     R
             EðTD Þ þ EðTDR Þ þ MT T F ÀEðTDR Þ Á
                                                                       0   ui ðxÞdx
  PA ! 1 À                                                                             ;   ð3Þ
                                         MT BF
                                                                                                 Fig. 6. The extended FDS configuration based on the NFD-S algorithm
                                         R                                                      in a crash-recovery run.
             EðTDR Þ þ MT T F ÀEðTDR Þ Á 0 ui ðxÞdx þ
                                                                            EðTD Þ
  EðTM Þ            ÀÄ MT T F ÀEðTDR Þ Å    Á                                          ;   ð4Þ
                                         þ 1 Á pi þ 1                                            needed to ensure that the NFD-S algorithm is still valid after
                                               s
                                                                                                 each recovery. However, without persistent storage to
                                                                                                 snapshot the runtime information frequently, when a crash
                   EðTDR Þ ¼ EðDÞ þ  Á EðXL Þ;                                            ð5Þ   failure occurs, all of the current runtime information might
                                                                                                 be lost. Thus, continuously increasing the heartbeat se-
               EðRDF Þ ! P rðXc   þ time-outÞ:                                           ð6Þ   quence number cannot be guaranteed.
                                                                                                    Since the NFD-S algorithm assumes that the local clocks of
Details of the proof of the theorem can be found in [29] and                                     the FDS and the CR-TS are synchronized, we can compare
Appendix C.2.                                                                                    the sending times of heartbeat messages instead of the
   When the monitored target is fail-free or crash-stop,5 for                                    heartbeat sequence numbers in the algorithm. Then, for a
the basic QoS metrics in [5], applying (1)-(4) of Theorem 1,                                     crash-recovery FDS, if the QoS requirements of the FDS are
we can easily deduce that                                                                        given, the configuration procedure is illustrated in Fig. 6.
                                                                                                   Initially, we can assume that the QoS of message
                            EðTMR Þ !                       ;                              ð7Þ   communication is perfect (e.g., pL ¼ 0, EðDÞ is small and
                                                         pi
                                                          s
                                                                                                 EðXL Þ ¼ 0), and the CR-TS is fail-free. As the monitoring
                                       Z                                                        procedure continues, the estimation of the QoS of message
                                1                                
                  EðTM Þ           Á             ui ðxÞdx         i ;                      ð8Þ   communication and the dependability metrics of the CR-TS
                                pi
                                 s       0                       q0
                                                                                                 will become more accurate. Thus, the FDS will be reconfi-
                                             Z       
                                                                                                 gured to adapt to changing input parameters, which help
                              1                                                                  better estimate  and time-out.
                      PA ! 1 À Á                         ui ðxÞdx:                         ð9Þ
                                                0                                                  Then for given QoS requirements, expressed as bounds,
    Thus, EðTMR Þ, EðTM Þ, and PA are exactly reduced to the                                     the following inequalities need to be satisfied where a
QoS analysis results in [5] (see Appendix C.4 for the details                                    superscript U denotes an upper bound and a superscript L
of the proof scratch). We can conclude that in terms of failure                                  denotes a lower bound:
detection, a fail-free run or a crash-stop run with MTTF                                                    U                  L                   L
                                                                                                    TD     TD ;     EðTMR Þ ! TMR ;          PA ! PA ;
tending to infinity is a particular case of a crash-recovery run.                                                                                                       ð10Þ
                                                                                                                   U                      U
If the monitored target’s MTTF is not sufficiently long and                                         EðTM Þ        TM ;     EðTDR Þ       TDR ;   EðRDF Þ ! RL :
                                                                                                                                                            DF
the target is recoverable, then the impact of its dependability
                                                                                                 From Theorem 1, we can estimate the parameters ( and
must also be taken into consideration. In the following
                                                                                                 time-out) of the NFD-S algorithm according to the following
section, we will introduce how to configure the crash-recovery
FDS according to the QoS bounds we have derived from                                             inequalities:
Theorem 1.                                                                                                                þ time-out       U
                                                                                                                                           TD ;   0;                  ð11Þ
3.5   The Configuration of the Crash-Recovery FDS
      Based on the NFD-S Algorithm                                                                                      MTBF                    L
                                                                                                       ÀÄ MTTFÀEðTDR Þ Å    Á      Æ    Ç    ! TMR ;                    ð12Þ
For crash failure detectors, it is crucial to select some                                                               þ 1 Á pi þ EðDÞ þ 2
                                                                                                                                s    
suitable input parameters (such as the liveness message
intersending interval and the time-out duration) to satisfy a                                                                                    R
given set of QoS requirements. In this section, we will show                                           EðTD Þ þ EðTDR Þ þ MTTFÀEðTDR Þ Á
                                                                                                                                                 0   ui ðxÞdx      L
how to achieve such steps in a crash-recovery run based on                                        1À                                                             ! PA ; ð13Þ
                                                                                                                              MTBF
the NFD-S algorithm. In a crash-recovery run, an assumption
that the sequence numbers of the heartbeat messages are                                                                           R
                                                                                                     EðTDR Þ þ MTTFÀEðTDR Þ Á       ui ðxÞdx þ EðTD Þ
continually increasing after every recovery of the CR-TS is                                                         
                                                                                                            ÀÄ MTTFÀEðTDR Þ Å
                                                                                                                                     0
                                                                                                                                     Á                            U
                                                                                                                                                                 TM ;   ð14Þ
                                                                                                                                 þ 1 Á pi þ 1
                                                                                                                                         s
  5. The precrash duration of the crash-stop process is a long run.
278                                IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING,            VOL. 7,     NO. 3,   JULY-SEPTEMBER 2010

                                         U
                   EðDÞ þ EðXL Þ       TDR ;               ð15Þ

                P rðXc   þ time-outÞ ! RL :
                                          DF                ð16Þ
   Then, the task of the NFD-S algorithm is to find the
largest  satisfying inequalities (12)-(15) and if such  exists,
                                                          U
find the largest time-out that satisfies  þ time-out TD and
P rðXc   þ time-outÞ ! RL . This can be done in the
                             DF
following steps:
               L
   Step I. If TMR  MTBF, continue; else the QoS of the FDS
cannot be achieved.
   Step II. Find the largest  that satisfies the inequalities      Fig. 7. Dependability metrics estimation.
(12)-(15); otherwise cannot find an appropriate  (QoS
cannot be achieved).                                                uniformly distributed on ½l ; l þ Þ, then after a recovery
                                                      U
   Step III. If   0, find the largest time-out TD À  and         has completed, the average tm can be estimated by
                                                                                                        c
P rðXc   þ time-outÞ ! RL .DF                                     tm ¼ l þ  . Notice that a smaller message intersending
                                                                     c         2
   From the above steps, the estimation of  and time-out for       time () can result in a more accurate tm estimate. Then, the
                                                                                                             c
a crash-recovery FDS based on the NFD-S algorithm amounts           CR-TS’s MTBF, MTTF, MTTR, and the probability that the
to finding a numerical solution for the inequalities (11)-(16).     CR-TS has not crashed up to time i þ x since its last
This can be done using binary search similarly to the               recovery, P rðXa  i þ x À tm Þ, can be estimated as follows:
                                                                                                   r
approach outlined in [5]. But the estimation of the input               Estimate MTBF. From the definition of MTBF, we know
parameters of the configuration becomes more difficult              that MTBF is only related to the CR-TS’s recovery times
because parameters, such as EðXL Þ, MTTF, MTTR, etc., are           tm ðsÞ. These tm ðsÞ can be obtained by adopting the recovery
                                                                     r             r
needed. How to estimate these parameters will be discussed          time estimation methods proposed in [29]. Thus, MTBF can
in Section 4.                                                       be estimated as below:
   Note that for this configuration procedure, choosing a
different message transmission protocol (e.g., TCP and                                À          Á 1 X À mþ1
                                                                                                      n          Á
UDP) can imply different QoS for message communication.                       MTBF ¼ E tmþ1 À tm ¼
                                                                                        r      r         tr À tm :
                                                                                                               r                        ð17Þ
                                                                                                   n m¼1
Thus, this new configuration can be more adaptive to the
message transmission characteristics. For example, if the              Estimate MTTF. MTTF can be estimated by using the
message loss probability or message delay is high for a             recovery time (tm ) and the crash detection time (tm ) as
                                                                                    r                                  d
certain protocol, then the FDS can switch to a more reliable        Eðtm À tm Þ ¼ MTTF þ EðTD Þ. Then,
                                                                       d    r
protocol to achieve a better QoS without increasing the
communication frequency or the time-out length.                              À        Á          1XÀ m
                                                                                                    n          Á
   In the next section, we will discuss how to estimate the          MTTF ¼ E tm À tm À EðTD Þ ¼
                                                                               d    r                  td À tm À EðTD Þ:
                                                                                                             r
                                                                                                 n m¼1
QoS of message transmission and the dependability metrics
of the CR-TS.                                                                                                                           ð18Þ
                                                                       Estimate MTTR. MTTR can be estimated by using MTBF
4     PARAMETER ESTIMATION                                          and MTTF directly for MTTR ¼ MTBF À MTTF or by
                                                                    using tmþ1 and tm . Hence, the MTTR can be estimated as
                                                                           r         d
In the previous section, we explained how to configure a
crash-recovery FDS. However, for this procedure, several            Eðtmþ1 À tm Þ ¼ MTTR À EðTD Þ. Then,
                                                                       r      d
input parameters are needed (see Fig. 6). In this section, we                              À         Á
                                                                                MTTR ¼ E tmþ1 À tm þ EðTD Þ
                                                                                              r    d
will show how to estimate these input parameters for an
FDS configuration.                                                                       1 X À mþ1
                                                                                             n           Á             ð19Þ
                                                                                       ¼        t   À tm þ EðTD Þ:
                                                                                         n m¼1 r       d
4.1 Dependability Metrics Estimation for the CR-TS
From the CR-TS modeling in Section 2, we see that there is             Estimate P rðXa  i þ x À tm Þ. When the probability
                                                                                                   r
an intimate relationship between the MTTF, MTTR, and                density function fa ðxÞ or the probability distribution
MTBF and the QoS of the FDS. In order to estimate these             function Fa ðxÞ of Xa is known, the probability that the
dependability metrics, we only need to estimate the crash           CR-TS does not crash until i þ x after its last recovery can
and recovery time of the CR-TS. We assume that the clocks           be estimated as
between the FDS and the CR-TS are synchronized. Let t1 be  r                                         Z i þxÀtm
the CR-TS’s first start time, then for m ! 1, tm represents the              À               m
                                                                                               Á              r
                                               r                          P r Xa  i þ x À tr ¼ 1 À            fa ðxÞdx
mth recovery time; tm represents the mth recovery detection
                      dr                                                                               0                     ð20Þ
time; tm represents the mth crash time; and tm represents                                                                 þxÀtm
       c                                          d                                                  ¼ 1 À Fa ðxÞj0i           r
                                                                                                                                   :
the mth crash detection time (see Fig. 7). tm can be saved to
                                             r
the persistent storage by the CR-TS after a recovery has            When x ¼ 0, we obtain that
completed (see [29]). tm can be recorded by the FDS when a
                         d                                                                  Z i Àtm
failure is detected, EðTD Þ can be estimated by using                   À             Á            r
                                                                                                                            Àtm
1
  Pn      m    m           m                         m               P r Xa  i À tm ¼ 1 À          fa ðxÞdx ¼ 1 À Fa ðxÞj0i r :
    m¼1 ðtd À tc Þ when tc is known. Actually, tc can be
                                                                                    r
n                                                                                                   0
estimated by saving the latest successful message sending                                                                               ð21Þ
time l in the persistent storage. If a crash event happens
MA ET AL.: ON THE QUALITY OF SERVICE OF CRASH-RECOVERY FAILURE DETECTORS                                                           279


   When the probability density function fa ðxÞ and the            4.3.2 The Impact on TMR
probability distribution function Fa ðxÞ of Xa are unknown,        For a fail-free run, Chen et al. showed that when time-out
an empirical distribution function (EDF) estimation method         length increases linearly, TMR increases exponentially (Fig. 12
can be adopted to estimate fa ðxÞ or Fa ðxÞ. In addition,          in [5]). This implies that for such systems, an arbitrary level of
P rðXa  i þ x À tm Þ is used to estimate the probability that
                    r                                              TMR can be achieved. Roughly speaking, in a fail-free run,
an S-transition happens on ½t1 ; t2 ) (see Proposition 1), which   when time-out increases to n   (n 2 Z þ and n ! 1), the FDS
                                                                                                            Z
is used to count the average number of mistakes in that            can tolerate around n consecutive communication message
period. If we maximize P rðXa  i þ x À tm Þ, then a
                                                    r              losses. The mistake recurrence which is caused by message
maximum average number of mistakes on ½t1 ; t2 ) will be           latency or loss decreases P1n rapidly, where
obtained. Therefore, we will get stricter QoS bound
estimates for PA , TM , and TMR . Thus, we can adopt i ¼ 1               P ¼ pL þ ð1 À pL Þ Á P rðtime-out  Delay  þ1Þ:
and x ¼ 0 to simplify the estimation of P rðXa  i þ
x À tm Þ. Notice that the above method is only for the strict          For a crash-recovery run, mistakes may occur on both
     r
bound estimation rather than an optimized estimation.              crash and recovery (see Fig. 3b) since message transmission
                                                                   latency will delay the detection of the CR-TS’s state change.
4.2 Message Loss Length Estimation                                 These mistakes are inevitable. This means that the upper
As discussed earlier, the parameters related to message            bound on TMR is governed by MTTF and MTTR (see
transmission are the average message delay (EðDÞ), prob-           inequalities (1)-(2) in Theorem 1). Even if all message delays
ability of message loss (pL ), and the consecutive message         and losses can be tolerated, EðTMR Þ cannot increase to an
loss number XL (see Fig. 6). Since pL and EðDÞ estimation          arbitrary level when MTTF is not þ1 and MTTR is not þ1
can be done very easily and have been introduced in many           or 0. If failure is detectable, EðTMR Þ cannot exceed MTBF 2
other papers such as [5], we do not discuss them here. The         since for each MTBF duration, there will be at least two
additional parameter XL is also used and captures the              mistakes, corresponding to the two changes of state in the
bursty message loss behavior. In this section, we propose a        CR-TS. When failure is undetectable, mistakes may happen
basic estimation method for XL , assuming independent
                                                                   at the CR-TS’s crash or recovery time. Then, EðTMR Þ cannot
message transmissions.
                                                                   exceed MTBF. Thus, after EðTMR Þ reaches MTBF , the overall
                                                                                                                   2
Lemma 3. If each message’s transmission and loss behavior is       EðTMR Þ approaches MTBF gradually.
  independent, then the mean number of consecutive message
                        p ð1ÀpM Þ
  losses is EðXL Þ ¼ L 1ÀpLL À MpMþ1 , where M is the
                                       L
                                                                   4.3.3 The Impact on PA
  maximum number of consecutive messages lost and pL is the        PA , the proportion of time that the FDS is not in a mistake
  probability that each message is lost during the transmission.   state, will depend on the ratio of EðTM Þ and EðTMR Þ
  The proof can be found in [29].                                  (PA ¼ 1 À EðTMRÞÞ in [5]). If a service is fail-free, PA can rapidly
                                                                               EðTM

Remark 1. When M ! þ1 and 0  pL  1, then pM ! 0                  approach 1. But in a crash-recovery run, when the time-out
                                            L
  and MpM ! 0, we obtain that
         L
                                                                   length is increased, both EðTM Þ and EðTMR Þ will eventually
                                     pL                            reach their upper bounds, meaning that PA will also be
                        EðXL Þ ¼          :                        bounded. Generally, as time-out increases, less failures will
                                   1 À pL
                                                                                                                                    3
                                                                   be detected and the mistakes caused by failures (see TM in
  From the above lemma, we see that if each liveness               Fig. 4c) will have more impact on EðTM Þ; thus, EðTM Þ will
message’s transmission is independent, EðXL Þ depends              approach MTTR, since the maximum length of EðTM Þ is           3
only on pL and can be computed straightforwardly.                  MTTR. As the time-out length becomes larger with respect to
4.3    The Impact of Service Dependability Metrics on              MTTR, more failures become undetectable. Thus, EðTM Þ
       the QoS of the FDS                                          will gradually approach MTTR.
A thorough analysis of the impact of the service depend-              The speed of increase of TMR will depend on when
ability metrics on the QoS of the FDS has been presented in        TMR reaches MTBF . Before this bound is reached, as the
                                                                                    2
[16]. Here, we only highlight the main observations.               time-out length increases, TMR can increase exponentially
                                                                   fast, as more message losses can be tolerated. After TMR
4.3.1 The Impact on TM and TD                                      exceeds MTBF , it can only increase gradually to MTBF, as
                                                                               2
Generally, for an FDS, the time-out length governs the             time-out increases and more and more crashes become
failure detection speed because the FDS makes its decision         undetectable. Thus, when TMR reaches its upper bound
at the time-out points. As the time-out length decreases, the      but TM has not yet reached its upper bound, PA will
FDS will make faster, but less accurate, decisions. As time-       decrease as time-out length increases. When both TM and
out increases, TD slows down but the FDS can tolerate more         TMR reach their upper bound, PA will approach MTTF ,           MTBF
message delays or losses, which can improve the detection          which is equal to the availability of the CR-TS.
accuracy to some extent. For a CR-TS, continually increas-
ing the time-out length may mean that failures become              5   SIMULATION EVALUATION AND ANALYSIS
undetectable, because its recovery duration could be shorter
than TD . Thus, EðTM Þ will not increase more than the             In previous sections, we have shown how to calculate the
recovery duration, MTTR.6                                          parameters of the FDS with a given set of QoS requirements
                                                                   and analyzed the QoS bounds of the crash-recovery FDS
  6. Assuming that pL and D are not very large and MTTR ) .       based on the NFD-S algorithm. In this section, we introduce
280                                    IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING,     VOL. 7,   NO. 3,   JULY-SEPTEMBER 2010




Fig. 8. The NFD-S algorithm: EðTM Þ.                               Fig. 9. The NFD-S algorithm: EðTMR Þ.


our analytical and simulation results, which verify our            complete characteristics. If the time-out length was increased
previous analysis work.                                            to 200, EðTM Þ would approach MTTR ¼ 50 closely.
                                                                       An interesting phenomenon is visible in the graph as
5.1   Evaluation of the Crash-Recovery FDS Based                   time-out increases from 0.5 to 1.1: EðTM Þ decreases (or
      on the NFD-S Algorithm                                       increases more slowly), and then, increases again. We
For the simulation studies, we fix the heartbeat interval at       analyze this phenomenon in detail as follows: Recall that for
 ¼ 1 and gradually increase the time-out length.                  a given length of time-out, there are four aspects which have
   The message transmission parameters are pL ¼ 0:01 and           impact on TM : the message delay and loss, and the CR-TS’s
EðDÞ ¼ 0:02, and the delay is assumed to be exponentially          crash and recovery (see Fig. 4). TM caused by a message
distributed. These settings are similar to those used in the       delay is governed by the ratio between EðDÞ and TD . For the
simulations in [5].                                                same EðDÞ, as time-out increases, more delayed messages
                                                                                                                                 1
   The CR-TS is defined as a recoverable process with              can be tolerated. Thus, TM caused by a message delay (TM )
various values of MTTF and MTTR (exponentially distrib-            will decrease and occur less frequently. TM caused by a
                                                                                    2
uted). We choose the exponential distribution for the              message loss (TM ) is related to , pL , EðDÞ, and the time-out
following reasons. First, exponential failures are widely          length. For constant message communication QoS (i.e., fixed
adopted for reliability analysis in many practical systems;        pL and EðDÞ), TM caused by message loss is governed by the
                                                                   ratio between  and TD . Since as the time-out length
second, unlike some heavy tailed distributions such as the
                                                                   increases, more message losses can be tolerated, the average
log-normal distribution, crash, and recovery with an ex-                           2                             2
                                                                   duration of TM will decrease, and TM will occur less
ponential distribution will occur with reasonable interarri-                                                3
                                                                   frequently. TM caused by a crash (TM ) is mainly governed
val times, avoiding the CR-TS behaving like a fail-free or
                                                                   by TD (see Fig. 4c), because if a crash occurs, a false positive
crash-stop process.                                                mistake will last until the time-out time or until the CR-TS
                                                                   recovers. For detectable crashes, as the time-out length
5.1.1 Analysis for the Basic QoS Metrics                                        3                                            4
                                                                   increases, TM will increase. TM caused by a recovery (TM ) is
We implemented the NFD-S algorithm presented in [5] to
                                                                   mainly governed by pL and EðDÞ (see Fig. 4d), since after
evaluate the QoS of the FDS and compared the results with          the CR-TS’s recovery, a recovery can be detected as soon as
the analytical results derived from Theorem 1. Figs. 8, 9, and     a valid liveness message is received.
10 compare the QoS of the FDS based on the NFD-S algorithm             From the above analysis, we know that for the same ,
(simulation results) and the corresponding analytical results      pL , EðDÞ, MTTF, and MTTR, when the time-out length
from different perspectives. From these three figures, we
                                                                   increases, the average mistake duration caused by message
have the following observations.                                                                                   1      2
                                                                   delays and message losses will decrease (TM b and TM c), the
   Fig. 8 presents the EðTM Þ of the FDS derived from
                                                                   average mistake duration caused by the CR-TS’s crash will
simulation and analytical results for two values of MTTR, 5                     3
                                                                   increase (TM d), and the average mistake caused by the
and 50, with corresponding values of MTTF, 100 and 1,000.                                                                        4
The simulation result for MTTR ¼ 5 shows that as the time-         CR-TS’s recovery from a detectable crash is unaffected (TM )
out length increases, EðTM Þ will tend to MTTR, i.e., EðTM Þ is    but fewer crashes and recoveries will be detected. In the
bounded by MTTR. With the exponentially distributed                simulation pL ¼ 0:01 and MTBF ¼ 105, when time-out is
                                                                            2        3
MTTR used in the simulation, the proportion of the detectable      small, TM and TM occur with similar frequency. When time-
crashes will decrease more gradually. Thus, EðTM Þ ap-             out increases from 0.5 to 1.0, (the FDS can tolerate zero
proaches MTTR more slowly than in the analytical results.          message loss and most message delays), EðTM Þ increases
                                                                                     1    2      3            4
   Simulation results for MTTR ¼ 50 confirm that if MTTR           slow because TM b, TM b, TM d, and TM and their impacts
becomes large, as the time-out length increases, EðTM Þ can        counterbalance. Overall, EðTM ) is stable within this period.
                                                                                                                2
also grow large, since the bound is now large. Note that in        As the time-out length increases, TM will occur less
                                                                                       3
the graph, we see only the linear part rather than the             frequently. But TM occurs every MTBF period. Thus, as
MA ET AL.: ON THE QUALITY OF SERVICE OF CRASH-RECOVERY FAILURE DETECTORS                                                         281


                                                                   However, from Fig. 10, we can also see that as the time-out
                                                                   length increases, PA is not always increasing as in a fail-freeor
                                                                   crash-stop run. Continually increasing time-out could de-
                                                                   crease PA . This is because TMR is bounded by MTBF or MTBF
                                                                                                                        2
                                                                   as discussed above. After EðTMR Þ reaches MTBF , it increases
                                                                                                                    2
                                                                   slowly rather than exponentially fast but EðTM Þ increases
                                                                   linearly and faster than EðTMR Þ. Thus, PA decreases, and
                                                                   finally, PA will approach MTTF , which is equal to the
                                                                                                  MTBF
                                                                   availability of the CR-TS.
                                                                      The above results indicate that for a highly available CR-
                                                                   TS, a reasonable QoS for the FDS can be achieved even if the
                                                                   FDS always trusts the CR-TS, when only the QoS metrics
                                                                   defined in [5] are considered. This is especially true for a
Fig. 10. The NFD-S algorithms: PA .
                                                                   highly available and highly consistent but not highly
                                                                   reliable CR-TS. However, the completeness property of the
                             3                                     FDS will not be satisfied. Consequently, these simulation
the time-out increases, TM will dominant and EðTM Þ will
                                                                   results demonstrate the necessity of the additional QoS
increase gradually.
                                                                   metrics we proposed in Section 3.3 to measure the
     In the simulation, pL ¼ 0:01 and MTBF ¼ 1;050, when
                                  2                                completeness aspects and the speed of the recovery detection
the time-out length is small, TM will have more impact than
  3                2                                               of a crash-recovery FDS. Furthermore, these results also
TM , because TM occurs more frequently than the crash and
                                                                   demonstrate the necessity of adopting the recovery detec-
recovery. Therefore, as the time-out length increases, the
                           2                                       tion protocols in [29], which can improve the proportion of
average duration of TM decreases and occurs less fre-
                                                                   detected failures without reducing other QoS aspects.
quently; EðTM Þ will increase slower or even decrease since
                                                                      In Figs. 8, 9, and 10, we can also observe how the
more message losses are tolerated. But if time-out continues
                 3                                                 dependability of a CR-TS can influence the QoS of the FDS.
to increase, TM will become dominant and EðTM Þ will then
                                                                   Particularly, for a highly available but not highly reliable
increase gradually.
                                                                   CR-TS, the dependability of the CR-TS can have more
     Overall, Fig. 8 shows that in a crash-recovery run, EðTM Þ
exhibits quite different characteristics from a fail-free or       impact than the performance of the algorithm and the QoS
crash-stop run. If the message delay and the probability of        of message transmission. In such situations, the depend-
message loss are not very large, EðTM Þ is bounded by              ability of the CR-TS must be taken into account for the FDS
MTTR. From Fig. 8, we also observe that EðTM Þ can                 design and implementation.
possibly be decreased for some time-out values. Unlike in a           From Figs. 8, 9, and 10, we can see that PA , EðTMR Þ and
fail-free run, continually increasing the time-out length          EðTM Þ have bounds. Continually increasing the time-out
cannot achieve a better ðTM Þ.                                     length might not be a reasonable way to achieve better PA ,
     Fig. 9 presents EðTMR Þ of the FDS derived analytically and   EðTMR Þ, and EðTM Þ. A potential trade-off exists between
from simulation with exponential MTTF and MTTR as above.           the QoS metrics. For instance, for the NFD-S algorithm,
We can see that with constant time-out length, as MTBF             time-out 2 ð1; 1:1Þ (time-out þ  2 ½2; 2:1Š) might achieve the
increases, EðTMR Þ also increases. This implies that EðTMR Þ is    best over all QoS.
greatly impacted by the dependability of the CR-TS.                   In addition, EðTM Þ in a crash-recovery run exhibits quite
     We can also see that for both these simulation cases,         different characteristics compared with a fail-free or crash-
EðTMR Þ initially increases exponentially fast but after EðTMR Þ   stop run. This is because in a crash-recovery run, the mistakes
reaches MTBF , the rate of increase is reduced. For the CR-TS      caused by the crash and recovery are taken into considera-
             2
with exponential MTTR, EðTMR Þ will increase gradually and         tion, which means continually increasing the time-out length
approach MTBF, until all crashes become undetectable. This         will not always decrease EðTM Þ. It may have the effect of
                                                                                                        3
is because for nondeterministic MTTR, as the time-out length       increasing false positive mistakes (TM , see Fig. 4). As the time-
increases, the proportion of the detectable crashes decreases.     out length increases, mistakes caused by message delays
Therefore, for the detectable crashes, TMR MTBF , and for the      and losses will occur less frequently, and false positive
                                                 2
undetectable crashes, TMR MTBF. Thus, EðTMR Þ will                 mistakes (which were not considered previously) will start
increase gradually between ½MTBF ; MTBFŠ, and finally,
                                     2
                                                                   to dominate the QoS of the FDS.
stabilize at MTBF. All of these results match our analysis in         From Figs. 8, 9, and 10, we can observe that the
Section 4.3 well and indicate that if a CR-TS is not fail-free     simulation results of EðTM Þ are smaller than the analytical
(MTTF ! 1) orcrash-stop (MTTR ! 1), EðTMR Þ will be                results, and the simulation results of EðTMR Þ and PA are
bounded by MTBF when failures are undetectable and by              larger than the analytical results, which indicate that the
MTBF
   2    when failures are detectable.                              bound analysis of the basic QoS metrics in Theorem 1 is
     Fig. 10 considers PA under the same communication QoS.        valid and the simulation results satisfy the QoS require-
We see that when MTBF increases, PA will be improved. This         ments according to the analysis. We can also observe a
is because EðTMR Þ also increases. Thus, from the equation         gap between the analytical and simulation results. This is
PA ¼ 1 À EðTMRÞÞ , we know that for the same time-out length,
             EðTM
                                                                   caused by the overestimation or underestimation of some
when MTBF increases, a better PA can be achieved.                  values within the analytical results. EðTM Þ is overestimated
282                                  IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING,     VOL. 7,   NO. 3,   JULY-SEPTEMBER 2010




                                                                   Fig. 12. The QoS relationship between communication, CR-TS,
                                                                   and FDS.

Fig. 11. The NFD-S algorithms: EðRDF Þ.                            decreases. When MTTR becomes shorter, EðRDF Þ will
                                                                   decrease faster. This is because the smaller MTTR is, the
                                                                                                           U
by using the total mistake duration over the underestimated        sooner time-out þ  crosses MTTR (TD  MTTR). Therefore,
average number of mistakes that might occur within a crash-        more crashes remain undetected when the NFD-S algorithm
recovery period. Thus, the analytical results of EðTM Þ will be    is adopted. In Fig. 11, we can also see that the simulation
larger than the simulation results. Similarly, EðTMR Þ is          results of EðRDF Þ are larger than the analytical results, which
underestimated by using the observation duration (MTBF)            means that the bound analysis of EðRDF Þ is valid and the
over an overestimation of the number of mistakes that              simulation results satisfy the QoS requirements in terms of
might occur within a period. For instance, the number of           RL . However, since most existing failure detection algo-
                                                                     DF
                                                                   rithms adopt increasing the time-out length to tolerate more
mistakes within the period is estimated as dEðDÞe þ 1, which
                                                 
                                                                   message losses and delays, if a CR-TS is recoverable and
is an upper bound rather than the average number. It
                                                                   recovers fast, it could be difficult for these algorithms to
follows that EðTMR Þ of the analytical results will be smaller
                                                                   achieve the QoS in [5] and satisfy the completeness property at
than the simulation results. Finally, PA is underestimated by
                                                                   the same time. In such a situation, the recovery detection
using one minus an overestimated total mistake duration            protocol introduced in [29] can be adopted, which can solve
over the observation period (MTBF). Thus, PA of the                this problem reasonably well.
analytical results will be smaller than the simulation results.
   All of these results satisfy the QoS requirements
            U           L                     L
EðTM Þ  TM , PA  PA , and EðTMR Þ  TMR . In addition,           6   CONCLUSION
according to the NFD-S algorithm, the failure detection            In this paper, the crash-recovery target and its failure detector
time TD is bounded by  þ time-out regardless of the               are modeled as stochastic processes. We redefined pre-
                                                   U
correctness of the detection; thus, TD  TD must be                viously proposed QoS metrics to be applicable to crash-
satisfied.                                                         recovery failure detection and introduced some new metrics
   From Figs. 8, 9, and 10, we can also see that there are some    to measure the recovery detection speed and the completeness
gaps between the analytical results and the simulation             property of a failure detector. We also discussed the impact
results. This is mainly caused by the overestimating and           of the monitored target’s crash-recovery behavior on each QoS
underestimating method we adopted to restrict the failure          metric and showed that if a failure detector’s parameters are
detector’s QoS bound as discussed above. In addition, we           to be accurately estimated, these dependability character-
use MTBF, MTTF, and MTTR, which are the expected values
                                                                   istics must be taken into account. Thus, we showed how to
rather the real values for each failure and recovery. In the
                                                                   configure the failure detector to satisfy a given set of
simulation, the results are calculated according to the
                                                                   requirements based on the dependability characteristics in
randomly generated failure time and recovery time, which
represent the real time to failure and recovery, and these         addition to the QoS of message transmission (see Fig. 12).
random variables will deviate from the expected values.            This was based on the NFD-S algorithm [5]. Our analysis
Thus, there will be some discrepancies between the simula-         shows that the QoS analysis in [5] is a particular case of a
tion and analytical results. These gaps show that there is still   crash-recovery run. Furthermore, we discussed how to
space to improve the accuracy of the model and it would be         estimate the input parameters for the algorithm.
interesting to investigate this point further in the future.           Finally, the plotted simulation and analytical results
                                                                   demonstrate that our QoS bound analysis is valid and can be
5.1.2 Analysis for the Extended QoS Metrics                        used as an approximate solution for the computation of the
We also plot the simulation and analytical results for the         failure detector’s parameters or the QoS bounds estimation
failure detection proportion (RDF ) defined in Section 3.3 to      if the failure detector’s parameters are given. Our simula-
demonstrate the impact of the failure and recovery events          tion results confirm that when a failure detector is designed
on this metric.                                                    and implemented, the dependability of the crash-recovery
   Fig. 11 shows the proportion of failures detected by the        target needs to be considered in order to achieve more
FDS, for different dependability characteristics of the CR-TS,     accurate parameter estimation. Furthermore, if the recovery
based on both simulation and analytical results. As the            of the monitored target needs to be detected, further
time-out length increases, EðRDF Þ of the NFD-S algorithm          enhancement of the existing algorithms is needed.
On the quality of service of crash recovery

More Related Content

Viewers also liked

Design and evaluation of a proxy cache for
Design and evaluation of a proxy cache forDesign and evaluation of a proxy cache for
Design and evaluation of a proxy cache foringenioustech
 
Impact of le arrivals and departures on buffer
Impact of  le arrivals and departures on bufferImpact of  le arrivals and departures on buffer
Impact of le arrivals and departures on bufferingenioustech
 
Dynamic measurement aware
Dynamic measurement awareDynamic measurement aware
Dynamic measurement awareingenioustech
 
Efficient computation of range aggregates
Efficient computation of range aggregatesEfficient computation of range aggregates
Efficient computation of range aggregatesingenioustech
 
Phish market protocol
Phish market protocolPhish market protocol
Phish market protocolingenioustech
 
Peering equilibrium multi path routing
Peering equilibrium multi path routingPeering equilibrium multi path routing
Peering equilibrium multi path routingingenioustech
 
Applied research of e learning
Applied research of e learningApplied research of e learning
Applied research of e learningingenioustech
 

Viewers also liked (8)

Design and evaluation of a proxy cache for
Design and evaluation of a proxy cache forDesign and evaluation of a proxy cache for
Design and evaluation of a proxy cache for
 
Impact of le arrivals and departures on buffer
Impact of  le arrivals and departures on bufferImpact of  le arrivals and departures on buffer
Impact of le arrivals and departures on buffer
 
Dynamic measurement aware
Dynamic measurement awareDynamic measurement aware
Dynamic measurement aware
 
Efficient computation of range aggregates
Efficient computation of range aggregatesEfficient computation of range aggregates
Efficient computation of range aggregates
 
Phish market protocol
Phish market protocolPhish market protocol
Phish market protocol
 
Peering equilibrium multi path routing
Peering equilibrium multi path routingPeering equilibrium multi path routing
Peering equilibrium multi path routing
 
Applied research of e learning
Applied research of e learningApplied research of e learning
Applied research of e learning
 
Intrution detection
Intrution detectionIntrution detection
Intrution detection
 

Similar to On the quality of service of crash recovery

Adaptive fault tolerance in cloud survey
Adaptive fault tolerance in cloud surveyAdaptive fault tolerance in cloud survey
Adaptive fault tolerance in cloud surveywww.pixelsolutionbd.com
 
Configuration Navigation Analysis Model for Regression Test Case Prioritization
Configuration Navigation Analysis Model for Regression Test Case PrioritizationConfiguration Navigation Analysis Model for Regression Test Case Prioritization
Configuration Navigation Analysis Model for Regression Test Case Prioritizationijsrd.com
 
Monte Carlo simulation convergences’ percentage and position in future relia...
Monte Carlo simulation convergences’ percentage and position  in future relia...Monte Carlo simulation convergences’ percentage and position  in future relia...
Monte Carlo simulation convergences’ percentage and position in future relia...IJECEIAES
 
An Analysis Of Cloud ReliabilityApproaches Based on Cloud Components And Reli...
An Analysis Of Cloud ReliabilityApproaches Based on Cloud Components And Reli...An Analysis Of Cloud ReliabilityApproaches Based on Cloud Components And Reli...
An Analysis Of Cloud ReliabilityApproaches Based on Cloud Components And Reli...ijafrc
 
Achievement for wireless
Achievement for wirelessAchievement for wireless
Achievement for wirelessijwmn
 
An efficient recovery mechanism
An efficient recovery mechanismAn efficient recovery mechanism
An efficient recovery mechanismijcsa
 
Fault tolerant real-time scheduling
Fault tolerant real-time schedulingFault tolerant real-time scheduling
Fault tolerant real-time schedulingReza Ramezani
 
Soft Real-Time Guarantee for Control Applications Using Both Measurement and ...
Soft Real-Time Guarantee for Control Applications Using Both Measurement and ...Soft Real-Time Guarantee for Control Applications Using Both Measurement and ...
Soft Real-Time Guarantee for Control Applications Using Both Measurement and ...CSCJournals
 
OCLR: A More Expressive, Pattern-Based Temporal Extension of OCL
OCLR: A More Expressive, Pattern-Based Temporal Extension of OCLOCLR: A More Expressive, Pattern-Based Temporal Extension of OCL
OCLR: A More Expressive, Pattern-Based Temporal Extension of OCLLionel Briand
 
Neural Network-Based Actuator Fault Diagnosis for a Non-Linear Multi-Tank System
Neural Network-Based Actuator Fault Diagnosis for a Non-Linear Multi-Tank SystemNeural Network-Based Actuator Fault Diagnosis for a Non-Linear Multi-Tank System
Neural Network-Based Actuator Fault Diagnosis for a Non-Linear Multi-Tank SystemISA Interchange
 
IRJET- Analysis of Micro Inversion to Improve Fault Tolerance in High Spe...
IRJET-  	  Analysis of Micro Inversion to Improve Fault Tolerance in High Spe...IRJET-  	  Analysis of Micro Inversion to Improve Fault Tolerance in High Spe...
IRJET- Analysis of Micro Inversion to Improve Fault Tolerance in High Spe...IRJET Journal
 
Adaptive check-pointing and replication strategy to tolerate faults in comput...
Adaptive check-pointing and replication strategy to tolerate faults in comput...Adaptive check-pointing and replication strategy to tolerate faults in comput...
Adaptive check-pointing and replication strategy to tolerate faults in comput...IOSR Journals
 
Efficient failure detection and consensus at extreme-scale systems
Efficient failure detection and consensus at extreme-scale  systemsEfficient failure detection and consensus at extreme-scale  systems
Efficient failure detection and consensus at extreme-scale systemsIJECEIAES
 
Towards a good abs design for more Reliable vehicles on the roads
Towards a good abs design for more Reliable vehicles on the roadsTowards a good abs design for more Reliable vehicles on the roads
Towards a good abs design for more Reliable vehicles on the roadsijcsit
 
Integrating Fault Tolerant Scheme With Feedback Control Scheduling Algorithm ...
Integrating Fault Tolerant Scheme With Feedback Control Scheduling Algorithm ...Integrating Fault Tolerant Scheme With Feedback Control Scheduling Algorithm ...
Integrating Fault Tolerant Scheme With Feedback Control Scheduling Algorithm ...ijics
 
Integrating fault tolerant scheme with feedback control scheduling algorithm ...
Integrating fault tolerant scheme with feedback control scheduling algorithm ...Integrating fault tolerant scheme with feedback control scheduling algorithm ...
Integrating fault tolerant scheme with feedback control scheduling algorithm ...ijics
 
ESTIMATING HANDLING TIME OF SOFTWARE DEFECTS
ESTIMATING HANDLING TIME OF SOFTWARE DEFECTSESTIMATING HANDLING TIME OF SOFTWARE DEFECTS
ESTIMATING HANDLING TIME OF SOFTWARE DEFECTScsandit
 

Similar to On the quality of service of crash recovery (20)

Adaptive fault tolerance in cloud survey
Adaptive fault tolerance in cloud surveyAdaptive fault tolerance in cloud survey
Adaptive fault tolerance in cloud survey
 
Configuration Navigation Analysis Model for Regression Test Case Prioritization
Configuration Navigation Analysis Model for Regression Test Case PrioritizationConfiguration Navigation Analysis Model for Regression Test Case Prioritization
Configuration Navigation Analysis Model for Regression Test Case Prioritization
 
Monte Carlo simulation convergences’ percentage and position in future relia...
Monte Carlo simulation convergences’ percentage and position  in future relia...Monte Carlo simulation convergences’ percentage and position  in future relia...
Monte Carlo simulation convergences’ percentage and position in future relia...
 
An Analysis Of Cloud ReliabilityApproaches Based on Cloud Components And Reli...
An Analysis Of Cloud ReliabilityApproaches Based on Cloud Components And Reli...An Analysis Of Cloud ReliabilityApproaches Based on Cloud Components And Reli...
An Analysis Of Cloud ReliabilityApproaches Based on Cloud Components And Reli...
 
Achievement for wireless
Achievement for wirelessAchievement for wireless
Achievement for wireless
 
An efficient recovery mechanism
An efficient recovery mechanismAn efficient recovery mechanism
An efficient recovery mechanism
 
111 118
111 118111 118
111 118
 
111 118
111 118111 118
111 118
 
Fault tolerant real-time scheduling
Fault tolerant real-time schedulingFault tolerant real-time scheduling
Fault tolerant real-time scheduling
 
Soft Real-Time Guarantee for Control Applications Using Both Measurement and ...
Soft Real-Time Guarantee for Control Applications Using Both Measurement and ...Soft Real-Time Guarantee for Control Applications Using Both Measurement and ...
Soft Real-Time Guarantee for Control Applications Using Both Measurement and ...
 
OCLR: A More Expressive, Pattern-Based Temporal Extension of OCL
OCLR: A More Expressive, Pattern-Based Temporal Extension of OCLOCLR: A More Expressive, Pattern-Based Temporal Extension of OCL
OCLR: A More Expressive, Pattern-Based Temporal Extension of OCL
 
Neural Network-Based Actuator Fault Diagnosis for a Non-Linear Multi-Tank System
Neural Network-Based Actuator Fault Diagnosis for a Non-Linear Multi-Tank SystemNeural Network-Based Actuator Fault Diagnosis for a Non-Linear Multi-Tank System
Neural Network-Based Actuator Fault Diagnosis for a Non-Linear Multi-Tank System
 
IRJET- Analysis of Micro Inversion to Improve Fault Tolerance in High Spe...
IRJET-  	  Analysis of Micro Inversion to Improve Fault Tolerance in High Spe...IRJET-  	  Analysis of Micro Inversion to Improve Fault Tolerance in High Spe...
IRJET- Analysis of Micro Inversion to Improve Fault Tolerance in High Spe...
 
Adaptive check-pointing and replication strategy to tolerate faults in comput...
Adaptive check-pointing and replication strategy to tolerate faults in comput...Adaptive check-pointing and replication strategy to tolerate faults in comput...
Adaptive check-pointing and replication strategy to tolerate faults in comput...
 
E01113138
E01113138E01113138
E01113138
 
Efficient failure detection and consensus at extreme-scale systems
Efficient failure detection and consensus at extreme-scale  systemsEfficient failure detection and consensus at extreme-scale  systems
Efficient failure detection and consensus at extreme-scale systems
 
Towards a good abs design for more Reliable vehicles on the roads
Towards a good abs design for more Reliable vehicles on the roadsTowards a good abs design for more Reliable vehicles on the roads
Towards a good abs design for more Reliable vehicles on the roads
 
Integrating Fault Tolerant Scheme With Feedback Control Scheduling Algorithm ...
Integrating Fault Tolerant Scheme With Feedback Control Scheduling Algorithm ...Integrating Fault Tolerant Scheme With Feedback Control Scheduling Algorithm ...
Integrating Fault Tolerant Scheme With Feedback Control Scheduling Algorithm ...
 
Integrating fault tolerant scheme with feedback control scheduling algorithm ...
Integrating fault tolerant scheme with feedback control scheduling algorithm ...Integrating fault tolerant scheme with feedback control scheduling algorithm ...
Integrating fault tolerant scheme with feedback control scheduling algorithm ...
 
ESTIMATING HANDLING TIME OF SOFTWARE DEFECTS
ESTIMATING HANDLING TIME OF SOFTWARE DEFECTSESTIMATING HANDLING TIME OF SOFTWARE DEFECTS
ESTIMATING HANDLING TIME OF SOFTWARE DEFECTS
 

More from ingenioustech

Supporting efficient and scalable multicasting
Supporting efficient and scalable multicastingSupporting efficient and scalable multicasting
Supporting efficient and scalable multicastingingenioustech
 
Monitoring service systems from
Monitoring service systems fromMonitoring service systems from
Monitoring service systems fromingenioustech
 
Locally consistent concept factorization for
Locally consistent concept factorization forLocally consistent concept factorization for
Locally consistent concept factorization foringenioustech
 
Measurement and diagnosis of address
Measurement and diagnosis of addressMeasurement and diagnosis of address
Measurement and diagnosis of addressingenioustech
 
Exploiting dynamic resource allocation for
Exploiting dynamic resource allocation forExploiting dynamic resource allocation for
Exploiting dynamic resource allocation foringenioustech
 
Throughput optimization in
Throughput optimization inThroughput optimization in
Throughput optimization iningenioustech
 
Online social network
Online social networkOnline social network
Online social networkingenioustech
 
It auditing to assure a secure cloud computing
It auditing to assure a secure cloud computingIt auditing to assure a secure cloud computing
It auditing to assure a secure cloud computingingenioustech
 
Bayesian classifiers programmed in sql
Bayesian classifiers programmed in sqlBayesian classifiers programmed in sql
Bayesian classifiers programmed in sqlingenioustech
 
Conditional%20 shortest%20path%20routing%20in%20delay%20tolerant%20networks[1]
Conditional%20 shortest%20path%20routing%20in%20delay%20tolerant%20networks[1]Conditional%20 shortest%20path%20routing%20in%20delay%20tolerant%20networks[1]
Conditional%20 shortest%20path%20routing%20in%20delay%20tolerant%20networks[1]ingenioustech
 
Active reranking for web image search
Active reranking for web image searchActive reranking for web image search
Active reranking for web image searchingenioustech
 
A dynamic performance-based_flow_control
A dynamic performance-based_flow_controlA dynamic performance-based_flow_control
A dynamic performance-based_flow_controlingenioustech
 
Java & dotnet titles
Java & dotnet titlesJava & dotnet titles
Java & dotnet titlesingenioustech
 

More from ingenioustech (18)

Supporting efficient and scalable multicasting
Supporting efficient and scalable multicastingSupporting efficient and scalable multicasting
Supporting efficient and scalable multicasting
 
Monitoring service systems from
Monitoring service systems fromMonitoring service systems from
Monitoring service systems from
 
Locally consistent concept factorization for
Locally consistent concept factorization forLocally consistent concept factorization for
Locally consistent concept factorization for
 
Measurement and diagnosis of address
Measurement and diagnosis of addressMeasurement and diagnosis of address
Measurement and diagnosis of address
 
Exploiting dynamic resource allocation for
Exploiting dynamic resource allocation forExploiting dynamic resource allocation for
Exploiting dynamic resource allocation for
 
Throughput optimization in
Throughput optimization inThroughput optimization in
Throughput optimization in
 
Tcp
TcpTcp
Tcp
 
Privacy preserving
Privacy preservingPrivacy preserving
Privacy preserving
 
Peace
PeacePeace
Peace
 
Online social network
Online social networkOnline social network
Online social network
 
Layered approach
Layered approachLayered approach
Layered approach
 
It auditing to assure a secure cloud computing
It auditing to assure a secure cloud computingIt auditing to assure a secure cloud computing
It auditing to assure a secure cloud computing
 
Bayesian classifiers programmed in sql
Bayesian classifiers programmed in sqlBayesian classifiers programmed in sql
Bayesian classifiers programmed in sql
 
Conditional%20 shortest%20path%20routing%20in%20delay%20tolerant%20networks[1]
Conditional%20 shortest%20path%20routing%20in%20delay%20tolerant%20networks[1]Conditional%20 shortest%20path%20routing%20in%20delay%20tolerant%20networks[1]
Conditional%20 shortest%20path%20routing%20in%20delay%20tolerant%20networks[1]
 
Active reranking for web image search
Active reranking for web image searchActive reranking for web image search
Active reranking for web image search
 
A dynamic performance-based_flow_control
A dynamic performance-based_flow_controlA dynamic performance-based_flow_control
A dynamic performance-based_flow_control
 
Vebek
VebekVebek
Vebek
 
Java & dotnet titles
Java & dotnet titlesJava & dotnet titles
Java & dotnet titles
 

Recently uploaded

भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,Virag Sontakke
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfadityarao40181
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxUnboundStockton
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 

Recently uploaded (20)

भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdf
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docx
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 

On the quality of service of crash recovery

  • 1. IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 3, JULY-SEPTEMBER 2010 271 On the Quality of Service of Crash-Recovery Failure Detectors Tiejun Ma, Jane Hillston, and Stuart Anderson Abstract—We model the probabilistic behavior of a system comprising a failure detector and a monitored crash-recovery target. We extend failure detectors to take account of failure recovery in the target system. This involves extending QoS measures to include the recovery detection speed and proportion of failures detected. We also extend estimating the parameters of the failure detector to achieve a required QoS to configuring the crash-recovery failure detector. We investigate the impact of the dependability of the monitored process on the QoS of our failure detector. Our analysis indicates that variation in the MTTF and MTTR of the monitored process can have a significant impact on the QoS of our failure detector. Our analysis is supported by simulations that validate our theoretical results. Index Terms—Failure detectors, crash recovery, quality of service, availability, dependability, performance. Ç 1 INTRODUCTION and accuracy, of crash failure detector implementations and F AULT tolerance is one of the most important issues for achieving dependable distributed systems. One of the most challenging problems in this research area is to tolerate failure detection algorithms, e.g., [5], [6], [7], [8], [9], [10]. It is important to note that most of this previous work the Byzantine failure, which is also sometimes called the focused on the QoS of crash failure detectors is based on the arbitrary failure. This means that a process may behave in crash-stop or fail-free assumption. The fail-free assumption an arbitrary manner, producing arbitrary responses at assumes that failures do not occur. The crash-stop assumption arbitrary time [1]. It is the most difficult failure to detect. assumes that there is only one failure and the monitoring One possible solution of Byzantine failure detection is procedure terminates once that crash failure is detected. The adopting consensus algorithms. To achieve K fault toler- algorithms based on these assumptions focus on how to ance, 3K þ 1 service replications are needed [2]. In the worst estimate the probabilistic message arrival time and a suitable case, the K faulty services may send incorrect values, or time-out period for a failure detector to ensure a required QoS. incorrectly represent the values of others, but the remaining However, fail-free and crash-stop can be strong assump- 2K þ 1 services can still return the same correct answer. tions. An alternative approach is to consider the crash- Crash failure detection is one of the most important building recovery paradigm as discussed by Guerraoui and Rodrigues blocks to achieve a successful consensus. However, detect- [11]. A process can keep crashing and recovering infinitely ing crash failures is a difficult problem. In [3], Fischer et al. often and it is eventually always up and running. In theory, a show the impossibility of separating a crashed process and a process recovery can be achieved by adopting stable storage very slow one, in a pure asynchronous system, known as the and the state information of the process can be stored and Fischer-Lynch-Paterson’s impossibility result. Subse- retrieved from the storage. After a crash is detected, the quently, failure detector oracles, which give possibly recovery procedure can be initiated to retrieve the latest erroneous information about the state of the monitored stored process information. In practice, in order to provide target, have been proposed. In [4], Chandra and Toueg high availability, self-repairing and self-healing mechanisms introduce the concept of unreliable crash failure detectors to are widely adopted in fault-tolerant systems to achieve detect the eventual crash behavior of a process and classify automatic recovery after a crash occurs. Particularly, in a set of abstract failure detectors based on the failure middleware systems, many techniques and algorithms have detectors’ eventual behavior to solve a certain set of been proposed to achieve the self-repairing or self-healing membership problems. This work inspired many research- goal, e.g., [12], [13], [14], [15]. ers to study the quality of service (QoS), such as the speed In such systems, it is assumed that the system undergoes periodic crashes. During a crash period, the system is unable to service any requests or send any messages, externally behaving as if the system is unreachable. The end of the crash period is marked by a recovery, after which the system . T. Ma is with the Department of Computing, Imperial College London, South Kensington Campus, 180 Queens Gate London, SW7 2AZ, UK. returns to normal service and its internal state is restored to E-mail: tma@doc.ic.ac.uk. the state before the crash failure occurred. . J. Hillston and S. Anderson are with the Laboratory for Foundations of For such systems, crash-recovery failure needs to be Computer Science, School of Informatics, University of Edinburgh, considered as a frequently occurring failure type to be 10 Crichton Street, Edinburgh EH8 9AB, UK. detected. However, the crash-recovery case has been little E-mail: {jeh, soa}@inf.ed.ac.uk. studied, due to the fact that there are more possible Manuscript received 19 Feb. 2008; revised 21 Apr. 2009; accepted 30 June discrepancies between the failure detector and the monitored 2009; published online 11 Aug. 2009. For information on obtaining reprints of this article, please send e-mail to: target, increasing the size of the state space of the monitoring tdsc@computer.org, and reference IEEECS Log Number TDSC-2008-02-0037. process, making the QoS analysis for such a paradigm more Digital Object Identifier no. 10.1109/TDSC.2009.36. complicated. 1545-5971/10/$26.00 ß 2010 IEEE Published by the IEEE Computer Society
  • 2. 272 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 3, JULY-SEPTEMBER 2010 In [16], we presented an evaluation of the QoS of a crash- recovery failure detector based on a simple time-out algo- rithm. A crash-recovery target was modeled as an alternating renewal process. The simulation results showed that the crash-recovery behavior of the monitored target will impact the QoS of such a failure detector, which implied that the crash-recovery paradigm merited further studied. Such an analysis was presented in [17]. In that paper, we outlined how to model the failure detection pair in a crash-recovery run and how to configure the failure detector to satisfy a given QoS requirement. The current paper represents a substantial expansion of [17]. We present more analytical Fig. 1. The QoS metrics without considering false positive mistakes. details and support the results with further simulation studies. Analytical results, derived directly from the equa- detector and the QoS metrics. In terms of the transitions tions in this paper, are also plotted and compared with the defined above and the fail-free assumption, Chen et al. simulation results. We are then able to present a detailed define the following QoS metrics for a failure detector: analysis for each of the QoS metrics, which shows the failure detection time (TD ), mistake recurrence time (TMR ), validity of our model. mistake duration (TM ), good period duration (TG ), and 1.1 Our Contribution query accuracy probability (PA ). We show how to remove the fail-free or crash-stop assump- Some recent research has extended the QoS work of [5] in tion and model the probabilistic behavior of a failure a number of ways. For example, the authors of [6], [9], [10], detector with respect to a crash-recovery target, taking into [18] refine the model with different probabilistic message delay and loss estimation methods. Meanwhile, others, such consideration general dependability metrics, such as mean as [7], [8], [19], [20], [21], focus on the scalability and time to failure (MTTF) and mean time to recovery (MTTR). We adaptivity of crash failure detection. But all of these papers outline how the QoS of a failure detector is limited by the are based on eventual crash-stop behavior of the monitored dependability of the monitored target. Moreover, we process or the fail-free assumption. Crash-recovery failure establish that the crash-stop or fail-free models are special detectors have been considered by several groups, e.g., cases of the crash-recovery model. Boichat and Guerraoui [22] implemented reliable and total In order to effectively assess the QoS of the failure order broadcast primitives, assuming a practical asynchro- detector in a crash-recovery run, we have defined new nous crash-recovery model in which the processes and QoS metrics to measure the recovery detection speed and channels may crash and recover or crash and never recover; the proportion of the failures of the monitored target which [23], [24], [25], [26], each of which proposes failure detectors are detected. To make an accurate estimation of the failure to solve consensus problems rather than focusing on the detector’s parameters needed to achieve a required QoS, a QoS of the failure detector itself. In [23], the monitored configuration procedure for a crash-recovery failure detector process is characterized as always-up, eventually-up, even- is outlined. We demonstrate how to achieve the QoS from tually-down, or unstable. A process which crashes and a given set of requirements based on the NFD-S algorithm recovers infinitely many times is regarded as unstable. (see Appendix B, which can be found on the Computer But crash-recovery looping behavior exists for most systems. Society Digital Library at http://doi.ieeecomputersociety. From the perspective of stochastic theory, crash-recovery org/10.1109/TDSC.2009.36,) proposed by Chen et al. [5] behavior can be regarded as a regenerative process in which with suitable modifications. To the best of our knowledge, the probabilistic live and recovery times are not zero. In the none of these aspects of QoS of failure detectors have been following sections, we will analyze such a crash-recovery presented before. paradigm and its failure detector from a QoS perspective. 1.2 Related Work This paper is organized as follows: in Section 2.1, we model a crash-recovery service with general dependability In [5], Chen et al. propose a set of QoS metrics to measure metrics. Then, we show our model of the probabilistic the accuracy and speed of a failure detector. Their model message communication and its QoS metrics. In Section 3, contains a pair of processes: one is the monitor process, the other is the monitored process, and there is only one crash we show how to model the crash-recovery failure detector’s during the monitoring period. The analysis is based on two probabilistic behavior. We refine the completeness of a crash- separate stages of failure detection: the precrash stage, recovery failure detector and extend the QoS metrics to which is a fail-free run; and the postcrash stage, which is a measure the completeness and the recovery detection speed crash-stop run when the monitoring procedure will be of such a failure detector. Then, we show how to involve terminated. In order to formally define the QoS metrics, the general dependability metrics for an approximate Chen et al. [5] define state transitions of a failure detector analysis of the QoS of a failure detector and how to monitoring a target process under the fail-free assumption. configure a crash-recovery failure detector to satisfy a given At any time, the failure detector’s state is either Trust or set of QoS requirements. Moreover, we discuss the impact Suspect with respect to the monitored process’s liveness. If a of the dependability of the crash-recovery service on the QoS failure detector moves from a Trust state to a Suspect state, of failure detectors in detail. In Section 4, the estimation of then an S-transition occurs; if the failure detector moves the input parameters of a crash-recovery failure detector is from a Suspect state to a Trust state, then a T-transition presented. We show how to estimate the message delay, occurs. Fig. 1 shows the state transitions of the failure message loss, MTTF, MTTR, etc., in a crash-recovery run. In
  • 3. MA ET AL.: ON THE QUALITY OF SERVICE OF CRASH-RECOVERY FAILURE DETECTORS 273 random variables fXðnÞ; n 2 N g, where XðnÞ is the random variable representing the time which elapses from the time of the nth regeneration point to the ðn þ 1Þth one (i.e., XðnÞ ¼ Snþ1 À Sn ). For simplicity of presentation, we use X instead of XðnÞ in the following since it is sufficient to consider a single regeneration period. Furthermore, we can consider X to be the sum of two independent random variables: Xa and Xc . Here, Xa represents the time which elapses from the time that the CR-TS starts a regeneration period to the time the CR-TS fails and Xc represents the time from when the CR-TS fails until to the time of the next Fig. 2. Crash-recovery service modeling. regeneration point. Lemma 1. In steady state, the CR-TS is an alternating renewal Section 5, the analytical and simulation results are plotted process and the time between any two consecutive recovery time and analyzed in detail. We show that the dependability of a points is one period of the crash-recovery service’s lifetime. crash-recovery target has an impact on the QoS of a failure Thus, we assert that in order to design a failure detector for detector and our analysis is valid. In Section 6, a brief the CR-TS, which is sensitive to the CR-TS’s behavior, we summary of the paper is presented. Appendix A provides a only need to consider one period of the CR-TS since all of the notation table for the variables used in the paper. other periods are independent and identically distributed. Appendix B shows the pseudocode of the NFD-S algorithm. 2.2 Dependability of a Crash-Recovery Service Appendix C presents the main proofs of the lemmas and theorems presented in this paper. Dependability, one of the most important issues for computer systems, is a complex attribute. Laprie et al. [1] define the concept of dependability as the property of a 2 CRASH-RECOVERY SERVICE AND QoS OF computer system such that reliance can justifiably be placed on the MESSAGE COMMUNICATION service it delivers. Associating timing information with the behavior of a system, its dependability can be described In this section, we outline the assumptions underlying quantitatively. Generally speaking, the dependability of a our framework, considering the crash-recovery behavior system can be measured according to a number of different of the target service, its dependability characteristics, and aspects such as reliability, availability, consistency, usability, the behavior of the communication channel which security, etc. In order to simplify the measurements which supports the failure detection process. are related to failure detection, here, we only introduce 2.1 The Crash-Recovery Service Modeling reliability, availability, and consistency, which are strongly related to the QoS of failure detectors. For a crash-recovery target service (CR-TS), we consider that In [27], Knight and Strunk give a definition of software the service might crash at arbitrary time and take some time reliability and availability. We extend this with a definition to be repaired and restart again after it fails. Let S be the of consistency as follows: state space of a stochastic process Z :¼ fZðtÞ; t ! 0g, where Z captures a CR-TS’s lifetime. Then, S can be regarded as . Reliability: is the probability that the system will {Alive, Crash} and the CR-TS can periodically switch operate correctly in a specified operating environ- between these two states. A transition occurs when the ment up until time t (t > 0). state of the CR-TS changes. Fig. 2 shows the state transitions . Availability: is the probability that the system will be of a CR-TS, where a C-transition occurs when the state of the operational at time t. CR-TS switches from the Alive state to the Crash state; an . Consistency: is the probability that in a specified R-transition occurs when the state of the CR-TS switches operating environment, the system will return to from the Crash state to the Alive state. normal operation correctly after a failure within time t. Assumption 1. If the service’s recovery is treated as a restart, These three metrics present different aspects of the then the CR-TS’s lifetime Z is a regenerative process. system dependability. Generally, reliability presents how long a system will operate correctly and can be captured by Assumption 1 will be used in the following. It is based MTTF, which records the likelihood of a service to persist without a failure. Availability presents the probability that a on the following observations. The CR-TS will periodically system is accessible or reachable with correct operation at crash and recover, leading to a sequence of time points, an arbitrary time and can be captured by mean time to failure S1 ; S2 ; . . . ; Sn ; . . . (n ! 0), representing the times of the divided by mean time between failure (MTTF ). Consistency MTBF CR-TS’s recovery. The behavior of the system after Sn presents the ability of a system to recover from a failure (n ! 0) is independent of what has occurred before, and state to the correct operation state and can be captured by thus, Sn can be regarded as a restart. Moreover, the MTTR, which records how quickly a system recovers. probability of Sn occurring is 1. This makes the time points In different scenarios, different aspects of dependability S1 ; S2 ; . . . ; Sn regeneration points. may be given greater relative importance. For example, Since the CR-TS’s lifetime Z is a regenerative process and consistency may be valued more than reliability in a the sequence fS1 ; S2 ; . . . ; Sn ; . . .g characterizes the lifetime system designed to be always accessible. This means that of the service, we can give an alternative definition of the fault-tolerance mechanisms should be able to adapt to stochastic process Z. The stochastic process Z is a set of reflect differing dependability requirements.
  • 4. 274 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 3, JULY-SEPTEMBER 2010 2.3 QoS of Message Communication In order to measure the communication between the FDS and target service quantitatively, we define the communication path between the FDS and the target service as a channel. Each communication component pair holds one or more virtual one-way, source-to-destination channel. Messages can only flow from the source component to the destination component. In addition, the channel model in this paper relies on the assumption of a basic unreliable communication channel with fairness, no-creation, and no-duplication [28]. This has some similarities with the Stubborn channels in [28], but they allow duplicated messages and we assume that there are no duplicated messages in our model. This channel-based communication, which maintains Fig. 3. State space in a crash-recovery run. (a) Fail-free transition. the interaction between the FDS and the CR-TS, can be (b) Crash-recovery transition. characterized by the QoS of the communication, the adopted failure detection algorithm, and the adopted communica- enough to be ignored and their local clocks are sufficiently tion protocol, each of which has some associated properties. synchronized (this can be guaranteed by some time synchro- In particular, we take the message transmission behavior to nization service such as the Network Time Protocol used in be probabilistic: we describe the message delay or loss as [6]) to be regarded as a clock synchronized system. The probabilistic behaviors associated with the communication failure detection algorithm we adopt is the NFD-S algorithm channel. proposed in [5]. Definition 1. Let D be a random variable representing the time 3.2 Modeling a Push-Style Crash-Recovery FDS which elapses from the time a message is sent until the time it The failure detector (FDS) has a set of suspicion levels S s :¼ arrives at the destination and EðDÞ be the average message fT rust; Suspectg as in [5]. The FDS can either trust or suspect delay; let pL be the probability of a message loss during the a CR-TS’s liveness. Thus, for a fail-free run, a service only has transmission; let XL be a random variable representing the one state: Alive. The state space of an FDS is S f :¼ number of consecutive messages lost and EðXL Þ be the average fT rust-Alive; Suspect-Aliveg, and the event space of an FDS number of consecutive messages lost. F :¼ fS-transition; T -transitiong (Fig. 3a). For a fail-free run, the QoS metrics of an FDS can be measured quite From these definitions, properties such as the following straightforwardly. The average time spent in the Trust state can be derived: is the mean length of the good period EðTG Þ; the average time Lemma 2. If each message’s transmission and loss behavior are spent in the Suspect state is the mean time of the mistake independent, then the probability that x (x ! 1) consecutive duration EðTM Þ; the average time between two consecutive messages are lost is transfers to the Suspect state (two consecutive S-transitions) is the mean time of the mistake recurrence EðTMR Þ. P rðXL ¼ xÞ ¼ px Á ð1 À pL Þ: L However, precisely speaking, the state space of an FDS S c :¼ S Â S s , where S is the state space of the target service. Overall, the QoS of this channel-based communication Therefore, for a CR-TS with failures, the state space of its between the FDS and the CR-TS can be captured by EðDÞ, FDS increases because the service has more than one state pL and EðXL Þ. In the following sections, we analyze how (see Fig. 3b). If the suspicion level is more than two, then S c the FDS monitors the CR-TS and how the FDS can be will increase as well. The QoS metrics of an FDS are no configured based on the characteristics of this channel- longer as simple as for fail-free runs. based communication. For a fail-free run (MTTF ! þ1) or a crash-stop run (MTTR ! þ1), the CR-TS’s current state S CRÀT S will be 3 QoS OF THE CRASH-RECOVERY FDS Alive for all time up to the crash, and it is easy to deduce the 3.1 System Model FDS’s accuracy S A directly from the FDS’s current state. However, for a crash-recovery run, since the CR-TS could fail We consider a distributed system model with two services: or recover at arbitrary time, S A cannot be deduced solely One FDS and one CR-TS, distributed over a wide-area network. The FDS and the CR-TS are connected by an from the state of the FDS. Furthermore, compared with a fail-free or crash-stop run, unreliable communication channel (see Section 2.3). Liveness (heartbeat) messages are transmitted through the channel. there are more mistake types in a crash-recovery run. In The communication channel does not create or duplicate previous work, such as [5], [6], [8], [9], [10], [18], [20], only liveness messages, but the messages might be lost or delayed the mistakes caused by the message transmission behaviors indefinitely during transmission.1 The CR-TS can fail by (message delay and loss) are considered. But in a crash- crashing but can be repaired and restart to run again after recovery run, a mistake starts whenever the CR-TS’s and some repair time, i.e., it behaves as a crash-recovery model. The FDS’s states diverge. Thus, there are also mistakes caused 3 drift of the local clocks of the FDS and the CR-TS is small by the CR-TS’s crash (see TF in Fig. 1 or TM in Fig. 4c) and recovery (see Fig. 4d) due to the delayed detection of such 1. This channel-based message transmission is the same as the events. Fig. 4 shows the four types of mistake which could 1 probabilistic network model in [5]. occur within a crash-recovery run. TM in Fig. 4a represents a
  • 5. MA ET AL.: ON THE QUALITY OF SERVICE OF CRASH-RECOVERY FAILURE DETECTORS 275 1 2 3 4 Fig. 4. The analysis of possible TM in a crash-recovery run. (a) TM . (b) TM . (c) TM . (d) TM . 2 mistake caused by a message delay. TM in Fig. 4b The above QoS metrics can measure some QoS aspects of 3 a failure detector in a crash-recovery run. However, they represents a mistake caused by a message loss. TM in Fig. 4c represents a mistake caused by CR-TS’s crash, while cannot measure how fast a recovery can be detected, the 4 proportion of the detected failures over the occurred the FDS still trusts the CR-TS. TM in Fig. 4d represents a mistake caused by CR-TS’s recovery, while the FDS still failures (completeness), etc. In the following section, we suspects the CR-TS. A message loss or delay will result in a extend the QoS metrics to measure the recovery detection Suspect-Alive mistake of the FDS (see Fig. 3b). A crash speed and the completeness of a failure detector. failure will result in a Trust-Crash mistake. A recovery event will result in a Suspect-Alive mistake. Mistakes caused by 3.3 Extended QoS Metrics for a Crash-Recovery different reasons will result in different FDS parameter FDS reconfiguration plans. For instance, the best way for the For an FDS in a crash-recovery run, in addition to the QoS FDS to tolerate more message losses or a longer message metrics introduced above, we propose some new QoS delay is to increase the time-out duration; the best way for metrics. the FDS to minimize the mistake duration caused by a crash First, in order to measure the speed with which an FDS event is to decrease the time-out duration; and the best way can discover a recovery of the CR-TS, we define—the to minimize the mistake duration caused by a recovery recovery detection time (TDR )—a random variable which event is to increase the liveness message sending frequency. represents the time that elapses from the CR-TS’s recovery Thus, we can see that an inaccurate mistake type identifica- time (an R-transition) to the time when the FDS discovers tion might reduce the QoS of an FDS and should be the recovery. Then, since in a crash-recovery run, there is no eventual avoided. behavior of a CR-TS, and a fast recovery could make a From the above analysis, we can see that due to the failure undetectable by the FDS. Under such circumstances, increasing mistake types in a crash-recovery run, the defini- the completeness property of a failure detector defined in [4] tion of the QoS metrics in [5] using transitions is not valid in a can no longer be satisfied. In order to reflect this situation, crash-recovery run. Thus, we redefine them as below: we refine the definition of the completeness as follows: . Detection time (TD ): The elapsed time from when . Strong completeness: Every crash failure of a recover- the monitored target crashes until the failure able process will be detected. detector correctly suspects the monitored target. . Weak completeness: A specified proportion of the crash . Mistake recurrence time (TMR ): The time between failures of a recoverable process will be detected. the occurrence of two consecutive mistakes. Therefore, in order to measure the completeness property of a . Mistake duration (TM ): The time to correct a crash-recovery FDS, we propose a new QoS metric. The mistaken suspect or trust. detected failure proportion (RDF ) is a random variable . Good period duration (TG ): The duration for which capturing the ratio of the detected crashes over the occurred the failure detector maintains the correct state crashes (0 RDF 1). When no crash failures are detected, information. RDF ¼ 0. When all of the occurring crashes are detected, . Query accuracy probability (PA ): The probability RDF ¼ 1. The strong completeness property of an FDS that the state information from the failure detector is requires that EðRDF Þ ¼ 1 (where E denotes expectation). correct at an arbitrary time. The weak completeness property requires that EðRDF Þ ! RL , DF
  • 6. 276 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 3, JULY-SEPTEMBER 2010 . i is the time of the ith freshness point corresponding to i ; . b is the last freshness point3 before a crash; and . f is the freshness point corresponding to f . Let time-out be the threshold waiting time for the expected arrival of the liveness message before suspecting the CR-TS (time-out ¼ i À i in Fig. 5). Let tm (m ! 1) be a r recovery time of the current MTBF period (see Fig. 5). Then in our model, the key thing for the QoS bounds analysis is to derive the average number of mistakes that will happen Fig. 5. The analysis of the FDS based on the NFD-S algorithm in a between the mth and ðm þ 1Þth recovery times, and the crash-recovery run. average duration of each mistake. We make the following where RL is the specified lower bound of the detected definitions as extensions of Definition 1 in [5]: DF failure proportion and 0 RL DF 1. Definition 2. For the fail-free duration ½t1 ; t2 Þ within each Overall, the QoS for a crash-recovery FDS can be MTBF period: captured by PA , TM , TMR , TD , TDR , and RDF . In the next section, we will analyze the QoS bounds of the FDS based 1. k: for any i ! 1, let k be the smallest integer such that for all j ! i þ k, mj is sent at or after time i , where on the NFD-S algorithm in a crash-recovery run by adopting mj is the jth heartbeat message.4 the proposed basic and extended QoS metrics. 2. For any i ! 1, let pi ðxÞ be the probability that the FDS j 3.4 QoS Estimate of the Crash-Recovery FDS Based does not receive the ði þ jÞth message miþj by on the NFD-S Algorithm time i þ x, for every j ! 0 and every x ! 0; let pi ¼ pi ð0Þ. 0 0 In a crash-recovery run, as the state of a CR-TS can switch i 3. For any i ! 2, let q0 be the probability that the FDS between Alive and Crash, these crash or recovery events will receives message miÀ1 before time i . force the output of the FDS to be accurate or inaccurate. For 4. For any i ! 1, let ui ðxÞ be the probability that the FDS analyzing the behavior of the failure detection pair, we suspects the CR-TS at time i þ x, for every x 2 ½0; Þ. want to pick an observation period, which will cover all the 5. pi : for any i ! 2, let pi be the probability that an s s events which may possibly occur. In our model, we pick S-transition occurs at time i . one MTBF period as the observation period. This is because, as we discussed in Section 2.1, in order to study the steady According to the QoS analysis of the NFD-S algorithm in state behavior of a CR-TS throughout its lifetime, we only Proposition 3 in [5], we now analyze the basic QoS metrics need to observe the time period between two consecutive of the FDS based on the NFD-S algorithm in a crash-recovery regeneration points (recovery times) of the CR-TS and the run and show the following relations hold: average duration between the two consecutive regeneration Proposition 1. points is MTBF. In the following, we will treat these as also regeneration points of the system consisting of the failure 1. k ¼ dtime-out=e. detection pair. This is an approximation made for prag- 2. for all j ! 0 and for all x ! 0, matic reasons but it can be justified as follows: Fig. 5 shows the relationship between an FDS and a pi ðxÞ ¼ ðpL þ ð1 À pL Þ Á P rðD time-out þ x À jÞÞ j À Á CR-TS on the interval t 2 ½t0 ; t3 Þ, where both t0 and t3 are Á P r Xa i À tm þ x : r regeneration points. Obviously, the mean time between t0 and t3 is the MTBF. We split ½t0 ; t3 Þ into three intervals 3. i q0 ¼ ð1 À pL Þ Á P rðD time-out þ Þ ½t0 ; t1 Þ, ½t1 ; t2 Þ, and ½t2 ; t3 Þ: À Á ÁP r Xa Q tm : iÀ r . t1 is the time when the FDS detects the transition of 4. For all x 2 ½0; Þ; ui ðxÞ ¼ k pi ðxÞ. j¼0 j the CR-TS from the Crash state to the Alive state. 5. pi ¼ q0 Á ui ð0Þ. s i . t2 is the time when the service crashes. Note that the In Proposition 1, the bounds of each QoS metric are period ½t1 ; t2 Þ is without failures. derived based on the analysis of the average number of Additionally, we define the following times: possible mistakes within the distinct intervals ½t0 ; t1 Þ, ½t1 ; t2 Þ, and ½t2 ; t3 Þ. In consequence, the following theorem holds . s is the first liveness message sending time after a and can be used to estimate the FDS’s parameters or QoS recovery; . f is the sending time of the last liveness message bounds within a crash-recovery run: before a crash; Theorem 1. The crash-recovery FDS based on the NFD-S . i is the sending time of a liveness message between algorithm has the following properties: s and f ; . is the liveness message sending interval; 3. The expected arrival time of the liveness message. 4. k is assumed to be independent of i approximately. In fact, in a crash- . s is the first decision time after recovery;2 recovery run, k is not completely independent of i. However, if the CR-TS will remain alive for a reasonable duration, k will be almost independent of i 2. The actual arrival time of the first received valid liveness message. except for the last few messages before the CR-TS crashes.
  • 7. MA ET AL.: ON THE QUALITY OF SERVICE OF CRASH-RECOVERY FAILURE DETECTORS 277 MT BF ! EðTMR Þ MT BF ð1Þ ! ÀÄ MT T F ÀEðT Å Á Æ Ç : DR Þ þ 1 Á pi þ EðDÞ þ 2 s If Xc þ time-out, then MT BF ! EðTMR Þ 2 MT BF ð2Þ ! ÀÄ MT T F ÀEðT Å Á Æ Ç ; DR Þ þ 1 Á pi þ EðDÞ þ 2 s R EðTD Þ þ EðTDR Þ þ MT T F ÀEðTDR Þ Á 0 ui ðxÞdx PA ! 1 À ; ð3Þ MT BF Fig. 6. The extended FDS configuration based on the NFD-S algorithm R in a crash-recovery run. EðTDR Þ þ MT T F ÀEðTDR Þ Á 0 ui ðxÞdx þ EðTD Þ EðTM Þ ÀÄ MT T F ÀEðTDR Þ Å Á ; ð4Þ þ 1 Á pi þ 1 needed to ensure that the NFD-S algorithm is still valid after s each recovery. However, without persistent storage to snapshot the runtime information frequently, when a crash EðTDR Þ ¼ EðDÞ þ Á EðXL Þ; ð5Þ failure occurs, all of the current runtime information might be lost. Thus, continuously increasing the heartbeat se- EðRDF Þ ! P rðXc þ time-outÞ: ð6Þ quence number cannot be guaranteed. Since the NFD-S algorithm assumes that the local clocks of Details of the proof of the theorem can be found in [29] and the FDS and the CR-TS are synchronized, we can compare Appendix C.2. the sending times of heartbeat messages instead of the When the monitored target is fail-free or crash-stop,5 for heartbeat sequence numbers in the algorithm. Then, for a the basic QoS metrics in [5], applying (1)-(4) of Theorem 1, crash-recovery FDS, if the QoS requirements of the FDS are we can easily deduce that given, the configuration procedure is illustrated in Fig. 6. Initially, we can assume that the QoS of message EðTMR Þ ! ; ð7Þ communication is perfect (e.g., pL ¼ 0, EðDÞ is small and pi s EðXL Þ ¼ 0), and the CR-TS is fail-free. As the monitoring Z procedure continues, the estimation of the QoS of message 1 EðTM Þ Á ui ðxÞdx i ; ð8Þ communication and the dependability metrics of the CR-TS pi s 0 q0 will become more accurate. Thus, the FDS will be reconfi- Z gured to adapt to changing input parameters, which help 1 better estimate and time-out. PA ! 1 À Á ui ðxÞdx: ð9Þ 0 Then for given QoS requirements, expressed as bounds, Thus, EðTMR Þ, EðTM Þ, and PA are exactly reduced to the the following inequalities need to be satisfied where a QoS analysis results in [5] (see Appendix C.4 for the details superscript U denotes an upper bound and a superscript L of the proof scratch). We can conclude that in terms of failure denotes a lower bound: detection, a fail-free run or a crash-stop run with MTTF U L L TD TD ; EðTMR Þ ! TMR ; PA ! PA ; tending to infinity is a particular case of a crash-recovery run. ð10Þ U U If the monitored target’s MTTF is not sufficiently long and EðTM Þ TM ; EðTDR Þ TDR ; EðRDF Þ ! RL : DF the target is recoverable, then the impact of its dependability From Theorem 1, we can estimate the parameters ( and must also be taken into consideration. In the following time-out) of the NFD-S algorithm according to the following section, we will introduce how to configure the crash-recovery FDS according to the QoS bounds we have derived from inequalities: Theorem 1. þ time-out U TD ; 0; ð11Þ 3.5 The Configuration of the Crash-Recovery FDS Based on the NFD-S Algorithm MTBF L ÀÄ MTTFÀEðTDR Þ Å Á Æ Ç ! TMR ; ð12Þ For crash failure detectors, it is crucial to select some þ 1 Á pi þ EðDÞ þ 2 s suitable input parameters (such as the liveness message intersending interval and the time-out duration) to satisfy a R given set of QoS requirements. In this section, we will show EðTD Þ þ EðTDR Þ þ MTTFÀEðTDR Þ Á 0 ui ðxÞdx L how to achieve such steps in a crash-recovery run based on 1À ! PA ; ð13Þ MTBF the NFD-S algorithm. In a crash-recovery run, an assumption that the sequence numbers of the heartbeat messages are R EðTDR Þ þ MTTFÀEðTDR Þ Á ui ðxÞdx þ EðTD Þ continually increasing after every recovery of the CR-TS is ÀÄ MTTFÀEðTDR Þ Å 0 Á U TM ; ð14Þ þ 1 Á pi þ 1 s 5. The precrash duration of the crash-stop process is a long run.
  • 8. 278 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 3, JULY-SEPTEMBER 2010 U EðDÞ þ EðXL Þ TDR ; ð15Þ P rðXc þ time-outÞ ! RL : DF ð16Þ Then, the task of the NFD-S algorithm is to find the largest satisfying inequalities (12)-(15) and if such exists, U find the largest time-out that satisfies þ time-out TD and P rðXc þ time-outÞ ! RL . This can be done in the DF following steps: L Step I. If TMR MTBF, continue; else the QoS of the FDS cannot be achieved. Step II. Find the largest that satisfies the inequalities Fig. 7. Dependability metrics estimation. (12)-(15); otherwise cannot find an appropriate (QoS cannot be achieved). uniformly distributed on ½l ; l þ Þ, then after a recovery U Step III. If 0, find the largest time-out TD À and has completed, the average tm can be estimated by c P rðXc þ time-outÞ ! RL .DF tm ¼ l þ . Notice that a smaller message intersending c 2 From the above steps, the estimation of and time-out for time () can result in a more accurate tm estimate. Then, the c a crash-recovery FDS based on the NFD-S algorithm amounts CR-TS’s MTBF, MTTF, MTTR, and the probability that the to finding a numerical solution for the inequalities (11)-(16). CR-TS has not crashed up to time i þ x since its last This can be done using binary search similarly to the recovery, P rðXa i þ x À tm Þ, can be estimated as follows: r approach outlined in [5]. But the estimation of the input Estimate MTBF. From the definition of MTBF, we know parameters of the configuration becomes more difficult that MTBF is only related to the CR-TS’s recovery times because parameters, such as EðXL Þ, MTTF, MTTR, etc., are tm ðsÞ. These tm ðsÞ can be obtained by adopting the recovery r r needed. How to estimate these parameters will be discussed time estimation methods proposed in [29]. Thus, MTBF can in Section 4. be estimated as below: Note that for this configuration procedure, choosing a different message transmission protocol (e.g., TCP and À Á 1 X À mþ1 n Á UDP) can imply different QoS for message communication. MTBF ¼ E tmþ1 À tm ¼ r r tr À tm : r ð17Þ n m¼1 Thus, this new configuration can be more adaptive to the message transmission characteristics. For example, if the Estimate MTTF. MTTF can be estimated by using the message loss probability or message delay is high for a recovery time (tm ) and the crash detection time (tm ) as r d certain protocol, then the FDS can switch to a more reliable Eðtm À tm Þ ¼ MTTF þ EðTD Þ. Then, d r protocol to achieve a better QoS without increasing the communication frequency or the time-out length. À Á 1XÀ m n Á In the next section, we will discuss how to estimate the MTTF ¼ E tm À tm À EðTD Þ ¼ d r td À tm À EðTD Þ: r n m¼1 QoS of message transmission and the dependability metrics of the CR-TS. ð18Þ Estimate MTTR. MTTR can be estimated by using MTBF 4 PARAMETER ESTIMATION and MTTF directly for MTTR ¼ MTBF À MTTF or by using tmþ1 and tm . Hence, the MTTR can be estimated as r d In the previous section, we explained how to configure a crash-recovery FDS. However, for this procedure, several Eðtmþ1 À tm Þ ¼ MTTR À EðTD Þ. Then, r d input parameters are needed (see Fig. 6). In this section, we À Á MTTR ¼ E tmþ1 À tm þ EðTD Þ r d will show how to estimate these input parameters for an FDS configuration. 1 X À mþ1 n Á ð19Þ ¼ t À tm þ EðTD Þ: n m¼1 r d 4.1 Dependability Metrics Estimation for the CR-TS From the CR-TS modeling in Section 2, we see that there is Estimate P rðXa i þ x À tm Þ. When the probability r an intimate relationship between the MTTF, MTTR, and density function fa ðxÞ or the probability distribution MTBF and the QoS of the FDS. In order to estimate these function Fa ðxÞ of Xa is known, the probability that the dependability metrics, we only need to estimate the crash CR-TS does not crash until i þ x after its last recovery can and recovery time of the CR-TS. We assume that the clocks be estimated as between the FDS and the CR-TS are synchronized. Let t1 be r Z i þxÀtm the CR-TS’s first start time, then for m ! 1, tm represents the À m Á r r P r Xa i þ x À tr ¼ 1 À fa ðxÞdx mth recovery time; tm represents the mth recovery detection dr 0 ð20Þ time; tm represents the mth crash time; and tm represents þxÀtm c d ¼ 1 À Fa ðxÞj0i r : the mth crash detection time (see Fig. 7). tm can be saved to r the persistent storage by the CR-TS after a recovery has When x ¼ 0, we obtain that completed (see [29]). tm can be recorded by the FDS when a d Z i Àtm failure is detected, EðTD Þ can be estimated by using À Á r Àtm 1 Pn m m m m P r Xa i À tm ¼ 1 À fa ðxÞdx ¼ 1 À Fa ðxÞj0i r : m¼1 ðtd À tc Þ when tc is known. Actually, tc can be r n 0 estimated by saving the latest successful message sending ð21Þ time l in the persistent storage. If a crash event happens
  • 9. MA ET AL.: ON THE QUALITY OF SERVICE OF CRASH-RECOVERY FAILURE DETECTORS 279 When the probability density function fa ðxÞ and the 4.3.2 The Impact on TMR probability distribution function Fa ðxÞ of Xa are unknown, For a fail-free run, Chen et al. showed that when time-out an empirical distribution function (EDF) estimation method length increases linearly, TMR increases exponentially (Fig. 12 can be adopted to estimate fa ðxÞ or Fa ðxÞ. In addition, in [5]). This implies that for such systems, an arbitrary level of P rðXa i þ x À tm Þ is used to estimate the probability that r TMR can be achieved. Roughly speaking, in a fail-free run, an S-transition happens on ½t1 ; t2 ) (see Proposition 1), which when time-out increases to n  (n 2 Z þ and n ! 1), the FDS Z is used to count the average number of mistakes in that can tolerate around n consecutive communication message period. If we maximize P rðXa i þ x À tm Þ, then a r losses. The mistake recurrence which is caused by message maximum average number of mistakes on ½t1 ; t2 ) will be latency or loss decreases P1n rapidly, where obtained. Therefore, we will get stricter QoS bound estimates for PA , TM , and TMR . Thus, we can adopt i ¼ 1 P ¼ pL þ ð1 À pL Þ Á P rðtime-out Delay þ1Þ: and x ¼ 0 to simplify the estimation of P rðXa i þ x À tm Þ. Notice that the above method is only for the strict For a crash-recovery run, mistakes may occur on both r bound estimation rather than an optimized estimation. crash and recovery (see Fig. 3b) since message transmission latency will delay the detection of the CR-TS’s state change. 4.2 Message Loss Length Estimation These mistakes are inevitable. This means that the upper As discussed earlier, the parameters related to message bound on TMR is governed by MTTF and MTTR (see transmission are the average message delay (EðDÞ), prob- inequalities (1)-(2) in Theorem 1). Even if all message delays ability of message loss (pL ), and the consecutive message and losses can be tolerated, EðTMR Þ cannot increase to an loss number XL (see Fig. 6). Since pL and EðDÞ estimation arbitrary level when MTTF is not þ1 and MTTR is not þ1 can be done very easily and have been introduced in many or 0. If failure is detectable, EðTMR Þ cannot exceed MTBF 2 other papers such as [5], we do not discuss them here. The since for each MTBF duration, there will be at least two additional parameter XL is also used and captures the mistakes, corresponding to the two changes of state in the bursty message loss behavior. In this section, we propose a CR-TS. When failure is undetectable, mistakes may happen basic estimation method for XL , assuming independent at the CR-TS’s crash or recovery time. Then, EðTMR Þ cannot message transmissions. exceed MTBF. Thus, after EðTMR Þ reaches MTBF , the overall 2 Lemma 3. If each message’s transmission and loss behavior is EðTMR Þ approaches MTBF gradually. independent, then the mean number of consecutive message p ð1ÀpM Þ losses is EðXL Þ ¼ L 1ÀpLL À MpMþ1 , where M is the L 4.3.3 The Impact on PA maximum number of consecutive messages lost and pL is the PA , the proportion of time that the FDS is not in a mistake probability that each message is lost during the transmission. state, will depend on the ratio of EðTM Þ and EðTMR Þ The proof can be found in [29]. (PA ¼ 1 À EðTMRÞÞ in [5]). If a service is fail-free, PA can rapidly EðTM Remark 1. When M ! þ1 and 0 pL 1, then pM ! 0 approach 1. But in a crash-recovery run, when the time-out L and MpM ! 0, we obtain that L length is increased, both EðTM Þ and EðTMR Þ will eventually pL reach their upper bounds, meaning that PA will also be EðXL Þ ¼ : bounded. Generally, as time-out increases, less failures will 1 À pL 3 be detected and the mistakes caused by failures (see TM in From the above lemma, we see that if each liveness Fig. 4c) will have more impact on EðTM Þ; thus, EðTM Þ will message’s transmission is independent, EðXL Þ depends approach MTTR, since the maximum length of EðTM Þ is 3 only on pL and can be computed straightforwardly. MTTR. As the time-out length becomes larger with respect to 4.3 The Impact of Service Dependability Metrics on MTTR, more failures become undetectable. Thus, EðTM Þ the QoS of the FDS will gradually approach MTTR. A thorough analysis of the impact of the service depend- The speed of increase of TMR will depend on when ability metrics on the QoS of the FDS has been presented in TMR reaches MTBF . Before this bound is reached, as the 2 [16]. Here, we only highlight the main observations. time-out length increases, TMR can increase exponentially fast, as more message losses can be tolerated. After TMR 4.3.1 The Impact on TM and TD exceeds MTBF , it can only increase gradually to MTBF, as 2 Generally, for an FDS, the time-out length governs the time-out increases and more and more crashes become failure detection speed because the FDS makes its decision undetectable. Thus, when TMR reaches its upper bound at the time-out points. As the time-out length decreases, the but TM has not yet reached its upper bound, PA will FDS will make faster, but less accurate, decisions. As time- decrease as time-out length increases. When both TM and out increases, TD slows down but the FDS can tolerate more TMR reach their upper bound, PA will approach MTTF , MTBF message delays or losses, which can improve the detection which is equal to the availability of the CR-TS. accuracy to some extent. For a CR-TS, continually increas- ing the time-out length may mean that failures become 5 SIMULATION EVALUATION AND ANALYSIS undetectable, because its recovery duration could be shorter than TD . Thus, EðTM Þ will not increase more than the In previous sections, we have shown how to calculate the recovery duration, MTTR.6 parameters of the FDS with a given set of QoS requirements and analyzed the QoS bounds of the crash-recovery FDS 6. Assuming that pL and D are not very large and MTTR ) . based on the NFD-S algorithm. In this section, we introduce
  • 10. 280 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 3, JULY-SEPTEMBER 2010 Fig. 8. The NFD-S algorithm: EðTM Þ. Fig. 9. The NFD-S algorithm: EðTMR Þ. our analytical and simulation results, which verify our complete characteristics. If the time-out length was increased previous analysis work. to 200, EðTM Þ would approach MTTR ¼ 50 closely. An interesting phenomenon is visible in the graph as 5.1 Evaluation of the Crash-Recovery FDS Based time-out increases from 0.5 to 1.1: EðTM Þ decreases (or on the NFD-S Algorithm increases more slowly), and then, increases again. We For the simulation studies, we fix the heartbeat interval at analyze this phenomenon in detail as follows: Recall that for ¼ 1 and gradually increase the time-out length. a given length of time-out, there are four aspects which have The message transmission parameters are pL ¼ 0:01 and impact on TM : the message delay and loss, and the CR-TS’s EðDÞ ¼ 0:02, and the delay is assumed to be exponentially crash and recovery (see Fig. 4). TM caused by a message distributed. These settings are similar to those used in the delay is governed by the ratio between EðDÞ and TD . For the simulations in [5]. same EðDÞ, as time-out increases, more delayed messages 1 The CR-TS is defined as a recoverable process with can be tolerated. Thus, TM caused by a message delay (TM ) various values of MTTF and MTTR (exponentially distrib- will decrease and occur less frequently. TM caused by a 2 uted). We choose the exponential distribution for the message loss (TM ) is related to , pL , EðDÞ, and the time-out following reasons. First, exponential failures are widely length. For constant message communication QoS (i.e., fixed adopted for reliability analysis in many practical systems; pL and EðDÞ), TM caused by message loss is governed by the ratio between and TD . Since as the time-out length second, unlike some heavy tailed distributions such as the increases, more message losses can be tolerated, the average log-normal distribution, crash, and recovery with an ex- 2 2 duration of TM will decrease, and TM will occur less ponential distribution will occur with reasonable interarri- 3 frequently. TM caused by a crash (TM ) is mainly governed val times, avoiding the CR-TS behaving like a fail-free or by TD (see Fig. 4c), because if a crash occurs, a false positive crash-stop process. mistake will last until the time-out time or until the CR-TS recovers. For detectable crashes, as the time-out length 5.1.1 Analysis for the Basic QoS Metrics 3 4 increases, TM will increase. TM caused by a recovery (TM ) is We implemented the NFD-S algorithm presented in [5] to mainly governed by pL and EðDÞ (see Fig. 4d), since after evaluate the QoS of the FDS and compared the results with the CR-TS’s recovery, a recovery can be detected as soon as the analytical results derived from Theorem 1. Figs. 8, 9, and a valid liveness message is received. 10 compare the QoS of the FDS based on the NFD-S algorithm From the above analysis, we know that for the same , (simulation results) and the corresponding analytical results pL , EðDÞ, MTTF, and MTTR, when the time-out length from different perspectives. From these three figures, we increases, the average mistake duration caused by message have the following observations. 1 2 delays and message losses will decrease (TM b and TM c), the Fig. 8 presents the EðTM Þ of the FDS derived from average mistake duration caused by the CR-TS’s crash will simulation and analytical results for two values of MTTR, 5 3 increase (TM d), and the average mistake caused by the and 50, with corresponding values of MTTF, 100 and 1,000. 4 The simulation result for MTTR ¼ 5 shows that as the time- CR-TS’s recovery from a detectable crash is unaffected (TM ) out length increases, EðTM Þ will tend to MTTR, i.e., EðTM Þ is but fewer crashes and recoveries will be detected. In the bounded by MTTR. With the exponentially distributed simulation pL ¼ 0:01 and MTBF ¼ 105, when time-out is 2 3 MTTR used in the simulation, the proportion of the detectable small, TM and TM occur with similar frequency. When time- crashes will decrease more gradually. Thus, EðTM Þ ap- out increases from 0.5 to 1.0, (the FDS can tolerate zero proaches MTTR more slowly than in the analytical results. message loss and most message delays), EðTM Þ increases 1 2 3 4 Simulation results for MTTR ¼ 50 confirm that if MTTR slow because TM b, TM b, TM d, and TM and their impacts becomes large, as the time-out length increases, EðTM Þ can counterbalance. Overall, EðTM ) is stable within this period. 2 also grow large, since the bound is now large. Note that in As the time-out length increases, TM will occur less 3 the graph, we see only the linear part rather than the frequently. But TM occurs every MTBF period. Thus, as
  • 11. MA ET AL.: ON THE QUALITY OF SERVICE OF CRASH-RECOVERY FAILURE DETECTORS 281 However, from Fig. 10, we can also see that as the time-out length increases, PA is not always increasing as in a fail-freeor crash-stop run. Continually increasing time-out could de- crease PA . This is because TMR is bounded by MTBF or MTBF 2 as discussed above. After EðTMR Þ reaches MTBF , it increases 2 slowly rather than exponentially fast but EðTM Þ increases linearly and faster than EðTMR Þ. Thus, PA decreases, and finally, PA will approach MTTF , which is equal to the MTBF availability of the CR-TS. The above results indicate that for a highly available CR- TS, a reasonable QoS for the FDS can be achieved even if the FDS always trusts the CR-TS, when only the QoS metrics defined in [5] are considered. This is especially true for a Fig. 10. The NFD-S algorithms: PA . highly available and highly consistent but not highly reliable CR-TS. However, the completeness property of the 3 FDS will not be satisfied. Consequently, these simulation the time-out increases, TM will dominant and EðTM Þ will results demonstrate the necessity of the additional QoS increase gradually. metrics we proposed in Section 3.3 to measure the In the simulation, pL ¼ 0:01 and MTBF ¼ 1;050, when 2 completeness aspects and the speed of the recovery detection the time-out length is small, TM will have more impact than 3 2 of a crash-recovery FDS. Furthermore, these results also TM , because TM occurs more frequently than the crash and demonstrate the necessity of adopting the recovery detec- recovery. Therefore, as the time-out length increases, the 2 tion protocols in [29], which can improve the proportion of average duration of TM decreases and occurs less fre- detected failures without reducing other QoS aspects. quently; EðTM Þ will increase slower or even decrease since In Figs. 8, 9, and 10, we can also observe how the more message losses are tolerated. But if time-out continues 3 dependability of a CR-TS can influence the QoS of the FDS. to increase, TM will become dominant and EðTM Þ will then Particularly, for a highly available but not highly reliable increase gradually. CR-TS, the dependability of the CR-TS can have more Overall, Fig. 8 shows that in a crash-recovery run, EðTM Þ exhibits quite different characteristics from a fail-free or impact than the performance of the algorithm and the QoS crash-stop run. If the message delay and the probability of of message transmission. In such situations, the depend- message loss are not very large, EðTM Þ is bounded by ability of the CR-TS must be taken into account for the FDS MTTR. From Fig. 8, we also observe that EðTM Þ can design and implementation. possibly be decreased for some time-out values. Unlike in a From Figs. 8, 9, and 10, we can see that PA , EðTMR Þ and fail-free run, continually increasing the time-out length EðTM Þ have bounds. Continually increasing the time-out cannot achieve a better ðTM Þ. length might not be a reasonable way to achieve better PA , Fig. 9 presents EðTMR Þ of the FDS derived analytically and EðTMR Þ, and EðTM Þ. A potential trade-off exists between from simulation with exponential MTTF and MTTR as above. the QoS metrics. For instance, for the NFD-S algorithm, We can see that with constant time-out length, as MTBF time-out 2 ð1; 1:1Þ (time-out þ 2 ½2; 2:1Š) might achieve the increases, EðTMR Þ also increases. This implies that EðTMR Þ is best over all QoS. greatly impacted by the dependability of the CR-TS. In addition, EðTM Þ in a crash-recovery run exhibits quite We can also see that for both these simulation cases, different characteristics compared with a fail-free or crash- EðTMR Þ initially increases exponentially fast but after EðTMR Þ stop run. This is because in a crash-recovery run, the mistakes reaches MTBF , the rate of increase is reduced. For the CR-TS caused by the crash and recovery are taken into considera- 2 with exponential MTTR, EðTMR Þ will increase gradually and tion, which means continually increasing the time-out length approach MTBF, until all crashes become undetectable. This will not always decrease EðTM Þ. It may have the effect of 3 is because for nondeterministic MTTR, as the time-out length increasing false positive mistakes (TM , see Fig. 4). As the time- increases, the proportion of the detectable crashes decreases. out length increases, mistakes caused by message delays Therefore, for the detectable crashes, TMR MTBF , and for the and losses will occur less frequently, and false positive 2 undetectable crashes, TMR MTBF. Thus, EðTMR Þ will mistakes (which were not considered previously) will start increase gradually between ½MTBF ; MTBFŠ, and finally, 2 to dominate the QoS of the FDS. stabilize at MTBF. All of these results match our analysis in From Figs. 8, 9, and 10, we can observe that the Section 4.3 well and indicate that if a CR-TS is not fail-free simulation results of EðTM Þ are smaller than the analytical (MTTF ! 1) orcrash-stop (MTTR ! 1), EðTMR Þ will be results, and the simulation results of EðTMR Þ and PA are bounded by MTBF when failures are undetectable and by larger than the analytical results, which indicate that the MTBF 2 when failures are detectable. bound analysis of the basic QoS metrics in Theorem 1 is Fig. 10 considers PA under the same communication QoS. valid and the simulation results satisfy the QoS require- We see that when MTBF increases, PA will be improved. This ments according to the analysis. We can also observe a is because EðTMR Þ also increases. Thus, from the equation gap between the analytical and simulation results. This is PA ¼ 1 À EðTMRÞÞ , we know that for the same time-out length, EðTM caused by the overestimation or underestimation of some when MTBF increases, a better PA can be achieved. values within the analytical results. EðTM Þ is overestimated
  • 12. 282 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 3, JULY-SEPTEMBER 2010 Fig. 12. The QoS relationship between communication, CR-TS, and FDS. Fig. 11. The NFD-S algorithms: EðRDF Þ. decreases. When MTTR becomes shorter, EðRDF Þ will decrease faster. This is because the smaller MTTR is, the U by using the total mistake duration over the underestimated sooner time-out þ crosses MTTR (TD MTTR). Therefore, average number of mistakes that might occur within a crash- more crashes remain undetected when the NFD-S algorithm recovery period. Thus, the analytical results of EðTM Þ will be is adopted. In Fig. 11, we can also see that the simulation larger than the simulation results. Similarly, EðTMR Þ is results of EðRDF Þ are larger than the analytical results, which underestimated by using the observation duration (MTBF) means that the bound analysis of EðRDF Þ is valid and the over an overestimation of the number of mistakes that simulation results satisfy the QoS requirements in terms of might occur within a period. For instance, the number of RL . However, since most existing failure detection algo- DF rithms adopt increasing the time-out length to tolerate more mistakes within the period is estimated as dEðDÞe þ 1, which message losses and delays, if a CR-TS is recoverable and is an upper bound rather than the average number. It recovers fast, it could be difficult for these algorithms to follows that EðTMR Þ of the analytical results will be smaller achieve the QoS in [5] and satisfy the completeness property at than the simulation results. Finally, PA is underestimated by the same time. In such a situation, the recovery detection using one minus an overestimated total mistake duration protocol introduced in [29] can be adopted, which can solve over the observation period (MTBF). Thus, PA of the this problem reasonably well. analytical results will be smaller than the simulation results. All of these results satisfy the QoS requirements U L L EðTM Þ TM , PA PA , and EðTMR Þ TMR . In addition, 6 CONCLUSION according to the NFD-S algorithm, the failure detection In this paper, the crash-recovery target and its failure detector time TD is bounded by þ time-out regardless of the are modeled as stochastic processes. We redefined pre- U correctness of the detection; thus, TD TD must be viously proposed QoS metrics to be applicable to crash- satisfied. recovery failure detection and introduced some new metrics From Figs. 8, 9, and 10, we can also see that there are some to measure the recovery detection speed and the completeness gaps between the analytical results and the simulation property of a failure detector. We also discussed the impact results. This is mainly caused by the overestimating and of the monitored target’s crash-recovery behavior on each QoS underestimating method we adopted to restrict the failure metric and showed that if a failure detector’s parameters are detector’s QoS bound as discussed above. In addition, we to be accurately estimated, these dependability character- use MTBF, MTTF, and MTTR, which are the expected values istics must be taken into account. Thus, we showed how to rather the real values for each failure and recovery. In the configure the failure detector to satisfy a given set of simulation, the results are calculated according to the requirements based on the dependability characteristics in randomly generated failure time and recovery time, which represent the real time to failure and recovery, and these addition to the QoS of message transmission (see Fig. 12). random variables will deviate from the expected values. This was based on the NFD-S algorithm [5]. Our analysis Thus, there will be some discrepancies between the simula- shows that the QoS analysis in [5] is a particular case of a tion and analytical results. These gaps show that there is still crash-recovery run. Furthermore, we discussed how to space to improve the accuracy of the model and it would be estimate the input parameters for the algorithm. interesting to investigate this point further in the future. Finally, the plotted simulation and analytical results demonstrate that our QoS bound analysis is valid and can be 5.1.2 Analysis for the Extended QoS Metrics used as an approximate solution for the computation of the We also plot the simulation and analytical results for the failure detector’s parameters or the QoS bounds estimation failure detection proportion (RDF ) defined in Section 3.3 to if the failure detector’s parameters are given. Our simula- demonstrate the impact of the failure and recovery events tion results confirm that when a failure detector is designed on this metric. and implemented, the dependability of the crash-recovery Fig. 11 shows the proportion of failures detected by the target needs to be considered in order to achieve more FDS, for different dependability characteristics of the CR-TS, accurate parameter estimation. Furthermore, if the recovery based on both simulation and analytical results. As the of the monitored target needs to be detected, further time-out length increases, EðRDF Þ of the NFD-S algorithm enhancement of the existing algorithms is needed.