1. FlaxRay Fault-Tolerance:
Capabilities, weaknesses and proposed enhancements
Antonio Cappiello, Omar Jaradat
Mälardalen University, Västerås, Sweden, 03/2011
{aco10003, ojt10001}@student.mdh.se
A
bstract weaknesses and the capabilities based on two
This paper gives an overview about main points: bus controller and network
FlexRay, and it summarizes its main topology. The paper will proceed with the
components, in addition to give an current state of the work and will be ended up
adequate details of how those components work. by showing our conclusions.
The document focuses on FlexRay reliability
and how it is considered as a fault tolerant The FlexRay Protocol
protocol, as well as, it discusses the capabilities,
weaknesses and the authors’ enhancement The FlexRay protocol is a time-triggered
proposals, so after reading this paper, readers protocol, and it can offer options for
can create a good knowledge about FlexRay as deterministic data that arrives in a predictable
well as how well this protocol achieves the time frame. FlexRay has a core with static
reliability. frames and dynamic frames with a
communication cycle that provides a predefined
Introduction space for static and dynamic data, so nodes on
FlexRay network must know how all the pieces
FlexRay is a communication system, it is of the network are configured in order to
considered one of the next generations of bus communicate, and since the embedded networks
protocol for automotive networks; even it can be are different from normal PC networks, it
applied on any other real time distributed means that FlexRay does need any additional
system environment, but any researcher will mechanism to automatically discover and
notice that this protocol is usually tied with the configure devices at run-time, like the PCs
automotive industries, and this is simply, networks which require these procedures,
because it was developed in 1999 by a FlexRay network and simply, have a closed
cooperation of leading companies in automotive configuration and should not be changed once it
industry and it was developed exclusively for is assembled in the production.
automotive.
FlexRay manages more than one node “Multiple
Since, software errors are considered one of the nodes” with a Time Division Multiple
big challenges that affect seriously on the Access (TDMA) scheme and every FlexRay node
software performance. Our mission is to show is synchronized to the same clock, and each node
how FlexRay can be considered as a fault waits for its turn to write on the bus, and
tolerant system, and how it can handle the because the timing is harmonious in
failures and errors that can be happened in any a TDMA scheme, FlexRay is always able to
given time, as well as, try to suggest or propose guarantee consistency of data deliver to nodes
any idea can lead to enhance the reliability of on the network, this provides many advantages
FlexRay protocol. for systems that depend on up to date data
between nodes.
In this paper, we will describe FlexRay protocol
and analyse the bus controller structure and
how the nodes can communicate and interact
within the whole communication system, so we
will begin to talk about the protocol itself, and
then the fault tolerance, by explaining the
2. Fault-Tolerance: Capabilities and operates normal, constitute all together the so-
called three-level error model, Figure 2. This
Weaknesses model provides a self-diagnostic mechanism of
the possible error.
In this section of the document we are going to
point out the means adopted by the FlexRay
protocol in order to provide a fault-tolerance
communication.
We have individuated two aspects of a FlexRay
System involved in the assurance of the fault-
tolerance:
1. The bus controller.
2. The physical network architecture.
The bus controller
The bus controller consists of six components as
showed in Figure 1 [1], but in particular there
are some of these that use a mechanism to
protect the communication from errors.
Figure 2
The Frame and Symbol Processing (FSP) beside
to separate the payload from the header of the
message received, it provides also status data to
the host regarding the frame reception, as for
example if the received frame is valid or invalid.
On the sender node, The Coding/Decoding Unit
(CODEC) computes and appends the CRC
checksum to the message that it has to encode
and send on the bus. On the receiver node, after
decoding the message received, the CODEC
performs the CRC check in order to verify
whether the message integrity has been affected
by electromagnetic noise on the bus and
consequently some bits have been flipped.
In addition, in a time-triggered real time system
such as FlexRay, different nodes have to keep a
Figure 1 consistent view of the global time even in faulty
situations, and the component responsible for
The Protocol Operation Control (POC) this is the Clock Synchronisation (CS). This
responsible to react to host commands component tries to improve the fault tolerance of
instructing/guiding the other components also the Protocol through two kind of correction: the
reacts to error situations. For example when an offset correction and the rate correction. In
error occurs, the POC falls to normal passive particular, it is in the offset correction method
state and tries to reintegrate, but when the that the CS adopts a fault-tolerant midpoint
error is fatal the POC falls in the halt state and algorithm in order to compute an average over
all operations are stopped. These two states and the time differences between the communication
the active state, in which the bus controller rounds. On the base of this computation, the
3. next message schedule is brought forward or allocated slots, and from the other hand the
delayed in such a way that all nodes have correctly relay of messages coming from non-
almost the same time in the next cycle. faulty communication controller.
The FlaxRay Consortium claims that thanks to
this algorithm, up to two Byzantine faults 1 can Summarising [4] [5] about the fault tolerance,
be tolerated. When more than two of these faults we can state that the FlexRay Protocol
happen, the System can fall in a situation in manages the errors with a “never-give-
which there are different views of the global up”-strategy thanks to the three-level
time and consequently another problem can error model explained above, because
affect the System, the Clique problem. “stopping communication is a critical
A Clique is a group of nodes connected to a decision which must be made by the
network which can communicate only inside the application whenever possible”;
same group and not with the other ones. is able to handle both internal and
FlexRay doesn’t provide any mean to detect and external faults;
resolve this kind of problem. The Clique does not adopt any strategy like
Problem in FlexRay has been well analysed in retransmission in case of a corrupted
[2], and more in depth, two kinds of Cliques has message, but this is responsibility of the
been identified: host application to face with these
1. Time domain cliques, that happens problems because the strategy of the
when subsets of nodes have different protocol is to “signal” the error;
view of the global time, as described as well as for the security aspect,
before, because the Protocol does not provide
2. Value domain cliques, that occurs when security, but it is responsibility of the
a frame is correctly placed in a slot but application
contains a different cycle counter. “Requires application support for
Moreover, in [2] it is said that “the FlaxRay Byzantine faults (e.g. group
consortium is aware of the potential clique membership).
problem” but it is even said that “the cliques do
not constitute a noticeable risk in practice” The physical network architecture
maybe because “there are no report published on
cliques observed in a practical setup”. For these FlexRay supports single and dual channel
reasons in that document the authors show with configurations which consist of one or two pairs
experiments how to create cliques in a physical of wires respectively, most FlexRay nodes
FlexRay cluster and how to avoid or detect typically also have power and ground wires
possible cliques. available to power transceivers and
microprocessors.
Finally, when all the above illustrated means
adopted by the bus controller are not enough to FlexRay can be distinguished from all other
prevent faulty behaviours, an additional automotive protocols such as CAN and LIN by
component can be inserted between the bus its Network layout because FlexRay supports a
controller and the network as showed in the very flexible network topology, and this is
Figure 1: the Bus Guardian (BG). In [3], four because it has two channels that can be used in
properties for the BG have been identified and a different ways, this for sure will increase the
formally proofed: flexibility which will allow the protocol to
provide a scalability of the fault tolerance, in
1. Correct Relay. addition to that it plays a big role in forming
2. Validity. FlexRay system structure, so redundant and
3. Agreement. independent systems are possible.
4. Integrity.
There are three possible FlexRay topologies:
These properties guarantee from one hand no
accesses of the communication control to the 1.Passive Bus Topology: it means that all
communication channel outside the pre- nodes can be connected to a bus but in
dual channels case one node can be
1
A Byzantine Fault is typical of the distributed system and is connected to both channels or only to
visible with the wrong behaviour of a node in the system, that
consist in sending arbitrary messages, including messages one of these channel. Figure 3.A
aimed to corrupt the system. More details about this topic will
be provided in the Current State of Work paragraph
4. 2.Active Star Topology: In this topology tolerance and time-determinism performance
the network can be built as an active requirements for x-by-wire applications (i.e.
star that contains star couplers, each drive-by-wire, steer-by-wire, brake-by-wire,
node must be connected to one etc.). This article covers the basics FlexRay. [7]
coupler. Figure 3.B
Most first FlexRay networks generation only use
3.Combination of the topologies: In this the “single channel” and this is to decrease the
topology a combination between the wires cost and keep it down, but further
passive bus and active star is used. networks will use dual channel and this is
Figure 3.C because the big advantage that they can gain
from dual channel, since the dual channel
It is very important for designers to select enhances fault – tolerance and increase the
between these topologies because choosing the bandwidth.
more suitable topology can play a big role to
optimize the cost, performance, and reliability FlexRay can redundantly transmit individual
for their design. messages to provide an additional layer of
network reliability. In fact, FlexRay networks
FlexRay network must know how all the pieces provide scalable fault – tolerance by allowing
of the network are configured in order to single or dual channel communication, but for
communicate efficiently. sure the dual channel is preferred in many
cases, for example, in security – critical
Figures 3.A, 3.B and 3.C show several possible applications, all devices connected to the bus
topologies can be supported by FlexRay may use both channels for transferring data.
channels [1] [6]. However, it is always possible to connect one
single channel when the redundancy is not
needed, or to increase the bandwidth by using
both channels for transferring non-redundant
data.
As a result, FlexRay can be used with single or
dual channels, but since the dual channel
provides and increases the redundancy this will
lead to increase the fault – tolerance, thus, using
Figure3.A Figure 3.B dual channel topology instead of single channel
will logically influence the fault – tolerance
cumulatively [1].
Current State of Work
In this section of the document we are going to
describe the second step of our work consisting
in collecting practical and theoretical research
on the enhancement of the FlaxRay fault-
tolerance capabilities.
Regardless of the fact that FlaxRay is still a new
protocol in the automotive industries, there are
many works conducted by companies or
Figure 3.C researchers form one hand in order to find out
the true potentialities of the protocol and
determine its working features and from the
FlexRay communications bus is a deterministic, other hand with the purpose to improve its
fault-tolerant and high-speed bus system, and reliability and effectiveness. Therefore in our
using two separate physical FlexRay work of collecting information we decided to
communication lines with 10Mbps implement adopt a strategy of research based on selecting
double redundant fault tolerant message the most reliable work form international
transmission so that data throughput can be conferences, workshops and companies leader in
doubled as well. FlexRay delivers the error
5. the field of the embedded systems such as the (Constraint Logic Programming) in term of
Real-Time Systems Symposium (RTSS), The results, but computationally less expensive.
Euromicro Technical Committee on Real-Time
Systems (ECRTS), the International Workshop About the message scheduling a good contribute
on Automated Verification of Critical Systems has been given by [11], where in order to
(AVoCS), the Real-Time and Embedded analyse the timing properties in both the static
Technology and Applications Symposium and the dynamic segment of a FlexRay
(RTAS), the IEEE Computer Society and many communication cycle, the authors suggest
others. different techniques.
Moreover our research strategy is focused in More in depth, about the timing properties of
selecting the works regarding the reliability and the static segment, an algorithm that builds the
fault-tolerant aspects of FlaxRay that try to static schedule has been proposed and analysed.
estimates its capacities and propose concrete About the dynamic segment, several factors that
solutions to its weakness. can impact on the worst-case response time have
As result of this research we are going to been analysed in three different approaches,
describe the most interesting outcomes as a kind optimal (OO), heuristic (HH) and holistic (OH)
of insight on the current state of work on solution.
FlexRay. The OO uses a ILP formulation, the HH sees the
problem as bin-covering problem, and OH
Before to go in depth with the single results we further reduce the time of HH using partially an
can say to have noticed a common reason on the ILP formulation. All the proposed analyses are
base of each work: everyone agree on the need to based on formal extensive experiments.
precisely determine the true performance, In another article [12] strictly related to the
predictability and reliability of the mentioned previous one [11] written by almost the same
protocol as mandatory requirement to use authors, a further step toward an efficient use of
successfully FlexRay in safety-critical FlexRay is done. While the first article bounds
applications. This common view is due to the the message transmission time on both the ST
fact that FlaxRay is becoming the leader in the and DYN segment, the second one is focused on
distributed embedded system targeted to high find the right bus configuration for a particular
performance vehicles. application in order to meet all the time
constraints.
Several study like [8] and [9] compare the
FlaxRay protocol with the most popular This purpose is achieved providing four
nowadays in automotive industries as LIN, techniques extensively tested by the authors:
CAN, TTCAN and others, with the purpose to
show how the flexibility and potentialities of 1. The Basic Bus Configuration (BCC),
FlaxRay include all the benefits of the other which results from analyzing the
protocol. In addition other works as [3] show minimal bandwidth requirements of the
practically how it’s possible to “migrate” from application;
CAN to FlexRay explaning the migration
requirements, parameter calculation, message 2. The OBC heuristic with the curve
analysis, Payload optimization and Slot size fitting (OBCCF), that instead of
definition, but at the end they indicated that exhaustively perform the scheduling for
there is a big problem in optimizing a FlexRay all possible values of the DYN segment
cycle which is formalizing the static segment length, evaluates the response time for
and dynamic segment parameters. The latter is only some values and than with the
one of the most interesting aspect on which curve fitting approach extrapolates the
many researchers spent their efforts. response time for the other points ( this
For example in [10] a technique to schedule is based on the regularity of the
messages on the FlaxRay segment has been dependence response time vs. size of the
proposed in order to compensate the lack of the DYN segment noticed in several
protocol toward the faulty messages due to experiments and depicted by the
transient and intermittent faults that affect the following picture)
reliability aspect of the communication. The
technique proposed generate a schedule on the
base of the probability of failure of the message
using an heuristic very close to the CLP
6. reduce the validation time is required to manage
even the continues and rapid changes in
electronic control feature. This means to
elaborate a schedule that takes into account
even a certain amount of uncertainty.
In [13] the info-gap technique has been showed
with the purpose to generate different schedules
with a degree of robustness related to different
ranges of uncertainty. More in depth, the
uncertainty analysed is in the payloads of the
messages, but the same approach can be used
even for uncertainty related to the dependency
between task and messages, for the period (rate
of task execution, or message transmission) and
topology (mapping of tasks to hosts and
messages to channels).
Figure 4 By now we have discussed only the message
scheduling problems in a system that uses the
3. The OBC heuristic with an exhaustive FlaxRay communication protocol, but there are
exploration of the size for the DYN many other issues pointed out by others works
segment; that need particular attention.
Most of these are for example related to
4. The Simulated Annealing (SA) based Byzantine fault that is very common in
design space exploration, used to distributed system.
provide a base-line for evaluation of the The Byzantine fault occurs when a faulty node
proposed heuristics. corrupts its local state and sends arbitrary
messages. To face with this problem can be used
The results of the experiments conducted by the a Byzantine fault tolerance technique (BFT)
authors can be summarised by the following which mask a bounded number of Byzantine
picture taken from the same article: faults e.g. using state machine replication, or a
detecting technique which equips each node
with a detector in order to monitor other nodes
and isolate the possible nodes with faulty
behaviour. A formal study on these techniques
has been conduced in [14], and what come out is
that the first technique is stronger than the
second one, but analysing a trade-off between
them follows that:
Detection require f+1 replication vs. 3f+1
of the BFT in order to cope with f
concurrent fault;
Detection systems need only be
provisioned for the average load while a
BFT system must be provisioned for the
peak load;
Detection is cheaper.
In addition to this analysis, in the same article
Figure 5 the authors propose a sketch of a system that
implements a Byzantine fault detector that
provide accountability, completeness and
As these studies have showed, design the accuracy.
schedule of the FlaxRay is a complex operation
not only because it is needed to guarantee the Toward the Byzantine fault the FlaxRay system
tight time constraints and performance required can be equipped with an additional module
by some automotive application but even placed between the Bus Controller and the
because, in order to increase the reusability and network, the Bus Guardian. The functionality
of this has been already described in the
7. previous section of the document, but the [6] Seminar FlexRay, Robert Rieb, Chemntiz
FlaxRay specification doesn’t give any proof of University 2009.
its functionalities. Regard to this, in [9], four
properties has been identified and formally [7] FlexRay Automotive Communication Bus
proofed Overview, National Instruments ("NI").
1. Correct Relay,
2. Validity, [8] Comparision of FieldBus Systems CAN,
3. Agreement, TTCAN, FlexRay and LIN in Passenger
4. Integrity. Vehicles, Steve C. Talbot, Shangping Ren, 29th
Moreover about the Byzantine fault, the IEEE International Conference on Distributed
FlexRay specification claims that up to two Computing Systems Workshops Montreal,
Byzantine faults can be tolerated thanks to the Quebec, Canada June 22-June 26 2009
Clock Synchronization Algorithm, but even this
property have to be proofed and the author of [9] In-Veichle Networking, frescale.com
the previous article ([15]) is currently working
even on this problem. [10] Scheduling for Fault-Tolerant
Communication on the Static Segment of
Conclusion FlexRay, Bogdan Tanasa, Unmesh D. Bordoloi,
Petru Eles, Zebo Peng, 31st IEEE Real-Time
FlexRay communications bus is a deterministic, Systems Symposium, 2010.
fault-tolerant and high-speed bus system with
high performance, and it has more and more [11]Timing Analysis of the FlexRay
promising future in real time distributed Communication Protocol, Traian Pop, Paul Pop,
systems, specially, in automotive industry. Dual Petru Eles, Zebo Peng, Alexandru Andrei, Real-
– channel topology offers enhanced fault- Time Systems Journal, Volume 39, Numbers 1-
tolerance and increases the bandwidth, and this 3, pp 205-235, August, 2008
provides messages redundancy or double the
transmission which increases the reliability, [12] Bus Access Optimisation for FlexRay-based
even the dual channels can be used to increase Distributed Embedded Systems, Design,
the bandwidth only, without redundant the Automation, and Test, Traian Pop, Paul Pop,
message. FlexRay has a good mechanism to Petru Ion Eles and Zebo Peng, in Europe
handle the errors (i.e. three-level error model) Conference DATE07.
which provides a self-diagnostic mechanism of
the possible error. [13] A. Ghosal, H. Zeng, Y. Ben-Haim, M. Di
Natale, “Computing Robustness of FlexRay
References Schedules to Uncertainties in Design
Parameters” , DATE '10, 2010
[1] Introduction to FlexRay and TTA, Peter
[14] The case for Byzantine fault detection,
Bohm, November 21, 2005.
Andreas Haeberlen, Petr Kouznetsov, Peter
Druschel, HOTDEP'06 Proceedings of the 2nd
[2] An Investigation of the Clique Problem in conference on Hot Topics in System
FlexRay, P.Milbredt, M.Horauer, A.Steininger, Dependability, Volume 2 , 2006
IEEE 2008.
[15] On the Formal Verification of the FlexRay
[3] On the Formal Verification of the FlexRay Communication Protocol, Bo Zhang, Automatic
Communication Protocol, Bo Zhang, AVoVS Verification of Critical Systems - AvoCS (2006)
2006. 184-189
[4] Protocol Overiew, C.Temple-Motorola, [16] Migration Framework from CAN to
FlexRay International Workshop, Detroit,2003. FlexRay, Richard Murphy, Frank Walsh and
Brendan Jackman, Automotive Control Group,
[5] The FlexRay Protocol, P.Koopman, Carnegie Waterford Institute of Technology, Cork Road,
Mellon, 2010. Waterford, Ireland.