New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Fault tolerant real-time scheduling
1. Quasi-static fault-tolerant scheduling schemes for
energy-efficient hard real-time systems
• Wei Tongquan, CS Department of East China Normal University, China
• Piyush Mishra, GE Global Research, Niskayuna, NY 12309, USA
• Kaijie Wu, ECE Department of University of Illinois, Chicago, IL 60607, USA
• Junlong Zhou, CS Department of East China Normal University, China
Journal of Systems and Software
2012
Reza Ramezani
1
2. A Unified Approach for Fault Tolerance and Dynamic
Power Management in Fixed-Priority Real-Time
Embedded Systems
• Ying Zhang
– a Senior Software engineer with the Research and Development
Department, Guidant Corporation, St. Paul, MN, USA
• Krishnendu Chakrabarty
– Department of Electrical and Computer Engineering, Duke University,
Durham, USA
Computer-Aided Design of Integrated Circuits and Systems,
IEEE Transactions on 25, no. 1 (2006): 111-125.
2
3. Overview
Primaries
Checkpointing & Response Time
Reliability, The best fault tolerance count?
Feasibility Analysis
Offline Application Level Voltage Scaling
Offline Task Level Voltage Scaling
Online DVS by Using Slacks
Previous Work (Ying Zhang, Krishnendu Chakrabarty, 2006)
Results
Suggestion
3
5. Features
• Fault Tolerance Scheduling
Transient Faults
Fast Detection
Fault occurrences at runtime, checkpointing and state restoration.
• Dynamic Voltage Scaling (DVS)
• Offline Scheduling
Application Level Voltage Scaling (A-DVS)
Task Level Voltage Scaling (T-DVS)
• Online Scheduling
Using Slacks
• Exact Rate-Monotonic Characterization
Instead of iteratively deriving the response time of each task for
feasibility analysis. 5
6. Online DVS Outline
• The adaptation of the offline task schedules to the
runtime behavior of fault occurrences is implemented:
(1) Pre-computing and saving in a lookup table the maximum slack
requirements for the processor to dynamically slow down.
(2) Retrieving and comparing the stored slack time requirements with
the generated cumulative slack in the runtime.
(3) Dynamically scaling down processor speed when the generated
slack time is equal to or greater than the stored slack requirements.
6
10. Checkpoint count
Fault-tolerant computing refers to the correct execution of user
programs and system software in the presence of faults.
Fault tolerance is typically achieved in real-time systems through
online fault detection, checkpointing, and rollback recovery .
Checkpointing increases the task execution time, and in the absence
of faults, it might cause a missed deadline for a task that completes
on time without checkpointing.
Frequent checkpointing reduces re-execution time due to faults but
increases task execution time and vice versa.
Therefore, the checkpointing interval, i.e., the duration between two
consecutive checkpoints, must be carefully chosen to balance
checkpointing cost with the re-execution time.
10
11. Fault occurrences count
• Relation between fault occurrences count and fault
arrival rate
k is the fault occurrences count to be tolerated.
a fault arrival rate λ and a task execution interval t, the mean number
of faults that arrive during the interval is λt.
o If k is much smaller than λt, a sophisticated fault-tolerant scheme with its
associated overhead is not appropriate.
o if k is much larger than λt, a fault-tolerant scheme that provides deterministic
real-time guarantee may not exist.
In order to target a system with reasonable real-time performance with
fault tolerance, the value of k can be taken to be a small multiple of λt,
e.g., 2λt ≤ k ≤ 3λt.
11
23. Exact Characterization of RMA (ECRMA)
• Critical Instant
The worst case behavior of RMA occurs when all tasks in a task set are
instantiated simultaneously and are ready for execution immediately after
initiation.
It has been shown that a schedule of independent periodic tasks is
feasible if the first instance of each task is schedulable when it is
instantiated at a critical instant Lehoczky et al. (1989) .
23
29. A-DVS algorithm (2)
• Some Considerations
The binary search based A-DVS algorithm is valid only if the energy
consumption is monotonic with respect to frequency/voltage changes.
When the processor static power consumption as well as context
switching overhead is considered, the monotonicity does not hold.
In this case, there exists a critical processor speed below which scaling
down the processor speed will instead increase the energy consumption.
The minimum voltage level low is initialized to the level corresponding
to the processor critical speed.
29
40. Online reevaluation of DVS policies
Offline scheduling assumes that all tasks exhibit the worst case execution
time and all faults occur during the checkpointing.
The runtime behavior of task execution and fault occurrences can vary
significantly.
In the runtime, not all tasks execute up to their worst case execution times
and not all faults occur during task executions.
Hence, the slack generated in the runtime could be used to dynamically scale
down the processor speed to save energy.
The online reevaluation of DVS policies can save significant energy by using
generated slacks due to uncertainties in fault occurrence.
40
59. Heuristic Method Based on GA (2)
• Init function
Initializes the search space (chromosome population).
One chromosome is initially generated using the computationally
feasible application-level speed scaling method.
The other chromosomes are generated randomly.
59
70. Suggestion
The scheduler can tolerate at least k faults and then tries to DVS by using
slacks.
Tolerating more faults than k by increasing processor speed when more
faults than k occur.
70