Fault tolerant real-time scheduling

Quasi-static fault-tolerant scheduling schemes for
energy-efficient hard real-time systems
• Wei Tongquan, CS Department of East China Normal University, China
• Piyush Mishra, GE Global Research, Niskayuna, NY 12309, USA
• Kaijie Wu, ECE Department of University of Illinois, Chicago, IL 60607, USA
• Junlong Zhou, CS Department of East China Normal University, China
Journal of Systems and Software
2012
Reza Ramezani
1

A Unified Approach for Fault Tolerance and Dynamic
Power Management in Fixed-Priority Real-Time
Embedded Systems
• Ying Zhang
– a Senior Software engineer with the Research and Development
Department, Guidant Corporation, St. Paul, MN, USA
• Krishnendu Chakrabarty
– Department of Electrical and Computer Engineering, Duke University,
Durham, USA
Computer-Aided Design of Integrated Circuits and Systems,
IEEE Transactions on 25, no. 1 (2006): 111-125.
2

Overview
 Primaries
 Checkpointing & Response Time
 Reliability, The best fault tolerance count?
 Feasibility Analysis
 Offline Application Level Voltage Scaling
 Offline Task Level Voltage Scaling
 Online DVS by Using Slacks
 Previous Work (Ying Zhang, Krishnendu Chakrabarty, 2006)
 Results
 Suggestion
3

Features
• Fault Tolerance Scheduling
 Transient Faults
 Fast Detection
 Fault occurrences at runtime, checkpointing and state restoration.
• Dynamic Voltage Scaling (DVS)
• Offline Scheduling
 Application Level Voltage Scaling (A-DVS)
 Task Level Voltage Scaling (T-DVS)
• Online Scheduling
 Using Slacks
• Exact Rate-Monotonic Characterization
 Instead of iteratively deriving the response time of each task for
feasibility analysis. 5

Online DVS Outline
• The adaptation of the offline task schedules to the
runtime behavior of fault occurrences is implemented:
 (1) Pre-computing and saving in a lookup table the maximum slack
requirements for the processor to dynamically slow down.
 (2) Retrieving and comparing the stored slack time requirements with
the generated cumulative slack in the runtime.
 (3) Dynamically scaling down processor speed when the generated
slack time is equal to or greater than the stored slack requirements.
6

Checkpointing
&
Response Time
9

Checkpoint count
 Fault-tolerant computing refers to the correct execution of user
programs and system software in the presence of faults.
 Fault tolerance is typically achieved in real-time systems through
online fault detection, checkpointing, and rollback recovery .
 Checkpointing increases the task execution time, and in the absence
of faults, it might cause a missed deadline for a task that completes
on time without checkpointing.
 Frequent checkpointing reduces re-execution time due to faults but
increases task execution time and vice versa.
 Therefore, the checkpointing interval, i.e., the duration between two
consecutive checkpoints, must be carefully chosen to balance
checkpointing cost with the re-execution time.
10

Fault occurrences count
• Relation between fault occurrences count and fault
arrival rate
 k is the fault occurrences count to be tolerated.
 a fault arrival rate λ and a task execution interval t, the mean number
of faults that arrive during the interval is λt.
o If k is much smaller than λt, a sophisticated fault-tolerant scheme with its
associated overhead is not appropriate.
o if k is much larger than λt, a fault-tolerant scheme that provides deterministic
real-time guarantee may not exist.
 In order to target a system with reasonable real-time performance with
fault tolerance, the value of k can be taken to be a small multiple of λt,
e.g., 2λt ≤ k ≤ 3λt.
11

Reliability
The best fault tolerance count?
17

Exact Characterization of RMA (ECRMA)
• Critical Instant
 The worst case behavior of RMA occurs when all tasks in a task set are
instantiated simultaneously and are ready for execution immediately after
initiation.
 It has been shown that a schedule of independent periodic tasks is
feasible if the first instance of each task is schedulable when it is
instantiated at a critical instant Lehoczky et al. (1989) .
23

Exact Characterization of RMA (ECRMA) (2)
24

Exact Characterization of RMA (ECRMA) (3)
25

Offline Application Level Voltage Scaling
26

Application level voltage scaling (A-DVS)
27

A-DVS algorithm (2)
• Some Considerations
 The binary search based A-DVS algorithm is valid only if the energy
consumption is monotonic with respect to frequency/voltage changes.
 When the processor static power consumption as well as context
switching overhead is considered, the monotonicity does not hold.
 In this case, there exists a critical processor speed below which scaling
down the processor speed will instead increase the energy consumption.
 The minimum voltage level low is initialized to the level corresponding
to the processor critical speed.
29

Feasibility Checking Algorithm (FCA)
30

31
Feasibility
Checking
Algorithm
(FCA)

Offline Task Level Voltage Scaling
32

Task level voltage scaling (T-DVS)
33

Schedulability Checking Algorithm (SCA)
38

Online reevaluation of DVS policies
 Offline scheduling assumes that all tasks exhibit the worst case execution
time and all faults occur during the checkpointing.
 The runtime behavior of task execution and fault occurrences can vary
significantly.
 In the runtime, not all tasks execute up to their worst case execution times
and not all faults occur during task executions.
 Hence, the slack generated in the runtime could be used to dynamically scale
down the processor speed to save energy.
 The online reevaluation of DVS policies can save significant energy by using
generated slacks due to uncertainties in fault occurrence.
40

Reevaluation of DVS at application level
41

Reevaluation of DVS at application level (2)
42

Reevaluation of DVS at application level (3)
43

Reevaluation of DVS policies at task level
49

Reevaluation of T-DVS (D-TDVS)
50

Previous Work (Ying Zhang, Krishnendu Chakrabarty, 2006)
51

Feasibility of a task set under fault-free conditions
53
Fault Free

Tolerating k Faults in Each Task
54

Fault Tolerance With DVS (2)
56

Fault Tolerance With DVS (3)
57

Heuristic Method Based on GA
58

Heuristic Method Based on GA (2)
• Init function
 Initializes the search space (chromosome population).
 One chromosome is initially generated using the computationally
feasible application-level speed scaling method.
 The other chromosomes are generated randomly.
59

Heuristic Method Based on GA
60

Application level results on Tranmeta Crusoe
65

Task level results on Tranmeta Crusoe
66

Application level results on Intel XScale
67

Task level results on Intel XScale
68

Real life implementation
 The energy consumptions of the system board ,excludes the processor time.
69

Suggestion
 The scheduler can tolerate at least k faults and then tries to DVS by using
slacks.
 Tolerating more faults than k by increasing processor speed when more
faults than k occur.
70

Fault tolerant real-time scheduling

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (10)

Similar a Fault tolerant real-time scheduling

Similar a Fault tolerant real-time scheduling (20)

Más de Reza Ramezani

Más de Reza Ramezani (10)

Último

Último (20)

Fault tolerant real-time scheduling