2. Why Fault Tolerance?
Offers many advantages:
◦
◦
◦
◦
Avoids costly packet retransmissions
Avoids catastrophic data loss
Can increase chip yield
Allows higher speed operation
In NoC specifically
◦ Ensures success of interconnect
◦ Grows in importance as technology scales
3. Fault Classes
Transient faults (or soft errors) : Random appearance and
disappearance
Alpha particles, Cosmic-ray-induced neutrons etc.
Intermittent faults: appear only under certain conditions like
Occur repeatedly at the same location
Tend to occur in bursts
Replacement of the faulty component removes the fault
Permanent faults (or Hard errors): occur always but may be
masked
Static (occurring at manufacture-time)
Process Variability (PV), Manufacturing imperfections
Dynamic (occurring at run-time,)
Electro-Migration (EM), Negative Bias Temperature Instability
(NBTI), Oxide breakdown, Stress-Induced Voiding (SIV), Hot
Carrier Injection (HCI), etc.
4. Making NoC’s Reliable
Current Methods
T-error tolerant NoC design
Error Control
◦ Error detection and correction codes
◦ HBH retransmission mechanism
•
Reliable task mapping
Fault tolerant rerouting
8. Power consumption
Observations
The ee-par scheme has higher power
consumption than ee-crc and hybrid
scheme.
The flit based scheme incurs more
power consumption because as the no.
of flits per packet increases the useful
bits decreases.
The packet buffer requirements impact
the power consumption. Hence, as the
number of hops increases, the power
overhead of ss-flit scheme increases.
14. ROBUST: SELF HEALING
ROUTER
Universal Logic Block
Crossbar protection using multiple
ULB blocks
Advantages
It has higher silicon protection factor and a higher reliability improvement
factor.
15. Future challenges
◦ All the schemes presented to improve the reliability of the
NoC architecture have power overhead associated with
them. This increases the power dissipated which can
reduce the mean time to failure (MTTF).
◦ All the techniques should be thermal aware in order to
prevent the above mentioned phenomena.
◦ Instead of evenly wearing out all cores in MPSoCs, a
method should be deigned to self heal failed cores.
◦ Most error resilient schemes today focus primarily on
making router, links fault tolerant. There should be some
focus on making memories more reliable
16. Conclusion
The ideas presented in this paper make the NoC
architecture resilient to permanent and intermittent
errors. To improve the reliability several techniques like
t-error tolerant mechanism, self healing router
architecture, reliability driven task mapping, deadlock
recovery mechanism, error detection and correction
schemes are employed. Several techniques make use
of redundancy in hardware component which is good in
terms of area since because of “dark silicon” it is
impossible to turn on every component on the die
anyways. However, most techniques increase the
power consumption in the NoC architecture which is by
far the only drawback in using them. Designing systems
to make them resilient to errors is very crucial in
exploiting the advantages of using Network on chips.
17. References
[1] M. Yang, T. Li, Y. Jiang, and Y. Yang, “Fault-tolerant routing schemes in RDT(2,2,1)/-based interconnection network for
networks-on-chip designs,”
[2] Jacques Henri Collet, Ahmed Louri, Vivek Tulsidas Bhat, Pavan Poluri, “ROBUST: A new Self-healing Fault-Tolerant
NoC Router”
[3] Theocharis Theocharides, Luca Benini, Giovanni De Micheli, N. Vijaykrishnan, Mary Jane Irwin, “Analysis of Error
Recovery Schemes for Networks-on-Chips”.
[4] Rutuparna Tamhankar, “TERROR: RELIABLE AND EFFICIENT LINK DESIGN FOR NETWORK ON CHIPS”
[5] Armin Alaghi, Mahshid Sedghi, Naghmeh Karimi, Mahmood Fathy, Zainalabedin Navabi, “Reliable NoC Architecture
Utilizing a Robust Rerouting Algorithm”.
[6] Srinivasan Murali, “METHODOLOGIES FOR RELIABLE AND EFFICIENT DESIGN OF NETWORKS ON CHIPS”
[7] Xin Fu1, Tao Li, José A. B. Fortes,” Architecting Reliable Multi-core Network-on-Chip for Small Scale Processing
Technology”
[8] Avijit Dutta and Nur A. Touba,” Reliable Network-on-Chip Using a Low Cost Unequal Error Protection Code”
[9] Deepthi chamkur .V , Vijayakumar.T, “Reliable Routing & Deadlock free massive NoC Design with Fault Tolerance based
on combinatorial application.”.
[10] Luca Benini, Giovanni De Micheli, “Powering Networks on Chips: Energy-efficient and reliable interconnect design for
SoCs”.
[11] Haidar M. Harmanani and Rana Farah, “A Method for Efficient Mapping and Reliable Routing for NoC Architectures with
Minimum Bandwidth and Area “.
[12] Yin-He Han Hang Lu Lei Zhang, “RevivePath: Resilient Network-on-Chip Design Through Data Path Salvaging of
Router”
[13] Anup Das, Akash Kumar and Bharadwaj Veeravalli,“Reliability-Driven Task Mapping for Lifetime Extension of Networkson-Chip Based Multiprocessor Systems”.
[14] Avijit Dutta and Nur A. Touba, ”Reliable Network-on-Chip Using a Low Cost Unequal Error Protection Code”.
[15] Deepthi chamkur .V , Vijayakumar.T,” Reliable Routing & Deadlock free massive NoC Design with Fault Tolerance
based on combinatorial application.”
[16] M.H. Neishaburi, Zeljko Zilic,” NISHA: A fault-tolerant NoC router enabling deadlock-free Interconnection of Subnets in
Hierarchical Architectures”.
[17] Yu Ren , Leibo Liu , Shouyi Yin , Jie Han , Qinghua Wua, Shaojun Wei, “A fault tolerant NoC architecture using quadspare mesh topology and dynamic reconfiguration”.
[18] Mehdi Modarressi , Marjan Asadinia , Hamid Sarbazi-Azad,” Using task migration to improve non-contiguous processor
allocation in NoC-based CMPs”.
[19] Cristian Grecu, Lorena Anghel, Partha P. Pande, André Ivanov, Resve Saleh,” Essential Fault-Tolerance Metrics for
NoC Infrastructures”.
[20] Young Hoon Kang, Taek-Jun Kwon, Jeffrey Draper,” Fault-Tolerant Flow Control in On-Chip Networks”.