JOURNAL ARTICLE

NoC-based fault-tolerant cache design in chip multiprocessors

Abbas BanaiyanMofradGustavo GirãoNikil Dutt

Year: 2014 Journal:   ACM Transactions on Embedded Computing Systems Vol: 13 (3s)Pages: 1-26   Publisher: Association for Computing Machinery

Abstract

Advances in technology scaling increasingly make emerging Chip MultiProcessor (CMP) platforms more susceptible to failures that cause various reliability challenges. In such platforms, error-prone on-chip memories (caches) continue to dominate the chip area. Also, Network-on-Chip (NoC) fabrics are increasingly used to manage the scalability of these architectures. We present a novel solution for efficient implementation of fault-tolerant design of Last-Level Cache (LLC) in CMP architectures. The proposed approach leverages the interconnection network fabric to protect the LLC cache banks against permanent faults in an efficient and scalable way. During an LLC access to a faulty block, the network detects and corrects the faults, returning the fault-free data to the requesting core. Leveraging the NoC interconnection fabric, designers can implement any cache fault-tolerant scheme in an efficient, modular, and scalable manner for emerging multicore/manycore platforms. We propose four different policies for implementing a remapping-based fault-tolerant scheme leveraging the NoC fabric in different settings. The proposed policies enable design trade-offs between NoC traffic (packets sent through the network) and the intrinsic parallelism of these communication mechanisms, allowing designers to tune the system based on design constraints. We perform an extensive design space exploration on NoC benchmarks to demonstrate the usability and efficacy of our approach. In addition, we perform sensitivity analysis to observe the behavior of various policies in reaction to improvements in the NoC architecture. The overheads of leveraging the NoC fabric are minimal: on an 8-core, 16-cache-bank CMP we demonstrate reliable access to LLCs with additional overheads of less than 3% in area and less than 7% in power.

Keywords:
Computer science Scalability Fault tolerance Embedded system Cache Network on a chip Multi-core processor Modular design Computer architecture Cache coherence Network packet Interconnection Reliability (semiconductor) Distributed computing Computer network Parallel computing Cache algorithms CPU cache Operating system

Metrics

3
Cited By
1.10
FWCI (Field Weighted Citation Impact)
51
Refs
0.80
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Interconnection Networks and Systems
Physical Sciences →  Computer Science →  Computer Networks and Communications
Parallel Computing and Optimization Techniques
Physical Sciences →  Computer Science →  Hardware and Architecture
Advancements in Battery Materials
Physical Sciences →  Engineering →  Electrical and Electronic Engineering
© 2026 ScienceGate Book Chapters — All rights reserved.