REPAIR: Hard-Error Recovery via Re-Execution
IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems
MetadataShow full item record
Soman, J., Miralaei, N., Mycroft, A., & Jones, T. M. (2015). REPAIR: Hard-Error Recovery via Re-Execution. IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, 76-79. https://doi.org/10.1109/DFT.2015.7315139
Processor reliability at upcoming technology nodes presents significant challenges to designers from increased manufacturing variability, parametric variation and transistor wear-out leading to permanent faults. We present a design to tolerate this impact at the microarchitectural level—a chip with n cores together with one or more shared instruction re-execution units (IRUs). Instructions using a faulty component are identified and re-executed on an IRU. This design incurs no slowdown in the absence of errors and allows continued operation of all n cores after multiple hard errors on one or all cores in the structures protected by our scheme. Experiments show that a single-core chip experiences only a 23% slowdown with 1 error, rising to 43% in the presence of 5 errors. In a 4-core scenario with 4 errors on every core and a shared IRU, REPAIR enables performance of 0.68× of a fully functioning system.
This work was supported by the Engineering and Physical Sciences Research Council (EPSRC) through grants EP/K026399/1 and EP/J016284/1. Experiments used the Darwin Supercomputer of the University of Cambridge High Performance Computing Service (http://www.hpc.cam.ac.uk/) funded by the Higher Education Funding Council for England and the Science and Technology Facilities Council.
External DOI: https://doi.org/10.1109/DFT.2015.7315139
This record's URL: https://www.repository.cam.ac.uk/handle/1810/249256