Repair: Hard-Error Recovery via Re-Execution

Processor reliability at upcoming technology nodes presents significant challenges to designers from increased manufacturing variability, parametric variation and transistor wearout leading to permanent faults. We present a design to tolerate this impact at the microarchitectural level-a chip with $n$ cores together with one or more shared instruction re-execution units (IRUs). Instructions using a faulty component are identified and re-executed on an IRU. This design incurs no slowdown in the absence of errors and allows continued operation of all $n$ cores after multiple hard errors on one or all cores in the structures protected by our scheme. Experiments show that a single-core chip experiences only a 23% slowdown with 1 error, rising to 43% in the presence of 5 errors. In a 4-core scenario with 4 errors on every core and a shared IRU, REPAIR enables performance of $0.68\times$ of a fully functioning system.

Keywords

33 Built Environment and Design, 40 Engineering, 3301 Architecture, 4009 Electronics, Sensors and Digital Hardware

Journal Title

2015 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFTS)

Conference Name

2015 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFTS)

Journal ISSN

1550-5774

Publisher

Institute of Electrical and Electronics Engineers (IEEE)

Publisher DOI

https://doi.org/10.1109/dft.2015.7315139

Rights and licensing

Sponsorship

Engineering and Physical Sciences Research Council (EP/K026399/1)
Engineering and Physical Sciences Research Council (EP/J016284/1)

This work was supported by the Engineering and Physical Sciences Research Council (EPSRC) through grants EP/K026399/1 and EP/J016284/1. Experiments used the Darwin Supercomputer of the University of Cambridge High Performance Computing Service (http://www.hpc.cam.ac.uk/) funded by the Higher Education Funding Council for England and the Science and Technology Facilities Council.

Collections

Scholarly Works - Computer Science and Technology
Symplectic mapped items for data match