Repair: Hard-Error Recovery via Re-Execution
Change log
Authors
Abstract
Processor reliability at upcoming technology nodes presents significant challenges to designers from increased manufacturing variability, parametric variation and transistor wearout leading to permanent faults. We present a design to tolerate this impact at the microarchitectural level-a chip with $n$ cores together with one or more shared instruction re-execution units (IRUs). Instructions using a faulty component are identified and re-executed on an IRU. This design incurs no slowdown in the absence of errors and allows continued operation of all $n$ cores after multiple hard errors on one or all cores in the structures protected by our scheme. Experiments show that a single-core chip experiences only a 23% slowdown with 1 error, rising to 43% in the presence of 5 errors. In a 4-core scenario with 4 errors on every core and a shared IRU, REPAIR enables performance of $0.68\times$ of a fully functioning system.
Description
Journal Title
Conference Name
Journal ISSN
Volume Title
Publisher
Publisher DOI
Rights and licensing
Sponsorship
Engineering and Physical Sciences Research Council (EP/J016284/1)
