Repository logo
 

High performance fault tolerance through predictive instruction re-execution

Accepted version
Peer-reviewed

Type

Conference Object

Change log

Authors

Soman, J 
Jones, TM 

Abstract

Processor designers face the challenge of defect formation, leading to permanent faults, during fabrication and operation. Permanent or hard fault tolerance is an important problem in computing systems, solutions to which can help improve yield during fabrication and reduce the cost of transistor mortality during the service life of the processor.

This paper presents PreFix, a method to handle hard errors to keep a faulty core running and correctly executing instructions. Instead of turning off faulty structures, PreFix predicts early on whether an instruction is likely to use faulty components, then refines this prediction later in the pipeline to actually detect when an error has occurred. Instructions marked as possibly- faulty in the front-end are queued for duplicate execution on a separate core. At commit, results from the original and duplicate instructions are compared. Upon a mismatch, the original instruction is patched up, the pipeline flushed and execution continues. Using PreFix, faulty components can continue performing useful work when their errors do not manifest in architecturally visible state changes. This enhances processor lifetime with minimal performance overhead.

Description

Keywords

33 Built Environment and Design, 40 Engineering, 3301 Architecture, 4009 Electronics, Sensors and Digital Hardware, Mental Health

Journal Title

2017 IEEE Int. Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, DFT 2017

Conference Name

2017 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)

Journal ISSN

1550-5774

Volume Title

2018-January

Publisher

IEEE
Sponsorship
Engineering and Physical Sciences Research Council (EP/J016284/1)
Engineering and Physical Sciences Research Council (EP/K026399/1)