Repository logo
 

ParaMedic: Heterogeneous Parallel Error Correction

Accepted version
Peer-reviewed

Type

Conference Object

Change log

Authors

Jones, TM 

Abstract

Processor error detection can be reduced in cost significantly by exploiting the parallelism that exists in a repeated copy of an execution, which may not exist in the original code, to split up the redundant work on a large number of small, highly efficient cores. However, such schemes don't provide a method for automatic error recovery.

We develop ParaMedic, an architecture to allow efficient automatic correction of errors detected in a system by using parallel heterogeneous cores, to provide a full fail-safe system that does not propagate errors to other systems, and can recover without manual intervention. This uses logging to roll back any computation that occurred after a detected error, along with a set of techniques to provide error-checking parallelism while still preventing the escape of incorrect processor values in multicore environments, where ordering of individual processors' logs is not enough to be able to roll back execution. Across a set of single and multi-threaded benchmarks, we achieve 3.1% and 1.5% overhead respectively, compared with 1.9% and 1% for error detection alone.

Description

Keywords

fault tolerance, microarchitecture, error detection

Journal Title

Proceedings - 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2019

Conference Name

2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)

Journal ISSN

1530-0889

Volume Title

Publisher

IEEE

Rights

All rights reserved
Sponsorship
EPSRC (1510365)
Engineering and Physical Sciences Research Council (EP/K026399/1)
Engineering and Physical Sciences Research Council (EP/M506485/1)
Arm Ltd