Repository logo
 

ParaVerser: Harnessing Heterogeneous Parallelism for Affordable Fault Detection in Data Centers

Accepted version
Peer-reviewed

Loading...
Thumbnail Image

Change log

Abstract

Data-Center operators have awoken to the fact that silent data corruption resulting from defective silicon compute units is endemic at scale. Software scanners have been deployed to mitigate the issue, but either have low coverage or take months, leaving long windows of incorrect behaviour. By contrast, the redundancy mechanisms used in automotive double the required power and area, so cannot be practically deployed in server-space. We present ParaVerser, a high-coverage, low-overhead solution to hardware-level error detection in servers. Through minor architectural modifications, we enable conventional cores in heterogeneous server-grade processors to act as checker cores, thus exploiting heterogeneity, frequency scaling and the inherent parallelism in repeat runs to provide energy-efficient error checking. By dynamically coupling big.LITTLE-style out-of-order superscalar cores with in-order ones, we reduce energy overheads relative to a typical lockstep system by 70% with identical guarantees, at only 4.3% performance degradation, and 1064B per-core area overhead.

Description

Journal Title

2025 55th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)

Conference Name

2025 55th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)

Journal ISSN

1530-0889

Volume Title

Publisher

Institute of Electrical and Electronics Engineers (IEEE)

Rights and licensing

Except where otherwised noted, this item's license is described as Attribution 4.0 International