Repository logo
 

Decoupled Vector Runahead

Accepted version
Peer-reviewed

Type

Conference Object

Change log

Authors

Naithani, Ajeya 
Roelandts, Jaime 
Ainsworth, Sam 
Eeckhout, Lieven 

Abstract

We present Decoupled Vector Runahead (DVR), an in-core prefetching technique, executing separately to the main application thread, that exploits massive amounts of memory-level parallelism to improve the performance of applications featuring indirect memory accesses. DVR dynamically infers loop bounds at run-time, recognizing striding loads, and vectorizing subsequent instructions that are part of an indirect chain. It proactively issues memory accesses for the resulting loads far into the future, even when the out-of-order core has not yet stalled, bringing their data into the L1 cache, and thus providing timely prefetches for the main thread. DVR can adjust the degree of vectorization at run-time, vectorize the same chain of indirect memory accesses across multiple invocations of an inner loop, and efficiently handle branch divergence along the vectorized chain. DVR runs as an on-demand, speculative, in-order, lightweight hardware subthread alongside the main thread within the core and incurs a minimal hardware overhead of only 1139 bytes. Relative to a large superscalar 5-wide out-of-order baseline and Vector Runahead — a recent microarchitectural technique to accelerate indirect memory accesses on out-of-order processors — DVR delivers 2.4× and 2× higher performance, respectively, for a set of graph analytics, database, and HPC workloads.

Description

Keywords

33 Built Environment and Design, 3301 Architecture

Journal Title

Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2023

Conference Name

56th IEEE/ACM International Symposium on Microarchitecture

Journal ISSN

Volume Title

Publisher

ACM
Sponsorship
EPSRC (EP/W00576X/1)