Prefetching for complex memory access patterns
Jones, Timothy M.
University of Cambridge
Computer Science and Technology
Doctor of Philosophy (PhD)
MetadataShow full item record
Ainsworth, S. (2018). Prefetching for complex memory access patterns (Doctoral thesis). https://doi.org/10.17863/CAM.25143
Modern-day workloads, particularly those in big data, are heavily memory-latency bound. This is because of both irregular memory accesses, which have no discernable pattern in their memory addresses, and large data sets that cannot fit in any cache. However, this need not be a barrier to high performance. With some data structure knowledge it is typically possible to bring data into the fast on-chip memory caches early, so that it is already available by the time it needs to be accessed. This thesis makes three contributions. I first contribute an automated software prefetching compiler technique to insert high-performance prefetches into program code to bring data into the cache early, achieving 1.3x geometric mean speedup on the most complex processors, and 2.7x on the simplest. I also provide an analysis of when and why this is likely to be successful, which data structures to target, and how to schedule software prefetches well. Then I introduce a hardware solution, the configurable graph prefetcher. This uses the example of breadth-first search on graph workloads to motivate how a hardware prefetcher armed with data-structure knowledge can avoid the instruction overheads, inflexibility and limited latency tolerance of software prefetching. The configurable graph prefetcher sits at the L1 cache and observes memory accesses, which can be configured by a programmer to be aware of a limited number of different data access patterns, achieving 2.3x geometric mean speedup on graph workloads on an out-of-order core. My final contribution extends the hardware used for the configurable graph prefetcher to make an event-triggered programmable prefetcher, using a set of a set of very small micro-controller-sized programmable prefetch units (PPUs) to cover a wide set of workloads. I do this by developing a highly parallel programming model that can be used to issue prefetches, thus allowing high-throughput prefetching with low power and area overheads of only around 3%, and a 3x geometric mean speedup for a variety of memory-bound applications. To facilitate its use, I then develop compiler techniques to help automate the process of targeting the programmable prefetcher. These provide a variety of tradeoffs from easiest to use to best performance.
prefetching, computer architecture, compilers
This work was supported by the Engineering and Physical Sciences Research Council (EPSRC), through grant references EP/K026399/1 and EP/M506485/1, and ARM Ltd.
This record's DOI: https://doi.org/10.17863/CAM.25143
Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Licence URL: https://creativecommons.org/licenses/by-sa/4.0/