Prefetching for complex memory access patterns

Ainsworth, Sam

Prefetching for complex memory access patterns

Repository URI

https://www.repository.cam.ac.uk/handle/1810/277804

Repository DOI

https://doi.org/10.17863/CAM.25143

Files

Thesis (3.38 MB)

Type

Thesis

Authors

Ainsworth, Sam

https://orcid.org/0000-0002-3726-0055

Abstract

Modern-day workloads, particularly those in big data, are heavily memory-latency bound. This is because of both irregular memory accesses, which have no discernable pattern in their memory addresses, and large data sets that cannot fit in any cache. However, this need not be a barrier to high performance. With some data structure knowledge it is typically possible to bring data into the fast on-chip memory caches early, so that it is already available by the time it needs to be accessed.

This thesis makes three contributions. I first contribute an automated software prefetching compiler technique to insert high-performance prefetches into program code to bring data into the cache early, achieving 1.3x geometric mean speedup on the most complex processors, and 2.7x on the simplest. I also provide an analysis of when and why this is likely to be successful, which data structures to target, and how to schedule software prefetches well.

Then I introduce a hardware solution, the configurable graph prefetcher. This uses the example of breadth-first search on graph workloads to motivate how a hardware prefetcher armed with data-structure knowledge can avoid the instruction overheads, inflexibility and limited latency tolerance of software prefetching. The configurable graph prefetcher sits at the L1 cache and observes memory accesses, which can be configured by a programmer to be aware of a limited number of different data access patterns, achieving 2.3x geometric mean speedup on graph workloads on an out-of-order core.

My final contribution extends the hardware used for the configurable graph prefetcher to make an event-triggered programmable prefetcher, using a set of a set of very small micro-controller-sized programmable prefetch units (PPUs) to cover a wide set of workloads. I do this by developing a highly parallel programming model that can be used to issue prefetches, thus allowing high-throughput prefetching with low power and area overheads of only around 3%, and a 3x geometric mean speedup for a variety of memory-bound applications. To facilitate its use, I then develop compiler techniques to help automate the process of targeting the programmable prefetcher. These provide a variety of tradeoffs from easiest to use to best performance.

Date

2018-02-06

Advisors

Jones, Timothy M.

Keywords

prefetching, computer architecture, compilers

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge

Rights

Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Sponsorship

This work was supported by the Engineering and Physical Sciences Research Council (EPSRC), through grant references EP/K026399/1 and EP/M506485/1, and ARM Ltd.

Collections

Theses - Computer Science and Technology