Show simple item record

dc.contributor.authorAinsworth, Sam
dc.date.accessioned2018-07-04T13:32:19Z
dc.date.available2018-07-04T13:32:19Z
dc.date.issued2018-07-21
dc.date.submitted2018-02-06
dc.identifier.urihttps://www.repository.cam.ac.uk/handle/1810/277804
dc.description.abstractModern-day workloads, particularly those in big data, are heavily memory-latency bound. This is because of both irregular memory accesses, which have no discernable pattern in their memory addresses, and large data sets that cannot fit in any cache. However, this need not be a barrier to high performance. With some data structure knowledge it is typically possible to bring data into the fast on-chip memory caches early, so that it is already available by the time it needs to be accessed. This thesis makes three contributions. I first contribute an automated software prefetching compiler technique to insert high-performance prefetches into program code to bring data into the cache early, achieving 1.3x geometric mean speedup on the most complex processors, and 2.7x on the simplest. I also provide an analysis of when and why this is likely to be successful, which data structures to target, and how to schedule software prefetches well. Then I introduce a hardware solution, the configurable graph prefetcher. This uses the example of breadth-first search on graph workloads to motivate how a hardware prefetcher armed with data-structure knowledge can avoid the instruction overheads, inflexibility and limited latency tolerance of software prefetching. The configurable graph prefetcher sits at the L1 cache and observes memory accesses, which can be configured by a programmer to be aware of a limited number of different data access patterns, achieving 2.3x geometric mean speedup on graph workloads on an out-of-order core. My final contribution extends the hardware used for the configurable graph prefetcher to make an event-triggered programmable prefetcher, using a set of a set of very small micro-controller-sized programmable prefetch units (PPUs) to cover a wide set of workloads. I do this by developing a highly parallel programming model that can be used to issue prefetches, thus allowing high-throughput prefetching with low power and area overheads of only around 3%, and a 3x geometric mean speedup for a variety of memory-bound applications. To facilitate its use, I then develop compiler techniques to help automate the process of targeting the programmable prefetcher. These provide a variety of tradeoffs from easiest to use to best performance.
dc.description.sponsorshipThis work was supported by the Engineering and Physical Sciences Research Council (EPSRC), through grant references EP/K026399/1 and EP/M506485/1, and ARM Ltd.
dc.language.isoen
dc.rightsAttribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.urihttps://creativecommons.org/licenses/by-sa/4.0/
dc.subjectprefetching
dc.subjectcomputer architecture
dc.subjectcompilers
dc.titlePrefetching for complex memory access patterns
dc.typeThesis
dc.type.qualificationlevelDoctoral
dc.type.qualificationnameDoctor of Philosophy (PhD)
dc.publisher.institutionUniversity of Cambridge
dc.publisher.departmentComputer Science and Technology
dc.date.updated2018-07-04T09:58:38Z
dc.identifier.doi10.17863/CAM.25143
dc.contributor.orcidAinsworth, Sam [0000-0002-3726-0055]
dc.publisher.collegeChurchill College
dc.type.qualificationtitlePhD in Computer Science
cam.supervisorJones, Timothy M.
cam.thesis.fundingtrue
rioxxterms.freetoread.startdate2018-07-04


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record

Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Except where otherwise noted, this item's licence is described as Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)