Reducing the burden of parallel loop schedulers for many-core processors

jats:titleSummary</jats:title>jats:pAs core counts in processors increases, it becomes harder to schedule and distribute work in a timely and scalable manner. This article enhances the scalability of parallel loop schedulers by specializing schedulers for fine‐grain loops. We propose a low‐overhead work distribution mechanism for a static scheduler that uses no atomic operations. We integrate our static scheduler with the Intel OpenMP and Cilkplus parallel task schedulers to build hybrid schedulers. Compiler support enables efficient reductions for Cilk, without changing the programming interface of Cilk reducers. Detailed, quantitative measurements demonstrate that our techniques achieve scalable performance on a 48‐core machine and the scheduling overhead is 43% lower than Intel OpenMP and 12.1× lower than Cilk. We demonstrate consistent performance improvements on a range of HPC and data analytics codes. Performance gains are more important as loops become finer‐grain and thread counts increase. We observe consistently 16%–30% speedup on 48 threads, with a peak of 2.8× speedup.</jats:p>

Description

Funder: FP7 People: Marie‐Curie Actions; Id: http://dx.doi.org/10.13039/100011264; Grant(s): 327744

Keywords

parallel computing, shared&#8208, memory synchronization

Journal Title

Concurrency and Computation: Practice and Experience

Journal ISSN

1532-0626
1532-0634

Volume Title

33

Publisher

Wiley

Publisher DOI

https://doi.org/10.1002/cpe.6241

Rights

Attribution 4.0 International

Sponsorship

Engineering and Physical Sciences Research Council (EP/L027402/1, EP/M008495/1)
FP7 Information and Communication Technologies (619706)
H2020 Future and Emerging Technologies (732631)

Collections

Jisc Publications Router