Software and Hardware Co-design for Efficient Neural Networks

Zhao, Yiren

Software and Hardware Co-design for Efficient Neural Networks

Repository URI

https://www.repository.cam.ac.uk/handle/1810/338852

Repository DOI

https://doi.org/10.17863/CAM.86258

Files

Thesis (8.42 MB)

Type

Thesis

Authors

Zhao, Yiren

Abstract

Deep Neural Networks (DNNs) offer state-of-the-art performance in many domains but this success comes at the cost of high computational and memory resources. Since DNN inference is now a popular workload on both edge and cloud systems, there is an urgent need to improve its energy efficiency. The improved efficiency enables the use of DNNs in a boarder range of target markets and helps boost performance as performance if often limted by power today. This thesis makes a number of contributions that help improve DNN inference performance on different hardware platforms using a hardware-software co-design approach.

I first show a number of software optimisation techniques for reducing DNN run-time costs. Overall, I demonstrate model sizes can be reduced by up to 34×. A combination of different styles of neural network compression techniques can offer multiplying gains in shrinking the memory footprints. These techniques are suitable for running DNN inference on memory-sensitive edge devices. Using the run-time and data-dependent feature information, I develop a dynamic pruning strategy that outperforms existing static pruning methods by a significant margin. The proposed dynamic pruning not only reduces the model sizes but also the number of multiply-accumulate operations for GPUs. I also introduce a novel quantisation mechanism that is tuned to fit the natural distributions of model parameters and this method decreases the total number of bit-wise operations required for DNN inference.

I then focus on accelerating DNNs using custom hardware. I build a framework named Tomato that generates multi-precision and multi-arithmetic hardware accelerators on FPGA devices. The software hardware co-generation flow deploys hardware accelerators from high-level neural network descriptions, and exploits the hardware reconfigurability of FPGA devices to support a flexible per-layer quantisation strategy. I then demonstrate that the automatically generated accelerators outperform their closest FPGA-based competitors by at least 2 to 4× for latency and throughput. The accelerator generated for the ImageNet classification runs at a rate of more than 3000 frames per second with a latency of only 0.32ms, making it a suitable candidate for latency critical, high throughput inference in the cloud.

Finally, I show how automated machine learning techniques can be improved with hardware-awareness to produce efficient network architectures for emerging types of neural networks and new learning problem setups. Hardware-aware network architecture search (NAS) is able to discover more power efficient network architectures and achieve significant computational savings on emerging neural networks types such as graph neural networks. The proposed Low Precision Graph Network Architecture Search improves the size-accuracy Pareto frontier when compared to seven manual and NAS-generated baselines on eight different graph datasets. In addition, I demonstrate hardware-aware NAS can be applied to a many-task many-device few-shot learning scenario. In popular few-shot learning benchmarks with various hardware platforms and constraints, the proposed approach outperforms a variety of NAS and manual baselines by a significant margin. On the 5-way 1-shot Mini-ImageNet classification task, the proposed method outperforms the best manual baseline by 5.21% in accuracy using 60% less computation.

Date

2022-06-13

Advisors

Mullins, Robert

Keywords

Machine Learning

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge

Rights

Sponsorship

EPSRC (1941039)
Engineering and Physical Sciences Research Council (1941039)

Collections

Theses - Computer Science and Technology