Enabling On-Device Smartphone GPU based Training: Lessons Learned
View / Open Files
Publication Date
2022Journal Title
2022 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events, PerCom Workshops 2022
Conference Name
2022 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops)
ISBN
9781665416474
Publisher
IEEE
Volume
00
Pages
533-538
Type
Article
This Version
AM
Metadata
Show full item recordCitation
Das, A., Kwon, Y. D., Chauhan, J., & Mascolo, C. (2022). Enabling On-Device Smartphone GPU based Training: Lessons Learned. 2022 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events, PerCom Workshops 2022, 00 533-538. https://doi.org/10.1109/PerComWorkshops53856.2022.9767442
Abstract
Deep Learning (DL) has shown impressive performance in many mobile
applications. Most existing works have focused on reducing the computational
and resource overheads of running Deep Neural Networks (DNN) inference on
resource-constrained mobile devices. However, the other aspect of DNN
operations, i.e. training (forward and backward passes) on smartphone GPUs, has
received little attention thus far. To this end, we conduct an initial analysis
to examine the feasibility of on-device training on smartphones using mobile
GPUs. We first employ the open-source mobile DL framework (MNN) and its OpenCL
backend for running compute kernels on GPUs. Next, we observed that training on
CPUs is much faster than on GPUs and identified two possible bottlenecks
related to this observation: (i) computation and (ii) memory bottlenecks. To
solve the computation bottleneck, we optimize the OpenCL backend's kernels,
showing 2x improvements (40-70 GFLOPs) over CPUs (15-30 GFLOPs) on the
Snapdragon 8 series processors. However, we find that the full DNN training is
still much slower on GPUs than on CPUs, indicating that memory bottleneck plays
a significant role in the lower performance of GPU over CPU. The data movement
takes almost 91% of training time due to the low bandwidth. Lastly, based on
the findings and failures during our investigation, we present limitations and
practical guidelines for future directions.
Keywords
cs.LG, cs.LG, cs.AR
Identifiers
This record's URL: https://www.repository.cam.ac.uk/handle/1810/338356
Statistics
Total file downloads (since January 2020). For more information on metrics see the
IRUS guide.