Low-resource Multi-task Audio Sensing for Mobile and Embedded Devices via Shared Deep Neural Network Representations

Georgiev, P; Bhattacharya, S; Lane, N; Mascolo, C

Low-resource Multi-task Audio Sensing for Mobile and Embedded Devices via Shared Deep Neural Network Representations

Accepted version

Peer-reviewed

Repository URI

https://www.repository.cam.ac.uk/handle/1810/277059

Repository DOI

https://doi.org/10.17863/CAM.12234

Files

Accepted version (4.83 MB)

Type

Article

Authors

Georgiev, P

Bhattacharya, S

Lane, N

Mascolo, Cecilia

https://orcid.org/0000-0001-9614-4380

Abstract

Continuous audio analysis from embedded and mobile devices is an increasingly important application domain. More and more, appliances like the Amazon Echo, along with smartphones and watches, and even research prototypes seek to perform multiple discriminative tasks simultaneously from ambient audio; for example, monitoring background sound classes (e.g., music or conversation), recognizing certain keywords (‘Hey Siri’ or ‘Alexa’), or identifying the user and her emotion from speech. The use of deep learning algorithms typically provides state-of-the-art model performances for such general audio tasks. However, the large computational demands of deep learning models are at odds with the limited processing, energy and memory resources of mobile, embedded and IoT devices. In this paper, we propose and evaluate a novel deep learning modeling and optimization framework that speci cally targets this category of embedded audio sensing tasks. Although the supported tasks are simpler than the task of speech recognition, this framework aims at maintaining accuracies in predictions while minimizing the overall processor resource footprint. The proposed model is grounded in multi-task learning principles to train shared deep layers and exploits, as input layer, only statistical summaries of audio lter banks to further lower computations. We nd that for embedded audio sensing tasks our framework is able to maintain similar accuracies, which are observed in comparable deep architectures that use single-task learning and typically more complex input layers. Most importantly, on an average, this approach provides almost a 2.1⇥ reduction in runtime, energy, and memory for four separate audio sensing tasks, assuming a variety of task combinations.

Keywords

46 Information and Computing Sciences, 4603 Computer Vision and Multimedia Computation, 4605 Data Management and Data Science, 4608 Human-Centred Computing, 4611 Machine Learning, Clinical Research, Behavioral and Social Science

Journal Title

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT)

Journal ISSN

2474-9567
2474-9567

Volume Title

1

Publisher

Association for Computing Machinery

Publisher DOI

https://doi.org/10.1145/3131895

Rights

http://www.rioxx.net/licenses/all-rights-reserved

Sponsorship

Microsoft Research

Collections

Scholarly Works - Engineering