Waveform-based speaker representations for speech synthesis
View / Open Files
Authors
Wan, M
Degottex, G
Gales, MJF
Publication Date
2018Journal Title
Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Conference Name
Interspeech 2018
ISSN
2308-457X
ISBN
978-1-5108-7221-9
Publisher
ISCA
Volume
2018-September
Pages
897-901
Type
Conference Object
Metadata
Show full item recordCitation
Wan, M., Degottex, G., & Gales, M. (2018). Waveform-based speaker representations for speech synthesis. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2018-September 897-901. https://doi.org/10.21437/Interspeech.2018-1154
Abstract
Speaker adaptation is a key aspect of building a range of speech processing systems, for example personalised speech synthesis. For deep-learning based approaches, the model parameters are hard to interpret, making speaker adaptation more challenging. One widely used method to address this problem is to extract a fixed length vector as speaker representation, and use this as an additional input to the task-specific model. This allows speaker-specific output to be generated, without modifying the model parameters. However, the speaker representation is often extracted in a task-independent fashion. This allows the same approach to be used for a range of tasks, but the extracted representation is unlikely to be optimal for the specific task of interest. Furthermore, the features from which the speaker representation is extracted are usually pre-defined, often a standard speech representation. This may limit the available information that can be used. In this paper, an integrated optimisation framework for building a task specific speaker representation, making use of all the available information, is proposed. Speech synthesis is used as the example task. The speaker representation is derived from raw waveform, incorporating text information via an attention mechanism. This paper evaluates and compares this framework with standard task-independent forms.
Keywords
integrated, adaptation, speech synthesis, fixed length speaker representation, attention mechanism, waveform, vocoder
Sponsorship
EPSRC International Doctoral Scholarship, reference number 10348827;
St. John’s College Internal Graduate Scholarship; the European Unions Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 655764; EPSRC grant EP/I031022/1 (Natural Speech Technology)
Funder references
EPSRC (1634918)
EPSRC (1634918)
Identifiers
External DOI: https://doi.org/10.21437/Interspeech.2018-1154
This record's URL: https://www.repository.cam.ac.uk/handle/1810/282926
Rights
Licence:
http://www.rioxx.net/licenses/all-rights-reserved
Statistics
Total file downloads (since January 2020). For more information on metrics see the
IRUS guide.
Recommended or similar items
The current recommendation prototype on the Apollo Repository will be turned off on 03 February 2023. Although the pilot has been fruitful for both parties, the service provider IKVA is focusing on horizon scanning products and so the recommender service can no longer be supported. We recognise the importance of recommender services in supporting research discovery and are evaluating offerings from other service providers. If you would like to offer feedback on this decision please contact us on: support@repository.cam.ac.uk