Waveform-based speaker representations for speech synthesis
Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
MetadataShow full item record
Wan, M., Degottex, G., & Gales, M. (2018). Waveform-based speaker representations for speech synthesis. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2018-September 897-901. https://doi.org/10.21437/Interspeech.2018-1154
Speaker adaptation is a key aspect of building a range of speech processing systems, for example personalised speech synthesis. For deep-learning based approaches, the model parameters are hard to interpret, making speaker adaptation more challenging. One widely used method to address this problem is to extract a fixed length vector as speaker representation, and use this as an additional input to the task-specific model. This allows speaker-specific output to be generated, without modifying the model parameters. However, the speaker representation is often extracted in a task-independent fashion. This allows the same approach to be used for a range of tasks, but the extracted representation is unlikely to be optimal for the specific task of interest. Furthermore, the features from which the speaker representation is extracted are usually pre-defined, often a standard speech representation. This may limit the available information that can be used. In this paper, an integrated optimisation framework for building a task specific speaker representation, making use of all the available information, is proposed. Speech synthesis is used as the example task. The speaker representation is derived from raw waveform, incorporating text information via an attention mechanism. This paper evaluates and compares this framework with standard task-independent forms.
EPSRC International Doctoral Scholarship, reference number 10348827; St. John’s College Internal Graduate Scholarship; the European Unions Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 655764; EPSRC grant EP/I031022/1 (Natural Speech Technology)
External DOI: https://doi.org/10.21437/Interspeech.2018-1154
This record's URL: https://www.repository.cam.ac.uk/handle/1810/282926