Show simple item record

dc.contributor.authorWan, Moquan
dc.contributor.authorDegottex, G
dc.contributor.authorGales, Mark
dc.date.accessioned2018-09-29T06:08:43Z
dc.date.available2018-09-29T06:08:43Z
dc.date.issued2018
dc.identifier.isbn978-1-5108-7221-9
dc.identifier.issn2308-457X
dc.identifier.urihttps://www.repository.cam.ac.uk/handle/1810/282926
dc.description.abstractSpeaker adaptation is a key aspect of building a range of speech processing systems, for example personalised speech synthesis. For deep-learning based approaches, the model parameters are hard to interpret, making speaker adaptation more challenging. One widely used method to address this problem is to extract a fixed length vector as speaker representation, and use this as an additional input to the task-specific model. This allows speaker-specific output to be generated, without modifying the model parameters. However, the speaker representation is often extracted in a task-independent fashion. This allows the same approach to be used for a range of tasks, but the extracted representation is unlikely to be optimal for the specific task of interest. Furthermore, the features from which the speaker representation is extracted are usually pre-defined, often a standard speech representation. This may limit the available information that can be used. In this paper, an integrated optimisation framework for building a task specific speaker representation, making use of all the available information, is proposed. Speech synthesis is used as the example task. The speaker representation is derived from raw waveform, incorporating text information via an attention mechanism. This paper evaluates and compares this framework with standard task-independent forms.
dc.description.sponsorshipEPSRC International Doctoral Scholarship, reference number 10348827; St. John’s College Internal Graduate Scholarship; the European Unions Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 655764; EPSRC grant EP/I031022/1 (Natural Speech Technology)
dc.publisherISCA
dc.titleWaveform-based speaker representations for speech synthesis
dc.typeConference Object
prism.endingPage901
prism.publicationDate2018
prism.publicationNameProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
prism.startingPage897
prism.volume2018-September
dc.identifier.doi10.17863/CAM.30289
dcterms.dateAccepted2018-06-03
rioxxterms.versionofrecord10.21437/Interspeech.2018-1154
rioxxterms.licenseref.urihttp://www.rioxx.net/licenses/all-rights-reserved
rioxxterms.licenseref.startdate2018-01-01
dc.contributor.orcidGales, Mark [0000-0002-5311-8219]
dc.identifier.eissn1990-9772
rioxxterms.typeConference Paper/Proceeding/Abstract
pubs.funder-project-idEPSRC (1634918)
pubs.funder-project-idEPSRC (1634918)
cam.issuedOnline2018-09-02
pubs.conference-nameInterspeech 2018
pubs.conference-start-date2018-09-02
pubs.conference-finish-date2018-09-06
rioxxterms.freetoread.startdate2019-09-29


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record