Show simple item record

dc.contributor.authorWan, Moquan
dc.contributor.authorDegottex, Gilles
dc.contributor.authorGales, Mark JF
dc.contributor.authorIEEE
dc.date.accessioned2018-09-26T10:19:38Z
dc.date.available2018-09-26T10:19:38Z
dc.date.issued2017-12-16
dc.identifier.isbn9781509047888
dc.identifier.urihttps://www.repository.cam.ac.uk/handle/1810/280719
dc.description.abstractEnabling speech synthesis systems to rapidly adapt to sound like a particular speaker is an essential attribute for building personalised systems. For deep-learning based approaches, this is difficult as these networks use a highly distributed representation. It is not simple to interpret the model parameters, which complicates the adaptation process. To address this problem, speaker characteristics can be encapsulated in fixed-length speaker-specific Identity Vectors (iVectors), which are appended to the input of the synthesis network. Altering the iVector changes the nature of the synthesised speech. The challenge is to derive an optimal iVector for each speaker that encodes all the speaker attributes required for the synthesis system. The standard approach involves two separate stages: estimation of the iVectors for the training data; and training the synthesis network. This paper proposes an integrated training scheme for speaker adaptive speech synthesis. For the iVector extraction, an attention based mechanism, which is a function of the context labels, is used to combine the data from the target speaker. This attention mechanism, as well as nature of the features being merged, are optimised at the same time as the synthesis network parameters. This should yield an iVector-like speaker representation that is optimal for use with the synthesis system. The system is evaluated on the Voice Bank corpus. The resulting system automatically provides a sensible attention sequence and shows improved performance from the standard approach.
dc.description.sponsorshipSt. John’s College Internal Graduate Scholarship European Unions Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 655764 EPSRC grant EP/I031022/1 (Natural Speech Technology)
dc.publisherIEEE
dc.subjectspeech synthesis
dc.subjectiVector
dc.subjectintegrated
dc.subjectadaptation
dc.subjectattention mechanism
dc.titleIntegrated speaker-adaptive speech synthesis
dc.typeConference Object
prism.endingPage711
prism.publicationDate2017
prism.publicationName2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
prism.startingPage705
dc.identifier.doi10.17863/CAM.28083
dcterms.dateAccepted2017-08-31
rioxxterms.versionofrecord10.1109/ASRU.2017.8269006
rioxxterms.versionAM
rioxxterms.licenseref.urihttp://www.rioxx.net/licenses/all-rights-reserved
rioxxterms.licenseref.startdate2017-12-16
dc.contributor.orcidGales, Mark [0000-0002-5311-8219]
rioxxterms.typeConference Paper/Proceeding/Abstract
pubs.funder-project-idEPSRC (1634918)
pubs.funder-project-idEPSRC (1634918)
pubs.funder-project-idEuropean Commission Horizon 2020 (H2020) Marie Sk?odowska-Curie actions (655764)
cam.issuedOnline2018-01-25
pubs.conference-name2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU)
pubs.conference-start-date2017-12-16
pubs.conference-finish-date2017-12-20


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record