Repository logo
 

Highly Intelligible Speaker-Independent Articulatory Synthesis

Accepted version
Peer-reviewed

Change log

Abstract

An articulatory synthesiser which could accurately map vocal tract features to speech would enable novel evaluation of acoustic-to-articulatory inversion models beyond the small, typically monolingual, articulatory datasets available. However, current deep articulatory synthesisers and physical simulation-based synthesisers struggle to produce consistently intelligible speech, with Word Error Rates (WER) of around 20% for real or hand-crafted articulatory input. Additionally, deep learning methods have often only achieved this level of intelligibility when training and evaluating on the same speaker (speaker-dependent training). In this paper, we create a highly intelligible (WER ∼7% for real data and ∼10% for synthetic), speaker-independent articulatory synthesiser by training a deep synthesiser on a combination of high-quality real data and synthetic data generated by inversion. We then perform a multilingual evaluation of the joint inversion-synthesis system.

Description

Keywords

Journal Title

Interspeech 2024

Conference Name

Interspeech 2024

Journal ISSN

2308-457X
1990-9772

Volume Title

Publisher

International Speech Communication Association

Rights and licensing

Except where otherwised noted, this item's license is described as All Rights Reserved
Sponsorship
Cambridge Assessment (Unknown)
Supported by the Automated Language Teaching and Assessment (ALTA) group sponsored by Cambridge University Press and Assessment