Highly Intelligible Speaker-Independent Articulatory Synthesis

An articulatory synthesiser which could accurately map vocal tract features to speech would enable novel evaluation of acoustic-to-articulatory inversion models beyond the small, typically monolingual, articulatory datasets available. However, current deep articulatory synthesisers and physical simulation-based synthesisers struggle to produce consistently intelligible speech, with Word Error Rates (WER) of around 20% for real or hand-crafted articulatory input. Additionally, deep learning methods have often only achieved this level of intelligibility when training and evaluating on the same speaker (speaker-dependent training). In this paper, we create a highly intelligible (WER ∼7% for real data and ∼10% for synthetic), speaker-independent articulatory synthesiser by training a deep synthesiser on a combination of high-quality real data and synthetic data generated by inversion. We then perform a multilingual evaluation of the joint inversion-synthesis system.

Journal Title

Interspeech 2024

Conference Name

Interspeech 2024

Journal ISSN

2308-457X
1990-9772

Publisher

International Speech Communication Association

Publisher DOI

https://doi.org/10.21437/interspeech.2024-1160

Rights and licensing

Sponsorship

Cambridge Assessment (Unknown)

Supported by the Automated Language Teaching and Assessment (ALTA) group sponsored by Cambridge University Press and Assessment

Collections

University of Cambridge Research Outputs (Articles and Conferences)