Repository logo

Cross-dialect lexicon optimisation for an endangered language ASR system: the case of Irish

Accepted version


Conference Object

Change log


Lonergan, L 
Qian, M 
Chiaráin, NN 
Gobl, C 
Chasaide, AN 


Lexicon optimisation strategies, addressing the problem of dialect divergence, are tested in an ASR system for Irish. As in many endangered languages, Irish has no spoken standard, but rather, three very different dialects of Ulster (Ul), Connaught (Co) and Munster (Mu). Furthermore, the complex sound system and ancient, opaque writing system result in sound-to-grapheme mappings that differ considerably across dialects. A hybrid ASR system was trained on (predominantly) native speaker speech data, balanced across the dialects. Experiment 1 tested whether a Global lexicon, which captures dialect variant forms with relatively abstract representations, can perform as well as a Multi-dialect lexicon containing all dialect variants. Three dialect-specific lexicons were also included in the tests. The Global lexicon did yield the best performance and experiment 2 tested whether further reductions to its phoneset might further enhance its performance. These included (i) merging a Tense-Lax contrast among coronal sonorants, not common to all dialects, and (ii) merging the contrast of voiceless-voiced sonorants, as the voiceless member is relatively infrequent. Results showed but a slight enhancement and only for Mu dialect, which is the one most aligned to the phoneset reduction.



Irish, speech recognition, cross-dialect variation, lexicon, minority language

Journal Title

Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

Conference Name

Interspeech 2022

Journal ISSN


Volume Title