Automatic Speech Recognition for Irish: testing lexicons and language models

A range of lexicons and language models were tested in the development of ASR for Irish. One problem, common among minority languages, is the multiplicity of dialects, with no one spoken standard. To address this challenge, in a hybrid ASR system two alternative cross-dialect lexicons are tested, which draw on research in dialect phonology. First, individual lexicons were built for the three main dialects of Ulster (Ul), Connaught (Co) and Munster (Mu). With these, a Multi-dialect lexicon incorporated all dialect-varying word forms. An alternative Global lexicon, essentially a trans-dialect lexicon, used abstract representations of dialect-varying forms (phoneme or morpheme sized units). These two cross-dialect lexicons were tested along with the three dialect-specific lexicons. Several different language models were also tested. Results for the Global and Multi-dialect lexicons were found to yield the highest performance, with the lowest overall WER for the latter. There were considerable differences in results for the individual dialect lexicons: this may reflect a bias in the datasets used or could be indicators of the linguistic distance between the dialects - competing hypotheses that will need more rigorous testing. Results showed a strong effect of the language model used. Error patterns show frequent substitutions involving inflected forms.

Keywords

4703 Language Studies, 4704 Linguistics, 5204 Cognitive and Computational Psychology, 47 Language, Communication and Culture, 52 Psychology

Journal Title

2022 33rd Irish Signals and Systems Conference (ISSC)

Conference Name

2022 33rd Irish Signals and Systems Conference (ISSC)

Journal ISSN

2688-1446
2688-1454

Volume Title

00

Publisher

Institute of Electrical and Electronics Engineers (IEEE)

Publisher DOI

https://doi.org/10.1109/issc55427.2022.9826201

Rights and licensing

Collections

University of Cambridge Research Outputs (Articles and Conferences)