Repository logo
 

Automatic Speech Recognition for Irish: testing lexicons and language models

Accepted version
Peer-reviewed

Loading...
Thumbnail Image

Change log

Abstract

A range of lexicons and language models were tested in the development of ASR for Irish. One problem, common among minority languages, is the multiplicity of dialects, with no one spoken standard. To address this challenge, in a hybrid ASR system two alternative cross-dialect lexicons are tested, which draw on research in dialect phonology. First, individual lexicons were built for the three main dialects of Ulster (Ul), Connaught (Co) and Munster (Mu). With these, a Multi-dialect lexicon incorporated all dialect-varying word forms. An alternative Global lexicon, essentially a trans-dialect lexicon, used abstract representations of dialect-varying forms (phoneme or morpheme sized units). These two cross-dialect lexicons were tested along with the three dialect-specific lexicons. Several different language models were also tested. Results for the Global and Multi-dialect lexicons were found to yield the highest performance, with the lowest overall WER for the latter. There were considerable differences in results for the individual dialect lexicons: this may reflect a bias in the datasets used or could be indicators of the linguistic distance between the dialects - competing hypotheses that will need more rigorous testing. Results showed a strong effect of the language model used. Error patterns show frequent substitutions involving inflected forms.

Description

Journal Title

2022 33rd Irish Signals and Systems Conference (ISSC)

Conference Name

2022 33rd Irish Signals and Systems Conference (ISSC)

Journal ISSN

2688-1446
2688-1454

Volume Title

00

Publisher

Institute of Electrical and Electronics Engineers (IEEE)

Rights and licensing

Except where otherwised noted, this item's license is described as All Rights Reserved