Automatic Speech Recognition for Irish: testing lexicons and language models
Accepted version
Peer-reviewed
Repository URI
Repository DOI
Change log
Abstract
A range of lexicons and language models were tested in the development of ASR for Irish. One problem, common among minority languages, is the multiplicity of dialects, with no one spoken standard. To address this challenge, in a hybrid ASR system two alternative cross-dialect lexicons are tested, which draw on research in dialect phonology. First, individual lexicons were built for the three main dialects of Ulster (Ul), Connaught (Co) and Munster (Mu). With these, a Multi-dialect lexicon incorporated all dialect-varying word forms. An alternative Global lexicon, essentially a trans-dialect lexicon, used abstract representations of dialect-varying forms (phoneme or morpheme sized units). These two cross-dialect lexicons were tested along with the three dialect-specific lexicons. Several different language models were also tested. Results for the Global and Multi-dialect lexicons were found to yield the highest performance, with the lowest overall WER for the latter. There were considerable differences in results for the individual dialect lexicons: this may reflect a bias in the datasets used or could be indicators of the linguistic distance between the dialects - competing hypotheses that will need more rigorous testing. Results showed a strong effect of the language model used. Error patterns show frequent substitutions involving inflected forms.
Description
Journal Title
Conference Name
Journal ISSN
2688-1454
