Building natural language processing tools for Runyakitara

Abstract This paper describes an endeavour to build natural language processing (NLP) tools for Runyakitara, a group of four closely related Bantu languages spoken in western Uganda. In contrast with major world languages such as English, for which corpora are comparatively abundant and NLP tools are well developed, computational linguistic resources for Runyakitara are in short supply. First therefore, we need to collect corpora for these languages, before we can proceed to the design of a spell-checker, grammar-checker and applications for computer-assisted language learning (CALL). We explain how we are collecting primary data for a new Runya Corpus of speech and writing, we outline the design of a morphological analyser, and discuss how we can use these new resources to build NLP tools. We are initially working with Runyankore–Rukiga, a closely-related pair of Runyakitara languages, and we frame our project in the context of NLP for low-resource languages, as well as CALL for the preservation of endangered languages. We put our project forward as a test case for the revitalization of endangered languages through education and technology.

Keywords

47 Language, Communication and Culture, 4703 Language Studies, 4704 Linguistics, Machine Learning and Artificial Intelligence, Networking and Information Technology R&D (NITRD)

Journal Title

Applied Linguistics Review

Journal ISSN

1868-6303
1868-6311

Publisher

De Gruyter

Publisher DOI

https://doi.org/10.1515/applirev-2020-2004

Rights and licensing

Collections

University of Cambridge Research Outputs (Articles and Conferences)