NLP Pipeline for Annotating (Endangered) Tibetan and Newar Varieties

Faggionato, Christian; Hill, Nathan; Meelen, Marieke

NLP Pipeline for Annotating (Endangered) Tibetan and Newar Varieties

Accepted version

Peer-reviewed

Repository URI

https://www.repository.cam.ac.uk/handle/1810/337486

Repository DOI

https://doi.org/10.17863/CAM.84900

Files

Accepted version (1009.79 KB)

Type

Conference Object

Authors

Faggionato, Christian

Hill, Nathan

Meelen, Marieke

https://orcid.org/0000-0003-0395-8372

Abstract

In this paper we present our work-in-progress on a fully-implemented pipeline to create deeply-annotated corpora of a number of historical and contemporary Tibetan and Newar varieties. Our off-the-shelf tools allow researchers to create corpora with five different layers of annotation, ranging from morphosyntactic to information-structural annotation. We build on and optimise existing tools (in line with FAIR principles), as well as develop new ones, and show how they can be adapted to other Tibetan and Newar languages, most notably modern endangered languages that are both extremely low-resourced and under-researched.

Journal Title

Proceedings of the EURALI workshop at LREC 2022

Conference Name

LREC-EURALI workshop

Publisher DOI

https://doi.org/10.17863/CAM.84900

Rights

Sponsorship

This research is AHRC-funded (AH/V011235/1).

Collections

Cambridge University Research Outputs