Repository logo
 

Optimisation of the Largest Annotated Tibetan Corpus Combining Rule-based, Memory-based, and Deep-learning Methods

Published version
Peer-reviewed

Change log

Authors

Roux, É 
Hill, N 

Abstract

jats:pThis article presents a pipeline that converts collections of Tibetan documents in plain text or XML into a fully segmented and POS-tagged corpus. We apply the pipeline to the large extent collection of the Buddhist Digital Resource Center. The semi-supervised methods presented here not only result in a new and improved version of the largest annotated Tibetan corpus to date, the integration of rule-based, memory-based, and neural-network methods also serves as a good example of how to overcome challenges of under-researched languages. The end-to-end accuracy of our entire automatic pipeline of 91.99% is high enough to make the resulting corpus a useful resource for both linguists and scholars of Tibetan studies.</jats:p>

Description

Keywords

NLP, POS tagging, Tibetan, historical treebanks

Journal Title

ACM Transactions on Asian and Low-Resource Language Information Processing

Conference Name

Journal ISSN

2375-4699
2375-4702

Volume Title

20

Publisher

Association for Computing Machinery (ACM)
Sponsorship
British Academy (PF170063)