Old Catalan Morphosyntax: Developing an Annotated Corpus

This paper presents a full procedure for the development of a Part-of-Speech (POS) tagged corpus of Old Catalan. As an extremely low-resource language with rich inflection and frequent homographs, Old Catalan poses non-trivial problems in the development of a searchable constituency-based treebank. We demonstrate, however, that a semi-supervised method of incrementally building training data using both neural and memory-based taggers, together with the Pyrrha annotation tool is highly efficient and yields accurate results. We propose that this simple and effective method could easily be extended to other low-resource historical languages for which no NLP tools exist yet.

Keywords

46 Information and Computing Sciences, 47 Language, Communication and Culture, 4704 Linguistics

Journal Title

Journal of Open Humanities Data

Journal ISSN

2059-481X
2059-481X

Publisher

Ubiquity Press

Publisher DOI

https://doi.org/10.5334/johd.54

Rights and licensing

Except where otherwised noted, this item's license is described as Attribution 4.0 International

Sponsorship

British Academy (PF170063)
British Academy (SRG18R1\181450)

"Research that partially facilitated the work presented in this article was funded by the British Academy (PDF grant pf170063), and the Cambridge Humanities Research Grant (tier 1 grant, GANT011262). Additionally, this work has been supported by the French government, through the UCAJEDI Investments in the Future project managed by the National Research Agency (ANR) with the reference number C870A06228 – EOTP : SYVACA – D112.

Collections

University of Cambridge Research Outputs (Articles and Conferences)

Version History

You are currently viewing version 1 of the item.

Now showing 1 - 2 of 2

Version	Date	Summary
2	2024-08-13 10:11:43	Published version added
1*	2021-12-22 00:30:46

* Selected version