Repository logo
 

Towards a Historical Treebank of Middle and Early Modern Welsh, Part I: Workflow and POS Tagging

Accepted version
Peer-reviewed

No Thumbnail Available

Type

Article

Change log

Authors

Willis, D 

Abstract

jats:pThis article introduces the working methods of the Parsed Historical Corpus of the Welsh Language (PARSHCWL). The corpus is designed to provide researchers with a tool for automatic exhaustive extraction of instances of grammatical structures from Middle and Modern Welsh texts in a way comparable to similar tools that already exist for various European languages. The major features of the corpus are outlined, along with the overall architecture of the workflow needed for a team of researchers to produce it. In this paper, the two first stages of the process, namely pre-processing of texts and automated part-of-speech (POS) tagging are discussed in some detail, focusing in particular on major issues involved in defining word boundaries and in defining a robust and useful tagset.</jats:p>

Description

Keywords

47 Language, Communication and Culture, 4704 Linguistics

Journal Title

Journal of Celtic Linguistics

Conference Name

Journal ISSN

0962-1377

Volume Title

22

Publisher

University of Wales Press/Gwasg Prifysgol Cymru

Rights

All rights reserved
Sponsorship
British Academy (PF170063)
British Academy (SRG18R1\181450)