Language Modeling for Morphologically Rich Languages: Character-Aware Modeling for Word-Level Prediction

Gerz, Daniela; Vulić, Ivan; Ponti, Edoardo; Naradowsky, Jason; Reichart, Roi; Korhonen, Anna

Language Modeling for Morphologically Rich Languages: Character-Aware Modeling for Word-Level Prediction

Published version

Peer-reviewed

Repository URI

https://www.repository.cam.ac.uk/handle/1810/279936

Repository DOI

https://doi.org/10.17863/CAM.27304

Files

Published version (1.14 MB)

Type

Article

Authors

Gerz, Daniela

Vulić, Ivan

Ponti, Edoardo

https://orcid.org/0000-0002-6308-1050

Naradowsky, Jason

Reichart, Roi

Show 1 more

Abstract

jats:p Neural architectures are prominent in the construction of language models (LMs). However, word-level prediction is typically agnostic of subword-level information (characters and character sequences) and operates over a closed vocabulary, consisting of a limited word set. Indeed, while subword-aware models boost performance across a variety of NLP tasks, previous work did not evaluate the ability of these models to assist next-word prediction in language modeling tasks. Such subword-level informed models should be particularly effective for morphologically-rich languages (MRLs) that exhibit high type-to-token ratios. In this work, we present a large-scale LM study on 50 typologically diverse languages covering a wide variety of morphological systems, and offer new LM benchmarks to the community, while considering subword-level information. The main technical contribution of our work is a novel method for injecting subword-level information into semantic word vectors, integrated into the neural language modeling training, to facilitate word-level prediction. We conduct experiments in the LM setting where the number of infrequent words is large, and demonstrate strong perplexity gains across our 50 languages, especially for morphologically-rich languages. Our code and data sets are publicly available. </jats:p>

Keywords

46 Information and Computing Sciences, 47 Language, Communication and Culture, 4704 Linguistics

Journal Title

Transactions of the Association for Computational Linguistics

Journal ISSN

2307-387X
2307-387X

Volume Title

6

Publisher

MIT Press - Journals

Publisher DOI

https://doi.org/10.1162/tacl_a_00032

Rights

Attribution 4.0 International (CC BY 4.0)

Sponsorship

European Research Council (648909)

This work is supported by the ERC Consolidator Grant LEXICAL (648909)

Collections

Cambridge University Research Outputs