Show simple item record

dc.contributor.authorGerz, Daniela
dc.contributor.authorVulić, Ivan
dc.contributor.authorPonti, Edoardo
dc.contributor.authorNaradowsky, Jason
dc.contributor.authorReichart, Roi
dc.contributor.authorKorhonen, Anna-Leena
dc.date.accessioned2018-09-08T06:35:14Z
dc.date.available2018-09-08T06:35:14Z
dc.date.issued2018-12
dc.identifier.issn2307-387X
dc.identifier.urihttps://www.repository.cam.ac.uk/handle/1810/279936
dc.description.abstract<jats:p> Neural architectures are prominent in the construction of language models (LMs). However, word-level prediction is typically agnostic of subword-level information (characters and character sequences) and operates over a closed vocabulary, consisting of a limited word set. Indeed, while subword-aware models boost performance across a variety of NLP tasks, previous work did not evaluate the ability of these models to assist next-word prediction in language modeling tasks. Such subword-level informed models should be particularly effective for morphologically-rich languages (MRLs) that exhibit high type-to-token ratios. In this work, we present a large-scale LM study on 50 typologically diverse languages covering a wide variety of morphological systems, and offer new LM benchmarks to the community, while considering subword-level information. The main technical contribution of our work is a novel method for injecting subword-level information into semantic word vectors, integrated into the neural language modeling training, to facilitate word-level prediction. We conduct experiments in the LM setting where the number of infrequent words is large, and demonstrate strong perplexity gains across our 50 languages, especially for morphologically-rich languages. Our code and data sets are publicly available. </jats:p>
dc.description.sponsorshipThis work is supported by the ERC Consolidator Grant LEXICAL (648909)
dc.languageen
dc.publisherMIT Press - Journals
dc.rightsAttribution 4.0 International (CC BY 4.0)
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.titleLanguage Modeling for Morphologically Rich Languages: Character-Aware Modeling for Word-Level Prediction
dc.typeArticle
prism.endingPage465
prism.publicationDate2018
prism.publicationNameTransactions of the Association for Computational Linguistics
prism.startingPage451
prism.volume6
dc.identifier.doi10.17863/CAM.27304
dcterms.dateAccepted2018-05-25
rioxxterms.versionofrecord10.1162/tacl_a_00032
rioxxterms.licenseref.urihttp://www.rioxx.net/licenses/all-rights-reserved
rioxxterms.licenseref.startdate2018-12
dc.contributor.orcidPonti, Edoardo [0000-0002-6308-1050]
dc.identifier.eissn2307-387X
rioxxterms.typeJournal Article/Review
pubs.funder-project-idEuropean Research Council (648909)
cam.issuedOnline2018-07-25


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record

Attribution 4.0 International (CC BY 4.0)
Except where otherwise noted, this item's licence is described as Attribution 4.0 International (CC BY 4.0)