Show simple item record

dc.contributor.authorGerz, Daniela
dc.contributor.authorVulić, I
dc.contributor.authorPonti, Edoardo
dc.contributor.authorReichart, R
dc.contributor.authorKorhonen, Anna-Leena
dc.date.accessioned2018-09-27T14:13:12Z
dc.date.available2018-09-27T14:13:12Z
dc.date.issued2020-01-01
dc.identifier.isbn9781948087841
dc.identifier.urihttps://www.repository.cam.ac.uk/handle/1810/282852
dc.description.abstractA key challenge in cross-lingual NLP is developing general language-independent architectures that are equally applicable to any language. However, this ambition is largely hampered by the variation in structural and semantic properties, i.e. the typological profiles of the world's languages. In this work, we analyse the implications of this variation on the language modeling (LM) task. We present a large-scale study of state-of-the art n-gram based and neural language models on 50 typologically diverse languages covering a wide variety of morphological systems. Operating in the full vocabulary LM setup focused on word-level prediction, we demonstrate that a coarse typology of morphological systems is predictive of absolute LM performance. Moreover, fine-grained typological features such as exponence, flexivity, fusion, and inflectional synthesis are borne out to be responsible for the proliferation of low-frequency phenomena which are organically difficult to model by statistical architectures, or for the meaning ambiguity of character n-grams. Our study strongly suggests that these features have to be taken into consideration during the construction of next-level language-agnostic LM architectures, capable of handling morphologically complex languages such as Tamil or Korean.
dc.description.sponsorshipERC grant Lexical
dc.titleOn the relation between linguistic typology and (limitations of) multilingual language modeling
dc.typeConference Object
prism.endingPage327
prism.publicationDate2020
prism.publicationNameProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018
prism.startingPage316
dc.identifier.doi10.17863/CAM.30216
dcterms.dateAccepted2018-08-10
rioxxterms.versionofrecord10.17863/CAM.30216
rioxxterms.licenseref.urihttp://www.rioxx.net/licenses/all-rights-reserved
rioxxterms.licenseref.startdate2020-01-01
dc.contributor.orcidPonti, Edoardo [0000-0002-6308-1050]
rioxxterms.typeConference Paper/Proceeding/Abstract
pubs.funder-project-idEuropean Research Council (648909)
pubs.conference-name2018 Conference on Empirical Methods in Natural Language Processing
pubs.conference-start-date2018-11-02
pubs.conference-finish-date2018-11-04
rioxxterms.freetoread.startdate2019-09-24


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record