On the relation between linguistic typology and (limitations of) multilingual language modeling
View / Open Files
Authors
Gerz, D
Vulić, I
Ponti, EM
Reichart, R
Korhonen, A
Publication Date
2018Journal Title
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018
Conference Name
2018 Conference on Empirical Methods in Natural Language Processing
ISBN
9781948087841
Pages
316-327
Type
Conference Object
Metadata
Show full item recordCitation
Gerz, D., Vulić, I., Ponti, E., Reichart, R., & Korhonen, A. (2018). On the relation between linguistic typology and (limitations of) multilingual language modeling. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018, 316-327. https://doi.org/10.17863/CAM.30216
Abstract
A key challenge in cross-lingual NLP is developing general language-independent architectures that are equally applicable to any language. However, this ambition is largely hampered by the variation in structural and semantic properties, i.e. the typological profiles of the world's languages. In this work, we analyse the implications of this variation on the language modeling (LM) task. We present a large-scale study of state-of-the art n-gram based and neural language models on 50 typologically diverse languages covering a wide variety of morphological systems. Operating in the full vocabulary LM setup focused on word-level prediction, we demonstrate that a coarse typology of morphological systems is predictive of absolute LM performance. Moreover, fine-grained typological features such as exponence, flexivity, fusion, and inflectional synthesis are borne out to be responsible for the proliferation of low-frequency phenomena which are organically difficult to model by statistical architectures, or for the meaning ambiguity of character n-grams. Our study strongly suggests that these features have to be taken into consideration during the construction of next-level language-agnostic LM architectures, capable of handling morphologically complex languages such as Tamil or Korean.
Sponsorship
ERC grant Lexical
Funder references
European Research Council (648909)
Identifiers
External DOI: https://doi.org/10.17863/CAM.30216
This record's URL: https://www.repository.cam.ac.uk/handle/1810/282852
Rights
Licence:
http://www.rioxx.net/licenses/all-rights-reserved
Statistics
Total file downloads (since January 2020). For more information on metrics see the
IRUS guide.
Recommended or similar items
The current recommendation prototype on the Apollo Repository will be turned off on 03 February 2023. Although the pilot has been fruitful for both parties, the service provider IKVA is focusing on horizon scanning products and so the recommender service can no longer be supported. We recognise the importance of recommender services in supporting research discovery and are evaluating offerings from other service providers. If you would like to offer feedback on this decision please contact us on: support@repository.cam.ac.uk