Learning unsupervised multilingual word embeddings with incremental multilingual hubs
View / Open Files
Authors
Heyman, G
Verreet, B
Vulić, I
Moens, MF
Publication Date
2019-01-01Journal Title
NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference
Conference Name
Proceedings of the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT)
ISBN
9781950737130
Volume
1
Pages
1890-1902
Type
Conference Object
This Version
AM
Metadata
Show full item recordCitation
Heyman, G., Verreet, B., Vulić, I., & Moens, M. (2019). Learning unsupervised multilingual word embeddings with incremental multilingual hubs. NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, 1 1890-1902. https://doi.org/10.17863/CAM.39779
Abstract
Recent research has discovered that a shared bilingual word embedding space can be induced by projecting monolingual word embedding spaces from two languages using a self-learning paradigm without any bilingual supervision. However, it has also been shown that for distant language pairs such fully unsupervised self-learning methods are unstable and often get stuck in poor local optima due to reduced isomorphism between starting monolingual spaces. In this work, we propose a new robust framework for learning unsupervised multilingual word embeddings that mitigates the instability issues. We learn a shared multilingual embedding space for a variable number of languages by incrementally adding new languages one by one to the current multilingual space. Through the gradual language addition our method can leverage the interdependencies between the new language and all other languages in the current multilingual hub/space. We find that it is beneficial to project more distant languages later in the iterative process. Our fully unsupervised multilingual embedding spaces yield results that are on par with the state-of-the-art methods in the bilingual lexicon induction (BLI) task, and simultaneously obtain state-of-the-art scores on two downstream tasks: multilingual document classification and multilingual dependency parsing, outperforming even supervised baselines. This finding also accentuates the need to establish evaluation protocols for cross-lingual word embeddings beyond the omnipresent intrinsic BLI task in future work.
Sponsorship
European Research Council (648909)
Identifiers
External DOI: https://doi.org/10.17863/CAM.39779
This record's URL: https://www.repository.cam.ac.uk/handle/1810/292618
Rights
All rights reserved
Licence:
http://www.rioxx.net/licenses/all-rights-reserved
Statistics
Total file downloads (since January 2020). For more information on metrics see the
IRUS guide.