Learning unsupervised multilingual word embeddings with incremental multilingual hubs

Heyman, G; Verreet, B; Vulić, I; Moens, MF

Learning unsupervised multilingual word embeddings with incremental multilingual hubs

Accepted version

Peer-reviewed

Repository URI

https://www.repository.cam.ac.uk/handle/1810/292618

Repository DOI

https://doi.org/10.17863/CAM.39779

Files

Accepted version (303.28 KB)

Type

Conference Object

Authors

Heyman, G

Verreet, B

Vulić, I

Moens, MF

Abstract

Recent research has discovered that a shared bilingual word embedding space can be induced by projecting monolingual word embedding spaces from two languages using a self-learning paradigm without any bilingual supervision. However, it has also been shown that for distant language pairs such fully unsupervised self-learning methods are unstable and often get stuck in poor local optima due to reduced isomorphism between starting monolingual spaces. In this work, we propose a new robust framework for learning unsupervised multilingual word embeddings that mitigates the instability issues. We learn a shared multilingual embedding space for a variable number of languages by incrementally adding new languages one by one to the current multilingual space. Through the gradual language addition our method can leverage the interdependencies between the new language and all other languages in the current multilingual hub/space. We find that it is beneficial to project more distant languages later in the iterative process. Our fully unsupervised multilingual embedding spaces yield results that are on par with the state-of-the-art methods in the bilingual lexicon induction (BLI) task, and simultaneously obtain state-of-the-art scores on two downstream tasks: multilingual document classification and multilingual dependency parsing, outperforming even supervised baselines. This finding also accentuates the need to establish evaluation protocols for cross-lingual word embeddings beyond the omnipresent intrinsic BLI task in future work.

Journal Title

NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference

Conference Name

Proceedings of the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT)

Volume Title

1

Publisher DOI

https://doi.org/10.17863/CAM.39779

Rights

Sponsorship

European Research Council (648909)

Collections

Cambridge University Research Outputs