Investigating cross-lingual alignment methods for contextualized embeddings with Token-level evaluation

Liu, Q; McCarthy, D; Vulić, I; Korhonen, A

Investigating cross-lingual alignment methods for contextualized embeddings with Token-level evaluation

Accepted version

Peer-reviewed

Repository URI

https://www.repository.cam.ac.uk/handle/1810/297000

Repository DOI

https://doi.org/10.17863/CAM.44042

Files

Accepted version (353.11 KB)

Type

Conference Object

Authors

Liu, Q

McCarthy, D

Vulić, I

Korhonen, A

Abstract

In this paper, we present a thorough investigation on methods that align pre-trained contextualized embeddings into shared cross-lingual context-aware embedding space, providing strong reference benchmarks for future context-aware crosslingual models. We propose a novel and challenging task, Bilingual Token-level Sense Retrieval (BTSR). It specifically evaluates the accurate alignment of words with the same meaning in cross-lingual non-parallel contexts, currently not evaluated by existing tasks such as Bilingual Contextual Word Similarity and Sentence Retrieval. We show how the proposed BTSR task highlights the merits of different alignment methods. In particular, we find that using context average type-level alignment is effective in transferring monolingual contextualized embeddings cross-lingually especially in non-parallel contexts, and at the same time improves the monolingual space. Furthermore, aligning independently trained models yields better performance than aligning multilingual embeddings with shared vocabulary.

Journal Title

CoNLL 2019 - 23rd Conference on Computational Natural Language Learning, Proceedings of the Conference

Conference Name

The SIGNLL Conference on Computational Natural Language Learning

Publisher DOI

https://doi.org/10.17863/CAM.44042

Rights

Sponsorship

European Research Council (648909)

Peterhouse College Studentship; ERC Consolidator Grant LEXICAL

Collections

Cambridge University Research Outputs