On cross-lingual retrieval with multilingual text encoders.
dc.contributor.author | Litschko, Robert | |
dc.contributor.author | Vulić, Ivan | |
dc.contributor.author | Ponzetto, Simone Paolo | |
dc.contributor.author | Glavaš, Goran | |
dc.date.accessioned | 2022-05-10T15:00:40Z | |
dc.date.available | 2022-05-10T15:00:40Z | |
dc.date.issued | 2022 | |
dc.date.submitted | 2021-07-08 | |
dc.identifier.issn | 1386-4564 | |
dc.identifier.other | s10791-022-09406-x | |
dc.identifier.other | 9406 | |
dc.identifier.uri | https://www.repository.cam.ac.uk/handle/1810/336981 | |
dc.description | Funder: Universität Mannheim (3157) | |
dc.description.abstract | Pretrained multilingual text encoders based on neural transformer architectures, such as multilingual BERT (mBERT) and XLM, have recently become a default paradigm for cross-lingual transfer of natural language processing models, rendering cross-lingual word embedding spaces (CLWEs) effectively obsolete. In this work we present a systematic empirical study focused on the suitability of the state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks across a number of diverse language pairs. We first treat these models as multilingual text encoders and benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR. In contrast to supervised language understanding, our results indicate that for unsupervised document-level CLIR-a setup with no relevance judgments for IR-specific fine-tuning-pretrained multilingual encoders on average fail to significantly outperform earlier models based on CLWEs. For sentence-level retrieval, we do obtain state-of-the-art performance: the peak scores, however, are met by multilingual encoders that have been further specialized, in a supervised fashion, for sentence understanding tasks, rather than using their vanilla 'off-the-shelf' variants. Following these results, we introduce localized relevance matching for document-level CLIR, where we independently score a query against document sections. In the second part, we evaluate multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to rank) on English relevance data in a series of zero-shot language and domain transfer CLIR experiments. Our results show that, despite the supervision, and due to the domain and language shift, supervised re-ranking rarely improves the performance of multilingual transformers as unsupervised base rankers. Finally, only with in-domain contrastive fine-tuning (i.e., same domain, only language transfer), we manage to improve the ranking quality. We uncover substantial empirical differences between cross-lingual retrieval results and results of (zero-shot) cross-lingual transfer for monolingual retrieval in target languages, which point to "monolingual overfitting" of retrieval models trained on monolingual (English) data, even if they are based on multilingual transformers. | |
dc.language | en | |
dc.publisher | Springer Science and Business Media LLC | |
dc.subject | Cross-lingual IR | |
dc.subject | Learning to Rank | |
dc.subject | Multilingual text encoders | |
dc.title | On cross-lingual retrieval with multilingual text encoders. | |
dc.type | Article | |
dc.date.updated | 2022-05-10T15:00:40Z | |
prism.endingPage | 183 | |
prism.issueIdentifier | 2 | |
prism.publicationName | Inf Retr Boston | |
prism.startingPage | 149 | |
prism.volume | 25 | |
dc.identifier.doi | 10.17863/CAM.84403 | |
dcterms.dateAccepted | 2022-02-05 | |
rioxxterms.versionofrecord | 10.1007/s10791-022-09406-x | |
rioxxterms.version | VoR | |
rioxxterms.licenseref.uri | http://creativecommons.org/licenses/by/4.0/ | |
dc.identifier.eissn | 1573-7659 | |
pubs.funder-project-id | European Research Council (957356) | |
pubs.funder-project-id | Ministerium für Wirtschaft, Arbeit und Wohnungsbau Baden-Württemberg (MultiConvAI) | |
cam.issuedOnline | 2022-03-07 |
Files in this item
This item appears in the following Collection(s)
-
Jisc Publications Router
This collection holds Cambridge publications received from the Jisc Publications Router