Repository logo
 

Data-Efficient Bilingual Lexicon Induction with Pretrained Language Models


Loading...
Thumbnail Image

Type

Change log

Abstract

Bilingual dictionaries are essential language resources that play a crucial role in the development of modern multilingual and cross-lingual natural language processing (NLP) systems, particularly for resource-lean languages. Although there are 7,000+ languages spoken worldwide, existing bilingual dictionaries are limited in both quality and quantity. This thesis specifically focuses on the task of Bilingual Lexicon Induction (BLI) and proposes a series of innovative data-efficient BLI approaches aimed at automatically inducing high-quality bilingual dictionaries in low-data scenarios, thereby bridging the lexical gaps between languages.

While previous BLI methods rely on mapping static word embeddings, inspired by the paradigm shifts towards pretrained language models (PLMs), we investigate leveraging PLMs for BLI. Firstly, we propose a two-stage contrastive learning framework, combing cross-lingual word embeddings (CLWEs) mapped from static embeddings and those extracted from PLMs, both learned with contrastive learning (Chapter 3). Secondly, we put forth a retrieve-and-rerank approach where we first use any precalculated CLWEs to retrieve a small set of candidate translations and then leverage PLMs as cross-encoder rerankers for BLI (Chapter 4). Thirdly, we investigate if it is possible to prompt autoregressive large language models (LLMs) for BLI, which completely deviates from traditional mapping-based approaches. We further employ retrieval-augmented in-context learning (ICL) to boost the performance and propose inducing a self-augmented high-confidence dictionary to be used in an ICL fashion for the unsupervised BLI task (Chapter 5). These three studies demonstrate the effectiveness of utilising PLMs for BLI and keep pushing the boundaries of BLI by establishing robust and new state-of-the-art BLI performance progressively.

Finally, recognising the usefulness of BLI in neural machine translation (NMT), as indicated by related work, we further propose an NMT-enhanced parameter-efficient cross-lingual transfer learning framework for multilingual text-to-image generation (Chapter 6). In this application-oriented study, we demonstrate that translation-based approaches, again focusing on low-data setups, can yield strong cross-lingual transfer capabilities also in multilingual text-to-image generation, previously limited to English-only monolingual settings.

Description

Date

2024-09-30

Advisors

Korhonen, Anna
Vulic, Ivan

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge

Rights and licensing

Except where otherwised noted, this item's license is described as All rights reserved
Sponsorship
Grace and Thomas C. H. Chan Cambridge International Scholarship