Data-Efficient Bilingual Lexicon Induction with Pretrained Language Models
Repository URI
Repository DOI
Change log
Authors
Abstract
Bilingual dictionaries are essential language resources that play a crucial role in the development of modern multilingual and cross-lingual natural language processing (NLP) systems, particularly for resource-lean languages. Although there are 7,000+ languages spoken worldwide, existing bilingual dictionaries are limited in both quality and quantity. This thesis specifically focuses on the task of Bilingual Lexicon Induction (BLI) and proposes a series of innovative data-efficient BLI approaches aimed at automatically inducing high-quality bilingual dictionaries in low-data scenarios, thereby bridging the lexical gaps between languages.
While previous BLI methods rely on mapping static word embeddings, inspired by the paradigm shifts towards pretrained language models (PLMs), we investigate leveraging PLMs for BLI. Firstly, we propose a two-stage contrastive learning framework, combing cross-lingual word embeddings (CLWEs) mapped from static embeddings and those extracted from PLMs, both learned with contrastive learning (Chapter 3). Secondly, we put forth a retrieve-and-rerank approach where we first use any precalculated CLWEs to retrieve a small set of candidate translations and then leverage PLMs as cross-encoder rerankers for BLI (Chapter 4). Thirdly, we investigate if it is possible to prompt autoregressive large language models (LLMs) for BLI, which completely deviates from traditional mapping-based approaches. We further employ retrieval-augmented in-context learning (ICL) to boost the performance and propose inducing a self-augmented high-confidence dictionary to be used in an ICL fashion for the unsupervised BLI task (Chapter 5). These three studies demonstrate the effectiveness of utilising PLMs for BLI and keep pushing the boundaries of BLI by establishing robust and new state-of-the-art BLI performance progressively.
Finally, recognising the usefulness of BLI in neural machine translation (NMT), as indicated by related work, we further propose an NMT-enhanced parameter-efficient cross-lingual transfer learning framework for multilingual text-to-image generation (Chapter 6). In this application-oriented study, we demonstrate that translation-based approaches, again focusing on low-data setups, can yield strong cross-lingual transfer capabilities also in multilingual text-to-image generation, previously limited to English-only monolingual settings.
Description
Date
Advisors
Vulic, Ivan
