Multi-lingual Learning with Limited Labels
Repository URI
Repository DOI
Change log
Authors
Abstract
Recent advances in natural language processing with neural networks have largely benefited high-resource languages such as English, while the vast majority of the world’s other languages face significant challenges due to data scarcity. Rather than focusing entirely on increasing labelled data for these languages, this thesis investigates approaches emerging from different perspectives to improve multi-lingual learning, namely a model’s ability to learn and generalise across multiple languages, with a particular emphasis on low-resource settings.
First, we leverage subword information to learn multi-lingual word representations, aiming to enhance data efficiency during pre-training by incorporating subword-level knowledge. We begin by demonstrating the importance of subword-aware word representations in morphological tasks, examined under both simulated and truly low-resource language settings. Extending this study, we propose a general framework for learning subword-informed word representations, centred on two core components essential to subword integration: the word segmentation strategy (tokenisation) and the subword composition function. Within this framework, we find that no universally optimal subword model exists; instead, the choice of components must be tuned to the specific language and task to achieve the best results.
Second, we integrate deep generative models into multi-lingual learning to address the challenge of labelled data scarcity. We introduce a unified framework that combines multi-lingual pre-training based on variational autoencoders with a reformulated semi-supervised learning approach that enables end to-end training of a class of two-stage deep generative semi-supervised models. We show that this framework can effectively harness both general unlabelled data and task-specific unlabelled data to mitigate the paucity of labelled data across multiple languages.
Third, we thoroughly study few-shot cross-lingual transfer to investigate the key factors that influence its performance. Through extensive experiments across tasks and languages, we demonstrate that its transfer effectiveness is largely determined by the selection of few-shot examples, with model performance showing high sensitivity to sample choice. Additionally, factors such as language similarity, source language adaptation, the nature of the downstream task, and the adaptation strategy employed all play an important role in overall performance.
The findings of this thesis further point to potential directions for democratising state-of-the-art language technologies for low-resource languages in the era of large language models.
Description
Date
Advisors
Shareghi, Ehsan
