Multi-lingual Learning with Limited Labels

Zhu, Yi

doi:https://doi.org/10.17863/CAM.129762

Multi-lingual Learning with Limited Labels

Repository URI

https://www.repository.cam.ac.uk/handle/1810/402342

Repository DOI

https://doi.org/10.17863/CAM.129762

Files

Primary Thesis (2.5 MB)

Type

Thesis

Authors

Zhu, Yi

Abstract

Recent advances in natural language processing with neural networks have largely benefited high-resource languages such as English, while the vast majority of the world’s other languages face significant challenges due to data scarcity. Rather than focusing entirely on increasing labelled data for these languages, this thesis investigates approaches emerging from different perspectives to improve multi-lingual learning, namely a model’s ability to learn and generalise across multiple languages, with a particular emphasis on low-resource settings.

First, we leverage subword information to learn multi-lingual word representations, aiming to enhance data efficiency during pre-training by incorporating subword-level knowledge. We begin by demonstrating the importance of subword-aware word representations in morphological tasks, examined under both simulated and truly low-resource language settings. Extending this study, we propose a general framework for learning subword-informed word representations, centred on two core components essential to subword integration: the word segmentation strategy (tokenisation) and the subword composition function. Within this framework, we find that no universally optimal subword model exists; instead, the choice of components must be tuned to the specific language and task to achieve the best results.

Second, we integrate deep generative models into multi-lingual learning to address the challenge of labelled data scarcity. We introduce a unified framework that combines multi-lingual pre-training based on variational autoencoders with a reformulated semi-supervised learning approach that enables end to-end training of a class of two-stage deep generative semi-supervised models. We show that this framework can effectively harness both general unlabelled data and task-specific unlabelled data to mitigate the paucity of labelled data across multiple languages.

Third, we thoroughly study few-shot cross-lingual transfer to investigate the key factors that influence its performance. Through extensive experiments across tasks and languages, we demonstrate that its transfer effectiveness is largely determined by the selection of few-shot examples, with model performance showing high sensitivity to sample choice. Additionally, factors such as language similarity, source language adaptation, the nature of the downstream task, and the adaptation strategy employed all play an important role in overall performance.

The findings of this thesis further point to potential directions for democratising state-of-the-art language technologies for low-resource languages in the era of large language models.

Date

2025-07-10

Advisors

Korhonen, Anna
Shareghi, Ehsan

Keywords

Cross-lingual Transfer, Deep Learning, Multi-lingual Learning, Natural Language Processing, Subword Text Representations, Variational Autoencoders, Deep Generative Models

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge

Rights and licensing

Sponsorship

European Research Council (648909)

ERC (Consolidator Grant 648909) Lexical

Collections

Theses - Theoretical and Applied Linguistics

Multi-lingual Learning with Limited Labels

Repository URI

Repository DOI

Files

Type

Change log

Authors

Abstract

Description

Date

Advisors

Keywords

Qualification

Awarding Institution

Rights and licensing

Sponsorship

Collections