Improving Parameter-Efficient  Cross-Lingual Transfer for  Low-Resource Languages

Parovic, Marinela

doi:https://doi.org/10.17863/CAM.111817

Improving Parameter-Efficient Cross-Lingual Transfer for Low-Resource Languages

Repository URI

https://www.repository.cam.ac.uk/handle/1810/373322

Repository DOI

https://doi.org/10.17863/CAM.111817

Files

Primary Thesis (3.95 MB)

Type

Thesis

Authors

Parovic, Marinela

Abstract

The rapid development and real-world adoption of natural language processing models in recent years underscores the imperative to develop such models in different languages. This endeavour aims to enable access to emerging technologies to a broad spectrum of individuals, irrespective of their language. The most challenging scenario is arguably that of low-resource languages, often lacking labelled data while also possessing a limited amount of unlabelled data. The open question is how to best use available data sources to achieve good performance across a range of resource scarcity. In this thesis, this question is addressed from two different perspectives in the context of modular and parameter-efficient approaches to cross-lingual transfer. Firstly, we propose different strategies for adapting to a low-resource target language within the existing zero-shot cross-lingual transfer paradigm. This involves leveraging unlabelled data in the target language in conjunction with other standard data sources to augment model performance for a specific task in that target language. Despite the underlying assumption of the absence of labelled data, our findings demonstrate that even exclusive reliance on unlabelled data enhances the task performance in the target language. In addition, we study the trade-offs between modularity and performance across the proposed methods. Secondly, we explore several approaches to combine various data sources in a few-shot setting, assuming the availability of a limited amount of labelled data in the target language. Our results illustrate that combining this data with larger amounts of lower-quality labelled data acquired through the translation process, along with unlabelled data, yields large performance gains for low-resource target languages when integrated with existing cross-lingual transfer tools. The outcomes of this research show the feasibility of refining existing methods for cross-lingual transfer through the implementation of different training procedures and data sources. We hope this thesis will provide valuable insights into cross-lingual transfer and serve as an inspiration for further advancements in models designed for low-resource scenarios.

Date

2023-12-05

Advisors

Korhonen, Anna

Keywords

cross-lingual transfer, multilingual NLP

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge

Rights and licensing

Sponsorship

Trinity College

Collections

Theses - Theoretical and Applied Linguistics