Analysing and Mitigating Classification Bias for Text-based Foundation Models
Repository URI
Repository DOI
Change log
Authors
Abstract
The objective of text classification is to categorise texts into one of several pre-defined classes. Text classification is a standard natural language processing (NLP) task with various applicability in many domains, such as analysing the evolving sentiment of users on a platform, identifying and filtering fraudulent reviews, or extracting useful features in a pipeline. While text classification has traditionally been performed manually, the rapid advancement of deep learning approaches has sparked considerable interest in developing automatic text classifiers. This has been accelerated by the introduction of pre-trained large language models (LLMs), models trained on vast quantities of digital textual data, enabling unprecedented capabilities across a diverse range of natural language understanding and generation tasks. These NLP foundation models are typically leveraged within two main methodologies: the "pre-train and fine-tune'' paradigm or by prompting instruction-following LLMs. While these approaches have been widely applied and have demonstrated state-of-the-art performance across NLP benchmarks, there remain concerns about their reliability, particularly regarding their susceptibility to spurious correlations and implicit model bias.
This thesis analyses the forms of bias and spurious correlations that are present when NLP foundation models are applied to text classification tasks. We analyse particular biases present in the systems, determine what impact these biases have on the predictions, and examine whether mitigation techniques can reduce the influence of the biases. The first part of the thesis focuses on the pre-train and fine-tune methodology, where we analyse the risk of systems learning spurious correlation by examining two popular NLP tasks: sentiment classification and multiple choice reading comprehension (MCRC). For sentiment classification, we demonstrate that fine-tuned systems may leverage spurious stopword relationships based on the stopword distribution in the training data. For MCRC, we find that models are susceptible to ignoring the contextual information and may use world knowledge to solve the task, which we leverage to assess question quality.
In the second part of the thesis, we examine biases present when instruction-following LLMs are prompted zero-shot for text classification. We analyse the prompt-based classifier setup and examine the biases present in these systems when applied to text classification tasks, including text classification and multiple choice question answering (MCQA) tasks. For text classification, we demonstrate that the choice of label words can cause an implicit prior that may favour particular classes over others, severely impacting the performance of these systems. However, by considering reweighting debiasing schemes, we demonstrate that both zero-resource and our proposed unsupervised reweighting debiasing yield more robust performance and reduce the sensitivity to the choice of label words. Further, we illustrate that when instruction-following LLMs are prompted for MCQA, they can exhibit considerable permutation bias. Here, the systems are sensitive to the ordering of the input options, which also negatively impacts task performance. We show that permutation debiasing improves performance significantly, and we propose a simple distillation framework to make this inefficient process more efficient.
The thesis concludes by considering biases for new tasks and domains. We propose LLM comparative assessment, a novel way of performing general, zero-shot, and effective NLG assessment by prompting LLMs to make pairwise decisions. For LLM comparative assessment, we find that position bias is also present and demonstrate that averaging the probabilities over both permutations can result in more accurate decisions and final rankings. We extend this approach to the product-of-experts framework for LLM comparative assessment, enabling faster convergence with fewer comparisons. Further, we investigate accounting for bias in the experts, which can result in better performance when a low number of comparisons are used. Finally, we conclude the thesis by examining whether our debiasing approaches generalise into other modalities, particularly the audio domain. We propose a novel way to leverage the emergent abilities of ASR foundation models for zero-shot audio classification and demonstrate that our proposed reweighting debiasing approaches remain effective for tasks in the audio modality as well.

