Computationally Efficient Active Learning for Large Imbalanced Datasets

Data collection for imbalanced text classification tasks is challenging as the minority class naturally occurs rarely and gathering a large pool of unlabelled data is often essential to capture minority instances. Standard pool-based active learning is computationally expensive due to the repeated evaluation of the model on a large pool or reaches low accuracy due to the learning challenges stemming from the imbalance. We propose a novel pool filtering method to scale active learning to large pools while addressing class imbalance. \name uses the semantic representation capabilities of language models to explore the input space and select pool instances closest to class-specific \textit{anchors} dynamically chosen from the labelled set. The active learning strategy runs on a fixed-sized, smaller, and more balanced subset of the pool in each iteration thus resulting in a constant instance selection time, independently of the original pool size. Moreover, it promotes the discovery of minority instances and prevents overfitting to the initial labelled set. Across experiments on binary and multiclass classification tasks, active learning strategies, model types and sizes, \name is faster, often reducing the total instance selection time from hours to minutes, while providing better performance than competing methods.

Conference Name

2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics

Rights and licensing

Except where otherwised noted, this item's license is described as Attribution 4.0 International

Sponsorship

European Commission Horizon 2020 (H2020) ERC (865958)

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 Research and Innovation programme grant AVeriTeC (Grant agreement No. 865958).

Collections

University of Cambridge Research Outputs (Articles and Conferences)