Computationally Efficient Active Learning for Large Imbalanced Datasets
Accepted version
Peer-reviewed
Repository URI
Repository DOI
Change log
Authors
Abstract
Data collection for imbalanced text classification tasks is challenging as the minority class naturally occurs rarely and gathering a large pool of unlabelled data is often essential to capture minority instances. Standard pool-based active learning is computationally expensive due to the repeated evaluation of the model on a large pool or reaches low accuracy due to the learning challenges stemming from the imbalance. We propose a novel pool filtering method to scale active learning to large pools while addressing class imbalance. \name uses the semantic representation capabilities of language models to explore the input space and select pool instances closest to class-specific \textit{anchors} dynamically chosen from the labelled set. The active learning strategy runs on a fixed-sized, smaller, and more balanced subset of the pool in each iteration thus resulting in a constant instance selection time, independently of the original pool size. Moreover, it promotes the discovery of minority instances and prevents overfitting to the initial labelled set. Across experiments on binary and multiclass classification tasks, active learning strategies, model types and sizes, \name is faster, often reducing the total instance selection time from hours to minutes, while providing better performance than competing methods.

