Repository logo
 

Learning with less: machine learning techniques for scarce medical data


Loading...
Thumbnail Image

Type

Change log

Abstract

Deep learning has demonstrated significant potential in medical applications. However, its success often depends on large, clean datasets, a condition rarely met in real-world medical settings, where data is often scarce, incomplete, or expensive. This affects medical machine learning throughout the entire modelling pipeline, from limited training data, to expensive measurements at deployment. This thesis presents a set of techniques that address these challenges at all stages of the machine learning pipeline.

The first contribution focusses on what insight can be gained from the available data before training a large predictive model. We introduce CompFS, an ensemble approach for discovering groups of features that are jointly predictive but individually uninformative. This approach generalises the problem of traditional feature selection and provides a way for domain experts to identify interactions between features, such as geneticists discovering epistatic effects in disease. The insight gained can then guide subsequent modelling or aid scientists in building mechanistic models.

The second contribution focusses on generating synthetic training data to address the challenge of limited data for training survival models, a tool for predicting patient outcomes. We discover three common failure modes when generating synthetic survival data, and introduce three metrics to evaluate the quality of synthetic survival data based on these failure modes. Additionally, we introduce SurvivalGAN, a GAN based generative model that produces synthetic survival data, which can be used when access to real training datasets is limited.

In the third contribution we investigate active feature acquisition, another generalisation of feature selection. Here, features are expensive to measure at deployment time, and we must select what to measure next to improve the prediction. An example is in diagnosis, where a doctor will prioritise which tests to conduct based on their understanding of a specific patient's condition. We introduce SEFA, a latent variable model that addresses the drawbacks associated with existing methods. This tool can be used to improve the efficiency of feature usage during deployment, enabling machine learning systems to make better decisions under resource constraints.

Finally, in the fourth contribution, we come full circle back to the "zeroth" stage of the machine learning pipeline. Here we consider the practical, overarching decision of what data to actually use for a problem, before any collection, modelling or analysis. We conduct a study on a multiple sclerosis prediction task, exploring the option of using inexpensive features, that are abundant during training and uncostly to measure at deployment. We benchmark the performance of four neural differential equation based models, showing medical data need not be scarce to be effective.

Across all contributions, we run extensive evaluations using real-world tabular and biomedical datasets, comparing to state-of-the-art baselines. Our methods consistently outperform or match these baselines. Thorough ablations and sensitivity analyses validate the design choices made in each contribution. Together, these contributions form a diverse set of strategies for tackling data scarcity in medicine. Additionally, many of the methods treat data abstractly and generalise beyond this domain. By focussing on the different challenges that arise throughout the modelling process, this thesis provides both practical tools and a broad foundation for future research.

Description

Date

2025-08-16

Advisors

Lio, Pietro

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge

Rights and licensing

Except where otherwised noted, this item's license is described as Attribution 4.0 International (CC BY 4.0)
Sponsorship
GSK plc