Repository logo
 

Artificial Intelligence for Global Complete Blood Count Data: Diagnosis, Harmonisation, and Multi-Site Collaboration


Loading...
Thumbnail Image

Type

Change log

Authors

Abstract

Every day, healthcare systems worldwide generate terabytes of rich diagnostic data---and then discard almost all of it. The complete blood count (CBC), the most frequently performed medical laboratory test in the world, produces high-dimensional single-cell measurements during every routine analysis. Yet, only a handful of summary statistics ever reach a clinician's report. With 3.6 billion tests performed annually, this represents an extraordinary volume of wasted information. In this thesis, I address foundational challenges in developing international multi-site machine learning (ML) models to extract clinical value from this extended CBC data that is already collected but never used, without any alteration to established clinical pathways.

My work comprises three overarching research areas: disease detection using extended CBC data; characterisation and mitigation of domain shifts in CBC data; and federated learning to enable privacy-preserving multi-site ML model development.

Iron deficiency, a major contributor to the global burden of disease and the leading cause of anaemia and years lived with disability worldwide, serves as my primary clinical application. Despite its global prevalence and clinical significance, conventional screening misses many early cases of iron deficiency. I present the first work in the literature to address the task of non-anaemic iron deficiency detection. By leveraging the rich single-cell measurements in raw CBC data and extended tabular parameters, we achieved a near four-fold improvement in sensitivity compared to conventional reference-interval-based screening (79.3% versus 21.9%). We develop our models on the largest dataset used for training and validation of iron deficiency ML models, increasing training dataset size eightfold compared to the previous largest study.

Identifying challenges in multi-site deployment, I demonstrate that analyser differences, invisible in standard CBC data, produce significant effects in extended CBC measurements: analyser identification achieved 96.4% accuracy, compared to only 88.3% for biological sex classification, despite ISO 17511-traceable calibration between analysers. To address this, we introduce the Dis-AE architecture, a novel neural network approach that handles multiple simultaneous domain shift effects through a domain-instance grouping paradigm.

We conduct a systematic review of federated learning for healthcare, the largest such review at the time of submission, covering 89 papers, compared to 44 in the next-largest. We then propose FedMAP, a federated learning framework based on maximum a posteriori estimation with learnable input-convex neural network priors. FedMAP outperforms existing federated and personalised federated learning methods on three large-scale clinical datasets, showing computational scalability for healthcare federations exceeding 300 sites and 300,000 patient records. Geographical analysis reveals that underperforming regions achieve up to 14.3% performance gains, demonstrating the potential of FedMAP to reduce rather than amplify healthcare disparities.

Together, I have demonstrated that extended CBC data contains substantially more diagnostic power than the standard report; that modern ML methods can capture this information; that domain shifts arising from different analysers can be characterised and mitigated; and that federated learning can enable privacy-preserving collaborative model development without the need to share sensitive patient data.

Description

Date

2026-01-02

Advisors

Roberts, Michael

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge

Rights and licensing

Except where otherwised noted, this item's license is described as Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
Sponsorship
Trinity Challenge BloodCounts! Award