Tabular Machine Learning on Small-Size and High-Dimensional Data

Margeloiu, Andrei

doi:https://doi.org/10.17863/CAM.117151

Tabular Machine Learning on Small-Size and High-Dimensional Data

Repository URI

https://www.repository.cam.ac.uk/handle/1810/382256

Repository DOI

https://doi.org/10.17863/CAM.117151

Files

Primary Thesis (7.27 MB)

Type

Thesis

Authors

Margeloiu, Andrei

Abstract

This thesis introduces four novel methods to improve the generalisation of machine learning models on small-size and high-dimensional tabular datasets. Tabular data – tables where each row represents an individual record and each column represents features – is ubiquitous in critical fields such as medicine, scientific research and finance. However, these areas often face data scarcity due to difficulties in data acquisition, making it challenging to gather large sample sizes. Conversely, new data collection technologies enable the collection of high-dimensional data, leading to datasets where the number of features greatly exceeds the number of samples. Data scarcity and high dimensionality present significant challenges for machine learning models, primarily due to the increased risk of overfitting arising from the curse of dimensionality and the limited data available to adequately represent the underlying distribution. Existing approaches often struggle to generalise effectively in such scenarios, resulting in suboptimal performance. As a result, training models on small and high-dimensional datasets requires methods designed to address these limitations and generalise more effectively from limited data.

We introduce two new model-centric approaches to address overfitting in neural networks trained on small-size and high-dimensional data. Our key innovation is to mitigate overfitting by constraining model parameters through shared auxiliary networks, which capture latent relationships in tabular data to partially determine the predictor model’s parameters, thereby reducing its degrees of freedom. Firstly, we introduce WPFS, a parameter-efficient architecture that imposes hard parameter-sharing on the model’s parameters using weight predictor networks. Secondly, we introduce GCondNet, a method that uses Graph Neural Networks (GNNs) to facilitate soft parameter-sharing in an underlying predictor model. When applied to biomedical tabular datasets, these methods demonstrate improved predictive performance, primarily by reducing overfitting.

While relying solely on model-centric approaches is common, integrating data-centric methods can provide additional performance gains, particularly in data-scarce tasks. To this end, we introduce two novel data augmentation methods that generate synthetic data to increase the size and diversity of the training set, capturing more variability of the underlying data distribution. Our key innovation is transforming pre-trained tabular classifiers into data generators, leveraging their pre-training information in two novel ways. The first method, TabEBM, constructs dedicated class-specific Energy-Based Models (EBMs) to approximate class-conditional distributions, which are then used to generate additional training data. The second method, TabMDA, introduces in-context subsetting (ICS), a technique that enables label-invariant transformations within the manifold space learned by pre-trained in-context classifiers, effectively enlarging the training dataset. Both methods are general, fast, require no additional training, and can be applied to any downstream predictor model. They consistently improve classification performance, especially in small datasets.

Overall, this thesis advances machine learning by initiating new directions to mitigate overfitting, and to generate and augment tabular data. Our techniques have immediate applications in fields such as medicine, finance, and scientific research, where data scarcity and high dimensionality are common obstacles. By showing that more effective learning is possible even with limited data, this work paves the way for a future where such constraints no longer hinder the application of machine learning.

Date

2024-11-17

Advisors

Jamnik, Mateja
Simidjievski, Nikola

Keywords

Machine Learning, Deep Learning, Tabular data, Small-sample size, High-dimensional data, Generative models, Data augmentation

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge

Rights and licensing

Sponsorship

ESRC (2615996)

Economic and Social Research Council (ESRC) -- Cambridge Doctoral Training Partnership (DTP

Collections

Theses - Computer Science and Technology

Tabular Machine Learning on Small-Size and High-Dimensional Data

Repository URI

Repository DOI

Files

Type

Change log

Authors

Abstract

Description

Date

Advisors

Keywords

Qualification

Awarding Institution

Rights and licensing

Sponsorship

Collections