Repository logo
 

Tabular Machine Learning on Small-Size and High-Dimensional Data


Loading...
Thumbnail Image

Type

Change log

Abstract

This thesis introduces four novel methods to improve the generalisation of machine learning models on small-size and high-dimensional tabular datasets. Tabular data – tables where each row represents an individual record and each column represents features – is ubiquitous in critical fields such as medicine, scientific research and finance. However, these areas often face data scarcity due to difficulties in data acquisition, making it challenging to gather large sample sizes. Conversely, new data collection technologies enable the collection of high-dimensional data, leading to datasets where the number of features greatly exceeds the number of samples. Data scarcity and high dimensionality present significant challenges for machine learning models, primarily due to the increased risk of overfitting arising from the curse of dimensionality and the limited data available to adequately represent the underlying distribution. Existing approaches often struggle to generalise effectively in such scenarios, resulting in suboptimal performance. As a result, training models on small and high-dimensional datasets requires methods designed to address these limitations and generalise more effectively from limited data.

We introduce two new model-centric approaches to address overfitting in neural networks trained on small-size and high-dimensional data. Our key innovation is to mitigate overfitting by constraining model parameters through shared auxiliary networks, which capture latent relationships in tabular data to partially determine the predictor model’s parameters, thereby reducing its degrees of freedom. Firstly, we introduce WPFS, a parameter-efficient architecture that imposes hard parameter-sharing on the model’s parameters using weight predictor networks. Secondly, we introduce GCondNet, a method that uses Graph Neural Networks (GNNs) to facilitate soft parameter-sharing in an underlying predictor model. When applied to biomedical tabular datasets, these methods demonstrate improved predictive performance, primarily by reducing overfitting.

While relying solely on model-centric approaches is common, integrating data-centric methods can provide additional performance gains, particularly in data-scarce tasks. To this end, we introduce two novel data augmentation methods that generate synthetic data to increase the size and diversity of the training set, capturing more variability of the underlying data distribution. Our key innovation is transforming pre-trained tabular classifiers into data generators, leveraging their pre-training information in two novel ways. The first method, TabEBM, constructs dedicated class-specific Energy-Based Models (EBMs) to approximate class-conditional distributions, which are then used to generate additional training data. The second method, TabMDA, introduces in-context subsetting (ICS), a technique that enables label-invariant transformations within the manifold space learned by pre-trained in-context classifiers, effectively enlarging the training dataset. Both methods are general, fast, require no additional training, and can be applied to any downstream predictor model. They consistently improve classification performance, especially in small datasets.

Overall, this thesis advances machine learning by initiating new directions to mitigate overfitting, and to generate and augment tabular data. Our techniques have immediate applications in fields such as medicine, finance, and scientific research, where data scarcity and high dimensionality are common obstacles. By showing that more effective learning is possible even with limited data, this work paves the way for a future where such constraints no longer hinder the application of machine learning.

Description

Date

2024-11-17

Advisors

Jamnik, Mateja
Simidjievski, Nikola

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge

Rights and licensing

Except where otherwised noted, this item's license is described as All rights reserved
Sponsorship
ESRC (2615996)
Economic and Social Research Council (ESRC) -- Cambridge Doctoral Training Partnership (DTP