Bayesian matrix factorisation: inference, priors, and data integration

Brouwer, Thomas Alexander

Bayesian matrix factorisation: inference, priors, and data integration

Repository URI

https://www.repository.cam.ac.uk/handle/1810/269921

Repository DOI

https://doi.org/10.17863/CAM.16797

Files

Thesis (14.48 MB)

Type

Thesis

Authors

Brouwer, Thomas Alexander

Abstract

In recent years the amount of biological data has increased exponentially. Most of these data can be represented as matrices relating two different entity types, such as drug-target interactions (relating drugs to protein targets), gene expression profiles (relating drugs or cell lines to genes), and drug sensitivity values (relating drugs to cell lines). Not only the size of these datasets is increasing, but also the number of different entity types that they relate. Furthermore, not all values in these datasets are typically observed, and some are very sparse.

Matrix factorisation is a popular group of methods that can be used to analyse these matrices. The idea is that each matrix can be decomposed into two or more smaller matrices, such that their product approximates the original one. This factorisation of the data reveals patterns in the matrix, and gives us a lower-dimensional representation. Not only can we use this technique to identify clusters and other biological signals, we can also predict the unobserved entries, allowing us to prune biological experiments.

In this thesis we introduce and explore several Bayesian matrix factorisation models, focusing on how to best use them for predicting these missing values in biological datasets. Our main hypothesis is that matrix factorisation methods, and in particular Bayesian variants, are an extremely powerful paradigm for predicting values in biological datasets, as well as other applications, and especially for sparse and noisy data. We demonstrate the competitiveness of these approaches compared to other state-of-the-art methods, and explore the conditions under which they perform the best.

We consider several aspects of the Bayesian approach to matrix factorisation. Firstly, the effect of inference approaches that are used to find the factorisation on predictive performance. Secondly, we identify different likelihood and Bayesian prior choices that we can use for these models, and explore when they are most appropriate. Finally, we introduce a Bayesian matrix factorisation model that can be used to integrate multiple biological datasets, and hence improve predictions. This model hybridly combines different matrix factorisation models and Bayesian priors. Through these models and experiments we support our hypothesis and provide novel insights into the best ways to use Bayesian matrix factorisation methods for predictive purposes.

Advisors

Lio, Pietro

Keywords

Matrix factorisation, Machine learning, Bayesian statistics, Bioinformatics

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge

Rights

Sponsorship

UK Engineering and Physical Sciences Research Council (EPSRC), grant reference EP/M506485/1.

Collections

Theses - Computer Science and Technology