Repository logo

Statistical methods for multi-omic data integration



Change log


Cabassi, Alessandra  ORCID logo


The thesis is focused on the development of new ways to integrate multiple ’omic datasets in the context of precision medicine. This type of analyses have the potential to help researchers deepen their understanding of biological mechanisms underlying disease. However, integrative studies pose several challenges, due to the typically widely differing characteristics of the ’omic layers in terms of number of predictors, type of data, and level of noise.

In this work, we first tackle the problem of performing variable selection and building supervised models, while integrating multiple ’omic datasets of different type. It has been recently shown that applying classical logistic regression with elastic-net penalty to these datasets can lead to poor results. Therefore, we suggest a two-step approach to multi-omic logistic regression in which variable selection is performed on each layer separately and a predictive model is subsequently built on the ensemble of the selected variables.

In the unsupervised setting, we first examine cluster of clusters analysis (COCA), an integrative clustering approach that combines information from multiple data sources. COCA has been widely applied in the context of tumour subtyping, but its properties have never been systematically explored before, and its robustness to the inclusion of noisy datasets is unclear. Then, we propose a new statistical method for the unsupervised integration of multi-omic data, called kernel learning integrative clustering (KLIC). This approach is based on the idea to frame the challenge of combining clustering structures as a multiple kernel learning problem, in which different datasets each provide a weighted contribution to the final clustering.

Finally, we build upon the notion of the posterior similarity matrix (PSM) in order to suggest new approaches for summarising the output of MCMC algorithms for Bayesian mixture models. A key contribution of our work is the observation that PSMs can be used to define probabilistically-motivated kernel matrices that capture the clustering structure present in the data. This observation enables us to employ a range of kernel methods to obtain summary clusterings, and, if we have multiple PSMs, use standard methods for combining kernels in order to perform integrative clustering. We also show that one can embed PSMs within predictive kernel models in order to perform outcome-guided clustering.





Kirk, Paul


precision medicine, genomics, data integration


Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge
MRC (1795323)