Variational Mixture Models for non-Gaussian observations: Applications to molecular data

Gerontogianni, Stavroula

Variational Mixture Models for non-Gaussian observations: Applications to molecular data

Repository URI

https://www.repository.cam.ac.uk/handle/1810/337911

Repository DOI

https://doi.org/10.17863/CAM.85317

Files

Thesis (3.38 MB)

Type

Thesis

Authors

Gerontogianni, Stavroula

Abstract

Epigenetics is the field of biology that studies the changes in organisms due to alteration of gene expression rather than modification of the DNA sequence itself. DNA methylation is a well-studied type of epigenetic change, which results in gene silencing and can be dangerous when occurs at tumour suppressor gene loci. Many techniques have been developed to map the methylation pattern of individuals at several genetic loci, such as the HumanMethylation450 BeadChip, the EPIC BeadChip and the whole-genome bisulfite sequencing. Each of these DNA profiling platforms quantifies methylation occurrence in different ways, either continuously (rates of methylation intensity) or discretely (counts of methylated reads). Identifying subgroups of individuals with similar methylation patterns, as well as those genetic loci that discriminate the subgroups, is a crucial procedure that helps linking diseases to specific methylation patterns. Clustering analysis and posterior feature selection of the most important genetic loci that discriminate each subgroup of individuals are the two tools we suggest for achieving this venture. Clustering DNA methylation data though is not a trivial procedure since they are platform-specific and not normally distributed.

In this thesis, we propose clustering DNA methylation data based on the data type (continuous or discrete) by fast model-based clustering methods, while we select the most important/discriminatory genetic loci by an a posteriori feature selection measure. Specifically, we apply variational non-Gaussian Dirichlet Process mixture models because they have infinite number of components that allow model-determination and are flexible to model any discrete or continuous data type. We also employ Variational Inference with the “annealing” extension that accounts for poor initialisation of the algorithm, due to its high speed in estimating the model parameters and its scalability to high-dimensional data. Our real applications on neonatal DNA methylation data measured in three different ways show that the discrete data types - number of aberrantly methylated genetic loci (counts) and whether a genetic locus is abnormally methylated or not (binary) - can be more informative than its continuous version (intensity of methylation per genetic locus) for revealing the association of artificial conception with the predisposition of developmental disorders.

Date

2022-03-28

Advisors

Bottolo, Leonardo

Keywords

Bayesian Statistics, Variational Inference, DNA methylation, Model-based clustering

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge

Rights

Sponsorship

Alan Turing Institute (unknown)

The Alan Turing Institute (full studentship for my PhD research)

Collections

Theses - Medical Genetics