Variational Mixture Models for non-Gaussian observations: Applications to molecular data
View / Open Files
Authors
Gerontogianni, Stavroula
Advisors
Date
2022-03-28Awarding Institution
University of Cambridge
Qualification
Doctor of Philosophy (PhD)
Type
Thesis
Metadata
Show full item recordCitation
Gerontogianni, S. (2022). Variational Mixture Models for non-Gaussian observations: Applications to molecular data (Doctoral thesis). https://doi.org/10.17863/CAM.85317
Abstract
Epigenetics is the field of biology that studies the changes in organisms due to alteration of gene expression rather than modification of the DNA sequence itself. DNA methylation is a well-studied type of epigenetic change, which results in gene silencing and can be dangerous when occurs at tumour suppressor gene loci. Many techniques have been developed to map the methylation pattern of individuals at several genetic loci, such as the HumanMethylation450 BeadChip, the EPIC BeadChip and the whole-genome
bisulfite sequencing. Each of these DNA profiling platforms quantifies methylation occurrence in different ways, either continuously (rates of methylation intensity) or discretely (counts of methylated reads). Identifying subgroups of individuals with similar methylation patterns, as well as those genetic loci that discriminate the subgroups, is a crucial procedure that helps linking diseases to specific methylation patterns. Clustering analysis and posterior feature selection of the most important genetic loci that discriminate each subgroup of individuals are the two tools we suggest for achieving this venture. Clustering DNA methylation data though is not a trivial procedure since they are platform-specific and not normally distributed.
In this thesis, we propose clustering DNA methylation data based on the data type (continuous or discrete) by fast model-based clustering methods, while we select the most important/discriminatory genetic loci by an a posteriori feature selection measure. Specifically, we apply variational non-Gaussian Dirichlet Process mixture models because they have infinite number of components that allow model-determination and are flexible to model any discrete or continuous data type. We also employ Variational Inference with the “annealing” extension that accounts for poor initialisation of the algorithm, due to its high speed in estimating the model parameters and its scalability to high-dimensional data. Our real applications on neonatal DNA methylation data measured in three different ways show that the discrete data types - number of aberrantly methylated genetic loci (counts) and whether a genetic locus is abnormally methylated or not (binary) - can be more informative than its continuous version (intensity of methylation per genetic locus) for revealing the association of artificial conception with
the predisposition of developmental disorders.
Keywords
Bayesian Statistics, Variational Inference, DNA methylation, Model-based clustering
Sponsorship
The Alan Turing Institute (full studentship for my PhD research)
Funder references
Alan Turing Institute (unknown)
Identifiers
This record's DOI: https://doi.org/10.17863/CAM.85317
Statistics
Total file downloads (since January 2020). For more information on metrics see the
IRUS guide.
Recommended or similar items
The current recommendation prototype on the Apollo Repository will be turned off on 03 February 2023. Although the pilot has been fruitful for both parties, the service provider IKVA is focusing on horizon scanning products and so the recommender service can no longer be supported. We recognise the importance of recommender services in supporting research discovery and are evaluating offerings from other service providers. If you would like to offer feedback on this decision please contact us on: support@repository.cam.ac.uk