Show simple item record

dc.contributor.authorGerontogianni, Stavroula
dc.date.accessioned2022-06-08T16:31:50Z
dc.date.available2022-06-08T16:31:50Z
dc.date.submitted2022-03-28
dc.identifier.urihttps://www.repository.cam.ac.uk/handle/1810/337911
dc.description.abstractEpigenetics is the field of biology that studies the changes in organisms due to alteration of gene expression rather than modification of the DNA sequence itself. DNA methylation is a well-studied type of epigenetic change, which results in gene silencing and can be dangerous when occurs at tumour suppressor gene loci. Many techniques have been developed to map the methylation pattern of individuals at several genetic loci, such as the HumanMethylation450 BeadChip, the EPIC BeadChip and the whole-genome bisulfite sequencing. Each of these DNA profiling platforms quantifies methylation occurrence in different ways, either continuously (rates of methylation intensity) or discretely (counts of methylated reads). Identifying subgroups of individuals with similar methylation patterns, as well as those genetic loci that discriminate the subgroups, is a crucial procedure that helps linking diseases to specific methylation patterns. Clustering analysis and posterior feature selection of the most important genetic loci that discriminate each subgroup of individuals are the two tools we suggest for achieving this venture. Clustering DNA methylation data though is not a trivial procedure since they are platform-specific and not normally distributed. In this thesis, we propose clustering DNA methylation data based on the data type (continuous or discrete) by fast model-based clustering methods, while we select the most important/discriminatory genetic loci by an a posteriori feature selection measure. Specifically, we apply variational non-Gaussian Dirichlet Process mixture models because they have infinite number of components that allow model-determination and are flexible to model any discrete or continuous data type. We also employ Variational Inference with the “annealing” extension that accounts for poor initialisation of the algorithm, due to its high speed in estimating the model parameters and its scalability to high-dimensional data. Our real applications on neonatal DNA methylation data measured in three different ways show that the discrete data types - number of aberrantly methylated genetic loci (counts) and whether a genetic locus is abnormally methylated or not (binary) - can be more informative than its continuous version (intensity of methylation per genetic locus) for revealing the association of artificial conception with the predisposition of developmental disorders.
dc.description.sponsorshipThe Alan Turing Institute (full studentship for my PhD research)
dc.rightsAll Rights Reserved
dc.rights.urihttps://www.rioxx.net/licenses/all-rights-reserved/
dc.subjectBayesian Statistics
dc.subjectVariational Inference
dc.subjectDNA methylation
dc.subjectModel-based clustering
dc.titleVariational Mixture Models for non-Gaussian observations: Applications to molecular data
dc.typeThesis
dc.type.qualificationlevelDoctoral
dc.type.qualificationnameDoctor of Philosophy (PhD)
dc.publisher.institutionUniversity of Cambridge
dc.date.updated2022-06-08T12:16:11Z
dc.identifier.doi10.17863/CAM.85317
rioxxterms.licenseref.urihttps://www.rioxx.net/licenses/all-rights-reserved/
rioxxterms.typeThesis
pubs.funder-project-idAlan Turing Institute (unknown)
cam.supervisorBottolo, Leonardo
cam.supervisor.orcidBottolo, Leonardo [0000-0002-6381-2327]
cam.depositDate2022-06-08
pubs.licence-identifierapollo-deposit-licence-2-1
pubs.licence-display-nameApollo Repository Deposit Licence Agreement


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record