Show simple item record

dc.contributor.authorQin, Da Daen
dc.date.accessioned2021-03-19T10:50:55Z
dc.date.available2021-03-19T10:50:55Z
dc.date.submitted2020-12-01en
dc.identifier.urihttps://www.repository.cam.ac.uk/handle/1810/318986
dc.description.abstractStructural biology has seen major advances over the past decade. In the area of protein structure prediction we have seen significant increase in accuracy with the discovery of coevolutionary signals in a multiple sequence alignment (MSA). Unlike methods which fold proteins using molecular dynamic (MD) simulations, these coevolutionary methods make use of correlation information to fold large protein structures orders of magnitudes faster. Often the correlation signals in a MSA are a strong indicator that a pair of amino acids are sufficiently close together to be in contact, thus interacting with each other. It has been shown that accurate inference of amino acid pairs that are in contact in the protein gives rise to accurate prediction of protein structure itself. Hence, statistical inference of amino acid pairs in contact is an important problem for protein folding. However, one of the major challenges of these statistical inference methods is that levels of noise significantly overwhelm the relevant signal for protein data. In this thesis, we attempt to alleviate one of the most important sources of noise which is also one that is often ignored: spurious correlations induced by phylogeny. To this end, we introduce a novel method for disentangling phylogenetic noise from the relevant structural signals. This method is grounded in an extension to a well-known theorem in Random Matrix Theory. Through extensive analysis on both synthetic and protein data, we demonstrate that it is possible to disentangle these two sources of information. Crucially, we find that the phylogenetic correlations can be largely removed by finding principal modes of the empirical correlation matrix where its corresponding eigenvalue satisfies a power-law.en
dc.rightsAll rights reserveden
dc.rightsAll rights reserveden
dc.subjectPhylogenyen
dc.subjectRandom Matrix Theoryen
dc.subjectPower lawen
dc.subjectProteinsen
dc.titlePhylogenetic Signals in Protein Dataen
dc.typeThesis
dc.type.qualificationlevelDoctoralen
dc.type.qualificationnameDoctor of Philosophy (PhD)en
dc.publisher.institutionUniversity of Cambridgeen
dc.identifier.doi10.17863/CAM.66103
rioxxterms.licenseref.urihttp://www.rioxx.net/licenses/all-rights-reserveden
rioxxterms.typeThesisen
dc.publisher.collegeSelwyn
dc.type.qualificationtitlePhD in Chemistryen
pubs.funder-project-idEPSRC (1512266)
cam.supervisorColwell, Lucy


Files in this item

FilesSizeFormatView

There are no files associated with this item.

This item appears in the following Collection(s)

Show simple item record