Repository logo
 

Phylogenetic Signals in Protein Data


Type

Thesis

Change log

Authors

Qin, Chongli 

Abstract

Structural biology has seen major advances over the past decade. In the area of protein structure prediction we have seen significant increase in accuracy with the discovery of coevolutionary signals in a multiple sequence alignment (MSA). Unlike methods which fold proteins using molecular dynamic (MD) simulations, these coevolutionary methods make use of correlation information to fold large protein structures orders of magnitudes faster. Often the correlation signals in a MSA are a strong indicator that a pair of amino acids are sufficiently close together to be in contact, thus interacting with each other. It has been shown that accurate inference of amino acid pairs that are in contact in the protein gives rise to accurate prediction of protein structure itself. Hence, statistical inference of amino acid pairs in contact is an important problem for protein folding. However, one of the major challenges of these statistical inference methods is that levels of noise significantly overwhelm the relevant signal for protein data. In this thesis, we attempt to alleviate one of the most important sources of noise which is also one that is often ignored: spurious correlations induced by phylogeny. To this end, we introduce a novel method for disentangling phylogenetic noise from the relevant structural signals. This method is grounded in an extension to a well-known theorem in Random Matrix Theory. Through extensive analysis on both synthetic and protein data, we demonstrate that it is possible to disentangle these two sources of information. Crucially, we find that the phylogenetic correlations can be largely removed by finding principal modes of the empirical correlation matrix where its corresponding eigenvalue satisfies a power-law.

Description

Date

2020-12-01

Advisors

Colwell, Lucy

Keywords

Phylogeny, Random Matrix Theory, Power law, Proteins

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge
Sponsorship
EPSRC (1512266)