Using sequence data to investigate the functional design of proteins

Puszkarska, Anna

doi:10.17863/CAM.60028

Using sequence data to investigate the functional design of proteins

Repository URI

https://www.repository.cam.ac.uk/handle/1810/312931

Repository DOI

https://doi.org/10.17863/CAM.60028

Files

Thesis (73.6 MB) (Embargoed until: 2400-01-01)

Type

Thesis

Authors

Puszkarska, Anna

Abstract

Understanding which sequence features enable a given protein to perform a specific function is a long-standing challenge in the field of molecular biology. In particular, the detection of functional specificity among paralogous proteins is challenging, since their common molecular origin causes only subtle differences between the variants that are difficult to detect based on the observable characteristics. For example, the set of paralogous polypeptides present in the genome of modern vertebrates give rise to the variety of collagen proteins. These proteins preserve a high degree of sequential and structural homology, yet are optimised by evolutionary processes to perform their specific function - assemble into widely diverse biological materials, such as fibrils or networks. The thesis exploits statistical modelling of protein sequence data to shine light on the relationship between protein sequence and function. Specifically, the thesis develops approaches to find sequence design principles which determine the specificity of protein paralogues that have diverged in function. I will show that this can be achieved by investigating evolutionary sequence variation using a probabilistic framework. The new approach to examine the importance of each amino acid in the protein primary sequence is developed. We find that the functionally important amino acids can be grouped into two clusters: (i) those shared among all paralogue sequences responsible for common features, and (ii) those specific to each group which enable functional specificity. I use data sets of orthologous collagen sequences from genomic research to build sequence models that represent each collagen paralogue variant and use these models to carry out comparative analysis. Adaptational dependencies among seventeen types of α-paralogue sequences from two functional groups are analysed. Moreover, a model of intermolecular interactions between fibrillar collagen trimers is proposed to show that the phenotype of the supra-molecular fibrillar structure is fully encoded in the primary sequences of the collagen proteins and can be predicted purely on the basis of simple predictive rules for the interaction between amino acid residues. Finally, I use statistical learning approaches to model the activity of peptide hormones, and use the resulting models to design novel hormone sequences with improved functional properties.

Date

2020-09-01

Advisors

Colwell, Lucy
Duer, Melinda

Keywords

protein sequence analysis, statistical modelling, evolutionary sequence variation, collagen

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge

Rights and licensing

Sponsorship

Raymond and Beverly Sackler Fund for Physics of Medicine, University of Cambridge, the European Research Council, the Simons Foundation

Collections

Theses - Chemistry

Using sequence data to investigate the functional design of proteins

Repository URI

Repository DOI

Files

Type

Change log

Authors

Abstract

Description

Date

Advisors

Keywords

Qualification

Awarding Institution

Rights and licensing

Sponsorship

Collections