Community Classification of the Protein Universe

Jeffryes, Matthew Jacob

Community Classification of the Protein Universe

Repository URI

https://www.repository.cam.ac.uk/handle/1810/298090

Repository DOI

https://doi.org/10.17863/CAM.45148

Files

Thesis (8.91 MB)

Type

Thesis

Authors

Jeffryes, Matthew Jacob

https://orcid.org/0000-0001-9868-6271

Abstract

Protein family databases are an important resource for biologists seeking to characterise the function of proteins, the structure of their domains, and their localisation within the cell. Operating a protein family database requires the identification of families, and the curation of literature related to the family. This labour is currently performed by skilled professional curators, whose abilities are a scarce resource. In this thesis, I have developed methods to enable some of this labour to be performed by the community of protein sequence similarity search users.

In the first chapter, I review the history of protein sequence and protein family databases, and how the abstract concept of a protein family is expressed as a computational model. I review in greater detail the protein family database Pfam, and the software package hmmer, which uses hidden Markov models to search protein sequence databases.

In the second chapter, I explore how the quality of computational models for a protein family can be measured, and how these measurements might be used to assess the quality of community-sourced protein family models. I then investigate how a protein sequence similarity search can be rapidly analysed for overlap with existing protein families in Pfam, using locality sensitive hashing.

In the third chapter, I discuss the use of literature search in protein family database curation, and the existing literature resources used by protein family database curators. I then develop a system for performing literature search based on protein families, exploiting the manually annotated links between literature and proteins found in the Swiss-Prot subset of the UniProt protein database.

In the fourth chapter, I develop a web application for analysing the results of protein sequence similarity searches, using the methods discussed in the second chapter, and for performing literature search based on the results of protein sequence similarity search, using the methods discussed in the third chapter.

In the fifth chapter, I develop a web application which applies the methods developed in the third chapter to the task of curation of the protein classification resource, InterPro.

Date

2018-09-28

Advisors

Bateman, Alex

Keywords

information retrieval, locality sensitive hashing, protein families, hidden markov models, biocuration, curation, text mining

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge

Rights

Attribution 4.0 International (CC BY 4.0)

Sponsorship

My fellowship was funded by the European Molecular Biology Laboratory's International PhD Programme

Collections

Theses - European Bioinformatics Institute (EMBL-EBI)