Show simple item record

dc.contributor.authorJeffryes, Matthew Jacob
dc.date.accessioned2019-10-28T09:32:11Z
dc.date.available2019-10-28T09:32:11Z
dc.date.issued2020-02-22
dc.date.submitted2018-09-28
dc.identifier.urihttps://www.repository.cam.ac.uk/handle/1810/298090
dc.description.abstractProtein family databases are an important resource for biologists seeking to characterise the function of proteins, the structure of their domains, and their localisation within the cell. Operating a protein family database requires the identification of families, and the curation of literature related to the family. This labour is currently performed by skilled professional curators, whose abilities are a scarce resource. In this thesis, I have developed methods to enable some of this labour to be performed by the community of protein sequence similarity search users. In the first chapter, I review the history of protein sequence and protein family databases, and how the abstract concept of a protein family is expressed as a computational model. I review in greater detail the protein family database Pfam, and the software package hmmer, which uses hidden Markov models to search protein sequence databases. In the second chapter, I explore how the quality of computational models for a protein family can be measured, and how these measurements might be used to assess the quality of community-sourced protein family models. I then investigate how a protein sequence similarity search can be rapidly analysed for overlap with existing protein families in Pfam, using locality sensitive hashing. In the third chapter, I discuss the use of literature search in protein family database curation, and the existing literature resources used by protein family database curators. I then develop a system for performing literature search based on protein families, exploiting the manually annotated links between literature and proteins found in the Swiss-Prot subset of the UniProt protein database. In the fourth chapter, I develop a web application for analysing the results of protein sequence similarity searches, using the methods discussed in the second chapter, and for performing literature search based on the results of protein sequence similarity search, using the methods discussed in the third chapter. In the fifth chapter, I develop a web application which applies the methods developed in the third chapter to the task of curation of the protein classification resource, InterPro.
dc.description.sponsorshipMy fellowship was funded by the European Molecular Biology Laboratory's International PhD Programme
dc.language.isoen
dc.rightsAttribution 4.0 International (CC BY 4.0)
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.subjectinformation retrieval
dc.subjectlocality sensitive hashing
dc.subjectprotein families
dc.subjecthidden markov models
dc.subjectbiocuration
dc.subjectcuration
dc.subjecttext mining
dc.titleCommunity Classification of the Protein Universe
dc.typeThesis
dc.type.qualificationlevelDoctoral
dc.type.qualificationnameDoctor of Philosophy (PhD)
dc.publisher.institutionUniversity of Cambridge
dc.publisher.departmentEMBL-EBI
dc.date.updated2019-10-21T14:55:29Z
dc.identifier.doi10.17863/CAM.45148
dc.contributor.orcidJeffryes, Matthew Jacob [0000-0001-9868-6271]
dc.publisher.collegeCorpus Christi
dc.type.qualificationtitlePhD in Biological Science
cam.supervisorBateman, Alex
cam.supervisor.orcidBateman, Alex [0000-0002-6982-4660]
cam.thesis.fundingfalse


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record

Attribution 4.0 International (CC BY 4.0)
Except where otherwise noted, this item's licence is described as Attribution 4.0 International (CC BY 4.0)