Bayesian Methods for Spatial Proteomics

Crook, Oliver

Bayesian Methods for Spatial Proteomics

Repository URI

https://www.repository.cam.ac.uk/handle/1810/324601

Repository DOI

https://doi.org/10.17863/CAM.72056

Files

Thesis (93.56 MB)

Type

Thesis

Authors

Crook, Oliver

https://orcid.org/0000-0001-5669-8506

Abstract

Proteins are biomolecules that govern the biochemical processes of the cell. Correct cellular function, therefore, depends on correct protein function. For a protein to function as intended, there need to be sufficient copies of that protein, it should be correctly folded into its tertiary structure and ought to be in proximity of its interaction partners, amongst many other requirements. For a protein to be in the proximity of its interaction partners, whether those be other proteins, RNA or metabolites, it needs to be localised to the required compartment. Cells from all organisms display sub-cellular compartmentalisation, though to vastly differing degrees. E.Coli, for example, has remarkably simple sub-cellular organisation, whilst the apicomplexan Toxoplasma gondii has a vast number of specialised organelles.

In seminal experiments, Christian De Duve showed that upon biochemical fractionation of the cell, proteins co-fractionated if they were localised to the same organelle. These experiments led to the discovery of two organelles: the lysosome and the peroxisome, for which Christian De Duve was awarded the Nobel prize. Upon the advent of mass-spectrometry, these experiments were refashioned into high-throughput techniques with the development of Localisation of Organelle Proteins by Isotope Tagging (LOPIT) and Protein Correlation Profiling (PCP). Now these techniques have been redeveloped and a typical experiment can accurately measure thousands of proteins per experiments, whilst also providing information on (at least) a dozen subcellular compartments.

To analyse spatial proteomics data, they are first annotated with marker proteins, which are proteins with a priori known unambiguous localisations. Typical analysis proceeds by training a machine learning classifier to assigned proteins with unknown localisations to one of the compartments based on the spatial proteomics data. However, this framework holds back spatial proteomics from answering more complex questions. The first challenge is that proteins are not necessarily localised to a single compartment and so there is uncertainty associating a protein with an organelle. There is also uncertainty associated with the experiment itself, for example, reproducing the biochemical fraction and the stochastic nature of mass spectrometric quantitation. Two chapters of my thesis are dedicated to alleviating this problem by developing a Bayesian model for spatial proteomics data, with dedicated software. These approaches perform competitively with state-of-the-art classification algorithms whilst Markov-chain Monte Carlo algorithms are employed to sample from the posterior distribution of localisation probabilities. This is the basis for quantifying uncertainty in protein-organelle associations.

This Bayesian approach has several limitations, for example it still relies on marker proteins. This precludes analysis of poorly annotated non-model organisms using spatial proteomics techniques. A chapter of my thesis is dedicated to this challenge with a motivating application to the \textit{T. gondii} sub-proteome. Following on from this in a separate chapter, I develop a semi-supervised Bayesian model that reduces the reliance on marker proteins. The application to T. gondii constitutes a massive knowledge expansion revealing localisation of thousands of proteins to complex specialised niches. I also analyse the relative redundancy of the organelle sub-proteomes and the selective pressure of the host-adaptive response, revealing previously unknown insights.

The semi-supervised Bayesian approach makes use of the principle of over-fitted mixtures, currently used for data clustering, by extending it to model spatial proteomics data. Reanalysis of spatial proteomics data reveals new annotations in all datasets and allows interrogation of previously overlooked organelles. Another limitation of the approaches, thus far, is the parametric assumptions made by the Bayesian approach. One chapter is dedicated to placing the analysis of spatial proteomics in the semi-supervised Bayesian non-parametric context.

In the final chapters of thesis, I summarise the modern questions that spatial proteomics seeks to answer, including deciphering multi-localisation, change in localisation and the effect of post-translation modifications on subcellular localisation. I carefully define these problems and motivate further Bayesian models. I develop a Bayesian model to analyse differential localisation experiments; that is, spatial proteomics concerned with changes in localisation. This approach improves over currently ad-hoc methods applied to such data. I conclude with the limitations of our approach and potential solutions to the other methods.

Date

2020-09-01

Advisors

Lilley, Kathryn
Kirk, Paul
Gatto, Laurent

Keywords

Proteomics, Bayesian Statistics, Mass spectrometry, Machine Learning

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge

Rights

Attribution 4.0 International (CC BY 4.0)

Sponsorship

Wellcome Trust Mathematical Genomics and Medicine scholarship

Collections

Theses - Biochemistry