Repository logo
 

Benchmarking methods for label-free quantitative proteomics data analysis


Change log

Abstract

Proteins are fundamental functional units in cellular processes. Inter-individual variation in the composition of proteins in cells — including differences in amino acid sequence and in abundance — mediate many of the effects of genetic variants on the risks and presentations of diseases. Consequently, statistical analysis of protein levels in cases and controls can identify disease-mediating proteins. Mass spectrometry (MS)-based proteomics is the predominant technology for protein quantification. In particular, MS-based label-free quantification (LFQ) proteomics can quantify thousands of proteins in complex biological samples. However, the molecular heterogeneity of such samples and the technological limitations of MS-based proteomics pose inferential challenges, including systematic biases that compromise the accuracy and precision of relative abundance estimates. To mitigate these problems, researchers rely on specialised statistical methodology and software tools for processing and analysing LFQ data. Gold-standard datasets play an essential role in the development and benchmarking of these tools. In this thesis, I will present methods for generating datasets to benchmark LFQ data analysis tools and highlight how Bayesian modelling paradigms can be used to address some of the challenges in LFQ data analysis.

In Chapter 1, I will describe the motivation behind my work and discuss the context and scope of this thesis. In Chapter 2, I will outline the biological and technical background to LFQ proteomics experiments. I will review state-of-the-art methods for collecting, processing, and analysing LFQ data and discuss their relative strengths and weaknesses. I will also discuss the problem of protein inference which makes isoform-level relative quantification challenging. The novel contributions of this thesis are presented in Chapters 3–6. Chapter 3 contains a comprehensive review of the most commonly used LFQ benchmarking datasets and details the design of an experiment for the acquisition of a new gold-standard dataset. Chapter 4 explores its potential applications and advantages over existing datasets. Chapter 5 introduces a generative algorithm for producing LFQ proteomics datasets to aid the development of tools for analysing processed LFQ data. In Chapter 6, I will illustrate how Bayesian hierarchical models can be used for analysing LFQ data and then highlight the issues that must be addressed before Bayesian approaches can be widely adopted for LFQ data analysis.

Description

Date

2023-10-13

Advisors

Astle, William

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge

Rights and licensing

Except where otherwised noted, this item's license is described as All Rights Reserved
Sponsorship
Dr. Herchel Smith Fellowship and National Institute for Health and Care Research (NIHR) Fellowship