Methods for Dissecting High Dimensional Single Cell RNA Sequencing Data
Repository URI
Repository DOI
Change log
Authors
Abstract
Since first being described in 2009 [ 1], single cell RNA sequencing (scRNA-seq) has rapidly advanced into a staple for interrogating cellular identity in heterogeneous populations. Researchers routinely capture transcriptome-wide snapshots of thousands or even millions of individual cells. From these readouts the challenge is to accurately identify distinct patterns of mRNA expression that separate cells with distinct identities. However, several intrinsic properties of scRNA-seq data obfuscate our ability to extract biologically relevant information. Technical error can lead to the loss of resolution in cell similarity or gene co-regulatory relationships, or generate artificial correlations that are related to experimental design rather than biological signals. Often we lack a clear ground truth that would define what cell types should be present in the data set, and which genes are strong discriminators for them. Consequently, it is difficult to know whether any information we have garnered is truly representative of the entire population of cells within the data. The lack of a ground truth is particularly relevant since scRNA-seq data often suffers from the "curse of dimensionality” [ 2– 4], as scRNA-seq data sets typically span tens of thousands of genes. The high number of genes diminishes the validity of common data analysis techniques such as distance and correlation metrics, making it more difficult to identify cell or gene relationships in the data. My aim in this work is to establish a new methodology that can successfully navigate the complexities of scRNA-seq data and maximise our ability to obtain biologically relevant and meaningful insights into a cell population under study. Existing methods designed to interrogate scRNA-seq data struggle to account for two major phenomena: technical drop outs and uninformative genes. Drop outs occur when a gene is measured as having low or zero expression in a cell, but this is a false negative due vi to technical errors in capture or amplification of the mRNA. Uninformative genes are those that are present in the data, but have little relevance to cellular identity and hinder our ability to differentiate distinct cell types. To address these issues, I introduce a framework termed Entropy Sorting (ES). ES uniquely quantifies the correlative relationships between pairs of genes as a sorting problem. Doing so enables us to quantify how far away the observed functional states of the two genes are from an ideal perfectly correlated system, where the dependent relationship between the two features has been maximised. This approach enables us to pose and test hypotheses on whether the observed gene expression states are likely due to gene dependencies or random chance. The theory around ES is encoded in an algorithm, Functional Feature Amplification Via Entropy Sorting (FFAVES). I demonstrate that FFAVES allows us to simultaneously quantify gene co-regulation, correct for false negative and false positive data points and perform feature selection for highly informative genes. Crucially, it does so in an entirely unsupervised manner, minimising the introduction of bias to the data that would prevent us from identifying unknowns, such as rare cell types or gene signatures. On synthetic data I demonstrate that FFAVES recovers gene relationships more accurately than the most popular methods currently used. On real scRNA-seq data sets I use FFAVES to uncover high resolution gene expression dynamics during human embryo pre-implantation blastocyst development. Through FFAVES, I expose rare cell types and cell type specific gene expression, by mitigating the contribution of technical confounders such as batch effects and false negative drop outs. To demonstrate that our analysis is biologically relevant, I use detailed cell embeddings created from the human embryo scRNA- seq data to identify sequential transcription factor expression dynamics during the formation of primitive endoderm cells. Preliminary results from human embryo staining are used to validate the findings from FFAVES. In summary, I hope to demonstrate that ES and FFAVES serve as powerful tools for increasing the amount of information that can be extracted from scRNA-seq data, and more generally any high dimensional data set with complex feature relationships. Having improved the quality of a data set by removing technical noise and amplifying functional relationship with ES and FFAVES, users should gain clearer insights vii into their system of study, enhancing ability to draw accurate conclusions and plan future experiments.
Description
Date
Advisors
Nichols, Jennifer
Qualification
Awarding Institution
Rights and licensing
Sponsorship
Biotechnology and Biological Sciences Research Council (1943266)
Biotechnology and Biological Sciences Research Council (2489150)

