Discovering variation from cell atlases: comparative methods for single-cell genomics

Dann, Emma

Discovering variation from cell atlases: comparative methods for single-cell genomics

Repository URI

https://www.repository.cam.ac.uk/handle/1810/362807

Repository DOI

https://doi.org/10.17863/CAM.104756

Files

Thesis (412.72 MB)

Type

Thesis

Authors

Dann, Emma

Abstract

Single-cell genomics technologies have become the norm to investigate cell-to-cell heterogeneity and collective research efforts have built “cell atlases” to characterize previously unknown cell types and states in tissues. For this task, the collection of cells from multiple donors or tissue sites is often required, and differences between cell populations from different samples are largely attributed to technical variation. More recently, multi-sample and multi-condition experiments are being designed to quantify differences between biological conditions, using atlases from healthy tissues as references. These studies require robust quantification of biological differences between cellular phenotypes while accounting for variability between samples and conserving fine-grained information on cell-to-cell heterogeneity. The work presented in this thesis focuses on development and application of computational strategies and best practices for comparative analysis between samples of different biological conditions profiled with single-cell genomics.

Chapter 2 presents Milo, a statistical framework for differential abundance testing on single-cell data. By quantifying differences in cell abundances between conditions in partially overlapping neighborhoods on a k-nearest neighbor graph, Milo can identify perturbations that are obscured by discretizing cells into clusters and it minimizes false discovery rate control even in the presence of batch effects. I present a comprehensive benchmark against alternative differential abundance testing strategies, using simulations and scRNA-seq data. I then demonstrate the utility of Milo by studying perturbations across lineages in a dataset of human liver cirrhosis.

Chapters 3 and 4 present a case-study where Milo and other integration and comparative analysis methods were used to study the development of the human immune system as a distributed network across tissues. In Chapter 3, I describe the integration of scRNA-seq data of almost a million cells from nine prenatal organs across 11 weeks of gestation, to define common and tissue-specific immune cell populations, and how these compare with immune cell states identified in adulthood. Using this integrated view, I show how I used Milo to identify stage- and tissue-specific subpopulations of myeloid and lymphoid cells, and discuss their potential role in maturation of immune function and tissue morphogenesis. Chapter 4 is centered around the analysis of spatial cellular environments across fetal tissues. By integrating our cross-tissue scRNA-seq atlas with spatial transcriptomics data, we identified and compared cellular niches in the developing liver, spleen, thymus and gut.

In Chapter 5, I present a systematic meta-analysis to identify best practices to identify cell states altered in human disease using integration and differential analysis on single-cell datasets. In particular, I examined whether atlas datasets are suitable references for disease-state identification or whether matched control samples should be employed, to minimise false discoveries driven by biological and technical confounders. By quantitatively comparing the use of atlas and control datasets as references for identification of disease-associated cell states, on simulations and real disease scRNA-seq datasets, I show that reliance on a single type of reference dataset introduces false positives. Conversely, using an atlas dataset as reference for latent space learning followed by differential analysis against a matched control dataset leads to precise identification of disease-associated cell states. I demonstrate how optimized design can guide discovery in two applications: using a cell atlas of blood cells from 12 studies to contextualise data from a case-control COVID-19 cohort to detect cell states associated with infection, and distinguish heterogeneous pathological cell states associated with distinct clinical severities. In parallel, I used a healthy reference lung atlas generated from 14 studies to study single-cell profiles obtained from patients with idiopathic pulmonary fibrosis, characterising two distinct aberrant basal cell states associated with disease and identifying unique marker genes with therapeutic potential.

In the final chapter, I discuss how advancements in multi-condition single-cell analysis open up a new phase of population-level tissue genomics studies, and the impact that it will have on biomedicine.

Date

2023-12-01

Advisors

Teichmann, Sarah

Keywords

computational biology, single-cell genomics

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge

Rights

Attribution 4.0 International (CC BY 4.0)

Collections

Theses - Wellcome Sanger Institute