Cellular Kaleidoscope: Unveiling Tissue Microenvironment in Health and Disease
Repository URI
Repository DOI
Change log
Authors
Abstract
Cells are the fundamental units of life. In multicellular organisms, cells are organised into tissues, where groups of similar cells work together to perform specific tasks. Understanding how tissues and organs function requires insight into the dynamics between different cell populations. Changes in this composition and gene expression patterns within a tissue can indicate a disease manifestation. Single-cell RNA sequencing has revolutionised our understanding of gene expression at the single-cell level, elucidating tissue composition and cell type-specific expression patterns and uncovering rare cell types. However, its high cost limits adaptation in clinical settings. On the other hand, bulk expression profiling is affordable and scalable, but it lacks cell type-specific information, which is masked due to the nature of the technology.
For decades, even before single-cell sequencing, scientists have developed algorithms to computationally infer cell composition and cell type-specific expression from microarray and bulk RNA data. Despite the abundance of methods, results often vary and lack reliability. One of the main challenges is selecting suitable methods, referencing data, and preprocessing for the condition and tissue of interest. This thesis aims to bridge this gap by developing a comprehensive benchmarking pipeline, called CATD (Critical Assessment of Transcriptomics Deconvolution), for existing methods, providing guidelines for generating synthetic data for evaluation, data pre-processing and selecting suitable single-cell references and methods.
Findings indicate that pseudo-bulk techniques affect deconvolution results. Specifically, deconvolution methods lack robustness when cell types present in the reference are missing from the sample. I provide guidelines for data pre-processing based on selected methodology and test method robustness and accuracy in the presence of batch effects between reference and bulk data. Results show that some methods are more resilient to batch effects than others, highlighting the need to test newly developed methods under these conditions.
The next chapter explores the performance of methods using real biological data with known composition from different molecular and imaging assays. First, I validate deconvolution results from blood samples using orthogonal flow cytometry measurements. In parallel, six tissues in GTEx are deconvolved, and results are validated using coupled tissue slides from immuno-histochemistry. A consensus approach is proposed for accurate deconvolution in different tissues, combining results from EpiDISH, FARDEEP, and DWLS deconvolution methods.
Previously, deconvolution was primarily tested on healthy samples using single-cell atlases as references. However, there is great interest in deconvolving disease samples to understand changes in composition in disease states. Here, I investigate the capabilities of deconvolution methods to deconvolve disease samples utilising healthy single-cell atlases since cell atlases of disease states are still limited. To explore this further, the example of liver fibrosis was used. Bulk samples from various fibrosis stages were deconvolved using the Healthy Liver Cell Atlas. Changes in immune subtypes and fibroblast composition of liver biopsies were found in later stages of the Non-Alchoholic Fatty Liver disease.
Finally, I extend deconvolution applications to more complex scenarios like cancer tissues. In the transition from healthy to cancerous tissue, significant environmental and genetic changes occur that lead to rapid cancer cell proliferation and, in many cases, immune cell infiltration of the tumour tissue. Capturing these changes is crucial for early diagnosis, treatment selection, and identifying targets for drug development. Clinically annotated bulk data are abundant, making them ideal for building machine learning models for outcome predictions. In this chapter, I dissect the melanoma microenvironment by integrating bulk and single-cell melanoma atlases using computational deconvolution. The resulting composition was used to find differentially expressed genes associated with disease progression rather than expression differences arising from cell composition changes. I implement a composition-aware Differential Expression analysis to identify disease-associated genes.
In summary, this thesis presents a deconvolution framework for bulk and single-cell data integration, highlights the impact of pseudo-bulk techniques in method comparison, and demonstrates the application of deconvolution in homeostasis, metabolic disease and carcinogenesis. It underscores the importance of testing and refining deconvolution approaches to enhance clinical utility. Moreover, it highlights the importance of leveraging historical bulk data to get new insights regarding composition and expression patterns in health and disease. Finally, it demonstrates how differential expression can be confounded by changes in cell composition and suggests a composition-aware method for gene expression analysis.