High-dimensional covariance estimation with applications to functional genomics
Repository URI
Repository DOI
Change log
Authors
Abstract
Covariance matrix estimation plays a central role in statistical analyses. In molecular biology, for instance, covariance estimation facilitates the identification of dependence structures between molecular variables that shed light on the underlying biological processes. However, covariance estimation is generally difficult because high-throughput molecular experiments often generate high-dimensional and noisy data, possibly with missing values. In such context, there is a need to develop scalable and robust estimation methods that can improve inference by, for example, taking advantage of the many sources of external information available in public repositories.
This thesis introduces novel methods and software for estimating covariance matrices from high-dimensional data. Chapter 2 introduces a flexible and scalable Bayesian linear shrinkage covariance estimator. This accommodates multiple shrinkage target matrices, allowing the incorporation of external information from an arbitrary number of sources. It is also less sensitive to target misspecification and can outperform state-of-the-art single-target linear shrinkage estimators.
Chapter 3 explores a dimensionality reduction approach --- probabilistic principal component analysis --- as a model-based covariance estimation method that can handle missing values. By assuming a low-dimensional latent structure, this is particularly useful when the inverse covariance is required (e.g. network inference). All of our methods are implemented as well-documented open-source R libraries.
Finally, Chapter 4 presents a case study using a dataset of cytokine expression in patients with traumatic brain injury. Studies of this type are crucial to researching the inflammatory response in the brain and potential patient recovery. However, due to the difficulties in patient recruitment, they result in high-dimensional datasets with relatively low sample sizes. We show how our methods can facilitate the multivariate analysis of cytokines across time and different treatment regimes.
Description
Date
Advisors
Leday, Gwenaël
Vallejos, Catalina