Repository logo
 

Hypothesis testing and causal inference with heterogeneous medical data


Type

Thesis

Change log

Authors

Bellot, Alexis 

Abstract

Learning from data which associations hold and are likely to hold in the future is a fundamental part of scientific discovery. With increasingly heterogeneous data collection practices, exemplified by passively collected electronic health records or high-dimensional genetic data with only few observed samples, biases and spurious correlations are prevalent. These are called spurious because they do not contribute to the effect being studied. In this context, the modelling assumptions of existing statistical tests and causal inference methods are often found inadequate and their practical utility diminished even though these models are increasingly used as decision-support tools in practice. This thesis investigates how modern computational techniques may broaden the fields of hypothesis testing and causal inference to handle the subtleties of large heterogeneous data sets, as well as simultaneously improve the robustness and theoretical understanding of machine learning algorithms using insights from causality and statistics.

The first part of this thesis is concerned with hypothesis testing. We develop a framework for hypothesis testing on set-valued data, a representation that faithfully describes many real-world phenomena including patient biomarker trajectories in the hospital. Using similar techniques, we develop next a two-sample test for making inference on selection-biased data, in the sense that not all individuals are equally likely to be included in the study, a fact that biases tests if not accounted for and if the desideratum is to obtain conclusions that are generally applicable. We conclude this section with an investigation of conditional independence in high-dimensional data, such as found in gene expression data, and propose a test using generative adversarial networks. The second part of this thesis is concerned with causal inference and discovery, with a special focus on the influence of unobserved confounders that distort the observed associations between variables and yet may not be ruled out or adjusted for using data alone. We start by demonstrating that unobserved confounders may bias substantially the generalization performance of machine learning algorithms trained with conventional learning paradigms such as empirical risk minimization. Acknowledging this spurious effect, we develop a new learning principle inspired by causal insights that provably generalizes to test data sampled from a larger set of distributions different from the training distribution. In the last chapter we consider the influence of unobserved confounders for causal discovery. We show that with some assumptions on the type and influence on the nature of unobserved confounding one may develop provably consistent causal discovery algorithms, formulated as a solution to a continuous optimization program.

Description

Date

2020-12-27

Advisors

van Der Schaar, Mihaela

Keywords

Hypothesis Testing, Machine Learning, Healthcare, Causal Inference

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge