Modern Methods for Variable Significance Testing

Lundborg, Anton Rask

Modern Methods for Variable Significance Testing

Repository URI

https://www.repository.cam.ac.uk/handle/1810/346132

Repository DOI

https://doi.org/10.17863/CAM.93556

Files

Thesis (1.61 MB)

Type

Thesis

Authors

Lundborg, Anton Rask

Abstract

This thesis concerns the ubiquitous statistical problem of variable significance testing. The first chapter contains an account of classical approaches to variable significance testing including different perspectives on how to formalise the notion of `variable significance'. The historical development is contrasted with more recent methods that are adapted to both the scale of modern datasets but also the power of advanced machine learning techniques. This chapter also includes a description of and motivation for the theoretical framework that permeates the rest of the thesis: providing theoretical guarantees that hold uniformly over large classes of distributions.

The second chapter deals with testing the null that Y ⊥ X | Z where X and Y take values in separable Hilbert spaces with a focus on applications to functional data. The first main result of the chapter shows that for functional data it is impossible to construct a non-trivial test for conditional independence even when assuming that the data are jointly Gaussian. A novel regression-based test, called the Generalised Hilbertian Covariance Measure (GHCM), is presented and theoretical guarantees for uniform asymptotic Type I error control are provided with the key assumption requiring that the product of the mean squared errors of regressing Y on Z and X on Z converges faster than n $−1$ , where n is the sample size. A power analysis is conducted under the same assumptions to illustrate that the test has uniform power over local alternatives where the expected conditional covariance operator has a Hilbert--Schmidt norm going to 0 at a $n n$ -rate. The chapter also contains extensive empirical evidence in the form of simulations demonstrating the validity and power properties of the test. The usefulness of the test is demonstrated by using the GHCM to construct confidence intervals for the boundary point in a truncated functional linear model and to detect edges in a graphical model for an EEG dataset.

The third and final chapter analyses the problem of nonparametric variable significance testing by testing for conditional mean independence, that is, testing the null that E(Y | X, Z) = E(Y | Z) for real-valued Y. A test, called the Projected Covariance Measure (PCM), is derived by considering a family of studentised test statistics and choosing a member of this family in a data-driven way that balances robustness and power properties of the resulting test. The test is regression-based and is computed by splitting a set of observations of (X, Y, Z) into two sets of equal size, where one half is used to learn a projection of Y onto X and Z (nonparametrically) and the second half is used to test for vanishing expected conditional correlation given Z between the projection and Y. The chapter contains general conditions that ensure uniform asymptotic Type I control of the resulting test by imposing conditions on the mean-squared error of the involved regressions. A modification of the PCM using additional sample splitting and employing spline regression is shown to achieve the minimax optimal separation rate between null and alternative under Hölder smoothness assumptions on the regression functions and the conditional density of X given Z=z. The chapter also shows through simulation studies that the test maintains the strong type I error control of methods like the Generalised Covariance Measure (GCM) but has power against a broader class of alternatives.

Date

2022-10-01

Advisors

Samworth, Richard
Shah, Rajen

Keywords

conditional independence, hypothesis testing, statistics

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge

Rights

Collections

Theses - Pure Mathematics and Mathematical Statistics