Computationally Efficient Methods for High-Dimensional Statistical Problems

Stokell, Benjamin

Computationally Efficient Methods for High-Dimensional Statistical Problems

Repository URI

https://www.repository.cam.ac.uk/handle/1810/328421

Repository DOI

https://doi.org/10.17863/CAM.75868

Files

Thesis (2.91 MB)

Type

Thesis

Authors

Stokell, Benjamin

https://orcid.org/0000-0002-8365-715X

Abstract

With the ever-increasing amount of computational power available, so broadens the horizon of statistical problems that can be tackled. However, many practitioners have only an ordinary personal computer on which to do their work. The need for computationally efficient methodology is as pressing as ever, and there remain some questions as-yet without a confident answer for a practitioner working with tight computational constraints. This thesis develops methods for three such problems. The first, introductory, chapter provides an overview of the area and an accessible preamble to the problems these methods address.

In the second chapter we address the problem of modelling a high-dimensional linear regression with categorical predictor variables. The natural sparsity assumption in this setting is on the number of unique values the coefficients within each categorical variable can take. With this assumption, we introduce a new form of penalty function for tackling this problem. While the number of combinations of levels can grow extremely fast in the number of levels, the unique structure of the method enables fast optimisation for this problem. A novel and intricate dynamic programming algorithm computes the exact global optimum over each variable, and is embedded within a block coordinate descent algorithm. This allows fitting of such models quickly on a laptop computer in a memory efficient manner. The scaling requirements sufficient for this method to recover the correct groups cannot be relaxed for any estimator; this strong performance is validated by a range of experiments using both simulated and real data.

In the third chapter we explore the possibility that a practitioner has some a priori belief to which variables are most likely to be important, which will be in the form of a permutation of the columns. Our approach takes this ordering and efficiently computes a grid of solution paths by sequentially removing groups of variables without unnecessary recomputation of coefficients. Typical examples of such orderings include the column norms in the (unscaled) design matrix, or the recentness of observations in time series data. This procedure, combined with selecting the size of support set by validation on a test set, has similar performance to that of fitting the oracular submodel.

The fourth chapter concerns the efficient estimation of conditional independence graphs in Gaussian graphical models. Neighbourhood selection is practical, popular, and enjoys good performance, but in large-scale settings it can still have computational demands exceeding the resources available to many practitioners. Screening approaches promise large improvements in speed with only a small price to pay in terms of resulting estimation performance. Although it is well-known that nodes adjacent in the conditional independence graph may be uncorrelated, a minimum absolute correlation between adjacent nodes is often tacitly or explicitly assumed in order for screening procedures to be effective. We make use of recent work in covariance estimation and high-dimensional screening of variables to develop a fast, two-stage, screening procedure specifically for use within neighbourhood selection and avoiding this restrictive assumption. Provided that a weaker version of a minimum edge strength requirement holds over most of the graph, the performance of the post-screening nodewise regressions is not compromised, while being substantially faster than the full procedure. This method is robust to the presence of latent confounders, as well as other scenarios that typically impede the screening of variables. Experiments show that our approach strikes a favourable balance between edge detection and computational efficiency

Date

2021-07-07

Advisors

Shah, Rajen

Keywords

Statistics, Regression, Sparsity, Algorithms

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge

Rights

Sponsorship

Cantab Capital Institute for the Mathematics of Information

Collections

Theses - Pure Mathematics and Mathematical Statistics