New statistical perspectives on efficient Big Data algorithms for high-dimensional Bayesian regression and model selection
This thesis is focused on the development of computationally efficient procedures for regression modelling with datasets containing a large number of observations. Standard algorithms be prohibitively computationally demanding on large
Bayesian divide and conquer and subsampling methods have been studied in the fixed model setting but little attention has been given to model selection. An important task in Bayesian model selection is computation of the integrated likelihood. We propose divide and conquer and subsampling algorithms for estimating the integrated likelihood. The divide and conquer approach is based on data augmentation, which is particularly useful for logistic regression. The subsampling approach involves constructing upper and lower bounds on the integrated likelihood. Lower bounds can be formed using variational Bayes techniques and we show how subsampling can be used to estimate an upper bound on the integrated likelihood.
Sketching algorithms generate a compressed set of responses and predictors than can then be used to estimate regression coefficients. Sketching algorithms use random projections to compress the original dataset and this stochastic generation process makes them amenable to statistical analysis. We examine the statistical properties of sketching algorithms, which allows us to quantify the error in the coefficients estimated using the sketched dataset. The proportion of variance explained by the model proves to be an important quantity in choosing between alternative sketching algorithms. This is particularly relevant to genetic studies, where the signal to noise ratio can be low.
We also investigate sketching as a tool for posterior approximation. The sketched dataset can be used to generate an approximate posterior distribution over models. As expected, the quality of the posterior approximation increases with the number of observations in the sketched dataset. The trade-off is that computational cost of sketching increases with the size of the desired sketched dataset. The main conclusion is that impractically large sketch sizes are needed to obtain a tolerable approximation of the posterior distribution over models. We test the performance of sketching for posterior approximation on a large genetic dataset. A key finding is that false positives are a major issue when performing model selection.
Practical regression analysis with large