New statistical perspectives on efficient Big Data algorithms for
high-dimensional Bayesian regression and model selection
Daniel Christian Ahfock
Fitzwilliam College
MRC Biostatistics Unit
University of Cambridge
September 2018
A thesis presented for the degree of
Doctor of Philosophy

Abstract
This thesis is focused on the development of computationally efficient procedures for regression modelling
with datasets containing a large number of observations. Standard algorithms be prohibitively compu-
tationally demanding on large n datasets, and we propose and analyse new computational methods for
model fitting and selection. We explore three different generic strategies for tall datasets. Divide and
conquer approaches split the full dataset into subsets, with the subsets then being analysed independently
in parallel. The subset results are then pooled into an overall consensus. Subsampling based methods
repeatedly use minibatches of data to estimate quantities of interest. The third strategy is sketching, a
probabilistic data compression technique developed in the computer science community. Sketching uses
random projection to compress the original large dataset, producing a smaller surrogate dataset that is
less computationally demanding to work with. The sketched dataset can be used for approximate infer-
ence. We test our regression algorithms on several large n genetic datasets, aiming to find associations
between genetic variants and red blood cell traits.
Bayesian divide and conquer and subsampling methods have been studied in the fixed model setting
but little attention has been given to model selection. An important task in Bayesian model selection is
computation of the integrated likelihood. We propose divide and conquer and subsampling algorithms for
estimating the integrated likelihood. The divide and conquer approach is based on data augmentation,
which is particularly useful for logistic regression. The subsampling approach involves constructing upper
and lower bounds on the integrated likelihood using information theory.
Sketching algorithms generate a compressed set of responses and predictors than can then be used
to estimate regression coefficients. Sketching algorithms use random projections to compress the original
dataset and this stochastic generation process makes them amenable to statistical analysis. We examine
the statistical properties of sketching algorithms, which allows us to quantify the error in the coefficients
estimated using the sketched dataset. The proportion of variance explained by the model proves to be an
important quantity in choosing between alternative sketching algorithms. This is particularly relevant to
genetic studies, where the signal to noise ratio can be low.
We also investigate sketching as a tool for posterior approximation. The sketched dataset can be
used to generate an approximate posterior distribution over models. As expected, the quality of the
posterior approximation increases with the number of observations in the sketched dataset. The trade-off
is that computational cost of sketching increases with the size of the desired sketched dataset. The main
conclusion is that impractically large sketch sizes are needed to obtain a tolerable approximation of the
posterior distribution over models. We test the performance of sketching for posterior approximation on
a large genetic dataset. A key finding is that false positives are a major issue when performing model
selection.
Practical regression analysis with large n datasets can require specialised algorithms. Parallel pro-
cessing, subsampling and random projection are all useful tools for computationally efficient regression
modelling.
Declaration
• This dissertation is the result of my own work and includes nothing which is the outcome of work
done in collaboration except where specified in the text.
• I further state that no substantial part of my dissertation has already been submitted, or, is being
concurrently submitted for any such degree, diploma or other qualification at the University of
Cambridge or any other University or similar institution.
• Chapters 4 and 5 are joint work with Sylvia Richardson and William Astle. The contents in these
chapters have been submitted together jointly for publication.
Acknowledgement
Firstly, thank you to my supervisors Sylvia Richardson and William Astle for all of their help and
guidance over the course of the PhD. The work in this thesis would not have come together without their
efforts. I am also very grateful to my parents, Tony and Georgette, and to my sister Melody for all of
their support and encouragement. Even though Cambridge is a long way from Australia, I never felt too
far away from home. Finally, thank you to my wonderful wife Amy for being such an amazing partner
on this journey.

Contents
Contents i
List of Figures iv
List of Tables vi
1 Introduction 2
2 Split, apply, combine: computing the model evidence using embarrassingly parallel
processing 6
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Divide and conquer Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Posterior sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Model uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.1 Integrated likelihood calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.2 Savage-Dickey density ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 General Bayesian models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.2 Subset saturated model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.3 Split and apply steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.4 Combine step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.5 Example: ESP dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4.6 Curse of dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.7 Embarrassingly parallel evidence estimation . . . . . . . . . . . . . . . . . . . . . . 26
2.5 Data augmentation for distributed inference . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5.2 Conjugate priors in the exponential family . . . . . . . . . . . . . . . . . . . . . . . 27
2.5.3 Data augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5.4 Chib’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5.5 Apply step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5.6 Combine step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5.7 Embarrassingly parallel evidence estimation . . . . . . . . . . . . . . . . . . . . . . 33
2.5.8 Monte Carlo error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.6 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.7 Data application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.7.1 Flights dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
i
Contents
2.7.2 Pima Indians dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3 Bounding the model evidence using the subsampled sandwich estimator 45
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2 Bayesian model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.1 Evidence bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2.2 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2.3 Upper bounding the evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.4 Lower bounding the evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.5 Sandwiching the evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3.1 Subsampled log likelihoods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3.2 Subsampled likelihoods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3.3 Importance sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3.4 Harmonic mean estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.3.5 Bridge sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.3.6 Laplace approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.3.7 Laplace-Metropolis estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4 Application to tall datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4.1 Estimation of evidence bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.5 Data application: flights dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.7 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.7.1 Control variates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.7.2 Asymptotic variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4 Statistical properties of sketching algorithms 69
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Background and related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2.1 Embedding bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2.2 Sketches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.2.3 Sketching examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2.4 Sketching bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3 Gaussian sketching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3.1 Complete sketching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3.2 Partial sketching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.3.3 Relative efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.3.4 Combined estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.4 Asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.4.2 Sketching central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.4.3 Sketching estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.5 Data application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.5.1 Human leukocyte antigen dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.5.2 Flights dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5 Proofs regarding sketching algorithms 90
ii
Contents
5.1 Proof of Theorem 4.1 (Worst case bound for partial sketching) . . . . . . . . . . . . . . . 90
5.2 Proof of Theorem 4.2 (Hierarchical model for the Gaussian sketch) . . . . . . . . . . . . . 91
5.3 Variance for partial sketching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.4 Combined estimator results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.5 Proof of Theorem 4.4 (central limit theorem under asymptotic negligibility condition) . . 95
5.6 Proof of Theorem 4.3 (Sketching central limit theorem) . . . . . . . . . . . . . . . . . . . 97
5.6.1 Clarkson-Woodruff sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.6.2 Hadamard sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.7 Proof of Theorem 4.5 (Complete sketching asymptotics) . . . . . . . . . . . . . . . . . . . 107
5.8 Proof of Theorem 4.6 (Partial sketching asymptotics) . . . . . . . . . . . . . . . . . . . . . 110
6 On subspace embeddings, Tracy-Widom limits and approximate Bayesian subset
selection 112
6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.3 Bayesian model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.4 Approximate Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.5 Embedding probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.5.1 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.5.2 Gaussian sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.6 Asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.7 Random matrix theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.7.1 Pointwise limit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.7.2 Tracy-Widom limit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.8 Sketching asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.9 Data application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.9.1 Embedding probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.9.2 Posterior approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7 Conclusion 144
References 149
iii
List of Figures
1.1 Box’s loop (Box, 1976). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Template for embarrassingly parallel algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Sleep dataset from the Behavioral Risk Factor Surveillance survey . . . . . . . . . . . . . . . 11
2.3 Divide and conquer analysis of the sleep dataset with s = 2. . . . . . . . . . . . . . . . . . . . 13
2.4 Subposterior distributions for the cubic model on the sleep dataset. . . . . . . . . . . . . . . . 14
2.5 Target Bayesian model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6 Illustration of the subposterior density identity using the ESP dataset. . . . . . . . . . . . . 20
2.7 Alternative hierarchical Bayesian model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.8 Divide and conquer analysis of the extra sensory perception dataset with s = 3 . . . . . . . . 25
2.9 Flights dataset and simple logistic regression model. . . . . . . . . . . . . . . . . . . . . . . . 39
2.10 Comparison of logistic regression models on the flights dataset . . . . . . . . . . . . . . . . . 40
2.11 Comparison of subposterior and target posterior distributions on the latent variables for the
Pima Indians dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.12 Uniform split of the Pima Indians dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.13 Biased split of the Pima Indians dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.14 Distribution of log Îsub for the Pima Indians dataset. . . . . . . . . . . . . . . . . . . . . . . . 43
3.1 ROC curves for the flights dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.2 Reliability curves for the flights dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.1 Example sketching matrices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.2 Abalone dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.3 Bias of sketching estimators on the HLA dataset. . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.4 Normality tests of the sketched dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.1 Comparison of simulated embedding probabilities against theoretical results at different k and d121
6.2 Tracy-Widom distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.3 Empirical probability of obtaining an -subspace embedding at different k and d . . . . . . . 126
6.4 Observed and theoretical density of σmax(Id −UTSTSU) at different k and d. . . . . . . . . 127
6.5 Regions of interest in determining the embedding probability . . . . . . . . . . . . . . . . . . 128
6.6 Empirical and theoretical embedding probabilities for the representative PRKCE dataset . . 132
6.7 Empirical and theoretical embedding probabilities for the full PRKCE genetic dataset. . . . . 133
6.8 Empirical and theoretical density of σmax(Id −UTSTSU) for the full PRKCE genetic dataset.134
6.9 Comparison of theoretical and empirical density of σmax(Id −UTSTSU) on the full PRKCE
dataset and the bootstrapped PRKCE dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.10 Manhattan plot using the representative PRKCE dataset . . . . . . . . . . . . . . . . . . . . 136
iv
List of Figures
6.11 Marginal posterior probabilities of inclusion for the PRKCE dataset . . . . . . . . . . . . . . 136
6.12 Boxplots of sketched marginal inclusion probabilities . . . . . . . . . . . . . . . . . . . . . . . 138
6.13 Histograms of sketched marginal inclusion probabilities for SNPs (low evidence SNPs) . . . . 139
6.14 Histograms of sketched marginal posterior probabilities (moderate evidence SNPs) . . . . . . 139
6.15 Boxplots of sketched marginal inclusion probabilities . . . . . . . . . . . . . . . . . . . . . . . 140
6.16 Sketched marginal inclusion probabilities (low evidence SNPs) . . . . . . . . . . . . . . . . . 141
6.17 Sketched marginal inclusion probabilities (moderate evidence SNPs) . . . . . . . . . . . . . . 141
7.1 Box’s loop (Box, 1976). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.2 Template for embarassingly parallel algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 146
v
List of Tables
2.1 Terminology for divide and conquer Bayesian inference . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Posterior and subposterior model probabilities for the sleep dataset. . . . . . . . . . . . . . . 13
2.3 Required sample size to attain a relative mean square error of less than 0.1 using nonparametric
kernel density estimation against dimension (Silverman, 1986). . . . . . . . . . . . . . . . . . 15
2.4 ESP dataset and subsets for s = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 Full ESP dataset and subsets for a divide and conquer approach with s = 3. The success
proportion is close to 0.5 in the full dataset and in each subset. . . . . . . . . . . . . . . . . . 25
2.6 Raw data for subjects shown in Figure 2.11. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.1 Guidelines for the interpretation of Bayes factors (Kass and Raftery, 1995). . . . . . . . . . . 49
3.2 Log Bayes factors for the flights dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.3 Estimates of the model evidence for the flights dataset. . . . . . . . . . . . . . . . . . . . . . 63
3.4 Time spent on likelihood evaluations for the flights dataset. . . . . . . . . . . . . . . . . . . . 63
3.5 Proportion of negative likelihood estimates using the simple Poisson estimator. . . . . . . . . 64
4.1 Properties of different data oblivious random projections (Woodruff, 2014). . . . . . . . . . . 75
4.2 Mean square error of sketched estimators on HLA dataset. . . . . . . . . . . . . . . . . . . . . 85
4.3 Coverage of confidence intervals on the HLA dataset. . . . . . . . . . . . . . . . . . . . . . . . 86
4.4 Mean square error of sketched estimators on flights dataset. . . . . . . . . . . . . . . . . . . . 86
4.5 Coverage of 95% confidence intervals on the flights dataset. . . . . . . . . . . . . . . . . . . . 86
4.6 Mean square error of sketched estimators on synthetic flights dataset. . . . . . . . . . . . . . 87
6.1 Properties of different data oblivious random projections. . . . . . . . . . . . . . . . . . . . . 117
6.2 Mean sketching time for the representative PRKCE dataset. . . . . . . . . . . . . . . . . . . . 132
6.3 Mean sketching time (seconds) for the full PRKCE dataset. . . . . . . . . . . . . . . . . . . . 135
6.4 Monte Carlo and asymptotic estimates of sketch quality for the Clarkson-Woodruff sketch . . 142
6.5 Monte Carlo and asymptotic estimates of sketch quality for the Gaussian sketch . . . . . . . 142
vi
List of Tables
¡¡
1
1Introduction
In their recent monograph ‘Computer age statistical inference’, Efron and Hastie (2016) trace the devel-
opment of statistical methodology over the past century with a particular emphasis on computationally
intensive procedures. A central theme in the book is that the frontiers of statistical inference are pushed
forward as both available datasets and data analysis methods evolve in complexity and richness. The
capacity to acquire data and the capacity to analyse data typically advance in tandem, jointly propelled
by technological developments. New systems create datasets of unprecedented size and scope that require
analysis with new computer driven statistical procedures. For example, microarray technology presented
large scale-hypothesis testing challenges. Computationally intensive false discovery rate methodology
(Benjamini and Hochberg, 1995) was then honed in the analysis of this data. In turn, these new algo-
rithms open the doors for more ambitious scientific investigations. False discovery rate methods can serve
as a valuable tool in genome-wide association studies. New datasets can strain existing methodology, and
innovation can spring from these difficulties. The emergence of Big Data marks another period where
statistical methodology will simultaneously be challenged and afforded the opportunity to expand its
reach.
Efron and Hastie (2016) signify that computational statistics research can fall into two camps. It is
generally possible to make a categorisation between algorithm development and algorithm evaluation,
although this division is not fully impermeable. Within the statistical field, algorithmic development
typically takes place with inferential operating characteristics and principles in mind. New algorithms for
data analysis are also developed by a wide range of subject disciplines, often shaped using other criteria.
Teasing out the underlying statistical context and significance of these innovative approaches can lead
to theoretical insights and improved techniques. The statistical toolbox can be applied to algorithms in
two different ways
• Development of new algorithms for statistical inference
• Evaluation of existing algorithms using statistical theory and methods.
Big Data hoists a big tent. Research contributions in the domain will come from many fields. Statistical
contributions will inevitably take both of the aforementioned forms. We can take new algorithms to Big
Data or we can seek to understand the statistical basis of promising methodology that has been proposed
outside of the area. The work in this thesis falls into both of these categories, to be outlined in more
detail shortly.
To locate our focus within the broader field of statistics it is effective to consider Box’s loop, a
conceptual process model of scientific data analysis (Box, 1976). Box’s loop represents the cycle of model
building, fitting, checking and refining that typically takes place when interacting with data, diagrammed
in Figure 1.1. Following Blei (2014) it is possible to identify four key steps in Box’s loop: Build, compute,
critique and repeat. The initial build step is to postulate a plausible data generating process behind the
observed data. Examples include but are certainly not limited to linear regression models, generalised
2
Figure 1.1: Box’s loop is conceptual process model of scientific data analysis (Box, 1976). Box’s loop defines a
number of key phases (Build, compute, critique and repeat) when approaching a data modelling problem. The com-
partmentalisation of the overall task highlights important tactical decisions that are involved the statistical modelling
lifecycle and aids the elicitation of broader strategic elements that influence the work. Adapted from Blei (2014).
linear models or mixture models. Substantial domain knowledge can be encoded at this stage. The
second step is to compute the structural unknowns in the built model. This can involve the estimation of
parameters through optimisation, or sampling from posterior distributions. The third step is to criticise
the model through a formal procedure. Examples include residual diagnostics and posterior predictive
checks. After the assessment, we then typically improve aspects of the model and repeat the process until
we are satisfied. The final model is then deployed for prediction, visualisation, inference or any other
suitable objective.
In an ideal world of infinite computational resources, the compute step would be instantaneous and
painless. In reality we feel the pinch of scarcity, and with tall datasets the compute step can be a significant
bottleneck. The work in this thesis is focused around the compute step of Box’s loop. In particular, we are
concerned with the development of scalable computational methods for Bayesian regression modelling
on large datasets. Consider the general scenario where we have a dataset y of n observations with
likelihood p(y|θ) and prior p(θ), for θ ∈ Ω ⊂ Rd. Many of the key Monte Carlo methods for Bayesian
computation require a full likelihood evaluation p(y|θ) per loop (Robert, 2007; Robert and Casella,
2010). This a highly undesirable trait in the Big Data era when the O(n) cost of likelihood evaluations
can be substantial. We propose a number of computationally efficient algorithms for Bayesian regression
modelling with tall (large n) datasets that address the likelihood burden. We focus on issues surrounding
model selection, in particular computation of the integrated likelihood p(y) =
∫
p(y|θ)p(θ) dθ. The latter
half of the thesis takes a particular focus on the Gaussian linear model, and how random projection can
be used for computationally efficient statistical inference. In particular we illuminate some statistical
properties of randomised data compression algorithms proposed in the computer science and machine
learning literature.
The integrated likelihood p(y) is also commonly known as the model evidence, and the quantity plays
an important role in Bayesian model choice. We develop theory and methods for efficient computation
of the integrated likelihood using parallel processing in Chapter 2. Chapter 3 explores how subsampling
can be used for efficient estimation of the integrated likelihood. The parallel processing and subsampling
algorithms are compatible with a variety of regression models. We consider their application to logistic
regression in detail. Chapters 4, 5 investigate the use of random projection for approximate computation
of the least squares estimates for Gaussian linear models. Chapter 6 extends the approach for carrying
out approximate Bayesian model selection.
There have been many recent advances in developing scalable computational methods for Bayesian
3
1. Introduction
inference in the fixed model setting. With the model given and fixed, the typical goal is to generate
samples from the posterior distribution p(θ|y) ∝ p(y|θ)p(θ). Uncertainty on θ can shrink to the point of
no practical significance in an analysis with a single model and huge n. This phenomenon is an instigative
dynamic for our purposes. Model uncertainty as opposed to parameter uncertainty can often emerge as
the more pressing inferential question given a massive number of observations (Varian, 2014). Bayesian
model selection has received comparatively little attention in the Big Data literature, and much of the
novelty in this thesis comes from the focus on this area.
We explore how distributed computing, subsampling and random projection can be used as methods
for efficient Bayesian model selection. These techniques have been integrated into algorithms for sampling
from the posterior distribution p(θ|y) when n is large. There are largely independent streams of literature
surrounding each technique, Bardenet et al. (2017) provide an excellent survey of prior work. The prob-
lems of posterior sampling and integrated likelihood evaluation are connected, but not fully equivalent.
Algorithms for the calculating the model evidence typically require the generation of posterior samples as
an initial step. As algorithms for posterior simulation do not typically produce the model evidence as an
ancillary benefit, the posterior samples must then be used in a secondary estimation procedure to obtain
the model evidence (Friel and Wyse, 2012; Raftery, 1995; Skilling et al., 2006). The two-stage nature of
the problem poses new research questions in the Big Data setting. Accelerated procedures for posterior
sampling do not naturally lead to accelerated procedures for calculation of the integrated likelihood. Our
use of parallel processing, subsampling and random projection for calculation of the model evidence has
major differences compared to prior work that is focused on posterior simulation.
Distributed computing platforms lend themselves naturally to a divide and conquer approach for
the analysis of tall datasets. The difficult Big Data job can be broken down into a series of smaller
manageable tasks by splitting the full dataset into subsets, and analysing each subset on a separate
machine. The subset results are then merged together to provide the end result. An important factor in
a divide and conquer algorithm is the degree of communication required between subprocessors during the
minibatch analyses. Embarrassingly parallel algorithms require no communication during this stage, and
the only pooling of information occurs in a single round at the end. Embarrassingly parallel algorithms
are attractive as not all distributed computing platforms allow for communication between individual
workers. Secondly, the latency cost of regular communication between machines can be very high (Scott
et al., 2013).
Subsampling also has a natural appeal for large n problems. It is sometimes possible to modify
standard algorithms that involve full likelihood evaluations to instead use estimated likelihoods from a
subsample of size m. Unsurprisingly, the efficiency of the modified algorithm depends on the quality
of the likelihood estimator. A common finding in prior work is that Monte Carlo variance reduction
techniques are needed to prevent algorithms from breaking down as the subsampling fraction m/n tends
to zero (Maclaurin and Adams, 2014; Bierkens et al., 2016; Baker et al., 2017).
The computational demands of tall datasets can be addressed by sacrificing some accuracy in order
to lower the computational expense of the analysis. Algorithms that play this trade-off to good effect
have been identified as a promising future direction for Bayesian computation (Green et al., 2015).
This principle leads us to consider ‘sketching’, a probabilistic data compression technique that has been
developed largely within the computer science community. Sketching algorithms use random projection
to generate a compressed dataset of k observations from the original source dataset of n observations.
The compressed dataset is then used for approximate inference. Crucially, sketching algorithms offer
stronger probabilistic guarantees on the stochastic approximation error that can be achieved through
simple random sampling from the full dataset. Sketching algorithms have also shown excellent empirical
performance compared to basic subsampling schemes in a number of simulation studies (Mahoney, 2011;
Ma et al., 2015). Sketching has recently been explored as a method for approximate Bayesian inference
in the fixed model setting (Geppert et al., 2017).
In Chapter 2 we develop an embarrassingly parallel algorithm for computation of integrated likelihood.
4
Data augmentation and Gibbs sampling are the key components of the method. In Chapter 3 we propose
a computationally efficient method for estimating the integrated likelihood using subsampling. We find
that it is difficult to modify existing importance sampling algorithms to use subsampling effectively. We
propose an alternative interval estimator of the model evidence that has more favourable properties. As
in related work, it is necessary to use variance reduction techniques to ensure the algorithm is stable even
as the subsampling fraction m/n tends to zero.
Chapter 4 develops statistical properties of sketched regression algorithms in the fixed model setting.
Much of the existing literature on sketching is from an algorithmic point of view, and we place sketching
in a statistical context. This groundwork is needed before considering sketched model selection. Chapter
5 gives proofs for many of the results in Chapter 4. Proofs are presented separately in Chapter 5 so as to
not hamper the exposition in Chapter 4. Of particular note is a proof of a central limit theorem where
we show the sketched dataset has a matrix normal distribution under mild regularity conditions. The
regularity conditions have an appealing interpretation in terms of the geometry of the source dataset.
Chapter 6 explores the use of sketching for approximate Bayesian subset selection. Random projection can
be used to approximate the integrated likelihood p(y) in less time than is needed for exact computation.
We find that the Tracy-Widom law (Tracy and Widom, 1994) is very useful for quantifying the error in
the approximate integrated likelihood. The Tracy-Widom law describes the asymptotic distribution of
the eigenvalues of large random matrices and has found many applications in high-dimensional statistics
(Johnstone, 2006). The connection to sketching algorithms appears to be a new result.
Each chapter is largely self-contained and includes a review of relevant literature. Returning to the
development/evaluation classification that was mentioned earlier, chapters 2 and 3 can be viewed as
algorithm development. We propose new methodology for Bayesian computation with tall datasets.
Chapters 4, 5 and 6 can be classified as algorithm evaluation. We investigate the statistical properties
and principles that underpin existing sketching algorithms. As well of being of theoretical interest, this
helps to define concrete procedures for assessing the quality of randomised algorithms.
Multiple real datasets are analysed throughout in order to demonstrate the theory and methods.
Chapters 2 and Chapter 3 consider a number of benchmark datasets from the computational statistics
literature. Chapters 4 and 6 analyse some large genetic datasets from the UK Biobank database. The
dissertation concludes in Chapter 7 with a short discussion of the major lessons and common themes that
have materialised over the enclosed work.
5
2Split, apply, combine: computing the model evidence using
embarrassingly parallel processing
Summary
We investigate how parallel processing can be used for computing the integrated likelihood on datasets
with a large number of observations n. Tall datasets can be split into many subsets that are then al-
located to different machines on a compute cluster. The subsets can then be analysed concurrently;
embarrassingly parallel algorithms run each minibatch analysis with no cross communication between
subprocessors. We find that a combination of data augmentation and Gibbs sampling facilitates a sim-
ulation consistent and scalable embarrassingly parallel algorithm for a wide class of statistical models.
We show that conditionally conjugate exponential family models exhibit structure that is amenable to
embarrassingly parallel inference. A second theoretical finding involves the final step of the divide and
conquer algorithm where the subset results are pooled to give the final estimate of the model evidence.
We show this step has an interpretation as a Bayesian hypothesis testing problem. The connection fur-
thers the theory of distributed Bayesian inference and leads to a second simulation consistent algorithm
for computing the model evidence in parallel.
2.1 Introduction
Distributed computing platforms offer a huge amount of computing power that can be harnessed to
complete many tasks simultaneously. Using parallel processing for the statistical analysis of large datasets
is thus an appealing idea. The broad paradigm is to first split the full dataset into a collection of
non-overlapping subsets. Subsets are then allocated to different machines on a network, we then apply
conventional statistical methods to each minibatch of data. The final step is to combine the subset output
into a single estimate. Separating the algorithm into distinct split, apply and combine phases is useful
for algorithm design and evaluation. Figure 2.1 displays a broach schematic of the desired approach.
A stringent desiderata of an embarrassingly parallel scheme is that the combine phase does not allow
interaction with the original dataset. The entire dataset is processed and distilled in the apply phase.
Embarrassingly parallel algorithms follow a divide and conquer principle, where the original difficult
problem is broken down into a series of manageable tasks.
This computational blueprint presents novel challenges for statistical inference. It can be difficult to
identify appropriate summary measures for the subset analyses conducted in the apply stage, and the final
aggregation in the combine phase is a challenging evidence synthesis task (Jordan, 2013). These issues are
present when implementing a divide and conquer approach for Bayesian model selection. Typically, each
subset will be analysed using Markov chain Monte Carlo methods and will return a rich set of output.
An additional consideration with model selection is that individual subset analyses will be underpowered
relative to a full dataset analysis. We are unlikely to see support for complex models in the subset
analyses, despite the fact that we may have a large enough dataset to detect complex features. We
6
2.1. Introduction
Figure 2.1: Template for embarrassingly parallel algorithms. The split step breaks the full dataset in to s non
overlapping subsets. The illustration is for s = 2. Each subset is then allocated to a different machine on a
cluster. During the apply step we apply conventional methodology to each data subset with no cross communication
between workers. Each analysis is summarised by a consistent format of output. The s sets of output from the apply
stage are then synthesised in the combine step to give the final result. In this design brief, the combine stage only
involves the output from the apply step, and not the original dataset.
explore theoretical and practical issues for distributed Bayesian model selection. Of particular concern is
how to summarise the subset analyses effectively in the apply step and how to synthesise the minibatch
results appropriately into an overall consensus in the combine step.
Suppose interest lies in a collection of M parametric models M1, . . . ,MM . The prior probability
of model j is given by p(Mj) for j = 1, . . . ,M . In a slight abuse of notation we do not write θj to
specify the parameter associated with model Mj , this is to stop notation from being cluttered. The
notation p(θ|Mj) is implicitly understood to refer to the distinct parameter associated with model Mj .
In this work we assume that model selection is to be carried out by calculating the integrated likelihood
p(y|Mj) for each model j = 1, . . . ,M . The posterior distribution over models can be calculated easily
by enumerating over the full set of M integrated likelihoods and prior probabilities:
p(Mj |y) = p(y|Mj)p(Mj)∑M
k=1 p(y|Mk)p(Mk)
. (2.1)
It is therefore sufficient to consider the distributed computation of the integrated likelihood for some
arbitrary modelM. Given an algorithm for an arbitrary model, we simply apply it to each model in the
set of interest and then enumerate to obtain the posterior probabilities using (2.1). Assume the parameter
space of model M is indexed by some continuous θ ∈ Ω ⊂ Rd. Let the data y consist of n independent
observations. Given a prior distribution p(θ) and a likelihood p(y|θ), the integrated likelihood is defined
as
p(y|M) =
∫
p(y|θ,M)p(θ|M) dθ. (2.2)
Previous work on Bayesian divide and conquer algorithms is largely focused on how to sample from or
approximate the target posterior p(θ|y,M) (Huang and Gelman, 2005; Scott et al., 2013; Neiswanger
et al., 2013; Srivastava et al., 2015). Calculation of the integrated likelihood (2.2) presents different
challenges, and we largely build on the existing literature for computing integrated likelihoods and Bayes
factors in the single machine setting. The first main theoretical finding is that the combine step in the
divide and conquer algorithm has an interpretation as a Bayesian hypothesis testing problem. The gold
standard Bayesian evidence synthesis rule can be expressed as a Savage-Dickey density ratio (Dickey,
1971). We also find that data augmentation (Tanner and Wong, 1987) and the theory of conditional
conjugacy for exponential family is of use to define appropriate action in the apply and combine stages.
From a computational point of view, we propose two simulation consistent algorithms for calculating
the model evidence in parallel. The first algorithm is developed in section 2.4 and is designed for general
parametric models. The algorithm presumes that the apply stage involves a generic Metropolis-Hastings
7
2. Computing the model evidence using parallel processing
sampler run on each subset. The subset output consists of the posterior draws. The combine phase then
consists of a density estimation task given the subset posterior samples. The dimension of the density
estimation task increases linearly with the number of sub-processors, thus limiting its scalability. The
second algorithm is developed in section 2.5, and applies to conditionally conjugate exponential family
models. We assume that conjugacy can be achieved through a suitable data augmentation scheme. The
algorithm prescribes Gibbs sampling in the apply step. The output is no longer the raw sampled values,
but rather the parameters of the conditional posterior at each stage of the Gibbs run. The combine stage
then involves aggregation of the Gibbs sampler histories from each subset. Data augmentation proves
useful as we can then obtain closed form rules for combining the output from the apply stage. The
approach is related to the method of Chib (1995) for computing the integrated likelihood from the Gibbs
output. Although limited in scope compared to the first algorithm, the approach involving data aug-
mentation has significantly more favourable computational properties. By combining data augmentation
and Gibbs sampling with parallel processing, it is possible to conduct Bayesian model selection using a
divide and conquer strategy. We illustrate the theory and methods on a number of real datasets, with a
particular focus on logistic regression.
2.2 Divide and conquer Bayesian inference
2.2.1 Posterior sampling
With a specified model, the divide and conquer approach is largely motivated by a factorisation of the
full dataset posterior distribution into a combination of subset posterior distributions (Bardenet et al.,
2017). We will assume it is possible to partition the data into s subsets y = (y1, . . . ,ys) such that the
subsets are independent given θ. To describe the strategy we introduce the idea of a subprior distribution
and a subposterior distribution. The subprior distribution p˜(θ) is defined as
p˜(θ) =
p(θ)1/s∫
p(θ)1/s dθ
.
We assume that the fractionated prior p(θ)1/s is integrable so that the subprior distribution is well
defined. The assumption is always satisfied for Gaussin priors, however in general this condition needs to
be checked on case by case basis. The subprior distribution contains a fraction of the prior information
encoded in the original prior distribution p˜(θ). The subposterior distributions are defined in terms of
the subset likelihoods and the subprior distributions. For i = 1, . . . , s the i-th subposterior distribution
p˜(θ|yi) is defined as
p˜(θ|yi) = p(yi|θ)p˜(θ)∫
p(yi|θ)p˜(θ) dθ .
The tilde is used to acknowledge the use of the subprior distribution p˜(θ) in place of the original prior
distribution p(θ). All probability formulae conditional on θ do not require a tilde, so we use the regular no-
tation for the subset likelihood p(yi|θ). We define the subprior normalising constant α as α =
∫
p(θ)1/s dθ
and the subposterior evidence as p˜(yi) =
∫
p(yi|θ)p˜(θ) dθ. Table 2.1 lists some key terms that we will
make use of when discussing divide and conquer Bayesian inference.
A divide and conquer strategy can be motivated by noting that full dataset posterior is proportional
8
2.2.2. Model uncertainty
Term Symbol Definition
Subprior p˜(θ) p˜(θ) = p(θ)1/s/α.
Subposterior p˜(θ|yi) p˜(θ|yi) = p(yi|θ)p˜(θ)/p˜(yi).
Subprior normalising constant α α =
∫
p(θ)1/s dθ.
Suposterior evidence p˜(yi) p˜(yi) =
∫
p(yi|θ)p˜(θ) dθ.
Table 2.1: Terminology for divide and conquer Bayesian inference. The notation and terminology in this table is
implicitly conditonal on some modelM with parameter θ.
to the product of s subposterior distributions.
p(θ|y1, . . . ,ys) = p(θ)p(y1, . . . ,ys|θ)
p(y1, . . . ,ys)
=
1
p(y1, . . . ,ys)
p(θ)
s∏
i=1
p(yi|θ)
=
1
p(y1, . . . ,ys)
s∏
i=1
p(yi|θ)p(θ)1/s
=
αs
p(y1, . . . ,ys)
s∏
i=1
p(yi|θ)p˜(θ)
=
αs
∏s
i=1 p˜(yi)
p(y1, . . . ,ys)
s∏
i=1
p˜(θ|yi) (2.3)
∝
s∏
i=1
p˜(θ|yi). (2.4)
Dropping the constants in (2.3) leads to the relationship in (2.4). Each data subset yi can be distributed to
a subprocessor. It is then possible to generate samples from each subposterior p˜(θ|yi) in parallel, typically
by running MCMC on each worker. When posterior simulation is the ultimate goal, the combine step
must pool the subposterior samples in a manner such that the full dataset posterior p(θ|y1, . . . ,ys) is
targeted.
A variety of combination rules have been suggested for synthesising the subposterior output in order to
target p(θ|y1, . . . ,ys). Huang and Gelman (2005) and Scott et al. (2013) develop rules based on making
a normal approximation to each subposterior. Neiswanger et al. (2013) propose an approach using kernel
density estimation. Wang and Dunson (2013) consider the use of the Weierstrass transform. There are
also aggregation rules based on geometric or robustness arguments (Srivastava et al., 2015; Minsker et al.,
2017). As our goal is calculation of the integrated likelihood as opposed to posterior sampling, we do not
review any of these methods in detail. Divide and conquer Bayesian model selection presents different
challenges as the decomposition of the full data model posterior into batch posteriors is more complicated
than in (2.4).
2.2.2 Model uncertainty
Motivating a divide and conquer approach for model selection requires a different line of reasoning
compared to the fixed model case, as the full dataset integrated likelihood is not equal to the product of
the subset integrated likelihoods,
p(y1,y2, . . . ,yk|Mj) = p(y1|Mj)
s∏
i=2
p(yi|y1, . . . ,yi−1,Mj)
6=
s∏
i=1
p(yi|Mj).
9
2. Computing the model evidence using parallel processing
The first line is simply the chain rule of probability. The inequality is present as the subsets are only
conditionally independent when given both the model and the parameters. For example, if the underlying
model is a normal distribution, the observations y1, . . . ,ys are still dependent until the mean and variance
parameters are completely specified. Although we have assumed p(yi|y1, . . . ,yi−1,θ,Mj) = p(yi|θ,Mj),
the integrated likelihood does not have the same property p(yi|y1, . . . ,yi−1,Mj) 6= p(yi|Mj). As such,
we cannot immediately obtain a subposterior factorisation over data batches using the same argument
in (2.4).
The target integrated likelihood is related to the subposterior evidence values, the subprior normalising
constant and the subposterior distributions. Rearranging (2.3) gives
p(y1, . . . ,ys|Mj)p(θ|y1, . . . ,ys|Mj) = αs
s∏
i=1
p˜(yi|Mj)
s∏
i=1
p˜(θ|yi,Mj).
Integrating both sides over θ gives an important expression for the full dataset model evidence.
p(y1, . . . ,ys|Mj) = αs
s∏
i=1
p˜(yi|Mj)
∫ s∏
i=1
p˜(θ|yi,Mj) dθ. (2.5)
The full dataset model evidence is related to the subposterior evidence values, the subprior normalising
constant α and an integral over the subposterior distributions. The complex relationship between the full
dataset evidence and the subset output means that is is difficult to obtain a simple rule for embarrassingly
parallel model selection. Define the subprior model probabilities as
p˜(Mj) = p(Mj)
1/s∑M
k=1 p(Mk)1/s
.
We define the subposterior model probabilities p˜(Mj) in terms of the subposterior evidence p˜(yi|Mj)
and the subprior model probability p˜(Mj):
p˜(Mj |yi) = p˜(yi|Mj)p˜(Mj)∑M
k=1 p˜(yi|Mk)p˜(Mk)
. (2.6)
It is more difficult represent the full dataset posterior over models as a combination of subposterior distri-
butions. We use αj to denote the subprior normalising constant for modelMj , so αj =
∫
p(θ|Mj)1/s dθ.
The full dataset posterior is related to the subposterior model probabilities, the subprior normalising
constants and the subposterior distributions on the model parameters. We start by writing out the
decomposition in full.
p(Mj |y1, . . . ,ys) = p(Mj)p(y1, . . . ,ys|Mj)∑M
k=1 p(Mk)p(y1, . . . ,ys|Mk)
∝ p(Mj)p(y1, . . . ,ys|Mj). (2.7)
Now substituting in (2.5) and using the fact that p˜(Mj) ∝ p(Mj)1/s,
p(Mj |y1, . . . ,ys) ∝ p(Mj)
(
s∏
i=1
p˜(yi|Mj)
)
αsj
∫ s∏
i=1
p˜(θ|yi,Mj) dθ
=
(
s∏
i=1
p˜(yi|Mj)p(Mj)1/s
)
αsj
∫ s∏
i=1
p˜(θ|yi,Mj) dθ
=
(
s∏
i=1
p˜(yi|Mj)p˜(Mj)
)
αsj
∫ s∏
i=1
p˜(θ|yi,Mj) dθ. (2.8)
The following quantity is a constant as it involves sums over the collection of models
s∏
i=1
1∑M
k=1 p˜(yi|Mk)p˜(Mk)
(2.9)
10
2.2.2. Model uncertainty
1000
10,000
100,000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Hours of sleep per night
N
um
be
r o
f p
eo
pl
e
(a)
0.0
0.1
0.2
0.3
0.4
0.5
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Hours of sleep per night
Pr
op
or
tio
n 
re
po
rti
ng
 P
o
o
r 
he
al
th
(b)
Figure 2.2: Sleep dataset from the Behavioural Risk Factor Surveillance survey (n = 481, 939). Panel (a) shows
the number of individuals reporting a particular number of hours sleep. Panel (b) shows the proportion of individuals
reporting ‘Poor‘ health against hours sleep. Self-reported ‘Poor’ health status appears to be statistically associated
with sleep habits.
Multiplying (2.8) by the constant in (2.9) yields
p(Mj |y1, . . . ,ys) ∝
(
s∏
i=1
p˜(yi|Mj)p˜(Mj)∑M
k=1 p˜(yi|Mk)p˜(Mk)
)
αsj
∫ s∏
i=1
p˜(θ|yi,Mj) dθ
=
(
s∏
i=1
p˜(Mj |yi)
)
αsj
∫ s∏
i=1
p˜(θ|yi,Mj) dθ. (2.10)
The full dataset posterior can be expressed as
p(Mj |y1, . . . ,ys) ∝
(
s∏
i=1
p˜(Mj |yi)
)
αsj
∫ s∏
i=1
p˜(θ|yi,Mj) dθ. (2.11)
The immediate message from equation (2.11) is that the subposterior model probabilities p˜(Mj |y1), . . . , p˜(Mj |ys)
are not sufficient to evaluate the global suitability of a model. The subposterior distributions of the param-
eters p˜(θ|y1,Mj), . . . , p˜(θ|ys,Mj) contain important additional information for discriminating between
models.
To illustrate, we first consider a simple model choice problem. The dataset is from the 2013 Be-
havioural Risk Factor Surveillance System (BRFSS) survey run by the Centers for Disease Control and
Prevention (Centers for Disease Control and Prevention, 2013). We examine the n = 481, 939 responses
for two general health questions. Respondents were asked for how many hours per night do they typically
sleep. Respondents were also asked to rate their general health on a five point scale: Poor, Fair, Good,
Very Good, Excellent. Figure 2.2 (a) plots the number of responses against hours sleep. The majority
of those surveyed report between 6 and 8 hours sleep, with the overall counts decreasing as the hours
of sleep moves away from this range. The y-axis is on a log scale. Figure 2.2 (b) plots the proportion
of respondents reporting ‘Poor’ health against hours sleep. There appears to be a statistical association
between the two variables. Atypical sleep habits, in terms of very low or very high hours of sleep appears
to be an indicator for self-reported ‘Poor’ health.
As a simple model choice exercise we fit two different logistic regression models to the data in (b). Let
xi denote the hours of sleep for individual i, and yi be a binary indicator for health status, where yi = 1 if
the reported health is ‘Poor’ and yi = 0 otherwise. The response is modelled as yi ∼ Bernoulli(σ(ηi)) for
a linear predictor ηi, where σ(η) gives the inverse logistic function σ(η) = 1/(1+exp(−η)). We compare a
cubic model (M1) with four parameters to a more complex cubic spline model (M2) with 10 parameters.
Under the simpler model M1, the linear predictor is given by
M1 : ηi = β0 + xiβ1 + x2iβ2 + x3iβ3.
11
2. Computing the model evidence using parallel processing
For the cubic spline model M2 with K knots we set
M2 : ηi = β0 + x1β1 + x2iβ2 + x3iβ3 +
K∑
k=1
(xi − ξk)3+
We chose K = 6 knots manually. The knots ξ1, . . . , ξ6 were set at 3,5,7,9,11 and 13. As a brief aside,
it worth acknowledging that the dataset is from a complex health survey and that the Bayesian models
were fit under the assumption that the data collection mechanism is ignorable (Gelman et al., 2014).
We conducted a divide and conquer analysis by splitting the data into s = 2 subsets, based on the xi
values. Observations with xi ≤ 7 were placed in subset 1, and observations with xi > 7 were allocated to
subset 2. This partition was a deliberate choice to highlight the influence of the split step on the combine
step. Figure 2.3 shows the fitted models on the full dataset and the subsets. In each panel we plot the
posterior predictive mean function from the respective analysis as a red line. Using the full dataset, the
posterior predictive mean for a new response given covariates xnew is obtained by integrating over the
posterior distribution of the coefficients
E[ynew|M] =
∫
p(ynew = 1|xnew,β,M)p(β|X,y,M) dβ. (2.12)
Let X(i) denote the design matrix for data subset i for i = 1, 2. For the subset results, the posterior
predictive mean for a new response given covariates xnew is is obtained by integrating over the subposterior
distribution of the coefficients
E[ynew|M] =
∫
p(ynew = 1|xnew,β,M)p˜(β|X(i),yi,M) dβ. (2.13)
The shaded ribbons in each plot give 90 percent credible intervals for the posterior predictive mean at
a given point. As expected, there is more uncertainty in the regions where we have fewer data points.
From panel (a) in Figure 2.2 the overwhelming majority of respondents report between 4 and 12 hours of
sleep, and the shaded ribbon is tightly concentrated around the posterior predictive mean in this range.
Looking at the full dataset results in Figure 2.3 it is apparent that the spline model is the more
appropriate model. The subset results are less definitive. Visually, the fits of the spline and cubic models
look comparable in each subset. In particular, the posterior mean functions and credible intervals look
nearly identical in subset 1. Table 2.2 reports the full dataset posterior model probabilities and the
subposterior model probabilities. The full dataset posterior puts mass one on the cubic spline model,
this is not surprising given that the cubic model fits the observed data very poorly in the tails of the
design space (hours of sleep). The subset results do not reflect this. In subset 1, the cubic model
provides a seemingly identical fit to the spline model using the available data. As the cubic model is
more parsimonious, it is favoured in the subposterior distribution on models with p˜(M1|y1,X(1)) = 0.84.
In subset 2 the extra parameters of the cubic spline model give a better fit to the available data. As such,
the spline model is strongly favoured in the second minibatch analysis. The subposterior probability for
the cubic model is p˜(M1|y2,X(2)) is zero in the second subset. It is difficult to intuitively reconcile the
subset results with the full dataset results when only given subposterior and posterior model probabilities.
We freely admit that this example has been artificially engineered to be melodramatic. We wished to
demonstrate that the choice of partition in the split step has a large downstream effect on the difficulty in
the combine step. Additionally, we would like to highlight the role of subposterior overlap as a goodness
of fit diagnostic in the combine step.
As dictated by (2.11), to properly synthesise the subset results it is necessary to look at the subposte-
rior distributions of the parameters. The most direct interpretation of the integral term is an expectation
over an arbitrary subposterior:
∫ s∏
i=1
p˜(θ|yi,Mj) dθ = Ep˜(θ|yi,Mj)
 s∏
k 6=i
p˜(θ|yk,Mj)
 . (2.14)
12
2.2.2. Model uncertainty
Posterior distribution Subposterior 1 Subposterior 2
Cubic model M1 0.00 0.84 0.00
Spline model M2 1.00 0.16 1.00
Table 2.2: Posterior and subposterior model probabilities for the sleep dataset. The subposterior results are in
conflict despite very clear results in a full dataset analysis.
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l l
data: Full dataset data: Subset 1 data: Subset 2
m
odel: Cubic m
odel
m
odel: Spline m
odel
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0.0
0.1
0.2
0.3
0.4
0.5
0.0
0.1
0.2
0.3
0.4
0.5
Hours of sleep
Pr
op
or
tio
n 
re
po
rti
ng
 p
oo
r h
ea
lth
Figure 2.3: Comparison of cubic model and spline model on the full sleep dataset and on the s = 2 subsets. The
red line gives the posterior preditive mean function. The shaded ribbons give a 90 percent credible interval for the
posterior predictive mean. The full dataset contains n = 481, 939 observations. Subset 1 contains n1 = 300, 177
observations, and subset 2 contains n2 = 181, 762 observations. The full dataset results show that the spline model
provides a much better fit to the observed data. The cubic model does not fit well in the extreme regions of feature
space (hours of sleep). The subset results in isolation do not make this as clear. Model 1 and Model 2 given
nearly identical fits in subset 1. The results seem comparable in subset 2. The results in the apply stage may not
be representative of a single full dataset analysis. The initial partition of the dataset in the split stage appears to
influence the difficulty of the evidence synthesis task in the combine stage.
Qualitatively speaking, the expectation will be large if the subposterior distributions are similar across
subsets, and the expectation will be small if the subposterior distributions are dissimilar. A high-level
interpretation is that the integral checks for consistency of parameter estimates across different subsets.
Figure 2.4 compares the marginal subposterior distributions on each parameter in the cubic model. The
non-overlapping subposteriors are indicative of poor goodness of fit, which is clear in the full dataset
results in Figure 2.3.
The example shows that the subposterior integral (2.14) has an important role in divide and conquer
model selection. The subset model probabilities in isolation do not give enough information to reconstruct
the target posterior distribution over models. Distributed model selection requires a pooling rule in the
combine stage that takes this into account. It is quite straightforward to make an approximation to the
subposterior integral, under the assumption of Gaussian subposteriors. Normal approximations to the
subposterior distributions are used in the fixed model divide and conquer work by Huang and Gelman
13
2. Computing the model evidence using parallel processing
β2 β3
β0 β1
−0.2 0.0 0.2 −0.01 0.00 0.01 0.02
−5.0 −2.5 0.0 2.5 −2 −1 0 1
0
1
2
0
100
200
300
0.0
0.5
1.0
1.5
0
4
8
12
value
de
ns
ity
Figure 2.4: Suposterior distributions of parameters in the cubic model. The solid line denotes the first subposterior
p˜(β|y1). The red dashed line denotes the second subposterior p˜(β|y2). The disparate subposterior distributions is
a reflection of the poor overall fit of the model. Subposterior overlap is an important diagnostic in the combine stage
for model selection.
(2005) and Scott et al. (2013). However, it is of interest to identify general strategies that do not rely
on subposterior normality, as quantifying the error from making normal approximations can be difficult
(Bierkens et al., 2016). Furthermore, a general solution can lead to a deeper understanding of the problem.
The combine step in divide and conquer model selection involves estimation of subposterior integral
given the output from the apply stage. For later reference we denote the subposterior integral as Isub
where
Isub =
∫ s∏
i=1
p˜(θ|yi,M) dθ. (2.15)
In section 2.4 we establish an important link between the subposterior integral and the Savage-Dickey
density ratio. The connection explains how the subposterior integral measures subposterior overlap
from a Bayesian perspective, and lays out how the combine step implicitly synthesises the support for a
model given the output from the apply stage. The theoretical connection also leads to simple Monte Carlo
procedure for estimating the integral involving kernel density estimation. Although simulation consistent,
the estimator scales poorly with the number of subsets s as it involves kernel density estimation in high-
dimensional spaces. To remedy this problem we propose an alternative strategy using data augmentation
and Gibbs sampling in section 2.5. Although limited to certain models, the second algorithm gives a
significantly better estimator of the subposterior integral (2.15) in the combine step. Section 2.3 reviews
some important background theory.
14
2.3. Background
Dimension Sample size
1 4
2 19
3 67
4 223
5 678
6 2,790
7 10,700
8 43,700
9 187,000
10 842,000
Table 2.3: Required sample size to attain a relative mean square error of less than 0.1 using nonparametric kernel
density estimation against dimension. The generative model is a d-dimensional multivariate normal distribution with
an identity covariance matrix. The density is estimated at zero using a Gaussian kernel. The bandwidth is chosen to
minimise the asymptotic mean square error at zero. The required sample size increases rapidly with the dimension
d. Reproduced from Silverman (1986, Table 4.2)
2.3 Background
2.3.1 Integrated likelihood calculation
Estimation of the integrated likelihood has been a long standing problem in Bayesian computation.
Rearranging Bayes’ theorem, an important general relationship is that for any ordinate θ ∈ Ω,
log p(y|M) = log p(y|θ,M) + log p(θ|M)− log p(θ|y,M). (2.16)
The identity (2.16) is often referred to as ‘Candidate’s formula’ (Besag, 1989). If the posterior density can
be obtained at the point θ then we have a simple way to obtain the integrated likelihood. Given samples
from the posterior distribution, it is possible to use a non-parametric kernel density estimator to estimate
the density at θ (Raftery, 1996; Lewis and Raftery, 1997). A problem with this approach is that kernel
density estimation is subject to the curse of dimensionality, with an asymptotically optimal pointwise
mean squared error rate of O(B−4/(d+4)), where B is the number of independent posterior samples (Terrell
and Scott, 1992). The bound is under the assumption that we use a second order kernel. Most common
kernels (eg. Gaussian, Epanechnikov, Uniform) are second order kernels (Terrell and Scott, 1992). Use
of kernel density estimation is therefore limited to models with a very small number of parameters. The
impact of the curse of dimensionality is vividly presented by Silverman (1986, section 4.5.2). Silverman
considers nonparametric kernel density estimation where the true density f is a d-dimensional multivariate
Gaussian with identity covariance matrix. The task is to estimate the true density at zero f(0) using a
Gaussian kernel. Silverman sets the single bandwidth parameter h to minimise the asymptotic pointwise
mean squared error. Suppose the goal is to ensure the relative mean square error E[(f̂(0)−f(0))2]/[f(0)2]
is less than 0.1. Silverman calculates the required sample size to attain this error tolerance at various
dimensions d. Table 2.3 reports the required sample size against dimensionality. The required sample
size increases rapidly with d. Nonparametric kernel density estimation is very hard in high-dimensional
spaces.
2.3.2 Savage-Dickey density ratio
The Savage-Dickey density ratio is a useful identity for computing Bayes factors that avoids the need
to compute integrated likelihoods (Dickey, 1971) directly. For our purposes, we are concerned with the
use of the Savage-Dickey density ratio to linear restrictions on parameters. The Savage-Dickey density
15
2. Computing the model evidence using parallel processing
ratio for testing linear restrictions is discussed Wetzels et al. (2010) and McCulloch and Rossi (1992).
We present the approach as in McCulloch and Rossi (1992). The Savage-Dickey identity expresses Bayes
factors as a density ratio involving posterior and prior distributions. Suppose we have a modelMS with
parameter θS ∈ ΩS . Let L(·) = f(y|·) denote the likelihood function, where the data y is treated as
fixed. Let ψ represent a linear function of θS so ψ = AθS for some matrix of known coefficients A. Let
Ω0 ⊂ ΩS denote the subspace of ΩS where ψ = 0. Let θ0 denote an arbitrary elements of Ω0. We wish
to test the linear restriction H0 : ψ = 0. The Bayesian hypothesis test requires a prior distribution on
ΩS and a prior distribution on Ω0. Let P denote the prior probability measure on ΩS and let P0 denote
the prior probability measure on Ω0. Let M0 denote the restricted model where ψ = 0. The evidence
under each model can be expressed as
p(y|M0) =
∫
Ω0
L(θ0) dP0(θ0)
p(y|MS) =
∫
ΩS
L(θS) dP (θS).
We integrate with respect to the prior measures, as P0 will not have density with respect to Lebesgue
measure. The Bayes factor in favour of the null hypothesis, B01 is given by
B01 = p(y|M0)
p(y|MS) (2.17)
=
∫
Ω0
L(θ0) dP0(θ0)∫
ΩS
L(θS) dP (θS)
. (2.18)
The Savage-Dickey density ratio occurs when we set the prior measure P0 to be the conditional distribution
of θS given that ψ = 0 under the encompassing model prior P . Let P (·|ψ = 0) be the regular conditional
distribution on θS given thatψ = 0 under P . Regular conditional distributions are discussed in Billingsley
(1999). Taking P0(·) = P (·|ψ = 0) gives that
p(y|M0) =
∫
Ω0
L(θ0) dP0(θ0)
=
∫
Ω0
L(θ0) dP (θ0|ψ − 0). (2.19)
Integrating with respect to the conditional distribution P (·|ψ = 0) also defines a conditional marginal
likelihood under MS ,
p(y|ψ = 0,MS) =
∫
Ω0
L(θ0) dP (θ0|ψ = 0). (2.20)
Substituting (2.20) and (2.19) into (2.17) gives another expression for the Bayes factor:
logB01 = p(y|ψ = 0,MS)
p(y|MS) . (2.21)
Dickey (1971) demonstrated that the the conditional evidence p(y|ψ = 0,MS) can be expressed in terms
of the posterior distribution of ψ = AθS under the unconstrained model MS . The conditional evidence
satisfies
p(y|ψ = 0,MS) = p(ψ = 0|y,MS)p(y|MS)
p(ψ = 0|MS) . (2.22)
This does not follow immediately from Bayes’ theorem the event ψ = 0 is a subset of measure zero under
P . The relationship (2.22) follows using the fact that P (·|ψ = 0) is a regular conditional distribution.
Substituting (2.22) into (2.21) gives
logB01 = p(ψ = 0|y,MS)p(y|MS)
p(ψ = 0|MS)
1
p(y|MS) (2.23)
=
p(ψ = 0|y,MS)
p(ψ = 0|MS) . (2.24)
16
2.4. General Bayesian models
Figure 2.5: Target Bayesian model. Subsets are conditionally independent given the modelM and the parameters
θ, but conditionally dependent given only the model due to the shared dependence on θ.
The Bayes factor B01 is given by the ratio of the posterior density of ψ to the prior density of ψ under
the unconstrained modelMS . Evaluation of the density ratio is often easier than direct evaluation of the
integrals in (2.18). A point of interest is that Savage-Dickey density ratio gives the integrated likelihood
for the restricted model M0 in terms related to the unconstrained model MS ,
p(y|M0) = p(y|MS)p(ψ = 0|y,MS)
p(ψ = 0|MS) . (2.25)
In many situations fitting the unconstrained model MS will be considerably easier than fitting the
constrained modelM0. The identity (2.25) gives a useful alternative route for computing the integrated
likelihood p(y|M0) in situations where it is difficult to fit the constrained modelM0. In many situations
we will be able to compute the evidence for the unconstrained modelMS and generate posterior samples
from MS . Given posterior and prior samples of ψ we can use nonparametric kernel density estimation
to estimation the density ratio in (2.24). This is assuming that the dimension of ψ is low, as the curse of
dimensionality will rule out the feasibility of this approach in high-dimensions (Table 2.3). The evidence
for the unconstrained model combined with the density ratio then give the evidence for the actual model
of interest M0.
We will show the relationship in (2.25) has an important connection to the desired split-apply-combine
methodology in Figure 2.1. We have a Big Data problem where it is difficult to fit the desired model on a
single machine due to computational constraints so we adopt a divide and conquer strategy. In the next
section we show how the split and apply steps effectively correspond to fitting an unconstrained model
with an expanded parameter space. In the combine step we can compute a Bayes factor that connects
the unconstrained model to the original model of interest. The analysis gives a Bayesian formalism to
the relationship between the apply and combine steps in the embarrassingly parallel algorithm.
2.4 General Bayesian models
2.4.1 Introduction
Figure 2.5 diagrams the target model that we wish to use for Bayesian inference. When analysing the
dataset using an embarrassingly parallel algorithm we obtain a collection of s subposterior distributions
in the apply stage. We assume that worker returns the subposterior evidence p˜(yi) and samples from the
subposterior p˜(θ|yi) as output during apply stage for i = 1, . . . , s. As demonstrated in the sleep dataset
example in section 2.2.2 subposterior overlap is an important consideration in the combine step. Recall
from (2.5) and (2.15) the following identity for the target model evidence,
p(y1, . . . ,ys|M) =
(
s∏
i=1
p˜(yi|M)
)
αs
∫ s∏
i=1
p˜(θ|yi,M) dθ. (2.26)
=
(
s∏
i=1
p˜(yi|M)
)
αs × Isub. (2.27)
17
2. Computing the model evidence using parallel processing
To recover the target model evidence it is necessary evaluate the subposterior integral Isub. In the case of
s = 2 subsets, it is possible to see connection between the integral and a pointwise density measurement
of subposterior overlap. To do so we rewrite the definition of the subposterior integral more explicitly as
Isub =
∫
θ∗∈Ω
s∏
i=1
p˜(θ = θ∗|yi,M) dθ∗. (2.28)
The extra notation is introduced as later it will be necessary to distinguish between the random variable
in the target model θ and an arbitrary ordinate in the parameter space θ∗. Each subposterior analysis
returns a measure over the parameter space Ω. In (2.28) the variable of integration is now represented by
the point θ∗ in the parameter space. Each subposterior density p˜(θ|y1), . . . , p˜(θ|ys) expresses different
beliefs on the unknown parameter θ. Each subposterior thus assigns a different density to θ∗, and Isub
is calculated by taking the product over the s subsets and accumulating over the parameter space Ω.
With s = 2 subsets, we can see an immediate connection to the convolution of the two density
functions p˜(θ|y1) and p˜(θ|y2). The convolution gives the density function of a random variable that can
be defined as a difference of independent samples from each subposterior. Let S1 and S2, be independent
random variables with support Ω. Let the density function of S1 be given by the subposterior density
from the first subset p˜(θ|y1) and the density function of S2 be given by the subposterior density from
the second subset p˜(θ|y2). Now define the random variable D = S1 − S2. The density of the difference
random variable D at a point r is given by
fD(r) =
∫
θ∗∈Ω
p˜(θ = θ∗|y1,M)p˜(θ = θ∗ − r|y2,M) dθ∗.
The density at 0 is equal to the subposterior integral Isub,
fD(0) =
∫
θ∗∈Ω
p˜(θ = θ∗|y1,M)p˜(θ = θ∗|y2,M) dθ
=
∫
θ∗∈Ω
2∏
i=1
p˜(θ = θ∗|yi,M) dθ∗
= Isub. (2.29)
If S1 and S2 are distributed similarly, then we expect the difference D to be concentrated around zero.
If the subposterior distributions are dissimilar then zero will be in the tails of the distribution of D.
To illustrate, we consider another simple model choice problem. The dataset is from an experiment to
detect extra sensory perception (ESP), first reported in Jahn et al. (1987). The experimenters constructed
a random number generator using a radioactive source. The random number generator was calibrated to
output a random sequence of zeroes and ones with equal probability. A subject was asked to psychically
alter the sequence of random numbers using extra sensory capabilities. There are n = 104, 490, 000
observed random numbers which can be modelled as n independent Bernoulli(θ) events. This dataset is
also considered in Bernardo et al. (2011). Table 2.4 reports the number of ones in the observed sequence
(successes). The fraction of ones in the dataset is slightly higher than 0.5.
We compare two models,M1 : θ = 0.5, corresponding to no support for ESP andM2 : θ 6= 0.5, which
allows for the possibility for ESP. We use a flat Beta(1, 1) prior on the proportion θ for model 2. The ESP
dataset is commonly used to illustrate differences between Bayesian and Frequentist hypothesis testing.
Suppose we assign equal prior weight to each model. The posterior probability of M1 : θ = 0.5 is then
0.92. The p-value for testing θ = 0.5 is 0.0003017. The Bayesian approach favours the simpler model
M1 : θ = 0.5, when the frequentist analysis rejects the null hypothesis (θ = 0.5) at the usual 5 percent
level of significance. Here default Bayesian and frequentist procedures lead to opposite conclusions
regarding the existence of extra sensory perception. We use the ESP dataset to illustrate divide and
conquer inference as it is a simple large n dataset where the posterior model probabilities are not highly
concentrated around zero or one.
18
2.4.2. Subset saturated model
Full dataset Subset 1 Subset 2
Trials 104,490,000 52,245,000 52,245,000
Successes 52,263,471 26,137,735 26,125,736
Success proportion 0.5001768 0.5002916 0.5000619
Table 2.4: Full ESP dataset and subsets for a divide and conquer approach with s = 2. The success proportion is
close to 0.5 in the full dataset and each subset.
We split the full dataset into two subsets, as described in Table 2.4. Let y1 denote the data for subset
1 and y2 denote the data in subset 2. We focus on the divide and conquer analysis of model 2, where
we estimate the success probability. Let ni and ri denote the number of trials and successes respectively
for subset i = 1, 2. The subpriors for θ are both uniform Beta(1, 1) since a fractionated uniform density
is proportional to a uniform density: [Beta(1, 1)(θ)]1/2 = 1θ∈[0,1] = Beta(1, 1)(θ). Given a Beta(1, 1)
subpriors, each subposterior is a Beta(ri + 1, ni − ri + 1) density for i = 1, 2. Figure 2.6 plots the two
subposteriors. As each subposterior is a Beta density it is possible to work out the subposterior integral
analytically. Let ai = ri + 1, bi = ni − ri + 1, a∗ = (
∑s
i=1 ai)− (s− 1) and b∗ = (
∑s
i=1 bi)− (s− 1). The
subposterior integral satisfies
log Isub = logB(a
∗, b∗)−
s∑
i=1
logB(ai, bi), (2.30)
where B(a, b) is the Beta function. Using the data partition in Table 2.4, we calculate Isub = 259.64.
Panel (b) shows the density of D = S1 − S2, where S1 and S2 are distributed according to p˜(θ|y1) and
p˜(θ|y2) respectively. The value of the subposterior integral Isub = 259.64 is plotted as a dashed horizontal
line and the vertical dotted line has an x intercept of zero. Looking at the intersection we see that the
theoretical density of D at zero is equal to Isub giving an empirical verification of (2.29) for this example.
The convolution by inspection argument leading to (2.29) was non-constructive, and it is not clear how
to generalise the argument to an arbitrary number of subsets s. Subposterior overlap clearly has a role in
the combine step, although it is not apparent if this is an inherently Bayesian approach to the evidence
task in the final stage. Another open question is the significance of the subprior normalising constant α
in the representation (2.5). We take a more general approach in the next subsection that addresses these
issues.
2.4.2 Subset saturated model
In the divide and conquer analysis, each data subset yi is modelled separately using the subprior p˜(θ).
This Big Data modelling strategy is adopted for purely computational reasons. This strategy can also
be motivated through an inferential argument. Laying out the inferential argument leads to a Savage-
Dickey ratio representation in the combine step. Consider an alternative hierarchical model, diagrammed
in Figure 2.7, where each data subset is now allocated an independent set of parameters. As before,
suppose that the arbitrary model M is indexed by parameter θ, where θ ∈ Ω. For each model M
we introduce an expanded version MS that has a separate set of parameters for each subset. In the
alternative hierarchical model we have an expanded parameter space. Let ΩS =
∏s
i=1 Ω. Let θS ∈ ΩS
denote the parameter for the expanded model. Then
θS =

θ(1)
...
θ(s)
 ∈ ΩS .
Suppose the original model M is a regression model with a particular set of explanatory variables. The
expanded modelMS will use the same set of covariates in each subset, but allows for different coefficients
19
2. Computing the model evidence using parallel processing
0.4990 0.4995 0.5000 0.5005 0.5010
0
10
00
30
00
50
00
(a) Subposteriors
Value
D
en
si
ty
p(θ| y1) 
p(θ| y2) 
0
10
00
20
00
30
00
40
00
(b) Density of D = S1 − S2
Value
D
en
si
ty
−0.0002 0 0.0002 0.0004 0.0006
25
9
Figure 2.6: Illustration of the subposterior density identity using the ESP dataset (s = 2). The full dataset and data
subsets are summarised in Table 2.4. (a) Subposterior densities. (b) Density of the difference variable D. In (b)
the horizontal dashed line gives the value of Isub. The intersection of the dashed lines in (b) illustrates the density
identity (2.29).
Figure 2.7: Alternative hierarchical Bayesian model (subset saturated model). Each subset yi is allocated an
independent set of parameters θ(i). Subsets are conditionally independent gives the model indicator. The split and
apply steps in the divide and conquer procedure are equivalent to defining and fitting the subset saturated model.
in each subset. It can be argued that split and apply steps in the embarrassingly parallel algorithm are
effectively the same as fitting the alternative hierarchical model with subset specific parameters. The
alternative hierarchical model is conceptually similar to the saturated model that is used when calculating
the deviance in a generalised linear model. The saturated model in the context of generalised linear models
allows for a unique parameter for each observation and provides the best possible goodness of fit. In the
divide and conquer world, the expanded model MS allows for a unique parameter for each subset, and
allows extra adaptation compared to the desired model M. As such we will refer to MS as the subset
saturated model, and the hierarchical model in Figure 2.7 as the hierarchical subset saturated model.
2.4.3 Split and apply steps
We make some assumptions about the subset saturated model to link it concretely to the split and apply
steps that take place in the embarrassingly parallel algorithm (see Figure 2.1).
Assumption 1: Subset likelihoods are the same as in the target model. That is for all θ∗ ∈ Ω, p(yi|θ(i) =
θ∗,MS) = p(yi|θ = θ∗,M) for i = 1, . . . , s.
Assumption 2: The subset specific parameters have independent priors, so the joint prior in the subset
20
2.4.3. Split and apply steps
saturated model can be written as
p(θ(i), . . . ,θ(s)|MS) =
s∏
i=1
p(θ(i)|MS).
Assumption 3: Each subset prior is given by the subprior used in a divide and conquer analysis (see
Table 2.1). That is for all θ∗ ∈ Ω, p(θ(i) = θ∗|MS) is equal to p˜(θ = θ∗|M) for i = 1, . . . , s.
The subset saturated model is a bona fide Bayesian model for any choice of prior distribution on the
subset specific parameters θ(1), . . . ,θ(s). Assumptions 2 and 3 are needed to make a connection to the
divide and conquer procedure. Assumption 1 is needed to be completely specify the model. Let L
(i)
S (θ)
give the likelihood function for subset i under the subset saturated model (Figure 2.7). Let L(i)(θ)
give the likelihood function for subset i under the target model M (Figure 2.5). From Assumption 1
L
(i)
S (θ) = L
(i)(θ) for all θ ∈ Ω.
Under assumptions 1,2 and 3, the posterior distribution of the subset specific parameters θ(1), . . . ,θ(s)
in the subset saturated model is closely related to the subposterior output that is is generated in the apply
phase. The underlying space of interest is Ω. The apply stage in the embarrassingly parallel algorithm
returns s belief distributions over Ω. The subset saturated model also defines s belief distributions on Ω
through the s marginal distributions on the subset specific parameter θ(i) for i = 1, . . . , s. We can show a
measure theoretic equivalence between the distribution sets of the embarrassingly parallel algorithm and
the marginal distributions of the subset saturated model under assumptions 1,2 and 3. Let F denote the
Borel σ-algebra on Ω. Each worker in the apply step is given an initial belief distribution on Ω given by
the subprior density p˜(θ|M). Let P˜ (i) denote the subprior distribution for worker i for i = 1, . . . , s. In
the subset saturated model we have a collection of s prior distributions on Ω, one for each subset specific
parameter. By assumption 3 each subset specific prior is set to the subprior distribution. Specifically, we
have that for all sets F ∈ F :
P˜ (i)(F ) = P (i)(F ) i = 1, . . . , s. (2.31)
In the embarrassingly parallel algorithm each worker learns updated beliefs on Ω during the apply step.
Let P˜
(i)
pi denote the ith subposterior distribution on Ω. The subset saturated model has parameter space
ΩS =
∏s
i=1 Ω. Let P denote the prior distribution on ΩS under the subset saturated model and let
Ppi denote the posterior distribution on Ωs under the subset saturated model. Let P
(i)
pi be the marginal
posterior distribution on θ(i) under the subset saturated model for i = 1, . . . , s. Using the subset saturated
model we also obtain a collection of beliefs on Ω through the s marginal distributions P
(i)
pi for i = 1, . . . , s.
From Assumption 1 and 3 we have that
P˜ (i)pi (F ) = P
(i)
pi (F ) i = 1, . . . , s. (2.32)
Let µ denote Lebesgue measure on Rd where d is the dimension of Ω. Under Assumptions 1 and 3 both
measures have the same Radon-Nikodym derivative with respect to Lebesgue measure,
P˜
(i)
pi
dµ
= L
(i)
S p˜(θ) = L
(i)(θ)p˜(θ)
P
(i)
pi
dµ
= L(i)(θ)p˜(θ).
This is sufficient to establish (2.32). The split and apply steps in the embarrassingly parallel algorithm
(see Figure 2.1) are effectively the same as building and computing the posterior distribution for the
subset saturated model (Figure 2.7). The subset saturated model is also useful for describing the role of
the the subposterior integral (2.15) in the combine step.
21
2. Computing the model evidence using parallel processing
2.4.4 Combine step
The ideal Bayesian procedure in the combine step can be identified by considering a linearly restricted
version of the subset saturated model (Figure 2.7). The key observation is that the target model M is
effectively nested within the subset saturated model MS under the linear restriction that θ(1) = θ(2) =
. . . = θ(s). Let ψ denote a d(s − 1) dimensional vector that captures differences in parameter values
across subsets, specifically
ψ =

θ(1) − θ(2)
θ(1) − θ(3)
...
θ(1) − θ(s)

. (2.33)
Under the linear restriction ψ = 0 the structure of the subset saturated model diagrammed in Figure
2.7 collapses to that of the target hierarchical model diagrammed in Figure 2.5. We will argue that the
model evidence for the linearly restricted version of the subset saturated model is equal to the model
evidence for the target model under appropriate assumptions.
There are some measure theoretic considerations when working with the linearly restricted version
of the subset saturated model. Again let P denote the prior measure on ΩS =
∏s
i=1 Ω given that
assumptions 1,2 and 3 hold. Let Ω0 be the linear subspace of ΩS defined by the linear restriction ψ = 0.
Let P (·|ψ = 0) denote the regular conditional distribution of θS given that ψ = 0 under P . Let M0
denote the linearly restricted model where θS ∈ Ω0. Let θ0 denote an element of Ω0. The linearly
restricted model M0 is not completely equivalent to the target model M as the parameter spaces are
different. Model M0 puts a prior distribution P0 on Ω0 ⊂
∏s
i=1 Ω whereas the target model M puts a
prior distribution on Ω.
Suppose we set P0 ≡ P (·|ψ = 0). Let P (1)0 be the marginal distribution on θ(1) under P0. Let PM be
the prior distribution on Ω under the target model. The measure PM has density p(θ|M) with respect
to Lebesgue measure on Rd. Assumptions 1,2 and 3 have been selected so that P (1)0 and PM give the
same distribution on Ω. More formally, we can show that
P
(1)
0 (F ) = PM(F ), (2.34)
for all sets F ∈ F where F is the Borel σ-algebra on Ω. Let L0(θ0) denote the likelihood function for
the linearly restricted model M0 and let L(θ) denote the likelihood function for the target model M.
Suppose that we have some θ0 = (θ
(1), . . . ,θ(s)) ∈ Ω0. On Ω0 all subset specific parameters are equal. If
θ0 ∈ Ω0 then L0(θ0) = L(θ(1)). This follows from Assumption 1, the subset specific likelihoods are the
same as the target model likelihood for each subset.The model evidence for M0 is directly expressed as
an integral over Ω0. We can also express the model evidence forM0 as an integral over Ω by considering
the marginal distribution of θ(1) under P0.
p(y1, . . . ,ys|M0) =
∫
θ0∈Ω0
L0(θ0) dP0(θ0)
=
∫
θ∈Ω
L(θ) dP
(1)
0 (θ). (2.35)
The evidence for the linearly restricted model M0 can also be obtained by integrating the target model
likelihood function L(θ) with respect to P
(1)
0 (θ). Under assumptions 1,2 and 3 we have the equality
(2.34). We can therefore substitute the target model prior distribution PM(F ) into (2.35). The target
model prior distribution PM(·) has density p(θ|M) with respect to Lebesgue measure. The evidence for
22
2.4.4. Combine step
the linearly restricted model is then shown to be equal to the evidence for the target model:
p(y1, . . . ,ys|M0) =
∫
θ∈Ω
L(θ) dPM(θ) (2.36)
=
∫
θ∈Ω
p(y1, . . . ,yS |M)p(θ|M) dθ (2.37)
= p(y1, . . . ,ys|M). (2.38)
Using the Savage-Dickey density ratio (2.22), we can express the model evidence for the linearly restricted
model M0 in terms of the subset saturated model MS :
p(y1, . . . ,ys|M0) = p(y1, . . . ,ys|ψ = 0,MS)
= p(y1, . . . ,ys|MS)p(ψ = 0|y1, . . . ,ys,MS)
p(ψ = 0|MS) . (2.39)
Now using the fact that p(y1, . . . ,ys|M0) = p(y1, . . . ,ys|M), we have the following identity relating the
target model evidence to the subset saturated model:
p(y1, . . . ,ys|M) = p(y1, . . . ,ys|MS)p(ψ = 0|y1, . . . ,ys,MS)
p(ψ = 0|MS) . (2.40)
We will show that each of the terms on the right hand side of (2.40) can be estimated using a split-apply-
combine algorithm. We assume that each worker can compute the subposterior evidence score p˜(yi|M)
for i = 1, . . . , s. The local evidence on each batch of data can be computed using existing techniques.
Existing methods for evidence estimation are discussed in Chapter 3. We now show how the density
ratio can be estimated in the combine stage using subposterior samples generated in the apply stage.
The posterior distribution p(ψ|y1, . . . ,ys,MS) is effectively given by the subposterior distributions in
a divide and conquer analysis. The prior distribution p(ψ|MS) is effectively given by the subprior
distribution p˜(θ) (see Table 2.1). Define the random variables Si ∼ p˜(θ|yi,M) and Vi ∼ p˜(θ,M) for
i = 1, . . . , s. The random variable Si is distributed according to the ith subposterior, and the random
variable Vi is distributed according to the ith subprior.
Now define the random vectors D and D0 as
D =

S1 − S2
. . .
S1 − Ss
 , D0 =

V1 − V2
. . .
V1 − Vs
 . (2.41)
Then D ∼ p(ψ|y1, . . . ,ys,MS) and D0 ∼ p(ψ|MS). We can sample from the subprior and subposterior
distributions in the apply step. Using the definitions in (2.41) we can sample from p(ψ|y1, . . . ,ys,MS)
and p(ψ|MS) in the combine step by pooling the subposterior and subprior samples. The remaining
term on the right hand side of (2.39) is the evidence for the subset saturated model p(y1, . . . ,ys|MS).
This can easily be computed given the subposterior evidence values p˜(y1|M), . . . , p˜(ys|M).
The model evidence for the subset saturated model is the product of the subposterior evidence scores.
23
2. Computing the model evidence using parallel processing
Specifically,
p(y1, . . . ,ys|MS) =
∫
Ωs
L(θS) dP (θS)
=
∫
Ωs
(
s∏
i=1
L
(i)
S (θ
(i))
)
s∏
i=1
dP (i)(θ(i))
=
s∏
i=1
∫
Ω
L
(i)
S (θ) dP
(i)(θ)
=
s∏
i=1
∫
Ω
L(i)(θ) dP˜ (θ)
=
s∏
i=1
p˜(yi|M).
Substituting the previous result into (2.39), we obtain the following identity for the full dataset integrated
likelihood:
p(y1, . . . ,ys|M) =
(
s∏
i=1
p˜(yi|M)
)
p(ψ = 0|y1, . . . ,ys,MS)
p(ψ = 0|MS) . (2.42)
The density ratio in (2.42) can be interpreted as the posterior to prior odds of observing ψ = 0 in the
subset saturated model. If the model is appropriate for the entire dataset, we expect to see similar
subposterior distributions on θ over subsets, and the posterior density of ψ should be concentrated
around zero. What exactly constitutes a ‘high’ level of agreement is determined by comparing to the
prior density of ψ at zero. The ratio of the two densities has an interpretation as a Bayes factor in the
subset saturated model. Recall the identity in the introduction,
p(y1, . . . ,ys|M) =
(
s∏
i=1
p˜(yi|M)
)
αs
∫ s∏
i=1
p˜(θ|yi,M) dθ (2.43)
Comparing (2.42) to (2.43) we see a correspondence between the density ratio and the subposterior
integral and subprior normalising constant. Specifically, it can be shown that
p(ψ = 0|y1, . . . ,ys,MS) =
∫ s∏
i=1
p˜(θ|yi) dθ = Isub (2.44)
p(ψ = 0|MS) = αs. (2.45)
Equation (2.44) is the generalisation of the density identity for s = 2 presented given in (2.29). Equation
(2.45) helps to explain why the subprior normalising constant α appears in the full model evidence
decomposition (2.43). The ratio of Isub to α
s measures the consistency of parameter estimates across
subsets. The split-apply-combine approach for computing the evidence can be viewed as defining a
misspecified model in the split step, followed by fitting the misspecified model in the apply step. The
misspecified model can be viewed as a subset saturated model where we have introduced extra parameters.
The combine step then corrects for the misspecification by computing an adjustment that corresponds
to a Bayes factor linking the subset saturated model to the target model.
2.4.5 Example: ESP dataset
To illustrate the results in the previous subsection, we present another analysis of the ESP dataset. We
split the dataset into three subsets, as described in Table 2.5. Let yi denote the data in subset i for
i = 1, 2, 3. Additionally, let ni denotes the number of trials in subset i and ri denote the number of
successes for i = 1, 2, 3.
24
2.4.5. Example: ESP dataset
Full dataset Subset 1 Subset 2 Subset 3
Trials 104,490,000 34,830,000 34,830,000 34,830,000
Successes 52,263,471 17,421,157 17,415,157 17,427,157
Success proportion 0.5001768 0.5001768 0.5000045 0.5003490
Table 2.5: Full ESP dataset and subsets for a divide and conquer approach with s = 3. The success proportion is
close to 0.5 in the full dataset and in each subset.
0.4995 0.5000 0.5005 0.5010
0
10
00
20
00
30
00
40
00
(a) Subposterior distributions
θ
Va
lu
e
(b) Posterior distribution of Ψ
Ψ1
Ψ
2
 12.23073 
 14 
 15 
 16 
−0.0002 0 0.0002 0.0004 0.0006−
0.
00
06
−
0.
00
02
0
0.
00
02
(c) Prior distribution of  ψ
ψ1
ψ 2
 
−
0.
2 
 −0.5 
 −1 
 −1.5 
 −2 
−1.0 −0.5 0.0 0.5 1.0
−
1.
0
−
0.
5
0.
0
0.
5
1.
0
Figure 2.8: (a) Subposterior distributions for s = 3 subsets. (b) Posterior distribution of ψ. (c) Prior distribution
of ψ. The red cross in (b) and (c) denotes the point (0,0). The numeric labels on the contours in (b) and (c) give
the log density on the respective contour. Subposterior overlap is measured by the posterior density of ψ at zero.
This is compared to the prior density of ψ at zero to determine the evidence for a linearly restricted model. The
positioning of the red cross in panels (b) and (c) illustrate the identities (2.44) and (2.45) respectively.
We again use a flat prior on the unknown success probability. The subprior is a Beta(1, 1) density.
Each subposterior is a Beta(1 + ri, 1 + ni − ri) distribution for i = 1, 2, 3. Using (2.30) we can calculate
log Isub = 12.2307
logα = 0.
Given that s = 3, ψ is a two dimensional vector,
ψ =
θ(1) − θ(2)
θ(1) − θ(3)
 ,
where θ(i) is the proportion parameter in subset i for i = 1, 2, 3. Figure 2.8 illustrates the identities
in (2.44) and (2.45). Panel (a) shows the three subposterior distributions. Panel (b) displays contours
of the posterior distribution of ψ. The red cross denotes the point (0, 0). The density contour where
log p(ψ|y1, . . . ,ys) = 12.23 passes through the point (0, 0) illustrating the general result that p(ψ =
0|y1, . . . ,ys,M) =
∫ ∏s
i=1 p˜(θ|yi,M) dθ. Panel (c) shows the contours of the prior density of ψ. The
prior mode is at (0, 0), and it can be seen that the log density is approaching zero around this point. We
thus have empirical evaluation of (2.44) and (2.45) in this example.
As mentioned, it is possible to generate samples from p(ψ|y1, . . . ,ys,MS). Evaluation of the subposte-
rior integral can then be approached as a pointwise density estimation task Îsub = p̂(ψ = 0|y1, . . . ,ys,MS).
From this example we can identify an important issue regarding the impact of the partition made in the
split step. In (a) we can see that there is only a moderate amount of subposterior overlap. As such, (0, 0)
is in the tails of the posterior distribution of ψ. The red cross in panel (b) is in the tails of the posterior
distribution on ψ. Estimating the pointwise density at zero will be difficult if 0 is in a region with little
posterior support. This promotes splitting the data so that subposterior distributions are as similar as
possible. Although judicious data partitioning can shift the posterior support of ψ to a more favourable
25
2. Computing the model evidence using parallel processing
region, little can be done about the fact that ψ is a d(s − 1)-dimensional vector. The implications for
estimation of Isub are discussed in the next subsection.
2.4.6 Curse of dimensionality
Suppose that we have B samples from each subposterior. Let S
[b]
i be the bth sample from p˜(θ|yi,M) for
b = 1, . . . , B and i = 1, . . . , s. We can generate B posterior samples of ψ, by setting
ψ[b] =

S
[b]
1 − S[b]2
S
[b]
1 − S[b]3
...
S
[b]
1 − S[b]s

, (2.46)
for b = 1, . . . , B. Using any consistent kernel density estimator the subposterior integral can be estimated
as
Îsub = p̂(ψ = 0|y1, . . . ,ys,MS) = 1
B
B∑
b=1
KH(ψ
[b]), (2.47)
for a suitable kernel density estimator KH(·) with bandwidth matrix H.
Although the estimator (2.47) is simulation consistent, it will be heavily affected by the curse of
dimensionality. As mentioned in Section 2.3.1, the optimal mean square error rate for nonparametric
kernel density estimation is known to be O(B−2/(q+4)), where q is the dimension of the space (Terrell and
Scott, 1992). As the number of subsets s increases, the dimension of ψ increases linearly. Non parametric
kernel density estimation in a (s−1)d dimensional space will only be feasible for very small s and d (recall
Table 2.3). This suggests that the Monte Carlo budget per worker needs to increase exponentially with
the number of subsets s to control the error in the combine step.
2.4.7 Embarrassingly parallel evidence estimation
Algorithm 2.1 gives the general algorithm for computing the model evidence in parallel using the kernel
density estimator for the subposterior integral (2.47). We expect the algorithm to break down as s
increases, as the error in the combine stage will increase exponentially with s for a fixed Monte Carlo
budget B. This limits its use for scalable computation of the integrated likelihood. The computational
benefits of increasing s in the initial split are undermined by the increased Monte Carlo error in the
combine step. We thus turn to a different strategy involving data augmentation.
2.5 Data augmentation for distributed inference
2.5.1 Introduction
The general treatment of the problem given in the previous section is interesting, but leads to a somewhat
negative conclusion regarding the feasibility of the divide and conquer approach. We can reach a more
positive outcome by working within the confines of the conditionally conjugate exponential family. To
ease notation we do not explicitly condition on a particular modelM in this section. The overall strategy
makes use of Gibbs sampling in the apply step so that we can implement an efficient estimator of the
subposterior integral in the combine step. The data augmentation based algorithm has more favourable
scaling properties than the kernel density based algorithm (Algorithm 2.1). There is a connection between
our approach and Chib’s method (Chib, 1995) for estimating the marginal likelihood from the Gibbs
output in the single machine setting.
26
2.5.2. Conjugate priors in the exponential family
Algorithm 2.1 Divide and Conquer model evidence
Apply Step: Subset Markov chain Monte Carlo (MCMC) runs
for i = 1, . . . , s do
for b = 1, . . . , B do
Sample θ from p˜(θ|yi)
Set S
[b]
i ← θ
end for
Compute l̂og p˜(yi) using method of choice.
end for
Combine Step: Post Processing subset MCMC output
for b = 1, . . . , B do
Compute ψ[b] =

S
[b]
1 − S[b]2
S
[b]
1 − S[b]3
...
S
[b]
1 − S[b]s
 .
end for
Compute Îsub = B
−1∑B
b=1KH(ψ
[b])
Compute log p̂(y1, . . . ,ys) =
∑s
i=1 l̂og p˜(yi) + s logα+ Îsub
2.5.2 Conjugate priors in the exponential family
To describe the conjugate priors suppose we have n independently and identically distributed observations
y1, . . . ,yn. Bold face is used to allow for vector values observations. The model conditioned likelihood
belongs to the exponential family if the likelihood contribution for a single observation can be written as
p(yi|θ) = h(yi)g(θ) exp(η(θ)Tt(yi)), (2.48)
for some known functions h(·), g(·), t(·) and η(·). The function t(·) returns a vector of sufficient statistics
and the function η(·) dictates how the parameter θ interacts with the data. The standard conjugate prior
(Bernardo and Smith, 2006) on θ is parametrised by a scalar ν0, and a vector φ0 of the same dimension
as t(yi), taking the form
pi(θ; ν0,φ0) = c(ν0,φ0)g(θ)
ν0 exp(η(θ)Tφ0). (2.49)
The function g(·) is the same that appears in the data likelihood (2.48), and c(ν0,φ0) is an appropriate
normalising constant:
1
c(ν0,φ0)
=
∫
g(θ)ν0 exp(η(θ)Tφ0) dθ.
Explicit forms of some standard conjugate priors are listed in Chapter 5 of Bernardo and Smith (2006).
Suppose the prior is of the standard conjugate form, so p(θ) = c(ν0,φ0)g(θ)
ν0 exp(η(θ)Tφ0). Given
n independent and identically distributed observations from model (2.48), the posterior remains in the
same family,
p(θ|y1, . . . ,yn) ∝
(
n∏
i=1
h(yi)g(θ) exp(η(θ)
Tt(yi))
)
c(ν0,φ0)g(θ)
ν0 exp(η(θ)Tφ0)
∝ g(θ)n exp [η(θ)T(∑ni=1t(yi))] c(ν0,φ0)g(θ)ν0 exp(η(θ)Tφ0)
∝ g(θ)ν0+n exp [η(θ)T (φ0 +∑ni=1t(yi))]
∝ pi (θ; ν0 + n,φ0 +
∑n
i=1t(yi)) .
The form of the update equations leads to an interpretation of the prior hyperparameters ν0 and φ0. The
prior distribution pi(θ; ν0,φ0) acts as a pseudo-dataset of ν0 observations with sufficient statistics φ0.
27
2. Computing the model evidence using parallel processing
In practice it is common to use more flexible priors than permitted by the standard conjugate family
(Gutie´rrez-Pen˜a et al., 1997; Arnold et al., 1993). For example, suppose we have n observations from
a d-dimensional multivariate normal distribution with known covariance matrix Σ and unknown mean
µ. Using definition (2.49), the standard conjugate family has prior N(m0,Σ/ν0) for some prior mean
m0 ∈ Rd and positive scalar ν0. The prior covariance of µ must be proportional to the covariance matrix
of the observations. In practice, a common conjugate prior on µ is a N(m0,V0) distribution (Gelman
et al., 2014) where V0 is any positive definite d × d matrix. The flexibility of an unstructured prior
covariance matrix V0 places it outside the standard class (2.49).
An enriched conjugate family of priors introduces an extra r-dimensional hyperparameter ω0 =
(ω1, . . . , ωr), and r positive valued functions bu(θ) : Ω → R+ for u = 1, . . . , r (Gutie´rrez-Pen˜a et al.,
1997). Define the function
b(θ|ω0) =
r∏
h=1
[bu(θ)]
ωu . (2.50)
The enriched conjugate prior is defined as
pi(θ; ν0,φ0, ω0) = c(ν0,φ0,ω0)b(θ|ω0)g(θ)ν0 exp(η(θ)Tφ0). (2.51)
In the previous display c(ν0,φ0,ω0) gives the normalising constant
1
c(ν0,φ0,ω0)
=
∫
b(θ|ω0)g(θ)ν0 exp(η(θ)Tφ0) dθ.
Standard conjugate priors (2.49) can be viewed as a special case of the enriched form (2.51) where ω0 = 0,
and the functions b1(θ), . . . , br(θ) can be set to return one for all θ. The update equations for the enriched
prior are very similar to the update equations for the standard prior. It is simple to show that if we have
n independent and identically distributed observations from model (2.48), the posterior remains in the
enriched family,
p(θ|y1, . . . ,yn) = pi (θ; ν0 + n,φ0 +
∑n
i=1t(yi),ω0) .
The ω0 hyperparameter remains unchanged, and the data still influences the posterior through the term
φ0 +
∑n
i=1t(yi). Almost all commonly used conjugate priors can be shown to be in the enriched family
(Arnold et al., 1993). However, it can be tedious to reparametrise benchmark priors to match the
enriched form. It can be shown that the N(µ0,V0) prior belongs to the enriched conjugate prior family
for Gaussian likelihoods (Gutie´rrez-Pen˜a et al., 1997, Example 4.3). The abstraction of the enriched
conjugate prior family is beneficial as it provides a unifying framework for establishing general results.
2.5.3 Data augmentation
It is not possible to write the likelihood function in the form (2.48) for many interesting statistical
models. Data augmentation is a useful technique that extends the utility of exponential family and
conjugate theory to a wider class of models. Suppose that we have an observed dataset of n observations
y = (y1, . . . ,yn). A wide range of data augmentation strategies add a hidden layer of latent variables z
with state space Z to the model, such that the observed data likelihood satisfies
p(y|θ) =
∫
Z
p(y, z|θ) dz.
The latent variables z often have a natural statistical interpretation in terms of a hidden layer in the
data generating process. Ideally, the complete data likelihood p(y, z|θ) is then more tractable than
the observed data likelihood p(y|θ). In the simplest form, we introduce n independent latent variable
z = (z1, . . . ,zn), one for each observation in the original dataset. With n independent latent variables
28
2.5.4. Chib’s method
and n independent observations, the complete data likelihood satisfies
p(y, z|θ) =
n∏
i=1
p(yi, zi|θ).
Suppose that the complete data likelihood contribution for observation i belongs to the exponential
family, so
p(yi, zi|θ) = h(yi, zi)g(θ) exp(η(θ)Tt(yi, zi)), (2.52)
again for some known functions h(·), g(·), t(·) and η(·). Suppose that we adopt an enriched conditionally
conjugate prior p(θ) = pi(θ; ν0,φ0,ω0). It is immediate that the conditional posterior has the form
p(θ|z,y) = pi(θ; ν0 + n,φ0 +
∑n
i=1t(yi, zi),ω0). (2.53)
The conditional posterior depends on the sufficient statistics of the augmented dataset
∑n
i=1t(yi, zi). For
many models, it is possible to sample from the full conditional of the latent variables p(z|θ,y). The joint
posterior on the unknowns p(θ, z|y) can thus be targeted using a two block Gibbs sampler, iteratively
sampling from the full conditionals p(θ|z,θ) and p(z|θ,y). This data augmentation and Gibbs approach
can be used for logistic regression, negative binomial regression, probit regression and Gaussian mixtures.
2.5.4 Chib’s method
Chib’s method for marginal likelihood computation (Chib, 1995) using Gibbs sampling is based on the
integrated likelihood identity (2.16). The method uses the structure of data augmented models to estimate
the pointwise density in a more efficient manner than nonparametric kernel density estimation. As
mentioned in section 2.5.2, it is often possible to introduce a vector of latent variables z such that the
conditional posterior distribution p(θ|y, z) is known exactly. Marginalising over the posterior distribution
of the latent variables z gives the relationship
p(θ|y) =
∫
p(θ|y, z)p(z|y). (2.54)
A simulation consistent estimate of the integrated likelihood can be obtained by simulating from the
posterior distribution of the latent variables z. As discussed, it is common to use a two block Gibbs
sampler, iteratively sampling the full conditional distributions p(θ|z,y) and p(z|θ,y). Suppose we use
a conditionally conjugate prior, so the full conditional p(θ|y, z) is known explicitly. Given B posterior
samples of the latent variables, z[1], . . . ,z[b], a simulation consistent estimator of the posterior density at
θ∗ ∈ Ω is then
p̂(θ∗|y) = 1
B
B∑
b=1
p(θ∗|z[b],y). (2.55)
Substituting (2.55) into the integrated likelihood identity (2.16) gives a simulation consistent estimator
of the integrated likelihood,
log p̂(y) = log p(y|θ∗) + log p(θ∗)− log
(
1
B
B∑
b=1
p(θ∗|y, z[b])
)
. (2.56)
Chib’s method can be used for high-dimensional models where general non-parametric density esti-
mators would likely fail. The key to the approach is the marginal representation of the posterior density
(2.55). We can take an average over posterior draws of the latent variables to estimate the unknown
posterior density p(θ∗|y). In the following subsections we develop an embarrassingly parallel algorithm
that makes use of the fact that although ∫ s∏
i=1
p˜(θ|yi) dθ,
29
2. Computing the model evidence using parallel processing
may be intractable, data augmentation with suitably chosen latent variables in each subset zi could lead
to the augmented subposterior integral ∫ s∏
i=1
p˜(θ|yi, zi) dθ,
having a closed form solution.
2.5.5 Apply step
We first show how data augmentation and Gibbs sampling can be used to draw subposterior samples and
compute the subposterior evidence scores in the apply step. We assume that the complete data likelihood
belongs to the exponential family. Let ni give the number of observations in subset i for i = 1, . . . , s. Let
yij denote the jth observation in subset i, and let zij denote an associated latent variable for the jth
observation in subset i. We assume that the latent variables zij are independent given θ. Finally let zi
give the vector of latent variables for subset i, that is zi = (zi1, . . . , zini) for i = 1, . . . , s. The augmented
full dataset posterior distribution is again proportional to s subposteriors:
p(θ|z1, . . . ,zs,y1, . . . ,ys) ∝ p(θ)p(y1, . . . ,ys, z1, . . . ,zs|θ)
=
s∏
i=1
p(yi, zi|θ)p(θ)1/s
∝
s∏
i=1
p(yi, zi|θ)p˜(θ)
∝
s∏
i=1
p˜(θ|yi, zi).
Suppose the complete data likelihood can be written in the following form
p(yij , zij |θ) = h(yij , zij)g(θ) exp(η(θ)Tt(yij , zij)). (2.57)
Where the functions h(·), g(·), t(·) and η(·) are known. Suppose the original prior is of the enriched
conjugate form (2.51). The fractionated prior remains in the enriched conjugate family. Fractionating
yields
p(θ|ν0,φ0,ω0)1/s =
[
c(ν0,φ0,ω0)b(θ|ω0)g(θ)ν0 exp(η(θ)Tφ0)
]1/s
= c(ν0,φ0,ω0)
1/sb(θ|ω0)1/sg(θ)ν0/s exp(η(θ)Tφ0/s).
Recall that ω0 = (ω1, . . . , ωr) and the definition of the function b(θ|ω0) =
∏r
u=1[bu(θ)]
ωu . It then follows
that b(θ|ω0)1/s = b(θ|ω0/s). Continuing, we see that the fractionated prior remains in the enriched
family
p(θ|ν0,φ0,ω0)1/s = c(ν0,φ0,ω0)1/sb(θ|ω0/s)g(θ)ν0/s exp(η(θ)Tφ0/s)
∝ pi(θ; ν0/s,φ0/s,ω0/s).
Given the original prior is p(θ) = pi(θ; ν0,φ0,ω0), the subprior can be defined as p˜(θ) = pi(θ; ν0/s,φ0/s,ω0/s).
The hyperparameters of the subprior show that each subset analysis receives a fraction of the original
prior pseudo-dataset, namely ν0/s observations with sufficient statistics φ0/s. Secondly, as the ω0 hyper-
parameter tends to 0 the function b(θ|ω) tends to one. The subpriors have hyperparameters ω0/s which
indicates a dampened effect of the extra prior flexibility function b(θ|ω). There are open questions sur-
rounding the importance of prior fractionation (Scott, 2017; Gelman and Vehtari, 2017), and the theory
of exponential families can be of use in addressing these.
30
2.5.6. Combine step
From the results in section 2.5.3, it follows that given the latent variables zi, the subposteriors
p˜(θ|yi, zi) are also conditionally conjugate. For i = 1, . . . , s:
p˜(θ|yi, zi) ∝ g(θ)ni exp
(
η(θ)T
∑ni
i=jt(yij , zij)
)
p(θ|ν0,φ0,ω0)1/s
∝ g(θ)ni exp
(
η(θ)T
∑ni
j=1t(yij , zij)
)
pi(θ; ν0/s,φ0/s,ω0/s)
∝ pi
(
θ
∣∣∣ ν0/s+ ni,φ0/s+∑nij=1t(yij , zij),ω0/s) .
Suppose that it is possible to sample from the full conditional p(zi|θ,yi) in a single machine analysis. It is
then also possible to sample from conditional subposterior distribution p˜(zi|θ,yi) using the same update
equations. The conditional subposterior has the same form as in regular analysis of a small portion of
the full dataset:
p˜(zi|yi,θ) ∝ p(yi, zi|θ)p˜(θ)
∝ p(zi|θ,yi)p(yi|θ)
∝ p(zi|θ,yi).
Use of the subprior p˜(θ) is not material when conditioning on θ. We keep the tilde to acknowledge
that both p(zi|θ,yi) and p˜(θ|yi, zi) are targeted in the apply stage by an invidual worker. As such,
we can use Gibbs sampling to target the subposterior distribution p˜(θ, zi|yi) by sampling from the
conditional densities p˜(θ|yi, zi) and p˜(zi|θ,yi). We have previously assumed that each worker returns the
subposterior evidence scores. We can use Chib’s method in the apply stage to compute the subposterior
evidence values. Let z
[b]
i be the sampled latent variables at iteration b in subposterior i. Assuming the
chain has mixed well, z
[b]
i is a sample from the distribution p˜(zi|yi). At any particular ordinate θ∗ ∈ Ω
we have the identity
log p˜(yi) = log p(yi|θ∗) + log p˜(θ∗)− log p˜(θ∗|yi),
A simulation consistent estimator of log p˜(yi) is therefore
l̂og p˜(yi) = log p(yi|θ∗) + log p˜(θ∗)− log
(
1
B
B∑
b=1
p˜(θ∗|yi, z[b]i )
)
. (2.58)
2.5.6 Combine step
The sequence of full conditional Gibbs distributions in the apply stage can be used to give a closed form
estimator of the subposterior integral
Isub =
∫ s∏
i=1
p˜(θ|yi) dθ. (2.59)
To describe the estimator of Isub suppose the latent variables z1, . . . ,zs take values in state spaces
Z1, . . . ,Zs respectively. Let the joint space of the subposterior latent variables be denoted Z = Z1 ×
. . . × Zs. Each subposterior has the marginal representation p˜(θ|yi) =
∫
Zi p˜(θ|yi, zi)p˜(zi|yi) dzi. The
subposterior latent variables can be considered together as a random vector (z1, z2, . . . ,zs) with joint
distribution p˜(z1, z2, . . . ,zs) =
∏s
i=1 p˜(zi|yi). As such, we have the following representation of the
product of the subposterior distributions
s∏
i=1
p˜(θ|yi) =
s∏
i=1
(∫
Zi
p˜(θ|yi, zi)p˜(zi|yi) dzi
)
=
∫
Z1×...×Zs
(
s∏
i=1
p˜(θ|yi, zi)p˜(zi|yi)
)
dz1dz2 . . . dzs
=
∫
Z
(
s∏
i=1
p˜(θ|yi, zi)
)(
s∏
i=1
p˜(zi|yi)
)
dz1dz2 . . . dzs. (2.60)
31
2. Computing the model evidence using parallel processing
Switching the integration and product operators is justified using Fubini’s theorem. The necessary
assumptions are satisfied if all the subposterior distributions are proper (Keener, 2013). Substituting
(2.60) into the definition of Isub (2.59) gives
∫
Ω
s∏
i=1
p˜(θ|yi) dθ =
∫
Ω
∫
Z
(
s∏
i=1
p˜(θ|yi, zi)
)(
s∏
i=1
p˜(zi|yi)
)
dz1dz2 . . . dzs dθ
Using Fubini’s theorem once more allows the order of integration to be switched,
∫
Ω
s∏
i=1
p˜(θ|yi) dθ =
∫
Z
(∫
Ω
s∏
i=1
p˜(θ|yi, zi) dθ
)(
s∏
i=1
p˜(zi|yi)
)
dz1dz2 . . . dzs (2.61)
The inner integral over Ω can be obtained in closed form given that the subposteriors are conditionally
conjugate. The subposterior integral is equal to the expected value of the augmented subposterior integral,
where the latent variables z1, . . . ,zs are sampled from the subposterior distributions. We can write
Isub = Ep˜(z1,...,zs|y1,...,ys)
[∫
Ω
s∏
i=1
p˜(θ|yi, zi) dθ
]
(2.62)
For convenience, let t(yi, zi) denote the complete data sufficient statistic for subset i, that is t(yi, zi) =∑ni
j=1 t(yij , zij). For more compact notation define for i = 1, . . . , s the subset normalising constants
ci = c(ν0/s+ ni,φ0/s+ t(yi, zi),ω0/s),
where c(·) is the normalising function for the enriched conjugate prior. Additionally, define the full
dataset normalising constant as
C = c(ν0 + n,φ0 +
∑s
i=1t(yi, zi),ω0).
Using properties of the standard conjugate prior we obtain a closed form expression for the augmented
subposterior integral.
∫
Ω
s∏
i=1
p˜(θ|yi, zi) dθ =
∫
Ω
s∏
i=1
ci × b(θ|ω0/s)g(θ)ν0/s+ni exp
[
η(θ)T (φ0/s+ t(yi, zi))
]
dθ
=
(
s∏
i=1
ci
)∫
Ω
g(θ)ν0+nb(θ|ω0) exp
[
η(θ)T
(
φ0 +
s∑
i=1
t(yi, zi)
)]
dθ
=
(∏s
i=1 ci
C
)∫
Ω
p(θ|y1, . . . ,ys, z1, . . . ,zs) dθ
=
∏s
i=1 c(ν0/s+ ni,φ0/s+ t(yi, zi),ω0/s)
c(ν0 + n,φ0 +
∑s
i=1 t(yi, zi),ω0)
. (2.63)
The second line uses the fact that
∏s
i=1 b(θ|ω0/s) = b(θ|ω0). Substituting (2.63) into (2.64) gives an
important expression for the subposterior integral Isub.
Isub = Ep˜(z1,...,zs|y1,...,ys)
[∏s
i=1 c(ν0/s+ ni,φ0/s+ t(yi, zi),ω0/s)
c(ν0 + n,φ0 +
∑s
i=1 t(yi, zi),ω0)
]
(2.64)
Let φ
[b]
i = φ0/s+ t(z
[b]
i ,yi) be the data dependent parameter of the conditional subposterior at iteration
b for i = 1, . . . , s. Suppose we have saved either the sufficient statistics t(yi, z
[b]
i ) or conditional posterior
parameters φ
[b]
i at each iteration b = 1, . . . , B in each minibatch analysis i = 1, . . . , s. The subposterior
integral can then be estimated in the combine step by pooling the Gibbs histories from the apply step.
32
2.5.7. Embarrassingly parallel evidence estimation
In full,
Îsub =
1
B
B∑
b=1
∫
Ω
s∏
i=1
p˜(θ|yi, z[b]i ) dθ
=
1
B
B∑
b=1
∏s
i=1 c(ν0/s+ ni,φ0/s+ t(yi, z
[b]
i ))
c
(
ν0 + n,φ0 +
∑s
i=1 t(z
[b]
i ,yi)
) (2.65)
=
1
B
B∑
b=1
∏s
i=1 c(ν0/s+ ni,φ
[b]
i )
c(ν0 + n,
∑s
i=1 φ
[b]
i )
. (2.66)
Equations (2.65) and (2.66) both define plug-in Monte Carlo estimators of the expectation in equation
(2.64).
2.5.7 Embarrassingly parallel evidence estimation
Recall (2.27), the subset decomposition of the full dataset model evidence
log p(y1, . . . ,ys) =
(
s∑
i=1
log p˜(yi)
)
+ log Isub + s logα.
The sum over the subposterior evidence scores can be estimated during the apply stage using Chib’s
method. To estimate the sequence of full conditional distributions in the combine stage it is necessary to
save information about the full conditional distributions that are sampled from in the apply stage. The
subprior normalising constant α can generally be determined in closed form given that the original prior
is in the exponential familty. The full dataset model evidence is then estimated by combining the subset
evidence scores with the subposterior integral,
l̂og p(yi, . . . ,ys) =
(
s∑
i=1
l̂og p˜(yi)
)
+ log Îsub + s logα.
Algorithm 2.2 presents the proposed methodology in detail.
Algorithm 2.2 Divide and conquer for conditionally conjugate models
Apply Step: Subset Markov chain Monte Carlo (MCMC) runs
for i = 1, . . . , s do
for b = 1, . . . , B do
Sample z
[b]
i ∼ p(zi|θ,yi)
Compute φ
[b]
i = φ0/s+ t(z
[b]
i ,yi)
Save φ
[b]
i
Sample θ from p˜(θ|z[b]i ,yi) = pi(θ; ν0/s+ ni,φ[b]i ,ω0/s)
end for
Compute l̂og p˜(yi) using (2.58) or alternative method.
end for
Combine Step: Post processing subset Gibbs output
for b = 1, . . . , B do
Compute gb =
∏s
i=1 c(ν0/s,φ
[b]
i ,ω0/s)
c(ν0,
∑s
i=1 φ
[b]
i ,ω0)
end for
Compute Îsub = B
−1∑B
b=1 gb
Compute l̂og p(yi, . . . ,ys) =
(∑s
i=1 l̂og p˜(yi)
)
+ log Îsub + s logα
To recap, we have developed an embarrassingly parallel algorithm to compute the full dataset model
evidence using distributed computing and Gibbs sampling. The algorithm meets the desired template of
Figure 2.1. The complexity of the combine step does not depend on n. It is quite remarkable that we can
33
2. Computing the model evidence using parallel processing
estimate the integrated likelihood on massive datasets without ever fitting the model to the whole dataset.
This is a significant departure from many other techniques for computing the integrated likelihood, where
generating samples from the full dataset posterior p(θ|y1, . . . ,ys) is an important intermediate step (Friel
and Wyse, 2012). Existing methods for calculating the integrated likelihood are discussed in more detail
in Chapter 3.
2.5.8 Monte Carlo error
For Algorithm 2.2 to be effective, the Monte Carlo error in the combine step needs to be tolerable.
Algorithm 2.1 is impractical for this reason, due to the curse of dimensionality impeding nonparametric
kernel density estimation of the subposterior integral. In this section we show that the variance of the
Gibbs based estimatpor subposterior integral (2.66) is tied to the disparity between the subposterior
distribution on the latent variables, and the full dataset posterior distribution. The variance can be
characterised in terms of importance sampling. As a quick review, importance sampling is a basic
Monte Carlo technique based on a change of measure. Suppose we are interested in the expectation of a
function h(x) over some probability distribution p(x) on state space X . Importance sampling is based
on the simple identity, ∫
X
h(x)p(x) dx =
∫
X
h(x)
p(x)
q(x)
q(x) dx,
for some other distribution q(x) on the same space X . A common use for importance sampling is when
p(x) is difficult to sample from, but it is easy to find an approximating distribution q(x). An important
diagnostic in importance sampling is the variance of the importance sampling weights
varq(x)
(
p(x)
q(x)
)
.
If q(x) ∝ p(x), then the variance of the weights is zero. The variance of the weights is strongly influenced
by the differences between p(x) and q(x). The tails of the distribution q(x) are particularly significant.
If p(x) has high support in regions that are minimally covered by q(x), the variance of the importance
weights will be high, and the importance sampler will be ineffective. It is possible for the variance of the
weights to be infinite, in which case there can be a large finite sample bias on the importance sampling
estimator (Robert and Casella, 2010).
Let p˜(z1, . . . ,zs|y1, . . . ,ys) be the effective joint distribution on the latent variables if sampling in-
dependently from each subposterior, so p˜(z1, . . . ,zs|y1, . . . ,ys) =
∏s
i=1 p˜(zi|yi). The target posterior on
the latent variables is p(z1, . . . ,zs|y1, . . . ,ys). To derive the variance of Îsub we start by applying Bayes
theorem to the augmented posterior
p(θ|z1, . . . ,zs,y1, . . . ,ys) = p(y1, . . . ,ys, z1, . . . ,zs|θ)p(θ)
p(z1, . . . ,zs,y1, . . . ,ys)
=
αs
p(z1, . . . ,zs|y1, . . . ,ys)p(y1, . . . ,ys)
s∏
i=1
p(yi, zi|θ)p˜(θ)
=
αs
p(z1, . . . ,zs|y1, . . . ,ys)p(y1, . . . ,ys)
s∏
i=1
p˜(θ|yi, zi)p˜(zi|yi)p˜(yi)
=
αs
∏s
i=1 p˜(yi)
p(y1, . . . ,ys)
∏s
i=1 p˜(zi|yi)
p(z1, . . . ,zs|y1, . . . ,ys)
s∏
i=1
p˜(θ|yi, zi). (2.67)
The core identity
p(y1, . . . ,ys) =
(
s∏
i=1
p˜(yi)
)
αs × Isub
34
2.6. Logistic regression
gives the relationship ∏s
i=1 p˜(yi)
p(y1, . . . ,ys)
=
α−s
Isub
. (2.68)
Substituting (2.68) into (2.67) and recalling the definition p˜(z1, . . . ,zs|y1, . . . ,ys) =
∏s
i=1 p˜(zi|yi),
p(θ|z1, . . . ,zs,y1, . . . ,ys) = α
sα−s
Isub
p˜(z1, . . . ,zs|y1, . . . ,ys)
p(z1, . . . ,zs|y1, . . . ,ys)
s∏
i=1
p˜(θ|yi, zi).
Integrating both sides over θ gives
1 =
1
Isub
p˜(z1, . . . ,zs|y1, . . . ,ys)
p(z1, . . . ,zs|y1, . . . ,ys)
∫ s∏
i=1
p˜(θ|yi, zi) dθ.
Rearranging shows a connection between the integral over the augmented subposteriors and importance
sampling weights. ∫ s∏
i=1
p˜(θ|yi, zi) dθ = Isub × p(z1, . . . ,zs|y1, . . . ,ys)
p˜(z1, . . . ,zs|y1, . . . ,ys) (2.69)
The subposterior integral estimator is defined as the plug-in expectation of the augmented integral over
p˜(z1, . . . ,zs|y1, . . . ,ys). The estimator can be written as
Îsub =
1
B
B∑
b=1
∫
Ω
s∏
i=1
p˜(θ|yi, z[b]i ) dθ
=
1
B
× Isub ×
B∑
b=1
p(z
[b]
1 , . . . ,z
[b]
s |y1, . . . ,ys)
p˜(z
[b]
1 , . . . ,z
[b]
s |y1, . . . ,ys)
.
Assuming independent subposterior samples, the variance of the estimator is therefore
var(Îsub) =
1
B
× I2sub × varp˜(z1,...,zs|y1,...,ys)
(
p(z1, . . . ,zs|y1, . . . ,ys)
p˜(z1, . . . ,zs|y1, . . . ,ys)
)
. (2.70)
In reality, correlated subposterior samples will inflate the variance, but to simplify the discussion we as-
sume independence or samples from a sufficiently thinned Markov chain. The variance of Îsub is tied to the
mean multiplied by a density ratio involving the latent variables in the model. The final term in brackets
can be interpreted as importance sampling weights if using the subposterior distributions over the latent
variables to approximate the full dataset joint posterior on the latent variables. There are two important
conclusions. The first is that the variance of the integral estimator is only finite if the variance of the
importance sampling weights is finite. If the tails of the subposterior distribution p˜(z1, . . . ,zs|y1, . . . ,ys)
contain regions where the target posterior p(z1, . . . ,zs|y1, . . . ,ys) has high support, then the estimator
Îsub could have infinite variance. The second conclusion in the converse direction is that it is possible to
obtain a zero variance estimator if p˜(z1, . . . ,zs|y1, . . . ,ys) is proportional to p(z1, . . . ,zs|y1, . . . ,ys).
It is difficult to make a broad statement about how the variance of the proposed estimator (2.66)
scales with the number of subsets s. The variance is tied to the model, the goodness of fit of the model
and how the full dataset is partitioned. However, Algorithm 2.2 has more promise than the approach
based on kernel density estimation as the curse of dimensionality does not completely rule out its efficacy.
2.6 Logistic regression
We now show how Algorithm 2.2 can be used for large scale logistic regression. Suppose we have n
binary observations yi ∈ {0, 1} with an associated p-dimensional vector of covariates xi for i = 1, . . . , n.
The responses are modelled as yi ∼ Bernoulli(σ(ηi)), where ηi = xTi β and σ(η) gives the inverse logistic
function σ(η) = exp(xTi β)/(1 + exp(x
T
i β)). Polson et al. (2013) propose a data augmentation scheme
for logistic regression involving the Po´lya-Gamma distribution. A Po´lya-Gamma random variable is an
infinite sum of gamma random variables, defined more precisely below.
35
2. Computing the model evidence using parallel processing
Definition 2.1 (Polson et al. (2013)). A random variable X has a Po´lya-Gamma distribution b > 0 and
c ∈ R, denoted X ∼ PG(b, c) if
X
d
=
1
2pi2
∞∑
k=1
gk
(k − 1/2)2 + c2/(4pi2) ,
where the gk ∼ Gamma(b, 1) are independent gamma random variables with shape parameter b and rate
parameter 1, and
d
= indicates equality in distribution. Let X ∼ PG(b, c). Let cosh(x) = (exp(x) +
exp(−x))/2. The probability density function of X can be written as
p(x|b, c) = coshb(c/2)2
b−1
Γ(b)
∞∑
n=0
(−1)n Γ(n+ b)
Γ(n+ 1)
(2n+ b)√
2pix3
exp
(
− (2n+ b)
2
8x
− c
2
2
x
)
.
For each binary observation yi we introduce a latent Po´lya-Gamma random variable zi with parameters
b = 1, c = 0. Polson et al. show that the likelihood contribution of observation i can be written as an
integral over the latent zi,
p(yi|xi,β) = [exp(x
T
i β)]
yi
1 + exp(xtiβ)
= 2−1 exp(κixTi β)
∫ ∞
0
exp(−zi(xTi β)2/2)p(zi|1, 0) dzi,
where κi = yi− 1/2 and p(zi|1, 0) is the density of a Po´lya-Gamma random variable. The complete data
likelihood can be written as a quadratic function of β. Let X give the n× p design matrix where row i
is given by xTi . Up to arbitrary constants, the complete data likelihood is equal to
p(y, z|X,β) = p(z)
n∏
i=1
exp(κix
T
i β − zi(xTi β)2/2)
= p(z)
n∏
i=1
exp(κ2i /(2zi)) exp
(
−zi
2
(xTi β − κi/zi)2
)
= h(z)p(z) exp
(
n∑
i=1
−zi
2
(xTi β − κi/zi)2
)
= h(z)p(z) exp
(
−1
2
(y′ −Xβ)TΩ(y′ −Xβ)
)
,
where y′ = (κ1/z1, . . . , κn/zn) and Ω = diag(z1, . . . , zn) and h(y, z) is a function of the augmented
dataset that is not dependent on β. This resembles a regression likelihood with working responses y′,
design matrix X, coefficients β, and diagonal covariance matrix Ω−1. With the connection to linear
regression, it is simple to see that a conditionally conjugate prior on β is a multivariate normal distribu-
tion. The enriched parametrisation (2.51) is cumbersome to work with, we instead use the more typical
parametrisation in terms of the prior mean m0 and prior covariance matrix V0. A two-block Gibbs
sampler can be used to sample from the joint posterior of the coefficients β and the latent variables z.
The full conditional on β is
p(β|z,y) = N(m,V ), (2.71)
where
V = (XTΩX + V −10 )
−1
m = V (XTκ+ V −10 m0).
Recall κ = (y1 − 1/2, . . . , yn − 1/2) and Ω is a diagonal matrix where Ωii = zi. The latent variables z
are conditionally independent given y and θ. The full conditional for the latent zi is
p(zi|β,y) = PG(1,xTi β), (2.72)
36
2.6. Logistic regression
for i = 1, . . . , n, and PG(b, c) denotes a Po´lya-Gamma distribution with parameters (b, c). The Po´lya-
Gamma distribution is not in the exponential family, but still has a useful conditional conjugacy property
for the logistic regression model.
From section 2.5.5 the subposterior distributions can also be targeted using Gibbs sampling. The
subprior is a N(m0, sV0) distribution. The fractionation scales the covariance matrix by a factor of s.
To sample from p˜(β, zi|yi) we iterate sampling from p˜(β|zi,yi) and p˜(zi|β,yi). Let xij give the vector
of covariates for the jth observation in subset i for j ∈ {1, . . . , ni} and i ∈ {1, . . . , s}. Let X(i) give the
ni × p matrix of covariates in subset i for i = 1, . . . , s. The subposterior full conditional on β is
p˜(β|zi,yi) = N(m,V ), (2.73)
where
V = (XT(i)ΩX(i) + s
−1V −10 )
−1
m = V (XT(i)κ+ s
−1V −10 m0).
Here κ = (yi1 − 1/2, . . . , yini − 1/2) and Ω = diag(zi). The latent variables zi are again conditionally
independent given yi and β. The full conditional for the latent zij is
p˜(zij |β,yi) = PG(1,xTijβ), (2.74)
for j = 1, . . . , ni. Suppose we run B sweeps of the Gibbs sampler in each subset analysis during the apply
step. As dictated by Algorithm 2.2, during the apply step we save the parameters of the full conditional
p˜(β|yi, zi) at each iteration. Let m[b]i represent the conditional mean of β and let V [b]i represent the
conditional variance of β at iteration b in the ith subposterior analysis for i = 1, . . . , s and b = 1, . . . , B.
Likewise, let z
[b]
i denote the sampled latent variables at iteration b in the ith subset analysis for i = 1, . . . , s
and b = 1, . . . , B.
Each worker can compute the subposterior evidence using Chib’s method. For any ordinate β∗
subsposterior evidence satisfies the identity
log p˜(yi) = log p(yi|β∗) + log p˜(β∗)− log p˜(β∗|yi).
A simulation consistent estimator is therefore
l̂og p˜(yi) = log p(yi|β∗) + log p˜(β∗)− log
(
B−1
B∑
b=1
p(β∗|yi, z[b]i )
)
= log p(yi|β∗) + log p˜(β∗)− log
(
B−1
B∑
b=1
N(β∗;m[b]i ,V
[b]
i )
)
. (2.75)
We now turn to the combine step. The subposterior integral has the representation
Isub = Ep˜(z1,...,zs|y1,...,ys)
[∫
Ω
s∏
i=1
p˜(β|yi, zi) dβ
]
.
From the results in section 2.5.6 it is possible to obtain a closed form expression for the integral over
the augmented subposterior distributions given the conjugate structure of the model. Here the condi-
tional subposterior is multivariate normal. Lemma 2.1 gives a result for the integral over a product of
multivariate Gaussians.
Lemma 2.1. Let p(x;µi,Σi) denote a multivariate Gaussian pdf with mean vector µi and covariance
matrix Σi for i = 1, . . . , s. For i = 1, . . . , s define
ηi = Σ
−1
i µi,
Λi = Σ
−1
i ,
ξi = −1
2
(
d log 2pi − log|Λi|+ ηTi Λiηi
)
.
37
2. Computing the model evidence using parallel processing
Similarly, define
η =
∑s
i=1ηi,
Λ =
∑s
i=1Λi
ξ = −1
2
(
d log 2pi − log|Λ|+ ηTΛη) .
The integral over the product of the s density functions has the closed form solution∫ s∏
i=1
p(x;µi,Σi) dx = exp [(
∑s
i=1ξi)− ξ] .
The proof uses the fact that the product of multivariate Gaussian density functions is proportional to
another multivariate Gaussian density function. This is a special case of the general result in section
2.5.6. For convenience define the function f(m1, . . . ,ms,Σ1, . . . ,Σs) as
f(m1, . . . ,ms,V1, . . . ,Vs) =
∫ s∏
i=1
N(x;mi,Vi) dx.
The function f(·) can be numerically evaluated using the result in Lemma 2.1. The Monte-Carlo estimator
of the subposterior integral is then
Îsub = B
−1
B∑
b=1
∫
Ω
s∏
i=1
p˜(β|yi, z[b]i ) dβ
= B−1
B∑
b=1
f(m
[b]
1 , . . . ,m
[b]
s ,V
[b]
1 , . . . ,V
[b]
s ) (2.76)
The subprior normalising constant is
α = sp/2(
√
(2pi)p|V0|)1−1/s.
Combining (2.75) and (2.76) the full dataset model evidence is estimated as
l̂og p(y1, . . . ,ys|β) =
(
s∑
i=1
l̂og p˜(yi)
)
+ s logα+ log Îsub.
2.7 Data application
2.7.1 Flights dataset
We considered a dataset on flights departing New York city, available in the R package nycflights13
(Wickham, 2014). There are n = 327, 346 observations. We dichotomised the arrival delay variable
(original units in minutes) to obtain a binary outcome. We labelled flights as late if the arrival delay was
greater than zero, and on time if the arrival delay was less than or equal to zero. Unsurprisingly, there
is a clear statistical association between late arrival at the destination and the departure delay leaving
New York. There are data on 16 different carriers (airlines).
Figure 2.9 (a) plots the number of flights against departure delay. Panel (b) plots the proportion
of flights that were late against departure delay in minutes. In (a) we see that the distribution of
departure delay time has a mode slightly above zero. The majority of flights depart late. In (b) we see
a logistic relationship between departure delay (minutes) and the empirical probability of late arrival at
the destination. This suggests a logistic regression model is appropriate.
Let yi ∈ {0, 1} denote the response, where yi = 1 if flight i was late and is zero otherwise. We
considered two regression models for the response,
M1 = intercept + delay (2.77)
M2 = intercept : carrier + delay : carrier. (2.78)
38
2.7.2. Pima Indians dataset
10
1000
0 100 200 300
Departure delay (minutes)
N
um
be
r o
f f
lig
ht
s
(a)
0.00
0.25
0.50
0.75
1.00
−50 0 50 100
Departure delay (minutes)
Pr
ob
ab
ilit
y 
of
 la
te
 a
rri
va
l
(b)
Figure 2.9: Flights dataset (n = 327, 346) and fitted logistic model. (a) shows the number of flights against
departure delay. (b) Probability of late arrival given departure delay. Points are observed data. The blue solid line
is the posterior mean from the logistic regression model. In (a) we see that the distribution of departure delay time
has a mode slightly above zero. The majority of flights depart late. In (b) we see a logistic relationship between
departure delay (minutes) and the empirical probability of late arrival at the destination.
Model 1 is a pooled model where all carriers are modelled identically. Model 2 is a completely unpooled
model where each carrier is given a unique intercept and slope. We used independent N(0, 1) priors on
the coefficients. In panel (b) of Figure 2.9 we plot the posterior predictive mean fit of model 1 as a dashed
line. The posterior predictive mean for a new response given covariates xnew is obtained by integrating
over the posterior distribution of the coefficients
E[ynew|M] =
∫
p(ynew = 1|xnew,β,M)p(β|X,y,M). (2.79)
The fit of M1 to the observed data appears to be extremely good. There is some deviation from the
theoretical proportion for very early flights, say with a departure delay of less than -20 minutes. However,
we have very few data points in this range (refer back to panel (a)) so it seems reasonable to see this
level of noise in (b).
Figure 2.10 plots probability of late arrival against departure delay for each carrier (airline) in the
dataset. The blue dashed line is the fitted mean from the pooled model (M1). The goodness of fit varies
over airlines. There is noticeable deficiency in the pooled model in the results for the carriers FL and MQ.
The graphical diagnositic in panel (b) of Figure 2.9 did obscure some deficiencies in the model. When
averaging over carriers, model 1 looks very good. However when looking at the quality of the fit for
individual carriers we see some systematic problems. Model 2 allows for more flexibility. The posterior
mean fit for model 2 is plotted as a solid red line. The fit is noticeably better, but requires the addition
of 30 parameters. We computed the integrated likelihood using the divide and conquer strategy with
s = 5 subsets. Datasets were split uniformly at random. The log Bayes factor in favour of model 2 is
2353. There is very strong support for modelling carriers separately.
2.7.2 Pima Indians dataset
We also analyse the Pima Indians diabetes dataset (Venables and Ripley, 2002). This is not a Big Data
application as n = 532. The idea is to illustrate how the initial data split affects the Monte Carlo
variance of Îsub on a benchmark dataset. We take diabetes status as the response, and glucose and
body mass index as the covariates. Let yj denote a binary response where yj is equal to subject j has
diabetes and is zero otherwise for j = 1, . . . , n. Let xj,1 represent glucose level and xj,2 represent body
39
2. Computing the model evidence using parallel processing
carrier: US carrier: VX carrier: WN carrier: YV
carrier: HA carrier: MQ carrier: OO carrier: UA
carrier: DL carrier: EV carrier: F9 carrier: FL
carrier: 9E carrier: AA carrier: AS carrier: B6
−50 0 50 100 −50 0 50 100 −50 0 50 100 −50 0 50 100
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
Departure delay (minutes)
Pr
ob
ab
ilit
y 
of
 la
te
 a
rri
va
l
Figure 2.10: Comparison of pooled logistic regression model to unpooled logistic regression model on the flights
dataset. The blue dashed line shows the posterior mean fit from the pooled model. The red solid line shows the
posterior mean fit from the unpooled model. There appears to be heterogeneity across carriers. The unpooled
model gives a better individual fit to each carrier.
mass index for subject j. Given covariates xj = (xj,1, xj,2)
T, yj is modelled as Bernoulli(σ(ηj)) where
ηj = β0 + xj,1β1 + xj,2β2 and σ(ηj) = 1/(1 + exp(−ηj)).
We compare two partitions of the full dataset y = (y1, . . . , yn) into s = 2 subsets y1 and y2. The first
split is made uniformly at random and is shown in Figure 2.12. The second split oversamples cases by a
factor of 10 in the first subset. The biased split of the dataset is shown in Figure 2.13. In each panel we
plot a decision boundary from the respective analysis. Let x = (x1, x2)
T represent a vector of covariates.
The displayed decision boundary is given by the hyperplane xTw+α = 0, where w and α are determined
using the posterior distribution in the full dataset results and the subposterior distributions in the subset
results. In the full dataset analysis we obtain the slope coefficients w = E[(β1, β2)T|y] and the intercept
as α = E[β0|y] where the expectation is over the posterior distribution p(β0, β1, β2|y). For the subset
results we set w = E[(β1, β2)T|yi] and the intercept as α = E[β0|y] where the expectation is over the
subposterior distribution p˜(β0, β1, β2|yi) for i = 1, 2. The subset decision boundaries are influenced by
the subset specific case proportion in yi. In Figure 2.12 the decision boundaries in the subset analyses
are similar to the decision boundary in the full dataset analyses. The subset results are in line with the
full dataset results. In constrast, in Figure 2.13 we see that the subset decision boundaries are different
to the decision boundary from the full dataset analysis. In the subset 2 results the decision boundary is
out of the range of the panel. The biased split has caused the subset analyses to not be consistent with
the full dataset analysis. This will have a flow on impact to the subposterior distributions on the latent
40
2.7.2. Pima Indians dataset
subject: 266 subject: 327 subject: 343
split: biased
split: unifo
rm
0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5
0
2
4
6
0
2
4
6
value
de
ns
ity
Figure 2.11: Comparison of subposterior and target posterior distributions on the latent variables for the Pima
Indians dataset. The solid black line represents the target posterior distribution on the latent variable p(zij |X,y)
and the dashed red line gives the subposterior distribution on the latent variable p˜(zij |X(i),yi) for the subjects
listed in Table 2.6. The disparity between the subposterior and the target posterior distributions is greater under the
biased split than under the uniform split. This disparity is directly related to the variance of the subposterior integral
estimator (2.70)
variables.
We ran Algorithm 2 with B = 1000 for both partitions one hundred times. We then compared the
variance of log Îsub. The variance of log Îsub is a function of the variance of the importance weights
p(z1, . . . ,zs)
p˜(z1, . . . ,zs)
. (2.80)
The subposterior distribution p˜(z1, . . . ,zs) is expected to be very different to the target distribution
p(z1, . . . ,zs) under the biased split. The subposterior distribution p˜(z1, . . . ,zs) should be more similar
to the target distribution p(z1, . . . ,zs) under the uniform split. The biased split affects the estimate of
the intercept parameter β0 in each subset. This is turn affects the subposterior distribution over the
latent variables. Figure 2.11 compares the subposterior distributions on the latent variables to the target
distribution for each partition. Results are shown for three observations, described in Table 2.6. In
Figure 2.11, the solid line gives the target posterior distribution p(zij |y1,y2) for the three subjects of
interest. The dashed line gives the subposterior distribution p˜(zij |y1,y2) for each subject. As predicted,
the biased split causes p˜(zij |y1,y2) to be very different from p(zij |y1,y2). The uniform split results in
p˜(zij |y1,y2) being very similar to p(zij |y1,y2). As expected, the Monte Carlo variance of log Îsub is very
different under each split. Figure 2.14 shows boxplots of the estimates of log Îsub obtained in the combine
step over the one hundred replications of Algorithm 2.2. The results under the biased split are more
variable than the results under the uniform split. The standard deviation under the biased split is 3.04.
The standard deviation under the uniform split is 0.032. The initial partition in the split step has a large
influence on the Monte Carlo error in the combine step.
41
2. Computing the model evidence using parallel processing
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
ll
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l l
l
l
l
l
ll
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l l
l
l
l
l
l
l
l
l
l
l
l
l
Full dataset Subset 1 Subset 2
50 100 150 200 50 100 150 200 50 100 150 200
20
30
40
50
60
Glucose
Bo
dy
 m
as
s 
in
de
x
Figure 2.12: Uniform split of the Pima Indians dataset. Solid line gives the decision boundary using the
(sub)posterior mean of β = (β0, β1, β2)T in each dataset. Red crosses denote cases and black circles denote
controls. The decision boundaries are similar across each analysis. The consistency in the results suggests the
evidence synthesis in the combine step will be relatively congenial.
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
ll
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l l
l
l
l
l
ll
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l l
l
l
ll
l
l
l
l
l l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
ll
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
Full dataset Subset 1 Subset 2
50 100 150 200 50 100 150 200 50 100 150 200
20
30
40
50
60
Glucose
Bo
dy
 m
as
s 
in
de
x
Figure 2.13: Biased split of the Pima Indians dataset. Solid line gives the decision boundary using the
(sub)posterior mean of β = (β0, β1, β2)T in each dataset. Red crosses denote cases, and black circles denote
controls. The decision boundaries are dissimilar over the analyses. The deviation in the results suggests the
evidence synthesis in the combine step will be comparatively challenging.
subject type glu bmi
266 No 106 30.50
327 No 78 36.90
343 No 157 25.60
Table 2.6: Raw data for subjects shown in Figure 2.11.
42
2.8. Conclusion
l
Biased split Uniform split
−
5
0
5
10
Distribution of  log I^sub
lo
g 
I^ s
u
b
−
E(
lo
g 
I^ s
u
b)
Figure 2.14: Distribution of log Îsub under the biased split and the uniform split on the Pima Indians dataset. There
is greater variance under the biased split compared to the uniform split. The initial partition in the split step influences
the Monte Carlo variance in the combine step.
2.8 Conclusion
Parallel processing is a compelling computational pathway for scalable Bayesian inference. Distributed
computing platforms facilitate a divide and conquer approach where the Big Data analysis problem can
be broken down into a series of smaller conventional analyses. This idea has been explored for posterior
simulation. We have investigated divide and conquer Bayesian model selection and found that model
selection requires different theory and methods.
We have considered two different algorithms that fit the mould prescribed by Figure 2.1. We initially
considered a general approach that is applicable to any any parametric model. MCMC can be used in
the apply stage, with the output then consisting of the posterior draws. The combine step can then be
approached as a Bayesian hypothesis testing problem. The posterior output can then be used to estimate
a Savage-Dickey densty ratio that checks for the global suitability of the model over all subsets. The curse
of dimensionality yields a major hurdle for this algorithm in the combine step. The dimension of the
density estimation task increases with the number of subsets s, thus limiting the scalability of Algorithm
2.1.
If the original model can be augmented with suitable latent variables, it is possible to obtain a
practical algorithm. We propose an embarrassingly parallel algorithm for computing the model evidence
that makes use of Gibbs sampling in the apply stage to ease the difficulty of the combine step. The
subset Gibbs runs in the apply phase can be undertaken using simple modifications of standard update
formulae. The combine step can be easily by carried out aggregating the Gibbs output. Interestingly, the
initial split of the dataset strongly influences the Monte Carlo variance in the combine step. If the initial
split it very poor, the results in the combine stage will be useless. As such it may be worth exploring
adaptive procedures that optimise the data split after a preliminary Gibbs run.
The split-apply-combine methodology using data augmentation may also be of use for posterior
sampling. Using the divide and conquer procedure, we generate samples from the subposterior dis-
tribution of the latent variables p˜(z1, . . . ,zs|y1, . . . ,ys). If we can determine the importance ratio
p˜(z1, . . . ,zs|y1, . . . ,ys) up to a constant of proportionality, we can use p˜(z1, . . . ,zs|y1, . . . ,ys) in a self-
normalised importance sampler to target the true posterior. Equation (2.69) showed that the subposterior
43
2. Computing the model evidence using parallel processing
integral is proportional to the required importance sampling weights:∫ s∏
i=1
p˜(θ|yi, zi) dθ = Isub × p(z1, . . . ,zs|y1, . . . ,ys)
p˜(z1, . . . ,zs|y1, . . . ,ys)
∝ p(z1, . . . ,zs|y1, . . . ,ys)
p˜(z1, . . . ,zs|y1, . . . ,ys) .
Given that we have a closed form expression for the augmented subposterior integral∫
Ω
s∏
i=1
p˜(θ|yi, zi) dθ =
∏s
i=1 c(ν0/s+ ni,φ0/s+ t(yi, zi),ω0/s)
c(ν0 + n,φ0 +
∑s
i=1 t(yi, zi),ω0)
,
we can compute the required importance sampling weights. We then can use self-normalised importance
sampling in the combine stage to target the full dataset posterior p(θ|y1, . . . ,ys). Existing divide and
conquer procedures have been criticised for not targeting the exact target posterior (Bardenet et al.,
2017). Using the Gibbs based methodology here we can target the exact posterior providing that the
model has some conditionally conjugate structure. This is a promising future research direction.
Calculation of the model evidence is an important task, especially so when given masses of data with
multiple plausible models. Distributed computing appears to be useful in the pursuit of this goal.
44
3Bounding the model evidence using the subsampled sandwich
estimator
Summary
We investigate computational methods for estimation of the integrated likelihood on large n datasets.
Existing importance sampling techniques can be impractical due to the need for full likelihood evaluations.
We develop a novel strategy for estimating the log model evidence using subsampling. We propose to
pair a variational lower bound with an upper bound formed using a maximum entropy argument. The
enclosing bounds are asymptotically tight as n grows, and permit a reduction in the computational
budget. The bounds can be estimated in a computationally efficient manner using subsampling and
control variates. The sandwich approach provides more definitive error measures than other methods
for approximating the integrated likelihood. We demonstrate the methods on a large logistic regression
dataset.
3.1 Introduction
The computational expense of likelihood evaluations is the root cause of many challenges in Bayesian
computation for Big Data. Standard iterative algorithms can struggle with large datasets as the O(n) cost
per loop stalls the procedure from a practical point of view. A generic goal is to reduce computational
cost by allowing for access of only a small minibatch of data per iteration. Most existing algorithms
for estimating the integrated likelihood suffer from the likelihood scaling issue, becoming impractical
for huge n. We develop a novel strategy for estimating the model evidence on tall datasets, based on
sandwiching the log integrated likelihood between upper and lower bounds. Squeezing the log integrated
likelihood between encasing bounds is a departure from traditional importance sampling methods that
deliver point estimates of the model evidence. Shifting to interval estimation is useful as it eases the
integration of subsampling into the algorithm. We find it is necessary to use control variates to obtain a
scalable algorithm, a common finding in the Big Data literature.
Subsampling has been examined as a technique to reduce the computational cost of Bayesian inference
in the fixed model setting, where the primary goal is to generate samples from the posterior distribution
(Maclaurin and Adams, 2014; Quiroz et al., 2018). Given a likelihood p(y|θ) and prior p(θ), suppose we
wish to sample from the posterior distribution p(θ|y) ∝ p(y|θ)p(θ). Markov chain Monte Carlo (MCMC)
has proven to be a versatile computational engine for applied Bayesian inference, facilitating sampling
from the posterior p(θ|y) in complex models. As a simple example, consider the Metropolis-Hastings
algorithm, described in Algorithm 3.1. The calculation of the acceptance ratio α(θ,θ′) in line 6 requires
the full likelihood p(y|θ). The O(n) cost per iteration can render the algorithm impractically slow for
Big Data applications. There is a body of work exploring the integration of subsampling into the pseudo-
marginal MCMC framework of Andrieu et al. (2009) to address this likelihood cost. The innovation
behind the pseudo-marginal idea is that if the likelihood calculations in the acceptance ratio are replaced
45
3. Computing the model evidence using subsampling
Algorithm 3.1 Metropolis-Hastings for simulating from the posterior distribution
1: Input θ[0]
2: for b = 1, . . . , B do
3: θ ← θ[b−1]
4: θ′ ∼ g(·|θ) . Sample new proposal from g
5:
6: α(θ,θ′)← p(y|θ
′)p(θ′)
p(y|θ)p(θ) ×
g(θ|θ′)
g(θ′|θ) . acceptance ratio calculation
7:
8: u ∼ Uniform(0, 1)
9: if u < α(θ,θ′) then
10: θ[b] ← θ′ . Accept proposal
11: else
12: θ[b] ← θ . Reject proposal
13: end if
14: end for
15: return
{
θ[1], . . . ,θ[B]
}
with a non-negative unbiased estimator p̂(y|θ), it is still possible to define a Markov chain that maintains
the correct target distribution p(θ|y) ∝ p(θ)p(y|θ). The pseudo-marginal algorithm for tall datasets
is described in Algorithm 3.1. The pseudo-marginal algorithm uses an approximate acceptance ratio
in line 8. The key point of difference relative to Algorithm 3.1, is that an estimated likelihood now
appears in the numerator of the acceptance ratio. An important detail of the pseudo-marginal algorithm
is that previous likelihood estimates need to be recycled if the proposed value θ′ is rejected (see line
16). There has been a large amount of effort in designing almost-surely non-negative unbiased estimators
of the likelihood that use subsampling (Bardenet et al., 2017; Quiroz et al., 2016). Although the per
iteration cost decreases when using a subsampled pseudo-marginal algorithm compared the the standard
Metropolis-Hastings implementation, the autocorrelation in the chain increases with the variance of the
likelihood estimator. A highly variable likelihood estimator can potentially reduce the effective sample
size per unit time relative to the standard Metropolis-Hastings algorithm (Maclaurin and Adams, 2014).
Algorithm 3.2 Pseudo-marginal Metropolis-Hastings using subsampling
1: Input θ[0], p(y|θ)[0]
2: for b = 1, . . . , B do
3: θ ← θ[b−1]
4: p̂(y|θ)← p̂(y|θ)[b−1]
5: θ′ ∼ g(·|θ) . Sample new proposal from g
6: Estimate likelihood p̂(y|θ′) . Use subsample of full dataset
7:
8: α̂(θ,θ′)← p̂(y|θ
′)p(θ′)
p̂(y|θ)p(θ) ×
g(θ|θ′)
g(θ′|θ) . Acceptance ratio calculation
9:
10: u ∼ Uniform(0, 1)
11: if u < α̂(θ′,θ) then
12: θ[b] ← θ′ . Accept proposal
13: p̂(y|θ)[b] ← p̂(y|θ′) . Update likelihood estimate
14: else
15: θ[b] ← θ . Reject proposal
16: p̂(y|θ)[b] ← p̂(y|θ)[b−1] . Retain likelihood estimate
17: end if
18: end for
19: return
{
θ[1], . . . ,θ[B]
}
Subsampling has also been explored as a technique to reduce the per epoch cost of other stochas-
tic simulation methods. Stochastic gradient Langevin dynamics (Welling and Teh, 2011) bypasses the
accept reject step entirely, the iterative algorithm uses subsampled estimates of the gradient of the log
46
3.1. Introduction
likelihood for computationally cheap individual steps. The Zig-Zag sampler (Bierkens et al., 2016) and
the scalable Langevin exact sampler (Pollock et al., 2016) are both based on discretisations of continuous
time stochastic processes and use subsampled estimates of the gradient of the log likelihood. Unbiased
estimation of the log scale is significantly easier than unbiased estimation of likelihoods, and this is seen
as a competitive advantage of non pseudo-marginal based methods (Bardenet et al., 2017).
The mechanics of the aforementioned large n samplers are varied and technical and will not be
reviewed here. We will proceed assuming that we have a large dataset and have obtained samples
from the posterior p(θ|y) using a suitable algorithm. The task of interest is to estimate the integrated
likelihood p(y) =
∫
p(y|θ)p(θ) dθ. Given posterior samples, it is common to use importance sampling
based methods for estimating the integrated likelihood p(y) (Friel and Wyse, 2012). Importance sampling
methods for computing the evidence typically require repeated additional full likelihood evaluations,
making them unsuited to large n problems. This is a roadblock for carrying out Bayesian model selection
on tall datasets. We will argue that it is difficult to adapt existing importance sampling methods to
use subsampling effectively. Faster methods for approximating the integrated likelihood include the
Laplace approximation, the Laplace-Metropolis estimator, and variational Bayesian methods. Although
computationally cheap, it is difficult to bound the approximation error of these faster methods.
The emergence of tall datasets has led to a shift in the procedures used for posterior sampling.
Estimation of the model evidence may also require a similar paradigm shift. Historically, estimation
of the integrated likelihood is performed using a combination of posterior simulation and asymptotic
approximation (DiCiccio et al., 1997). We also use these techniques but with the large n dynamics of
the posterior distribution in mind. Underlying our approach is a less commonly used representation of
the log model evidence. The log model evidence log p(y) can be written as an information criterion type
score involving a goodness of fit term and a penalty term for model complexity. Starting from Bayes’
theorem, we easily obtain the so called ‘Candidiate’s formula’ an important identity relating the model
evidence to the posterior and prior densities (Besag, 1989). For any ordinate θ ∈ Ω we have the identity
:
log p(y) = log p(y|θ) + log p(θ)− log p(θ|y) (3.1)
= log p(y|θ)− log p(θ|y)
p(θ)
. (3.2)
Taking the Candidate’s formula (3.2) one step further, we can integrate both sides over the posterior
distribution to obtain the following decomposition of the log model evidence:∫
log p(y)p(θ|y) dθ =
∫
p(θ|y) log p(y|θ) dθ −
∫
p(θ|y) log p(θ|y)
p(θ)
dθ.
log p(y) = Ep(θ|y)[log p(y|θ)]−D(p(θ|y) ‖ p(θ)). (3.3)
The identity (3.3) is not an original result, however it seems to have attracted very little attention in the
existing literature on Bayesian model selection (Robert, 2007; Gelman et al., 2014; Kass and Raftery, 1995;
Gelfand and Dey, 1994; Raftery, 1995). The first term Ep(θ|y)[log p(y|θ)] is the expected log likelihood
over the posterior distribution. This can be characterised as a Bayesian measure of goodness of fit
(Dempster, 1997; Spiegelhalter et al., 2002). The second term D(p(θ|y) ‖ p(θ)) represents the Kullback-
Leibler divergence of the prior from the posterior. This can be seen as a penalty term that captures model
complexity and prior regularisation simultaneously. We propose to estimate the goodness of fit term
using subsampling, and to bound the penalty term using elementary information theory. The goodness
of fit term and the bounds on the penalty can be estimated using a posterior samples and subsampled
evaluations of the log likelihood. Overall we obtain a computationally efficient interval estimator of the
log model evidence that avoids repeated O(n) likelihood evaluations. Asymptotic arguments suggest that
the evidence bounds may be sufficient to rank competing models confidently in the large n regime. The
general idea of introducing bounded error in order to drastically cut the computational budget has also
47
3. Computing the model evidence using subsampling
been explored for posterior simulation (Bardenet et al., 2014; Korattikara et al., 2014), and we feel it is
natural to follow this avenue for estimation of the model evidence.
The sandwich approach reduces the computational burden of estimating the integrated likelihood
while still giving confidence measures on the quality of the estimate. Section 3.2 reviews asymptotic
properties of Bayesian model selection and develops the upper and lower bounds on the penalty term.
Section 3.3 reviews how subsampling can be used to estimate likelihoods and recaps existing methods for
estimating the model evidence. Section 3.4 discusses the utility of our proposed evidence bounds for tall
datasets. We test our proposed methodology in a logistic regression data application in Section 3.5.
3.2 Bayesian model selection
Suppose that we have M competing parametric models M1, . . . ,MM . Each model j = 1, . . . ,M has
parameter θj which takes values in parameter space Ωj of dimension pj . We assume the data y =
(y1, . . . ,yn) consists of n independently and identically distributed observations from some distribution
f0. We take the M-open viewpoint, in that the true generative model f0 may not be in the set of
candidate models M1, . . . ,MM . Given two models Mj and Mk, let logBjk denote the log Bayes factor
in favour of model j:
logBjk = log p(yi|Mj)− log p(y|Mk). (3.4)
The asymptotic behaviour of logBjk is of direct interest given our focus on tall data. For each model
Mj we can define a measure of closeness to the true model f0 using the Kullback-Leibler divergence. Let
dKL(f0, g) denote the Kullback-Leibler divergence of a candidate density g from the true generate model
f0. Formally,
dKL(f0, g) =
∫
f0 log
f0
g
.
For each model Mj we can find a density p(yi|θ∗j ,Mj) such that θ∗j satisfies
θ∗j := arg min
θj∈Ωj
dKL(f0, p(yi|θj ,Mj)). (3.5)
The density p(yi|θ∗j ) is the closest density to the true model f0 in the parametric family associated with
modelMj . We can characterise this as the best approximation to the truth attainable under modelMj .
Let djKL denote the divergence from the truth for model j, that is
djKL = dKL(f0, p(yi|θ∗j )). (3.6)
Under regularity conditions, as n tends to infinity, the limiting Bayes factors between two competing
models j and k will be controlled by djKL and d
k
KL. The asymptotic behaviour of the posterior distribution
on models can be described by the asymptotic behaviour of the set of Bayes factors. In the M-open
viewpoint pragmatic consistency of the posterior distribution amounts to ensuring that the posterior
distribution concentrates on the best possible model in the candidate set. Bayes factor consistency is
described more formally in Definition 3.1.
Definition 3.1 (Pragmatic Bayes Factors Consistency (Walker et al., 2004)). Consider two models Mj
and Mk. Model j has pj parameters and Model k has pk parameters. Let djKL and dkKL be defined
as in (3.6). We say that the Bayes factors achieve pragmatic consistency if they exhibit the following
asymptotic behaviour as n→∞ for all j, k ∈ {1, . . . ,M}.
(a) Suppose that djKL < d
k
KL. Then we require that logBjk →∞
(a) Suppose that djKL > d
k
KL. Then we require that logBjk → −∞
48
3.2.1. Evidence bounds
2 logBjk Bjk Strength of evidence
0 to 2 1 to 3 Not worth more than a bare mention
2 to 6 3 to 20 Positive
6 to 10 20 to 150 Strong
>10 >150 Very strong
Table 3.1: Guidelines for the interpretation of Bayes factors (Kass and Raftery, 1995). Given two modelsMjand
Mk, the quantity Bjk gives the Bayes factor in favour of model j (recall equation (3.4)).
(a) Suppose that djKL = d
k
KL and model j is nested in model k, so pj < pk. Then we require that
logBjk →∞.
The case of non-nested models with djKL = d
k
KL is more complicated, and there does not seem to be
a general theory for this situation (Casella et al., 2009). We assume that this we do not encounter this
problem for the sake of simplicity.
Pragmatic Bayes factor consistency requires that asymptotically all Bayes factors tend to negative
infinity or positive infinity. Asymptotically, the posterior distribution of models concentrates tightly on
the best model in the candidate set, even under misspecification. Under relatively mild conditions it
is possible to show that pragmatic Bayes factor consistency holds for a wide range of problems (Chib
and Kuffner, 2016; Chatterjee et al., 2018). Key references for nested models include Schwarz (1978)
and Gelfand and Dey (1994). We proceed assuming that the collection of models of interest and data
generating process is such that pragmatic Bayes factor consistency holds. The practical interpretation
of the consequences of Definition 3.1 is that we expect to see very large Bayes factors when comparing
models in the large n regime. Under mild conditions the gaps between the log evidence scores are expected
to widen as n grows.
From an inferential point of view, this large sample dynamic needs to be taken into account when
interpreting Bayes factors on tall datasets. Kass and Raftery (1995) present a table providing guidelines
on the interpretation of Bayes factor, reproduced here as Table 3.1. Bayes factors and p-values do have
some commonality in terms of large sample behaviour. The American Statistical Association’s statement
on p-values drew attention to the fact that it is important to consider the sample size when weighing
the evidence provided by a small p-value (Wasserstein and Lazar, 2016). When n is large, practically
insignificant deviations from the null hypothesis can lead to very small p-values. This same phenomenon
will affect Bayes factors. The log Bayes factors for two competing models are expected to tend to −∞
or ∞ as n increases, even when the two models differ in an practically inconsequential manner. For tall
datasets models may become separated by larger factors than what is considered in Table 3.1.
As n increases, exact calculation of the integrated likelihood becomes more computationally demand-
ing. This problem is not insurmountable as the asymptotic behaviour of the posterior distribution on
models works in our favour, assuming that we take Definition 3.1 as realistic. As sample sizes increase
we expect to see wider margins between the log evidence values within a collection of models. From a
decision theoretic point of view, exact calculation of the integrated likelihood becomes less necessary as n
increases. We can still rank models confidently given interval estimates of the integrated likelihood if the
interval widths are small compared to the true differences in Bayes factors. This large sample behaviour
provides some leeway to carry out computationally efficient Bayesian model selection on tall datasets.
This is assumption is used to motivate our evidence sandwich approach.
3.2.1 Evidence bounds
An important assumption for our bound to hold is that the parameter θ has unconstrained support in
Rd. It may be necessary to reparameterise the original statistical model in order to meet this condition.
49
3. Computing the model evidence using subsampling
For example, variance components and proportions can be mapped to the real line using the log and
logistic transform respectively. We believe this assumption to not be overly restrictive, as a wide range
of statistical models can be transformed to an unconstrained parameterisation. The probabilistic pro-
gramming language Stan (Carpenter et al., 2017) internally transforms all user declared models to an
unconstrained parameterisation in order to improve the efficiency of the underlying Hamiltonian Monte
Carlo algorithms. The default transformations are listed in the Stan manual (Stan Development Team,
2018). From here on in we assume that θ has unconstrained support in Rd.
3.2.2 Entropy
The differential entropy represents the amount of uncertainty associated with a continuous random vector.
For a continuous random vector X with probability density p(x), the differential entropy is defined as
differential entropy of X = −
∫
p(x) log p(x) dx. (3.7)
The differential entropy is not bounded, taking values in (−∞,∞). The differential entropy of a multi-
variate normal random variable is a simple function of the covariance matrix.
Lemma 3.1. Let X be a d-dimensional multivariate normal random vector X ∼ N(µ,Σ), where Σ is
of full rank. The differential entropy of X is given by
d
2
+
d
2
log (2pi) +
1
2
log(det(Σ)).
The result is easily established using properties of quadratic forms. The determinant of the covariance
matrix has an interpretation as a scalar measure of the overall variability associated with a multidimen-
sional random vector. The multivariate normal distribution has a maximum entropy property for random
vectors in Rd.
Theorem 3.1. Let X be a random vector in Rd with mean µ and full rank covariance matrix Σ. The
maximum entropy probability distribution amongst all candidate distributions for X with mean µ and
covariance matrix Σ is the multivariate normal distribution N(µ,Σ).
Proof: Let q(x) denote some distribution on Rd with mean µ and covariance matrix Σ. Let p(x) =
N(µ,Σ). Let H(q) denote the differential entropy of q(x) and let H(p) denote the differential entropy
of p(x). Let D(q(x) ‖ p(x)) denote the Kullback-Leibler divergence of p from q. The Kullback-Leibler
divergence is positive so
0 ≤ D(q(x) ‖ p(x))
=
∫
q(x) log
q(x)
p(x)
dx
=
∫
q(x) log q(x) dx−
∫
q(x) log p(x) dx
= −H(q)−
∫
q(x)
(
−d
2
log(2pi)− 1
2
log(det(Σ))− 1
2
(x− µ)TΣ−1(x− µ)
)
dx
= −H(q) + d
2
log(2pi) +
1
2
log(det(Σ)) +
d
2
= −H(q) +H(p).
The step in the second last line follows from properties of quadratic forms and the fact that q(x) has the
same first and second moments as p(x). The final line establishes the desired result that H(p) ≥ H(q).
As D(q(x) ‖ p(x)) = 0 if and only if q(x) is a multivariate normal distribution, H(q) = H(p) if and only
if q(x) = N(µ,Σ).
50
3.2.3. Upper bounding the evidence
It follows from Theorem 3.1 that given some random d-dimensional vector X with mean µ, full rank
covariance matrix Σ and unknown probability density q(x), the differential entropy has the upper bound,
−
∫
q(x) log q(x) dx ≤ d
2
+
d
2
log (2pi) +
1
2
log(det(Σ)). (3.8)
Equality holds if and only if X ∼ N(µ,Σ). We now show how the entropy bound can be used to
approximate the model evidence.
3.2.3 Upper bounding the evidence
The penalty term in (3.3) involves a posterior expectation and the negative posterior entropy:
D(p(θ|y) ‖ p(θ)) = p(θ|y) log p(θ|y)− Ep(θ|y) [log p(θ)] . (3.9)
We can lower bound the penalty term by using the entropy upper bound (3.8). Let Σθ|y be the posterior
covariance matrix of θ. Throughout we make the mild assumption that Σθ|y is of full rank. The entropy
bound (3.8) gives that∫
p(θ|y) log p(θ|y) dθ ≥ d
2
+
d
2
log (2pi) +
1
2
log(det(Σθ|y)).
The bound will be very tight when the posterior distribution of θ is approximately normal. Posterior
normality is a plausible assumption when the sample size is large (Van Der Vaart, 1998, Chapter 10).
We subsequently obtain the lower bound on the penalty term,
D(p(θ|y) ‖ p(θ)) ≥ d
2
+
d
2
log (2pi) +
1
2
log(det(Σθ|y))− Ep(θ|y) [log p(θ)] . (3.10)
We now upper bound the penalty term.
3.2.4 Lower bounding the evidence
Variational Bayesian inference is a generic family of methods for approximate Bayesian inference (Blei
et al., 2017). Variational Bayesian methods introduce an approximate posterior distribution q(θ) where
q(θ) is chosen to maximise a lower bound on the log model evidence log p(y). An important identity is
that for any distribution q(θ),
log p(y) = Eq(θ) [log p(y|θ)]−D(q(θ) ‖ p(θ)) +D(q(θ) ‖ p(θ|y)). (3.11)
The proof involves a similar line of reasoning that we used to arrive at (3.3). Starting from the ‘Candi-
date’s formula’ (3.2) we integrate over the arbitrary distribution q(θ).
log p(y) = log p(y|θ) + log p(θ)
p(θ|y)
=
∫
q(θ)
[
log p(y|θ) + log p(θ)
p(θ|y)
]
dθ
= Eq(θ) [log p(y|θ)] + Eq(θ) [log p(θ)] +
∫
q(θ) log
1
p(θ|y) dθ
= Eq(θ) [log p(y|θ)] + Eq(θ) [log p(θ)] +
∫
q(θ) log
1
p(θ|y) dθ+∫
q(θ) log q(θ) dθ −
∫
q(θ) log q(θ) dθ
= Eq(θ) [log p(y|θ)] + Eq(θ) [log p(θ)]− Eq(θ) [log q(θ)] +
∫
q(θ) log
q(θ|y)
p(θ)
dθ
= Eq(θ) [log p(y|θ)] + Eq(θ) [log p(θ)]− Eq(θ) [log q(θ)] +D(q(θ) ‖ p(θ|y)) (3.12)
= Eq(θ) [log p(y|θ)]−D(q(θ)‖p(θ)) +D(q(θ) ‖ p(θ|y)). (3.13)
51
3. Computing the model evidence using subsampling
Now as D(q(θ) ‖ p(θ|y)) is positive, dropping the term from (3.13) gives a lower bound on the log model
evidence.
log p(y) ≥ Eq(θ) [log p(y|θ)]−D(q(θ) ‖ p(θ)). (3.14)
Now substituting in (3.3)
Ep(θ|y)[log p(y|θ)]−D(p(θ|y) ‖ p(θ)) ≥ Eq(θ) [log p(y|θ)]D(q(θ) ‖ p(θ)).
This then gives an upper bound on the penalty term:
D(p(θ|y) ‖ p(θ)) ≤ Ep(θ|y)[log p(y|θ)]− Eq(θ) [log p(y|θ)] +D(q(θ) ‖ p(θ)). (3.15)
The upper bound on the penalty term will be tight when D(q(θ) ‖ p(θ|y)) is small. This is the same gap
that appears in the usual variational evidence lower bound.
3.2.5 Sandwiching the evidence
Substituting the lower bound on the penalty term (3.10) into (3.3) gives an upper bound on the evidence:
log p(y) ≤ Ep(θ|y) [log p(y|θ)] + Ep(θ|y) [log p(θ)] + d
2
+
d
2
log (2pi) +
1
2
log(det(Σθ|y)). (3.16)
Substituting the upper bound on the penalty term (3.10) into (3.3) gives the usual variational lower
bound on the model evidence
log p(y) ≥ Eq(θ) [log p(y|θ)] + Eq(θ) [log p(θ)]− Eq(θ) [log q(θ)] . (3.17)
If the posterior distribution is normally distributed, and we use a Gaussian variational distribution both
the upper and lower bounds will be tight. As the integrated likelihood is invariant to reparameterisations
it may be worth searching for effective normalising transforms of θ. We will argue that subsampling
can be used to estimate the bounds efficiently. Before developing our methodology further we review
important existing related work.
3.3 Related work
In this section we review on subsampling for tall datsets and existing methods for calculating the in-
tegrated likelihood. As mentioned in the introduction, the pseudo-marginal MCMC algorithm requires
unbiased estimation of the likelihood from subsamples. Existing work builds an unbiased estimator,
or asymptotically unbiased estimator of the likelihood from unbiased estimators of the log likelihood
(Quiroz et al., 2016; Bardenet et al., 2017; Quiroz et al., 2018). We focus on Monte Carlo estimators
of the integrated likelihood that can be implemented given posterior samples and some generic closed
form approximations. More sophisticated estimators of the integrated likelihood such as nested sampling
(Skilling et al., 2006) or the power posterior method (Friel and Pettitt, 2008) are not considered here as
they require more specialised MCMC implementations.
3.3.1 Subsampled log likelihoods
Unbiased estimation of the likelihood using subsamples is difficult as the likelihood is a product of n
terms. Unbiased estimation of the log likelihood, `(θ) = log p(y|θ), is significantly easier as log p(y|θ) is
a sum over n log likelihood contributions,
log p(y|θ) =
n∑
i=1
log p(yi|θ).
52
3.3.1. Subsampled log likelihoods
The simplest approach is to take a uniform random sample of size m, where we assume m  n. Let S
denote the set of subsampled indices, so S ⊂ {1, . . . , n} and |S| = m. Then a simple estimator of the log
likelihood at θ is
̂`
simple(θ) =
n
m
∑
i∈S
log p(yi|θ).
Although easy to understand and implement, this estimator scales very poorly with n. Let vn(θ) denote
the population variance of the log likelihood values evaluated at θ over the n observations. The variance
of the estimator is seen to be
var ̂`simple(θ) = n2
m
vn(θ).
It is reasonable to expect vn(θ) to stabilise around a constant as n increases. Thus, in order to control
the variance as n increases, we have to increase the batch size m quadratically with n. This is undesirable
as we would like the computational cost of estimating the integrated likelihood to be sublinear in n. To
reduce the variance of the log likelihood estimator we can employ control variates, a generic technique
for reducing Monte Carlo variance. Control variates have been identified as a critical tool in many Big
Data subsampling algorithms (Baker et al., 2017). Suppose we have a Monte Carlo estimator Z such
that E[Z] = µ, where µ is the unknown quantity of interest. Suppose that we can define another random
variable W on the same probability space with known expectation τ . Let α denote some constant. We
can define a new estimator ZCV that makes use of the auxiliary control variate W :
ZCV = Z + α(W − τ).
The estimator ZCV is also an unbiased estimator for µ for any choice of constant α. The variance of the
new estimator is
var(ZCV ) = var(Z) + α
2var(W ) + 2α× cov(Z,W ). (3.18)
If Z and W are correlated, and α is chosen appropriately, var(ZCV ) can be much smaller than var(Z). The
full dataset log likelihood at θ is the sum of n contributions log p(y|θ) = ∑ni=1 log p(yi|θ) = ∑ni=1 `i(θ).
The difference estimator is a special control variate scheme for estimating population totals from a
subsample (Sarndal et al., 1992). The difference estimator introduces an approximate log likelihood
contribution for each observation ̂`i(θ) for i = 1, . . . , n. As in Bardenet et al. (2017) and Quiroz et al.
(2018) we consider the use of Taylor series approximations about the maximum likelihood estimate to
form the log likelihood approximations ̂`i(θ) for i = 1, . . . , n. The control variates can reduce the variance
compared to the simple estimator ̂`simple(θ). Bardenet et al. and Quiroz et al. both consider second
order approximations. We also consider a first order approximation, as computing the Hessian matrix
can be challenging for high-dimensional statistical models.
Let θ̂ denote the maximum likelihood estimate of θ. Let gi denote the gradient of the log likelihood
evaluated at θ̂ for observation i. Likewise, let Hi denote the Hessian matrix of the log likelihood
contribution of observation i, evaluated at the maximum likelihood estimate. For i = 1, . . . , n:
gi = ∇ log p(yi|θ)|θ=θ̂ , Hi = ∇2 log p(yi|θ)
∣∣
θ=θ̂
.
Let g denote the full dataset gradient so g =
∑n
i=1 gi. Let H give the full dataset Hessian matrix, so
H =
∑n
i=1Hi. For i = 1, . . . , n the first order approximation to the log likelihood contribution is given
by
̂`
i,1(θ) = `(θ̂) + (θ − θ̂)Tgi. (3.19)
For i = 1, . . . , n the second order approximation to the log likelihood contribution is given by
̂`
i,2(θ) = `(θ̂) + (θ − θ̂)Tgi + 1
2
(θ − θ̂)THi(θ − θ̂). (3.20)
53
3. Computing the model evidence using subsampling
The first order log likelihood estimator ̂`gradient(θ) is defined as :
̂`
gradient(θ) = log p(y|θ̂ + n
m
∑
i∈S
[
`i(θ)− ̂`i,1(θ)] . (3.21)
The second order log likelihood estimator ̂`hessian(θ) is defined as:
̂`
hessian(θ) = log p(y|θ̂) + 1
2
(θ − θ̂)TH(θ − θ̂) + n
m
∑
i∈S
[
`i(θ)− ̂`i,2(θ)] . (3.22)
The variance of ̂`gradient and ̂`hessian is a function of the remainder terms in the Taylor series approxi-
mations (3.19) and (3.20). Qualitatively, the smaller the remainder terms, the smaller the variance. We
analyse the asymptotic variance of the estimators in the Appendix to this chapter. We make a heuristic
argument under mild assumptions that the variance of ̂`gradient is Op(m−1) and the variance of ̂`hessian is
Op(n
−1). Stochastic notation is used to account for the fact that we take θ to be a stochastic sequence,
where we are treating θ as a sample from a posterior distribution.
3.3.2 Subsampled likelihoods
For simplicity suppose we have n independent observations y = (y1, . . . ,yn). Unbiased estimation of
the likelihood p(y|θ) = ∏ni=1 p(yi|θ) is difficult as it is a product over n terms. The Poisson estimator
(Wagner, 1987; Papaspiliopoulos, 2009) builds an unbiased estimator of the likelihood from unbiased
estimators of the log likelihood. We assume that we use one of the subsampling based esitmators of
the log likelihood described in the previous subsection. Let (̂`j(θ)) be a sequence of independently and
identically distributed unbiased estimators of the log likelihood for j ∈ N. The basic Poisson estimator
is defined as the randomised product
p̂(y|θ) = exp(λ)
J∏
j=1
̂`
j(θ)
λ
, where J ∼ Poisson(λ). (3.23)
Taking iterated expectations shows the estimator is unbiased. We take expectations over the subsampling
indices for the log likelihood estimators S, and the random number of product terms J .
ES,J [p̂(y|θ)] = EJ [ES|J [p̂(y|θ)|J = j]]
= EJ
[
exp(λ)
λj
j∏
k=1
ES|J [̂`j(θ)]
]
= EJ
[
exp(λ)
λj
j∏
k=1
`(θ)
]
= EJ
[
exp(λ)
λj
`(θ)j
]
=
∞∑
j=0
Pr(J = j)
exp(λ)
λj
`(θ)j
=
∞∑
j=0
1
j!
exp(−λ)λj × exp(λ)
λj
`(θ)j
=
∞∑
j=0
`(θ)j
j!
= exp(`(θ))
= p(y|θ).
The second line relies on the unbiasedness and the independence of the log likelihood estimators ̂`1(θ), . . . , ̂`j(θ).
The fourth line uses the probability mass function of the Poisson distribution. The seventh line uses the
54
3.3.3. Importance sampling
power series expansion of exp(z) =
∑∞
i=0 z
i/i!. The value of λ sets the expected computational cost of
the Poisson estimator. Assuming the unbiased log likelihood estimators use a subsample of size m, the
expected number of likelihood evaluations per use is mλ.
In practice it is common to introduce an extra scalar tuning parameter ω to reduce the variance. The
modified estimator is again a randomised product:
exp(λ+ ω)
J∏
j=1
̂`
j(θ)− ω
λ
, J ∼ Poisson(λ). (3.24)
The modified Poisson estimator is also unbiased. From the general result given by Papaspiliopoulos
(2009), we have that for a fixed λ, the choice of ω giving the minimum variance is ω = log p(y|θ) − λ.
The minimum variance tuning parameter requires the unknown log likelihood at θ.
An important issue is that both the simple Poisson estimator and the modified Poisson estimator are
possibly negative. The modified estimator is almost surely nonnegative if ω is chosen such that ̂`j(θ) > ω
holds almost surely. It can be difficult to determine such a bound analytically, and a conservative choice of
ω can inflate the Monte Carlo variance (Quiroz et al., 2018). A general result is that without additional
information about the log likelihood there is no algorithm that takes unbiased estimators of the log
likelihood and outputs an almost-surely nonnegative estimator of the likelihood (Jacob et al., 2015). An
alternative subsampled likelihood estimator is the estimator proposed in Rhee and Glynn (2015). The
Rhee and Glynn estimator also uses unbiased estimators of the log likelihood in a randomised product. A
lower bound on the likelihood is also required to ensure a nonnegative estimator. In general it is difficult
to construct unbiased and almost surely nonnonegative estimators of the likelihood that use subsampling.
3.3.3 Importance sampling
The model evidence can be viewed as the expected likelihood over the prior distribution p(y) =
∫
p(y|θ)p(θ) dθ.
A direct approach for estimating the model evidence is to take W samples from the prior distribution
θ[1], . . . ,θ[W ], and to set
p̂(y) =
1
W
W∑
w=1
p(y|θ[w]). (3.25)
Although simulation consistent, this strategy is typically highly inefficient (Raftery, 1996; Friel and Wyse,
2012). The likelihood will be concentrate in a small region supported by the prior, and as such a small
fraction of the prior samples will dominate the sum. This leads to the estimator (3.25) having very
high variance. Importance sampling estimators express the model evidence as an integral over some
importance distribution q(θ) as opposed to an integral over the prior distribution p(θ). The identity at
the heart of this strategy is
p(y) =
∫
p(y|θ)p(θ) dθ =
∫
p(y|θ)p(θ)
q(θ)
q(θ) dθ.
Given R samples from q(θ), denoted θ
[1]
q , . . . ,θ
[R]
q , the importance sampling estimator is
p̂(y) =
1
R
R∑
r=1
p(y|θ[r]q )p(θ[r]q )
q(θ
[r]
q )
. (3.26)
If the importance distribution q(θ) is exactly equal to the posterior distribution p(θ|y) then the Monte
Carlo estimator (3.26) has zero variance. In this ideal scenario q(θ) = p(y|θ)p(θ)/p(y) and the estimator
55
3. Computing the model evidence using subsampling
reduces to
p̂(y) =
1
R
R∑
r=1
p(y|θ[r])p(θ[r]q )
q(θ
[r]
q )
=
1
R
R∑
r=1
p(y|θ[r]q )p(θ[r]q )
p(y|θ[r]q )p(θ[r]q )
p(y)
1
=
Rp(y)
R
= p(y).
Gelfand and Dey (1994) propose to set the importance distribution q(θ) to be a normal distribution
with mean and covariance matrix given by the sample mean and covariance of the posterior samples
θ[1], . . . ,θ[B]. This is a popular choice in practice.
3.3.4 Harmonic mean estimator
The harmonic mean estimator (Newton and Raftery, 1994) uses posterior samples to estimate the
marginal likelihood. The harmonic mean estimator is based on the identity
Ep(θ|y)
[
1
p(y|θ)
]
=
∫
1
p(y|θ)p(θ|y) dθ
=
∫
p(y|θ)p(θ)
p(y|θ)p(y) dθ
=
1
p(y)
∫
p(θ) dθ
=
1
p(y)
.
Given B posterior samples θ[1], . . . ,θ[B], a simulation consistent estimator of the model evidence is then
p̂(y) =
(
1
B
B∑
b=1
p(y|θ[b])
)−1
.
An advantage of the harmonic mean estimator is that it is readily computable if the posterior samples were
generated using a Metropolis-Hastings algorithm (eg. Algorithm 3.1). If the likelihood ratio calculations
are saved at each iteration, the harmonic mean estimator can be computed with very little additional
cost after the posterior sampling run. In the tall data setting, it is perhaps unlikely that full likelihood
evaluations have been made in the process of generating θ[1], . . . ,θ[B], so this benefit may not be present.
The harmonic mean estimator can have infinite variance, and can exhibit a large finite sample bias (Friel
and Wyse, 2012).
3.3.5 Bridge sampling
Bridge sampling (Bennett, 1976; Meng and Wong, 1996) also uses an importance distribution q(θ) but
is based on a slightly more complicated identity. We first start with the relationship
1 =
∫
p(y|θ)p(θ)a(θ)q(θ) dθ∫
p(y|θ)p(θ)a(θ)q(θ) dθ ,
for some bridge function a(θ). Multiplying both sides of the equation by p(y) gives
p(y) =
∫
p(y|θ)p(θ)a(θ)q(θ) dθ∫
[p(y|θ)p(θ)/p(y)]a(θ)q(θ) dθ
=
Eq(θ)[p(y|θ)p(θ)a(θ)]
Ep(θ|y)[a(θ)q(θ)]
.
56
3.3.6. Laplace approximation
Suppose we have R samples from the importance density q(θ), denoted θ
[1]
q , . . . ,θ
[R]
q . Suppose we have
B samples from the posterior distribution, θ[1], . . . ,θ[B]
Bridge sampling requires the choice of importance distribution q(θ) and bridge function a(θ). There
are some theoretical arguments to take q(θ) to resemble the posterior distribution. A common approach
is to make a normal approximation. Let s1 = R/(R + B) and s2 = B/(R + B) The optimal choice of
bridge function in terms of minimising mean square error is
aopt(θ) =
1
s1p(y|θ)p(θ) + s2p(y)q(θ) .
When setting a(θ) = 1/q(θ), bridge sampling reduces to ordinary importance sampling. The optimal
bridge function is unattainable as it includes the unknown p(y). Meng and Wong propose an iterative
scheme to estimate the optimal bridge function, which in turn gives an iterative estimator of p(y). The
iterative bridge sampling estimator is
p̂(y)(t+1) =
R−1
∑R
r=1
p(y|θ[r]q )p(θ[r]q )
s1p(y|θ[r]q )p(θ[r]q ) + s2p̂(y)(t)q(θ[r]q )
B−1
∑B
b=1
q(θ[b])
s1p(y|θ[b])p(θ[b]) + s2p̂(y)(t)q(θ[b])
. (3.27)
The iterative scheme requires full likelihood evaluations for the B posterior samples, and the R samples
from the importance distribution. The iterative estimator (3.27) has an interpretation as a maximum
profile likelihood estimator of the model evidence under the assumption of independent posterior samples
(Geyer, 1996; Shirts et al., 2003).
3.3.6 Laplace approximation
The Laplace approximation can be motivated under the assumption that the posterior density is approx-
imately quadratic around the posterior mode θ˜. Let G and H give the gradient and Hessian matrix of
the unnormalised posterior evaluated at the mode respectively:
G = ∇[log p(y|θ) + log p(θ)]|θ=θ˜
H = ∇2[log p(y|θ) + log p(θ)]∣∣
θ=θ˜
.
A second order Taylor expansion around the mode gives
p(y) ≈
∫
exp
[
log p(y|θ˜) + log p(θ˜) + (θ − θ˜)TG+ 1
2
(θ − θ˜)TH(θ − θ˜)
]
dθ
=
∫
exp
[
log p(y|θ˜) + log p(θ˜) + 1
2
(θ − θ˜)TH(θ − θ˜)
]
dθ
= exp
[
log p(y|θ˜) + log p(θ˜)
] ∫
exp
[
1
2
(θ − θ˜)TH(θ − θ˜)
]
dθ.
The second line uses the fact that gradient is zero at the posterior mode. Let S−1 = (−H) so that
the integral can be recognised as the kernel of a multivariate Gaussian density with covariance matrix
S. Introducing the required normalising constant and integrating over the density function yields the
approximation
p(y) ≈ exp
[
log p(y|θ˜) + log p(θ˜)
] ∫
exp
[
−1
2
(θ − θ˜)TS−1(θ − θ˜)
]
dθ
= exp
[
log p(y|θ˜) + log p(θ˜)
]
(2pi)d/2|S|1/2
∫
1
(2pi)d/2|S|1/2 exp
[
−1
2
(θ − θ˜)TS−1(θ − θ˜)
]
dθ
= exp
[
log p(y|θ˜) + log p(θ˜)
]
(2pi)d/2|S|1/2.
The Laplace approximation to the log model evidence is
log p(y) ≈ log p(y|θ˜) + log p(θ˜) + d
2
log(2pi) +
1
2
log(det(S)). (3.28)
57
3. Computing the model evidence using subsampling
Although the Laplace approximation has been shown to be competitive in simulation studies, it can be
difficult to bound the error in the approximation in practice (Gelfand and Dey, 1994). Additionally,
computation of the Hessian matrix can be difficult in high-dimensional statistical models.
3.3.7 Laplace-Metropolis estimator
The Laplace-Metropolis estimator (Raftery, 1996; Lewis and Raftery, 1997) is a variant of the Laplace
approximation designed to be used more easily in situations where posterior samples are available. The
Laplace approximation requires the posterior mode and the Hessian evaluated at the mode. These are
not immediately available given posterior samples θ[1], . . . ,θ[B]. Raftery proposes a number of candidate
estimators for the posterior mode θ˜ using the posterior samples:
1. Take θ˜ to be the posterior sample that maximises p(y|θ[b])p(θ[b]) over b = 1, . . . , B.
2. Estimate the components of θ˜ by taking the 1 component wise means of the posterior samples.
3. Estimate the components of θ˜ by taking the component wise medians of the posterior samples.
4. Estimate θ˜ by finding the multivariate median over the samples.
Let Σθ|y denote the covariance matrix of the posterior distribution. Raftery suggests using Σθ|y in place
of S = (−H)−1 in the Laplace approximation (3.28). This is justified using an asymptotic argument. The
posterior covariance matrix Σθ|y can be estimated using the empirical covariance matrix of the posterior
samples or an alternative robust covariance matrix estimator. The Laplace-Metropolis estimator is
log p(y) ≈ log p(y|θ∗) + log p(θ∗) + d
2
log(2pi) +
1
2
log(det(Σ̂θ|y)), (3.29)
where θ∗ and Σ̂θ|y are suitable estimators of the posterior mode and posterior covariance matrix, which
can be obtained using the posterior samples. The Laplace-Metropolis estimator has performed well in
simulations but its theoretical properties are not well understood (Raftery, 1996).
3.4 Application to tall datasets
Importance sampling, the harmonic mean estimator and bridge sampling all require repeated full likeli-
hood evaluations p(y|θ). If each likelihood evaluation is O(n), these techniques may be impractical for
large n datasets. As discussed in Section 3.3.2 subsampling can be used to provide unbiased estimates of
likelihoods p̂(y|θ). In principle it is possible to modify existing estimators of the model evidence to use
subsampled likelihoods. Consider the basic importance sampling estimator described in Section 3.3.3.
Given R samples from the importance distribution q(θ), denoted θ
[1]
q , . . . ,θ
[R]
q we could replace the full
likelihood evaluations p(y|θ[r]) with unbiased subsampled estimates p̂(y|θ[r]) using the modified Poisson
estimator in Section 3.3.2. The subsampled importance sampling estimator could be constructed as
p̂(y) =
1
R
R∑
r=1
p̂(y|θ[r])p(θ[r])
q(θ[r])
. (3.30)
An issue is that we could obtain negative estimates of the model evidence unless the tuning parameter
ω is an appropriate lower bound. Assuming that we plug in subsampled estimates of the likelihood,
the possibility of negative estimates will also affect subsampled implementations of the harmonic mean
estimator and subsampled bridge sampling. Although the Poisson estimator can cut the number of
likelihood evaluations required to use the importance sampling estimators, the difficulties in choosing an
appropriate lower bound ω make such an approach unattractive.
The Laplace approximation and the Laplace-Metropolis estimator are attractive alternatives that
avoid much of the computational expense associated with the other methods. A drawback is that the
58
3.4.1. Estimation of evidence bounds
error associated with these approximate methods can be difficult to quantify in practice. As mentioned
earlier, the Laplace approximation requires the posterior mode and Hessian, which is not automatically
provided given posterior samples. The Laplace-Metropolis estimator approximates the mode and Hessian
using information that is more readily available given simulation output. Although convenient, it is again
difficult to determine the error that this introduces into the final estimator.
To integrate subsampling with the pseudo-marginal MCMC algorithm it is a requirement to have an
unbiased estimator of the likelihood, necessitating the use of the Poisson estimator or alternatives. As
our goal is Bayesian model selection rather than posterior sampling, we have the freedom to focus on the
log model evidence. As already discussed, estimating log p(y|θ) using subsampling is significantly easier
than estimating p(y|θ) using subsampling.
At this point the computational advantages of the evidence bounds become more evident. As they
are defined on the log scale it relatively easily to incorporate subsampling methodology. Suppose we have
B samples from the posterior distribution. Let Σ̂θ|y be the empirical covariance matrix of the posterior
samples θ[1], . . . ,θ[B]. A simulation consistent estimate of the entropy upper bound is
1
B
B∑
i=1
log p(y|θ[i]) + 1
B
B∑
i=1
log p(θ[b]) +
d
2
+
d
2
log (2pi) +
1
2
log(det(Σ̂θ|y)) (3.31)
Suppose we generate R samples from the approximate variational posterior q(θ). Similar to the Gelfand
and Dey (1994) recommendation for importance sampling, we propose to set the variational distribution
q(θ) to be a normal distribution N(µ̂θ|y, Σ̂θ|y) where µ̂θ|y and Σ̂θ|y are the mean and covariance matrix
of the posterior samples θ[1], . . . ,θ[B]. Let these samples be denotes θ
[1]
q , . . . ,θ
[R]
q . A simulation consistent
estimator of the evidence lower bound is
1
R
R∑
r=1
log p(y|θ[r]q ) +
1
R
R∑
r=1
p(θ[q]q ) +
1
R
R∑
r=1
log q(θ[q]). (3.32)
In some situations the expectations Eq(θ) [log p(θ)] and Eq(θ) [log q(θ)] can be determined analytically.
This is the case when both the prior p(θ) and the variational approximation q(θ) are normal distributions.
The simulation consistent estimator of the variational lower bound can be simplified to
1
R
R∑
r=1
log p(y|θ[r]q ) + Eq(θ) [log p(θ)]− Eq(θ) [log q(θ)] . (3.33)
For tall datasets the B full log likelihood evaluations in (3.31) and the R full log likelihood evaluations
in (3.32) and (3.33) will be computationally demanding. Subsampling can be used to estimate the log
likelihood terms efficiently.
3.4.1 Estimation of evidence bounds
Suppose we have B samples from the posterior distribution θ[1], . . . ,θ[B] using one of the Big Data
posterior simulation algorithms mentioned in the introduction. Pseudo-marginal MCMC or a stochastic
gradient Langevin dynamics based sampler are two possibilities. To estimate the bounds (3.16) we need
to estimate the posterior expectation Ep(θ|y)[log p(y|θ)]. We propose to use the subsampling estimators
to estimate the log likelihoods. That is for some batch size m n, to use either of the estimators
Êp(θ|y)[log p(y|θ)] = 1
B
B∑
b=1
̂`
gradient(θ
[b]), (3.34)
Êp(θ|y)[log p(y|θ)] = 1
B
B∑
b=1
̂`
hessian(θ
[b]). (3.35)
Similarly, we can use the control variates to estimate the required goodness of fit term Eq(θ)[log p(y|θ)]
in the variational lower bound (3.14). Suppose R samples from the variational distribution q(θ) are
59
3. Computing the model evidence using subsampling
available, denoted θ
[1]
q , . . . ,θ
[R]
q . We propose to use either of the estimators
Êq(θ)[log p(y|θ)] = 1
R
R∑
r=1
̂`
gradient(θ
[r]
q ), (3.36)
Êq(θ)[log p(y|θ)] = 1
R
R∑
r=1
̂`
hessian(θ
[r]
q ). (3.37)
Given the subsampled estimators of the log likelihood we can define the subsampled estimator of the
upper bound
l̂og p(y) ≤ Êp(θ|y) [log p(y|θ)] + Êp(θ|y) [log p(θ)] + d
2
+
d
2
log (2pi) +
1
2
log(det(Σ̂θ|y)). (3.38)
We can also define a subsampled estimator of the lower bound
l̂og p(y) ≥ Êq(θ) [log p(y|θ)] + Eq(θ) [log p(θ)]− Eq(θ) [log q(θ)] . (3.39)
3.5 Data application: flights dataset
We analyse the flights dataset that was also considered in Chapter 2. The flights dataset is available in
the R package nycflights13 (Wickham, 2014). There are n = 327, 346 observations on flights departing
New York City in 2013. There are data on 16 different carriers (airlines). We dichotomised the arrival
delay variable (original units in minutes) to obtain a binary outcome. We labelled flights as late if the
arrival delay was greater than zero, and on time if the arrival delay was less than or equal to zero. We
compared three different logistic regression models for late arrival.
M1 = intercept : carrier + delay : carrier. (3.40)
M2 = intercept : carrier + delay : carrier + weekday (3.41)
M3 = intercept : carrier + delay : carrier + weekday : carrier (3.42)
Model 1 is the unpooled model from Chapter 2. Model 2 introduces a fixed effect for the day of the
week. It could be the case that flights are more likely to arrive late on a weekend. Model 3 allows for
an interaction between weekday and carrier. We used independent N(0, 1) priors on all coefficients. We
used a Gibbs sampler to generate B = 5000 observations from the full dataset posterior distribution.
This dataset is of moderate size, and the covariates can be grouped for faster sampling. Ideally we would
use a Big Data algorithm to generate the posterior samples, but we were unable to find an R package for
doing so. We have not been concerned with the method of generation of the posterior samples, as our
interest has been the follow up likelihood cost for then obtaining the model evidence. Our conclusions
should not be affected by the fact that we have used a conventional Gibbs sampler to generate the initial
collection of posterior samples.
The goodness of fit of each model were summarised by computing an Receiver Operating Charac-
teristic (ROC) curve and a reliability plot. We generated predicted probabilities of late arrival for each
observation. We again take the posterior predictive mean as the predicted probability of late arrival. The
posterior predictive mean for a response given covariates xi is obtained by integrating over the posterior
distribution of the coefficients
E[yi|M] =
∫
p(yi = 1|xi,β,M)p(β|X,y,M). (3.43)
The ROC curve plots the in-sample true positive rate against the false positive rate using the predicted
probabilities. The ROC curve summarises the predictive ability of each classifier. The area under the
curve (AUC) is a summary measure for the predictive performance. A perfect classifier has AUC of 1,
random guessing gives an AUC of 0.5. ROC curves were generated using the R package pROC (Robin
et al., 2011).
60
3.5. Data application: flights dataset
Good predictive performance does not mean a model correctly specified. Another useful diagnostic
is a reliability plot. Suppose a group of observations has a predicted probability of late arrival of 0.2.
We would expect 20 percent of the observed data points to be late arrivals in this data group. The
reliability plot checks the goodness of fit of the model in terms of empirical event probabilities matching
the predicted event probabilities. Observations are binned according to predicted probabilities, and the
average predicted probability is compared to the empirical probability of late arrival within the bin. The
reliability curves were generated using the R package caret (Kuhn, 2008).
Figure 3.1 shows the ROC curves for each model. The AUC for each model is given in the Figure
legend. The models have very similar AUC scores. Model 2 outperforms model 1 by 0.0023 and model
3 outperforms model 2 by 0.0006. Model 2 and 3 have more free parameters than model 1 so it is
not surprising that they have better classification performance. What is more interesting is how the
differences in predictive ability translate into Bayes factors. Before looking at this in more detail it is
worth studying the reliability plots shown in Figure 3.2. Reliability curves are shown for each of the
three models. A perfectly calibrated probabilistic classifier would give a reliability curve falling along
the identity line in a sufficiently large sample. All three models appear very well calibrated. By the
diagnostics in Figures 3.1 and 3.2 it is difficult to make a clear ranking of the models. We now turn to
the model evidence.
We first computed the model evidence for each model using the importance sampling method proposed
by Gelfand and Dey (1994). The estimator is described in Section 3.3.3. We generated R = 5000 samples
from the importance distribution. The estimator required R = 5000 full likelihood evaluations. Estimated
log Bayes factors are reported in Table 3.2. Moving from left to right, both model 2 and model 3 are
supported over model 1. There is evidence for a day of the week effect on the probability of late arrival.
Model 2 is preferred over model 3 with 2 logB23 = 55. The day of the week effect appears constant across
carriers. Model 2 has fewer parameters, and gives very similar results to Model 3 as seen in Figures 3.1
and 3.2. The computed Bayes factors are much larger than what is considered in Table 3.1. Here we
have n = 327, 346. We expect the typical magnitude of log Bayes factors to increase as n grows under
standard asymptotic theory. Large sample Bayes factors can be difficult to interpret in isolation. The
Bayes factors provide extra information to discriminate between models over what is shown in Figure 3.1
and Figure 3.2, however it is difficult to give an intuitive measure of the strength of evidence.
We then estimated the model evidence using the subsampled bound estimators (3.39) and (3.38).
We estimated the required goodness of fit expectations using the three log likelihood estimators ̂`simple,̂`
gradient and ̂`hessian discussed in Section 3.3.1. For the lower bound, we took the variational distribution
q(θ) to be the same normal approximation that was used in the implementation of importance sampling.
We set the subsample size at m = 500. Table 3.3 reports the estimated evidence bounds and the Monte
Carlo standard errors. Standard errors were obtained using the default method in the mcmcse R package
(Flegal et al., 2017). We also report the estimates of the log model evidence using the importance
sampling (IS) method. Standard errors for importance sampling were obtained using the nonparametric
bootstrap.
The results in Table 3.3 show the importance of using control variates when estimating the full log
likelihood from a subsample. The standard errors for ̂`simple are much higher than the other estimators
for all three models. In particular the standard errors for the bounds on Models 2 and 3 are larger than
the difference in log Bayes factors. We can not determine which model has a the highest evidence score
using the simple estimator ̂`simple. We obtain more precise estimates of the bounds using the control
variate estimators ̂`gradient and ̂`hessian.
The estimated bounds are reasonably tight. The model evidence can generally be enclosed in an
interval of approximately width one. This suggests the normal approximation to the posterior distribution
is reasonable under each model. Interestingly, the standard error of ̂`hessian is greater than the standard
error of ̂`gradient for model 3. This is contrary to what we expect from the asymptotic analysis of the
estimators. The larger standard error could be a result of the sampling method. We have used simple
61
3. Computing the model evidence using subsampling
ROC curves
1 − Specificity
Se
ns
itiv
ity
0.0 0.2 0.4 0.6 0.8 1.0
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
Model 1 (AUC=0.8313)
Model 2 (AUC=0.8336)
Model 3 (AUC=0.8342)
Figure 3.1: ROC curves for the flights dataset. Curves are shown for each of the three models in the candidate
set. Classification performance is very similar across the three models. AUC values are reported in the legend. The
addition of weekday improves in-sample predictive performance by a small margin.
random sampling in a situation where we do not have an equal number of observations for each carrier.
The dummy coding in the design matrix will mean that a simple random subsample will contain very
little information about the entire parameter set. We should aim to stratify the sample so that we get
information about each parameter in the subsample. The second order estimator may be sensitive to this
issue. The standard errors for the subsampled bound estimators are roughly an order of magnitude larger
than the standard error of the importance sampling estimator. This should be taken into consideration
when comparing the wall-clock time for each estimator. The standard errors in Table 3.3 are all estimated
from a single replication of the computational experiment. Also we use different methodology to estimate
the standard errors for the bounds and the importance sampling estimator. As future work we will repeat
the experiment multiple times to obtain more robust estimates of the Monte Carlo standard errors.
Table 3.4 reports the time spent on likelihood evaluations for each estimator. We report the sum of
the time to compute the upper and lower bounds for the subsampling based estimators. The subsampled
estimators are multiple orders of magnitude faster than the standard importance sampling method.
The subsampled estimators require m = 500 likelihood evaluations at each θ with standard importance
sampling requiring n = 327, 346 likelihood evaluations. The second order estimator ̂`hessian is considerably
slower than the first order estimator ̂`gradient. The cost of the extra quadratic adjustment in (3.20)
compared to (3.19) is non-negligible. As the second order estimator has higher standard errors, it seems
the gradient based estimator is a better choice in this example.
We also implemented a subsampled version of importance sampling as per (3.30). We again generated
R = 5000 samples from the same normal importance distribution that was used for the regular importance
sampling method. We used the simple Poisson estimator (3.23) to generate unbiased estimates of the
likelihood. The Poisson estimator requires unbiased estimators of the log likelihood as input. We tried
the Poisson estimator with each of the log likelihood estimators ̂`simple, ̂`gradient and ̂`hessian. We set
λ = 5 and took m = 100 when using the subsampling estimators. This was so the expected number of
likelihood evaluations per use was equal to five hundred. Table 3.5 reports the proportion of negative
likelihood estimates obtained using each subsampling estimator. The proportion of negative estimates
is close to 0.5 for each model and estimator. It is hard to construct a good estimator of the log model
evidence given many negative likelihood estimates. It is difficult to use to simple Poisson estimator to
accelerate importance samplers for the model evidence.
62
3.5. Data application: flights dataset
Reliability plots
Bin Midpoint
O
bs
er
ve
d 
Ev
e
n
t P
e
rc
e
n
ta
ge
0
20
40
60
80
100
0 20 40 60 80 100
l
l
l
l
l
l
l
l
l
l
l
l
l
Model 1 Model 2 Model 3
Figure 3.2: Reliability curves for the flights dataset. Curves are shown for each of the three models in the candidate
set. All models are very well calibrated across the range of theoretical probabilities. Each model could be considered
to be correctly specified.
2 logB21 2 logB31 2 logB23
Value 941 887 55
Table 3.2: Log Bayes Factors for the flights dataset (n = 327, 346). The values of the Bayes factors are all outside
the range considered in Table 3.1. The models in the candidate set appear give very similar predictions and are all
well calibrated. Interpreting the strength of evidence when n is large can be difficult.
̂`
simple
̂`
gradient
̂`
hessian IS
Lower Upper Lower Upper Lower Upper
Model 1 -147209.1 (98.02) -147100.8 (99.46) -147112.5 (0.21) -147111.9 (0.16) -147112.6 (0.25) -147111.9 (0.63) -147112.0 (0.02)
Model 2 -146548.8 (122.18) -146637.6 (97.23) -146641.2 (0.22) -146641.2 (0.19) -146640.7 (0.76) -146641.8 (0.20) -146641.0 (0.02)
Model 3 -146491.5 (106.59) -146611.4 (93.61) -146669.6 (0.55) -146668.5 (0.53) -146579.6 (21.04) -146588.0 (17.07) -146668.7 (0.04)
Table 3.3: Estimates of the model evidence for the flights dataset. The first three columns report the evidence
bounds obtained using the different log likelihood estimators. Estimated standard errors are in brackets. The Monte
Carlo variance of the simple log likelihood estimator is too large to rank models confidently. The control variate
based estimators have significantly lower standard errors. The widths of the evidence bounds are small in relation
to the estimated Bayes factors between models.
̂`
simple
̂`
gradient
̂`
hessian IS
Model 1 3 4 10 523
Model 2 4 4 12 603
Model 3 12 13 85 1743
Table 3.4: Time spent on likelihood evaluations for the flights dataset (seconds). The importance sampling method
requires a full O(n) likelihood evaluation for each posterior sample. The subsampling based estimators require
O(m) likelihood evaluations per sample. We report the total time for calculating the lower and upper bounds for the
subsampling estimators. In this simulation n = n = 327, 346 and m = 500. We report the time spent on likelihood
evaluations only. The time spent on generating subsampling integer sets S was recorded separately.
63
3. Computing the model evidence using subsampling
̂`
simple
̂`
gradient
̂`
hessian
Model 1 0.52 0.50 0.50
Model 2 0.49 0.50 0.51
Model 3 0.50 0.50 0.51
Table 3.5: Proportion of negative likelihood estimates using the simple Poisson estimator over the B = 5000
posterior samples. We applied the simple Poisson estimator using each of the subsampled log likelihood esitmators.
The high fraction of negative likelihood estimates makes it difficult to construct a subsampled version of importance
sampling for the purposes of estimating the model evidence. The evidence bounds are defined on the log scale not
subject to a positivity requirement.
3.6 Conclusion
Much of the status quo in Bayesian computation is not well suited to Big Data. As mentioned in
the introduction, procedures that require repeated O(n) evaluations of the full dataset likelihood are
unsustainable as datasets grow taller. Estimation of the integrated likelihood can be computationally
demanding on tall datasets because of this likelihood burden. We proposed a method for estimating the
integrated likelihood using subsampling that avoids repeated O(n) likelihood evaluations. The relevance
to model choice is that the proposed estimator of the model evidence can be calculated in less time
than traditional importance sampling methods. Bayesian model choice can then proceed at a less glacial
pace. Practitioners will inevitably perform a cost benefit analysis when deciding what procedure to use
for model selection. If the computational cost of the Bayesian approach is very high, the scales will
tip in favour of alternative approaches, whatever the benefits of a Bayesian analysis may be. The main
objective of the thesis is to identify and investigate strategies that minimise the expense of the compute
step in Box’s loop (refer back to Figure 1.1 in Chapter 1). The point of this Chapter is to see how
subsampling can be used to minimise the expense of the compute step when we want to critique multiple
models on a large dataset using the fully Bayesian paradigm. To reiterate, we are trying to give an air
of computational feasibility to Bayesian purism in the huge n setting.
Subsampling has been used in algorithms for generating posterior samples p(y|θ) given a large dataset
y, however subsampling based estimation of the model evidence does not seem to have been explored in
great depth. Subsampling methodology for posterior simulation makes use of subsampled estimates of the
log likelihood. Integrating these estimators into existing importance sampling methods for calculating
the evidence is difficult due to the possibility of negative estimates of the likelihood. We found that the
log likelihood estimators can also be used for the purposes of efficient estimation of the model evidence
in a different manner. Our starting point was an identity for the log model evidence:
log p(y) = Ep(θ|y)[log p(y|θ)]−D(p(θ|y) ‖ p(θ)). (3.44)
The goodness of fit term Ep(θ|y)[log p(y|θ)] can be estimated efficiently using subsampling. Given pos-
terior samples, we can bound the penalty term D(p(θ|y) ‖ p(θ)) using a maximum entropy argument
and a standard variational Bayes approach. Control variates are essential in order to control the Monte
Carlo variance of the algorithm, this is a similar finding to other work on subsampling algorithms for
Bayesian computation (Baker et al., 2017; Bierkens et al., 2016). We investigated both first order and
second order Taylor series approximations as control variates. The first order estimator has attractive
theoretical properties and performed competitively with the second order estimator in the simulations.
Asymptotic properties of Bayes factors suggests that interval estimators of the log model evidence may
be acceptable for ranking models in the large n regime.
The proposed methodology is closely related to the standard Laplace approximation, however there are
some key differences. The Laplace approximation is based on a second order Taylor series expansion about
the posterior mode. It can be difficult to determine the error in the Laplace approximation in practice
64
3.7. Appendix
(Gelfand and Dey, 1994). We make a normal approximation to the posterior distribution. Assuming
the parameter space is unconstrained, this provides an upper bound on the log model evidence using the
maximum entropy property of the Gaussian distribution. Although our approach is more computationally
demanding than the standard Laplace approximation or the Laplace-Metropolis estimator it does come
with additional guarantees on the error in the approximation.
We have used the maximum entropy property of the normal distribution in Rd, and assumed that the
parameter space is unconstrained. Another way to achieve a closed form upper bound on the posterior
entropy is to use general properties of the exponential family. We typically have closed form expressions
for the entropy of a distribution in the exponential family. The maximum entropy distribution of a
random variables X taking values in the interval [0,∞) subject to the constraint that E[X] = 1/λ is an
exponential distribution with density f(x) = λ exp(−λx). Suppose we had a single parameter θ with
support [0,∞) in our model. Given posterior samples we can form an estimate of E[θ|y]. We could then
form an upper bound on the posterior entropy using the maximum entropy property of the exponential
distribution. Using exponential family distributions for the maximum entropy bound also suggests taking
the same exponential family distributions for the variational approximation q(θ) for the lower bound.
This is an interesting direction that helps to further distinguish our approach from the Laplace and
Laplace-Metropolis estimators.
3.7 Appendix
3.7.1 Control variates
We have defined `(θ) = log p(y|θ) and `i(θ) = log p(yi|θ). The variance of the gradient estimator̂`
gradient(θ) (3.21) will be a function of the remainder terms from each of the individual log likelihood
contributions. Let Rigradient denote the remainder term from the approximation for observation i, so
Rigradient = `i(θ)− ̂`i,1(θ).
An alternative expression for (3.21) is then
̂`
gradient(θ) = `(θ̂) +
n
m
∑
i∈S
Rigradient.
Now as var(Z) ≤ E(Z2) for any random variable Z, the variance can be upper bounded in terms of the
expected squared remainder term
var(̂`gradient(θ)) ≤ n
m
n∑
i=1
|Rigradient|2. (3.45)
If the remainder terms are small the variance of the estimator will be small. The introduction of the first
order control variates can decrease the Monte Carlo variance of the log likelihood estimator if the linear
approximations to the log likelihood contributions are accurate. The variance of ̂`hessian(θ) estimator will
also be a function of the remainder terms from each of the individual log likelihood approximations. Let
Rihessian denote the remainder term from the approximation for observation i, so
Rihessian = `i(θ)− ̂`i,2(θ).
The second order estimator can be expressed as
̂`
hessian(θ) = log p(y|θ̂) + n
m
∑
i∈S
Rihessian
Now as var(Z) ≤ E(Z2) for any random variable Z, the variance of ̂`hessian(θ) can be upper bounded in
terms of the expected squared remainder term
var(̂`hessian(θ)) ≤ n
m
n∑
i=1
|Rihessian|2. (3.46)
65
3. Computing the model evidence using subsampling
It is reasonable to expect the remainder terms from the second order approximation Rihessian to be smaller
in magnitude than the remainders Rigradient from the first order approximation. As a consequence the
variance of the second order estimator should be smaller than the variance of the first order estimator.
We study the asymptotic variance of each estimator to compare them in a more precise manner.
3.7.2 Asymptotic variance
Here we consider the asymptotic variance of the proposed estimators ̂`gradient(θ) and ̂`hessian(θ) as the
sample size n is taken to infinity. As mentioned earlier, the variance of the estimators will depend on the
remainder terms from the Taylor approximations for each likelihood contribution. Recall that Rigradient
denotes the remainder term for observation i from the first order Taylor expansion about the maximum
likelihood estimate. Suppose that we can bound the second order derivatives of the log likelihood for
each observation i = 1, . . . , n. Given a d-dimensional parameter, θ = (θ1, . . . ,θd)
T, assume that for all
j, k ∈ {1, . . . , d}, ∣∣∣∣∂2 log p(yi|θ)∂θj∂θk
∣∣∣∣ ≤Mi,
for some constant Mi, and i = 1, . . . , n. This is possible for a wide range of models in the exponential
family. Then by the Taylor-Lagrange inequality we can form bounds on the remainder term
|Rigradient| ≤
Mi
2
‖θ − θ̂‖21, |Rigradient|2 ≤
M2i
4
‖θ − θ̂‖41.
Substituting the squared remainder bound into (3.45) gives an upper bound on the variance
var(̂`gradient(θ)) ≤ n
m
n∑
i=1
M2i
4
‖θ − θ̂‖41
=
1
m
(
n∑
i=1
1
n
M2i
4
)
n2‖θ − θ̂‖41. (3.47)
To determine the asymptotic variance of the estimators we take the evaluation point θ to be a stochastic
sequence. Roughly speaking we assume the posterior distribution concentrates in a ball of radius
√
n
around the maximum likelihood estimate θ̂ as n increases. One way to motivate this is through the
Bernstein-von Mises theorem (Van Der Vaart, 1998, Chapter 10). Under mild conditions we expect the
posterior distribution to be asymptotically normal,
p(θ|y) ≈ N(θ̂, n−1I1(θ̂)),
where I1(θ̂) is the Fisher information matrix for a single observation evaluated at the maximum likelihood
estimate (Van Der Vaart, 1998, Chapter 10). The Fisher information for a single observation is defined as
I(θ) = −E[∇2 log p(yi|θ)|θ], where the expectation is over the generative model with known parameter
θ. We expect the posterior variance to be O(n−1). Using properties of the folded normal distribution,
each element of norm
‖θ − θ̂‖1 =
d∑
i=1
|θi − θ̂i|
can be reasoned to be Op(n
−1/2). As such we can take ‖θ − θ̂‖41 to be of the order Op(n−2). Plugging
this deviation rate into the variance bound (3.47), we can cancel out the inflation by n that debilitates
the simple likelihood estimator,
var(̂`gradient(θ)) ≤ 1
m
(
n∑
i=1
1
n
M2i
4
)
n2‖θ − θ̂‖41
≈ 1
m
(
n∑
i=1
1
n
M2i
4
)
n2 ×Op(n−2)
≈ Op(m−1).
66
3.7.2. Asymptotic variance
Under mild assumptions the term inside the sum
(∑n
i=1 n
−1M2i
)
can be assumed to approach constant
as n → ∞. As such we expect the variance of the gradient estimator to be Op(1/m) for large n.
The approximation in the final line is to reflect that this is a heuristic argument. This is a significant
improvement over the simple subsampling estimator where var(̂`simple(θ)) is expected to be Op(n2/m).
A similar analysis can be performed for the second order estimator ̂`hessian(θ). Let Rihessian denote
the remainder term for observation i from the second order Taylor expansion about the maximum like-
lihood estimate. We assume that we can bound the third order partial derivatives of the log likelihood
contribution for each observation. Specifically, suppose that for all j, k, l ∈ {1, . . . , d},∣∣∣∣∂3 log p(yi|θ)∂θj∂θkθl
∣∣∣∣ ≤Mi,
for some constant Mi, and i = 1, . . . , n. Then by the Taylor-Lagrange inequality we can form bounds on
the remainder term (Walschap, 2015, Chapter 2),
|Rihessian| ≤
Mi
6
‖θ − θ̂‖31, |Rihessian|2 ≤
M2i
36
‖θ − θ̂‖61.
Substituting the squared remainder bound into (3.46) gives an upper bound on the variance
var(̂`hessian(θ)) ≤ n
m
n∑
i=1
M2i
36
‖θ − θ̂‖61
=
1
m
(
n∑
i=1
1
n
M2i
36
)
n2‖θ − θ̂‖61.
We again use the fact that we expect ‖θ − θ̂‖1 to be Op(n−1/2). Making another heuristic argument for
the asymptotic variance,
var(̂`hessian(θ)) ≤ 1
m
(
n∑
i=1
1
n
M2i
36
)
n2‖θ − θ̂‖61
≈ 1
m
(
n∑
i=1
1
n
M2i
36
)
n2 ×Op(n−3)
≈ 1
m
(
n∑
i=1
1
n
M2i
36
)
×Op(n−1).
Again assuming that
∑n
i=1 n
−1M2i approaches a constant, we can reason that the variance of the second
order estimator approaches zero as n tends to infinity.
It is interesting to compare the behaviour of ̂`gradient(θ) an ̂`hessian(θ). The assumption that θ
approaches θ̂ at rate Op(n
−1/2) implies that the remainder terms vanish. This is counterbalanced by
the scaling factor n/m that appears in both the definition of both estimators (equations (3.34)). For thê`
hessian(θ), the remainder terms vanish at a sufficiently fast rate such that the variance tends to zero. For
the ̂`gradient(θ) the remainder terms diminish at a rate that leads to stable finite variance.
Example: logistic regression
The logistic regression model satisfies the assumptions in the analysis of the asymptotic variance. Let
yi denote the binary response and xi represent the column vector of covariates for observation i, for
i = 1, . . . , n. We assume that yi ∼ Bernoulli(σ(ηi)), where ηi = xTi β and σ(z) = 1/(1 + exp(−z)).
For the analysis of ̂`gradient in Section 3.7.2 to apply, it is necessary to bound the second order partial
derivatives. The Hessian matrix of each log likelihood contribution is
∇2 log p(yi|xi,β) = −ŷi(1− ŷi)xixTi ,
67
3. Computing the model evidence using subsampling
where ŷi = σ(x
T
i β). As 0 ≤ ŷi ≤ 1, it holds that ŷi(1− ŷi) < 1/4. The absolute value of the second order
partial derivatives then satisfies ∣∣∣∣∂2 log p(yi|xi,β)∂βj∂βk
∣∣∣∣ ≤ 14‖xi‖21,
for each observation i = 1, . . . , n. The analysis of ̂`hessian in Section 3.7.2 assumed bounded third order
partial derivatives. The logistic regression model satisfies∣∣∣∣∂3 log p(yi|xi,β)∂βj∂βkβl
∣∣∣∣ ≤ 14‖xi‖31,
for each observation i = 1, . . . , n.
68
4Statistical properties of sketching algorithms
Summary
Sketching is a probabilistic data compression technique that has been largely developed in the computer
science community. Numerical operations on big datasets can be intolerably slow; sketching algorithms
address this issue by generating a smaller surrogate dataset. Typically, inference proceeds on the com-
pressed dataset. Sketching algorithms generally use random projections to compress the original dataset
and this stochastic generation process makes them amenable to statistical analysis. We argue that the
sketched data can be modelled as a random sample, thus placing this family of data compression methods
firmly within an inferential framework. In particular, we focus on the Gaussian, Hadamard and Clarkson-
Woodruff sketches, and their use in single pass sketching algorithms for linear regression with huge n. We
explore the statistical properties of sketched regression algorithms and derive new distributional results
for a large class of sketched estimators. We develop confidence intervals for sketching estimators and
bounds on the relative efficiency of different sketching algorithms. An important finding is that the best
choice of sketching algorithm in terms of mean square error is related to the signal to noise ratio in the
source dataset. Finally, we demonstrate the theory and the limits of its applicability on two real datasets.
4.1 Introduction
Sketching is a general probabilistic data compression technique designed for Big Data applications (Cor-
mode, 2011). Even routine calculations can be prohibitively computationally expensive on massive
datasets. Computation time can be reduced to an acceptable level by allowing for some approxima-
tion error in the results. Sketching algorithms relax the computational task by generating a compressed
version of the original dataset which then serves as a surrogate for calculations. The compressed dataset
is referred to as a sketch as it acts as a compact representation of the full dataset. Sketching algorithms
use a randomised compression stage which makes them interesting from a statistical viewpoint. Sketching
algorithms for linear regression have attracted significant attention in the numerical linear algebra and
theoretical computer science communities (Woodruff, 2014; Mahoney, 2011). In this paper we investi-
gate the statistical properties of sketched regression algorithms, a perspective which has received little
attention up to now.
To describe sketched regression in more detail, we first assume the data consists of a n-length response
vector y and a n×p matrix of covariates, X which is of full rank. It is assumed that n > p. The objective
is to find the optimal least squares coefficients. Given sufficient computational resources, these could be
computed exactly as
βF = (X
TX)−1XTy.
The subscript F is used to indicate the connection to the full dataset. Only two quantities are needed
in order to determine βF , the Gram matrix X
TX, and the marginal associations XTy. Calculation of
69
4. Statistical properties of sketching algorithms
XTX requires O(np2) operations while computation of XTy needs only O(np) calculations. There are
two broad methods for sketched regression, complete sketching and partial sketching. Complete sketching
is based on approximating both XTX and XTy, whereas partial sketching only approximates the Gram
matrix.
Sketching algorithms use random linear mappings to reduce the size of the dataset from n to k
observations. The random linear mapping can be represented as a k × n sketching matrix S. Complete
sketching generates a k-length sketched response vector y˜ and a k × p matrix of sketched predictors X˜.
The sketched data are computed through the linear mappings y˜ = Sy and X˜ = SX. Partial sketching
only generates a k × p matrix of sketched covariates X˜. We again use the random mapping X˜ = SX.
The complete sketching estimator, βS , is defined as the least squares coefficients using the sketched
responses and predictors,
βS = (X˜
TX˜)−1X˜Ty˜. (4.1)
The partial sketching estimator, βP , is defined as
βP = (X˜
TX˜)−1XTy. (4.2)
The key difference between (4.1) and (4.2) is that the partial sketched estimator βP is constructed using
the exact marginal associations XTy. Given the sketched data, computation of βS or βP requires only
O(kp2) operations, compared with the O(np2) required for βF .
The estimand within a sketching algorithm is the optimal coefficient vector βF . Sketching algorithms
have the property that given a fixed k, the approximation error ‖βS − βF ‖2 or ‖βP − βF ‖2 remains
probabilistically bounded even as n → ∞. Designing estimators for approximate computation with
such properties is very difficult, and is a common goal in the development of techniques for Big Data
(Bardenet and Maillard, 2015; Phillips, 2016). The favourable scaling properties of sketching algorithms
are a critical factor in making them stand apart from simple subsampling approaches, where it can be
difficult to establish universal worst case bounds for large n (Drineas et al., 2006; Ma et al., 2015). The
fact that sketching algorithms provide finite k guarantees for arbitrarily large n is a major reason they
have received so much attention in the computer science community.
There is a large literature concerned with designing appropriate distributions for the random sketching
matrix S. Our focus is on data-oblivious random projections, where the distribution of the sketching
matrix is not a function of the source data (y,X). An example is the Gaussian sketch, where each
element is independently distributed as a N(0, 1/k) variate. We also consider the Hadamard sketch and
the Clarkson-Woodruff sketch, random projections that exploit structure and sparsity for computational
efficiency.
Most existing results on the accuracy of sketching are universal worst case bounds (Woodruff, 2014;
Mahoney and Drineas, 2016). This is typical for randomised algorithms, however a more detailed error
analysis can provide important insights (Halko et al., 2011). We investigate the statistical properties of
βP and βS when using data oblivious sketches. An important finding is upper and lower bounds on the
relative efficiency of complete sketching to partial sketching in terms of the signal to noise ratio in the
source dataset. The statistical analysis also allows the construction of exact confidence intervals for the
Gaussian sketch, and asymptotic confidence intervals for other random projections, paving the way for
their wider use in the statistical community interested in Big Data methods.
We start by reviewing the existing literature on sketching algorithms before investigating the statistical
properties in more detail. At its core, sketched regression is a randomised algorithm for approximate
computation of βF . Repeated application of the sketching algorithm on the same dataset will produce
different results. The first stage in our analysis is to establish the distributional properties of the sketched
estimators with the source dataset fixed. This gives a clear statistical picture of the behaviour of the
randomised algorithm. An important result is a conditional central limit theorem for the sketched dataset
70
4.2. Background and related work
that connects the Hadamard and Clarkson-Woodruff projections to the Gaussian sketch. The regularity
conditions have a intuitive interpretation in terms of the geometry of the source dataset. We then
analyse a large genetic dataset to compare the performance of different sketching algorithms and to test
the asymptotic theory that we have developed.
4.2 Background and related work
Before proceeding, it is worth mentioning alternatives to sketching, in particular iterative methods for
calculating the least squares coefficients βF . These include coordinate descent or stochastic gradient
methods. Iterative methods are guaranteed to converge to βF under very mild conditions. These iterative
techniques assume that the entire dataset can stored in memory in a single location, or require regular
communication if the full dataset is distributed across multiple sites. Sketching algorithms are not
burdened by these memory and communication costs, with the drawback of no convergence guarantees
to βF . Connections to iterative methods are postponed until the discussion, the focus for now is on the
single pass estimators βS and βP .
The purpose of this section is to review the existing theoretical framework for sketching algorithms.
Sketching algorithms are largely motivated through worst case guarantees. We recap how these bounds
can be developed before studying the statistical properties of the sketched estimators.
It will be helpful to define a number of quantities related to the full dataset before moving on.
Let TSSF = y
Ty, RSSF = ‖y −XβF ‖22,MSSF = ‖XβF ‖22 and R2F = MSSF /TSSF . These terms
summarise the goodness of fit of the model. The total, residual and model sum of squares are given
by TSSF , RSSF and MSSF respectively, with TSSF = MSSF + RSSF . The proportion of variance
explained by the model is given by R2F . These values will be important in characterising the behaviour
of βS and βP .
4.2.1 Embedding bounds
A key concept in the construction of sketching algorithms is the notion of an -subspace embedding
(Woodruff, 2014; Meng and Mahoney, 2013; Yang et al., 2015a).
Definition 4.1. -subspace embedding.
For a given n× d matrix A, we call a k × n matrix S an -subspace embedding for A, if for all vectors
z ∈ Rp
(1− )||Az||22 ≤ ||SAz||22 ≤ (1 + )||Az||22.
Speaking broadly, an -subspace preserves the linear structure of the columns of the original dataset
up to some multiplicative (1± ) factor. In particular, if  is small, the linear mapping S approximately
preserves the covariance structure of the source dataset. Most theoretical arguments for sketching algo-
rithms are predicated on the idea that the sketching matrix S is an -subspace embedding for the source
dataset. The general notion is that it is possible to use a linear mapping S that reduces the sample size
from n to k whilst preserving much of the linear information in the full dataset.
The issue of how to generate -subspace embeddings is deferred until section 3.2, the present focus
will be on the utility of -subspace embeddings for linear regression problems. For now, assume that
we have some method for generating -subspace embeddings for the source data matrix A. It will be
convenient to refer to A˜ = SA as an -subspace embedding of A if S is an -subspace embedding for A.
As regression is the focus from this point forward, we will define the source data matrix as A = [y,X],
the sketched data matrix as A˜ = [y˜, X˜] and set d = p+ 1.
71
4. Statistical properties of sketching algorithms
The complete sketched estimator βS is given by the least squares coefficients using the sketched
responses y˜ and the sketched predictors X˜,
βS = argmin
β∈Rp
‖y˜ − X˜β‖22.
An -subspace embedding is useful as it relates the sketched optimisation problem to the full dataset
optimisation problem. If A˜ = [y˜, X˜] is an -subspace embedding of A = [y,X], it must hold that for all
β ∈ Rp,
(1− )||y −Xβ||22 ≤ ||y˜ − X˜β||22 ≤ (1 + )||y −Xβ||22.
If  is small, minimising the sum of squared residuals on the sketched dataset is similar to minimising
the sum of squared residuals on the full dataset. If this is the case, it can be expected that βS will be
close to βF . It is possible to establish the concrete bounds, that if A˜ is an -subspace embedding of A˜
(Sarlos, 2006),
‖βS − βF ‖22 ≤
2
σ2min(X)
RSSF , (4.3)
where σmin(X) represents the smallest singular value of the design matrix X. A very similar argument
can be used to motivate the partial sketched estimator βP . Existing bounds for the partial sketch focus
on the prediction error ‖XβP −XβF ‖22 (Becker et al., 2015; Pilanci and Wainwright, 2016). To make a
direct comparison to (4.3) we establish a bound on the coefficient error
Theorem 4.1. Suppose that X˜ is an -subspace embedding of X with  < 0.5. Then the following bound
holds,
‖βP − βF ‖22 ≤
42
σ2min(X)
MSSF (4.4)
For proof see Chapter 5. The mild requirement that  < 0.5 is imposed so that the bound matches
the functional form of the complete sketching bound (4.3). Comparing the partial sketching bound to
(4.3), we see that the tightness of the bound is controlled by the model sum of squares as opposed to the
residual sum of squares. The sensitivity of partial sketching to the model sum of squares as opposed to
the residual sum of squares has been noted in previous on partial sketching (Dhillon et al., 2013; Pilanci
and Wainwright, 2016; Becker et al., 2015). This suggests that the signal to noise ratio in the source
dataset will be important when selecting which sketched estimator to use. A naive conclusion is that
complete sketching is preferred when RSSF < 4MSSF , or equivalently R
2
F > 0.25. Such a result is
hardly prescriptive, as the worst case bound is not necessarily indicative of expected performance. A
second point of interest is that if the k × n matrix S is an -subspace embedding for A = [y,X], it is
also an -subspace embedding for X. This suggests that it is reasonable to compute both βP and βS
from a single sketch, although it is not clear how to combine the estimators into a single point estimator.
These issues will be explored in more depth by examining the statistical properties of both complete and
partial sketching. Before moving on to the statistical analysis we review some of the existing methods
for generating -subspace embeddings.
4.2.2 Sketches
There are two general categories of distributions for the random matrix S, data aware random projections
and data oblivious random projections. A data aware random projection uses information in the source
data y, X to generate S. In contrast, a data oblivious random projection can be sampled without
knowledge of y or X. Data oblivious projections are designed to produce -subspace embeddings for an
arbitrary source data matrix with high probability. Our focus is on data oblivious random projections.
72
4.2.3. Sketching examples
The Gaussian sketch was one of the first projections proposed for sketched regression (Sarlos, 2006).
Recall that a Gaussian sketch is formed by independently sampling each element of S from a N(0, 1/k)
distribution. The drawback of the Gaussian sketch is that computation of the sketched data is quite
demanding, taking O(npk) operations. As such, there has been work on designing more computationally
efficient random projections. The Hadamard sketch and the Clarkson-Woodruff sketch are two examples
of more efficient methods for generating -subspace embeddings.
The Hadamard sketch is a structured random matrix (Ailon and Chazelle, 2009). The sketching
matrix is formed as S = ΦHD/
√
k, where Φ is a k × n matrix and H and D are both n× n matrices.
The fixed matrix H is a Hadamard matrix of order n. A Hadamard matrix is a square matrix with
elements that are either +1 or −1 and orthogonal rows. Hadamard matrices do not exist for all integers
n, the source dataset can be padded with zeroes so that a conformable Hadamard matrix is available.
The random matrix D is a diagonal matrix where each nonzero element is an independent Rademacher
random variable. The random matrix Φ subsamples k rows of H with replacement. The structure of the
Hadamard sketch allows for fast matrix multiplication, reducing calculation of the sketched dataset to
O(nd log k) operations.
The Clarkson-Woodruff sketch is a sparse random matrix (Clarkson and Woodruff, 2013). The pro-
jection can be represented as the product of two independent random matrices, S = ΓD, where Γ is a
random k×n matrix and D is a random n×n matrix. The matrix Γ is formed by choosing one element
in each column independently and setting the entry to +1. The matrix D is a diagonal matrix where
each nonzero element is an independent Rademacher random variable. This results in a sparse S, where
there is only one nonzero entry per column. The sparsity of the Clarkson-Woodruff sketch speeds up
matrix multiplication, dropping the complexity of generating the sketched dataset to O(nd).
4.2.3 Sketching examples
As examples, we demonstrate the construction of a Hadamard sketch and a Clarkson-Woodruff sketch,
for k = 3, n = 4.
The Hadamard sketch matrix is formed as S = ΦHD/
√
k, where Φ is a k × n matrix and H and D
are both n× n matrices. The fixed matrix H is a Hadamard matrix of order n. The random matrix D
is a diagonal matrix where each nonzero element is an independent Rademacher random variable. The
random matrix Φ subsamples k rows of H with replacement. The display below shows an example of
the random projection. The first matrix in the display represents ΦH, a subsample of three rows from
a 4 × 4 Hadamard matrix. In step 2, the diagonal matrix D is generated, with random Rademacher
random variables along the diagonal. The diagonal elements are shown above the matrix. In step 3 the
matrix multiplication ΦHD is performed. This outputs the sketching matrix S.


1 −1 1 −1
1 −1 −1 1
1 −1 1 −1
→
step 2
+1 −1 +1 +1
D11 D22 D33 D44

1 −1 1 −1
1 −1 −1 1
1 −1 1 −1
→
step 3
+1 −1 +1 +1
× × × ×

1 −1 1 −1
1 −1 −1 1
1 −1 1 −1
→
output


1 1 1 −1
1 1 −1 1
1 1 1 −1
The Clarkson-Woodruff sketch is a sparse random matrix. The projection can be represented as the prod-
uct of two independent random matrices, S = ΓD, where Γ is a random k×n matrix and D is a random
n×n matrix. The matrix Γ is formed by choosing one element in each column independently and setting
the entry to +1. The matrix D is a diagonal matrix where each nonzero element is an independent
Rademacher random variable. This results in a sparse S, where there is only one nonzero entry per
column. The display below shows an example of the random projection. The first matrix in the display
represents Γ, a random matrix where a single element in each column is set to one. In step 2, the diagonal
73
4. Statistical properties of sketching algorithms
matrix D is generated, with random Rademacher random variables along the diagonal. The diagonal
elements are shown above the matrix. In step 3 the matrix multiplication ΓD is performed. This outputs
the sketching matrix S.


1 0 0 0
0 0 0 1
0 1 1 0
→
step 2
−1 +1 −1 +1
D11 D22 D33 D44

1 0 0 0
0 0 0 1
0 1 1 0
→
step 3
+1 −1 +1 +1
× × × ×

1 0 0 0
0 0 0 1
0 1 1 0
→
output


1 0 0 0
0 0 0 −1
0 −1 1 0
Figure 4.1 shows examples of the three sketches for k = 32, n = 36.
(a) Gaussian Sketch (b) Hadamard Sketch (c) Clarkson−Woodruff Sketch
Figure 4.1: Sampled sketching matrices S for k = 32, n = 36. Elements in the sketching matrix are coloured
based on the value. One and negative one are coloured as black and white respectively. Intermediate values are in
shades of grey.
4.2.4 Sketching bounds
Data oblivious sketches are designed to give an -subspace embedding for an arbitrary source dataset with
at least probability (1− δ). Sketching algorithms are appealing for large n problems as the required k to
attain the (δ, ) bound is independent of n for the Gaussian and Clarkson-Woodruff sketches, and very
weakly dependent on n for the Hadamard sketch. Table 4.1 summarises existing results on the necessary
k to attain the (, δ) bound. Probabilistic worst case bounds for sketched regression are formed by noting
that if a sketch produces an -subspace embedding with probability at least (1− δ), then the bounds in
section 2 must hold with probability at least (1− δ). Woodruff (2014) gives an excellent survey of work
in this area. We consider embedding probabilities in more detail in Chapter 6.
As mentioned, data aware random projections can also be used to generate -subspace embeddings.
Existing data aware projections perform weighted sampling with replacement from the source dataset.
As such, data aware sketching methods are closely related to resampling methods such as the bootstrap
and the jacknife (Ma and Sun, 2015). We focus on data oblivious random projections, where there is
no direct connection to resampling methods. The Gaussian sketch is mathematically tractable, and it is
possible to establish a number of exact finite sample results regarding the performance of the sketched
estimators. In the next section, we obtain the exact distribution of βS and the bias and variance of βP .
This provides guidance on issues regarding the relative efficiency of complete to partial sketching.
74
4.3. Gaussian sketching
Sketch Sketching time Required sketch size k
Gaussian O(ndk) O((d+ log(1/δ))/2)
Hadamard O(nd log k) O((
√
d+
√
log n)2(log(d/δ))/2)
Clarkson-Woodruff O(nd) O(d2/δ2)
Table 4.1: Properties of different data oblivious random projections (Woodruff, 2014). The third column refers to
the necessary sketch size k to obtain an -subspace embedding for an arbitrary n × d source dataset with at least
probability (1− δ).
4.3 Gaussian sketching
4.3.1 Complete sketching
The Gaussian sketch is mathematically tractable, and it is possible to establish a number of exact finite
sample results regarding the performance of the sketched estimators. In this section we will develop the
distribution of βS when using a Gaussian sketch. As mentioned previously, all results treat y and X as
fixed. The variability in βS is solely due to the use of the random sketching matrix S. Let (y˜j , x˜
T
j )
T refer
to the jth row in the sketched data matrix A˜ = [y˜, X˜] for j = 1, . . . , k. Similarly, let sTj denote the jth
row in the sketching matrix S. The sketched dataset consists of k random units (y˜j , x˜
T
j ) for j = 1, . . . , k.
The jth sketched response is given by y˜j = s
T
j y, and the jth sketched predictor is calculated as x˜j = s
T
jX
for j = 1, . . . , k. The k sketched instances are independently distributed, because rows of the sketching
matrix are independent.
We take an indirect route to find the distribution of βS , by focusing on the distribution of the sketched
data A˜ = [y˜, X˜] conditional on the original dataset A = [y,X]. The initial step is to decompose the
joint distribution on the sketched responses and predictors as the product of a marginal and conditional
distribution. Specifically,
p(y˜, X˜|y,X) = p(y˜|X˜,y,X)p(X˜|y,X).
It can be shown that p(y˜|X˜,y,X)p(X˜|y,X) has the structure of a hierarchical Gaussian linear model.
We first show that the sketched dataset has a multivariate normal distribution, conditional on the source
dataset. This follows as the sketched dataset can be expressed as a linear combination of Gaussian
random variables. Specifically, row j in the sketched dataset is given by a˜j = (y˜j , x˜j) = sjA. To be
clear, it is helpful to express (y˜j , x˜j) as a column vector y˜j
x˜Tj
 = ATsTj .
Conditional on A = [y,X], ATsTj is a linear combination of independent Gaussians as s
T
j ∼ N(0, Id/k).
As affine transformations of Gaussians are also multivariate normal, (y˜j , x˜j) must then be jointly normally
distributed, conditional on the source data (y,X). It is easily shown that the joint distribution of the
sketched responses and predictors is then y˜j
x˜Tj
∣∣∣∣∣∣y,X ∼ N
0
0
 , 1
k
 yTy yTX
XTy XTX
 , independently for j = 1, . . . , k.
Standard results on multivariate normals give that the conditional distribution of y˜j given x˜j is also
normal. A routine calculation shows that the conditional mean is related to βF , that is ES [y˜j | x˜j ,y,X] =
x˜jβF . The subscript S is used on the expectation operator to emphasise that only random quantity is
the sketching matrix. The conditional variance is related to the prediction error on the source dataset
75
4. Statistical properties of sketching algorithms
RSSF ,
varS (y˜j |x˜j ,y,X) = 1
k
[
yTy − yTX(XTX)−1XTy]
=
1
k
RSSF .
The subscript S is again used to recognise that the source of the variance is the random sketching
matrix, the source dataset is fixed. The step in the second line follows from sum of squares partitions in
linear models (Searle, 1997, Chapter 3). Therefore, the conditional distribution of y˜j given the sketched
predictors x˜j and the source dataset y,X is
y˜j | x˜j ,y,X ∼ N
(
x˜jβF ,
RSSF
k
)
independently for j = 1, . . . , k.
This is the exact form of a standard Gaussian linear model. The distribution p(X˜|y,X) is easily obtained
as the marginal distribution of x˜j is also multivariate normal,
x˜Tj ∼ N
(
0,XTX/k
)
, independently for j = 1, . . . , k.
The sketching process can be described using the following hierarchical model,
y˜ | X˜,y,X ∼ N
(
X˜βF ,
RSSF
k
Ik
)
,
X˜ | y,X ∼MN
(
0k×p, Ik,
1
k
XTX
)
.
A Gaussian sketch effectively simulates a series of observations from a Gaussian linear model parametrised
in terms of βF and σ
2
F , where the design matrix has a matrix normal distribution. We now turn to the
distribution of βS . The distribution of βS conditional on the sketched predictors follows immediately
from standard results on linear models (Searle, 1997, Chapter 3).
[βS | X˜,y,X] ∼ N
(
βF ,
RSSF
k
(X˜TX˜)−1
)
. (4.5)
To obtain the marginal distribution of βS it is necessary to integrate over the random sketched design
matrix X˜. From properties of the normal distribution (Eaton, 2007), it is possible to show (X˜TX˜) |
y,X ∼Wishart(k,XTX/k). As such,
(X˜TX˜)−1 | y,X ∼ InvWishart (k, k(XTX)−1) .
As seen in equation (4.5), βS is normally distributed when conditioned on the random Inverse-Wishart
matrix (X˜TX˜)−1. The marginal distribution of βS can then be described using the Normal Inverse-
Wishart distribution (Gelman et al., 2014, p.73). The following theorem characterises the distribution of
βS under the Gaussian sketch.
Theorem 4.2. Suppose βS is computed using a Gaussian sketch and k > p + 1. The conditional
distribution of βS is
(i) βS |X˜,y,X ∼ N
(
βF ,
nσ2F
k
(
X˜TX˜
)−1)
.
The marginal distribution of βS is
(ii) βS |y,X ∼ Student
(
βF ,
nσ2F
k − p+ 1
(
XTX
)−1
, k − p+ 1
)
.
For the proof see Chapter 5.
An immediate application of result (i) is the ability to generate exact confidence intervals for the
elements of βS , methodology that does not appear to be present in the existing literature. It is also
76
4.3.2. Partial sketching
possible to estimate exact joint confidence regions for the entire vector βS . Again assuming that k > p+1,
it should be noted that the variance of βS ,
var(βS |y,X) = RSSF
(k − p+ 1)(X
TX)−1 (4.6)
is not dependent on the compression ratio k/n. Although RSSF can be expected to grow linearly with
n, this will generally be counterbalanced by (XTX)−1 decreasing linearly with n. The distribution of
the approximation error ‖βS − βF ‖2 will largely be controlled by the target dimension k. This speaks
to the defining characteristic of sketching algorithms, that given a fixed k, the stochastic approximation
error does not necessarily increase with size of the original dataset n. Probabilistic worst case bounds on
the error ‖βS − βF ‖2 can also be obtained by making an appeal to Chebyschev’s inequality.
4.3.2 Partial sketching
Partial sketching was first proposed by Dhillon et al. (2013) using uniform subsampling, and later studied
for general sketches by Pilanci and Wainwright (2016). Existing results on partial sketching highlight
that the model sum of squares influences the approximation error of the partial sketched estimator βP .
An important finding was that the variance of the complete sketching estimator is dependent on the
residual sum of squares. It is simple to see that the variance of the partial sketched estimator will not
be a function of the residual sum of squares. From the normal equations it holds that XTy = XTXβF .
Using this property, we see that conditional on y,X, the variance of the random linear combination
βP = (X
TSTSX)−1XTy = (XTSTSX)−1XTXβF will be a function of the covariates and the fitted
values. The residual vector has no influence on the variance of the partial sketching estimator, and as
such the variance of βP will not be related to the residual sum of squares. This suggests that when the
noise level is high, partial sketching may become preferrable to complete sketching. This idea has been
touched on in the existing literature, but specific guidelines are lacking (Becker et al., 2015; Dhillon et al.,
2013). A statistical analysis can provide some insight into this issue.
The hierarchical model for complete sketching gave an intuitive statistical perspective on the me-
chanics of the algorithm. Partial sketching seems to lack a similar conceptual device. The least squares
coefficients can be represented as the solution to the linear system of the equations XTXb = XTy. Par-
tial sketching simply returns the solution, b, to the approximate linear system X˜TX˜b = XTy. Lacking
a convenient representation for the estimator, we must proceed in a more pedestrian manner. The mean
square error of the estimator βP can be determined using only mean and variance information, and this
will be the goal for now. The key observation is that (X˜TX˜)−1|y,X ∼ InvWishart (k, k(XTX)−1) .
Conditional on y,X, the estimator βP = (X˜
TX˜)−1XTy is a linear combination of the elements of an
Inverse-Wishart random variable. However, this is a non-standard distribution and it is difficult to di-
rectly express the distribution function of βP . Despite this, it is straightforward to determine the mean
and variance of βP . From properties of the Inverse-Wishart distribution, it can be seen that the partial
sketched estimator is biased, with mean
ES [βP |y,X] = k
(k − p− 1)βF ,
where it is assumed that k > p+ 3. This motivates an alternative unbiased estimator
β∗P =
(k − p− 1)
k
(X˜TX˜)−1XTy.
Determining the variance of βP and the unbiased β
∗
P is a more lengthy computation (see Chapter 5).
Skipping the work, the variance of the biased estimator βP is
var(βP |y,X) = k
2
(k − p)(k − p− 1)(k − p− 3)
(
MSSF (X
TX)−1 +
(k − p+ 1)
(k − p− 1)βFβ
T
F
)
. (4.7)
77
4. Statistical properties of sketching algorithms
The variance of the unbiased estimator β∗P is
var(β∗P |y,X) =
(k − p− 1)
(k − p)(k − p− 3)
(
MSSF (X
TX)−1 +
(k − p+ 1)
(k − p− 1)βFβ
T
F
)
. (4.8)
The variances of βP and β
∗
P have a similar structure to the variance of βS . The main point of difference
is that the variance of βS depends on the residual sum of squares, whereas the variance of βP and β
∗
P
depends on the model sum of squares.
As mentioned the explicit form of the sampling distribution is hard to obtain, but by making a
connection with method of moments estimation it is possible to establish asymptotic normality of both
βP and β
∗
P as k tends to infinity. This motivates the construction of approximate confidence intervals.
As the exact variance is unknown we propose the following estimator
v̂ar(β∗P |y,X) =
(k − p− 1)
(k − p)(k − p− 3)
((
k − p− 1
k
)
MSSS(X˜
TX˜)−1 + β∗Pβ
∗T
P
)
. (4.9)
4.3.3 Relative efficiency
The relative efficacy of complete and partial sketching is also of interest. As the plug in estimator βP
has a higher mean square error than β∗P , it will not be considered in this section. The performance of
the complete sketching estimator βS and the unbiased partial sketched estimator β
∗
P will be compared in
terms of mean squared error. As both βF and β
∗
P are unbiased, the mean squared error can be computed
using their respective covariance matrices, that is
ES
(‖βS − βF ‖22 | y,X) = tr(var(βS)),
ES
(‖β∗P − βF ‖22 | y,X) = tr(var(β∗P )).
Comparing (4.6) and (4.8), the variance of β∗P is dependent on MSSF , whereas the variance of βS is
dependent on RSSF . This suggests that the signal to noise ratio in the source dataset will be an influential
factor in determining which estimator is more efficient. When R2F is close to one we expect complete
sketching to be be orders of magnitude more efficient than partial sketching, and when R2F is close to
zero, we expect partial sketching to be orders of magnitude more efficient than complete sketching.
4.3.4 Combined estimator
So far we have assumed that an analyst much choose between one of the two methods. Obtaining both
β∗P and βS from a single sketch is computationally cheap, and may be an attractive strategy. The most
demanding operation with the sketched data is calculating (X˜TX˜)−1. Given this quantity it is economical
to compute both βS and β
∗
P . Becker et al. (2015) mention they are presently investigating such a strategy,
but do not give any details. Our motivation for a combined estimator is driven by the fact even when
using a single sketch (y˜, X˜), the two estimators are uncorrelated, that is cov (β∗P ,βS) = 0. This is
established by taking iterated expectations, and using the hierarchical model established in section 4.3.1
(see Chapter 5). A simple strategy is then to take a weighted combination of βS and β
∗
P . A combined
estimator βC can be defined as
βC = αβS + (1− α)β∗P ,
for some 0 < α < 1. The value of α that minimises the mean square error is
αopt =
tr(var(β∗P ))
tr(var(β∗P )) + tr(var(βS))
.
The weight given to the βS is related to the relative efficiency of the two estimators. Use of the weighted
estimator is expected to be most beneficial when the signal to noise ratio is moderate, that is R2F ≈ 0.5.
When the signal to noise ratio is either very high or very low, there is little gain from using the weighted
estimator as either the complete or partial estimator will dominate.
78
4.4. Asymptotics
4.4 Asymptotics
4.4.1 Preliminaries
Finite sample distributions of random projection estimators can be mathematically intractable, and as
such asymptotic analysis can be a powerful tool (Li et al., 2006). It is a very difficult task to establish
meaningful finite sample results for the Hadamard and Clarkson-Woodruff sketches, as they are discrete
distributions over a very large combinatorial space. The explicit finite sample distribution of the sketched
estimators can be written as a sum over all these possible combinations, but such a representation is not
very informative. Instead, it is useful to study the large n distribution of the estimators βS and βP to
obtain an interpretable expression.
As βF is the estimand in sketching algorithms, this requires conditioning on the source data in the
asymptotic analysis. We make no assumption on the nature of the data generating process. To elaborate,
let A(n) = [y(n),X(n)] represent the n×d source data matrix of full column rank. Any source data matrix
A(n) has a set of associated least squares coefficients, which will here be denoted β
(n)
F . The overall goal
is to determine the asymptotic form of the distributions p(βS |A(n)) and p(β∗P |A(n)) for some arbitrary
large dataset A(n).
To take limits, we employ a fixed sequence of n× d datasets, all of rank d. In the regression scenario
this amounts to assuming thatX(n) is of full column rank and that y(n) is not a perfect linear combination
of the columns of X(n) for all n. Conditioning on A(n) is effectively the same as treating the full dataset
as an arbitrary sequence of constants Aij for i = 1, . . . , n, j = 1, . . . , d. This is analogous to large sample
results for regression models where the design matrix is treated as arbitrary set of constants, and the
random variables of interest are the error terms, for example see Van Der Vaart (1998, section 2.5). Here
the source dataset is treated as a sequence of constants and the random variables of interest are the
elements of the sketching matrix.
The asymptotic analysis is carried out in two stages. The initial step is to establish asymptotic
normality of the sketched dataset. This is then followed by an analysis of the limiting distribution of
βS , and β
∗
P . There is some related work by Ma et al. (2015) who develop asymptotic expressions for the
bias and variance of data aware sketched regression estimators, where limits are taken in the sketch size
k. Our work is different as we study data oblivious random projections and take limits in n, which is
perhaps more natural in the Big Data setting.
4.4.2 Sketching central limit theorem
Using a Gaussian sketch, treating the source data matrix A as fixed:
[A˜|A] ∼MN(0, Ik,ATA/k). (4.10)
Each row is statistically independent, and marginally normally distributed with covariance matrixATA/k.
We would like to establish an analogous asymptotic result for the Clarkson-Woodruff and Hadamard
sketches. In (4.10), the source dataset is treated as fixed. The original data matrix A can originate from
any data generating process. The source dataset could arise from a spatial point process, a collection of
time series or a Gaussian Markov random field. When we treat the source data matrix A as fixed we are
completely agnostic as to the data generating process that led to the creation of A. We would like to
establish an asymptotic result in this vein, where we make minimal assumptions on A. We have gone to
considerable effort to establish a conditional central limit theorem, where the source dataset is treated
as an arbitrary sequence of fixed constants. This is so our expedition into asymptopia remains tied as
closely to reality as possible. In practice, the large dataset that we apply a random projection to will
be a fixed file living on a server or hard drive. We would like to describe the distribution of the random
sketched dataset in relation to the fixed source dataset. This is the distribution that is reported in (4.10)
for the Gaussian sketch. We want to understand the asymptotic form of this conditional distribution for
79
4. Statistical properties of sketching algorithms
the Hadamard and Clarkson-Woodruff projections. The only random variables in the sequence of distri-
butions that we study are elements of the random projections. An unconditional central limit theorem for
sparse sketching matrices with independent entries is given in Li et al. (2006). We cannot easily extend
their method as we wish to establish a conditional central limit theorem and the Clarkson-Woodruff
sketch and the Hadamard sketch have dependent entries. Our method of proof differs in many significant
aspects.
Under some regularity conditions the Hadamard and Clarkson-Woodruff sketches produce sketched
data that asymptotically has the same matrix normal distribution as under the Gaussian sketch. Al-
though asymptotic normality may not be particularly surprising seeing as the sketched data are linear
combinations of random vectors, the proof is not immediate due to the dependence in the Hadamard
and Clarkson-Woodruff sketches and our conditional perspective. The difficulties we face are most easily
illustrated for the Clarkson-Woodruff sketch.
Algorithm 4.1 Clarkson-Woodruff sketch
A˜← 0 Initialise sketched dataset as k × d matrix of zeroes
For i = 1 to i = n
Sample z ∼ Uniform(1, . . . , k) Sample random index
Sample r ∼ Uniform(−1,+1) Sample random sign
A˜z ← r ×Ai + A˜z Multiply by r and add to row z in sketch
Output A˜ Output sketched dataset
The behaviour of the Clarkson-Woodruff sketch can be represented as a many to less mapping. Each
row in the source dataset is assigned to a single row in the sketched dataset. The Clarkson-Woodruff
sketch has an alternative streaming construction that highlights this property, given in Algorithm 4.1.
As each row in the source dataset only contributes to a single row in the sketched dataset, it might be
expected that this results in some statistical dependence amongst the rows of the sketched dataset. The
concept of dependence here is again conditional on the source dataset A being known. The independence
in (4.10) refers to the fact that knowledge of any particular row in the sketched dataset does not help
to predict any other row in the sketched dataset, given that we know the full source dataset A. This
definition of independence has interesting implications for the Clarkson-Woodruff sketch. Suppose that
we have the following source dataset where n = 3 and d = 2:
A =

10 10
1 −1
0.1 0.1
 (4.11)
Suppose that we take a Clarkson-Woodruff sketch of size k = 2 to obtain the k × d matrix A˜. Suppose
we reveal the first row to an outside observer who then has to predict the second row in the sketched
data matrix. The observer knows
A˜ =
11 9
? ?
 (4.12)
Conditional on the observer having access to the source data matrix A (4.11) they could easily reason
that the second row in A˜ will either be (0.1, 0.1) or (−0.1,−0.1). The observer can piece this together by
realising that rows 1 and 2 in the source dataset must have been assigned to the first row in the sketched
dataset. Knowledge ofA and the first row in A˜ is enough to reverse engineer the sketched mapping and to
predict the second row in the sketched data matrix. For asymptotic equivalence with the Gaussian sketch,
we require no information gain from revealing any row of the sketched data matrix given that we have
access to the full source dataset A. The predictive ability of any row in the sketched data matrix needs
80
4.4.2. Sketching central limit theorem
to dissipate with n even after accounting for the fact that we are always conditioning on knowing the full
source dataset A. We will see that we can obtain asymptotic independence of the sketched dataset rows
conditional on knowing the full source dataset under mild regularity conditions. Additionally, although
it seems each row in the sketched dataset will be marginally normally distributed, it is not clear if joint
asymptotic normality over all rows will hold. Similar conundrums arise when examining the Hadamard
sketch in detail.
The k × d random matrix A˜ is the output of a stochastic process governed by the fixed n× d source
dataset A(n) and the distribution of the random k × n sketching matrix S. The sketched dataset is a
linear combination of random vectors, the number of which increases with n. As such, we can expect A˜
to demonstrate some stable limiting behaviour as n grows larger. Under an assumption on the limiting
leverage scores of the source data matrix, we can establish a central limit theorem for the sketched dataset.
Recall the singular value decomposition of the source dataset A(n) = U(n)D(n)V
T
(n). The leverage scores
for observation i in the source dataset is defined as ‖u(n)i‖22 where uT(n)i gives row i in U(n). The leverage
scores of the observations in the source data matrix have been identified an important structural property
of sketching algorithms (Mahoney and Drineas, 2016). Assumption 1 highlights their role in establishing
asymptotic normality of the sketched data matrix.
Assumption 1 Let the singular value decomposition of the n × d source dataset be given by A(n) =
U(n)D(n)V
T
(n). Let u
T
(n)i give the ith row in U(n). Assume that the maximum leverage score tends
to zero, that is
lim
n→∞ maxi=1,...,n
‖u(n)i‖22 = 0.
Theorem 4.3 gives the sketching central limit theorem.
Theorem 4.3. Consider a fixed sequence of arbitrary n × d data matrices A(n), where d is fixed. Let
A(n) = U(n)D(n)V
T
(n) represent the singular value decomposition of A(n). Let S be a k× n Hadamard or
Clarkson-Woodruff sketching matrix where k is also fixed. Suppose that Assumption 1 on the maximum
leverage score is satisfied. Then as n tends to infinity with k and d fixed,
[A˜V(n)D
−1
(n) | A(n)]
d→ MN(0, Ik, Id/k).
The proof of Theorem 4.3 is given in Chapter 5. Heuristically, for large n we expect the matrix normal
result (4.10) to approximately hold for the Hadamard and Clarkson-Woodruff sketches. The significance
of Assumption 1 is perhaps best explained by making a connection to a version of the Lindeberg-Feller
theorem for triangular arrays of uniformly bounded random variables (Billingsley, 1999).
Theorem 4.4 (Billingsley, 1999). For each n ∈ N, let Zn1, Zn2, . . . , Znrn be a sequence of independent
random variables with E(Zni) = 0 and var(Zni) = σ2ni for i = 1, . . . , rn. Let s2n =
∑rn
i=1 σ
2
ni and assume
that rn →∞ as n→∞. Suppose that we can form a sequence of upper bounds (Kn)n∈N such that
|Zni| ≤ Kn almost surely for i = 1, . . . , rn.
Then if Kn/sn → 0 as n→∞ we have the convergence in distribution
1
sn
rn∑
i=1
Zni
d→ N(0, 1)
In Theorem 4.4 the condition that Kn/sn → 0 ensures that no random variable in a particular row of
the array has to much pull over the sum
∑rn
i=1 Zni. A triangular array of random variables satisfying the
conditions in Theorem 4.4 is often said to be uniformly asymptotically negligible in that no single term
has undue influence over the random sum. We can make an analogy to the leverage score condition in
the sketching central limit theorem (Theorem 4.3). The sum of the statistical leverage scores is always
81
4. Statistical properties of sketching algorithms
−0.4
−0.2
0.0
0.2
0 1 2
Mean centred total weight (grams)
M
ea
n 
ce
nt
re
d 
le
ng
th
 (m
m)
Figure 4.2: Abalone dataset (nfull = 4, 167) with principal component axes.
0.0
0.2
0.4
0.6
0.8
0 1000 2000 3000 4000
Number of observations (n)
M
ax
im
u
m
 le
ve
ra
ge
 s
co
re
equal to the rank of the source dataset. As we have assumed that each dataset in the sequence is of rank
d, we have that
∑n
i ‖u(n)i‖22 = d for all n. As n grows we need the maximum contribution from a single
term in the sum to tend to zero. The limiting leverage scores must satisfy an asymptotic negligibility
condition, so that each individual observation provides a vanishingly small contribution to the total sum
of the leverage scores.
Given a source dataset with centred columns, the leverage scores have a particularly intuitive inter-
pretation in terms of the principal components decomposition of the source dataset. The row vector
uT(n)iD(n) gives the coordinates of observation i on the principal component axes. The elements of the
vector u(n)i give the coordinates of observation i in a scaled system where the variance along each prin-
cipal coordinate axis is set to be one. If we think of the source dataset as a point cloud in Euclidean
space, Assumption 1 essentially implies that there are no extreme outliers as n tends to infinity. Each
observation must have a negligible contribution to the total variance along each principal component
axis. We consider a real dataset to illustrate this concept. Figure 4.2 shows a dataset consisting of
physical measurements on abalone. Each variable has been mean centred. The first and second principal
components are shows as a solid and dashed lines respectively. We define a sequence of datasets A(n) by
taking the first n rows in the full dataset. Panel (b) shows the maximum leverage score against n, where
we move through the dataset sequentially. We compute the maximum leverage score at each value of n
for n = 10, 11, . . . , nfull. The maximum leverage score appears to be heading towards zero as n increases.
Assumption 1 rules out gross outliers in the source dataset. The proof of Theorem 4.3 relies heavily
on the fact that the random variables that are involved in the construction of the Clarkson-Woodruff
sketch and the Hadamard sketch are bounded. The random projections use random variables that take
values in {−1, 0,+1}. A second important factor is that the leverage scores of an arbitrary data matrix
A(n) = U(n)D(n)V
T
(n) are also bounded. For all n ∈ N, ‖u(n)i‖22 ≤ 1 for all i = 1, . . . , n. We can
thus study a sequence of bounded random matrices. Another key factor in the proof is that Hadamard
matrices have a number of symmetry properties that lead to a pairwise independence structure in the
random projection. These issues are discussed in more detail in Chapter 5.
82
4.4.3. Sketching estimators
4.4.3 Sketching estimators
The central limit theorem for the sketched data suggests that the results about βS and βP for the
Gaussian sketch will also approximately hold for the Hadamard and Clarkson-Woodruff sketches for
large n. In order to establish convergence of the estimators it helps to adopt an extra assumption on the
sequence of source datasets.
Assumption 2:
lim
n→∞n
−1
 yT(n)y(n) yT(n)X(n)
XT(n)y(n) X
T
(n)X(n)
 = Q for some positive-definite matrix Q.
It is worth discussing the significance of the limiting matrix Q. A useful comparison can be made to
asymptotic theory for regression models, where a common assumption is that the design matrix satisfies
the limit condition n−1XT(n)X(n) → B, where B is some positive definite matrix (White, 1984; Greene,
1997). The development of asymptotic results is often eased by treating the covariates as a random
sample, although this requires positing a realistic probability model for the covariates, which may be
difficult. Treating the covariates as an arbitrary fixed sequence relaxes this assumption and covers more
general scenarios. Although it is possible to establish asymptotic results when n−1XT(n)X(n) is not
required to converge to any fixed matrix, proofs can become very technical (Fahrmeir and Tutz, 1994,
Appendix A.2). Imposing a limiting value for n−1XT(n)X(n) simplifies arguments and can be seen as
a compromise between making strong and weak assumptions about the covariates (Fahrmeir and Tutz,
1994, p.46). There is an analogous motivation for Assumption 2, the limiting matrix Q is present to
avoid specifying a probability model for the source dataset, without overcomplicating the mathematical
analysis.
Setting up a limit theorem requires a little extra care with notation. As we have a sequence of datasets
A(n), there is a corresponding sequence of optimal least squares coefficients β
(n)
F . Similarly, there is a
sequence of squared residual errors RSS
(n)
F and model sum of squares MSS
(n)
F . As the sequence of
datasets are fixed, β
(n)
F , RSS
(n)
F and MSS
(n)
F are a deterministic sequence.
Under the assumptions 1 and 2, it is possible to establish an asymptotic result for βS and βP .
Theorem 4.5. Suppose that Assumptions 1 and 2 hold, k ≥ p, and βS is computed using a Hadamard
or Clarkson-Woodruff sketch. Let (X˜TX˜)+ denote the Moore-Penrose pseudo-inverse of (X˜TX˜). Let
H˜(n) =
RSS
(n)
F
k
(
X˜TX˜
)+
and H(n) =
RSS
(n)
F
k − p+ 1
(
XT(n)X(n)
)−1
.
Then as n→∞, convergence in distribution holds for
(i)[H
−1/2
(n) (βS − β(n)F )|A(n)]→ Student (0, Ip, k − p+ 1) ,
(ii)[H˜
−1/2
(n) (βS − β(n)F ) |A(n)]→ N (0, Ip) .
The proof is in Chapter 5. For large n, we expect βS to be approximately distributed as per Theorem
4.5 for both the Hadamard and Clarkson-Woodruff sketches.
It is harder to establish a comparable limit theorem for β∗P , due to the non-standard distribution of
β∗P when using a Gaussian sketch. There is no typical normalised distribution to target. Instead, we wish
to show asymptotic equivalence in moments. The partially sketched estimator under the Hadamard and
Clarkson-Woodruff sketches should have similar mean and variance properties to the Gaussian partially
sketched estimator. An extra assumption has to be made to show convergence in moments. A sufficient
condition is a stability condition on the singular values of the sketched data matrix.
Assumption 3. Let G be the Gram matrix of the scaled sketched dataset, G = n−1X˜TX˜. Assume that
the sequence of source datasets is such that ES
(
1
σ2min(G)
)2
is finite for large enough n. This additional
regularity condition enables a formal limit theorem regarding the moments of β∗P .
83
4. Statistical properties of sketching algorithms
Theorem 4.6. Suppose that Assumptions 1, 2 and 3 hold, k > p + 3, and β∗P is computed using a
Hadamard or Clarkson-Woodruff sketch. Let
H(n) =
(k − p− 1)
(k − p)(k − p− 3)
(
MSS
(n)
F (X
T
(n)X(n))
−1 +
(k − p+ 1)
(k − p− 1)β
(n)
F β
(n)T
F
)
.
Then as n→∞,
(i) ES [β
∗
P − β(n)F |A(n)] → 0.
(ii) varS
(
H
−1/2
(n) (β
∗
P − β(n)F ) |A(n)
)
→ Id
The proof is in Chapter 5. Once again, the heavy notation may obscure the essence of the result. The
subscript S is used to emphasise that the only source of randomness is the sketching matrix, and that
the source dataset is fixed. The theorem suggests that the bias and variance of β∗P under the Clarkson-
Woodruff and Hadamard sketches should be approximately equal to that under the Gaussian sketch.
Specifically, we expect equations (4.6), (4.7), and (4.8) to be good approximations for the variance of the
sketched estimators using the Hadamard or Clarkson-Woodruff sketches.
The results here are meant to be useful heuristics to assess the uncertainty attached to the output of the
randomised approximation algorithm. There is a need to communicate and quantify the approximation
error of sketching algorithms to end users, and the asymptotic results developed in this section can be of
use.
4.5 Data application
4.5.1 Human leukocyte antigen dataset
We compared the performance of the sketching estimators on a real genetic dataset taken from the UK
Biobank database. We use a small extract from the data in Astle et al. (2016). The selected response
variable was mean red cell volume (MCV), taken from the full blood count assay and adjusted for various
technical and environmental covariates. Genome-wide imputed genotype data in expected allele dose
format were available on n = 132,353 study subjects (Howie et al., 2009). We consider 1000 genetic
variants in the Human leukocyte antigen (HLA) region of chromosome 6, selected so that no pair of
variants had Pearson correlation of allelic scores greaterless than 0.8. The region was chosen as many
associations were discovered in a genome-wide scan using univariable models; these associations were
with variants with different allele frequencies, suggesting multiple distinct causal variants in the region.
The aim is to perform a multivariable regression analysis to obtain variant effect size estimates that are
conditional on the other variants in the region.
An early theoretical finding was that the partial sketched estimator βP was biased. One thousand
sketches were taken to estimate the bias ES(βP − βF ) with k = 1,500. We also computed the bias
corrected estimator β∗P in each replication. Figure 4.3 plots the average value of the estimators against
the true value of the least squares coefficient using the full dataset. The top row (a)-(c) shows results
for βP , and the bottom row (d)-(f) shows results for β
∗
P . The first, second and third columns display
the results for the Gaussian, Hadamard and Clarkson-Woodruff sketches respectively. The solid line in
each panel is the identity line. The dashed line in the top row shows the theoretical bias, having slope
k/(k − p− 1).
The results in the top row show that βP is biased for each of the random projections. The bias closely
matches the theoretical factor. The bottom row shows that the adjusted estimator β∗P appears to be
unbiased, with the mean values falling closely along the identity line.
We also compared the complete and partially sketched estimators on mean square error and the
coverage of confidence intervals at k = 1,500 and k = 100,00. We did not consider a combined estimator
as the small R2F value would give an optimal complete sketching weight of close to zero. Table 4.2 reports
84
4.5.1. Human leukocyte antigen dataset
−0.4 −0.2 0.0 0.2 0.4 0.6
−
0.
4
−
0.
2
0.
0
0.
2
0.
4
(a) Gaussian sketch βP
Full dataset coefficient
Av
e
ra
ge
 s
ke
tc
he
d 
da
ta
se
t c
oe
ffi
cie
nt
−0.4 −0.2 0.0 0.2 0.4 0.6
−
0.
4
−
0.
2
0.
0
0.
2
0.
4
(b) Hadamard sketch βP
Full dataset coefficient
Av
e
ra
ge
 s
ke
tc
he
d 
da
ta
se
t c
oe
ffi
cie
nt
−0.4 −0.2 0.0 0.2 0.4 0.6
−
0.
4
−
0.
2
0.
0
0.
2
0.
4
(c) Clarkson−Woodruff sketch βP
Full dataset coefficient
Av
e
ra
ge
 s
ke
tc
he
d 
da
ta
se
t c
oe
ffi
cie
nt
−0.2 −0.1 0.0 0.1 0.2
−
0.
15
−
0.
05
0.
05
0.
15
(d) Gaussian sketch βP*
Full dataset coefficient
Av
e
ra
ge
 s
ke
tc
he
d 
da
ta
se
t c
oe
ffi
cie
nt
−0.2 −0.1 0.0 0.1 0.2
−
0.
15
−
0.
05
0.
05
0.
15
(e) Hadamard sketch βP*
Full dataset coefficient
Av
e
ra
ge
 s
ke
tc
he
d 
da
ta
se
t c
oe
ffi
cie
nt
−0.2 −0.1 0.0 0.1 0.2
−
0.
15
−
0.
05
0.
05
0.
15
(f) Clarkson−Woodruff sketch βP*
Full dataset coefficient
Av
e
ra
ge
 s
ke
tc
he
d 
da
ta
se
t c
oe
ffi
cie
nt
Figure 4.3: Bias of sketching estimators on the HLA dataset. Mean estimates are plotted against true values.
In this scenario n = 132353, p = 1000, k = 1500. Solid line is the identity line and dashed line represents the
theoretical bias factor.
k = 1, 500 k = 10, 000
βS βP β
∗
P βS βP β
∗
P
Gaussian 235 (2) 39 (0.4) 3.9 (0.04) 13.1 (0.1) 0.28 (0.002) 0.214 (0.001)
Hadamard 233 (2) 39 (0.4) 3.8 (0.04) 12.5 (0.1) 0.27 (0.002) 0.204 (0.002)
Clarkson-Woodruff 237 (2) 39 (0.5) 4.0 (0.05) 13.1 (0.1) 0.27 (0.002) 0.212 (0.002)
Table 4.2: Mean square error of sketched estimators on HLA dataset. Standard errors are in brackets.
the mean square error for each of the estimators. The signal to noise ratio is quite low for this dataset
with R2F = 0.02. The relative efficiency bound dictates that partial sketching will be much more efficient
than complete sketching on this dataset. The simulation results support this idea, with β∗P having a mean
square error roughly sixty times smaller than βS at both values of k. Results are very similar for each of
the random projections, suggesting that the asymptotic approximations are reasonable for this dataset.
For k = 1,500, the mean square error of βP is approximately ten times that of β
∗
P . For k = 10,000,
there is less of a difference, as the ratio k/(k − p − 1) is closer to one. The bias adjusted estimator β∗P
has significant advantages over βP when k/(k − p − 1) is larger than one. Table 4.3 summarises the
coverage of 95% confidence intervals for the sketched estimators. We report the overall proportion of
intervals that contained the true value of the least squares estimate βF over the two hundred and fifty
sketches and p = 1,000 coefficients. The observed coverage is close the nominal level of 0.95 at both levels
of k. The different random projections give very similar results, suggesting that the use of asymptotic
approximations is again reasonable on this dataset. The intervals for the Hadamard sketch appear to be
slightly conservative at k = 10,000. This may be due to the specialised random number generator used
in the implementation of the Hadamard sketch. The Radamacher random variables are only four-wise
independent as opposed to being mutually independent (Geppert et al., 2017, p.85).
85
4. Statistical properties of sketching algorithms
k = 1, 500 k = 10, 000
βS β
∗
P βS β
∗
P
Gaussian 0.950 0.951 0.950 0.950
Hadamard 0.949 0.952 0.953 0.953
Clarkson-Woodruff 0.950 0.951 0.950 0.950
Table 4.3: Coverage of confidence intervals on the HLA dataset. The largest standard error is 0.001
βS βP β
∗
P
Gaussian 62 (1) 15,200 (300) 14,400 (300)
Hadamard 59 (1) 14,600 (300) 14,800 (300)
Clarkson-Woodruff 59 (1) 14,700 (300) 14,100 (300)
Table 4.4: Mean square error of sketched estimators on flights dataset with k = 5000. Standard errors are in
brackets.
βS β
∗
P
Gaussian 0.950 0.954
Hadamard 0.950 0.949
Clarkson-Woodruff 0.948 0.952
Table 4.5: Coverage of 95% confidence intervals on the flights dataset with k = 5000. The largest standard error
is 0.004
4.5.2 Flights dataset
The sketching algorithms were also evaluated on the New York flights dataset available in the R package
nycflights13 (Wickham, 2014). Arrival delay was taken as the response, and departure delay, distance,
departure time, origin and month and day were chosen to be the covariates. Rows of the dataset with
missing data were omitted, leaving n = 327,346 and p = 47. We fit a saturated linear model with the
47 predictors. The goal was to compare the accuracy of the various sketches on real data rather than to
build a statistical model for the flights dataset. We compared the mean square error of the estimators and
the coverage of confidence intervals for k = 5000. In contrast to the HLA dataset, the flights dataset has
a very high R2F value of 0.99. We took one thousand sketches to compare complete and partial sketching.
Table 4.4 reports the mean square error of βS ,βP and β
∗
P . As expected, complete sketching has a
much smaller mean square error than partial sketching. Table 4.5 summarises the coverage rates of the
95% confidence intervals. We report the overall proportion of intervals that contained the true value of
the least squares estimate over the thousand sketches and p = 47 coefficients.
We also generated a synthetic flights dataset with an R2F of close to 0.5. This was achieved by
generating a synthetic response vector y′ using the fitted model. The simulated response was computed
as y′ = XβF + 15e, where e was the residual vector from the least squares fit e = y −XβF . We took
one thousand sketches and computed βS , β
∗
P and a weighted estimator using the optimal weight αopt in
each iteration. Table 4.6 reports the mean square error for each estimator. From the theoretical analysis
the mean square error of the weighted estimator is expected to be roughly half that of βS or β
∗
P . The
results support this for each of the three different sketches.
We also assessed the finite sample behaviour of the normal approximation in Theorem 4.3 at different
levels of k and p. We dropped some predictors from the full flights dataset to give smaller datasets with
p = 10 and p = 25 covariates. We then took subsamples of different sizes from each of the datasets. A
single subsample was taken at each value of n, so the same subsampled dataset was being sketched each
86
4.6. Discussion
βS β
∗
P βC with α = αopt
Gaussian 13,500 (300) 13,700 (300) 7,000 (150)
Hadamard 13,100 (300) 14,300 (300) 6,900 (150)
Clarkson-Woodruff 13,600 (300) 14,600 (300) 7,000 (150)
Table 4.6: Mean square error of sketched estimators on synthetic flights dataset with k = 5000. Standard errors
are in brackets.
time. One thousand sketches were taken of each dataset at different values of k. We tested the joint
multivariate normality of [y˜, X˜] and the normality of the sketched residual e˜ = S(y−XβF ). The squared
Mahalanobis distance of the sketched observations was compared to the theoretical χ2-distribution. As
n increases the rejection rate is expected to fall to the type one error rate of 0.05. Figure 4.4 plots the
proportion of times the null hypothesis of normality is rejected against the size of the source dataset.
The Hadamard sketch appears to have a much faster rate of convergence than the Clarkson-Woodruff
sketch. When using a Hadamard sketch, each row in the sketched dataset is a linear combination of n
observations from the source dataset. When using a Clarkson-Woodruff sketch, each row in the sketched
dataset is expected to be a combination of only n/k observations from the source dataset. As such, n/k
must be large for the normal approximation to hold. As expected, the rejection rate for the Clarkson-
Woodruff sketch increases with k, but remains stable for the Hadamard sketch. In Figure 4.4 the rejection
rate for the Clarkson-Woodruff sketch increases with p. The Hadamard sketch seems to be less sensitive
to the number of covariates. The extra log k computation cost associated with the Hadamard sketch
(Table 4.1) appears to have the benefit of accelerated convergence to normality. Even though joint
normality may not be holding for the Clarkson-Woodruff sketch for the flights dataset, the coverage
of the confidence intervals is still very good. As y˜ = X˜βF + e˜, normality of the sketched residual is
perhaps sufficient in justifying the approximate confidence intervals given by Theorem 4 (ii). The sketched
residual converges much more quickly than the full sketched data matrix, which perhaps explains the
good coverage properties of the confidence intervals for βS in Table 4.5.
4.6 Discussion
Sketching algorithms have emerged in the computer science community as a powerful device for the
analysis of massive datasets (Mahoney and Drineas, 2016). Sketched regression algorithms use random
projections to reduce the size of the original dataset, the sketched dataset is then used to estimate the
optimal least squares coefficients. Most existing theory for sketched regression is from an algorithmic
worst case perspective and connects with random matrix theory and computational geometry (Raskutti
and Mahoney, 2014; Thanei et al., 2017). In this chapter we have provided a complementary statistical
perspective and derived new tools for assessing the uncertainty attached to sketched estimators as well
as guidelines for choosing between competing sketching algorithms.
Iterative methods, in particular stochastic gradient descent have not been mentioned so far. For
large n regression problems, stochastic gradient descent will produce iterates that converge to βF under
very mild conditions. Comparisons between single pass sketching and stochastic gradient methods are
difficult, as the two techniques are not formulated for the exact same purpose. Single pass sketching
algorithms are designed to return an approximate solution in finite time with probabilistically controlled
error, whereas stochastic gradient methods are designed to converge to the exact solution asymptotically.
It is perhaps more appropriate to compare stochastic gradient descent to iterative sketching methods,
as iterative sketching algorithms also come with convergence guarantees to βF (Pilanci and Wainwright,
2016; Gower and Richtrik, 2015). Iterative sketching methods make use of approximate second order
information that can lead to a potential improvement compared to first order stochastic gradient methods
(Roosta-Khorasani and Mahoney, 2016). Our focus has been on characterising the approximation error
87
4. Statistical properties of sketching algorithms
k = 500 k = 1500 k = 5000
0.1
1.0
0.1
1.0
0.1
1.0
p = 10
p = 25
p = 46
1e+05 2e+05 3e+05 1e+05 2e+05 3e+05 1e+05 2e+05 3e+05
n
Pr
op
or
tio
n 
of
 ti
m
es
 n
u
ll 
hy
po
th
es
is 
re
jec
ted
 at
 al
ph
a=
0.0
5
Figure 4.4: Proportion of times null hypothesis of normality is rejected against size of the source dataset (n) for
the Hadamard (solid line) and Clarkson-Woodruff sketches (dashed line). Results for tests of the sketched residual
vector e˜ = S(y −XβF ) are plotted as triangles (4), and results for tests of the entire sketched dataset [y˜, X˜] are
plotted as circles (◦). The horizontal line gives the type 1 error of 0.05. The y-axis is on a log scale.
attached to single pass sketching estimators. The spectral theory developed in Chapter 6 has applications
to iterative sketching algorithms, although that is not the focus of the chapter.
There has been recent work in adapting sketching methods for statistical inference in large datasets,
building from the worst case bounds in the computer science literature. Geppert et al. (2017) and
Bardenet and Maillard (2015) investigate sketching algorithms for Bayesian regression, and derive bounds
on the difference between the sketched posterior distribution and the full data posterior distribution. Yang
et al. (2015b) consider sketched penalised regression, and give bounds between the sketched solution and
the full data solution similar to the results in section 4.2.1. Only complete sketching is considered in the
aforementioned work. The results on the advantages of partial sketching in this paper could motivate
adaptations that make use of the exact marginal associations XTy.
Sketching ideas have been used to develop methods for approximate non-linear regression (Avron
et al., 2014; Banerjee et al., 2013). A related branch of work uses random projections to reduce the
number of predictors in regression and classification problems (Shah and Meinshausen, 2013; Cannings
and Samworth, 2015; Guhaniyogi and Dunson, 2015). Sketching can also be used to ensure privacy when
sharing sensitive datasets (Zhou et al., 2009).
88
4.6. Discussion
Acknowledgement
This work has been conducted using the UK Biobank resource under applications number 13745. Many
thanks to Rajen Shah for helpful discussions.
89
5Proofs regarding sketching algorithms
Summary
This Chapter contains proofs for the results in Chapter 4. We restate the major results before giving
proofs.
5.1 Proof of Theorem 4.1 (Worst case bound for partial sketching)
Theorem. Suppose that X˜ is an -subspace embedding of X with 0 <  < 0.5. Then the following bound
holds,
‖βP − βF ‖22 ≤
42
σ2min(X)
MSSF .
Let the singular value decomposition ofX be given byX = UDV T. The singular value decomposition
will help to simplify expressions in later working. If the sketching matrix S is an -subspace embedding
for the source dataset with 0 <  < 1, then UTSTSU is necessarily invertible. The expression for βP
can then be simplified to
βP = V D
−1(UTSTSU)−1D−1V TXTy
= V D−1(UTSTSU)−1D−1V TV DUTy
= V D−1(UTSTSU)−1UTy.
Similarly, βF can be written as βF = V D
−1UTy. The Euclidean norm of the approximation error can
thus be expressed as
‖βP − βF ‖2 = ‖V D−1(UTSTSU)−1UTy − V D−1UTy‖2
= ‖{V D−1(UTSTSU)−1 − V D−1}UTy‖2
= ‖{V D−1[(UTSTSU)−1 − Ip]}UTy‖2.
The model sum of squares can be written as
MSSF = ‖XβF ‖22
= ‖XVD−1UTy‖22
= ‖UDV TV D−1UTy‖22
= ‖UUTy‖22
= ‖UTy‖22. (5.1)
90
5.2. Proof of Theorem 4.2 (Hierarchical model for the Gaussian sketch)
The final line uses the fact that UTU = Ip. Using the matrix norm induced by the Euclidean norm and
the usual Euclidean norm for vectors we can form an upper bound on the error.
‖βP − βF ‖2 ≤ ‖V D−1
{
(UTSTSU)−1 − Ip
}‖2‖UTy‖2
≤ ‖V D−1‖2‖UTy‖2‖(UTSTSU)−1 − Ip‖2
=
MSS
1/2
F
σmin(X)
‖(UTSTSU)−1 − Ip‖2. (5.2)
It remains to upper bound the maximum singular value of the matrix (UTSTSU)−1 − Ip. Let M =
UTSTSU . The maximum absolute value of the singular values of (UTSTSU)−1 − Ip will be given
by max(|1/σmin(M) − 1|, |1/σmax(M) − 1|), where σmin(M) is the minimum singular value of M , and
σmax(M) is the maximum singular value of M . If S is an -subspace embedding for the source covariate
matrix X then it must hold that σmin(M) ≥ 1 − , and σmax(M) ≤ 1 +  (Woodruff, 2014, p.11). As
such, max(|1/σmin(M)−1|, |1/σmax(M)−1|) ≤ |1/(1− )−1|. It is simple to show that over the interval
0 ≤  ≤ 0.5, |1/(1− )− 1| ≤ 2. This results in an upper bound on the singular value of interest,
‖(UTSTSU)−1 − Ip‖2 ≤ |1/(1− )− 1|
≤ 2.
Substituting this back into (5.2) gives that under the condition that  < 0.5
‖βP − βF ‖2 ≤ MSS
1/2
F
σmin(X)
× 2.
Squaring both sides gives the final result, that if  < 0.5
‖βP − βF ‖22 ≤
42
σ2min(X)
MSSF .
5.2 Proof of Theorem 4.2 (Hierarchical model for the Gaussian sketch)
Theorem. Suppose βS is computed using a Gaussian sketch and k > p+ 1. The conditional distribution
of βS is
(i) βS |X˜,y,X ∼ N
(
βF ,
nσ2F
k
(
X˜TX˜
)−1)
.
The marginal distribution of βS is
(ii) βS |y,X ∼ Student
(
βF ,
nσ2F
k − p+ 1
(
XTX
)−1
, k − p+ 1
)
.
We use the following lemma about the Normal Inverse-Wishart distribution in many of our results
(Gelman et al., 2014, p.73). Suppose that Σ is a random d× d matrix and y is a d-dimensional random
vector from the following hierarchical model
y|Σ ∼ N (µ,Σ/κ) ,
Σ ∼ Inv-Wishart(Λ, ν),
where Λ is a d× d scale matrix, ν is a scalar giving degrees of freedom, and κ is a scaling constant. Then
marginally,
y ∼ Student(µ,Λ/(κ(ν − d+ 1)), ν − d+ 1).
Theorem 4.2 follows from setting µ = βF , Σ = (X˜
TX˜)−1, κ = k/RSSF , Λ = k(XTX)−1, ν = k and
d = p.
91
5. Proofs regarding sketching algorithms
5.3 Variance for partial sketching
Using a Gaussian sketch of size k where k > p + 3, the standard partial sketching estimator βP has
variance
var(βP ) =
k2
(k − p)(k − p− 1)(k − p− 3)
(
MSSF (X
TX)−1 +
(k − p+ 1)
(k − p− 1)βFβ
T
F
)
. (5.3)
The bias corrected partial sketching estimator β∗P has variance
var(β∗P ) =
(k − p− 1)
(k − p)(k − p− 3)
(
MSSF (X
TX)−1 +
(k − p+ 1)
(k − p− 1)βFβ
T
F
)
. (5.4)
We now prove (5.3) and (5.4).
Let the singular value decomposition ofX be given byX = UDV T. The singular value decomposition
will help to simplify expressions in later working. The sketched Gram matrix has the form X˜TX˜ =
V DUTSTSUDV T. As UTSTSU ∼Wishart(k, Ip/k), the matrix UTSTSU is almost surely invertible.
The inverse Gram matrix can then be written as
(X˜TX˜)−1 = [DV T]−1(UTSTSU)−1[V D]−1
= V D−1(UTSTSU)−1D−1V T.
The expression for βP can then be simplified to
βP = V D
−1(UTSTSU)−1D−1V TXTy
= V D−1(UTSTSU)−1D−1V TV DUTy
= V D−1(UTSTSU)−1UTy.
Let M = (UTSTSU)−1. We know that M ∼ Inverse-Wishart(k, kIp). Properties of the Inverse-Wishart
distribution give that that for i = 1, . . . , p,
var(Mii) =
2k2
(k − p− 1)2(k − p− 3) . (5.5)
Additionally, for i, j = 1, . . . , p, where j 6= i
var(Mij) =
k2(k − p− 1)
(k − p)(k − p− 1)2(k − p− 3) . (5.6)
Finally we have that for i, j = 1, . . . , p, i 6= j,
cov(Mij ,Mji) =
k2(k − p− 1)
(k − p)(k − p− 1)2(k − p− 3) , (5.7)
cov(Mii,Mjj) =
2k2
(k − p)(k − p− 1)2(k − p− 3) . (5.8)
All other covariances cov(Mij ,Mbr) are equal to zero unless they reduce to the cases in (5.7) or (5.8).
Let z = UTy. Let W = cov
(
MUTy
)
= cov(Mz). The elements of W can be determined using the
properties in equations (5.5) to (5.8). Starting with the diagonal entries,
Wii = var
 p∑
j=1
Mijzj

=
p∑
j=1
z2j var(Mij) +
p∑
j=1
p∑
w 6=j
zjzwcov(Mij ,Miw).
As cov(Mij ,Miw) is equal to zero for all w 6= j this simplifies to
Wii = var
 p∑
j=1
Mijzj

=
p∑
j=1
z2j var(Mij).
92
5.3. Variance for partial sketching
It is helpful to split the sum into two pieces, a single term for j = i and then a sum over the remaining
indices. Grouping terms leads to an expression involving the model sum of squares MSSF .
Wii = z
2
i
2k2
(k − p− 1)2(k − p− 3) +
p∑
j=1,j 6=i
z2j
k2(k − p− 1)
(k − p)(k − p− 1)2(k − p− 3)
= z2i
2k2(k − p)
(k − p)(k − p− 1)2(k − p− 3) +
p∑
j=1,j 6=i
z2j
k2(k − p− 1)
(k − p)(k − p− 1)2(k − p− 3)
= z2i
2k2(k − p− 1) + 2k2
(k − p)(k − p− 1)2(k − p− 3) +
p∑
j=1,j 6=i
z2j
k2(k − p− 1)
(k − p)(k − p− 1)2(k − p− 3)
=
k2(k − p− 1)
(k − p)(k − p− 1)2(k − p− 3)
p∑
j=1
z2j +
k2(k − p− 1) + 2k2
(k − p)(k − p− 1)2(k − p− 3)z
2
i
=
k2(k − p− 1)
(k − p)(k − p− 1)2(k − p− 3)MSSF +
k2(k − p+ 1)
(k − p)(k − p− 1)2(k − p− 3)z
2
i .
In the second line the first term is modified to have the same denominator as the remainder sum. In the
third line we add and subtract by 2k2 so that the numerator in the first term matches the numerator
in the remainder sum. This allows the zj terms to be grouped into a sum over the full set of indexes
j = 1, . . . , p in the third line. The fourth line uses the fact that
∑p
j=1 z
2
j = z
Tz = MSSF . This was
shown in the proof of Theorem 1 (5.1). For the off diagonal entries Wib where b 6= i,
Wib = cov
 p∑
j=1
Mijzj ,
p∑
r=1
Mbrzr

=
p∑
j=1
p∑
r=1
zjzr cov(Mij ,Mbr).
Now cov(Mij ,Mbr) is only nonzero for cov(Mib,Mbi) and cov(Mii,Mbb). Using (5.7) and (5.8) we obtain
Wib = zizb cov(Mib,Mbi) + zizb cov(Mii,Mbb)
=
k2(k − p− 1)
(k − p)(k − p− 1)2(k − p− 3)zizb +
2k2
(k − p)(k − p− 1)2(k − p− 3)zizb
=
k2(k − p+ 1)
(k − p)(k − p− 1)2(k − p− 3)zizb.
The entire covariance matrix W can therefore be written compactly as
W =
k2(k − p− 1)
(k − p)(k − p− 1)2(k − p− 3) (MSSF Ip) +
k2(k − p+ 1)
(k − p)(k − p− 1)2(k − p− 3)zz
T
=
k2(k − p− 1)
(k − p)(k − p− 1)2(k − p− 3)
(
MSSF Ip +
(k − p+ 1)
(k − p− 1)zz
T
)
.
=
k2
(k − p)(k − p− 1)(k − p− 3)
(
MSSF Ip +
(k − p+ 1)
(k − p− 1)zz
T
)
.
Now βP = V D
−1Mz. Therefore var(βP ) = V D−1var(Mz)D−1V T = V D−1WD−1V T. The variance
of βP is then a linear function of W ,
var(βP ) = V D
−1WD−1V T
= V D−1
k2
(k − p)(k − p− 1)(k − p− 3)
(
MSSF Ip +
(k − p+ 1)
(k − p− 1)zz
T
)
D−1V T
=
k2
(k − p)(k − p− 1)(k − p− 3)MSSF (V D
−2V T)+
k2(k − p+ 1)
(k − p)(k − p− 1)2(k − p− 3)V D
−1zzTD−1V T (5.9)
93
5. Proofs regarding sketching algorithms
Recall that z = UTy and
βF = (X
TX)−1XTy (5.10)
= V D−1UTy (5.11)
= V D−1z. (5.12)
The term V D−1z appears in (5.9). Substituting (5.12) into (5.9) gives
var(βP ) =
k2
(k − p)(k − p− 1)(k − p− 3)MSSF (V D
−2V T)+
k2(k − p+ 1)
(k − p)(k − p− 1)2(k − p− 3)βFβ
T
F .
A final simplification can be made by noting that (XTX)−1 = V D−2V T giving
var(βP ) =
k2
(k − p)(k − p− 1)(k − p− 3)MSSF (X
TX)−1 +
k2(k − p+ 1)
(k − p)(k − p− 1)2(k − p− 3)βFβ
T
F
=
k2
(k − p)(k − p− 1)(k − p− 3)
(
MSSF (X
TX)−1 +
(k − p+ 1)
(k − p− 1)βFβ
T
F
)
.
The variance of β∗P = [(k − p− 1)/k]βP is then
var(β∗P ) =
(
k − p− 1
k
)2
k2(k − p− 1)
(k − p)(k − p− 1)2(k − p− 3)
(
MSSF (X
TX)−1 +
(k − p+ 1)
(k − p− 1)βFβ
T
F
)
=
(k − p− 1)
(k − p)(k − p− 3)
(
MSSF (X
TX)−1 +
(k − p+ 1)
(k − p− 1)βFβ
T
F
)
. (5.13)
5.4 Combined estimator results
We first show that β∗P and βS are uncorrelated. We again avoid notation explicitly conditioning on the
source dataset [y,X] in every step, which is always treated as fixed. The covariance between β∗P and βS
computed from the same sketch can be shown to be zero. Using the definition of covariance, and taking
iterated expectations
cov(β∗P ,βS) = E
[
(β∗P − βF )(βS − βF )T
]
= E
{
E
[
(β∗P − βF )(βS − βF )T | X˜
]}
.
Recall the hierarchical model for complete sketching,
y˜ | X˜ ∼ N
(
X˜βF ,
RSSF
k
Ik
)
.
Equivalently,
y˜ | X˜ = X˜βF + e˜,
where e˜ | X˜ ∼ N(0, RSSF
k
Ik). So
βS | X˜,y,X = βF + (X˜TX˜)−1X˜Te˜.
Substituting back into the expression for the covariance,
cov(β∗P ,βS) = E
{
E
[
(β∗P − βF )(βF + (X˜TX˜)−1X˜Te˜− βF )T | X˜
]}
= E
{[
(β∗P − βF )(βF + (X˜TX˜)−1X˜TE[e˜ | X˜]− βF )T | X˜
]}
= E
{[
(β∗P − βF )(βF − βF )T | X˜
]}
= E
{[
(β∗P − βF )0T | X˜
]}
= 0p×p.
94
5.5. Proof of Theorem 4.4 (central limit theorem under asymptotic negligibility condition)
The final line may be obtained by noting that E[β∗P ] = βF . Simple calculus shows that the value which
minimises the expected mean square error ES(‖βC − βF ‖22 | y,X) is
αopt =
trace(var(β∗P ))
trace(var(β∗P )) + trace(var(βS))
.
5.5 Proof of Theorem 4.4 (central limit theorem under asymptotic negligibility condition)
A triangular array of random variables is a useful structure for studying weak convergence. To establish
a triangular array, define for every n ∈ N a collection of random variables Zn1, Zn2, . . . , Znrn . There are
rn random variables in row n of the array. Suppose that rn = n. Visually we can represent the first three
rows of the array as
Z11
Z21 Z22
Z31 Z32 Z33
Theorem (Billingsley, 1999). For each n ∈ N, let Zn1, Zn2, . . . , Znrn be a sequence of independent
random variables with E(Zni) = 0 and var(Zni) = σ2ni for i = 1, . . . , rn. Let s2n =
∑rn
i=1 σ
2
ni and assume
that rn →∞ as n→∞. Suppose that we can form a sequence of upper bounds (Kn)n∈N such that
|Zni| ≤ Kn almost surely for i = 1, . . . , rn.
Then if Kn/sn → 0 as n→∞ we have the convergence in distribution
1
sn
rn∑
i=1
Zni
d→ N(0, 1)
Lindeberg’s condition is a critical component in establishing asymptotic normality. We state Linde-
berg’s condition for triangular arrays of random variables.
Definition 5.1 (Lindeberg’s condition). For each n ∈ N, let Zn1, Zn2, . . . , Znrn be a sequence of random
variables with E(Zni) = 0 and var(Zni) = σ2ni for i = 1, . . . , rn. Let s2n =
∑n
i=1 σ
2
ni and suppose that
rn →∞ as n→∞. The random variables are said to satisfy Lindeberg’s condition if for all η > 0,
lim
n→∞
1
s2n
rn∑
i=1
E(Z2ni1{|Zni|>ηsn}) = 0. (5.14)
The triangular array of random variables does not have to have independent random variables in each
row in order to satisfy the condition. The general form of the Lindeberg-Feller central limit theorem shows
that a triangular array of independent random variables satisfying Lindeberg’s condition is asymptotically
normal after suitable scaling.
Theorem 5.1 (Lindeberg-Feller). For each n ∈ N, let Zn1, Zn2, . . . , Znrn be a sequence of random
variables with E(Zni) = 0 and var(Zni) = σ2ni for i = 1, . . . , rn. Let s2n =
∑rn
i=1 σ
2
ni and suppose that
rn → ∞ as n → ∞. Suppose the triangular array of random variables satisfies Lindeberg’s condition
(Definition 5.1). Then
1
sn
rn∑
i=1
Zni
d→ N(0, 1)
For a proof see Loeve (1977). It can be difficult to show Lindeberg’s condition directly. A stronger
condition that implies the Lindeberg condition is the Lyapunov condition.
95
5. Proofs regarding sketching algorithms
Definition 5.2 (Lyapunov’s condition). For each n ∈ N, let Zn1, Zn2, . . . , Znrn be a sequence of random
variables with E(Zni) = 0 and var(Zni) = σ2ni for i = 1, . . . , rn. Let s2n =
∑rn
i=1 σ
2
ni and suppose that
rn →∞ as n→∞. The triangular array of random variables is said to satisfy Lyapunov’s condition if
there exists a δ > 0 such that
lim
n→∞
1
s2+δn
rn∑
i=1
E(|Zni|2+δ) = 0. (5.15)
The Lyapunov condition implies the Lindeberg condition. We state this in a Lemma for later reference.
Lemma 5.1. The Lyapunov condition implies the Lindeberg condition.
To see this assume the Lyapunov condition is satisfied and fix η > 0. Now |Zni| ≥ ηsn implies that
1 ≤ |Zni/(ηsn)|δ. We can then form an upper bound on the sequence of partial sums that appear in
Lindeberg’s condition.
1
s2n
rn∑
i=1
E(Z2ni1{|Zni|>ηsn}) ≤
1
s2n
rn∑
i=1
E(Z2ni|Zni/(ηsn)|δ1{|Zni|>ηsn})
=
1
s2n
rn∑
i=1
E(|Zni|2|Zni/(ηsn)|δ1{|Zni|>ηsn})
=
1
s2n
1
(ηsn)δ
rn∑
i=1
E(|Zni|2+δ1{|Zni|>ηsn})
=
1
ηδ
1
s2+δn
rn∑
i=1
E(|Zni|2+δ).
Assuming that Lyapunov’s condition holds we can establish zero as an upper bound
lim
n→∞
1
s2n
rn∑
i=1
E(Z2ni1{|Zni|>ηsn}) ≤ limn→∞
1
ηδ
1
s2+δn
rn∑
i=1
E(|Zni|2+δ)
=
1
ηδ
lim
n→∞
1
s2+δn
rn∑
i=1
E(|Zni|2+δ)
= 0.
We also have the lower bound
0 ≤ lim
n→∞
1
s2n
rn∑
i=1
E(Z2ni1{|Zni|>ηsn}).
By the squeeze theorem we then have that the Lyapunov condition implies the Lindeberg condition.
We now present a useful Lemma for showing the Lyapunov condition. The result is from Billingsley
(1999) and applies to triangular arrays of uniformly bounded random variables.
Lemma 5.2 (Billingsley, 1999). For each n ∈ N, let Zn1, Zn2, . . . , Znrn be a sequence of random variables
with E(Zni) = 0 and var(Zni) = σ2ni for i = 1, . . . , rn. Let s2n =
∑rn
i=1 σ
2
ni and suppose that rn → ∞ as
n→∞. Suppose that we can form a sequence of upper bounds (Kn)n∈N such that
|Zni| ≤ Kn almost surely for i = 1, . . . , rn.
Then if Kn/sn → 0 as n→∞ the Lyapunov condition holds for the triangular array of random variables.
Lemma 5.2 is useful as it does not impose a constant uniform bound on the random variables. In
the special case where |Zni| ≤ M almost surely for some constant M for all n ∈ N and all i = 1, . . . , rn
we have that Lyapunov’s condition is satisfied providing that sn →∞. Lemma 5.2 allows for the bound
Kn to increase with n as long as the rate of growth is slower than the rate of growth of sn. Lyapunov’s
condition holds providing that Kn = o(sn).
96
5.6. Proof of Theorem 4.3 (Sketching central limit theorem)
The proof of Lemma 5.2 is given below. Again fix some δ > 0. If |Zni| ≤ Kn almost surely for
i = 1, . . . , rn it must hold that |Zni|δ ≤ Kδn as |Zni|,Kn and δ are all positive. As such |Zni|2+δ =
|Zni|2|Zni|δ ≤ |Zni|2Kδn. We can then form an upper bound on the sequence of partial sums that appear
in Lyapunov’s condition.
1
s2+δn
rn∑
i=1
E(|Zni|2+δ) ≤ 1
s2+δn
rn∑
i=1
E(|Zni|2)Kδn
=
Kδn
s2+δn
rn∑
i=1
E|Zni|2.
=
Kδn
s2+δn
s2n
=
(
Kn
sn
)δ
. (5.16)
Now assuming that Kn = o(sn) we have that Kn/sn → 0 as n→∞. We then also have that
lim
n→∞
(
Kn
sn
)δ
=
(
lim
n→∞
Kn
sn
)δ
= 0,
as the exponentiation by δ > 0 is a continuous function. Now taking limits on both sides of the inequality
(5.16):
lim
n→∞
1
s2+δn
rn∑
i=1
E(|Zni|2+δ) ≤ lim
n→∞
(
Kn
sn
)δ
(5.17)
= 0. (5.18)
We also have the lower bound
0 ≤ lim
n→∞
1
s2+δn
rn∑
i=1
E(|Zni|2+δ).
By the squeeze theorem we then have that Kn = o(sn) is sufficient for Lyapunov’s condition to hold.
The triangular array of independent random variables in Theorem 4.4 satisfies Lyapunov’s condition
by Lemma 5.2. As the Lyapunov condition implies the Lindeberg condition (Lemma 5.1) the general
Lindeberg-Feller central limit theorem (Theorem 5.1) gives asymptotic normality of the scaled row sums,
thus proving Theorem 4.4.
5.6 Proof of Theorem 4.3 (Sketching central limit theorem)
Assumption 1 Let the singular value decomposition of the n × d source dataset be given by A(n) =
U(n)D(n)V
T
(n). Let u
T
(n)i give the ith row in U(n). Assume that the maximum leverage score tends
to zero, that is
lim
n→∞ maxi=1,...,n
‖u(n)i‖22 = 0.
Theorem 4.3 gives the sketching central limit theorem.
Theorem. Consider a fixed sequence of arbitrary n×d data matrices A(n), where d is fixed. Let A(n) =
U(n)D(n)V
T
(n) represent the singular value decomposition of A(n). Let S be a k×n Hadamard or Clarkson-
Woodruff sketching matrix where k is also fixed. Suppose that Assumption 1 on the maximum leverage
score is satisfied. Then as n tends to infinity with k and d fixed,
[A˜V(n)D
−1
(n) | A(n)]
d→ MN(0, Ik, Id/k).
97
5. Proofs regarding sketching algorithms
To prove the sketching central limit theorem it helps to restate Lemma 5.2. This helps to show
the importance of the leverage scores in establishing asymptotic normality. Lemma 5.2 provided a
sufficient condition for showing that Lindeberg’s condition holds. We can restate Lemma 5.2 in terms of
a normalised triangular array.
Theorem 5.2 (Billingsley, 1999). For each n ∈ N let Zn1, Zn2, . . . , Znrn be a sequence of random vari-
ables with E[Zni] = 0 and var(Z2ni) = σ2ni for i = 1, . . . , rn. Define s2n =
∑rn
i=1 σ
2
ni each n. Suppose that
the rows of the triangular array are standardised such that s2n = 1 for all n. Suppose that rn → ∞ as
n → ∞. Suppose we have a sequence of upper bounds (Zn) suck that |Zni| ≤ Kn almost surely for all
i = 1, . . . , rn. Then a sufficient condition for Lyapnuov’s condition to hold is Kn → 0 as n→∞.
The standardisation of the triangular array gives an intuitive condition for Lyapunov’s and hence
Lindeberg’s condition to hold. We require that Kn → 0 as n → ∞. We require that the upper bound
tends to zero. All the random variables in the row must converge almost surely to zero. Almost sure
convergence is stronger than convergence in probability and rules out pathological cases where a single
random variable in a row can take a large value with small probability. Assumption 1 on the leverage
scores in the sketching central limit theorem enforces a bounded growth condition that relates to Theorem
5.2.
Let n ∈ N index the sequence of source datasets of increasing size. We assume that the source
dataset consists of rn observations where rn → ∞ as n → ∞. For now we can take take rn = n to
ease interpretation. We take the singular value decomposition of each dataset A(n) = U(n)D(n)V
T
(n). All
results in this section treat the source dataset A(n) as fixed, only the sketching matrix is random. We
consider the sequence of whitened sketched datasets
A˜V(n)D
−1
(n) = (SA)V(n)D
−1
(n)
= SU(n)D(n)V
T
(n)V(n)D
−1
(n)
= SU(n).
The whitened sketched dataset A˜V(n)D
−1
(n) has a MN(0, Ik, Id/k) distribution when S is a Gaus-
sian sketch. We need to show that as n tends to infinity, SU(n) convergences in distribution to a
MN(0, Ik, Id/k) random matrix for both the Clarkson-Woodruff and Hadamard sketches.
Let uT(n)i denote row i of the matrix of left singular vectors U(n). We write u
T
(n)i so that that we can
form a triangular array of left singular vectors. Taking rn = n, the first three rows of the triangular array
can be written as
u(1)1
u(2)1 u(2)2
u(3)1 u(3)2 u(3)3
An important property is that for all n, the sum of the norms of the leverage scores always equals the
number of variables in the source dataset d.
rn∑
i=1
‖u(n)i‖22 = d. (5.19)
As n increases, the typical norm of each vector u(n)i, i ∈ {1, . . . , rn} is expected to decrease. For
completeness we restate Assumption 1 in terms of the triangular array formulation.
Assumption 1 Let the singular value decomposition of the rn × d source dataset be given by A(n) =
U(n)D(n)V
T
(n). Let u
T
(n)i give the ith row in U(n) for i = 1, . . . , rn. Assume that the maximum
leverage score tends to zero, that is
lim
n→∞ maxi=1,...,rn
‖u(n)i‖22 = 0.
98
5.6. Proof of Theorem 4.3 (Sketching central limit theorem)
This increasing collection of smaller quantities is similar to the behaviour of the triangular array of random
variables in Theorem 5.2. The standardisation property in equation (5.19), namely that
∑rn
i=1‖u(n)i‖22 = d
for all n is similar to the assumption that sn = 1 in each row of the triangular array of random variables
in Theorem 5.2. Assumption 1 on the leverage scores, where the maximum individual norm tends to zero
is similar to the assumption that Kn → 0 in Theorem 5.2. This will be made more explicit in the proofs.
Before moving on we make a note that assumption 1 also implies that the maximum square root of the
leverage scores also tends to zero. As
max
i=1,...,rn
‖u(n)i‖2 =
(
max
i=1,...,rn
‖u(n)i‖22
)1/2
(5.20)
We have that
lim
n→∞ maxi=1,...,rn
‖u(n)i‖2 = lim
n→∞
(
max
i=1,...,rn
‖u(n)i‖22
)1/2
=
(
lim
n→∞ maxi=1,...,rn
‖u(n)i‖22
)1/2
= 0. (5.21)
To establish joint asymptotic normality of the sketched data matrix we use the Crame´r-Wold device.
Lemma 5.3 (Crame´r-Wold device). Let z be a real valued random vector of dimension v and let (zn) be
a sequence of real valued random vectors of the same fixed dimension v. The sequence of random vectors
(zn) converges in distribution to z as n tends to infinity if and only if (λ
Tzn) converges in distribution
to λTz for all unit vectors λ ∈ Rv.
Let zn represent the kd length vector formed by stacking transposed rows of the whitened sketched
dataset U˜ = SU(n). Let u˜
T
j give row j in U˜ for j = 1, . . . , k. Formally,
zn =

u˜1
u˜2
...
u˜k

. (5.22)
Let us define the random matrix k × d random matrix W as having the matrix normal distribution
W ∼MN(0, Ik, Id/k)
Let wTi refer to row i in W for i = 1, . . . , k. Let zL refer to the stacked transposed rows of W , so
zL =

w1
w2
...
wk

. (5.23)
Let λ be an arbitrary unit vector in Rk×d. It will be useful to also partition the vector λ into k sub-vectors,
λ =

λ1
λ2
...
λk

, (5.24)
99
5. Proofs regarding sketching algorithms
where λj is a d-dimensional vector for j = 1, . . . , k. For any unit vector λ ∈ Rk×d, λTzL is distributed
as N(0, 1/k). We will aim to show that the distribution of the whitened sketched data SA(n)V(n)D
−1
(n)
converges to that of W through the Crame´r-Wold device. We must show that for any fixed k× d length
unit vector λ, λTzn converges in distribution to N(0, 1/k) as n→∞.
We will rely on a central limit theorem for jointly symmetric, pairwise independent random variables
(Pruss and Szynal, 2000). A collection of random variables (Z1, . . . Zn) is said to be jointly symmetric if
(Z1, . . . Zn) has the same distribution as (q1Z1, . . . , qnZn), where qi ∈ {+1,−1} for i = 1, . . . , n. Given a
set of random variables Y1, . . . , Yn, a jointly symmetric collection Z1, . . . , Zn can be formed by sampling
n independent Rademacher random variables h1, . . . , hn, and setting Zi = hiYi (Pruss and Szynal, 2000).
It is possible to establish a central limit theorem for jointly symmetric, pairwise independent random
variables.
Theorem 5.3 (Pruss and Szynal (2000), Theorem 1, Corollary 2). For each n ∈ N, let Zn1, Zn2, . . . , Znrn
be a sequence of jointly symmetric pairwise independent random variables with E(Zni) = 0 and var(Zni) =
σ2ni for i = 1, . . . , rn. Let s
2
n =
∑rn
i=1 σ
2
ni and assume that rn → ∞ as n → ∞. Suppose the triangular
array of random variables satisfies Lindeberg’s condition. Then as n → ∞, s−1n
∑rn
i=1 Zni converges in
distribution to N(0, 1).
Not all triangular arrays with pairwise independent random variables in each row satisfy a central
limit theorem. The joint symmetry property is very important (Pruss and Szynal, 2000; Janson, 1988).
To use Theorem 5.3 we need to show that the triangular array of random variables satisfies Lindeberg’s
condition. As discussed this can be very difficult to establish directly. If the triagular array of random
variables can be appropriately bounded, we can use Theorem 5.2 to show that Lyapunov’s condition
holds, and subsequently that Lindeberg’s condition holds.
This is the approach we take in proving the sketching central limit theorem. The Crame´r-Wold device
is used to reduce the study of multivariate convergence to univariate convergence. We can then form a
triangular array of random variables such that elements in each row are jointly symmetric and pairwise
independent. We then show that triangular array satisfies Lindeberg’s condition using Theorem 5.2.
Assumption 1 on the maximum leverage score enforces the necessary cap on the rate of growth. Theorem
5.3 is then used to establish asymptotic normality.
5.6.1 Clarkson-Woodruff sketch
The Clarkson-Woodruff sketch can be represented as the product of two independent random matrices,
S = ΓD, where Γ is a random k × n matrix and D is a random n× n matrix. The diagonal matrix D
contains n independent Rademacher random variables on the diagonal. Let hi ∈ {+1,−1} be the random
sign in element Dii. The matrix Γ is formed by choosing one element in each column independently and
setting the entry to +1. Element Γij is equal to +1 if we add observation i in the original dataset to
sketched observation j. The signs in row i are flipped if hi is equal to negative one. Each observation
in the original dataset is assigned to one sketched observation as each column of Γ contains a single +1
entry. Using a Clarkson-Woodruff sketch row j in the sketched data matrix can be represented as
u˜Tj =
n∑
i=1
hiΓiju
T
(n)i,
where hi represents the random sign flip applied to row i of the original data matrix, and Γij is the
indicator variable which is equal to one if row i of the original data is added to row j of the sketched
dataset.
Let us consider the linear combination λTz, where λ and z are defined as in (5.22) and (5.24)
respectively. The sum over the k rows in the sketched dataset can be rearranged into a sum over the n
100
5.6.1. Clarkson-Woodruff sketch
rows in the source dataset,
λTzn =
k∑
j=1
λTj u˜j
=
k∑
j=1
λTj
n∑
i=1
hiΓiju(n)i
=
n∑
i=1
hi
k∑
j=1
Γijλ
T
j u(n)i. (5.25)
The scalar λTzn is equal to the sum of n independent random variables. Independence holds as the signs
flips hi on each observation are independent, and each column of Γ is independent.
In the language of Theorem 5.2 we can form a triangular array of random variables setting
Zni = hi
k∑
j=1
Γijλ
T
j u(n)i. (5.26)
for i = 1, . . . , n and n ∈ N. The linear combination in (5.25) then be expressed as a row sum over the
triangular array defined in (5.26):
λTzn =
n∑
i=1
Zni. (5.27)
Our goal of showing that λTzn converges in distribution to a N(0, 1/k) random variable is achieved if
we can show that
∑n
i=1 Zni converges in distribution to a N(0, 1/k) random variable.
It is worth making a connection to Theorem 5.3, because of the random sign flips hi appearing
in (5.26), we have a sequence of mutually independent jointly symmetric random variables. Mutually
independent random variables are also necessarily pairwise independent. Theorem 5.3 can be used to
establish asymptotic normality of the sum in (5.27) and hence the linear combination λTzn. To show
that the triangular array of random variables defined in (5.26) satisfies Lindeberg’s condition we use
Theorem 5.2. Set s2n =
∑n
i=1 var(Zni). We first determine s
2
n. We then form the necessary sequence of
upper bounds Kn such that |Zni| ≤ Kn almost surely for i = 1, . . . , n. The variance of a single term in
the sum (5.25) is
var(Zni) = var
hi k∑
j=1
Γijλ
T
j u(n)i
 (5.28)
=
k∑
j=1
1
k
λTj u(n)iu
T
(n)iλj . (5.29)
101
5. Proofs regarding sketching algorithms
The row-wise variance totals s2n are then
s2n =
n∑
i=1
var(Zni)
=
n∑
i=1
var
hi k∑
j=1
Γijλ
T
j u(n)i

=
1
k
n∑
i=1
k∑
j=1
λTj u(n)iu
T
(n)iλj
=
1
k
k∑
j=1
λTj
(
n∑
i=1
u(n)iu
T
(n)i
)
λj
=
1
k
k∑
j=1
λTjU
T
(n)U(n)λj
=
1
k
k∑
j=1
λTj Idλj .
=
1
k
k∑
j=1
λTj λj
=
1
k
.
The fact that UT(n)U(n) = Id for all n serves as a useful normalisation to give stable limiting behaviour.
The step in the last line follows as we have taken λ to be a unit vector. We have sn = 1/k for all n in
the triangular array. We now establish a sequence of upper bounds (Kn). As the random variables in
the construction of construction of the sketch are bounded, we can bound the random variables in the
triangular array using the leverage scores of the sequence of source dataset. Now as the random sign
hi ∈ {+1,−1}
|Zni| = |hi
k∑
j=1
Γijλ
T
j u(n)i|
= |
 k∑
j=1
Γijλ
T
j
u(n)i|. (5.30)
Now by the Cauchy-Schwarz inequality
|
 k∑
j=1
Γijλ
T
j
u(n)i| ≤ ‖ k∑
j=1
Γijλj‖2‖u(n)i‖2 (5.31)
Now as Γij = 1 for a single j ∈ {1, . . . , k} and is zero otherwise we have that
‖
k∑
j=1
Γijλj‖2 ≤ max
j=1, . . . , k
‖λj‖2
≤ 1. (5.32)
The last line follows as we have taken λ to be a unit vector. Substituting (5.32) and (5.31) into (5.30)
we arrive at
|Zni| ≤ ‖u(n)i‖2.
We can then form the sequence of upper bounds Kn,
Kn = max
i=1,...,n
‖u(n)i‖2.
102
5.6.2. Hadamard sketch
We have that |Zni| ≤ Kn almost surely for i = 1, . . . , n and n ∈ N. Assumption 1 controls the limiting
behaviour of Kn = max
i=1,...,n
‖u(n)i‖2 (recall equation (5.21)). Taking limits and using Assumption 1 shows
that Kn → 0,
lim
n→∞ Kn = limn→∞ maxi=1,...,n
‖u(n)i‖2
= 0.
By theorem 5.2 we have that the triangular array of random variables in (5.26) satisfies Lindeberg’s
condition. As such the conditions of Theorem 5.3 are satisfied, giving that λTzn converges in distribution
to N(0, 1/k). Finally, the Crame´r-Wold device gives that the whitened sketched dataset has a limiting
matrix normal distribution, that is A˜V(n)D
−1
(n) converges in distribution to a MN (0k×d, Ik, Id/k) random
matrix.
5.6.2 Hadamard sketch
Recall that the Hadamard sketch is defined through S = ΦHD/
√
k. Here H is a Hadamard matrix.
Hadamard matrices are square matrices with 2n rows for some integer n. To take limits we have to
define our sequence of source datasets (A(n) = U(n)D(n)V
T
(n)) as having rn = 2
n rows for n ∈ N+. In
practice when taking a Hadamard sketch we pad the original dataset with zeros if the original number
of observations is not a power of two. To rigourously establish asymptotic normality for the Hadamard
sketch we have to take rn = 2
n. The first three rows of the triangular array of left singular vectors now
looks like
u(1)1
u(2)1 u(2)2
u(3)1 u(3)2 u(3)3 u(n)4.
The intuition is the same as with the Clarkson-Woodruff sketch, as we move down the rows n we expect
the norms u(n)i, i ∈ {1, . . . , 2n} to decrease. This follows from the implicit row-wise normalisation
property
rn∑
i=1
‖u(n)i‖22 = d.
The indexing change to rn = 2
n instead of rn = n has very little impact on the underlying arguments.
There are two independent sources of randomness in a Hadamard sketch, the rn = 2
n independent
random Rademacher variables in the diagonal matrix D, and the random matrix Φ which subsamples k
rows with replacement from the Hadamard matrix H. Hadamard matrices have a number of properties
that we will use (Anderson, 1997, section 3.2).
• (P1) The first column contains all ones.
• (P2) Every column other than the first contains an equal number of +1 and −1 entries.
• (P3) Consider any two different columns i and s, where i, s ∈ {2, . . . , rn}, i 6= s. Columns i
and s will have +1 together in a quarter of the rows, and −1 together in a quarter of the rows.
Furthermore, a quarter of the rows will have +1 in column i and −1 in column s. Similarly, a
quarter of the rows will have −1 in column i and +1 in column s.
Let M represent the random k × n matrix from the subsampling operation M = ΦH. Let mji refer to
the element in row j and column i of M . Each element in M is equal to +1 or −1. Let hi ∈ {+1,−1}
be the random sign in element Dii. We now represent the Hadamard sketch as S = MD/
√
k.
103
5. Proofs regarding sketching algorithms
The structure of the Hadamard matrix gives the random matrix M some useful properties. Consider
an arbitrary row j in M . By (P1) listed above regarding the first column of M , mj1 = 1 with probability
one. For the other columns, mji = 1 with probability half, and mji = −1 with probability half for
i = 2, . . . , rn by (P2). By (P3) listed above, we have pairwise independence between elements in row j of
M , that is p(mji|mjs) = p(mji) for i, s ∈ {1, . . . , rn}, i 6= s. As rows of M are sampled independently,
each column of M is pairwise independent.
Row j in the sketched dataset is given by
u˜Tj =
1√
k
rn∑
i=1
mjihiu
T
(n)i.
Let us again consider the linear combination λTzn, where λ and zn are defined as in (5.22) and (5.24)
respectively. The sum over the k rows in the sketched dataset can be rearranged into a sum over the
rn = 2
n rows in the source dataset,
λTzn =
k∑
j=1
λTj u˜j
=
1√
k
k∑
j=1
λTj
rn∑
i=1
mjihiu(n)i
=
1√
k
rn∑
i=1
hi
 k∑
j=1
mjiλ
T
j
u(n)i. (5.33)
In the language of Theorem 5.2 we can form a triangular array of random variables setting
Zni =
1√
k
hi
 k∑
j=1
mjiλ
T
j
u(n)i. (5.34)
for i = 1, . . . , rn and n ∈ N. The linear combination in (5.33) can then be expressed as a row sum of the
triangular array defined by (5.34)
λTzn =
rn∑
i=1
Zni. (5.35)
Our goal of showing that λTzn converges in distribution to a N(0, 1/k) random variable is achieved if
we can show that
∑n
i=1 Zni converges in distribution to a N(0, 1/k) random variable.
The sequence of random variables in each row of the triangular array Zn1, . . . , Znrn are not mutually
independent over i = 1 . . . , rn. This is because the columns ofM are not mutually independent. However,
as the columns of M are pairwise independent, the random sums
∑k
j=1mjiλ
T
j appearing in (5.34) are
also pairwise independent. Again making a connection to Theorem 5.3, the independent sign flips hi
appearing in (5.34) ensure that the random variables in each row of the triangular array are jointly
symmetric and pairwise independent.
Theorem 5.3 can be used to establish asymptotic normality of the sum in (5.35) and hence the linear
combination λTzn. To show that the triangular array of random variables defined in (5.34) satisfies
Lindeberg’s condition we use Theorem 5.2. Set s2n =
∑n
i=1 var(Zni). We first determine s
2
n. We then
form the necessary sequence of upper bounds Kn such that |Zni| ≤ Kn almost surely for i = 1, . . . , n.
We start by considering the variance of a single term in the triangular array var(Zni). We have that
var(Zni) =
1
k
var
hi
 k∑
j=1
mjiλ
T
j
u(n)i
 (5.36)
104
5.6.2. Hadamard sketch
It is important to consider the covariance between the elements of the sum over j = 1, . . . , k. For i 6= 1
and j, v ∈ {1, . . . , k}, j 6= v the covariance is zero
cov
(
himjiλ
T
j u(n)i, himviλ
T
vu(n)i
)
= E
[
h2imjimviλ
T
j u(n)iλ
T
vu(n)i
]
= E [mjimvi]λTj u(n)iλTvu(n)i
= 0.
We use (P2) to conclude that E [mjimvi] = 0. Therefore for i = 2, . . . , rn
var(Zni) =
1
k
var
 k∑
j=1
himjiλ
T
j u(n)i

=
1
k
k∑
j=1
var
(
himjiλ
T
j u(n)i
)
=
1
k
k∑
j=1
λTj u(n)iu
T
(n)iλj . (5.37)
Results are different for i = 1 as the first column of the Hadamard matrix is all ones (P1). For j, v ∈
{1, . . . , k}, j 6= v the covariance is
cov
(
h1mj1λ
T
j u(n)1, h1mv1λ
T
vu(n)1
)
= E
[
h21mj1mv1λ
T
j u(n)1λ
T
vu(n)1
]
= E [mj1mv1]λTj u(n)1λTvu(n)1
= λTj u(n)1λ
T
vu(n)1.
From (P1) mj1 = mv1 = 1. Now using the Cauchy-Schwarz inequality,
|cov (h1mj1λTj u(n)1, h1mv1λTvu(n)1)| = |λTj u(n)1||λTvu(n)1|
≤ ‖λj‖2‖u(n)1‖2‖λv‖2‖u(n)1‖2
≤ ‖u(n)1‖2‖u(n)1‖2
= ‖u(n)1‖22
The second last last uses the fact that λ is a unit vector and we must have ‖λj‖2 ≤ 1, ‖λj‖2 ≤ 1 for
any j, k. From assumption 1, the right hand side of the previous inequality tends to zero as n tends to
infinity. As such we conclude that |cov (h1mj1λTj u(n)1, h1mv1λTvu(n)1)| is o(1). Some covariance terms
appear in the expression for var(Zn1)
var(Zn1) =
1
k
var
 k∑
j=1
h1mj1λ
T
j u(n)1

=
1
k
k∑
j=1
var
(
himjiλ
T
j u(n)1
)
+
1
k
2
k−1∑
j=1
k∑
v=j+1
cov
(
h1mj1λ
T
j u(n)1, h1mv1λ
T
vu(n)1
)
=
1
k
k∑
j=1
var
(
h1mj1λ
T
j u(n)1
)
+
1
k
2
k−1∑
j=1
k∑
v=j+1
λTj u(n)1λ
T
vu(n)1
=
1
k
k∑
j=1
var
(
h1mj1λ
T
j u(n)1
)
+ o(1)
=
1
k
k∑
j=1
λTj u(n)1u
T
(n)1λj + o(1) (5.38)
105
5. Proofs regarding sketching algorithms
The trailing term can be grouped into an o(1) term as the sketch size k is fixed in our analysis. Using
(5.37) and (5.38) we can then determine the row-wise variance totals s2n:
s2n =
1
k
rn∑
i=1
var(Zni)
=
1
k
rn∑
i=1
k∑
j=1
λTj u(n)iu
T
(n)iλj + o(1)
=
1
k
k∑
j=1
λTj
(
rn∑
i=1
u(n)iu
T
(n)i
)
λj + o(1)
=
1
k
k∑
j=1
λTjU
T
(n)U(n)λj + o(1)
=
1
k
k∑
j=1
λTj Idλj + o(1)
=
1
k
k∑
j=1
λTj λj + o(1)
=
1
k
+ o(1).
The step in the last line follows as we have taken λ to be a unit vector. The fact that UT(n)U(n) = Id for
all n serves as a useful normalisation to give stable limiting behaviour. We are working with a triangular
array where the rows are nearly standardised. Asymptotically in n, s2n → 1/k.
We now establish a sequence of upper bounds (Kn). As the random variables in the construction of
construction of the Hadamard sketch are bounded, we can bound the random variables in the triangular
array (5.34) using the leverage scores of the sequence of source datasets. Now as the random sign
hi ∈ {+1,−1} we have that for all i = 1, . . . , rn:
|Zni| = 1√
k
|hi
k∑
j=1
mjiλ
T
j u(n)i|
=
1√
k
|
k∑
j=1
mjiλ
T
j u(n)i|.
Now using the Cauchy-Schwarz inequality,
1√
k
|
 k∑
j=1
mjiλ
T
j
u(n)i| ≤ 1√
k
‖
 k∑
j=1
mjiλj
‖2‖u(n)i‖2. (5.39)
Using the triangle inequality,
‖
 k∑
j=1
mjiλj
‖2 ≤ k∑
j=1
‖mjiλj‖2. (5.40)
Now as mji ∈ {+1,−1} for all j = 1, . . . , k,
k∑
j=1
‖mjiλj‖2 =
k∑
j=1
‖λj‖2. (5.41)
As λ is a unit vector we can easily form the bound
k∑
j=1
‖λj‖2 ≤ k. (5.42)
106
5.7. Proof of Theorem 4.5 (Complete sketching asymptotics)
Substituting (5.41) and (5.42) into (5.39) leads to the upper bound for i = 1, . . . , rn:
|Zni| ≤ 1√
k
k‖u(n)i‖2
=
√
k‖u(n)i‖2. (5.43)
This is less than ideal as the upper bound is a function of k. In the present analysis the sketch size k
is fixed. In future work we would like establish limit theorems letting k and d grow. This is discussed
more in Chapter 6. We would like to eliminate the sketch size k from the upper bound (5.43). To do
so we establish a tighter bound than in (5.42). It is reasonable to form a tighter bound as we have the
restriction that λ is a unit vector, and we have not made full use of that in (5.42). To improve (5.42) we
use the following lemma
Lemma 5.4. Let a1, . . . , ak be a sequence of positive scalars such that
∑k
j=1 aj = 1. Define f(a1, . . . , ak) =∑k
j=1 a
1/2
j . Then f s maximised by setting aj = 1/k for j = 1, . . . , k. Furthermore f(1/k, . . . , 1/k) =
√
k.
The proof is simple using Lagrange multipliers. The quantity in (5.41) can be upper bounded using
Lemma 5.4. We set aj = ‖λj‖22 for j = 1, . . . , k. As λ is a unit vector we have that
∑k
j=1‖λj‖22 =∑k
j=1 aj = 1. We can write
∑k
j=1‖λj‖2 =
∑k
j=1 a
1/2
j . Take f(a1, . . . , ak) =
∑k
j=1 a
1/2
j . We can form
an upper bound on
∑k
j=1‖λj‖2 by maximising f over the arguments a1, . . . , ak subject to the constraint
that
∑k
j=1 aj = 1. From Lemma 5.4 we have that
k∑
j=1
‖λj‖2 ≤
√
k. (5.44)
We can then form the tighter bounds on the triangular array
|Zni| ≤ 1√
k
√
k‖u(n)i‖2 (5.45)
= ‖u(n)i‖2. (5.46)
The upper bound no longer depends on the sketch size k as was desired. Continuing, we can then form
the sequence of upper bounds Kn:
Kn = max
i=1,...,rn
‖u(n)i‖2.
We have that |Zni| ≤ Kn almost surely for i = 1, . . . , rn and n ∈ N. Assumption 1 (recall equation
(5.21)) gives the limiting behaviour of Kn.
lim
n→∞ Kn = limn→∞ maxi=1,...,rn
‖u(n)i‖2
= 0.
We have that Kn → 0 as n → ∞. As s2n = 1/k + o(1) we have an asymptotically standardised array
and we can use Theorem 5.2 to conclude that the triangular array of random variables defined in (5.34)
satisfies Lindeberg’s condition. As such the conditions of Theorem 5.3 are satisfied. We conclude that
the row sums in (5.35) converge in distribution to N(0, 1/k). Finally, the Crame´r-Wold device gives that
the whitened sketched dataset has a limiting matrix normal distribution. That is the sequence of random
matrices A˜V(n)D
−1
(n) converges in distribution to a MN (0, Ik, Id/k) random matrix.
5.7 Proof of Theorem 4.5 (Complete sketching asymptotics)
Assumption 2:
lim
n→∞n
−1
 yT(n)y(n) yT(n)X(n)
XT(n)y(n) X
T
(n)X(n)
 = Q for some positive-definite matrix Q.
107
5. Proofs regarding sketching algorithms
Theorem. Suppose that Assumptions 1 and 2 hold, k ≥ p, and βS is computed using a Hadamard or
Clarkson-Woodruff sketch. Let (X˜TX˜)+ denote the Moore-Penrose pseudo-inverse of (X˜TX˜). Let
H˜(n) =
RSS
(n)
F
k
(
X˜TX˜
)+
and H(n) =
RSS
(n)
F
k − p+ 1
(
XT(n)X(n)
)−1
.
Then as n→∞, convergence in distribution holds for
(i)[H
−1/2
(n) (βS − β(n)F )|A(n)]→ Student (0, Ip, k − p+ 1) ,
(ii)[H˜
−1/2
(n) (βS − β(n)F ) |A(n)]→ N (0, Ip) .
Notation is slightly heavier in the proof compared to the main text for the sake of clarity. Again we
do not explicitly condition on the source dataset A(n), the source dataset is always fixed, and the only
randomness is from the sketching matrix. The sketched data will be denoted y˜(n) and X˜(n) to denote
the dependence on the n× d source dataset. So y˜(n) = Sy(n) and X˜(n) = SX(n). The dimension of the
sketched dataset does not change.
Assumption 2 is of assistance in establishing the limit theorem. Let
Q(n) = n
−1
 yT(n)y(n) yT(n)X(n)
XT(n)y(n) X
T
(n)X(n).

The matrixQ(n) contains the sufficient statistics needed to fit a Gaussian linear model, y
T
(n)y(n),X
T
(n)y(n)
and XT(n)X(n) given the source dataset A(n) = [y(n),X(n)]. Assumption 2 states the averaged sufficient
statistic matrix converges to a limiting matrix Q. It will be helpful to partition the limiting matrix Q as
Q = lim
n→∞n
−1
 yT(n)y(n) yT(n)X(n)
XT(n)y(n) X
T
(n)X(n)
 =
 s mT
m G
 , (5.47)
where s is a scalar, G is a p× p matrix and b is a p-length column vector. The matrix G is the limiting
averaged Gram matrix of the predictors. The vector m is the limit of the predictor response inner
products n−1XT(n)y(n), and the scalar s is the limit of the mean total sum of squares n
−1yT(n)y(n).
As mentioned, the assumption of a sequence of source datasets also gives a sequence of optimal
least squares coefficients and residual errors. Let σ
2(n)
F = RSSF /n. Define the limiting least squares
coefficient estimate as β = lim
n→∞β
(n)
F and the limiting residual error as σ
2 = lim
n→∞σ
2(n)
F . Both β and σ
2
can be expressed as functions of the matrix Q. Specifically,
β = G−1m, (5.48)
σ2 = s−mTG−1m. (5.49)
From Assumption 2, we have that n−1V(n)D2(n)V
T
(n) → Q. As such we have that n−1/2D(n)V T(n) → Q1/2
From the sketching central limit theorem the whitened sketched data converges to a matrix normal
distribution
[y˜(n), X˜(n)]V(n)D
−1
(n)
d→MN (0k×d, Ik, Id/k)
The benefit of adding Assumption 2 is that using Slutsky’s theorem we have the additional convergence
result
n−1/2[y˜(n), X˜(n)]
d→MN (0k×d, Ik,Q/k) .
To prove results (i) and (ii) we use the continuous mapping theorem (Van Der Vaart, 1998, p. 7) in
conjunction with the previous convergence result. It will be helpful to define the random variables y˜L, X˜L
as having the above limiting matrix normal distribution
[y˜L, X˜L] ∼MN (0k×d, Ik,Q/k) .
108
5.7. Proof of Theorem 4.5 (Complete sketching asymptotics)
This is so we can say that
n−1/2[y˜(n), X˜(n)]
d→ [y˜L, X˜L].
Lemma 5.5 (Continuous Mapping Theorem). Let Zn indicate a sequence of random vectors and Z
indicate another random vector. Suppose the function g : Rd → Rm is continuous at every point of a set
C such that P (Z ∈ C) = 1. Then if Zn d→ Z then g(Zn) d→ g(Z).
In Lemma 5.5, the function g : Rd → Rm does not change with n, and the dimensions d and m
are fixed when taking limits. The sketched estimator βS can be defined as a function of the sketched
data that is continuous over the set where X˜(n) is of full rank. Formally we could say that βS =
g(n−1/2y˜(n), n−1/2X˜(n)). As X˜L is of rank p almost surely, and X˜(n)
d→ X˜L we can apply the continuous
mapping theorem to determine the limiting distribution of the βS . The random matrix [y˜L, X˜L] can
be described using a hierarchical model completely analogous in structure to the hierarchical model
established for the Gaussian sketch in section 3 of the main text. Specifically,
y˜L | X˜L ∼ N
(
X˜β,
1
k
σ2Ik
)
,
X˜L ∼MN
(
0k×p, Ik,
1
k
Q
)
.
From Theorem 2 in the main text, and recalling that the function g outputs βS , we have that
g(y˜L, X˜L) ∼ Student
(
β,
σ2
k − p+ 1G
−1, k − p+ 1
)
.
As such, for the Hadamard and Clarkson-Woodruff sketches,
[βS | y(n),X(n)] d→ Student
(
β,
σ2
k − p+ 1G
−1, k − p+ 1
)
.
Let
H(n) = σ
2(n)
F /(k − p+ 1)
(
n−1XT(n)X(n)
)−1
.
Now as n−1XT(n)X(n) → G, σ2(n)F → σ2, and β(n)F → β, Slutsky’s theorem can be used to arrive at (i),
H
−1/2
(n) (βS − βF )
d→ Student (0, Ip, k − p+ 1) .
For result (ii), let us define the function
f(n−1/2y˜(n), n−1/2X˜(n)) =
[
n
(
X˜TX˜
)+]−1/2 (
X˜+y˜ − β
)
=
[
n
(
X˜TX˜
)+]−1/2
(βS − β) .
This function transforms the βS so that the output is uncorrelated. This function is also continuous over
the set where X˜(n) is of rank p. Again using the fact that X˜L has rank p almost surely, it follows from
the continuous mapping theorem that f(n−1/2y˜(n), n−1/2X˜(n))
d→ f(y˜L, X˜L). Result (ii) in Theorem 4.2
also applies to the hierarchical model for y˜L, X˜L, and gives the distribution of the transformed βS under
the Gaussian sketch. The distribution of f(y˜L, X˜L) will be
f(y˜L, X˜L) ∼ N
(
0,
σ2
k
Ip
)
.
As such, for the Clarkson-Woodruff and Hadamard sketches,[
n
(
X˜TX˜
)+]−1/2
(βS − β) d→ N
(
0,
σ2
k
Ip
)
.
109
5. Proofs regarding sketching algorithms
Now let
H˜(n) = nσ
2(n)
F /k
(
X˜TX˜
)+
Now as σ
2(n)
F → σ2, and β(n)F → β, Slutsky’s theorem can be used to arrive at (ii)
H˜
−1/2
(n) (βS − β(n)F )
d→ N (0, Ip) .
5.8 Proof of Theorem 4.6 (Partial sketching asymptotics)
Theorem. Suppose that Assumptions 1, 2 and 3 hold, k > p+ 3, and β∗P is computed using a Hadamard
or Clarkson-Woodruff sketch. Let
H(n) =
(k − p− 1)
(k − p)(k − p− 3)
(
MSS
(n)
F (X
T
(n)X(n))
−1 + β(n)F β
(n)T
F
)
.
Then as n→∞,
(i) ES [β
∗
P − β(n)F |A(n)] → 0.
(ii) varS
(
H
−1/2
(n) (β
∗
P − β(n)F ) |A(n)
)
→ Id
Application of the continuous mapping theorem gives that the distribution of βS and β
∗
P under the
Hadamard and Clarkson-Woodruff sketches converges to the distribution of the estimators under the
Gaussian sketch. This does not necessarily guarantee convergence in moments. To establish a limit
theorem for the bias and variance of the estimators, we need a uniform integrability condition on the
sketched dataset. The sketched data will be denoted X˜(n) to denote the dependence on the n × p
source covariate matrix. So X˜(n) = SX(n). We again do not explicitly condition on the source dataset
A(n) = [y(n),X(n)] in the following working.
Let G(n) = n
−1X˜T(n)X˜(n). From the continuous mapping theorem and Theorem 4.3, it is known that
G−1(n)
d→W ,
where W has an Inverse-Wishart(k, kQ−1) distribution and Q is the limiting matrix from assumption 2.
We would like to establish convergence in first and second moments, that is
E[G−1(n)]→ E[W ],
var(G−1(n))→ var(W ).
If convergence in first and second moments occurs, then we can show that (i) and (ii) will hold. If
E[G−1(n)]→ E[W ], we can say that
E[β∗P − β]→ 0,
where β is the limiting ordinary least squares estimator (5.48), that is a function of the limiting matrix
Q in assumption 2. From here, using that lim
n→∞β
(n)
F , Slutsky’s theorem can be used to arrive at (i)
E[β∗P − β(n)F ]→ 0.
To show convergence of the variance of the sketched estimator (ii), we define
H(n) =
(k − p− 1)
(k − p)(k − p− 3)
(
MSS
(n)
F
(
XT(n)X(n)
)−1
+
(k − p+ 1)
(k − p− 1)β
(n)
F β
(n)T
F
)
,
H =
(k − p− 1)
(k − p)(k − p− 3)
(
(s− σ2)G−1 + (k − p+ 1)
(k − p− 1)ββ
T
)
.
110
5.8. Proof of Theorem 4.6 (Partial sketching asymptotics)
Where s, σ2 and and G are functions of the limiting matrix Q, as in (5.47), (5.48) and (5.49). If
var(G−1(n))→ var(W ) it follows that
varS
(
H−1/2(β∗P − β)
)
→ Id.
As H(n) converges to H and β
(n)
F converges to β asymptotically with n, an application of Slutsky’s
theorem gives (ii),
varS
(
H
−1/2
(n) (β
∗
P − β(n)F )
)
→ Id
As such, if we can establish that var(G−1(n)) → var(W ) we have proved (ii). The following theorem
describes the necessary conditions for such convergence to occur.
Theorem 5.4. (Billinglsley, 1968, Theorem 5.4) Let X1, . . . ,Xn be a sequence of random vectors. Sup-
pose Xn converges in distribution to a random variable Z as n tends to infinity. For the additional
convergence of moments E[Xn] → E[Z] and var[Xn] → var[Z], it must hold that for all conformable
constant vectors λ
lim
M→∞
lim sup
n→∞
|λTXn|21{|λTXn|2≥M} = 0.
The above condition can be difficult to verify directly. It can be shown that if asymptotically |λTXn|
has a bounded fourth moment, then the integrability condition is satisfied (Van Der Vaart, 1998, section
2.5).
A linear combination of the elements of the random matrix G−1(n) can be written an trace(ΛG
−1
(n)) for
a p × p matrix of constants Λ. It is easier to work with this form rather than stacking the elements of
the random matrix to form a random vector. From theorem 5.4, it is sufficient to show that show that
the expected value of |trace(ΛG−1(n))|4 is finite for large n to show the desired convergence in moments.
As trace(ΛG−1(n)) equals the sum of the singular values of the matrix ΛG
−1
(n), we can form an upper
bound on the value,
trace(ΛG−1(n)) ≤ p‖ΛG−1(n)‖2
≤ p‖Λ‖2‖G−1(n)‖2
Squaring both sides gives an upper bound on the quantity that must satisfy the uniform integrability
condition,
|trace(ΛG−1(n))|2 ≤ p2‖Λ‖22‖G−1(n)‖22.
Squaring again gives an upper bound on the fourth moment of the linear combination of interest
|trace(ΛG−1(n))|4 ≤ p4‖Λ‖42‖G−1(n)‖42
= p4‖Λ‖42
(
1
σ2min(G(n))
)2
By assumption 3, the expectation of the right hand side is finite. As such, the uniform integrability
condition holds and we can conclude that
E[G−1(n)]→ E[W ],
var(G−1(n))→ var(W ).
As discussed at the beginning of the proof this is sufficient to show that (i) and (ii) hold.
111
6On subspace embeddings, Tracy-Widom limits and approximate
Bayesian subset selection
6.1 Summary
Sketching is a probabilistic data compression technique that uses random projection. Sketching has
recently been proposed as a method for approximate Bayesian regression on large datasets. We investigate
sketching for approximate Bayesian subset selection. Sketching algorithms offer probabilistic bounds
on the discrepancy introduced from using the sketched dataset in place of the full dataset. Existing
bounds give limited information on how to choose the size of the compressed dataset. We consider the
asymptotic behaviour of various random sketching matrices and establish an important link between
the Tracy-Widom distribution and the stochastic error rates of sketching algorithms. The Tracy-Widom
distribution can be used to construct asymptotic error bounds describing the information loss from
the use of the randomised algorithm. The asymptotic results provide new insights on the comparative
performance of different sketching algorithms and will help to give useful guidelines for practitioners. We
test the theory and methods on a large genetic dataset.
6.2 Introduction
As discussed in Chapter 4 sketching algorithms use random projection to generate a smaller surrogate
dataset for efficient approximate computation. Recent work has examined the possibility of using sketch-
ing for approximate Bayesian inference (Geppert et al., 2017; Bardenet and Maillard, 2015). An apprecia-
ble barrier to the adoption of inferential procedures using sketching is the difficulty in constructing tight
error bounds on the sketching noise injected into the analysis. We take a new approach and use random
matrix asymptotics to assess the suitability of random projections for approximate Bayesian inference. A
main finding is that under mild regularity conditions, the stochastic distortion caused by a wide class of
random projections is well described by the Tracy-Widom law (Tracy and Widom, 1994). The asymptotic
results help to determine appropriate compression ratios for sketched analyses and to compare the operat-
ing characteristics of different random projections. We assess the suitability of sketching for approximate
Bayesian subset selection using a combination of asymptotic theory and simulation.
We assume we have n independent observations from the standard Gaussian linear model. For sim-
plicity the error variance σ2 is treated as known. Let y denote a vector of n responses, and let X denote
a n×p matrix of covariates. As discussed in Chapter 4, sketching algorithms use random linear mappings
to the size of the dataset from n to k observations. The random linear mapping can be represented as
a k × n sketching matrix S. Complete sketching generates a k-length sketched response vector y˜ and a
k × p matrix of sketched predictors X˜. The sketched data are computed through the linear mappings
y˜ = Sy and X˜ = SX. Partial sketching only generates a k × p matrix of sketched covariates X˜. We
again use the random mapping X˜ = SX. To simplify the analysis we only consider complete sketching
in this chapter. Extensions to partial sketching are covered in the discussion. Working with the sketched
112
6.2. Introduction
responses y˜ and sketched covariates X˜ is more computationally efficient than operating on the full dataset
of n observations. The use of structured random projections S allows one to quantify the information
loss incurred from working with the compressed dataset.
As discussed in section 4.2.1 of Chapter 4, sketching algorithms are typically motivated from objects
known as -subspace embeddings (Woodruff, 2014; Meng and Mahoney, 2013; Yang et al., 2015a). Recall
the formal definition of an -subspace embedding.
Definition 6.1. -subspace embedding.
For a given n× p matrix A, we call a k × n matrix S an -subspace embedding for A, if for all vectors
z ∈ Rp
(1− )||Az||22 ≤ ||SAz||22 ≤ (1 + )||Az||22.
An -subspace preserves the linear structure of the original dataset up to a multiplicative (1 ± )
factor. Broadly speaking, the covariance matrix of the sketched dataset is similar to the covariance
matrix of the source dataset if  is small. Mathematical arguments show that the sketched dataset is a
good surrogate for many linear statistical methods if the sketching matrix S is an -subspace embedding
for the original dataset, with  sufficiently small (Woodruff, 2014). Epsilon-subspace embeddings are
central in our analysis for developing posterior approximations and for constructing probabilistic bounds
on the compression loss. Suitable ranges for  depend on the task of interest and structural properties of
the source dataset.
Sketching has been investigated for posterior approximation in the fixed model setting. The likelihood
is p(y|X, β, σ2) = N(Xβ, σ2In), where β is a p-dimensional vector of coefficients. Given a prior distri-
bution p(β), the target posterior distribution of interest is p(β|y,X, σ2) ∝ p(y|X,β, σ2)p(β). For large
n, it may be computationally expensive to sample from the posterior distribution using Markov-Chain
Monte Carlo, or to determine the analytic form if a conjugate prior is used. Geppert et al. (2017) consider
using the sketched dataset [y˜, X˜] to construct an approximate posterior distribution on the coefficients β.
Given an -subspace embedding it is possible to establish bounds on the difference between the sketched
posterior and the target posterior as a function of  (Geppert et al., 2017). We consider approximate
Bayesian model selection, also motivated by properties of -subspace embeddings.
To assess the confidence in the approximate posterior it is necessary to obtain the probability that
the random projection S is an -subspace embedding for the source dataset. The embedding probability
Pr(S is an -subspace embedding for A), (6.1)
is a critical feature of many sketching algorithms that is difficult to characterise precisely using existing
theory (Venkatasubramanian and Wang, 2011). Most existing results on the embedding probability are
finite sample lower bounds, that can potentially contain hidden constants. This makes it difficult to
design methods for reporting the level of uncertainty associated with the sketched approximation. We
take a new approach and develop asymptotic expressions for the embedding probability (6.1). We find
that the Tracy-Widom law can be used to describe the probability of obtaining an -subspace embedding
for many data oblivious sketches in the literature. The Tracy-Widom law has many applications in
high-dimensional statistics (Johnstone, 2006; Bai and Silverstein, 2010), however there is little work to
our knowledge foregrounding the importance of the law for statistical computation. The connection to
sketching algorithms is novel and hints at deeper principles that may apply to randomised algorithms
that use data oblivious projections.
We test the accuracy of the asymptotic results on a large genetic dataset and find that the Tracy-
Widom law gives a very good approximation for the embedding probability (6.1). We also analyse a real
genetic dataset to provide guidelines on the necessary  required for approximate Bayesian model selection.
It appears that the noise introduced by the sketch can overwhelm the posterior distribution in situations
where there is model uncertainty. False positives are also another systemic issue with sketched model
113
6. Sketching for posterior approximation
selection. Although single pass sketching does not appear to be a viable route for posterior approximation,
we believe the asymptotic analysis presented here will nevertheless be useful in the design and analysis
of general sketching algorithms.
6.3 Bayesian model selection
We assume that the model is unknown, and wish to incorporate model uncertainty into the Bayesian
analysis. The p-length binary vector γ is used to index different models. Element j in γ is equal to one if
variable j is included in the model, and zero otherwise. Let Xγ denote the design matrix for a particular
model. The matrix Xγ is a submatrix of X where column j is included if γj = 1. Let pγ denote the
number of predictors in model γ, pγ is equal to the sum of the elements of γ. The likelihood is defined
as
p(y|X,βγ , σ2,γ) = N(Xγβγ , σ2In),
where βγ is a pγ-dimensional vector of coefficients for model γ. We will use Zellner’s g-prior,
p(βγ |γ, σ2) = N(0, gσ2(XTγXγ)−1),
where g is a hyper-parameter controlling the prior variance over the coefficients. As stated in the intro-
duction we treat σ2 as known. Let the residual sum of squares for a particular model γ be denoted
RSSγF = y
Ty − yTXγ(XTγXγ)−1XTγ y.
The marginal likelihood is a function of the residual sum of squares RSSγF and the total sum of squares
yTy. The expression is
p(y|γ, g, σ2) = (1 + g)−pγ/2 exp
[
− 1
2σ2
(
g
g + 1
RSSγF +
1
g + 1
yTy
)]
. (6.2)
We assume we use Markov chain Monte Carlo (MCMC) to sample from the posterior distribution over
models, for example the evolutionary stochastic search algorithm in Bottolo and Richardson (2010).
For tall datasets it is likely to be advantageous to precompute the sufficient statistics XTX,XTy,yTy.
Computing XTX is O(np2), computing XTy is O(np) and computing yTy is O(n). This is a one-off set
up cost of O(np2) operations, dominated by computation of the Gram matrix of predictors XTX.
At each step in the MCMC algorithm it is only necessary to compute the model sum of squares
yTXγ(X
T
γXγ)
−1XTγ y for each new model γ visited. This can be obtained in O(p
3
γ) operations. The first
step is to compute (XTγXγ)
−1XTγ y by solving the linear system
XTγXγb = X
T
γ y (6.3)
for b. The matrix XTγXγ is obtained by taking the appropriate subset from the full Gram matrix of the
predictors XTX. Likewise, the marginal associations XTγ y are read from the summary statistic X
Ty.
Solving the linear system (6.3) takes O(p3γ) operations.
Given the solution b̂ = (XTγXγ)
−1XTγ y, computation of the product y
TXγ [(X
T
γXγ)
−1XTγ y] can
be done in O(pγ) time, leading to an overall O(p
3
γ) cost for computing the integrated likelihood. The
pay-off from the initial investment in computing the sufficient statistics is that the subsequent cost of
each MCMC step is then independent of n.
For a very tall dataset the O(np2) set up cost of computing XTX may be undesirable. To reduce the
initial computational expense, we can approximate the sufficient statistics using the sketched dataset.
Given a sketched dataset of k observations [y˜, X˜], the sketched sufficient statistics X˜TX˜, X˜Ty˜, y˜Ty˜
can be computed in O(kp2) time. The approximate posterior distribution over models is defined using
the sketched sufficient statistics. The approximate posterior can be explored using the same MCMC
algorithm that would be applied to the full dataset sufficient statistics. Use of the sketched sufficient
statistics is formally motivated using -subspace embeddings in the next section.
114
6.4. Approximate Bayesian inference
6.4 Approximate Bayesian inference
Geppert et al. (2017) examine sketching for approximate Bayesian regression in the fixed model setting,
using properties of -subspace embeddings to motivate their approach. If A˜ = [y˜, X˜] is an -subspace
embedding of A = [y,X], it must hold that for all β ∈ Rp,
(1− )||y −Xβ||22 ≤ ||y˜ − X˜β||22 ≤ (1 + )||y −Xβ||22. (6.4)
Let p(β) denote the prior distribution and p(y|β,X, σ2) denote the likelihood, so
p(y|β,X, σ2) = 1
(2piσ2)n/2
exp
(
− 1
2σ2
||y −Xβ||22
)
.
The target posterior distribution is p(β|y,X) ∝ p(y|β,X, σ2)p(β). If n is very large it may be compu-
tationally infeasible to compute or sample from the target posterior distribution. Geppert et al. propose
to form an approximate likelihood function p˜(y|β,X, σ2) by substituting the sketched squared residuals
||y˜ − X˜β||22 for the true squared residuals ‖y −Xβ‖22,
p˜(y|β,X, σ2) = 1
(2piσ2)n/2
exp
(
− 1
2σ2
||y˜ − X˜β||22
)
.
If  is small, the bounds (6.4) imply that p˜(y|β,X, σ2) ≈ p(y|β,X, σ2). Using the sketched likelihood
we can then define a sketched posterior
p˜(β|X,y, σ2) ∝ p(β)p˜(y|X,β, σ2).
As → 0, the approximate likelihood p˜(y|X,β, σ2) approaches the true likelihood p(y|β,X, σ2), and the
sketched approximate posterior approaches the target posterior distribution. More formally, as → 0,
D(p(β|y,X, σ2) ‖ p˜(β|y,X, σ2))→ 0.
where D(p(β|β,X, σ2) ‖ p˜(β|y,X, σ2)) denotes the Kullback-Leibler divergence of the sketched posterior
from the target posterior. Geppert et al. establish a bound on the Wasserstein distance between the
sketched posterior and the target posterior for finite .
We can use a similar argument to construct an approximate posterior distribution over models. Let
RSSγS denote the residual sum of squares using the sketched dataset for a particular model γ,
RSSγS = min
β∈Rpγ
‖y˜ − X˜γβ‖22
= y˜Ty˜ − y˜TX˜γ(X˜Tγ X˜γ)−1X˜Tγ y˜.
Lemma 6.1 gives an important result that motivates our approach.
Lemma 6.1. Suppose A˜ = [y˜, X˜] is an -subspace embedding of A = [y,X]. Then for all models γ
(1− )RSSγF ≤ RSSγS ≤ (1 + )RSSγF .
Proof: Let βγF denote the least squares coefficients for model γ using the full dataset, where omitted
covariates have their respective coefficients set to zero. The vector βγF is p dimensional, with pγ nonzero
elements. Similarly, let βγS denote the least squares coefficients for model γ using the sketched dataset,
where again omitted covariates have their respective coefficients set to zero. The vector βγS is also p
dimensional, with pγ nonzero elements. It holds that RSS
γ
F = ‖y −XβγF ‖22 and RSSγS = ‖y˜ − X˜βγS‖22.
To establish the upper bound in Lemma 6.1 we use the upper bound in (6.4),
‖y˜ − X˜βγS ||22 ≤ ‖y˜ − X˜βγF ||22
≤ (1 + )||y −XβγF ||22
= (1 + )RSSγF .
115
6. Sketching for posterior approximation
The first line follows from the optimality of βγS on the sketched dataset. To establish the lower bound in
Lemma 6.1 we use the lower bound in (6.4),
||y˜ − X˜βγS ||22 ≥ (1− )‖y −XβγS ||22
≥ (1− )‖y −XβγF ||22
= (1− )RSSγF .
The second line follows from the optimality of βγF on the full dataset, completing the proof.
We can define an approximate integrated likelihood p˜(y|γ, g, σ2) in terms of the residual sum of squares
on the sketched dataset. We substitute RSSγS in place of RSS
γ
F in the target integrated likelihood (6.2),
obtaining
p˜(y|γ, g, σ2) = (1 + g)−pγ/2 exp
[
− 1
2σ2
(
g
g + 1
RSSγS +
1
g + 1
yTy
)]
. (6.5)
If [y˜, X˜] is an -subspace embedding of [y,X], the bounds on the residual sum of squares (1− )RSSγF ≤
RSSγS ≤ (1 + )RSSγF must hold. If  is small, the approximate integrated likelihood p˜(y|γ, g, σ2)
should be a good approximation of the target integrated likelihood p(y|γ, g, σ2). The sketched posterior
approximation over models is then defined as
p˜(γ|y, g, σ2) ∝ p(γ)p˜(y|γ, g, σ2).
As → 0, p˜(y|γ, g, σ2) approaches p(y|γ, g, σ2), and the sketched approximate posterior approaches the
target posterior distribution. More formally, as → 0,
D(p(γ|y, g, σ2) ‖ p˜(γ|y, g, σ2))→ 0,
where D(p(γ|y, g, σ2) ‖ p˜(γ|y, g, σ2)) denotes the Kullback-Leibler divergence of the sketched posterior
from the target posterior.
A Markov chain Monte Carlo algorithm targeting the sketched posterior requires a set up cost of
O(kp2) operations, after which each MCMC step to a new model has O(p3γ) cost. If we can reliably
generate -subspace embeddings with small  we can arrive at a posterior approximation with a smaller
overall computational cost than through exact computation.
To form concrete probabilistic error bounds we need to determine the probability that the random
sketching matrix S is an -subspace embedding for the source dataset [y,X]. The success probability
(6.1) determines the confidence we have in the randomised algorithm for approximate regression. We
develop useful guidelines for practitioners in section 4.
Secondly, we also need to know what value of  is needed to obtain a tolerable posterior approximation.
Geppert et al. (2017) aim for  ∈ (0.1, 0.2) for posterior approximation in the fixed model setting. We
examine what is an appropriate  for posterior approximation through simulation. Our results suggest
that  needs to be much smaller than 0.1 for approximating the posterior distribution over models. We
now turn to the first question of interest, determining the embedding probability.
6.5 Embedding probabilities
6.5.1 Previous work
We first review some key theoretical results used to support sketching algorithms from the computer
science literature. Let A denote an arbitrary n×d data matrix that we wish to compress. Data oblivious
random projections are distributions over sketching matrices S ∈ Rk×n that are not a function of the
source dataset A ∈ Rn×d. At a high-level, data oblivious random projections offer guarantees on the
success probability
Pr(S is an -subspace embedding for A),
116
6.5.1. Previous work
Sketch Sketching time Required sketch size k
Gaussian O(ndk) O((d+ log(1/δ))/2)
Hadamard O(nd log k) O((
√
d+
√
log n)2(log(d/δ))/2)
Clarkson-Woodruff O(nd) O(d2/δ2)
Table 6.1: Properties of different data oblivious random projections (Woodruff, 2014). The third column refers to
the necessary sketch size k to obtain an -subspace embedding for an arbitrary n × d source dataset with at least
probability (1− δ).
for an arbitrary n× d source dataset A. The guarantees are typically finite sample lower bounds on the
embedding probability. The bounds concern the required sketch size k needed to obtain an -subspace
embedding with probability at least 1− δ. The lower bounds for the Gaussian, Hadamard and Clarkson-
Woodruff sketches are listed in Table 6.1.
The results are expressed in Big-O notation, the formulae for the required sketch size k are not fully
explicit. For the Gaussian sketch, the notation O((
√
d +
√
log n)2(log(d/δ))/2) indicates the existence
of constants c1 and c2 that are independent of n, such that if k is chosen within the bounds
c1(d+ log(1/δ))/
2) ≤ k ≤ c2(d+ log(1/δ))/2),
then the probability of obtaining an -subspace embedding is at least (1 − δ). The hidden constants in
Table 6.1 are of little concern from a computer science perspective as the algorithmic significance is that
the required sketch size k is independent of n for the Gaussian and Clarkson-Woodruff sketches, and
very weakly dependent on n for the Hadamard projection. This implies that size of the sketch does not
necessarily have to increase with n to obtain a fixed error guarantee. This is a very desirable property
for a data compression algorithm.
Many proofs of the results in Table 6.1 use the following important lemma that gives a necessary and
sufficient condition to obtain an -subspace embedding (Woodruff, 2014, Chapter 2).
Lemma 6.2. For a given matrix n×d matrix A of rank d, let U be an orthonormal basis for the columns
of A. Let S be some k×n sketching matrix. The matrix S an -subspace embedding for A, if and only if
σmax(Id −UTSTSU) ≤ ,
where σmax(Id −UTSTSU) denotes the maximum singular value of the d× d matrix Id −UTSTSU .
Proof: From Definition 6.1, it is necessary to preserve norms in the column space of A up to (1 ± )
factor,
(1− )||Az||22 ≤ ||SAz||22 ≤ (1 + )||Az||22.
It is therefore sufficient to consider an orthonormal basis for the column space of A, as{
Az : z ∈ Rd} = {Uv : v ∈ Rd} .
We can limit attention to unit vectors v, as if
(1− )||Uv||22 ≤ ||SUv||22 ≤ (1 + )||Uv||22,
holds for all unit vectors v, it holds for all vectors v′ ∈ Rd by scaling. It is therefore sufficient to show
that for all unit vectors v,
(1− ) ≤ ‖SUv‖22 ≤ (1 + ).
117
6. Sketching for posterior approximation
Let λmin(U
TSTSU) and λmax(U
TSTSU) denote the minimum and maximum eigenvalues of the matrix
UTSTSU respectively. The extreme eigenvalues are the solutions to the optimisation problems over unit
vectors v,
λmin(U
TSTSU) = min
v
‖SUv‖22, λmax(UTSTSU) = max
v
‖SUv‖22.
The extreme eigenvalues give upper and lower bounds on the distortion in norms caused by S. As
we need (1 − ) ≤ ‖SUv‖22 ≤ (1 + ), the matrix S is then an -subspace embedding if and only if
|1−λmin(UTSTSU)| ≤  and |1−λmax(UTSTSU)| ≤ . The singular values of the matrix Id−UTSTSU
are given by the absolute values of the eigenvalues of Id−UTSTSU . The maximal singular value is then
σmax(Id −UTSTSU) = max(|1− λmin(UTSTSU)|, |1− λmax(UTSTSU)|) (6.6)
It follows that σmax(Id−UTSTSU) ≤  is a necessary and sufficient condition for |1−λmin(UTSTSU)| ≤
 and |1− λmax(UTSTSU)| ≤  to hold jointly. As such the matrix S is an -subspace embedding if and
only if σmax(Id −UTSTSU) ≤ 
As discussed in Woodruff (2014, Chapter 2), the bounds in Table 6.1 can be obtained by giving up-
per bounds on the failure probability
1− Pr(S is an -subspace embedding for A) = Pr(σmax(Id −UTSTSU) > ).
Let ‖A‖F denote the Frobenius norm of a n× d matrix A,
‖A‖F =
√√√√ n∑
i=1
d∑
j=1
A2ij .
The proof for the Clarkson-Woodruff sketch uses Markov’s inequality and the fact that σ2max(A) ≤ ‖A‖2F
(Nelson and Nguyeˆn, 2013; Meng, 2014). The key argument is,
Pr(σmax(Id −UTSTSU) > ) = Pr(σ2max(Id −UTSTSU) > 2)
≤ −2E (σ2max(Id −UTSTSU)])
≤ −2E (‖Id −UTSTSU‖2F ) .
The expected squared Frobenius norm can be upper bounded using properties of the Clarkson-Woodruff
sketch and the fact that UTU = Id (Nelson and Nguyeˆn, 2013; Meng, 2014).
The proof for the Hadamard sketch uses a matrix Chernoff bound and Boole’s inequality to upper
bound Pr(σmax(Id −UTSTSU) > ) (Tropp, 2011; Meng, 2014). The matrix Chernoff bound is used to
bound the extreme eigenvalues of a random matrix. Boole’s inequality upper bounds the probability of a
union of events, Pr (
⋃m
i=1Bi) ≤
∑m
i=1 Pr(Bi) for any countable set of events B1, . . . , Bm. The argument
is more technical and not reviewed here. See Tropp (2011) for an insightful proof.
The different functional forms of the bounds in Table 6.1 are due to the different methods of proof,
and the different statistical properties of the Gaussian, Hadamard and the Clarkson-Woodruff sketches.
The use of chained bounds makes it difficult to keep track of the necessary constant factors to obtain a
tight bound on the embedding probability. Finite sample, worst case bounds can be pessimistic in regards
to the performance of the randomised algorithm (Halko et al., 2011; Mahoney and Drineas, 2016). We
propose to use asymptotic random matrix theory to characterise the distribution of the critical term
σmax(Id − UTSTSU). The singular values of large random matrices is a well studied topic in random
matrix theory, and we can draw from existing results to analyse the behaviour of sketching algorithms
(Bai et al., 2014). Given the asymptotic distribution of σmax(Id−UTSTSU) we can give the asymptotic
probability of obtaining an -subspace embedding rather than forming a lower bound. We first analyse
the Gaussian sketch.
118
6.5.2. Gaussian sketch
6.5.2 Gaussian sketch
From Lemma 6.2, the embedding probability is a function of the maximum singular value
Pr(S is an -subspace embedding for A) = Pr(σmax(Id −UTSTSU) ≤ ),
As noted in Meng (2014, Section 2.3), when using a Gaussian sketch it is instructive to consider directly
the distribution of the random variable σmax(Id−UTSTSU). Lemma 6.2 helps to show why the Gaussian
projection is useful as a data oblivious sketch. Consider an arbitrary n×d data matrixA. Let the singular
value decomposition ofA be given byA = UDV T. The success probability of interest is can be expressed
as
Pr(S is an -subspace embedding for A) = Pr(σmax(Id −UTSTSU) ≤ ).
Note that on the right hand side the sketching matrix only enters through the term UTSTSU . As S is
a matrix of independent Gaussians with mean zero and variance 1/k, it is possible to show that
UTSTSU ∼Wishart (k,UTU/k)
= Wishart (k, Id/k) ,
where the second line follows as U is an orthonormal matrix. The key term UTSTSU is in some sense
a pivotal quantity, as its distribution is invariant to the actual values of the data matrix A. When
using a Gaussian sketch, the probability of obtaining an -subspace embedding has no dependence on the
number of original observations n, or on the values in the data matrix A. This is a useful property for a
data oblivious sketch, as it is possible to develop universal performance guarantees that will hold for any
possible source dataset. This invariance property is also noted in Meng (2014), although the derivation
is different.
Let us define the random matrix W ∼Wishart(k, Id/k). The success probability of interest can then
be expressed in terms of the Wishart distribution,
Pr(S is an -subspace embedding for A) = Pr
(
σmax(Id −UTSTSU) ≤ 
)
= Pr(σmax (Id −W ) ≤ ). (6.7)
Now as σmax(Id−W ) = max(|1−λmin(W )|, |1−λmax(W )|) the embedding probability can be expressed
in terms of the extreme eigenvalues of the Wishart distribution. The embedding probability of interest
has the representation
Pr(S is an -subspace embedding for A) = Pr(σmax(Id −W ) ≤ )
= Pr (|1− λmin(W )| ≤ , |1− λmax(W )| ≤ ) , (6.8)
where W ∼ Wishart(k, Id/k). It is difficult to obtain a closed form expression for the embedding
probability as it involves the joint distribution of the extreme eigenvalues. Meng forms a lower bound
on this probability using concentration results on the eigenvalues of the Wishart distribution. The lower
bound is reported here in Lemma 6.3.
Lemma 6.3. (Meng, 2014, Lemma 11) Suppose we have an arbitrary n× d data matrix A where n > d
and A is of rank d. Suppose we take a Gaussian projection with k ≥ 6(√d+√2 log(2/δ))2/2 then
Pr(S is an -subspace embedding for A) ≥ 1− δ.
For a proof see (Meng, 2014, Chapter 2, Section 2.5). The bound can be inverted to give a lower
bound on the success probability as a function of . Doing so gives
Pr(S is an -subspace embedding for A) ≥ 1− exp
[
log(2)− (
√
k2/6−√d)√
2
]
. (6.9)
119
6. Sketching for posterior approximation
We can investigate the tightness of the bound (6.9) through simulation. From equation (6.7), for an
arbitrary n× d data matrix A,
Pr(S is an -subspace embedding for A) = Pr(σmax (Id −W ) < ),
whereW ∼Wishart(k, Id/k). To estimate the embedding probability for the Gaussian sketch we can sim-
ulateW ∼Wishart(k, Id/k) and look at the empirical distribution of the random variable σmax (Id −W ).
We first generated B = 10000 random Wishart matrices W [1], . . . ,W [B]. For each simulated matrix W [b]
we computed the distortion factor [b]:
[b] = σmax(Id −W [b]), (6.10)
for b = 1, . . . , B. The simulated distortion factors [1], . . . , [1B] can be used to give a Monte Carlo
estimate of the embedding probability:
P̂r(S is an -subspace embedding for A) = P̂r(σmax(Id −W ) ≤ )
=
1
B
B∑
b=1
1([b] ≤ ).
We used the ARPACK library Lehoucq et al. (1998) to compute the maximum singular value of the high-
dimensional matrices. The estimated embedding probabilities are displayed in Figure 6.1 for different
dimensions d. The sketch size was kept at k = 20 × d. The red line shows the empirical probability of
obtaining an -subspace embedding. The dot-dash line shows the lower bound of Meng (2014), given in
equation (6.9). The solid vertical line gives an asymptotic limiting value that will be discussed later in
section 6.7.1.
Comparing the empirical embedding probabilities to the lower bound, we see that the lower bound is
quite loose. The lower bound is zero in each plot at points where the empirical cdf is close to one. To
obtain a tighter prediction on the embedding probability we consider the asymptotic distribution of the
eigenvalues of a random Wishart matrix. This is a well studied area of random matrix theory (Edelman,
1988). Asymptotic analysis is adopted in order to obtain point estimate of the success probability
rather than a worst case bound. We develop two asymptotic expressions for the embedding probability.
We introduce the random matrix asymptotics by first considering the pointwise limit of the extreme
eigenvalues. The pointwise asymptotic result suggests a sharp phase change in the success probability of
the algorithm. We then develop a more accurate approximation for the embedding probability by making
a connection to the Tracy-Widom distribution.
6.6 Asymptotics
Our asymptotic arguments in section 6.7 concern the convergence of probability measures. Billingsley
(1999) is an authoritative reference on the topic. We now recap some useful foundational theory, as is
presented in Van Der Vaart (1998, Chapter 2). Let (Zn)n∈N denote a sequence of real v-dimensional
random vectors with cumulative distribution functions (Fn)n∈N. Let Z denote a real v-dimensional
random vector with cumulative distribution function F . The sequence of random vectors (Zn) converges
in distribution to Z if
lim
n→∞Fn(z) = F (z),
at every point z ∈ Rv where the limit distribution F is continuous. Convergence in distribution is denoted
as Zn
d→ Z. Convergence in distribution is often referred to as weak convergence. Let ‖.‖2 denote the
Euclidean norm. A sequence of random vectors (Zn)n∈N, is said to converge in probability to a random
vector Z if for all δ > 0:
lim
n→∞Pr(‖Zn −Z‖2 > δ) = 0.
120
6.6. Asymptotics
0.2 0.3 0.4 0.5 0.6 0.7 0.8
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
(a) d=10, k=20*d
Epsilon
Em
be
dd
in
g 
pr
ob
ab
ilit
y
0.2 0.3 0.4 0.5 0.6 0.7 0.8
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
(b) d=20, k=20*d
Epsilon
Em
be
dd
in
g 
pr
ob
ab
ilit
y
0.2 0.3 0.4 0.5 0.6 0.7 0.8
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
(c) d=100, k=20*d
Epsilon
Em
be
dd
in
g 
pr
ob
ab
ilit
y
0.2 0.3 0.4 0.5 0.6 0.7 0.8
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
(d) d=1000, k=20*d
Epsilon
Em
be
dd
in
g 
pr
ob
ab
ilit
y
Empirical Lower bound Pointwise limit
Figure 6.1: Comparison of simulated embedding probabilities against theoretical results at different k and d. The
sketch size to variables ratio is kept constant at twenty. We want the sketching matrix S to be an -subspace
embedding for the source dataset with small  with high probability. The y-axis gives the proportion of times we
attain an -subspace embedding. The x-axis gives the distortion factor  (recall Definition 6.1). The dot-dash line
gives a finite sample lower bound. The vertical line gives the asymptotic limiting distribution as d increases. As d
increases the empirical cdf concentrates around the step-function given by pointwise asymptotic theory (Theorem
6.5).
Convergence in probability is denoted Zn
p→ Z. Convergence in probability requires that (Zn) and Z be
defined on the same probability space. Convergence in distribution has no such requirement.
The Portmanteau lemma gives a number of useful equivalent definitions of convergence in distribution
(weak convergence).
Lemma 6.4 (Portmanteau). Let (Zn)n∈N denote a sequence of random vectors of fixed dimension, and
Z denote another random vector of the same dimension. The following statements are equivalent, where
limits are being taken in n:
(a) Pr(Zn ≤ z)→ Pr(Z ≤ x) at all continuity points z of the cumulative distribution function Pr(Z ≤
z).
(b) Ef(Zn)→ Ef(Z) for all bounded, continuous function f .
(c) Ef(Zn)→ Ef(Z) for all bounded Lipschitz functions f
(d) lim inf Ef(Zn) ≥ Ef(Z) for all nonnegative, continuous functions f .
(e) lim inf Pr(Zn ∈ G) ≥ Pr(Z ∈ G) for every open set G.
(f) lim sup Pr(Zn ∈ F ) ≤ Pr(Z ∈ F ) for every closed set F .
121
6. Sketching for posterior approximation
(g) Pr(Zn ∈ B) → Pr(Z ∈ B) for all Borel sets B with Pr(X ∈ ∂B) = 0, where ∂B denotes the
boundary of the set B. The boundary is defined as the closure of the set B minus the interior of B,
so ∂B = B \Bo.
The continuous mapping theorem is a useful result for studying the weak convergence of functions of
random variables. The continuous mapping theorem is quite powerful as we only need to show that the
limiting random variable Z satisfies P (Z ∈ C) = 1. Possible discontinuities of g(Zn) do not cause any
difficulties in establishing a limit theorem.
Theorem 6.1 (Continuous Mapping Theorem). Let (Zn)n∈N indicate a sequence of v-dimensional ran-
dom vectors. Let Z denote another random vector of dimension v. Suppose the function g : Rv → Rm is
continuous at every point of a set C such that P (Z ∈ C) = 1. Then the following results hold
(a) If Zn
d→ Z then g(Zn) d→ g(Z).
(b) If Zn
p→ Z then g(Zn) p→ g(Z).
There are some useful relationships between convergence in probability and convergence in distribu-
tion. Additionally, there are some useful relationships between joint and marginal convergence when one
or more sequences in question converges weakly to a constant. The following theorem summarises some
useful results.
Theorem 6.2 (Relationship between modes of convergence). Let (Zn)n∈N and (Yn)n∈N be sequences of
random vectors. Let Z and Y denote random vectors. Then the following relationships hold:
(a) Xn
p→X implies Xn d→X.
(a) Xn
p→ c for a constant c if and only if Xn d→ c.
(a) If Xn
p→ c1 and Yn p→ c2 for constants c1 and c2, then (Xn,Yn) p→ (c1, c2)
Slutsky’s theorem concerns the joint convergence of functions of random variables when there is some
convergence to a constant.
Theorem 6.3 (Slutsky). Let (Zn)n∈N and (Yn)n∈N be sequences of random vectors or random variables.
Let X denote random vector or random variable. If Xn
d→X and Yn d→ c for a constant c, then
(a) Xn + Yn
d→X + c
(a) YnXn
d→ cX
(a) Y −1n Xn
d→ c−1X provided that c 6= 0.
Lemma 6.5 (Uniform convergence). Suppose that (Zn)converges in distribution to a random vector Z
with a continuous distribution function. Then
lim
n→∞supz
|Pr(Zn ≤ z)− Pr(Z ≤ z)| = 0.
Proofs of all the results in this subsection are given in Chapter 2 of Van Der Vaart (1998).
122
6.7. Random matrix theory
6.7 Random matrix theory
6.7.1 Pointwise limit
The extreme eigenvalues of a Wishart random matrix converge in probability to fixed values as both the
dimension and degrees of freedom expand. The result for the largest eigenvalue is due to Geman (1980)
and the result for the smallest eigenvalue is due to Silverstein (1985).
Theorem 6.4. (Geman, 1980; Silverstein, 1985)
Consider a sequence of Wishart(k, Id/k) random matrices where the degrees of freedom k and dimension
d are both taken to infinity. Suppose that the variables to samples ratio d/k converges to a constant
(d/k) → α, where α ∈ (0, 1]. Then the extreme eigenvalues of the random matrix, λmin and λmax
converge in probability to the limits
(i) λmin
p→ (1−√α)2, (6.11)
(ii) λmax
p→ (1 +√α)2. (6.12)
Theorem 6.4 and the continuous mapping theorem (Theorem 6.1) can be used to determine the
asymptotic embedding probability for the Gaussian sketch. Theorem 6.5 gives the limiting probability of
obtaining an -subspace embedding for a Gaussian sketch. The limit is asymptotic in n, d and k. This can
be interpreted as a type of Big Data asymptotic, where we consider tall and wide datasets through the
limit in n and d, and increasing sketch sizes k to cope with the with the expanding number of variables
d. In Theorem 6.5 the limiting variables to sketch ratio d/k is taken to a constant that is less than one.
This is reasonable as the sketched dataset will only have a full rank covariance matrix if k > d.
Theorem 6.5. Suppose we have an arbitrary n × d data matrix A where n > d and A is of rank d.
Assume we take a Gaussian sketch of size k. Then asymptotically in n, k and d, with d/k → α where
α ∈ (0, 1],
lim
n,d,k→∞
Pr(S is an -subspace embedding for A) =
0 if  < (1 +
√
α)2 − 1
1 if  > (1 +
√
α)2 − 1
Before giving a proof we first make a comment on the practical interpretation of Theorem 6.5. For
large d and k, we expect the embedding probability to be a step function at (1 +
√
d/k)2 − 1. In the
simulations depicted in Figure 6.1, the sketch size to variable ratio was held constant at twenty. As k and
d grow larger, the embedding probability curve should concentrate at (1 +
√
1/20)2−1 ≈ 0.5. This value
is plotted as a dashed line in each panel. Moving through the panels in the order (a) through (d) we can
see visible convergence to the pointwise limit around 0.5. As d and k increase, the empirical cdf of the
simulated values [1], . . . , [B], (see equation (6.10)) approaches a step function at the pointwise limit.
Proof: Let W ∼Wishart(k, Id/k), and let λmin and λmax denote the minimum and maximum eigen-
values of W respectively. Using Slutsky’s theorem and the continuous mapping theorem we have the
joint convergence result |1− λmin|
|1− λmax|
 p→
|1− (1−√α)2|
|1− (1 +√α)2|
 . (6.13)
For large k and d, the maximum eigenvalue λmax is expected to show greater deviation from one than
the minimum eigenvalue λmin. Over the interval α ∈ (0, 1] it holds that
|1− (1 +√α)2| > |1− (1−√α)2|.
Applying the continuous mapping theorem to the random vector in (6.13),
max
|1− λmin|
|1− λmax|
 p→ max
|1− (1−√α)2|
|1− (1 +√α)2|
 ,
123
6. Sketching for posterior approximation
yielding max(|1 − λmin|, |1 − λmax|) p→ |1 − (1 +
√
α)2|. Now as (1 + √α)2 is greater than one for all
α > 0, the absolute value sign can be removed in the limit giving the equivalent statement max(|1 −
λmin|, |1 − λmax|) p→ (1 +
√
α)2 − 1. Recalling that σmax(Id −W ) = max(|1 − λmin|, |1 − λmax|), we
establish convergence of the limiting singular value
σmax(Id −W ) p→ (1 +
√
α)2 − 1. (6.14)
Property Portmanteau lemma then gives the probabilistic statement
lim
n,d,k→∞
Pr(σmax(Id −W ) ≤ ) =
0 if  < (1 +
√
α)2 − 1,
1 if  > (1 +
√
α)2 − 1.
As  = (1 +
√
α)2 − 1 is a discontinuity point of the limiting distribution function we do not make a
statement about the case  = (1 +
√
α)2 − 1. From (6.7) we have the equality in limits
lim
n,d,k→∞
Pr(S is an -subspace embedding for A) = lim
n,d,k→∞
Pr(σmax(Id −W ) ≤ ),
giving the final result.
In Figure 6.1 the Monte Carlo estimate of the embedding probability curve shows deviation from the
sharp step function predicted by Theorem 6.5. We can obtain a more accurate approximation of the
embedding probability using more sophisticated asymptotics.
6.7.2 Tracy-Widom limit
We will show that the embedding probability (6.1) under the Gaussian sketch is closely related to the
Tracy-Widom law. The Tracy-Widom distribution is the limiting distribution of the maximum eigenvalue
of a random Wishart(m, Im) matrix. Note that in the previous expression the degrees of freedom m
matches the dimension of the m×m random matrix. The cumulative distribution function of the Tracy-
Widom distribution F1(x), is defined as
F1(x) = exp
(
−1
2
∫ ∞
x
q(t) + (t− x)q2(t) dt
)
.
Where q(x) satisfies the nonlinear differential equation
q′′(x) = xq(x) + 2q3(x),
subject to the asymptotic boundary condition, q(x) ∼ Ai(x) as x → ∞. The function Ai(x) denotes
the Airy function, defined as
Ai(x) =
1
pi
∫ ∞
0
cos
(
t3
3
+ xt
)
dt.
The Airy equation y′′ − xy = 0 has solution y = Ai(x) under the boundary condition y → 0 as x → ∞.
Figure 6.2 plots the density function of the Tracy-Widom distribution and the Airy function. The R
package RMTstat has a suite of functions for working with the Tracy-Widom distribution (Johnstone et al.,
2014). The Tracy-Widom law describes Wishart(m, Im) random matrices, however we are interested in
situations where the degrees of freedom may not match the dimension of the matrix. For sketching
algorithms the random matrix of interest W ∼ Wishart(k, Id/k) has the degrees of freedom controlled
by the sketch size k, and the dimension of the matrix is given by the number of variables in the source
dataset d. Johnstone (2001) showed that Tracy-Widom law also gives the asymptotic distribution of the
maximum eigenvalue of a Wishart(k, Id/k) matrix after appropriate centring and scaling. In subsequent
work Ma (2012) showed that the rate of convergence could be improved by using different centering and
scaling constants than in Johnstone (2001). We present the convergence result given by Ma.
124
6.7.2. Tracy-Widom limit
−6 −4 −2 0 2 4 6
0.
00
0.
10
0.
20
0.
30
(a) Tracy−Widom distribution
x
D
en
si
ty
−15 −10 −5 0 5 10
−
0.
4
−
0.
2
0.
0
0.
2
0.
4
(b) Airy function Ai(x)
x
Ai
(x)
Figure 6.2: (a) Tracy-Widom distribution. (b) Airy function. The Tracy-Widom distribution has a significant role in
describing the asymptotic distributions of the eigenvalues of large random matrices.
Theorem 6.6. (Ma, 2012)
Consider a sequence of Wishart(k, Id/k) random matrices where the degrees of freedom k and dimension
d are both taken to infinity. Let λmax denote the maximum eigenvalue of the random matrix. Suppose
that the variables to samples ratio d/k converges such that d/k → α where α ∈ (0, 1]. Define the centring
and scaling constants as
µk,d = k
−1(
√
k − 1/2 +
√
d− 1/2)2,
σk,d = k
−1(
√
k − 1/2 +
√
d− 1/2)
(
1√
k − 1/2 +
1√
d− 1/2
)1/3
.
Then
(λmax − µk,d)
σk,d
d→ Z,
where Z ∼ F1 and F1 is the Tracy-Widom distribution.
The limiting values of µk,d and σk,d are (1+
√
α)2 and 0 as we take k, d to infinity with d/k → α. This
shows a correspondence with the pointwise limit of maximum eigenvalue given in Theorem 6.4. Theorem
6.7 uses the Tracy-Widom law to describe the asymptotic embedding probability when using a Gaussian
sketch.
Theorem 6.7. Suppose we have an arbitrary n× d data matrix A where n > d and A is of rank d. Let
the singular value decomposition of A be given by A = UDV T. Furthermore assume we take a Gaussian
sketch of size k. Consider the limit in n, k and d, such that d/k → α with α ∈ (0, 1]. Let µk,d and σk,d
be the centering and scaling constants given in Theorem 6.6. The asymptotically in n, d and k,
(i) lim
n,d,k→∞
∣∣∣∣Pr (S is an -subspace embedding for A)− Pr(Z ≤ + 1− µk,dσk,d
)∣∣∣∣ = 0.
Where Z ∼ F1 and F1 is the Tracy-Widom distribution. Furthermore we have convergence in distribution
(ii)
σmax(Id −USTSU)− µk,d + 1
σk,d
d→ Z.
125
6. Sketching for posterior approximation
0.2 0.3 0.4 0.5 0.6 0.7 0.8
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
(a) d=10, k=20*d
Epsilon
Em
be
dd
in
g 
pr
ob
ab
ilit
y
0.2 0.3 0.4 0.5 0.6 0.7 0.8
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
(b) d=20, k=20*d
Epsilon
Em
be
dd
in
g 
pr
ob
ab
ilit
y
0.2 0.3 0.4 0.5 0.6 0.7 0.8
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
(c) d=100, k=20*d
Epsilon
Em
be
dd
in
g 
pr
ob
ab
ilit
y
0.2 0.3 0.4 0.5 0.6 0.7 0.8
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
(d) d=1000, k=20*d
Epsilon
Em
be
dd
in
g 
pr
ob
ab
ilit
y
Empirical Lower bound Pointwise limit Tracy−Widom limit
Figure 6.3: Empirical probability of obtaining an -subspace embedding at different k and d. The sketch to vari-
ables ratio is kept constant at twenty. We want the sketching matrix S to be an -subspace embedding for the
source dataset with small  with high probability. The y-axis gives the proportion of times we attain an -subspace
embedding. The x-axis gives the distortion factor  (recall Definition 6.1). The empirical cdf is obtained by simulat-
ing σmax(Id −W ) where W ∼ Wishart(k, Id/k). The Tracy-Widom law gives the most accurate predictions of the
embedding probability in the simulation. The lower bound is conservative at each d. The Tracy-Widom based pre-
diction (Theorem 6.7) describes the fluctuations around the sharp step-function given by the pointwise asymptotic
theory (Theorem 6.5). The accuracy of the Tracy-Widom approximation improves as d increases.
Before presenting the proof we show an application of Theorem 6.7. Result (i) gives an asymptotic
approximation for the embedding probability given that we take a sketch of size k of a source dataset
with d variables. We compare the Tracy-Widom approximation in Theorem 6.7 to the simulation results
reported in section 6.5.2. Figure 6.3 displays the results. The asymptotic expression for the embedding
probability is superimposed as a blue dashed line over the empirical estimate (red line). The pointwise
limit discussed in section 6.7.1 is again plotted as a solid vertical line. The Tracy-Widom limit is clearly
more accurate than the pointwise approximation, modelling the fluctuation around the limiting value
(1 −√(1/20))2 − 1 ≈ 0.5. The Tracy-Widom approximation improves as k and d grow larger. In (a)
we can see that the left tail of the Tracy-Widom approximation is not accurate for d = 10. In (c) we
can see that the approximation is very good at d = 100. The lower bound (6.9) is again plotted as the
dot-dash line. Although an asymptotic result, the Tracy-Widom limit gives a more accurate prediction
of the embedding probability than the finite sample lower bound in this simulation.
Theorem 6.7 (ii) gives the asymptotic distribution of σmax(Id − UTSTSU) for an arbitrary data
matrix A = UDV T. The simulated values [1], . . . , [B] (recall equation (6.10)) can be used to estimate
the density of σmax(Id−UTSTSU). Figure 6.4 compares the asymptotic theoretical density of σmax(Id−
UTSTSU) to a kernel density estimate from the simulated values. The agreement improves as d grows,
with the asymptotic approximation being extremely good for d = 100 and d = 1000. Theorem 6.7 is not
an immediate extension of Theorem 6.6 as Theorem 6.6 does not pertain describe the joint distribution
126
6.7.2. Tracy-Widom limit
0.2 0.3 0.4 0.5 0.6 0.7 0.8
0
2
4
6
8
(a) d=10, k=20*d
σmax(Id − UTSTSU)
D
en
si
ty
0.3 0.4 0.5 0.6 0.7
0
2
4
6
8
(b) d=20, k=20*d
σmax(Id − UTSTSU)
D
en
si
ty
0.40 0.45 0.50 0.55 0.60
0
5
10
15
20
25
(c) d=100, k=20*d
σmax(Id − UTSTSU)
D
en
si
ty
0.47 0.48 0.49 0.50 0.51 0.52
0
20
40
60
80
10
0
(d) d=1000, k=20*d
σmax(Id − UTSTSU)
D
en
si
ty
Empirical Tracy−Widom limit Pointwise limit
Figure 6.4: Observed and theoretical density of σmax(Id −UTSTSU) at different k and d. The sketch to variables
ratio is kept constant at twenty. We want the sketching matrix S to be an -subspace embedding for the source
dataset with small  with high probability. The sketching matrix S is an -subspace embedding for the source
dataset if and only if σmax(Id − UTSTSU) ≤ . The x-axis limits are different in each plot. The accuracy of the
Tracy-Widom approximation improves as d increases as is expected from the asymptotic theory (Theorem 6.7).
of λmin and λmax. The embedding probability involves the joint distribution of λmin and λmax. We use
Theorem 6.6 in conjunction with the Portmanteau lemma, Theorem 6.5 and the continuous mapping
theorem in order to establish Theorem 6.7.
We now present the proof of Theorem 6.7.
Proof: Let W ∼Wishart(k, Id/k), and let λmin and λmax denote the minimum and maximum eigenvalues
of W respectively. The majority of the proof comes down to showing that λmax controls the embedding
probability. Using the Portmanteau lemma (Lemma 6.4) we will show that
lim
d,k→∞
Pr(σmax(Id −W ) ≤ ) = lim
d,k→∞
Pr (|1− λmax| ≤ ) .
Recall the key expression given in (6.8),
Pr(S is an -subspace embedding for A) = Pr(σmax(Id −W ) ≤ )
= Pr (|1− λmin| ≤ , |1− λmax| ≤ ) .
The Tracy-Widom law describes the marginal distributions of λmin and λmax. We would like to avoid
working with the joint distribution of the extreme eigenvalues, and instead restrict attention to the
distribution of the maximum. Let denote the random vector X = (|1 − λmin|, |1 − λmax|)T. Figure 6.5
presents some diagrams that will be useful. We wish to know the probability that X lies in the region
shaded region C in panel (a). For every  > 0 we have that Pr (|1− λmin| ≤ , |1− λmax| ≤ ) = Pr(X ∈
C). The region C can be expressed as C = M−R where M and R are the shaded regions in panels (b) and
127
6. Sketching for posterior approximation
Figure 6.5: Regions of interest in determining the embedding probability. To obtain an -subspace embedding we
require that |1− λmin| ≤  and |1− λmax| ≤ . If we define X = (|1− λmin|, |1− λmax|)T, we have that Pr(X ∈ C) =
Pr(X ∈M)− Pr(X ∈ R). In panel (c) the dot-dash line gives the identity line where |1− λmax| = |1− λmin|.
(c) respectively. The probability Pr(X ∈M) represents the marginal probability that |1−λmax| ≤ . The
probability Pr(X ∈ R) represents the probability of the joint event that (|1− λmax| ≤ , |1− λmin| > ).
We have that
Pr(X ∈ C) = Pr(X ∈M)− Pr(X ∈ R).
In panel (c) the dot-dash line gives the identity line where |1−λmax| = |1−λmin|. From Theorem 6.5 we
know that as d, k tends to infinity X converges in distribution to the constant vector XL = (|1 − (1 −√
α)2|, |1−(1+√α)2|)T. As such, asymptotically |1−λmax| > |1−λmin| with probability one. Referring to
panel (c), the random vector XL takes values in the region below the dot-dash line with probability one.
The limiting random vector XL thus satisfies Pr(XL ∈ R) = 0 and Pr(XL ∈ ∂R) = 0. As X d→ XL,
property (g) of the Portmanteau lemma (Lemma 6.4) gives that Pr(X ∈ R) → Pr(XL ∈ R) = 0 . The
limiting probability is then
lim
d,k→∞
Pr (|1− λmin| ≤ , |1− λmax| ≤ ) = lim
d,k→∞
Pr(X ∈ C) (6.15)
= lim
d,k→∞
Pr(X ∈M)− lim
d,k→∞
Pr(X ∈ R) (6.16)
= lim
d,k→∞
Pr(X ∈M)− 0 (6.17)
= lim
d,k→∞
Pr (|1− λmax| ≤ ) .. (6.18)
We have now isolated the maximum eigenvalue λmax as the determining factor in obtaining an -subspace
embedding. We make another application of the Portmanteau lemma to arrive at the final result. From
here we can write
Pr (|1− λmax| ≤ ) = Pr(λmax ≤ + 1)− Pr(λmax ≤ 1− ). (6.19)
From Theorem 6.5 we know that λmax converges in distribution to the constant random variable ZL =
(1 +
√
α)2, where we have assumed α ∈ (0, 1]. Let B denote the interval (−∞, 1]. The limiting random
variable ZL satisfies Pr(ZL ∈ B) = 0 and Pr(ZL ∈ ∂B) = 0. As such using property g of the Portmanteau
lemma, lim
d,k→∞
Pr(λmax ∈ B) = 0. Now Pr(λmax ≤ 1 − ) ≤ Pr(λmax ∈ B) for any  > 0. We can then
conclude that lim
d,k→∞
Pr(λmax ≤ 1 − ) = 0 for any  > 0. Asymptotically, the term Pr(λmax ≤ 1 − )
drops out of the expression for the embedding probability. Taking limits over (6.19)
lim
d,k→∞
Pr (|1− λmax| ≤ ) = lim
d,k→∞
Pr(λmax ≤ + 1)− lim
d,k→∞
Pr(λmax ≤ 1− )
= lim
d,k→∞
Pr(λmax ≤ + 1)− 0.
128
6.8. Sketching asymptotics
The asymptotic embedding probability is then related to the asymptotic cdf of λmax. Let Z be a random
variable with Tracy-Widom distribution F1. Now for any fixed d and k, it must hold that for any fixed
 > 0,∣∣∣∣Pr(λmax − µk,dσk,d ≤ + 1− µk,dσk,d
)
− Pr
(
Z ≤ + 1− µk,d
σk,d
)∣∣∣∣ ≤ sup
z∈R
∣∣∣∣Pr(λmax − µk,dσk,d ≤ z
)
− Pr(Z ≤ z)
∣∣∣∣ .
Using Theorem 6.6, the random variable (λmax − µk,d)/σk,d converges in distribution to the continuous
random variable Z ∼ F1. It follows from Lemma 6.5 that
lim
d,k→∞
sup
z∈R
∣∣∣∣Pr(λmax − µk,dσk,d ≤ z
)
− Pr(Z ≤ z)
∣∣∣∣ = 0.
Now by the squeeze theorem, it holds that for all  > 0,
lim
d,k→∞
∣∣∣∣Pr(λmax − µk,dσk,d ≤ + 1− µk,dσk,d
)
− Pr
(
Z ≤ + 1− µk,d
σk,d
)∣∣∣∣ = 0.
Property (a) of the Lemma (6.4) then gives part (ii).
6.8 Sketching asymptotics
We also wish to characterise the probability of obtaining an -subspace embedding for the Hadamard and
Clarkson-Woodruff projections. Again letA be some large data matrix with singular value decomposition
A = UDV T. The embedding probability of interest is
Pr(S is an -subspace embedding for A) = Pr(σmax
(
Id −UTSTSU) ≤ 
)
.
Because of the discrete nature of the Hadamard and Clarkson-Woodruff projections it is cumbersome to
express this probability in a meaningful way. As in Chapter 4 we again appeal to large n asymptotics
to obtain an interpretable expression. Using the sketching central limit theorem (Theorem 4.3) we can
argue that the Hadamard and Clarkson-Woodruff sketches have the same limiting embedding probability
as the Gaussian projection. The sketching central limit theorem is restated here as Theorem 6.8. The
necessary regularity conditions are described again in Assumption 1.
Assumption 1 Let the singular value decomposition of the n × d source dataset be given by A(n) =
U(n)D(n)V
T
(n). Let u
T
(n)i give the ith row in U(n) for i = 1, . . . , n. Assume that the maximum
leverage score tends to zero, that is
lim
n→∞ maxi=1,...,rn
‖u(n)i‖22 = 0.
Theorem 6.8. Consider a sequence of arbitrary n× d data matrices A(n), where d is fixed. Let A(n) =
U(n)D(n)V
T
(n) represent the singular value decomposition of A(n). Let S be a k×n Hadamard or Clarkson-
Woodruff sketching matrix where k is also fixed. Let uT(n)i represent the ith row of U(n). Suppose that
Assumption 1 on the maximum leverage score is satisfied. Then as n tends to infinity with k and d fixed,
[SA(n)]V(n)D
−1
(n)
d→ MN(0, Ik, Id/k).
Theorem 6.9 gives the asymptotic probability of obtaining an -subspace embedding for the Hadamard
and Clarkson-Woodruff sketches.
Theorem 6.9. Consider a sequence of arbitrary n× d data matrices A(n), where each data matrix is of
rank d, and d is fixed. Let A(n) = U(n)D(n)V
T
(n) represent the singular value decomposition of A(n). Let
S be a k× n Hadamard or Clarkson-Woodruff sketching matrix where k is also fixed. Let uT(n)i represent
the ith row of U(n). Assume that the maximum leverage score tends to zero, so
lim
n→∞ maxi=1,...,n
‖u(n)i‖22 = 0.
129
6. Sketching for posterior approximation
Then as n tends to infinity with k and d fixed,
lim
n→∞Pr
(
S is an -subspace embedding for A(n)
)
= Pr (σmax(Id −W ) ≤ ) ,
where W ∼Wishart(k, Id/k).
Proof: The sketching central limit theorem holds under Assumption 1. As we only need to con-
sider the sequence of orthonormal matrices U(n) to determine the embedding probability, we can use
Theorem 6.8 with D(n) and V(n) set to the d × d identity matrix. As such we conclude that SU(n) d→
MN(0k×d, Ik, Id/k). By the continuous mapping theorem it holds that for fixed d and k, asymptotically
with n, UT(n)S
TSU(n)
d→Wishart(k, Id/k). Another application of the continuous mapping theorem gives
σmax(Id −UT(n)STSU(n)) d→ σmax(Id −W ),
where W ∼ Wishart(k, Id/k). We can use the continuous mapping theorem as the limiting Wishart
matrix W has rank d with probability one. The maximum singular value function is continuous over the
range where W has full rank (Bhatia, 1996). By the Portmanteau lemma it then holds that
lim
n→∞Pr
(
σmax(Id −UT(n)STSU(n)) ≤ 
)
= Pr (σmax(Id −W ) ≤ ) .
Now as
lim
n→∞Pr
(
S is an -subspace embedding for A(n)
)
= lim
n→∞Pr
(
σmax(Id −UT(n)STSU(n)) ≤ 
)
= Pr (σmax(Id −W ) ≤ ) ,
we have the final result
The limiting embedding probability in Theorem 6.9 is the exact same embedding probability as for
the Gaussian sketch (6.7). As n grows, we expect the Hadamard and Clarkson-Woodruff sketches to
be as effective as the Gaussian sketch for generating -subspace embeddings. This is significant as the
Hadamard and Clarkson-Woodruff sketches are dramatically faster than the Gaussian projection (see
Table 6.1). The Clarkson-Woodruff sketch is O(k) times faster than the Gaussian projection, which can
be a very large factor in practice.
We can again use the Tracy-Widom distribution to approximate the limiting embedding probability
Pr (σmax(Id −W ) ≤ ) for the Clarkson-Woodruff and Hadamard sketches. We are unable to establish
a formal limit theorem in terms of the Tracy-Widom distribution
lim
n,d,k→∞
Pr(S is an -subspace embedding for A(n)) = Pr
(
Z ≤ + 1− µk,d
σk,d
)
,
where Z ∼ F1 as per Theorem 6.7. This is because the sketching central limit theorem (Theorem 6.8)
has k and d fixed, with only n being taken to infinity. Central limit theorems under expanding dimension
are more challenging (Portnoy, 1986), and this is left as future work. It is possible that Assumption 1 on
the leverage scores will remain sufficient in the expanding dimension scenario. The following reasoning
was also used by Huber (1973) in the analysis of high dimensional regression models. For any d, the
maximum leverage score must be greater than the average leverage score. We thus have
max
i=1,...,n
‖u(n)i‖22 ≥
1
n
n∑
i=1
‖u(n)i‖22
=
d
n
.
If we maintain that Assumption 1 holds on the leverage scores as n, d, k →∞ we necessarily imply that
d/n→ 0. This is intuitively reasonable. If we simultaneously require d/k → α where α ∈ (0, 1] we must
130
6.9. Data application
also have that k/n → 0. This is also a sensible requirement. The proofs in Chapter 5 do not appear to
have any critical weaknesses that prevent an extension to the expanding dimension scenario.
The lack of a central limit theorem in the expanding dimension scenario is not a critical gap, as the
key result is that the Hadamard and Clarkson-Woodruff sketches behave like the Gaussian projection for
large n, with k and d fixed. If the Tracy-Widom approximation in Theorem 6.7 is good for finite k and d
with the Gaussian sketch, it should hold well for the Hadamard and Clarkson-Woodruff projections for
n sufficiently large.
6.9 Data application
We test the theory on a large genetic dataset. The covariate data consists of genotypes at p = 1032
genetic variants on n = 407, 779 subjects. The genetic variants are in the Protein Kinase C Epsilon
(PRKCE) gene. Variants were filtered to have mean allele frequency of greater than one percent. The
response variable is haemoglobin concentration adjusted for age/sex and technical covariates. We also
consider a subset of this dataset with p = 130 representative markers identified by hierarchical clustering.
We first assess the accuracy of Theorem 6.7 and 6.9 by comparing the theoretical embedding proba-
bility to the observed embedding probability. We then check the accuracy of the posterior approximation
as a function of  in a simulation study. We compare the Gaussian, Hadamard and Clarkson-Woodruff
sketches. We also compare the data oblivious sketches to simple random sampling with replacement. The
simple random sampling of observations is also referred to as the Uniform projection in the sketching
literature. As mentioned in Chapter 4, although the uniform projection is simple to understand and
implement, it is difficult to establish strong error bounds on its performance.
6.9.1 Embedding probabilities
The dataset is of moderate size, so it is feasible to take the singular value decomposition of the full
n × d dataset A = UDV T. Given the singular value decomposition we can run an oracle procedure to
determine the empirical embedding probability. Suppose we take B sketches. Let S[1], . . . ,S[B] denote
the B sketching matrices. Define [b] as
[b] = σmax(Id −UTS[b]TS[b]U),
for b = 1, . . . , B. The value [b] measures the quality of the sketching matrix S[b]. The estimated
embedding probability is then
P̂r(S is an -subspace embedding for A) = P̂r(σmax(Id −UTSTSU) ≤ )
=
1
B
B∑
b=1
1([b] ≤ ).
Figure 6.6 shows the empirical and theoretical embedding probabilities for the representative PRKCE
dataset (n = 407, 779, d = 132) for each type of sketch. We took one hundred sketches at k = 20 × d.
The observed and theoretical curves match well for the Gaussian, Hadamard and Clarkson-Woodruff
projection. The uniform projection performs worse than the other data-oblivious random projections.
The distribution of  is shifted to the right compared to the other projections. Larger values of  indicate
weaker approximation bounds. The uniform projection does not satisfy a central limit theorem for fixed
k, so we do not necessarily expect the Tracy-Widom law to give a good approximation for the uniform
projection. Table 6.2 reports the average sketching time for each projection. The Clarkson-Woodruff
projection is much faster than the Gaussian projection, but has the same embedding probability, as
predicted by Theorem 6.9.
Figure 6.7 shows the empirical and theoretical embedding probabilities for the full PRKCE dataset
(n = 407, 779, d = 1034). We took one hundred sketches at k = 20 × d. The x-axis is different in
131
6. Sketching for posterior approximation
Clarkson−Woodruff Uniform
Gaussian Hadamard
0.5 0.6 0.7 0.8 0.5 0.6 0.7 0.8
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
Epsilon
Em
be
dd
in
g 
pr
ob
ab
ilit
y
Empirical
Pointwise limit
Tracy−Widom limit
Figure 6.6: Empirical and theoretical embedding probabilities for the representative PRKCE dataset. Hierarchical
clustering was used to select a subset of genetic variants from the full PRKCE dataset. We want the sketching
matrix S to be an -subspace embedding for the source dataset with small  with high probability. The y-axis gives
the proportion of times we attain an -subspace embedding. The x-axis gives the distortion factor  (recall Definition
6.1). In this simulation n = 407, 779, d = 132, k = 20× d. One thousand sketches were generated for each type of
sketch. The vertical line gives the asymptotic pointwise limit. The Tracy-Widom approximation is very accurate for
the Gaussian, Hadamard and Clarkson-Woodruff sketches.
Projection Gaussian Hadamard Clarkson-Woodruff Uniform
Time (seconds) 769 17.6 1.34 0.03
Table 6.2: Mean sketching time for the representative PRKCE dataset. Results for the Hadamard, Clarkson-
Woodruff and Uniform projections are over one hundred sketches. Results for the Gaussian sketch are over ten
sketches.The Gaussian sketch is considerably slower than the Hadamard and Clarkson-Woodruff sketches as is
expected from Table 6.1.
132
6.9.1. Embedding probabilities
Clarkson−Woodruff Uniform
Gaussian Hadamard
0.5 0.6 0.7 0.8 0.9 0 5 10 15
0.450 0.475 0.500 0.525 0.550 0.450 0.475 0.500 0.525 0.550
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
Epsilon
Em
be
dd
in
g 
pr
ob
ab
ilit
y
Empirical
Pointwise limit
Tracy−Widom limit
Figure 6.7: Empirical and theoretical embedding probabilities for the full PRKCE genetic dataset (n = 407, 779, d =
1034, k = 20 × d). One hundred sketches were generated for each type of sketch. We want the sketching matrix
S to be an -subspace embedding for the source dataset with small  with high probability. The y-axis gives the
proportion of times we attain an -subspace embedding. The x-axis gives the distortion factor  (recall Definition 6.1).
Different x-scales are used in each plot. The Tracy-Widom approximation is very accurate for the Gaussian sketch.
There is moderate deviation from the asymptotic approximation for the Hadamard sketch and larger deviation for
the Clarkson-Woodruff sketch.
each plot. Overall, the empirical and theoretical curves do not match as well compared to Figure 6.6.
The approximation for the Gaussian projection is very accurate. Interestingly, the Hadamard projection
slightly outperforms the Gaussian projection. The dashed line in the Hadamard panel is shifted slightly
to the left compared to the theoretical curve, representing better than expected performance. The
Clarkson-Woodruff projection has a much longer right tail then is predicted by the Tracy-Widom law.
The Uniform projection does very poorly in this simulation compared to the the other data oblivious
projections. Even though the distribution of  under Clarkson-Woodruff projection has a longer right tail
than the theoretical, the distortion factors are always below 1. The median for the Uniform projection
is above 5. The Hadamard and Clarkson-Woodruff projections to be more stable in this example. Data
oblivious projections are designed to be robust, and it appears there are features in this dataset that
cause major problems for uniform subsampling that are less of an issue for the data oblivious sketches.
The extra computational cost of the Hadamard and Clarkson-Woodruff projections come with some
demonstrable benefits.
Table 6.3 reports the average sketching time for each projection. The Gaussian sketch was not
timed in this example as we simulated directly from the matrix normal distribution of the sketched data
A˜ ∼ MN(0, Ik,ATA/k) rather than computing the matrix product SA directly. This was so that the
simulation could be run in a reasonable amount of time. The Clarkson-Woodruff sktch is again faster
than the Hadamard sketch as is expected.
We can better understand the deviation from the Tracy-Widom limit in Figure 6.7 by looking at
the density of σmax(Id − UTSTSU). Figure 6.8 compares the empirical density to the Tracy-Widom
approximation given in Theorem 6.7. The empirical distribution under the Hadamard sketch is shifted to
133
6. Sketching for posterior approximation
Clarkson−Woodruff Uniform
Gaussian Hadamard
0.5 0.6 0.7 0.8 0 5 10 15
0.450 0.475 0.500 0.525 0.550 0.450 0.475 0.500 0.525 0.550
0
30
60
90
0.00
0.05
0.10
0.15
0
30
60
90
0
30
60
90
σmax(Id − UTSTSU)
de
ns
ity Empirical
Pointwise limit
Tracy−Widom limit
Figure 6.8: Empirical and theoretical density of σmax(Id−UTSTSU) for the full PRKCE genetic dataset. We want
the sketching matrix S to be an -subspace embedding for the source dataset with small  with high probability. The
sketching matrix S is an -subspace embedding for the source dataset if and only if σmax(Id −UTSTSU) ≤ . One
hundred sketches were generated for each type of sketch. The vertical line gives the asymptotic pointwise limit.
Different x-scales are used in each plot. The Tracy-Widom approximation is very accurate for the Gaussian sketch.
There is moderate deviation from the asymptotic approximation for the Hadamard sketch and larger deviation for
the Clarkson-Woodruff sketch.
the left compared to the asymptotic result, and the empirical distribution under the Clarkson-Woodruff
sketch has more right skew.
The deviation from the Tracy-Widom limit in Figure 6.8 could be because the finite sample approx-
imation is poor. Theorem 6.9 suggests that the Hadamard and Clarkson-Woodruff projections behave
like the Gaussian sketch for n sufficiently large. To test this we bootstrapped the full PRKCE dataset
to be ten times its original size. The bootstrapped PRKCE dataset has n = 4, 077, 790, d = 1034. We
took one thousand sketches of size k = 20 × d using the Clarkson-Woodruff projection and ran the or-
acle procedure of computing [b] = σmax(Id − UTS[b]TS[b]U) for each sketch. Figure 6.9 compares the
distribution of σmax(Id −UTSTSU) using Clarkson-Woodruff projection on the original dataset and on
the large bootstrapped dataset. As n increases we expect the quality of the Tracy-Widom approximation
to improve. Panel (a) compares the theoretical to the simulation results on the original dataset. The
Clarkson-Woodruff projection shows greater variance than expected. Panel (b) compares the theoretical
to the simulation results on the bootstrapped dataset. In (b) we see very good agreement between the
empirical distribution and the theoretical distribution. It seems that for this dataset n ≈ 400, 000 is not
big enough for the large sample asymptotics to kick in. At n ≈ 4 million we see the expected asymptotic
behaviour. The results in Figure 6.9 are significant as they reflect the culmination of the asymptotic
analysis in Chapter 5 and the asymptotic results in this chapter. The sketching central limit theorem ties
the large n behaviour of the Clarkson-Woodruff projection to that of the Gaussian projection. Theorem
6.7 links the large d and k behaviour of the Gaussian projection to the Tracy-Widom law. The finite
sample analysis of the Clarkson-Woodruff projection is very difficult. By chaining together asymptotic
results we can obtain asymptotic predictions on its performance. As with any asymptotic analysis there
134
6.9.2. Posterior approximation
0.48 0.50 0.52 0.54 0.56
0
20
40
60
80
10
0
12
0
Orginal PRKCE dataset 
 (n=407,779, d=1034, k=20*d)
N = 100   Bandwidth = 0.003686
D
en
si
ty
0.48 0.50 0.52 0.54 0.56
0
20
40
60
80
10
0
12
0
Boostrapped PRKCE dataset 
 (n=4,077,790, d=1034, k=20*d)
N = 1000   Bandwidth = 0.0007265
D
en
si
ty
Figure 6.9: Comparison of theoretical and empirical density of σmax(Id − UTSTSU) on the full PRKCE dataset
(n = 407, 779) and the bootstrapped PRKCE dataset (n = 4, 077, 790). We want the sketching matrix S to be
an -subspace embedding for the source dataset with small  with high probability. The sketching matrix S is an
-subspace embedding for the source dataset if and only if σmax(Id −UTSTSU) ≤ . The dashed black line gives
the Tracy-Widom limit and the solid red line shows a kernel density estimate. The sketch size was set at k = 20×d,
where d = 1034. We used a Clarkson-Woodruff sketch. The accuracy of the Tracy-Widom approximation improves
as n increases as is expected from the asymptotic theory (Theorem 6.9).
Projection Gaussian Hadamard Clarkson-Woodruff Uniform
Time (seconds) - 156 21 2.8
Table 6.3: Mean sketching time (seconds) for the full PRKCE dataset (n = 407, 779, d = 1034). This dataset is
too large to use Gaussian sketch directly. The Clarkson-Woodruff sketch is faster than the Hadamard sketch as is
expected from Table 6.1.
is there is the question of how large n has to be before the approximation is reasonable. Further study on
the rate of convergence and possible finite sample bounds on the error in the normal approximation would
be very useful and interesting research directions. We have used asymptotic theory to assess the embed-
ding probability (6.1). The results in Figure 6.9 show the asymptotic approximations are reasonable in a
realistic data setting. Given the rapidly decreasing cost of genotyping, it is reasonable to anticipate the
n and d in panel (b) to be commonplace in multivariable genotype-phenotype studies in the near future.
To integrate sketching methods in the analysis pipeline it is necessary to understand the level of error
introduced by using the random projection. The asymptotic results are useful as they provide important
guidance on the quality of the randomised data compression step.
Table 6.3 reports the average sketching time for each projection on the full dataset. This is the time
required to compute the random sketched dataset A˜ = SA. The Clarkson-Woodruff sketch is faster than
the Hadamard sketch as is expected from Table 6.1.
6.9.2 Posterior approximation
We also compare the accuracy of the sketched posterior distribution at different sketch sizes. Sketching
is a tool for approximate computation. As the sketch size k increases the accuracy of the approximate
calculation increases. As discussed in section 6.4 this behaviour can be formalised using -subspace
embeddings. As k increases we 
p→ 0 and the sketched posterior distribution will approach the target
posterior distribution. Here we examine the sensitivity of the results to k and  in a practical setting.
135
6. Sketching for posterior approximation
llll
l
ll
l
l
lll
l
ll
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
lll
ll
l
l
l
l
l
ll
ll
l
lll
l
l
ll
l
l
ll
l
l
l
l
ll
l
l
l
l
l
ll
l
lll
l
ll
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
lll
0 20 40 60 80 100 120
0
1
2
3
4
5
6
Manhattan plot using multivariate regression model
SNP index
−
lo
g1
0 
p−
va
lu
e
Figure 6.10: Manhattan plot using the representative PRKCE dataset. The y-axis gives the minus log 10 p-value
for each predictor when testing a null hypothesis of zero. We fit the saturated model using p = 130 markers and
performed a t-test on each coefficient associated with a particular genetic marker.
l
l
l
l l l
l
l
l
l
l
l l l l
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
Marginal probability of inclusion
Po
st
er
io
r p
ro
ba
bi
lity
rs
13
39
64
24
rs
61
75
68
62
rs
11
27
88
36
6
rs
56
40
93
20
rs
11
49
48
63
9
rs
10
20
44
6
rs
18
68
27
4
rs
67
12
55
7
rs
75
79
48
46
rs
13
88
54
32
8
rs
10
17
10
92
rs
10
16
83
49
rs
79
80
68
47
rs
67
24
31
5
rs
28
14
82
SNP
Figure 6.11: Marginal inclusion probabilities from an analysis using the top 15 SNPs. These results were obtained
by enumerating over all models using the full dataset. There are some variants with low support for inclusion, some
variants with moderate support for inclusion and some variants with high support for inclusion. The range of target
marginal inclusion probabilities makes this a useful benchmark dataset for sketching.
To isolate the error introduced by the sketch we limit our analysis to a dataset with p = 15 genetic
variants so that the posterior can be computed by enumeration. This is so there is not Monte Carlo
error in the target posterior distribution. We took the top 15 variants identified by fitting a multivariable
regression model using the representative PRKCE dataset. Figure 6.10 shows a Manhattan plot of the
p = 130 genetic markers in the dataset. We take the 15 markers with the smallest p-values. The dashed
line in the plot shows the cutoff used. The design matrix X was then set to only contain the top 15
genetic variants. From here we computed the target posterior over models γ. There are 216 = 65536
candidate models as we allow for an intercept. Figure 6.11 shows the marginal posterior probability
of inclusion for each of the top 15 variants in the restricted analysis. We fixed the error variance at
σ2 = n/(n − 131)σ̂2 where σ̂2 is the maximum likelihood estimate of the residual variance from fitting
the saturated model with p = 130 genetic predictors. We then computed the integrated likelihood for all
models using (6.2). By normalising we obtained the the exact posterior distribution on the full dataset.
We benchmark the sketched posterior against the true posterior distribution.
136
6.9.2. Posterior approximation
We then took B = 100 hundred sketches at different sizes k using the Clarkson-Woodruff projection.
Let A˜[b] be the bth sketched dataset for b = 1, . . . , B. For b = 1, . . . , B we then computed the sketched
sufficient statistics A˜[b]TA˜[b]. We then computed the approximate integrated likelihood for each model
using (6.5). Formally, for b = 1, . . . , B and all models γ we calculated
p˜[b](y|γ, g, σ2) = (1 + g)−pγ/2 exp
[
− 1
2σ2
(
g
g + 1
(RSSγS)
[b] +
1
g + 1
yTy
)]
. (6.20)
Using the approximate integrated likelihoods we obtain a sketched posterior over models p˜[b](γ|y, g, σ2) for
b = 1, . . . , B. We computed the sketched marginal probabilities of inclusion for b = 1, . . . , B. We compare
p˜[b](γ|y, g, σ2) for b = 1, . . . , B to the target posterior p˜(γ|y, g, σ2) using the marginal probabilities of
inclusion. Figure 6.12 shows boxplots of the sketched posterior results over the B = 100 sketches. Each
panel shows the results for the marginal inclusion probability for a particular genetic variant. The red
dashed line gives the target marginal inclusion probability. At a high-level, as the sketch size increases,
the quality of the approximate posterior should increase. Here quality refers to the difference compared
to the exact posterior distribution computed on the full dataset p˜(γ|y, g, σ2). The gold-standard results
are given by the red-dashed line in each panel. We would like to see boxplots tightly concentrated around
the red dashed line. We do not see this. The panel in row 1 column 3 is worth studying in detail.
The target marginal inclusion probability is below 0.05. The sketched approximations return marginal
inclusion probabilities close to one for k up to ten thousand. The quality of the approximate calculation
is very poor for this SNP. The panel in row 3 column 2 is also worth examining. The target marginal
inclusion probability is around 0.45. The sketched marginal inclusion probabilities are biased upwards
for k up to ten thousand. The sketched inclusion probabilities remain highly variable at k equal to one
hundred thousand.
The variance of the sketched approximation is very high and it is hard to discern any clear trends due
tot the use of boxplots. Figures 6.13 and 6.14 show histograms of the approximate marginal inclusion
probabilities for selected SNPs. Looking at the first column in Figure 6.13 we can see a large positive
bias at k = 100. The sketched inclusion probabilities are bunched around one when the target inclusion
probabilities are below 0.5. The results in Figure 6.14 show some interesting patterns. At k = 100
thousand we see that the sketched posterior inclusion probabilities have a U shaped distribution with
modes at zero and one. We would like to see a mode around the red-dashed line. The noise introduced
by the sketch has a strong and perhaps unpredictable effect on the marginal includion probabilities.
The Gaussian sketch does not require that k be smaller than n. We repeated the analysis with very
large sketch sizes to identify the necessary k to give a tolerable posterior approximation. We would also like
to see if smoother behaviour emerges at larger sketch sizes. We simulated directly from the distribution
of the sketched sufficient statistics A˜TA˜ ∼Wishart(k,ATA/k) to bypass the huge computational cost of
generating comically large sketches. Figure 6.15 shows boxplots of the approximate marginal inclusion
probabilities. Figures 6.16 and 6.17 show histograms of the approximate marginal inclusion probabilities
for selected SNPs. These results show the general dynamic that we expect. As k increases we see
concentration around the target marginal inclusion probabilities (red dashed line). In the fourth column
in 6.17 we see a unimodal distribution centered around the target marginal inclusion probability rather
than the U shaped distribution that was noted in the fourth column of Figure 6.14.
Even at k = 10 million there is significant variance around the target inclusion probability. At k = 1
billion the error looks acceptable. The posterior distribution over models is a complicated nonlinear
function of the sufficient statistics. Small changes in the approximated sufficient statistics can lead to
a large change in the approximated posterior distribution. It is hard to determine apriori what level of
deviation we expect in the approximate posterior given a particular sketch size.
As before, we can run an oracle procedure to measure the quality of sketch at each k. Let the singular
value decomposition of the source dataset again be given by A = UDV T. Define
137
6. Sketching for posterior approximation
l
l
l
l
l l
ll
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
ll
l
l
l
ll
l
ll
l
l
ll
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
lll l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
ll
l
l
l
l
l l
l
l
ll
l
ll
l l
l
l
l
l
l
l
l
l
l
l
l
ll
l
ll
l
ll
l
l
l
l
ll
l
l
l
l
l
l
lll
ll
l
l
l
l
l
l
l
l
l
l
l
lll
l
l
l
l
l
l
l
l
l
ll
l
l l
l
l
l
l
l
l
l
l
ll
l
ll
l
l
l
l
l
l l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
rs79806847 rs6724315 rs281482
rs138854328 rs10171092 rs10168349
rs1868274 rs6712557 rs75794846
rs56409320 rs114948639 rs1020446
rs13396424 rs61756862 rs112788366
1 hundred
1 thousand
10 thousand
100 thousand
1 hundred
1 thousand
10 thousand
100 thousand
1 hundred
1 thousand
10 thousand
100 thousand
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
Sketch size (k)
M
ar
gi
na
l p
os
te
rio
r p
ro
ba
bi
lity
 o
f i
nc
lu
sio
n
Figure 6.12: Boxplots of sketched marginal posterior probabilities of inclusion. The Clarkson-Woodruff projection
was used at different sketch sizes k. One hundred sketches were taken at each value of k. The target marginal
probability of inclusion for each SNP is plotted as red dashed line. The sketched posterior approximations should
approach the target values as k increases. The variability in the sketched approximation differs over SNPs. In
general the sketched estimates lie above the target values (the red dashed line). It is hard to see a clear relationship
between the variance of the sketched approximations and k in this plot.
138
6.9.2. Posterior approximation
k: 1 hundred k: 1 thousand k: 10 thousand k: 100 thousand
co
efficient: rs112788366
co
efficient: rs6712557
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
0
25
50
75
0
25
50
75
Marginal posterior probability of inclusion
co
u
n
t
Figure 6.13: Histograms of sketched marginal posterior probabilities of inclusion for SNPs where the target
marginal probability is less than 0.5. Results were obtained using the Clarkson-Woodruff projection at different
sketch sizes k. The target marginal probability of inclusion for each SNP is plotted as red dashed line. The sketched
posterior approximations should approach the target values as k increases. The sketched marginal inclusion prob-
abilities are concentrated around zero and one in each panel. It is hard to see a clear relationship between the
variance of the sketched approximations and k in this plot.
k: 1 hundred k: 1 thousand k: 10 thousand k: 100 thousand
co
efficient: rs13396424
co
efficient: rs138854328
co
efficient: rs10171092
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
0
25
50
75
100
0
25
50
75
100
0
25
50
75
100
Marginal posterior probability of inclusion
co
u
n
t
Figure 6.14: Histograms of sketched marginal posterior probabilities of inclusion for SNPs where the target
marginal probability is between 0.5 and 0.9. Results were obtained using the Clarkson-Woodruff projection at
different sketch sizes k. The target marginal probability of inclusion for each SNP is plotted as red dashed line.
The sketched posterior approximations should approach the target values as k increases. The sketched marginal
inclusion probabilities are concentrated around zero and one in each panel.It is hard to see a clear relationship
between the variance of the sketched approximations and k in this plot.
139
6. Sketching for posterior approximation
l
l
l
l
l
l
ll
l
l l
l
l
ll
lll
l
ll
l
ll
l
ll
l
l l l l
l l l
l l l
l
l
l
l
ll l l l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l l l l
l
l
l
l
l
l
l l l
l l l
l
l
l
l
l
l
l
l
l l
rs79806847 rs6724315 rs281482
rs138854328 rs10171092 rs10168349
rs1868274 rs6712557 rs75794846
rs56409320 rs114948639 rs1020446
rs13396424 rs61756862 rs112788366
1 m
illion
10 m
illion
100 m
illion
1 billion
1 m
illion
10 m
illion
100 m
illion
1 billion
1 m
illion
10 m
illion
100 m
illion
1 billion
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
Sketch size (k)
M
ar
gi
na
l p
os
te
rio
r p
ro
ba
bi
lity
 o
f i
nc
lu
sio
n
Figure 6.15: Boxplots of sketched marginal posterior probabilities of inclusion. The target marginal probability of
inclusion for each SNP is plotted as red dashed line. The Gaussian sketch was used at different sketch sizes k.
The sketched posterior approximations should approach the target values as k increases. One hundred sketches
were taken at each value of k. As the sketch size k increases the approximate marginal inclusion probabilities
concentrate around the target values.
140
6.9.2. Posterior approximation
k: 1 million k: 10 million k: 100 million k: 1 billion
co
efficient: rs112788366
co
efficient: rs6712557
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
0
25
50
75
100
0
25
50
75
100
Marginal posterior probability of inclusion
co
u
n
t
Figure 6.16: Histograms of sketched marginal posterior probabilities of inclusion for SNPs where the target
marginal probability is less than 0.5. Results were obtained using the Gaussian projection at different sketch sizes
k. The target marginal probability of inclusion is plotted as a dashed line. The sketched posterior approximations
should approach the target values as k increases. As the sketch size k increases the approximated marginal
inclusion probabilities concentrate around the target values.
k: 1 million k: 10 million k: 100 million k: 1 billion
co
efficient: rs13396424
co
efficient: rs138854328
co
efficient: rs10171092
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
0
20
40
60
0
20
40
60
0
20
40
60
Marginal posterior probability of inclusion
co
u
n
t
Figure 6.17: Histograms of sketched marginal posterior probabilities of inclusion for SNPs where the target
marginal probability is between 0.5 and 0.9. Results were obtained using the Gaussian projection at different
sketch sizes k. The target marginal probability of inclusion is plotted as a dashed line. The sketched posterior
approximations should approach the target values as k increases. As the sketch size k increases the approximated
marginal inclusion probabilities concentrate around the target values.
141
6. Sketching for posterior approximation
 = ES [σmax(Id −UTSTSU)].
The value  gives the expected distortion factor over the random sketch. This can be compared to the
values predicted by Theorem 6.5 and Theorem 6.7. Using the Tracy-Widom approximation we expect
that
 ≈ σk,dE[Z] + µk,d − 1,
where µk,d and σk,d are the centring and scaling constants in Theorem 6.6, and E[Z] is the mean of
a Tracy-Widom random variable, E[Z] ≈ −1.21. Using the more crude pointwise approximation in
Theorem 6.5 we have  ≈ (1 + √d/k)2 − 1. A Monte Carlo estimate of  is obtained by taking the
sample average of [1], . . . , [B]. Table 6.4 compares the sample average to the asymptotic predictions
for the Clarkson-Woodruff projection. There is a good correspondence between the theoretical and
observed values. Table 6.5 compares the sample average to the asymptotic predictions for the Gaussian
sketch. The Tracy-Widom approximation from Theorem 6.6 gives more accurate predictions than the
pointwise approximation in Theorem 6.5. We are able to estimate the expected distortion factor  for a
particular sketch size k using the asymptotic theory. The sketches are behaving as expected in terms of
the distortion factor . It seems that very small  is required to give a tolerable posterior approximation.
We can determine ahead of time what  is expected for a given sketch size k. This can be used to set
expectations appropriately when interpreting the the sketched results.
Sketch size (k) one hundred one thousand ten thousand one hundred thousand
Monte Carlo estimate of  0.857 0.244 0.079 0.024
Tracy-Widom expectation 0.851 0.245 0.075 0.023
Pointwise expectation 0.995 0.278 0.084 0.026
Table 6.4: Comparison of Monte Carlo and asymptotic estimates of  = ES [σmax(Id−UTSTSU)] for the Clarkson-
Woodruff sketch. The quality of a sketch can be summarised by . The source dataset is the top 15 SNP dataset
described in section 6.9.2 (n = 407, 779, d = 17). The Tracy-Widom expectations (Theorem 6.7) match the Monte
Carlo estimates more closely than the pointwise limit (Theorem 6.5). We are able to estimate the expected distortion
factor  for a particular sketch size k using the asymptotic theory.
Sketch size (k) one million ten million one hundred million one billion
Monte Carlo estimate of  0.00771 0.00247 0.00077 0.00025
Tracy-Widom expectation 0.00738 0.00233 0.00074 0.00023
Pointwise expectation 0.00826 0.00261 0.00082 0.00026
Table 6.5: Comparison of Monte Carlo and asymptotic estimates of  = ES [σmax(Id−UTSTSU)] for the Gaussian
sketch. The quality of a sketch can be summarised by . The source dataset is the top 15 SNP dataset described
in section 6.9.2 (n = 407, 779, d = 17). The Tracy-Widom expectations (Theorem 6.7) match the Monte Carlo
estimates more closely than the pointwise limit (Theorem 6.5). We are able to estimate the expected distortion
factor  for a particular sketch size k using the asymptotic theory.
6.10 Conclusion
We have investigated the use of sketching for approximate Bayesian subset selection. Approximation
bounds are developed using -subspace embeddings. Data oblivious random projections are designed to
output an -subspace embedding of the source dataset with high probability. We obtained asymptotic
142
6.10. Conclusion
expressions for the embedding probability for the Gaussian, Hadamard and Clarkson-Woodruff sketch in
terms of the Tracy-Widom law. Simulations showed that the asymptotic expression is accurate on large
datasets. The asymptotic theory can be used to estimate the required sketch size k needed to obtain
an -subspace embedding with probability at least (1− δ). Determination of the embedding probability
Pr(S is an -subspace embedding for A) is a largely open question, and the asymptotic equivalence of
the Gaussian, Hadamard and Clarkson-Woodruff sketched is of theoretical and practical interest. The
universality of the Tracy-Widom law is a topic of current research (Bao et al., 2015) and it is also of
interest to determine the full set of data oblivious random projections that satisfy the Tracy-Widom
limiting embedding probability.
We do not think that asymptotic analysis can supplant the finite sample results that drive much of
the existing work on randomised algorithms. We feel that they are a useful complement that can provide
answers to important questions that are nigh unresolvable in a finite sample framework. A combination of
asymptotic and non-asymptotic results could be useful for the development of more advanced algorithms.
We also investigated the range of  needed to obtain a tolerable posterior approximation. Sketching
has been proposed for approximating the posterior distribution over coefficients in the fixed model setting
(Geppert et al., 2017). We found that  needs to be much smaller than in the fixed model setting to
maintain the important features of the target posterior over models. It appears that approximation of
the integrated likelihood is a more difficult task than preservation of the normal likelihood conditional
on a model.
143
7Conclusion
As the thesis winds down, we can ease into the customary retrospection by cycling back to the introduction
and reconsidering Box’s loop (Figure 7.1). Box’s loop breaks down a generic data analysis problem into
a series of connected sub-problems involving model formulation, fitting, checking and revising. This
staged workflow is described by Blei (2014) as build, compute, critique and repeat. The modularisation
helped us to pin down the central research question of interest. The emergence of Big Data has led to the
compute step becoming a bottleneck in practical applications. As datasets grow larger, the computational
budget does not stretch as far and we are forced to adapt. New algorithms and methods are needed
to facilitate statistical inference at scale. Our particular focus has been on computational methods
for Bayesian analyses, however we have encountered a number of general principles that will impact
statistical computing research on Big Data problems. We have investigated how distributed computing,
subsampling and randomised data compression can be used for efficient statistical computation.
A subtle yet fundamental distinction underpins how algorithms should be compared and assessed. This
is the distinction between the number of floating point operations required by an algorithm and the typical
execution time as measured by the end user (wall-clock time). Making such a differentiation is critical
when allowing for parallel computation. A hardline accounting perspective is that the computational
expense of an algorithm is defined by the number of floating point operations that are invoked, whether
this is in parallel or not. A purely utilitarian viewpoint is that the only relevant factor is the time taken
for the algorithm to return the end result, making wall clock-time the sole metric of interest.
Figure 7.1: Box’s loop is conceptual process model of scientific data analysis (Box, 1976). Box’s loop defines a
number of key phases (Build, compute, critique and repeat) when approaching a data modelling problem. The com-
partmentalisation of the overall task highlights important tactical decisions that are involved the statistical analysis
lifecycle and aids the elicitation of broader strategic elements that influence the work. Adapted from Blei (2014).
144
Consider the task of calculating the mean of each variable given a large dataset of n observations
on p variables. This can be executed on a single machine in O(np) operations. It is trivial to define an
embarrassingly algorithm for carrying out this task. Given that we distribute the job across s processors
we can expect the wall-clock time to drop by a factor s assuming minimal communication costs. The
wall-clock problem is solved by an embarrassingly parallel algorithm, however the raw number of floating
point operations remains the same. Each of the s workers completes an O(np/s) task leading to an
overall expense of O(np) operations. A pragmatic solution exists, yet the deeper issue of the floating
point operation cost is not addressed by using a divide and conquer approach. The underlying expense
of an algorithm is of interest in theoretical computer science. Reflective of this is that early work on
sketching considered the estimation of a mean (Cormode, 2011). A more thorough analysis would of
course include memory, disk and communication costs, we are choosing to emphasise the floating point
operation count as it is of the most relevance to our work. If we can fundamentally reduce the number
of floating point operations required in the compute step and still obtain an acceptable result, we have
made great headway for numerical computing with Big Data. To surmise, subsampling and sketching
cut wall clock time by reduce the underlying number of floating point operations. Divide and conquer
approaches attack the wall clock time issue by distributing the floating point calculation burden over
a large number of workers. A very classical statistical perspective may gloss over such implementation
details, but this is an important aspect of dealing with Big Data (Donoho, 2017; Bryan and Wickham,
2017). The distinction between computational cost and wall-clock time should be kept in mind when
comparing the work across chapters.
Specialised algorithms can be necessary for the practical analysis of large datasets. Parallel process-
ing, subsampling and random projection can act as useful components within algorithms for Big data
regression. Each broad approach has unique strengths and weaknesses and the best tool for the job may
change on a case by case basis. To be more formal, recall the generic problem set-up from the introduc-
tion. Suppose we have a dataset y of n observations with likelihood p(y|θ). The parameter θ ∈ Rd has
prior p(θ). Bayesian analyses can be challenging on tall datasets as computational procedures can be
intolerably slow. The typical bottleneck encountered is the need for repeated full likelihood evaluations
p(y|θ), which carry an O(n) cost. This can grow to an infeasible burden for a sufficiently large number
of observations n (Robert et al., 2018). A main goal in the thesis to develop algorithms that minimise or
eliminate O(n) calculations. The divide and conquer work in Chapter 2 splits the initial dataset into s
subsets so that each worker only has tasks that are O(n/s). In the subsampling chapter we use subsam-
ples of size m n to estimate the log likelihood in O(m) time. In the work in sketching we replace the
full dataset of n observations with a compressed pseudo-dataset of k  n observations. Numerical work
on the sketched dataset demands O(k) operations.
Existing work in the field is largely directed at sampling from the posterior distribution p(θ|y). To
carry out model selection it is desirable to have the integrated likelihood (model evidence) p(y|θ) =∫
p(y|θ)p(θ) dθ, and this was a motivating concern throughout the thesis. Chapter 2 uses divide and
conquer methodology in order to compute the model evidence. Chapter3 uses subsampling to bound the
model evidence. Chapters 4,5 and 6 examine how random projection can be used to approximate the
least squares estimate θ̂ and the integrated likelihood p(y) for Gaussian linear models.
The analysis of each technique required different mathematical methods and concepts. In Chapter 2
we developed an embarassingly parallel algorithm using Gibbs sampling and conditional conjugacy. The
approach fits the split-apply-combine template shown in Figure 7.2. We propose to use Gibbs sampling in
the apply stage, and to take the Gibbs sampler history as part of the output. The combine step involves
pooling the Gibbs trajectories from each subset run. The initial partition of the data in the split step
has a strong influence on the Monte Carlo variance in the combine step.
In Chapter 3 we motivated the use of subsampling for estimation of the integrated likelihood using
145
7. Conclusion
Figure 7.2: Template for embarrassingly parallel algorithms. The split step breaks the full dataset in to s non
overlapping subsets. The illustration is for s = 2. Each subset is then allocated to a different machine on a
cluster. During the apply step we apply conventional methodology to each data subset with no cross communication
between workers. Each analysis is summarised by a consistent format of output. The s sets of output from the apply
stage are then synthesised in the combine step to give the final result. In this design brief, the combine stage only
involves the output from the apply step, and not the original dataset.
the the identity
log p(y) = Ep(θ|y)[log p(y|θ)]−D(p(θ|y) ‖ p(θ)). (7.1)
The goodness of fit term Ep(θ|y)[log p(y|θ)] can be estimated using subsampling, and the penalty term
D(p(θ|y)‖p(θ)) can be bounded using information theoretic arguments. The main contribution is an
upper bound on the model evidence using the maximum entropy property of the normal distribution.
Exact calculation of the model evidence becomes increasingly computationally demanding as n increases.
Evidence bounds are attractive for large n problems as they can be expected to squeeze together as n
increases under mild assumptions, and they can be estimated cheaply using control variate estimators.
Chapters 4, 5 and 6 considered sketching, a probabilistic data compression technique. As mentioned
in the introduction, most of our work on sketching can be seen as a statistical evaluation of algorithms
developed in the computer science and machine learning literature. We pried into the nature of the
existing stochastic machinery behind randomised algorithms. Randomised algorithms are interesting
from a statistical perspective as repeated application of the algorithm to the same dataset will produce
different results. The distribution of the output is a measure of the quality of the algorithm. We
established a number of asymptotic results regarding the distribution of the sketched output. In Chapter
4 we considered the distribution of least squares coefficients on the sketched dataset. In Chapter 6 we
considered the probability of obtaining an -subspace embedding (Definition 6.1). Existing results on
sketching algorithms are typically finite sample worst case bounds that can be pessimistic. We took a
different approach and tried to characterise typical behaviour under regularity conditions. It is important
to be able to quantify the uncertainty attached to a randomised algorithm, and the results in Chapter 4 on
the variance of the sketched coefficients and the results in Chapter 6 about the embedding probability are
useful measures for end users. We found that the central limit theorem and the Tracy-Widom law were
useful for analysing the behaviour of data oblivious sketching algorithms. Weak convergence is a useful
concept in the analysis of randomised algorithms. We believe our results serve as a useful complement
to the existing body of work in the area. Sketching algorithms can be more richly analysed using a
combination of non-asymptotic and asymptotic results.
The fundamental trade off between the computational budget and accuracy of the calculation is a
primary concern for computational statistics in the Big Data world. This is in some sense no a free lunch
problem, where lowering the computational expense of a Monte Carlo method can be expected to increase
the Monte Carlo error. Successful Big Data algorithms use available resources wisely. Embarrassingly
parallel algorithms spread the floating point operation cost over a large number of nodes without grossly
increasing communication costs. Subsampling based algorithms use control variates as a stabilising influ-
146
ence. Sketching algorithms use well designed random projections to give robust performance guarantees
in terms of the compression ratio and the structure of the source dataset. Truly scalable methodology
controls the error in conjunction with the cost as datasets increase in size, and this is an overarching
concern in each piece of work. Much of the statistical analysis in this thesis has been directed at non
sampling based sources of error.
The Divide and conquer approach for computing the integrated likelihood (Chapter 2) provides an
interesting example of this challenge. We easily achieve a linear reduction in the analysis time given
that we split the full dataset into s subsets and process them in parallel. However the difficulty of the
combine step increases with the number of subsets s. The variance of the kernel density estimator of the
subposterior integral will be extremely high when s is large. As such, there is a limit to how much can
be gained from parallel processing. Data augmentation can be used to address the scaling problem: by
using Gibbs sampling in the intermediate apply step of the algorithm we can obtain a feasible combine
step for large s. In Chapter 3, we find that although unbiased estimators of the log likelihood can
be constructed using simple random sampling, the variance of such estimators explodes as n tends to
infinity. Control variates are needed to obtain a scalable algorithm as n increases. A similar principle
applies to sketching. A simple random sample of size k can be viewed as a sketch, however it does not
offer the same probabilistic guarantees as the Hadamard and as Clarkson-Woodruff sketch n tends to
infinity. Algorithms need to be designed carefully in order to control the associated Monte Carlo or
approximation error.
There are a number of research avenues that appear to be worth pursing. As mentioned in Chapter
2, the split-apply combine approach using data augmentation and Gibbs sampling can also be used for
posterior sampling. In Chapter 2 we considered estimation of the model evidence. We prescribe Gibbs
sampling during the apply step and take the output to be the sequence of full conditional distributions.
The combine step used the history of the subset Gibbs runs to estimate the integrated likelihood p(y).
We can also make use of the Gibbs history to target the full dataset posterior p(θ|y) in the combine
step. We can define a self-normalised importance sampler to target the true posterior distribution in the
combine step using the providing that the model satisfies conditional conjugacy constraints.
The work on subsampling based estimation of the log model evidence in Chapter 3 can also be taken
in new directions. Exact calculation of the integrated likelihood appears to require a large number of
floating point operations. If this cost is inescapable, a useful strategy is to instead bound the integrated
likelihood. We built our strategy around the identity (7.1). Subsampling can be used to estimated
the goodness of fit term Ep(θ|y)[log p(y|θ)]. It is of interest to determine other cheap estimators of the
Kullback-Leibler penalty term D(p(θ|y)‖p(θ)).
There was an important common pattern to the analysis in Chapters 4 and 6. Recall that the
Gaussian sketch is particularly mathematically tractable but is computationally demanding whereas the
Hadamard and Clarkson-Woodruff projections are computationally efficient but difficult to characterise
mathematically. Our broad strategy was to analyse the Gaussian sketch and then establish asymptotic
equivalence for the other projections. This was useful to develop guidelines for sketching applications.
This general idea was suggested by Li et al. (2006), however it has appeared to have gained little traction
in the sketching literature. We found this line of reasoning to be highly effective, and anticipate that
it can be used address other questions concerning the behaviour of sketching algorithms. As in earlier
chapters suppose we have a large n× d source dataset and we take a sketch of size k where k  n. It is
of interest to extend the sketching central limit theorem to the ‘Big Data’ asymptotic regime:
n, d, k →∞, d/k → α ∈ (0, 1], d/n→ 0, k/n→ 0. (7.2)
In the regime (7.2) the sketched dataset becomes an infinite dimensional random matrix. It is possible
to establish weak convergence in general metric spaces (Van Der Vaart, 1998), and we aim to identify
the appropriate framework to study sketching under the limit conditions in (7.2). The concept of joint
asymptotic normality as espoused by Mallows (1972) appears to be useful for establishing the necessary
147
7. Conclusion
convergence of the finite dimensional distributions. We anticipate that it will be necessary to treat the
sketched dataset as a random bounded linear operator.
The connections between the Tracy-Widom law and the behaviour of data oblivious sketches may
have deeper ramifications for statistical computation. Iterative sketching algorithms for least squares
problems use a combination of approximate second order information from a sketches and exact gradient
information. Iterative sketching algorithms can be interpreted as classical iterative methods with a
random preconditioner (Gower and Richtrik, 2015; Pilanci and Wainwright, 2016). The iterates come with
stochastic convergence guarantees to the optimal solution. It can be shown that the rate of convergence
of the algorithm is governed by the eigenvalue distribution of the random preconditioner (Young, 1972).
Knowledge of the spectrum can be used to accelerate the rate of convergence of the algorithm through the
addition of learning rate and momentum hyperparameters (Young, 1972). We can implement Chebyshev
acceleration on the sketched linear system given that we know the asymptotic spectrum of the random
preconditioner. We are currently investigating the acceleration of iterative sketching algorithms using
the spectral theory developed in Chapter 6.
The Bayesian analysis of tall data can be challenging due to obstacles encountered in the compute step
of Box’s loop. We believe distributed computing, subsampling and random projection are all promising
approaches for computationally efficient Bayesian inference. We hope that the algorithms and results
here expand the suite of computational tools for analysts to use in practice.
148
References
Ailon, N. and Chazelle, B. (2009) The fast Johnson Lindenstrauss transform and approximate nearest
neighbors. SIAM Journal on Computing, 39, 302–322.
Anderson, I. (1997) Combinatorial Designs and Tournaments. Oxford lecture series in mathematics and
its applications. Clarendon Press.
Andrieu, C., Roberts, G. O. et al. (2009) The pseudo-marginal approach for efficient Monte Carlo com-
putations. The Annals of Statistics, 37, 697–725.
Arnold, B. C., Castillo, E. and Sarabia, J. M. (1993) Conjugate exponential family priors for exponential
family likelihoods. Statistics, 25, 71–77.
Astle, W. J., Elding, H., Jiang, T., Allen, D., Ruklisa, D., Mann, A. L., Mead, D., Bouman, H., Riveros-
Mckay, F., Kostadima, M. A. et al. (2016) The allelic landscape of human blood cell trait variation
and links to common complex disease. Cell, 167, 1415–1429.
Avron, H., Nguyen, H. and Woodruff, D. (2014) Subspace embeddings for the polynomial kernel. In
Advances in Neural Information Processing Systems, 2258–2266.
Bai, Z., Fang, Z., Liang, Y.-C. and Fange, Z. (2014) Spectral theory of large dimensional random matrices.
Singapore ; Hackensack, N.J.
Bai, Z. and Silverstein, J. W. (2010) Spectral Analysis of Large Dimensional Random Matrices. Springer
Series in Statistics. New York, NY: Springer New York, 2nd edn.
Baker, J., Fearnhead, P., Fox, E. B. and Nemeth, C. (2017) Control variates for stochastic gradient
MCMC. arXiv preprint arXiv:1706.05439.
Banerjee, A., Dunson, D. B. and Tokdar, S. T. (2013) Efficient Gaussian process regression for large
datasets. Biometrika, 100, 75–89.
Bao, Z., Pan, G. and Zhou, W. (2015) Universality for the largest eigenvalue of sample covariance matrices
with general population. The Annals of Statistics, 43, 382–421.
Bardenet, R., Doucet, A. and Holmes, C. (2014) Towards scaling up markov chain monte carlo: an
adaptive subsampling approach. In International Conference on Machine Learning (ICML), 405–413.
Bardenet, R., Doucet, A. and Holmes, C. (2015) On Markov chain Monte Carlo methods for tall data.
arXiv preprint 1505.02827.
Bardenet, R., Doucet, A. and Holmes, C. (2017) On Markov chain Monte Carlo methods for tall data.
The Journal of Machine Learning Research, 18, 1515–1557.
149
References
Bardenet, R. and Maillard, O.-A. (2015) A note on replacing uniform subsampling by random projections
in MCMC for linear regression of tall datasets. HAL preprint 01248841.
Becker, S., Kawas, B., Petrik, M. and Ramamurthy, K. (2015) Robust partially-compressed least- squares.
arXiv preprint arXiv:1510.04905v1.
Benjamini, Y. and Hochberg, Y. (1995) Controlling the false discovery rate: A practical and powerful
approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57,
289–300.
Bennett, C. H. (1976) Efficient estimation of free energy differences from Monte Carlo data. Journal of
Computational Physics, 22, 245–268.
Bernardo, J., Bayarri, M. and Berger, J. (2011) Bayesian Statistics 9. Oxford science publications. OUP
Oxford.
Bernardo, J. and Smith, A. (2006) Bayesian Theory. Wiley Series in Probability and Statistics. John
Wiley & Sons Canada, Limited.
Besag, J. (1989) A candidate’s formula: A curious result in bayesian prediction. Biometrika, 76, 183.
Bhatia, R. (1996) Matrix Analysis. Springer.
Bierkens, J., Fearnhead, P. and Roberts, G. (2016) The zig-zag process and super-efficient sampling for
Bayesian analysis of big data. arXiv preprint arXiv:1607.03188.
Billinglsley, P. (1968) Convergence of probability measures. Wiley.
Billingsley, P. (1999) Convergence of probability measures. Wiley Series in Probability and Statistics.
New York: Wiley, 2nd edn.
Blei, D. M. (2014) Build, compute, critique, repeat: Data analysis with latent variable models. Annual
Review of Statistics and Its Application, 1, 203–232.
Blei, D. M., Kucukelbir, A. and McAuliffe, J. D. (2017) Variational inference: A review for statisticians.
Journal of the American Statistical Association, 112, 859–877.
Bottolo, L. and Richardson, S. (2010) Evolutionary stochastic search for Bayesian model exploration.
Bayesian Analysis, 5, 583–618.
Box, G. E. P. (1976) Science and statistics. Journal of the American Statistical Association, 71, 791–799.
Bryan, J. and Wickham, H. (2017) Data science: A three ring circus or a big tent? Journal of Compu-
tational and Graphical Statistics, 26, 784–785.
Cannings, T. I. and Samworth, R. J. (2015) Random projection ensemble classification. arXiv preprint
arXiv:1504.04595.
Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo,
J., Li, P. and Riddell, A. (2017) Stan: A probabilistic programming language. Journal of Statistical
Software, 76.
Casella, G., Giro´n, F. J., Mart´ınez, M. L., Moreno, E. et al. (2009) Consistency of Bayesian procedures
for variable selection. The Annals of Statistics, 37, 1207–1228.
Centers for Disease Control and Prevention (2013) Behavioral Risk Factor Surveillance System. Available
online at https://www.cdc.gov/brfss/annual_data/annual_2013.html.
150
References
Chatterjee, D., Maitra, T. and Bhattacharya, S. (2018) A short note on almost sure convergence of bayes
factors in the general set-up. The American Statistician, 0, 1–4.
Chib, S. (1995) Marginal likelihood from the Gibbs output. Journal of the American Statistical Associ-
ation, 90, 1313–1321.
Chib, S. and Kuffner, T. A. (2016) Bayes factor consistency. arXiv preprints arXiv:1607.00292.
Clarkson, K. L. and Woodruff, D. P. (2013) Low rank approximation and regression in input sparsity
time. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, 81–90. ACM.
Cormode, G. (2011) Sketch techniques for approximate query processing. Foundations and Trends in
Databases.
Dempster, A. P. (1997) The direct use of likelihood for significance testing. Statistics and Computing, 7,
247–252.
Dhillon, P., Lu, Y., Foster, D. P. and Ungar, L. (2013) New subsampling algorithms for fast least squares
regression. In Advances in Neural Information Processing Systems, 360–368.
DiCiccio, T. J., Kass, R. E., Raftery, A. and Wasserman, L. (1997) Computing Bayes factors by combining
simulation and asymptotic approximations. Journal of the American Statistical Association, 92, 903–
915.
Dickey, J. M. (1971) The weighted likelihood ratio, linear hypotheses on normal location parameters. The
Annals of Mathematical Statistics, 204–223.
Donoho, D. (2017) 50 years of data science. Journal of Computational and Graphical Statistics, 26,
745–766.
Drineas, P., Mahoney, M. W. and Muthukrishnan, S. (2006) Sampling algorithms for l2 regression and
applications. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm,
1127–1136. Society for Industrial and Applied Mathematics.
Eaton, M. L. (2007) Chapter 8: The Wishart Distribution. In Multivariate Statistics, vol. 53 of Lecture
Notes Monograph Series, 302–333. Ohio: Institute of Mathematical Statistics.
Edelman, A. (1988) Eigenvalues and condition numbers of random matrices. SIAM Journal on Matrix
Analysis and Applications, 9, 543–560.
Efron, B. and Hastie, T. (2016) Computer Age Statistical Inference. Institute of Mathematical Statistics
Monographs. Cambridge University Press.
Fahrmeir, L. and Tutz, G. (1994) Multivariate statistical modelling based on generalized linear models.
Springer series in statistics. Springer-Verlag.
Flegal, J. M., Hughes, J., Vats, D. and Dai, N. (2017) mcmcse: Monte Carlo Standard Errors for MCMC.
Riverside, CA, Denver, CO, Coventry, UK, and Minneapolis, MN. R package version 1.3-2.
Friel, N. and Pettitt, A. N. (2008) Marginal likelihood estimation via power posteriors. Journal of the
Royal Statistical Society: Series B (Statistical Methodology), 70, 589–607.
Friel, N. and Wyse, J. (2012) Estimating the evidence a review. Statistica Neerlandica, 66, 288–308.
Gelfand, A. E. and Dey, D. K. (1994) Bayesian model choice: asymptotics and exact calculations. Journal
of the Royal Statistical Society: Series B (Methodological), 501–514.
151
References
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A. and Rubin, D. B. (2014) Bayesian
data analysis. Boca Raton: Chapman & Hall, 3 edn.
Gelman, A. and Vehtari, A. (2017) Comment: Consensus Monte Carlo using expectation propagation.
Brazillian Journal of Probability and Statistics, 31, 692–696.
Geman, S. (1980) A limit theorem for the norm of random matrices. The Annals of Probability, 252–261.
Geppert, L. N., Ickstadt, K., Munteanu, A., Quedenfeld, J. and Sohler, C. (2017) Random projections
for Bayesian regression. Statistics and Computing, 27, 79–101.
Geyer, C. (1996) Estimating normalizing constants and reweighting mixtures in Markov chain Monte
Carlo. Tech. Rep. 568, University of Minnesota.
Gower, R. M. and Richtrik, P. (2015) Randomized iterative methods for linear systems. arXiv preprint
arXiv:1506.03296.
Green, P. J.,  Latuszyn´ski, K., Pereyra, M. and Robert, C. P. (2015) Bayesian computation: a summary
of the current state, and samples backwards and forwards. Statistics and Computing, 25, 835–862.
Greene, W. (1997) Econometric Analysis. Prentice-Hall international editions. Prentice Hall.
Guhaniyogi, R. and Dunson, D. B. (2015) Bayesian compressed regression. Journal of the American
Statistical Association, 110, 1500–1514.
Gutie´rrez-Pen˜a, E., Smith, A., Bernardo, J. M., Consonni, G., Veronese, P., George, E., Giro´n, F.,
Mart´ınez, M., Letac, G. and Morris, C. N. (1997) Exponential and Bayesian conjugate families: review
and extensions. Test, 6, 1–90.
Halko, N., Martinsson, P. G. and Tropp, J. A. (2011) Finding structure with randomness: Probabilistic
algorithms for constructing approximate matrix decompositions. SIAM Review, 53, 217–288.
Howie, B. N., Donnelly, P. and Marchini, J. (2009) A flexible and accurate genotype imputation method
for the next generation of genome-wide association studies. PLoS Genet, 5, e1000529.
Huang, Z. and Gelman, A. (2005) Sampling for Bayesian computation with large datasets. Tech. rep.,
University of Columbia.
Huber, P. J. (1973) Robust regression: Asymptotics, conjectures and monte carlo. The Annals of Statis-
tics, 1, 799–821. URLhttps://doi.org/10.1214/aos/1176342503.
Jacob, P. E., Thiery, A. H. et al. (2015) On nonnegative unbiased estimators. The Annals of Statistics,
43, 769–784.
Jahn, R., Dunne, B. and Nelson, R. (1987) Engineering anomalies research. Journal of Scientific Explo-
ration.
Janson, S. (1988) Some pairwise independent sequences for which the central limit theorem fails. Stochas-
tics, 23, 439–448.
Johnstone, I. M. (2001) On the distribution of the largest eigenvalue in principal components analysis.
Annals of statistics, 295–327.
— (2006) High dimensional statistical inference and random matrices. arXiv preprint arXiv:0611589.
Johnstone, I. M., Ma, Z., Perry, P. O. and Shahram, M. (2014) RMTstat: Distributions, Statistics and
Tests derived from Random Matrix Theory. R package version 0.3.
152
References
Jordan, M. I. (2013) On statistics, computation and scalability. Bernoulli, 19, 1378–1390.
Kass, R. E. and Raftery, A. E. (1995) Bayes factors. Journal of the American Statistical Association, 90,
773–795.
Keener, R. W. (2013) Theoretical Statistics. Springer.
Korattikara, A., Chen, Y. and Welling, M. (2014) Austerity in mcmc land: Cutting the metropolis-
hastings budget. In International Conference on Machine Learning, 181–189.
Kuhn, M. (2008) Building predictive models in R using the caret package. Journal of Statistical Software,
Articles, 28, 1–26.
Lehoucq, R. B., Sorensen, D. C. and Yang, C. (1998) ARPACK users’ guide: solution of large-scale
eigenvalue problems with implicitly restarted Arnoldi methods, vol. 6. Siam.
Lewis, S. M. and Raftery, A. E. (1997) Estimating Bayes factors via posterior simulation with the Laplace-
Metropolis estimator. Journal of the American Statistical Association, 92, 648–655.
Li, P., Hastie, T. J. and Church, K. W. (2006) Very sparse random projections. In Proceedings of the 12th
ACM SIGKDD international conference on Knowledge discovery and data mining, 287–296. ACM.
Loeve, M. (1977) Probability Theory. Springer.
Ma, P., Mahoney, M. W. and Yu, B. (2015) A statistical perspective on algorithmic leveraging. Journal
of Machine Learning Research, 861–911.
Ma, P. and Sun, X. (2015) Leveraging for big data regression. Wiley Interdisciplinary Reviews: Compu-
tational Statistics, 7, 70–76.
Ma, Z. (2012) Accuracy of the Tracy–Widom limits for the extreme eigenvalues in white Wishart matrices.
Bernoulli, 18, 322–359.
Maclaurin, D. and Adams, R. P. (2014) Firefly Monte Carlo: Exact MCMC with Subsets of Data. arXiv
preprint arXiv:1403.5693.
Mahoney, M. (2011) Randomized algorithms for matrices and data. Foundations and Trends in Machine
Learning, 3, 123–224.
Mahoney, M. and Drineas, P. (2016) Structural properties underlying high-quality Randomized Numerical
Linear Algebra algorithms. In Handbook of Big Data (eds. P. Buhlmann, P. Drineas, M. Kane and
M. van de Laan), 137–154. Chapman and Hall.
Mallows, C. (1972) A note on asymptotic joint normality. The Annals of Mathematical Statistics, 508–515.
McCulloch, R. E. and Rossi, P. E. (1992) Bayes factors for nonlinear hypotheses and likelihood distribu-
tions. Biometrika, 79, 663–676.
Meng, X. (2014) Randomized Algorithms for Large-scale Strongly Over-determined Linear Regression
Problems. Ph.D. thesis, Stanford University, Stanford, California, United States.
Meng, X. and Mahoney, M. M. (2013) Low-distortion Subspace Embeddings in Input-sparsity Time and
Applications to Robust Linear Regression. In Proceedings of the forty-fifth annual ACM symposium on
Theory of computing, 91–100. ACM.
Meng, X.-L. and Wong, W. H. (1996) Simulating ratios of normalizing constants via a simple identity: a
theoretical exploration. Statistica Sinica, 831–860.
153
References
Minsker, S., Srivastava, S., Lin, L. and Dunson, D. B. (2017) Robust and scalable Bayes via a median of
subset posterior measures. The Journal of Machine Learning Research, 18, 4488–4527.
Neiswanger, W., Wang, C. and Xing, E. (2013) Asymptotically exact, embarrassingly parallel MCMC.
arXiv preprint arXiv:1311.4780.
Nelson, J. and Nguyeˆn, H. L. (2013) Osnap: Faster numerical linear algebra algorithms via sparser sub-
space embeddings. In Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium
on, 117–126. IEEE.
Newton, M. A. and Raftery, A. E. (1994) Approximate Bayesian inference with the weighted likelihood
bootstrap. Journal of the Royal Statistical Society: Series B (Methodological), 3–48.
Papaspiliopoulos, O. (2009) A methodological framework for Monte Carlo probabilistic inference for
diffusion processes. Working paper, University of Warwick, Coventry.
Phillips, J. M. (2016) Coresets and Sketches. arXiv preprint arXiv:1601.00617.
Pilanci, M. and Wainwright, M. J. (2016) Iterative Hessian sketch: Fast and accurate solution approxi-
mation for constrained least-squares. Journal of Machine Learning Research, 17, 1842–1879.
Pollock, M., Fearnhead, P., Johansen, A. M. and Roberts, G. O. (2016) The scalable Langevin exact
algorithm: Bayesian inference for big data. arXiv preprint arXiv:1609.03436.
Polson, N. G., Scott, J. G. and Windle, J. (2013) Bayesian inference for logistic models using po´lya–
gamma latent variables. Journal of the American Statistical Association, 108, 1339–1349.
Portnoy, S. (1986) On the central limit theorem in rp when p tends to infinity. Probability Theory and
Related Fields, 73, 571–583.
Pruss, A. R. and Szynal, D. (2000) On the central limit theorem for negatively correlated random variables
with negatively correlated squares. Stochastic Processes and their Applications, 87, 299 – 309.
Quiroz, M., Kohn, R., Villani, M. and Tran, M.-N. (2018) Speeding up MCMC by efficient data subsam-
pling. Journal of the American Statistical Association, 0, 1–35.
Quiroz, M., Tran, M.-N., Villani, M., Kohn, R. and Dang, K.-D. (2016) The block-Poisson estimator for
optimally tuned exact subsampling MCMC. ArXiv e-prints.
Raftery, A. E. (1995) Bayesian model selection in social research. Sociological Methodology, 25, 111–163.
— (1996) Hypothesis testing and model selection via posterior simulation. In Markov chain Monte Carlo
in practice, 163–188.
Raskutti, G. and Mahoney, M. (2014) A Statistical Perspective on Randomized Sketching for Ordinary
Least-Squares. arXiv preprint arXiv:1406.5986.
Rhee, C.-h. and Glynn, P. W. (2015) Unbiased estimation with square root convergence for SDE models.
Operations Research, 63, 1026–1043.
Robert, C. P. (2007) The Bayesian Choice From Decision-Theoretic Foundations to Computational Im-
plementation. Springer Texts in Statistics. New York, NY: Springer New York, 2nd edn.
Robert, C. P. and Casella, G. (2010) Monte Carlo Statistical Methods. Springer.
Robert, C. P., Elvira, V., Tawn, N. and Wu, C. (2018) Accelerating MCMC algorithms. Wiley Interdis-
ciplinary Reviews: Computational Statistics, 10, 1–14.
154
References
Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J.-C. and Mller, M. (2011) proc: an
open-source package for r and s+ to analyze and compare roc curves. BMC Bioinformatics, 12, 77.
Roosta-Khorasani, F. and Mahoney, M. W. (2016) Sub-Sampled Newton Methods I: Globally Convergent
Algorithms. arXiv preprint arXiv:1601.04737.
Sarlos, T. (2006) Improved approximation algorithms for large matrices via random projections. In 47th
Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), 143–152. IEEE.
Sarndal, C.-E., Swensson, B. and Wretman, J. H. (1992) Model assisted survey sampling. Springer series
in statistics. New York: Springer-Verlag.
Schwarz, G. (1978) Estimating the dimension of a model. The Annals of Statistics, 6, 461–464.
Scott, S. L. (2017) Comparing consensus Monte Carlo strategies for distributed Bayesian computation.
Brazilian Journal of Probability and Statistics, 31, 668–685.
Scott, S. L., Blocker, A. W., Bonassi, F. V., Chipman, H., George, E. and McCulloch, R. (2013) Bayes
and big data: the consensus Monte Carlo algorithm. In EFaBBayes 250 conference, vol. 16.
Searle, S. R. (1997) Linear Models. New Jersey: Wiley-Interscience.
Shah, R. D. and Meinshausen, N. (2013) Min-wise hashing for large-scale regression and classification
with sparse data. arXiv preprint arXiv:1308.1269.
Shirts, M. R., Bair, E., Hooker, G. and Pande, V. S. (2003) Equilibrium free energies from nonequilibrium
measurements using maximum-likelihood methods. Physical Review Letters, 91, 140601.
Silverman, B. W. (1986) Density estimation for statistics and data analysis. Monographs on statistics
and applied probability (Series). London ; New York: Chapman and Hall.
Silverstein, J. W. (1985) The smallest eigenvalue of a large dimensional wishart matrix. The Annals of
Probability, 13, 1364–1368. URLhttps://doi.org/10.1214/aop/1176992819.
Skilling, J. et al. (2006) Nested sampling for general Bayesian computation. Bayesian Analysis, 1, 833–
859.
Spiegelhalter, D. J., Best, N. G., Carlin, B. P. and Van Der Linde, A. (2002) Bayesian measures of model
complexity and fit. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64,
583–639.
Srivastava, S., Cevher, V., Tran-Dinh, Q., Dunson, D. B. and de Lausanne, F. (2015) WASP: Scalable
Bayes via barycenters of subset posteriors. In Proceedings of the Eighteenth International Conference
on Artificial Intelligence and Statistics, 912–920.
Stan Development Team (2018) Stan Modeling Language Users Guide and Reference Manual.
Tanner, M. A. and Wong, W. H. (1987) The calculation of posterior distributions by data augmentation.
Journal of the American Statistical Association, 82, 528–540.
Terrell, G. R. and Scott, D. W. (1992) Variable kernel density estimation. The Annals of Statistics, 20,
1236–1265.
Thanei, G.-A., Heinze, C. and Meinshausen, N. (2017) Random projections for large-scale regression.
arXiv preprint 1701.05325.
Tracy, C. A. and Widom, H. (1994) Level-spacing distributions and the airy kernel. Communications in
Mathematical Physics, 159, 151–174.
155
References
Tropp, J. A. (2011) Improved analysis of the subsampled randomized Hadamard transform. Advances in
Adaptive Data Analysis, 3, 115–126.
Van Der Vaart, A. (1998) Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Math-
ematics, 3. Cambridge University Press.
Varian, H. R. (2014) Big data: New tricks for econometrics. Journal of Economic Perspectives, 28, 3–28.
Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S. New York: Springer, fourth
edn. ISBN 0-387-95457-0.
Venkatasubramanian, S. and Wang, Q. (2011) The Johnson-Lindenstrauss transform: an empirical
study. In 2011 Proceedings of the Thirteenth Workshop on Algorithm Engineering and Experiments
(ALENEX), 164–173. SIAM.
Wagner, W. (1987) Unbiased Monte Carlo evaluation of certain functional integrals. Journal of Compu-
tational Physics, 71, 21–33.
Walker, S., Damien, P. and Lenk, P. (2004) On priors with a Kullback-Leibler property. Journal of the
American Statistical Association, 99, 404–408.
Walschap, G. (2015) Multivariable Calculus and Differential Geometry. De Gruyter.
Wang, X. and Dunson, D. B. (2013) Parallelizing MCMC via Weierstrass sampler. arXiv preprint
arXiv:1312.4605.
Wasserstein, R. L. and Lazar, N. A. (2016) The ASA’s statement on p-values: Context, process, and
purpose. The American Statistician, 70, 129–133.
Welling, M. and Teh, Y. W. (2011) Bayesian learning via stochastic gradient Langevin dynamics. In
Proceedings of the 28th International Conference on Machine Learning (ICML-11), 681–688.
Wetzels, R., Grasman, R. P. and Wagenmakers, E.-J. (2010) An encompassing prior generalization of the
Savage–Dickey density ratio. Computational Statistics & Data Analysis, 54, 2094–2102.
White, H. (1984) Asymptotic theory for econometricians. Economic Theory, Econometrics and Mathe-
matical Economics Series. Academic Press.
Wickham, H. (2014) nycflights13: Data about flights departing NYC in 2013. R package version 0.1.
Woodruff, D. P. (2014) Sketching as a tool for numerical linear algebra. Foundations and Trends in
Theoretical Computer Science, 10, 1–157.
Yang, J., Meng, X. and Mahoney, M. W. (2015a) Implementing randomized matrix algorithms in parallel
and distributed environments. arXiv preprint arXiv:1502.03032.
Yang, T., Zhang, L., Lin, Q. and Jin, R. (2015b) Fast sparse least-squares regression with non-asymptotic
guarantees. arXiv preprint arXiv:1507.05185.
Young, D. M. (1972) Second-degree iterative methods for the solution of large linear systems. Journal of
Approximation Theory, 5, 137 – 148.
Zhou, S., Ligett, K. and Wasserman, L. (2009) Differential privacy with compression. In Information
Theory, 2009. ISIT 2009. IEEE International Symposium on, 2718–2722. IEEE.
156