New statistical perspectives on efficient Big Data algorithms for high-dimensional Bayesian regression and model selection Daniel Christian Ahfock Fitzwilliam College MRC Biostatistics Unit University of Cambridge September 2018 A thesis presented for the degree of Doctor of Philosophy Abstract This thesis is focused on the development of computationally efficient procedures for regression modelling with datasets containing a large number of observations. Standard algorithms be prohibitively compu- tationally demanding on large n datasets, and we propose and analyse new computational methods for model fitting and selection. We explore three different generic strategies for tall datasets. Divide and conquer approaches split the full dataset into subsets, with the subsets then being analysed independently in parallel. The subset results are then pooled into an overall consensus. Subsampling based methods repeatedly use minibatches of data to estimate quantities of interest. The third strategy is sketching, a probabilistic data compression technique developed in the computer science community. Sketching uses random projection to compress the original large dataset, producing a smaller surrogate dataset that is less computationally demanding to work with. The sketched dataset can be used for approximate infer- ence. We test our regression algorithms on several large n genetic datasets, aiming to find associations between genetic variants and red blood cell traits. Bayesian divide and conquer and subsampling methods have been studied in the fixed model setting but little attention has been given to model selection. An important task in Bayesian model selection is computation of the integrated likelihood. We propose divide and conquer and subsampling algorithms for estimating the integrated likelihood. The divide and conquer approach is based on data augmentation, which is particularly useful for logistic regression. The subsampling approach involves constructing upper and lower bounds on the integrated likelihood using information theory. Sketching algorithms generate a compressed set of responses and predictors than can then be used to estimate regression coefficients. Sketching algorithms use random projections to compress the original dataset and this stochastic generation process makes them amenable to statistical analysis. We examine the statistical properties of sketching algorithms, which allows us to quantify the error in the coefficients estimated using the sketched dataset. The proportion of variance explained by the model proves to be an important quantity in choosing between alternative sketching algorithms. This is particularly relevant to genetic studies, where the signal to noise ratio can be low. We also investigate sketching as a tool for posterior approximation. The sketched dataset can be used to generate an approximate posterior distribution over models. As expected, the quality of the posterior approximation increases with the number of observations in the sketched dataset. The trade-off is that computational cost of sketching increases with the size of the desired sketched dataset. The main conclusion is that impractically large sketch sizes are needed to obtain a tolerable approximation of the posterior distribution over models. We test the performance of sketching for posterior approximation on a large genetic dataset. A key finding is that false positives are a major issue when performing model selection. Practical regression analysis with large n datasets can require specialised algorithms. Parallel pro- cessing, subsampling and random projection are all useful tools for computationally efficient regression modelling. Declaration • This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration except where specified in the text. • I further state that no substantial part of my dissertation has already been submitted, or, is being concurrently submitted for any such degree, diploma or other qualification at the University of Cambridge or any other University or similar institution. • Chapters 4 and 5 are joint work with Sylvia Richardson and William Astle. The contents in these chapters have been submitted together jointly for publication. Acknowledgement Firstly, thank you to my supervisors Sylvia Richardson and William Astle for all of their help and guidance over the course of the PhD. The work in this thesis would not have come together without their efforts. I am also very grateful to my parents, Tony and Georgette, and to my sister Melody for all of their support and encouragement. Even though Cambridge is a long way from Australia, I never felt too far away from home. Finally, thank you to my wonderful wife Amy for being such an amazing partner on this journey. Contents Contents i List of Figures iv List of Tables vi 1 Introduction 2 2 Split, apply, combine: computing the model evidence using embarrassingly parallel processing 6 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Divide and conquer Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Posterior sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.2 Model uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.1 Integrated likelihood calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.2 Savage-Dickey density ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4 General Bayesian models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.2 Subset saturated model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4.3 Split and apply steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4.4 Combine step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4.5 Example: ESP dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4.6 Curse of dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4.7 Embarrassingly parallel evidence estimation . . . . . . . . . . . . . . . . . . . . . . 26 2.5 Data augmentation for distributed inference . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.5.2 Conjugate priors in the exponential family . . . . . . . . . . . . . . . . . . . . . . . 27 2.5.3 Data augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.5.4 Chib’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.5.5 Apply step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.5.6 Combine step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.5.7 Embarrassingly parallel evidence estimation . . . . . . . . . . . . . . . . . . . . . . 33 2.5.8 Monte Carlo error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.6 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.7 Data application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.7.1 Flights dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 i Contents 2.7.2 Pima Indians dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3 Bounding the model evidence using the subsampled sandwich estimator 45 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.2 Bayesian model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2.1 Evidence bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.2.2 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2.3 Upper bounding the evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.2.4 Lower bounding the evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.2.5 Sandwiching the evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.3.1 Subsampled log likelihoods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.3.2 Subsampled likelihoods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.3.3 Importance sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.3.4 Harmonic mean estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.3.5 Bridge sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.3.6 Laplace approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.3.7 Laplace-Metropolis estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.4 Application to tall datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.4.1 Estimation of evidence bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.5 Data application: flights dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.7 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.7.1 Control variates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.7.2 Asymptotic variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4 Statistical properties of sketching algorithms 69 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.2 Background and related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.2.1 Embedding bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.2.2 Sketches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.2.3 Sketching examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.2.4 Sketching bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.3 Gaussian sketching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.3.1 Complete sketching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.3.2 Partial sketching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.3.3 Relative efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.3.4 Combined estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.4 Asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.4.2 Sketching central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.4.3 Sketching estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.5 Data application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.5.1 Human leukocyte antigen dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.5.2 Flights dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5 Proofs regarding sketching algorithms 90 ii Contents 5.1 Proof of Theorem 4.1 (Worst case bound for partial sketching) . . . . . . . . . . . . . . . 90 5.2 Proof of Theorem 4.2 (Hierarchical model for the Gaussian sketch) . . . . . . . . . . . . . 91 5.3 Variance for partial sketching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.4 Combined estimator results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.5 Proof of Theorem 4.4 (central limit theorem under asymptotic negligibility condition) . . 95 5.6 Proof of Theorem 4.3 (Sketching central limit theorem) . . . . . . . . . . . . . . . . . . . 97 5.6.1 Clarkson-Woodruff sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.6.2 Hadamard sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.7 Proof of Theorem 4.5 (Complete sketching asymptotics) . . . . . . . . . . . . . . . . . . . 107 5.8 Proof of Theorem 4.6 (Partial sketching asymptotics) . . . . . . . . . . . . . . . . . . . . . 110 6 On subspace embeddings, Tracy-Widom limits and approximate Bayesian subset selection 112 6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.3 Bayesian model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6.4 Approximate Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.5 Embedding probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.5.1 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.5.2 Gaussian sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.6 Asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.7 Random matrix theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.7.1 Pointwise limit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.7.2 Tracy-Widom limit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.8 Sketching asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.9 Data application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 6.9.1 Embedding probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 6.9.2 Posterior approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 7 Conclusion 144 References 149 iii List of Figures 1.1 Box’s loop (Box, 1976). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1 Template for embarrassingly parallel algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Sleep dataset from the Behavioral Risk Factor Surveillance survey . . . . . . . . . . . . . . . 11 2.3 Divide and conquer analysis of the sleep dataset with s = 2. . . . . . . . . . . . . . . . . . . . 13 2.4 Subposterior distributions for the cubic model on the sleep dataset. . . . . . . . . . . . . . . . 14 2.5 Target Bayesian model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.6 Illustration of the subposterior density identity using the ESP dataset. . . . . . . . . . . . . 20 2.7 Alternative hierarchical Bayesian model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.8 Divide and conquer analysis of the extra sensory perception dataset with s = 3 . . . . . . . . 25 2.9 Flights dataset and simple logistic regression model. . . . . . . . . . . . . . . . . . . . . . . . 39 2.10 Comparison of logistic regression models on the flights dataset . . . . . . . . . . . . . . . . . 40 2.11 Comparison of subposterior and target posterior distributions on the latent variables for the Pima Indians dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.12 Uniform split of the Pima Indians dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.13 Biased split of the Pima Indians dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.14 Distribution of log Îsub for the Pima Indians dataset. . . . . . . . . . . . . . . . . . . . . . . . 43 3.1 ROC curves for the flights dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.2 Reliability curves for the flights dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.1 Example sketching matrices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.2 Abalone dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.3 Bias of sketching estimators on the HLA dataset. . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.4 Normality tests of the sketched dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.1 Comparison of simulated embedding probabilities against theoretical results at different k and d121 6.2 Tracy-Widom distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6.3 Empirical probability of obtaining an -subspace embedding at different k and d . . . . . . . 126 6.4 Observed and theoretical density of σmax(Id −UTSTSU) at different k and d. . . . . . . . . 127 6.5 Regions of interest in determining the embedding probability . . . . . . . . . . . . . . . . . . 128 6.6 Empirical and theoretical embedding probabilities for the representative PRKCE dataset . . 132 6.7 Empirical and theoretical embedding probabilities for the full PRKCE genetic dataset. . . . . 133 6.8 Empirical and theoretical density of σmax(Id −UTSTSU) for the full PRKCE genetic dataset.134 6.9 Comparison of theoretical and empirical density of σmax(Id −UTSTSU) on the full PRKCE dataset and the bootstrapped PRKCE dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.10 Manhattan plot using the representative PRKCE dataset . . . . . . . . . . . . . . . . . . . . 136 iv List of Figures 6.11 Marginal posterior probabilities of inclusion for the PRKCE dataset . . . . . . . . . . . . . . 136 6.12 Boxplots of sketched marginal inclusion probabilities . . . . . . . . . . . . . . . . . . . . . . . 138 6.13 Histograms of sketched marginal inclusion probabilities for SNPs (low evidence SNPs) . . . . 139 6.14 Histograms of sketched marginal posterior probabilities (moderate evidence SNPs) . . . . . . 139 6.15 Boxplots of sketched marginal inclusion probabilities . . . . . . . . . . . . . . . . . . . . . . . 140 6.16 Sketched marginal inclusion probabilities (low evidence SNPs) . . . . . . . . . . . . . . . . . 141 6.17 Sketched marginal inclusion probabilities (moderate evidence SNPs) . . . . . . . . . . . . . . 141 7.1 Box’s loop (Box, 1976). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 7.2 Template for embarassingly parallel algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 146 v List of Tables 2.1 Terminology for divide and conquer Bayesian inference . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Posterior and subposterior model probabilities for the sleep dataset. . . . . . . . . . . . . . . 13 2.3 Required sample size to attain a relative mean square error of less than 0.1 using nonparametric kernel density estimation against dimension (Silverman, 1986). . . . . . . . . . . . . . . . . . 15 2.4 ESP dataset and subsets for s = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.5 Full ESP dataset and subsets for a divide and conquer approach with s = 3. The success proportion is close to 0.5 in the full dataset and in each subset. . . . . . . . . . . . . . . . . . 25 2.6 Raw data for subjects shown in Figure 2.11. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.1 Guidelines for the interpretation of Bayes factors (Kass and Raftery, 1995). . . . . . . . . . . 49 3.2 Log Bayes factors for the flights dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.3 Estimates of the model evidence for the flights dataset. . . . . . . . . . . . . . . . . . . . . . 63 3.4 Time spent on likelihood evaluations for the flights dataset. . . . . . . . . . . . . . . . . . . . 63 3.5 Proportion of negative likelihood estimates using the simple Poisson estimator. . . . . . . . . 64 4.1 Properties of different data oblivious random projections (Woodruff, 2014). . . . . . . . . . . 75 4.2 Mean square error of sketched estimators on HLA dataset. . . . . . . . . . . . . . . . . . . . . 85 4.3 Coverage of confidence intervals on the HLA dataset. . . . . . . . . . . . . . . . . . . . . . . . 86 4.4 Mean square error of sketched estimators on flights dataset. . . . . . . . . . . . . . . . . . . . 86 4.5 Coverage of 95% confidence intervals on the flights dataset. . . . . . . . . . . . . . . . . . . . 86 4.6 Mean square error of sketched estimators on synthetic flights dataset. . . . . . . . . . . . . . 87 6.1 Properties of different data oblivious random projections. . . . . . . . . . . . . . . . . . . . . 117 6.2 Mean sketching time for the representative PRKCE dataset. . . . . . . . . . . . . . . . . . . . 132 6.3 Mean sketching time (seconds) for the full PRKCE dataset. . . . . . . . . . . . . . . . . . . . 135 6.4 Monte Carlo and asymptotic estimates of sketch quality for the Clarkson-Woodruff sketch . . 142 6.5 Monte Carlo and asymptotic estimates of sketch quality for the Gaussian sketch . . . . . . . 142 vi List of Tables ¡¡ 1 1Introduction In their recent monograph ‘Computer age statistical inference’, Efron and Hastie (2016) trace the devel- opment of statistical methodology over the past century with a particular emphasis on computationally intensive procedures. A central theme in the book is that the frontiers of statistical inference are pushed forward as both available datasets and data analysis methods evolve in complexity and richness. The capacity to acquire data and the capacity to analyse data typically advance in tandem, jointly propelled by technological developments. New systems create datasets of unprecedented size and scope that require analysis with new computer driven statistical procedures. For example, microarray technology presented large scale-hypothesis testing challenges. Computationally intensive false discovery rate methodology (Benjamini and Hochberg, 1995) was then honed in the analysis of this data. In turn, these new algo- rithms open the doors for more ambitious scientific investigations. False discovery rate methods can serve as a valuable tool in genome-wide association studies. New datasets can strain existing methodology, and innovation can spring from these difficulties. The emergence of Big Data marks another period where statistical methodology will simultaneously be challenged and afforded the opportunity to expand its reach. Efron and Hastie (2016) signify that computational statistics research can fall into two camps. It is generally possible to make a categorisation between algorithm development and algorithm evaluation, although this division is not fully impermeable. Within the statistical field, algorithmic development typically takes place with inferential operating characteristics and principles in mind. New algorithms for data analysis are also developed by a wide range of subject disciplines, often shaped using other criteria. Teasing out the underlying statistical context and significance of these innovative approaches can lead to theoretical insights and improved techniques. The statistical toolbox can be applied to algorithms in two different ways • Development of new algorithms for statistical inference • Evaluation of existing algorithms using statistical theory and methods. Big Data hoists a big tent. Research contributions in the domain will come from many fields. Statistical contributions will inevitably take both of the aforementioned forms. We can take new algorithms to Big Data or we can seek to understand the statistical basis of promising methodology that has been proposed outside of the area. The work in this thesis falls into both of these categories, to be outlined in more detail shortly. To locate our focus within the broader field of statistics it is effective to consider Box’s loop, a conceptual process model of scientific data analysis (Box, 1976). Box’s loop represents the cycle of model building, fitting, checking and refining that typically takes place when interacting with data, diagrammed in Figure 1.1. Following Blei (2014) it is possible to identify four key steps in Box’s loop: Build, compute, critique and repeat. The initial build step is to postulate a plausible data generating process behind the observed data. Examples include but are certainly not limited to linear regression models, generalised 2 Figure 1.1: Box’s loop is conceptual process model of scientific data analysis (Box, 1976). Box’s loop defines a number of key phases (Build, compute, critique and repeat) when approaching a data modelling problem. The com- partmentalisation of the overall task highlights important tactical decisions that are involved the statistical modelling lifecycle and aids the elicitation of broader strategic elements that influence the work. Adapted from Blei (2014). linear models or mixture models. Substantial domain knowledge can be encoded at this stage. The second step is to compute the structural unknowns in the built model. This can involve the estimation of parameters through optimisation, or sampling from posterior distributions. The third step is to criticise the model through a formal procedure. Examples include residual diagnostics and posterior predictive checks. After the assessment, we then typically improve aspects of the model and repeat the process until we are satisfied. The final model is then deployed for prediction, visualisation, inference or any other suitable objective. In an ideal world of infinite computational resources, the compute step would be instantaneous and painless. In reality we feel the pinch of scarcity, and with tall datasets the compute step can be a significant bottleneck. The work in this thesis is focused around the compute step of Box’s loop. In particular, we are concerned with the development of scalable computational methods for Bayesian regression modelling on large datasets. Consider the general scenario where we have a dataset y of n observations with likelihood p(y|θ) and prior p(θ), for θ ∈ Ω ⊂ Rd. Many of the key Monte Carlo methods for Bayesian computation require a full likelihood evaluation p(y|θ) per loop (Robert, 2007; Robert and Casella, 2010). This a highly undesirable trait in the Big Data era when the O(n) cost of likelihood evaluations can be substantial. We propose a number of computationally efficient algorithms for Bayesian regression modelling with tall (large n) datasets that address the likelihood burden. We focus on issues surrounding model selection, in particular computation of the integrated likelihood p(y) = ∫ p(y|θ)p(θ) dθ. The latter half of the thesis takes a particular focus on the Gaussian linear model, and how random projection can be used for computationally efficient statistical inference. In particular we illuminate some statistical properties of randomised data compression algorithms proposed in the computer science and machine learning literature. The integrated likelihood p(y) is also commonly known as the model evidence, and the quantity plays an important role in Bayesian model choice. We develop theory and methods for efficient computation of the integrated likelihood using parallel processing in Chapter 2. Chapter 3 explores how subsampling can be used for efficient estimation of the integrated likelihood. The parallel processing and subsampling algorithms are compatible with a variety of regression models. We consider their application to logistic regression in detail. Chapters 4, 5 investigate the use of random projection for approximate computation of the least squares estimates for Gaussian linear models. Chapter 6 extends the approach for carrying out approximate Bayesian model selection. There have been many recent advances in developing scalable computational methods for Bayesian 3 1. Introduction inference in the fixed model setting. With the model given and fixed, the typical goal is to generate samples from the posterior distribution p(θ|y) ∝ p(y|θ)p(θ). Uncertainty on θ can shrink to the point of no practical significance in an analysis with a single model and huge n. This phenomenon is an instigative dynamic for our purposes. Model uncertainty as opposed to parameter uncertainty can often emerge as the more pressing inferential question given a massive number of observations (Varian, 2014). Bayesian model selection has received comparatively little attention in the Big Data literature, and much of the novelty in this thesis comes from the focus on this area. We explore how distributed computing, subsampling and random projection can be used as methods for efficient Bayesian model selection. These techniques have been integrated into algorithms for sampling from the posterior distribution p(θ|y) when n is large. There are largely independent streams of literature surrounding each technique, Bardenet et al. (2017) provide an excellent survey of prior work. The prob- lems of posterior sampling and integrated likelihood evaluation are connected, but not fully equivalent. Algorithms for the calculating the model evidence typically require the generation of posterior samples as an initial step. As algorithms for posterior simulation do not typically produce the model evidence as an ancillary benefit, the posterior samples must then be used in a secondary estimation procedure to obtain the model evidence (Friel and Wyse, 2012; Raftery, 1995; Skilling et al., 2006). The two-stage nature of the problem poses new research questions in the Big Data setting. Accelerated procedures for posterior sampling do not naturally lead to accelerated procedures for calculation of the integrated likelihood. Our use of parallel processing, subsampling and random projection for calculation of the model evidence has major differences compared to prior work that is focused on posterior simulation. Distributed computing platforms lend themselves naturally to a divide and conquer approach for the analysis of tall datasets. The difficult Big Data job can be broken down into a series of smaller manageable tasks by splitting the full dataset into subsets, and analysing each subset on a separate machine. The subset results are then merged together to provide the end result. An important factor in a divide and conquer algorithm is the degree of communication required between subprocessors during the minibatch analyses. Embarrassingly parallel algorithms require no communication during this stage, and the only pooling of information occurs in a single round at the end. Embarrassingly parallel algorithms are attractive as not all distributed computing platforms allow for communication between individual workers. Secondly, the latency cost of regular communication between machines can be very high (Scott et al., 2013). Subsampling also has a natural appeal for large n problems. It is sometimes possible to modify standard algorithms that involve full likelihood evaluations to instead use estimated likelihoods from a subsample of size m. Unsurprisingly, the efficiency of the modified algorithm depends on the quality of the likelihood estimator. A common finding in prior work is that Monte Carlo variance reduction techniques are needed to prevent algorithms from breaking down as the subsampling fraction m/n tends to zero (Maclaurin and Adams, 2014; Bierkens et al., 2016; Baker et al., 2017). The computational demands of tall datasets can be addressed by sacrificing some accuracy in order to lower the computational expense of the analysis. Algorithms that play this trade-off to good effect have been identified as a promising future direction for Bayesian computation (Green et al., 2015). This principle leads us to consider ‘sketching’, a probabilistic data compression technique that has been developed largely within the computer science community. Sketching algorithms use random projection to generate a compressed dataset of k observations from the original source dataset of n observations. The compressed dataset is then used for approximate inference. Crucially, sketching algorithms offer stronger probabilistic guarantees on the stochastic approximation error that can be achieved through simple random sampling from the full dataset. Sketching algorithms have also shown excellent empirical performance compared to basic subsampling schemes in a number of simulation studies (Mahoney, 2011; Ma et al., 2015). Sketching has recently been explored as a method for approximate Bayesian inference in the fixed model setting (Geppert et al., 2017). In Chapter 2 we develop an embarrassingly parallel algorithm for computation of integrated likelihood. 4 Data augmentation and Gibbs sampling are the key components of the method. In Chapter 3 we propose a computationally efficient method for estimating the integrated likelihood using subsampling. We find that it is difficult to modify existing importance sampling algorithms to use subsampling effectively. We propose an alternative interval estimator of the model evidence that has more favourable properties. As in related work, it is necessary to use variance reduction techniques to ensure the algorithm is stable even as the subsampling fraction m/n tends to zero. Chapter 4 develops statistical properties of sketched regression algorithms in the fixed model setting. Much of the existing literature on sketching is from an algorithmic point of view, and we place sketching in a statistical context. This groundwork is needed before considering sketched model selection. Chapter 5 gives proofs for many of the results in Chapter 4. Proofs are presented separately in Chapter 5 so as to not hamper the exposition in Chapter 4. Of particular note is a proof of a central limit theorem where we show the sketched dataset has a matrix normal distribution under mild regularity conditions. The regularity conditions have an appealing interpretation in terms of the geometry of the source dataset. Chapter 6 explores the use of sketching for approximate Bayesian subset selection. Random projection can be used to approximate the integrated likelihood p(y) in less time than is needed for exact computation. We find that the Tracy-Widom law (Tracy and Widom, 1994) is very useful for quantifying the error in the approximate integrated likelihood. The Tracy-Widom law describes the asymptotic distribution of the eigenvalues of large random matrices and has found many applications in high-dimensional statistics (Johnstone, 2006). The connection to sketching algorithms appears to be a new result. Each chapter is largely self-contained and includes a review of relevant literature. Returning to the development/evaluation classification that was mentioned earlier, chapters 2 and 3 can be viewed as algorithm development. We propose new methodology for Bayesian computation with tall datasets. Chapters 4, 5 and 6 can be classified as algorithm evaluation. We investigate the statistical properties and principles that underpin existing sketching algorithms. As well of being of theoretical interest, this helps to define concrete procedures for assessing the quality of randomised algorithms. Multiple real datasets are analysed throughout in order to demonstrate the theory and methods. Chapters 2 and Chapter 3 consider a number of benchmark datasets from the computational statistics literature. Chapters 4 and 6 analyse some large genetic datasets from the UK Biobank database. The dissertation concludes in Chapter 7 with a short discussion of the major lessons and common themes that have materialised over the enclosed work. 5 2Split, apply, combine: computing the model evidence using embarrassingly parallel processing Summary We investigate how parallel processing can be used for computing the integrated likelihood on datasets with a large number of observations n. Tall datasets can be split into many subsets that are then al- located to different machines on a compute cluster. The subsets can then be analysed concurrently; embarrassingly parallel algorithms run each minibatch analysis with no cross communication between subprocessors. We find that a combination of data augmentation and Gibbs sampling facilitates a sim- ulation consistent and scalable embarrassingly parallel algorithm for a wide class of statistical models. We show that conditionally conjugate exponential family models exhibit structure that is amenable to embarrassingly parallel inference. A second theoretical finding involves the final step of the divide and conquer algorithm where the subset results are pooled to give the final estimate of the model evidence. We show this step has an interpretation as a Bayesian hypothesis testing problem. The connection fur- thers the theory of distributed Bayesian inference and leads to a second simulation consistent algorithm for computing the model evidence in parallel. 2.1 Introduction Distributed computing platforms offer a huge amount of computing power that can be harnessed to complete many tasks simultaneously. Using parallel processing for the statistical analysis of large datasets is thus an appealing idea. The broad paradigm is to first split the full dataset into a collection of non-overlapping subsets. Subsets are then allocated to different machines on a network, we then apply conventional statistical methods to each minibatch of data. The final step is to combine the subset output into a single estimate. Separating the algorithm into distinct split, apply and combine phases is useful for algorithm design and evaluation. Figure 2.1 displays a broach schematic of the desired approach. A stringent desiderata of an embarrassingly parallel scheme is that the combine phase does not allow interaction with the original dataset. The entire dataset is processed and distilled in the apply phase. Embarrassingly parallel algorithms follow a divide and conquer principle, where the original difficult problem is broken down into a series of manageable tasks. This computational blueprint presents novel challenges for statistical inference. It can be difficult to identify appropriate summary measures for the subset analyses conducted in the apply stage, and the final aggregation in the combine phase is a challenging evidence synthesis task (Jordan, 2013). These issues are present when implementing a divide and conquer approach for Bayesian model selection. Typically, each subset will be analysed using Markov chain Monte Carlo methods and will return a rich set of output. An additional consideration with model selection is that individual subset analyses will be underpowered relative to a full dataset analysis. We are unlikely to see support for complex models in the subset analyses, despite the fact that we may have a large enough dataset to detect complex features. We 6 2.1. Introduction Figure 2.1: Template for embarrassingly parallel algorithms. The split step breaks the full dataset in to s non overlapping subsets. The illustration is for s = 2. Each subset is then allocated to a different machine on a cluster. During the apply step we apply conventional methodology to each data subset with no cross communication between workers. Each analysis is summarised by a consistent format of output. The s sets of output from the apply stage are then synthesised in the combine step to give the final result. In this design brief, the combine stage only involves the output from the apply step, and not the original dataset. explore theoretical and practical issues for distributed Bayesian model selection. Of particular concern is how to summarise the subset analyses effectively in the apply step and how to synthesise the minibatch results appropriately into an overall consensus in the combine step. Suppose interest lies in a collection of M parametric models M1, . . . ,MM . The prior probability of model j is given by p(Mj) for j = 1, . . . ,M . In a slight abuse of notation we do not write θj to specify the parameter associated with model Mj , this is to stop notation from being cluttered. The notation p(θ|Mj) is implicitly understood to refer to the distinct parameter associated with model Mj . In this work we assume that model selection is to be carried out by calculating the integrated likelihood p(y|Mj) for each model j = 1, . . . ,M . The posterior distribution over models can be calculated easily by enumerating over the full set of M integrated likelihoods and prior probabilities: p(Mj |y) = p(y|Mj)p(Mj)∑M k=1 p(y|Mk)p(Mk) . (2.1) It is therefore sufficient to consider the distributed computation of the integrated likelihood for some arbitrary modelM. Given an algorithm for an arbitrary model, we simply apply it to each model in the set of interest and then enumerate to obtain the posterior probabilities using (2.1). Assume the parameter space of model M is indexed by some continuous θ ∈ Ω ⊂ Rd. Let the data y consist of n independent observations. Given a prior distribution p(θ) and a likelihood p(y|θ), the integrated likelihood is defined as p(y|M) = ∫ p(y|θ,M)p(θ|M) dθ. (2.2) Previous work on Bayesian divide and conquer algorithms is largely focused on how to sample from or approximate the target posterior p(θ|y,M) (Huang and Gelman, 2005; Scott et al., 2013; Neiswanger et al., 2013; Srivastava et al., 2015). Calculation of the integrated likelihood (2.2) presents different challenges, and we largely build on the existing literature for computing integrated likelihoods and Bayes factors in the single machine setting. The first main theoretical finding is that the combine step in the divide and conquer algorithm has an interpretation as a Bayesian hypothesis testing problem. The gold standard Bayesian evidence synthesis rule can be expressed as a Savage-Dickey density ratio (Dickey, 1971). We also find that data augmentation (Tanner and Wong, 1987) and the theory of conditional conjugacy for exponential family is of use to define appropriate action in the apply and combine stages. From a computational point of view, we propose two simulation consistent algorithms for calculating the model evidence in parallel. The first algorithm is developed in section 2.4 and is designed for general parametric models. The algorithm presumes that the apply stage involves a generic Metropolis-Hastings 7 2. Computing the model evidence using parallel processing sampler run on each subset. The subset output consists of the posterior draws. The combine phase then consists of a density estimation task given the subset posterior samples. The dimension of the density estimation task increases linearly with the number of sub-processors, thus limiting its scalability. The second algorithm is developed in section 2.5, and applies to conditionally conjugate exponential family models. We assume that conjugacy can be achieved through a suitable data augmentation scheme. The algorithm prescribes Gibbs sampling in the apply step. The output is no longer the raw sampled values, but rather the parameters of the conditional posterior at each stage of the Gibbs run. The combine stage then involves aggregation of the Gibbs sampler histories from each subset. Data augmentation proves useful as we can then obtain closed form rules for combining the output from the apply stage. The approach is related to the method of Chib (1995) for computing the integrated likelihood from the Gibbs output. Although limited in scope compared to the first algorithm, the approach involving data aug- mentation has significantly more favourable computational properties. By combining data augmentation and Gibbs sampling with parallel processing, it is possible to conduct Bayesian model selection using a divide and conquer strategy. We illustrate the theory and methods on a number of real datasets, with a particular focus on logistic regression. 2.2 Divide and conquer Bayesian inference 2.2.1 Posterior sampling With a specified model, the divide and conquer approach is largely motivated by a factorisation of the full dataset posterior distribution into a combination of subset posterior distributions (Bardenet et al., 2017). We will assume it is possible to partition the data into s subsets y = (y1, . . . ,ys) such that the subsets are independent given θ. To describe the strategy we introduce the idea of a subprior distribution and a subposterior distribution. The subprior distribution p˜(θ) is defined as p˜(θ) = p(θ)1/s∫ p(θ)1/s dθ . We assume that the fractionated prior p(θ)1/s is integrable so that the subprior distribution is well defined. The assumption is always satisfied for Gaussin priors, however in general this condition needs to be checked on case by case basis. The subprior distribution contains a fraction of the prior information encoded in the original prior distribution p˜(θ). The subposterior distributions are defined in terms of the subset likelihoods and the subprior distributions. For i = 1, . . . , s the i-th subposterior distribution p˜(θ|yi) is defined as p˜(θ|yi) = p(yi|θ)p˜(θ)∫ p(yi|θ)p˜(θ) dθ . The tilde is used to acknowledge the use of the subprior distribution p˜(θ) in place of the original prior distribution p(θ). All probability formulae conditional on θ do not require a tilde, so we use the regular no- tation for the subset likelihood p(yi|θ). We define the subprior normalising constant α as α = ∫ p(θ)1/s dθ and the subposterior evidence as p˜(yi) = ∫ p(yi|θ)p˜(θ) dθ. Table 2.1 lists some key terms that we will make use of when discussing divide and conquer Bayesian inference. A divide and conquer strategy can be motivated by noting that full dataset posterior is proportional 8 2.2.2. Model uncertainty Term Symbol Definition Subprior p˜(θ) p˜(θ) = p(θ)1/s/α. Subposterior p˜(θ|yi) p˜(θ|yi) = p(yi|θ)p˜(θ)/p˜(yi). Subprior normalising constant α α = ∫ p(θ)1/s dθ. Suposterior evidence p˜(yi) p˜(yi) = ∫ p(yi|θ)p˜(θ) dθ. Table 2.1: Terminology for divide and conquer Bayesian inference. The notation and terminology in this table is implicitly conditonal on some modelM with parameter θ. to the product of s subposterior distributions. p(θ|y1, . . . ,ys) = p(θ)p(y1, . . . ,ys|θ) p(y1, . . . ,ys) = 1 p(y1, . . . ,ys) p(θ) s∏ i=1 p(yi|θ) = 1 p(y1, . . . ,ys) s∏ i=1 p(yi|θ)p(θ)1/s = αs p(y1, . . . ,ys) s∏ i=1 p(yi|θ)p˜(θ) = αs ∏s i=1 p˜(yi) p(y1, . . . ,ys) s∏ i=1 p˜(θ|yi) (2.3) ∝ s∏ i=1 p˜(θ|yi). (2.4) Dropping the constants in (2.3) leads to the relationship in (2.4). Each data subset yi can be distributed to a subprocessor. It is then possible to generate samples from each subposterior p˜(θ|yi) in parallel, typically by running MCMC on each worker. When posterior simulation is the ultimate goal, the combine step must pool the subposterior samples in a manner such that the full dataset posterior p(θ|y1, . . . ,ys) is targeted. A variety of combination rules have been suggested for synthesising the subposterior output in order to target p(θ|y1, . . . ,ys). Huang and Gelman (2005) and Scott et al. (2013) develop rules based on making a normal approximation to each subposterior. Neiswanger et al. (2013) propose an approach using kernel density estimation. Wang and Dunson (2013) consider the use of the Weierstrass transform. There are also aggregation rules based on geometric or robustness arguments (Srivastava et al., 2015; Minsker et al., 2017). As our goal is calculation of the integrated likelihood as opposed to posterior sampling, we do not review any of these methods in detail. Divide and conquer Bayesian model selection presents different challenges as the decomposition of the full data model posterior into batch posteriors is more complicated than in (2.4). 2.2.2 Model uncertainty Motivating a divide and conquer approach for model selection requires a different line of reasoning compared to the fixed model case, as the full dataset integrated likelihood is not equal to the product of the subset integrated likelihoods, p(y1,y2, . . . ,yk|Mj) = p(y1|Mj) s∏ i=2 p(yi|y1, . . . ,yi−1,Mj) 6= s∏ i=1 p(yi|Mj). 9 2. Computing the model evidence using parallel processing The first line is simply the chain rule of probability. The inequality is present as the subsets are only conditionally independent when given both the model and the parameters. For example, if the underlying model is a normal distribution, the observations y1, . . . ,ys are still dependent until the mean and variance parameters are completely specified. Although we have assumed p(yi|y1, . . . ,yi−1,θ,Mj) = p(yi|θ,Mj), the integrated likelihood does not have the same property p(yi|y1, . . . ,yi−1,Mj) 6= p(yi|Mj). As such, we cannot immediately obtain a subposterior factorisation over data batches using the same argument in (2.4). The target integrated likelihood is related to the subposterior evidence values, the subprior normalising constant and the subposterior distributions. Rearranging (2.3) gives p(y1, . . . ,ys|Mj)p(θ|y1, . . . ,ys|Mj) = αs s∏ i=1 p˜(yi|Mj) s∏ i=1 p˜(θ|yi,Mj). Integrating both sides over θ gives an important expression for the full dataset model evidence. p(y1, . . . ,ys|Mj) = αs s∏ i=1 p˜(yi|Mj) ∫ s∏ i=1 p˜(θ|yi,Mj) dθ. (2.5) The full dataset model evidence is related to the subposterior evidence values, the subprior normalising constant α and an integral over the subposterior distributions. The complex relationship between the full dataset evidence and the subset output means that is is difficult to obtain a simple rule for embarrassingly parallel model selection. Define the subprior model probabilities as p˜(Mj) = p(Mj) 1/s∑M k=1 p(Mk)1/s . We define the subposterior model probabilities p˜(Mj) in terms of the subposterior evidence p˜(yi|Mj) and the subprior model probability p˜(Mj): p˜(Mj |yi) = p˜(yi|Mj)p˜(Mj)∑M k=1 p˜(yi|Mk)p˜(Mk) . (2.6) It is more difficult represent the full dataset posterior over models as a combination of subposterior distri- butions. We use αj to denote the subprior normalising constant for modelMj , so αj = ∫ p(θ|Mj)1/s dθ. The full dataset posterior is related to the subposterior model probabilities, the subprior normalising constants and the subposterior distributions on the model parameters. We start by writing out the decomposition in full. p(Mj |y1, . . . ,ys) = p(Mj)p(y1, . . . ,ys|Mj)∑M k=1 p(Mk)p(y1, . . . ,ys|Mk) ∝ p(Mj)p(y1, . . . ,ys|Mj). (2.7) Now substituting in (2.5) and using the fact that p˜(Mj) ∝ p(Mj)1/s, p(Mj |y1, . . . ,ys) ∝ p(Mj) ( s∏ i=1 p˜(yi|Mj) ) αsj ∫ s∏ i=1 p˜(θ|yi,Mj) dθ = ( s∏ i=1 p˜(yi|Mj)p(Mj)1/s ) αsj ∫ s∏ i=1 p˜(θ|yi,Mj) dθ = ( s∏ i=1 p˜(yi|Mj)p˜(Mj) ) αsj ∫ s∏ i=1 p˜(θ|yi,Mj) dθ. (2.8) The following quantity is a constant as it involves sums over the collection of models s∏ i=1 1∑M k=1 p˜(yi|Mk)p˜(Mk) (2.9) 10 2.2.2. Model uncertainty 1000 10,000 100,000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Hours of sleep per night N um be r o f p eo pl e (a) 0.0 0.1 0.2 0.3 0.4 0.5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Hours of sleep per night Pr op or tio n re po rti ng P o o r he al th (b) Figure 2.2: Sleep dataset from the Behavioural Risk Factor Surveillance survey (n = 481, 939). Panel (a) shows the number of individuals reporting a particular number of hours sleep. Panel (b) shows the proportion of individuals reporting ‘Poor‘ health against hours sleep. Self-reported ‘Poor’ health status appears to be statistically associated with sleep habits. Multiplying (2.8) by the constant in (2.9) yields p(Mj |y1, . . . ,ys) ∝ ( s∏ i=1 p˜(yi|Mj)p˜(Mj)∑M k=1 p˜(yi|Mk)p˜(Mk) ) αsj ∫ s∏ i=1 p˜(θ|yi,Mj) dθ = ( s∏ i=1 p˜(Mj |yi) ) αsj ∫ s∏ i=1 p˜(θ|yi,Mj) dθ. (2.10) The full dataset posterior can be expressed as p(Mj |y1, . . . ,ys) ∝ ( s∏ i=1 p˜(Mj |yi) ) αsj ∫ s∏ i=1 p˜(θ|yi,Mj) dθ. (2.11) The immediate message from equation (2.11) is that the subposterior model probabilities p˜(Mj |y1), . . . , p˜(Mj |ys) are not sufficient to evaluate the global suitability of a model. The subposterior distributions of the param- eters p˜(θ|y1,Mj), . . . , p˜(θ|ys,Mj) contain important additional information for discriminating between models. To illustrate, we first consider a simple model choice problem. The dataset is from the 2013 Be- havioural Risk Factor Surveillance System (BRFSS) survey run by the Centers for Disease Control and Prevention (Centers for Disease Control and Prevention, 2013). We examine the n = 481, 939 responses for two general health questions. Respondents were asked for how many hours per night do they typically sleep. Respondents were also asked to rate their general health on a five point scale: Poor, Fair, Good, Very Good, Excellent. Figure 2.2 (a) plots the number of responses against hours sleep. The majority of those surveyed report between 6 and 8 hours sleep, with the overall counts decreasing as the hours of sleep moves away from this range. The y-axis is on a log scale. Figure 2.2 (b) plots the proportion of respondents reporting ‘Poor’ health against hours sleep. There appears to be a statistical association between the two variables. Atypical sleep habits, in terms of very low or very high hours of sleep appears to be an indicator for self-reported ‘Poor’ health. As a simple model choice exercise we fit two different logistic regression models to the data in (b). Let xi denote the hours of sleep for individual i, and yi be a binary indicator for health status, where yi = 1 if the reported health is ‘Poor’ and yi = 0 otherwise. The response is modelled as yi ∼ Bernoulli(σ(ηi)) for a linear predictor ηi, where σ(η) gives the inverse logistic function σ(η) = 1/(1+exp(−η)). We compare a cubic model (M1) with four parameters to a more complex cubic spline model (M2) with 10 parameters. Under the simpler model M1, the linear predictor is given by M1 : ηi = β0 + xiβ1 + x2iβ2 + x3iβ3. 11 2. Computing the model evidence using parallel processing For the cubic spline model M2 with K knots we set M2 : ηi = β0 + x1β1 + x2iβ2 + x3iβ3 + K∑ k=1 (xi − ξk)3+ We chose K = 6 knots manually. The knots ξ1, . . . , ξ6 were set at 3,5,7,9,11 and 13. As a brief aside, it worth acknowledging that the dataset is from a complex health survey and that the Bayesian models were fit under the assumption that the data collection mechanism is ignorable (Gelman et al., 2014). We conducted a divide and conquer analysis by splitting the data into s = 2 subsets, based on the xi values. Observations with xi ≤ 7 were placed in subset 1, and observations with xi > 7 were allocated to subset 2. This partition was a deliberate choice to highlight the influence of the split step on the combine step. Figure 2.3 shows the fitted models on the full dataset and the subsets. In each panel we plot the posterior predictive mean function from the respective analysis as a red line. Using the full dataset, the posterior predictive mean for a new response given covariates xnew is obtained by integrating over the posterior distribution of the coefficients E[ynew|M] = ∫ p(ynew = 1|xnew,β,M)p(β|X,y,M) dβ. (2.12) Let X(i) denote the design matrix for data subset i for i = 1, 2. For the subset results, the posterior predictive mean for a new response given covariates xnew is is obtained by integrating over the subposterior distribution of the coefficients E[ynew|M] = ∫ p(ynew = 1|xnew,β,M)p˜(β|X(i),yi,M) dβ. (2.13) The shaded ribbons in each plot give 90 percent credible intervals for the posterior predictive mean at a given point. As expected, there is more uncertainty in the regions where we have fewer data points. From panel (a) in Figure 2.2 the overwhelming majority of respondents report between 4 and 12 hours of sleep, and the shaded ribbon is tightly concentrated around the posterior predictive mean in this range. Looking at the full dataset results in Figure 2.3 it is apparent that the spline model is the more appropriate model. The subset results are less definitive. Visually, the fits of the spline and cubic models look comparable in each subset. In particular, the posterior mean functions and credible intervals look nearly identical in subset 1. Table 2.2 reports the full dataset posterior model probabilities and the subposterior model probabilities. The full dataset posterior puts mass one on the cubic spline model, this is not surprising given that the cubic model fits the observed data very poorly in the tails of the design space (hours of sleep). The subset results do not reflect this. In subset 1, the cubic model provides a seemingly identical fit to the spline model using the available data. As the cubic model is more parsimonious, it is favoured in the subposterior distribution on models with p˜(M1|y1,X(1)) = 0.84. In subset 2 the extra parameters of the cubic spline model give a better fit to the available data. As such, the spline model is strongly favoured in the second minibatch analysis. The subposterior probability for the cubic model is p˜(M1|y2,X(2)) is zero in the second subset. It is difficult to intuitively reconcile the subset results with the full dataset results when only given subposterior and posterior model probabilities. We freely admit that this example has been artificially engineered to be melodramatic. We wished to demonstrate that the choice of partition in the split step has a large downstream effect on the difficulty in the combine step. Additionally, we would like to highlight the role of subposterior overlap as a goodness of fit diagnostic in the combine step. As dictated by (2.11), to properly synthesise the subset results it is necessary to look at the subposte- rior distributions of the parameters. The most direct interpretation of the integral term is an expectation over an arbitrary subposterior: ∫ s∏ i=1 p˜(θ|yi,Mj) dθ = Ep˜(θ|yi,Mj)  s∏ k 6=i p˜(θ|yk,Mj)  . (2.14) 12 2.2.2. Model uncertainty Posterior distribution Subposterior 1 Subposterior 2 Cubic model M1 0.00 0.84 0.00 Spline model M2 1.00 0.16 1.00 Table 2.2: Posterior and subposterior model probabilities for the sleep dataset. The subposterior results are in conflict despite very clear results in a full dataset analysis. l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l data: Full dataset data: Subset 1 data: Subset 2 m odel: Cubic m odel m odel: Spline m odel 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 Hours of sleep Pr op or tio n re po rti ng p oo r h ea lth Figure 2.3: Comparison of cubic model and spline model on the full sleep dataset and on the s = 2 subsets. The red line gives the posterior preditive mean function. The shaded ribbons give a 90 percent credible interval for the posterior predictive mean. The full dataset contains n = 481, 939 observations. Subset 1 contains n1 = 300, 177 observations, and subset 2 contains n2 = 181, 762 observations. The full dataset results show that the spline model provides a much better fit to the observed data. The cubic model does not fit well in the extreme regions of feature space (hours of sleep). The subset results in isolation do not make this as clear. Model 1 and Model 2 given nearly identical fits in subset 1. The results seem comparable in subset 2. The results in the apply stage may not be representative of a single full dataset analysis. The initial partition of the dataset in the split stage appears to influence the difficulty of the evidence synthesis task in the combine stage. Qualitatively speaking, the expectation will be large if the subposterior distributions are similar across subsets, and the expectation will be small if the subposterior distributions are dissimilar. A high-level interpretation is that the integral checks for consistency of parameter estimates across different subsets. Figure 2.4 compares the marginal subposterior distributions on each parameter in the cubic model. The non-overlapping subposteriors are indicative of poor goodness of fit, which is clear in the full dataset results in Figure 2.3. The example shows that the subposterior integral (2.14) has an important role in divide and conquer model selection. The subset model probabilities in isolation do not give enough information to reconstruct the target posterior distribution over models. Distributed model selection requires a pooling rule in the combine stage that takes this into account. It is quite straightforward to make an approximation to the subposterior integral, under the assumption of Gaussian subposteriors. Normal approximations to the subposterior distributions are used in the fixed model divide and conquer work by Huang and Gelman 13 2. Computing the model evidence using parallel processing β2 β3 β0 β1 −0.2 0.0 0.2 −0.01 0.00 0.01 0.02 −5.0 −2.5 0.0 2.5 −2 −1 0 1 0 1 2 0 100 200 300 0.0 0.5 1.0 1.5 0 4 8 12 value de ns ity Figure 2.4: Suposterior distributions of parameters in the cubic model. The solid line denotes the first subposterior p˜(β|y1). The red dashed line denotes the second subposterior p˜(β|y2). The disparate subposterior distributions is a reflection of the poor overall fit of the model. Subposterior overlap is an important diagnostic in the combine stage for model selection. (2005) and Scott et al. (2013). However, it is of interest to identify general strategies that do not rely on subposterior normality, as quantifying the error from making normal approximations can be difficult (Bierkens et al., 2016). Furthermore, a general solution can lead to a deeper understanding of the problem. The combine step in divide and conquer model selection involves estimation of subposterior integral given the output from the apply stage. For later reference we denote the subposterior integral as Isub where Isub = ∫ s∏ i=1 p˜(θ|yi,M) dθ. (2.15) In section 2.4 we establish an important link between the subposterior integral and the Savage-Dickey density ratio. The connection explains how the subposterior integral measures subposterior overlap from a Bayesian perspective, and lays out how the combine step implicitly synthesises the support for a model given the output from the apply stage. The theoretical connection also leads to simple Monte Carlo procedure for estimating the integral involving kernel density estimation. Although simulation consistent, the estimator scales poorly with the number of subsets s as it involves kernel density estimation in high- dimensional spaces. To remedy this problem we propose an alternative strategy using data augmentation and Gibbs sampling in section 2.5. Although limited to certain models, the second algorithm gives a significantly better estimator of the subposterior integral (2.15) in the combine step. Section 2.3 reviews some important background theory. 14 2.3. Background Dimension Sample size 1 4 2 19 3 67 4 223 5 678 6 2,790 7 10,700 8 43,700 9 187,000 10 842,000 Table 2.3: Required sample size to attain a relative mean square error of less than 0.1 using nonparametric kernel density estimation against dimension. The generative model is a d-dimensional multivariate normal distribution with an identity covariance matrix. The density is estimated at zero using a Gaussian kernel. The bandwidth is chosen to minimise the asymptotic mean square error at zero. The required sample size increases rapidly with the dimension d. Reproduced from Silverman (1986, Table 4.2) 2.3 Background 2.3.1 Integrated likelihood calculation Estimation of the integrated likelihood has been a long standing problem in Bayesian computation. Rearranging Bayes’ theorem, an important general relationship is that for any ordinate θ ∈ Ω, log p(y|M) = log p(y|θ,M) + log p(θ|M)− log p(θ|y,M). (2.16) The identity (2.16) is often referred to as ‘Candidate’s formula’ (Besag, 1989). If the posterior density can be obtained at the point θ then we have a simple way to obtain the integrated likelihood. Given samples from the posterior distribution, it is possible to use a non-parametric kernel density estimator to estimate the density at θ (Raftery, 1996; Lewis and Raftery, 1997). A problem with this approach is that kernel density estimation is subject to the curse of dimensionality, with an asymptotically optimal pointwise mean squared error rate of O(B−4/(d+4)), where B is the number of independent posterior samples (Terrell and Scott, 1992). The bound is under the assumption that we use a second order kernel. Most common kernels (eg. Gaussian, Epanechnikov, Uniform) are second order kernels (Terrell and Scott, 1992). Use of kernel density estimation is therefore limited to models with a very small number of parameters. The impact of the curse of dimensionality is vividly presented by Silverman (1986, section 4.5.2). Silverman considers nonparametric kernel density estimation where the true density f is a d-dimensional multivariate Gaussian with identity covariance matrix. The task is to estimate the true density at zero f(0) using a Gaussian kernel. Silverman sets the single bandwidth parameter h to minimise the asymptotic pointwise mean squared error. Suppose the goal is to ensure the relative mean square error E[(f̂(0)−f(0))2]/[f(0)2] is less than 0.1. Silverman calculates the required sample size to attain this error tolerance at various dimensions d. Table 2.3 reports the required sample size against dimensionality. The required sample size increases rapidly with d. Nonparametric kernel density estimation is very hard in high-dimensional spaces. 2.3.2 Savage-Dickey density ratio The Savage-Dickey density ratio is a useful identity for computing Bayes factors that avoids the need to compute integrated likelihoods (Dickey, 1971) directly. For our purposes, we are concerned with the use of the Savage-Dickey density ratio to linear restrictions on parameters. The Savage-Dickey density 15 2. Computing the model evidence using parallel processing ratio for testing linear restrictions is discussed Wetzels et al. (2010) and McCulloch and Rossi (1992). We present the approach as in McCulloch and Rossi (1992). The Savage-Dickey identity expresses Bayes factors as a density ratio involving posterior and prior distributions. Suppose we have a modelMS with parameter θS ∈ ΩS . Let L(·) = f(y|·) denote the likelihood function, where the data y is treated as fixed. Let ψ represent a linear function of θS so ψ = AθS for some matrix of known coefficients A. Let Ω0 ⊂ ΩS denote the subspace of ΩS where ψ = 0. Let θ0 denote an arbitrary elements of Ω0. We wish to test the linear restriction H0 : ψ = 0. The Bayesian hypothesis test requires a prior distribution on ΩS and a prior distribution on Ω0. Let P denote the prior probability measure on ΩS and let P0 denote the prior probability measure on Ω0. Let M0 denote the restricted model where ψ = 0. The evidence under each model can be expressed as p(y|M0) = ∫ Ω0 L(θ0) dP0(θ0) p(y|MS) = ∫ ΩS L(θS) dP (θS). We integrate with respect to the prior measures, as P0 will not have density with respect to Lebesgue measure. The Bayes factor in favour of the null hypothesis, B01 is given by B01 = p(y|M0) p(y|MS) (2.17) = ∫ Ω0 L(θ0) dP0(θ0)∫ ΩS L(θS) dP (θS) . (2.18) The Savage-Dickey density ratio occurs when we set the prior measure P0 to be the conditional distribution of θS given that ψ = 0 under the encompassing model prior P . Let P (·|ψ = 0) be the regular conditional distribution on θS given thatψ = 0 under P . Regular conditional distributions are discussed in Billingsley (1999). Taking P0(·) = P (·|ψ = 0) gives that p(y|M0) = ∫ Ω0 L(θ0) dP0(θ0) = ∫ Ω0 L(θ0) dP (θ0|ψ − 0). (2.19) Integrating with respect to the conditional distribution P (·|ψ = 0) also defines a conditional marginal likelihood under MS , p(y|ψ = 0,MS) = ∫ Ω0 L(θ0) dP (θ0|ψ = 0). (2.20) Substituting (2.20) and (2.19) into (2.17) gives another expression for the Bayes factor: logB01 = p(y|ψ = 0,MS) p(y|MS) . (2.21) Dickey (1971) demonstrated that the the conditional evidence p(y|ψ = 0,MS) can be expressed in terms of the posterior distribution of ψ = AθS under the unconstrained model MS . The conditional evidence satisfies p(y|ψ = 0,MS) = p(ψ = 0|y,MS)p(y|MS) p(ψ = 0|MS) . (2.22) This does not follow immediately from Bayes’ theorem the event ψ = 0 is a subset of measure zero under P . The relationship (2.22) follows using the fact that P (·|ψ = 0) is a regular conditional distribution. Substituting (2.22) into (2.21) gives logB01 = p(ψ = 0|y,MS)p(y|MS) p(ψ = 0|MS) 1 p(y|MS) (2.23) = p(ψ = 0|y,MS) p(ψ = 0|MS) . (2.24) 16 2.4. General Bayesian models Figure 2.5: Target Bayesian model. Subsets are conditionally independent given the modelM and the parameters θ, but conditionally dependent given only the model due to the shared dependence on θ. The Bayes factor B01 is given by the ratio of the posterior density of ψ to the prior density of ψ under the unconstrained modelMS . Evaluation of the density ratio is often easier than direct evaluation of the integrals in (2.18). A point of interest is that Savage-Dickey density ratio gives the integrated likelihood for the restricted model M0 in terms related to the unconstrained model MS , p(y|M0) = p(y|MS)p(ψ = 0|y,MS) p(ψ = 0|MS) . (2.25) In many situations fitting the unconstrained model MS will be considerably easier than fitting the constrained modelM0. The identity (2.25) gives a useful alternative route for computing the integrated likelihood p(y|M0) in situations where it is difficult to fit the constrained modelM0. In many situations we will be able to compute the evidence for the unconstrained modelMS and generate posterior samples from MS . Given posterior and prior samples of ψ we can use nonparametric kernel density estimation to estimation the density ratio in (2.24). This is assuming that the dimension of ψ is low, as the curse of dimensionality will rule out the feasibility of this approach in high-dimensions (Table 2.3). The evidence for the unconstrained model combined with the density ratio then give the evidence for the actual model of interest M0. We will show the relationship in (2.25) has an important connection to the desired split-apply-combine methodology in Figure 2.1. We have a Big Data problem where it is difficult to fit the desired model on a single machine due to computational constraints so we adopt a divide and conquer strategy. In the next section we show how the split and apply steps effectively correspond to fitting an unconstrained model with an expanded parameter space. In the combine step we can compute a Bayes factor that connects the unconstrained model to the original model of interest. The analysis gives a Bayesian formalism to the relationship between the apply and combine steps in the embarrassingly parallel algorithm. 2.4 General Bayesian models 2.4.1 Introduction Figure 2.5 diagrams the target model that we wish to use for Bayesian inference. When analysing the dataset using an embarrassingly parallel algorithm we obtain a collection of s subposterior distributions in the apply stage. We assume that worker returns the subposterior evidence p˜(yi) and samples from the subposterior p˜(θ|yi) as output during apply stage for i = 1, . . . , s. As demonstrated in the sleep dataset example in section 2.2.2 subposterior overlap is an important consideration in the combine step. Recall from (2.5) and (2.15) the following identity for the target model evidence, p(y1, . . . ,ys|M) = ( s∏ i=1 p˜(yi|M) ) αs ∫ s∏ i=1 p˜(θ|yi,M) dθ. (2.26) = ( s∏ i=1 p˜(yi|M) ) αs × Isub. (2.27) 17 2. Computing the model evidence using parallel processing To recover the target model evidence it is necessary evaluate the subposterior integral Isub. In the case of s = 2 subsets, it is possible to see connection between the integral and a pointwise density measurement of subposterior overlap. To do so we rewrite the definition of the subposterior integral more explicitly as Isub = ∫ θ∗∈Ω s∏ i=1 p˜(θ = θ∗|yi,M) dθ∗. (2.28) The extra notation is introduced as later it will be necessary to distinguish between the random variable in the target model θ and an arbitrary ordinate in the parameter space θ∗. Each subposterior analysis returns a measure over the parameter space Ω. In (2.28) the variable of integration is now represented by the point θ∗ in the parameter space. Each subposterior density p˜(θ|y1), . . . , p˜(θ|ys) expresses different beliefs on the unknown parameter θ. Each subposterior thus assigns a different density to θ∗, and Isub is calculated by taking the product over the s subsets and accumulating over the parameter space Ω. With s = 2 subsets, we can see an immediate connection to the convolution of the two density functions p˜(θ|y1) and p˜(θ|y2). The convolution gives the density function of a random variable that can be defined as a difference of independent samples from each subposterior. Let S1 and S2, be independent random variables with support Ω. Let the density function of S1 be given by the subposterior density from the first subset p˜(θ|y1) and the density function of S2 be given by the subposterior density from the second subset p˜(θ|y2). Now define the random variable D = S1 − S2. The density of the difference random variable D at a point r is given by fD(r) = ∫ θ∗∈Ω p˜(θ = θ∗|y1,M)p˜(θ = θ∗ − r|y2,M) dθ∗. The density at 0 is equal to the subposterior integral Isub, fD(0) = ∫ θ∗∈Ω p˜(θ = θ∗|y1,M)p˜(θ = θ∗|y2,M) dθ = ∫ θ∗∈Ω 2∏ i=1 p˜(θ = θ∗|yi,M) dθ∗ = Isub. (2.29) If S1 and S2 are distributed similarly, then we expect the difference D to be concentrated around zero. If the subposterior distributions are dissimilar then zero will be in the tails of the distribution of D. To illustrate, we consider another simple model choice problem. The dataset is from an experiment to detect extra sensory perception (ESP), first reported in Jahn et al. (1987). The experimenters constructed a random number generator using a radioactive source. The random number generator was calibrated to output a random sequence of zeroes and ones with equal probability. A subject was asked to psychically alter the sequence of random numbers using extra sensory capabilities. There are n = 104, 490, 000 observed random numbers which can be modelled as n independent Bernoulli(θ) events. This dataset is also considered in Bernardo et al. (2011). Table 2.4 reports the number of ones in the observed sequence (successes). The fraction of ones in the dataset is slightly higher than 0.5. We compare two models,M1 : θ = 0.5, corresponding to no support for ESP andM2 : θ 6= 0.5, which allows for the possibility for ESP. We use a flat Beta(1, 1) prior on the proportion θ for model 2. The ESP dataset is commonly used to illustrate differences between Bayesian and Frequentist hypothesis testing. Suppose we assign equal prior weight to each model. The posterior probability of M1 : θ = 0.5 is then 0.92. The p-value for testing θ = 0.5 is 0.0003017. The Bayesian approach favours the simpler model M1 : θ = 0.5, when the frequentist analysis rejects the null hypothesis (θ = 0.5) at the usual 5 percent level of significance. Here default Bayesian and frequentist procedures lead to opposite conclusions regarding the existence of extra sensory perception. We use the ESP dataset to illustrate divide and conquer inference as it is a simple large n dataset where the posterior model probabilities are not highly concentrated around zero or one. 18 2.4.2. Subset saturated model Full dataset Subset 1 Subset 2 Trials 104,490,000 52,245,000 52,245,000 Successes 52,263,471 26,137,735 26,125,736 Success proportion 0.5001768 0.5002916 0.5000619 Table 2.4: Full ESP dataset and subsets for a divide and conquer approach with s = 2. The success proportion is close to 0.5 in the full dataset and each subset. We split the full dataset into two subsets, as described in Table 2.4. Let y1 denote the data for subset 1 and y2 denote the data in subset 2. We focus on the divide and conquer analysis of model 2, where we estimate the success probability. Let ni and ri denote the number of trials and successes respectively for subset i = 1, 2. The subpriors for θ are both uniform Beta(1, 1) since a fractionated uniform density is proportional to a uniform density: [Beta(1, 1)(θ)]1/2 = 1θ∈[0,1] = Beta(1, 1)(θ). Given a Beta(1, 1) subpriors, each subposterior is a Beta(ri + 1, ni − ri + 1) density for i = 1, 2. Figure 2.6 plots the two subposteriors. As each subposterior is a Beta density it is possible to work out the subposterior integral analytically. Let ai = ri + 1, bi = ni − ri + 1, a∗ = ( ∑s i=1 ai)− (s− 1) and b∗ = ( ∑s i=1 bi)− (s− 1). The subposterior integral satisfies log Isub = logB(a ∗, b∗)− s∑ i=1 logB(ai, bi), (2.30) where B(a, b) is the Beta function. Using the data partition in Table 2.4, we calculate Isub = 259.64. Panel (b) shows the density of D = S1 − S2, where S1 and S2 are distributed according to p˜(θ|y1) and p˜(θ|y2) respectively. The value of the subposterior integral Isub = 259.64 is plotted as a dashed horizontal line and the vertical dotted line has an x intercept of zero. Looking at the intersection we see that the theoretical density of D at zero is equal to Isub giving an empirical verification of (2.29) for this example. The convolution by inspection argument leading to (2.29) was non-constructive, and it is not clear how to generalise the argument to an arbitrary number of subsets s. Subposterior overlap clearly has a role in the combine step, although it is not apparent if this is an inherently Bayesian approach to the evidence task in the final stage. Another open question is the significance of the subprior normalising constant α in the representation (2.5). We take a more general approach in the next subsection that addresses these issues. 2.4.2 Subset saturated model In the divide and conquer analysis, each data subset yi is modelled separately using the subprior p˜(θ). This Big Data modelling strategy is adopted for purely computational reasons. This strategy can also be motivated through an inferential argument. Laying out the inferential argument leads to a Savage- Dickey ratio representation in the combine step. Consider an alternative hierarchical model, diagrammed in Figure 2.7, where each data subset is now allocated an independent set of parameters. As before, suppose that the arbitrary model M is indexed by parameter θ, where θ ∈ Ω. For each model M we introduce an expanded version MS that has a separate set of parameters for each subset. In the alternative hierarchical model we have an expanded parameter space. Let ΩS = ∏s i=1 Ω. Let θS ∈ ΩS denote the parameter for the expanded model. Then θS =  θ(1) ... θ(s)  ∈ ΩS . Suppose the original model M is a regression model with a particular set of explanatory variables. The expanded modelMS will use the same set of covariates in each subset, but allows for different coefficients 19 2. Computing the model evidence using parallel processing 0.4990 0.4995 0.5000 0.5005 0.5010 0 10 00 30 00 50 00 (a) Subposteriors Value D en si ty p(θ| y1) p(θ| y2) 0 10 00 20 00 30 00 40 00 (b) Density of D = S1 − S2 Value D en si ty −0.0002 0 0.0002 0.0004 0.0006 25 9 Figure 2.6: Illustration of the subposterior density identity using the ESP dataset (s = 2). The full dataset and data subsets are summarised in Table 2.4. (a) Subposterior densities. (b) Density of the difference variable D. In (b) the horizontal dashed line gives the value of Isub. The intersection of the dashed lines in (b) illustrates the density identity (2.29). Figure 2.7: Alternative hierarchical Bayesian model (subset saturated model). Each subset yi is allocated an independent set of parameters θ(i). Subsets are conditionally independent gives the model indicator. The split and apply steps in the divide and conquer procedure are equivalent to defining and fitting the subset saturated model. in each subset. It can be argued that split and apply steps in the embarrassingly parallel algorithm are effectively the same as fitting the alternative hierarchical model with subset specific parameters. The alternative hierarchical model is conceptually similar to the saturated model that is used when calculating the deviance in a generalised linear model. The saturated model in the context of generalised linear models allows for a unique parameter for each observation and provides the best possible goodness of fit. In the divide and conquer world, the expanded model MS allows for a unique parameter for each subset, and allows extra adaptation compared to the desired model M. As such we will refer to MS as the subset saturated model, and the hierarchical model in Figure 2.7 as the hierarchical subset saturated model. 2.4.3 Split and apply steps We make some assumptions about the subset saturated model to link it concretely to the split and apply steps that take place in the embarrassingly parallel algorithm (see Figure 2.1). Assumption 1: Subset likelihoods are the same as in the target model. That is for all θ∗ ∈ Ω, p(yi|θ(i) = θ∗,MS) = p(yi|θ = θ∗,M) for i = 1, . . . , s. Assumption 2: The subset specific parameters have independent priors, so the joint prior in the subset 20 2.4.3. Split and apply steps saturated model can be written as p(θ(i), . . . ,θ(s)|MS) = s∏ i=1 p(θ(i)|MS). Assumption 3: Each subset prior is given by the subprior used in a divide and conquer analysis (see Table 2.1). That is for all θ∗ ∈ Ω, p(θ(i) = θ∗|MS) is equal to p˜(θ = θ∗|M) for i = 1, . . . , s. The subset saturated model is a bona fide Bayesian model for any choice of prior distribution on the subset specific parameters θ(1), . . . ,θ(s). Assumptions 2 and 3 are needed to make a connection to the divide and conquer procedure. Assumption 1 is needed to be completely specify the model. Let L (i) S (θ) give the likelihood function for subset i under the subset saturated model (Figure 2.7). Let L(i)(θ) give the likelihood function for subset i under the target model M (Figure 2.5). From Assumption 1 L (i) S (θ) = L (i)(θ) for all θ ∈ Ω. Under assumptions 1,2 and 3, the posterior distribution of the subset specific parameters θ(1), . . . ,θ(s) in the subset saturated model is closely related to the subposterior output that is is generated in the apply phase. The underlying space of interest is Ω. The apply stage in the embarrassingly parallel algorithm returns s belief distributions over Ω. The subset saturated model also defines s belief distributions on Ω through the s marginal distributions on the subset specific parameter θ(i) for i = 1, . . . , s. We can show a measure theoretic equivalence between the distribution sets of the embarrassingly parallel algorithm and the marginal distributions of the subset saturated model under assumptions 1,2 and 3. Let F denote the Borel σ-algebra on Ω. Each worker in the apply step is given an initial belief distribution on Ω given by the subprior density p˜(θ|M). Let P˜ (i) denote the subprior distribution for worker i for i = 1, . . . , s. In the subset saturated model we have a collection of s prior distributions on Ω, one for each subset specific parameter. By assumption 3 each subset specific prior is set to the subprior distribution. Specifically, we have that for all sets F ∈ F : P˜ (i)(F ) = P (i)(F ) i = 1, . . . , s. (2.31) In the embarrassingly parallel algorithm each worker learns updated beliefs on Ω during the apply step. Let P˜ (i) pi denote the ith subposterior distribution on Ω. The subset saturated model has parameter space ΩS = ∏s i=1 Ω. Let P denote the prior distribution on ΩS under the subset saturated model and let Ppi denote the posterior distribution on Ωs under the subset saturated model. Let P (i) pi be the marginal posterior distribution on θ(i) under the subset saturated model for i = 1, . . . , s. Using the subset saturated model we also obtain a collection of beliefs on Ω through the s marginal distributions P (i) pi for i = 1, . . . , s. From Assumption 1 and 3 we have that P˜ (i)pi (F ) = P (i) pi (F ) i = 1, . . . , s. (2.32) Let µ denote Lebesgue measure on Rd where d is the dimension of Ω. Under Assumptions 1 and 3 both measures have the same Radon-Nikodym derivative with respect to Lebesgue measure, P˜ (i) pi dµ = L (i) S p˜(θ) = L (i)(θ)p˜(θ) P (i) pi dµ = L(i)(θ)p˜(θ). This is sufficient to establish (2.32). The split and apply steps in the embarrassingly parallel algorithm (see Figure 2.1) are effectively the same as building and computing the posterior distribution for the subset saturated model (Figure 2.7). The subset saturated model is also useful for describing the role of the the subposterior integral (2.15) in the combine step. 21 2. Computing the model evidence using parallel processing 2.4.4 Combine step The ideal Bayesian procedure in the combine step can be identified by considering a linearly restricted version of the subset saturated model (Figure 2.7). The key observation is that the target model M is effectively nested within the subset saturated model MS under the linear restriction that θ(1) = θ(2) = . . . = θ(s). Let ψ denote a d(s − 1) dimensional vector that captures differences in parameter values across subsets, specifically ψ =  θ(1) − θ(2) θ(1) − θ(3) ... θ(1) − θ(s)  . (2.33) Under the linear restriction ψ = 0 the structure of the subset saturated model diagrammed in Figure 2.7 collapses to that of the target hierarchical model diagrammed in Figure 2.5. We will argue that the model evidence for the linearly restricted version of the subset saturated model is equal to the model evidence for the target model under appropriate assumptions. There are some measure theoretic considerations when working with the linearly restricted version of the subset saturated model. Again let P denote the prior measure on ΩS = ∏s i=1 Ω given that assumptions 1,2 and 3 hold. Let Ω0 be the linear subspace of ΩS defined by the linear restriction ψ = 0. Let P (·|ψ = 0) denote the regular conditional distribution of θS given that ψ = 0 under P . Let M0 denote the linearly restricted model where θS ∈ Ω0. Let θ0 denote an element of Ω0. The linearly restricted model M0 is not completely equivalent to the target model M as the parameter spaces are different. Model M0 puts a prior distribution P0 on Ω0 ⊂ ∏s i=1 Ω whereas the target model M puts a prior distribution on Ω. Suppose we set P0 ≡ P (·|ψ = 0). Let P (1)0 be the marginal distribution on θ(1) under P0. Let PM be the prior distribution on Ω under the target model. The measure PM has density p(θ|M) with respect to Lebesgue measure on Rd. Assumptions 1,2 and 3 have been selected so that P (1)0 and PM give the same distribution on Ω. More formally, we can show that P (1) 0 (F ) = PM(F ), (2.34) for all sets F ∈ F where F is the Borel σ-algebra on Ω. Let L0(θ0) denote the likelihood function for the linearly restricted model M0 and let L(θ) denote the likelihood function for the target model M. Suppose that we have some θ0 = (θ (1), . . . ,θ(s)) ∈ Ω0. On Ω0 all subset specific parameters are equal. If θ0 ∈ Ω0 then L0(θ0) = L(θ(1)). This follows from Assumption 1, the subset specific likelihoods are the same as the target model likelihood for each subset.The model evidence for M0 is directly expressed as an integral over Ω0. We can also express the model evidence forM0 as an integral over Ω by considering the marginal distribution of θ(1) under P0. p(y1, . . . ,ys|M0) = ∫ θ0∈Ω0 L0(θ0) dP0(θ0) = ∫ θ∈Ω L(θ) dP (1) 0 (θ). (2.35) The evidence for the linearly restricted model M0 can also be obtained by integrating the target model likelihood function L(θ) with respect to P (1) 0 (θ). Under assumptions 1,2 and 3 we have the equality (2.34). We can therefore substitute the target model prior distribution PM(F ) into (2.35). The target model prior distribution PM(·) has density p(θ|M) with respect to Lebesgue measure. The evidence for 22 2.4.4. Combine step the linearly restricted model is then shown to be equal to the evidence for the target model: p(y1, . . . ,ys|M0) = ∫ θ∈Ω L(θ) dPM(θ) (2.36) = ∫ θ∈Ω p(y1, . . . ,yS |M)p(θ|M) dθ (2.37) = p(y1, . . . ,ys|M). (2.38) Using the Savage-Dickey density ratio (2.22), we can express the model evidence for the linearly restricted model M0 in terms of the subset saturated model MS : p(y1, . . . ,ys|M0) = p(y1, . . . ,ys|ψ = 0,MS) = p(y1, . . . ,ys|MS)p(ψ = 0|y1, . . . ,ys,MS) p(ψ = 0|MS) . (2.39) Now using the fact that p(y1, . . . ,ys|M0) = p(y1, . . . ,ys|M), we have the following identity relating the target model evidence to the subset saturated model: p(y1, . . . ,ys|M) = p(y1, . . . ,ys|MS)p(ψ = 0|y1, . . . ,ys,MS) p(ψ = 0|MS) . (2.40) We will show that each of the terms on the right hand side of (2.40) can be estimated using a split-apply- combine algorithm. We assume that each worker can compute the subposterior evidence score p˜(yi|M) for i = 1, . . . , s. The local evidence on each batch of data can be computed using existing techniques. Existing methods for evidence estimation are discussed in Chapter 3. We now show how the density ratio can be estimated in the combine stage using subposterior samples generated in the apply stage. The posterior distribution p(ψ|y1, . . . ,ys,MS) is effectively given by the subposterior distributions in a divide and conquer analysis. The prior distribution p(ψ|MS) is effectively given by the subprior distribution p˜(θ) (see Table 2.1). Define the random variables Si ∼ p˜(θ|yi,M) and Vi ∼ p˜(θ,M) for i = 1, . . . , s. The random variable Si is distributed according to the ith subposterior, and the random variable Vi is distributed according to the ith subprior. Now define the random vectors D and D0 as D =  S1 − S2 . . . S1 − Ss  , D0 =  V1 − V2 . . . V1 − Vs  . (2.41) Then D ∼ p(ψ|y1, . . . ,ys,MS) and D0 ∼ p(ψ|MS). We can sample from the subprior and subposterior distributions in the apply step. Using the definitions in (2.41) we can sample from p(ψ|y1, . . . ,ys,MS) and p(ψ|MS) in the combine step by pooling the subposterior and subprior samples. The remaining term on the right hand side of (2.39) is the evidence for the subset saturated model p(y1, . . . ,ys|MS). This can easily be computed given the subposterior evidence values p˜(y1|M), . . . , p˜(ys|M). The model evidence for the subset saturated model is the product of the subposterior evidence scores. 23 2. Computing the model evidence using parallel processing Specifically, p(y1, . . . ,ys|MS) = ∫ Ωs L(θS) dP (θS) = ∫ Ωs ( s∏ i=1 L (i) S (θ (i)) ) s∏ i=1 dP (i)(θ(i)) = s∏ i=1 ∫ Ω L (i) S (θ) dP (i)(θ) = s∏ i=1 ∫ Ω L(i)(θ) dP˜ (θ) = s∏ i=1 p˜(yi|M). Substituting the previous result into (2.39), we obtain the following identity for the full dataset integrated likelihood: p(y1, . . . ,ys|M) = ( s∏ i=1 p˜(yi|M) ) p(ψ = 0|y1, . . . ,ys,MS) p(ψ = 0|MS) . (2.42) The density ratio in (2.42) can be interpreted as the posterior to prior odds of observing ψ = 0 in the subset saturated model. If the model is appropriate for the entire dataset, we expect to see similar subposterior distributions on θ over subsets, and the posterior density of ψ should be concentrated around zero. What exactly constitutes a ‘high’ level of agreement is determined by comparing to the prior density of ψ at zero. The ratio of the two densities has an interpretation as a Bayes factor in the subset saturated model. Recall the identity in the introduction, p(y1, . . . ,ys|M) = ( s∏ i=1 p˜(yi|M) ) αs ∫ s∏ i=1 p˜(θ|yi,M) dθ (2.43) Comparing (2.42) to (2.43) we see a correspondence between the density ratio and the subposterior integral and subprior normalising constant. Specifically, it can be shown that p(ψ = 0|y1, . . . ,ys,MS) = ∫ s∏ i=1 p˜(θ|yi) dθ = Isub (2.44) p(ψ = 0|MS) = αs. (2.45) Equation (2.44) is the generalisation of the density identity for s = 2 presented given in (2.29). Equation (2.45) helps to explain why the subprior normalising constant α appears in the full model evidence decomposition (2.43). The ratio of Isub to α s measures the consistency of parameter estimates across subsets. The split-apply-combine approach for computing the evidence can be viewed as defining a misspecified model in the split step, followed by fitting the misspecified model in the apply step. The misspecified model can be viewed as a subset saturated model where we have introduced extra parameters. The combine step then corrects for the misspecification by computing an adjustment that corresponds to a Bayes factor linking the subset saturated model to the target model. 2.4.5 Example: ESP dataset To illustrate the results in the previous subsection, we present another analysis of the ESP dataset. We split the dataset into three subsets, as described in Table 2.5. Let yi denote the data in subset i for i = 1, 2, 3. Additionally, let ni denotes the number of trials in subset i and ri denote the number of successes for i = 1, 2, 3. 24 2.4.5. Example: ESP dataset Full dataset Subset 1 Subset 2 Subset 3 Trials 104,490,000 34,830,000 34,830,000 34,830,000 Successes 52,263,471 17,421,157 17,415,157 17,427,157 Success proportion 0.5001768 0.5001768 0.5000045 0.5003490 Table 2.5: Full ESP dataset and subsets for a divide and conquer approach with s = 3. The success proportion is close to 0.5 in the full dataset and in each subset. 0.4995 0.5000 0.5005 0.5010 0 10 00 20 00 30 00 40 00 (a) Subposterior distributions θ Va lu e (b) Posterior distribution of Ψ Ψ1 Ψ 2 12.23073 14 15 16 −0.0002 0 0.0002 0.0004 0.0006− 0. 00 06 − 0. 00 02 0 0. 00 02 (c) Prior distribution of ψ ψ1 ψ 2 − 0. 2 −0.5 −1 −1.5 −2 −1.0 −0.5 0.0 0.5 1.0 − 1. 0 − 0. 5 0. 0 0. 5 1. 0 Figure 2.8: (a) Subposterior distributions for s = 3 subsets. (b) Posterior distribution of ψ. (c) Prior distribution of ψ. The red cross in (b) and (c) denotes the point (0,0). The numeric labels on the contours in (b) and (c) give the log density on the respective contour. Subposterior overlap is measured by the posterior density of ψ at zero. This is compared to the prior density of ψ at zero to determine the evidence for a linearly restricted model. The positioning of the red cross in panels (b) and (c) illustrate the identities (2.44) and (2.45) respectively. We again use a flat prior on the unknown success probability. The subprior is a Beta(1, 1) density. Each subposterior is a Beta(1 + ri, 1 + ni − ri) distribution for i = 1, 2, 3. Using (2.30) we can calculate log Isub = 12.2307 logα = 0. Given that s = 3, ψ is a two dimensional vector, ψ = θ(1) − θ(2) θ(1) − θ(3)  , where θ(i) is the proportion parameter in subset i for i = 1, 2, 3. Figure 2.8 illustrates the identities in (2.44) and (2.45). Panel (a) shows the three subposterior distributions. Panel (b) displays contours of the posterior distribution of ψ. The red cross denotes the point (0, 0). The density contour where log p(ψ|y1, . . . ,ys) = 12.23 passes through the point (0, 0) illustrating the general result that p(ψ = 0|y1, . . . ,ys,M) = ∫ ∏s i=1 p˜(θ|yi,M) dθ. Panel (c) shows the contours of the prior density of ψ. The prior mode is at (0, 0), and it can be seen that the log density is approaching zero around this point. We thus have empirical evaluation of (2.44) and (2.45) in this example. As mentioned, it is possible to generate samples from p(ψ|y1, . . . ,ys,MS). Evaluation of the subposte- rior integral can then be approached as a pointwise density estimation task Îsub = p̂(ψ = 0|y1, . . . ,ys,MS). From this example we can identify an important issue regarding the impact of the partition made in the split step. In (a) we can see that there is only a moderate amount of subposterior overlap. As such, (0, 0) is in the tails of the posterior distribution of ψ. The red cross in panel (b) is in the tails of the posterior distribution on ψ. Estimating the pointwise density at zero will be difficult if 0 is in a region with little posterior support. This promotes splitting the data so that subposterior distributions are as similar as possible. Although judicious data partitioning can shift the posterior support of ψ to a more favourable 25 2. Computing the model evidence using parallel processing region, little can be done about the fact that ψ is a d(s − 1)-dimensional vector. The implications for estimation of Isub are discussed in the next subsection. 2.4.6 Curse of dimensionality Suppose that we have B samples from each subposterior. Let S [b] i be the bth sample from p˜(θ|yi,M) for b = 1, . . . , B and i = 1, . . . , s. We can generate B posterior samples of ψ, by setting ψ[b] =  S [b] 1 − S[b]2 S [b] 1 − S[b]3 ... S [b] 1 − S[b]s  , (2.46) for b = 1, . . . , B. Using any consistent kernel density estimator the subposterior integral can be estimated as Îsub = p̂(ψ = 0|y1, . . . ,ys,MS) = 1 B B∑ b=1 KH(ψ [b]), (2.47) for a suitable kernel density estimator KH(·) with bandwidth matrix H. Although the estimator (2.47) is simulation consistent, it will be heavily affected by the curse of dimensionality. As mentioned in Section 2.3.1, the optimal mean square error rate for nonparametric kernel density estimation is known to be O(B−2/(q+4)), where q is the dimension of the space (Terrell and Scott, 1992). As the number of subsets s increases, the dimension of ψ increases linearly. Non parametric kernel density estimation in a (s−1)d dimensional space will only be feasible for very small s and d (recall Table 2.3). This suggests that the Monte Carlo budget per worker needs to increase exponentially with the number of subsets s to control the error in the combine step. 2.4.7 Embarrassingly parallel evidence estimation Algorithm 2.1 gives the general algorithm for computing the model evidence in parallel using the kernel density estimator for the subposterior integral (2.47). We expect the algorithm to break down as s increases, as the error in the combine stage will increase exponentially with s for a fixed Monte Carlo budget B. This limits its use for scalable computation of the integrated likelihood. The computational benefits of increasing s in the initial split are undermined by the increased Monte Carlo error in the combine step. We thus turn to a different strategy involving data augmentation. 2.5 Data augmentation for distributed inference 2.5.1 Introduction The general treatment of the problem given in the previous section is interesting, but leads to a somewhat negative conclusion regarding the feasibility of the divide and conquer approach. We can reach a more positive outcome by working within the confines of the conditionally conjugate exponential family. To ease notation we do not explicitly condition on a particular modelM in this section. The overall strategy makes use of Gibbs sampling in the apply step so that we can implement an efficient estimator of the subposterior integral in the combine step. The data augmentation based algorithm has more favourable scaling properties than the kernel density based algorithm (Algorithm 2.1). There is a connection between our approach and Chib’s method (Chib, 1995) for estimating the marginal likelihood from the Gibbs output in the single machine setting. 26 2.5.2. Conjugate priors in the exponential family Algorithm 2.1 Divide and Conquer model evidence Apply Step: Subset Markov chain Monte Carlo (MCMC) runs for i = 1, . . . , s do for b = 1, . . . , B do Sample θ from p˜(θ|yi) Set S [b] i ← θ end for Compute l̂og p˜(yi) using method of choice. end for Combine Step: Post Processing subset MCMC output for b = 1, . . . , B do Compute ψ[b] =  S [b] 1 − S[b]2 S [b] 1 − S[b]3 ... S [b] 1 − S[b]s  . end for Compute Îsub = B −1∑B b=1KH(ψ [b]) Compute log p̂(y1, . . . ,ys) = ∑s i=1 l̂og p˜(yi) + s logα+ Îsub 2.5.2 Conjugate priors in the exponential family To describe the conjugate priors suppose we have n independently and identically distributed observations y1, . . . ,yn. Bold face is used to allow for vector values observations. The model conditioned likelihood belongs to the exponential family if the likelihood contribution for a single observation can be written as p(yi|θ) = h(yi)g(θ) exp(η(θ)Tt(yi)), (2.48) for some known functions h(·), g(·), t(·) and η(·). The function t(·) returns a vector of sufficient statistics and the function η(·) dictates how the parameter θ interacts with the data. The standard conjugate prior (Bernardo and Smith, 2006) on θ is parametrised by a scalar ν0, and a vector φ0 of the same dimension as t(yi), taking the form pi(θ; ν0,φ0) = c(ν0,φ0)g(θ) ν0 exp(η(θ)Tφ0). (2.49) The function g(·) is the same that appears in the data likelihood (2.48), and c(ν0,φ0) is an appropriate normalising constant: 1 c(ν0,φ0) = ∫ g(θ)ν0 exp(η(θ)Tφ0) dθ. Explicit forms of some standard conjugate priors are listed in Chapter 5 of Bernardo and Smith (2006). Suppose the prior is of the standard conjugate form, so p(θ) = c(ν0,φ0)g(θ) ν0 exp(η(θ)Tφ0). Given n independent and identically distributed observations from model (2.48), the posterior remains in the same family, p(θ|y1, . . . ,yn) ∝ ( n∏ i=1 h(yi)g(θ) exp(η(θ) Tt(yi)) ) c(ν0,φ0)g(θ) ν0 exp(η(θ)Tφ0) ∝ g(θ)n exp [η(θ)T(∑ni=1t(yi))] c(ν0,φ0)g(θ)ν0 exp(η(θ)Tφ0) ∝ g(θ)ν0+n exp [η(θ)T (φ0 +∑ni=1t(yi))] ∝ pi (θ; ν0 + n,φ0 + ∑n i=1t(yi)) . The form of the update equations leads to an interpretation of the prior hyperparameters ν0 and φ0. The prior distribution pi(θ; ν0,φ0) acts as a pseudo-dataset of ν0 observations with sufficient statistics φ0. 27 2. Computing the model evidence using parallel processing In practice it is common to use more flexible priors than permitted by the standard conjugate family (Gutie´rrez-Pen˜a et al., 1997; Arnold et al., 1993). For example, suppose we have n observations from a d-dimensional multivariate normal distribution with known covariance matrix Σ and unknown mean µ. Using definition (2.49), the standard conjugate family has prior N(m0,Σ/ν0) for some prior mean m0 ∈ Rd and positive scalar ν0. The prior covariance of µ must be proportional to the covariance matrix of the observations. In practice, a common conjugate prior on µ is a N(m0,V0) distribution (Gelman et al., 2014) where V0 is any positive definite d × d matrix. The flexibility of an unstructured prior covariance matrix V0 places it outside the standard class (2.49). An enriched conjugate family of priors introduces an extra r-dimensional hyperparameter ω0 = (ω1, . . . , ωr), and r positive valued functions bu(θ) : Ω → R+ for u = 1, . . . , r (Gutie´rrez-Pen˜a et al., 1997). Define the function b(θ|ω0) = r∏ h=1 [bu(θ)] ωu . (2.50) The enriched conjugate prior is defined as pi(θ; ν0,φ0, ω0) = c(ν0,φ0,ω0)b(θ|ω0)g(θ)ν0 exp(η(θ)Tφ0). (2.51) In the previous display c(ν0,φ0,ω0) gives the normalising constant 1 c(ν0,φ0,ω0) = ∫ b(θ|ω0)g(θ)ν0 exp(η(θ)Tφ0) dθ. Standard conjugate priors (2.49) can be viewed as a special case of the enriched form (2.51) where ω0 = 0, and the functions b1(θ), . . . , br(θ) can be set to return one for all θ. The update equations for the enriched prior are very similar to the update equations for the standard prior. It is simple to show that if we have n independent and identically distributed observations from model (2.48), the posterior remains in the enriched family, p(θ|y1, . . . ,yn) = pi (θ; ν0 + n,φ0 + ∑n i=1t(yi),ω0) . The ω0 hyperparameter remains unchanged, and the data still influences the posterior through the term φ0 + ∑n i=1t(yi). Almost all commonly used conjugate priors can be shown to be in the enriched family (Arnold et al., 1993). However, it can be tedious to reparametrise benchmark priors to match the enriched form. It can be shown that the N(µ0,V0) prior belongs to the enriched conjugate prior family for Gaussian likelihoods (Gutie´rrez-Pen˜a et al., 1997, Example 4.3). The abstraction of the enriched conjugate prior family is beneficial as it provides a unifying framework for establishing general results. 2.5.3 Data augmentation It is not possible to write the likelihood function in the form (2.48) for many interesting statistical models. Data augmentation is a useful technique that extends the utility of exponential family and conjugate theory to a wider class of models. Suppose that we have an observed dataset of n observations y = (y1, . . . ,yn). A wide range of data augmentation strategies add a hidden layer of latent variables z with state space Z to the model, such that the observed data likelihood satisfies p(y|θ) = ∫ Z p(y, z|θ) dz. The latent variables z often have a natural statistical interpretation in terms of a hidden layer in the data generating process. Ideally, the complete data likelihood p(y, z|θ) is then more tractable than the observed data likelihood p(y|θ). In the simplest form, we introduce n independent latent variable z = (z1, . . . ,zn), one for each observation in the original dataset. With n independent latent variables 28 2.5.4. Chib’s method and n independent observations, the complete data likelihood satisfies p(y, z|θ) = n∏ i=1 p(yi, zi|θ). Suppose that the complete data likelihood contribution for observation i belongs to the exponential family, so p(yi, zi|θ) = h(yi, zi)g(θ) exp(η(θ)Tt(yi, zi)), (2.52) again for some known functions h(·), g(·), t(·) and η(·). Suppose that we adopt an enriched conditionally conjugate prior p(θ) = pi(θ; ν0,φ0,ω0). It is immediate that the conditional posterior has the form p(θ|z,y) = pi(θ; ν0 + n,φ0 + ∑n i=1t(yi, zi),ω0). (2.53) The conditional posterior depends on the sufficient statistics of the augmented dataset ∑n i=1t(yi, zi). For many models, it is possible to sample from the full conditional of the latent variables p(z|θ,y). The joint posterior on the unknowns p(θ, z|y) can thus be targeted using a two block Gibbs sampler, iteratively sampling from the full conditionals p(θ|z,θ) and p(z|θ,y). This data augmentation and Gibbs approach can be used for logistic regression, negative binomial regression, probit regression and Gaussian mixtures. 2.5.4 Chib’s method Chib’s method for marginal likelihood computation (Chib, 1995) using Gibbs sampling is based on the integrated likelihood identity (2.16). The method uses the structure of data augmented models to estimate the pointwise density in a more efficient manner than nonparametric kernel density estimation. As mentioned in section 2.5.2, it is often possible to introduce a vector of latent variables z such that the conditional posterior distribution p(θ|y, z) is known exactly. Marginalising over the posterior distribution of the latent variables z gives the relationship p(θ|y) = ∫ p(θ|y, z)p(z|y). (2.54) A simulation consistent estimate of the integrated likelihood can be obtained by simulating from the posterior distribution of the latent variables z. As discussed, it is common to use a two block Gibbs sampler, iteratively sampling the full conditional distributions p(θ|z,y) and p(z|θ,y). Suppose we use a conditionally conjugate prior, so the full conditional p(θ|y, z) is known explicitly. Given B posterior samples of the latent variables, z[1], . . . ,z[b], a simulation consistent estimator of the posterior density at θ∗ ∈ Ω is then p̂(θ∗|y) = 1 B B∑ b=1 p(θ∗|z[b],y). (2.55) Substituting (2.55) into the integrated likelihood identity (2.16) gives a simulation consistent estimator of the integrated likelihood, log p̂(y) = log p(y|θ∗) + log p(θ∗)− log ( 1 B B∑ b=1 p(θ∗|y, z[b]) ) . (2.56) Chib’s method can be used for high-dimensional models where general non-parametric density esti- mators would likely fail. The key to the approach is the marginal representation of the posterior density (2.55). We can take an average over posterior draws of the latent variables to estimate the unknown posterior density p(θ∗|y). In the following subsections we develop an embarrassingly parallel algorithm that makes use of the fact that although ∫ s∏ i=1 p˜(θ|yi) dθ, 29 2. Computing the model evidence using parallel processing may be intractable, data augmentation with suitably chosen latent variables in each subset zi could lead to the augmented subposterior integral ∫ s∏ i=1 p˜(θ|yi, zi) dθ, having a closed form solution. 2.5.5 Apply step We first show how data augmentation and Gibbs sampling can be used to draw subposterior samples and compute the subposterior evidence scores in the apply step. We assume that the complete data likelihood belongs to the exponential family. Let ni give the number of observations in subset i for i = 1, . . . , s. Let yij denote the jth observation in subset i, and let zij denote an associated latent variable for the jth observation in subset i. We assume that the latent variables zij are independent given θ. Finally let zi give the vector of latent variables for subset i, that is zi = (zi1, . . . , zini) for i = 1, . . . , s. The augmented full dataset posterior distribution is again proportional to s subposteriors: p(θ|z1, . . . ,zs,y1, . . . ,ys) ∝ p(θ)p(y1, . . . ,ys, z1, . . . ,zs|θ) = s∏ i=1 p(yi, zi|θ)p(θ)1/s ∝ s∏ i=1 p(yi, zi|θ)p˜(θ) ∝ s∏ i=1 p˜(θ|yi, zi). Suppose the complete data likelihood can be written in the following form p(yij , zij |θ) = h(yij , zij)g(θ) exp(η(θ)Tt(yij , zij)). (2.57) Where the functions h(·), g(·), t(·) and η(·) are known. Suppose the original prior is of the enriched conjugate form (2.51). The fractionated prior remains in the enriched conjugate family. Fractionating yields p(θ|ν0,φ0,ω0)1/s = [ c(ν0,φ0,ω0)b(θ|ω0)g(θ)ν0 exp(η(θ)Tφ0) ]1/s = c(ν0,φ0,ω0) 1/sb(θ|ω0)1/sg(θ)ν0/s exp(η(θ)Tφ0/s). Recall that ω0 = (ω1, . . . , ωr) and the definition of the function b(θ|ω0) = ∏r u=1[bu(θ)] ωu . It then follows that b(θ|ω0)1/s = b(θ|ω0/s). Continuing, we see that the fractionated prior remains in the enriched family p(θ|ν0,φ0,ω0)1/s = c(ν0,φ0,ω0)1/sb(θ|ω0/s)g(θ)ν0/s exp(η(θ)Tφ0/s) ∝ pi(θ; ν0/s,φ0/s,ω0/s). Given the original prior is p(θ) = pi(θ; ν0,φ0,ω0), the subprior can be defined as p˜(θ) = pi(θ; ν0/s,φ0/s,ω0/s). The hyperparameters of the subprior show that each subset analysis receives a fraction of the original prior pseudo-dataset, namely ν0/s observations with sufficient statistics φ0/s. Secondly, as the ω0 hyper- parameter tends to 0 the function b(θ|ω) tends to one. The subpriors have hyperparameters ω0/s which indicates a dampened effect of the extra prior flexibility function b(θ|ω). There are open questions sur- rounding the importance of prior fractionation (Scott, 2017; Gelman and Vehtari, 2017), and the theory of exponential families can be of use in addressing these. 30 2.5.6. Combine step From the results in section 2.5.3, it follows that given the latent variables zi, the subposteriors p˜(θ|yi, zi) are also conditionally conjugate. For i = 1, . . . , s: p˜(θ|yi, zi) ∝ g(θ)ni exp ( η(θ)T ∑ni i=jt(yij , zij) ) p(θ|ν0,φ0,ω0)1/s ∝ g(θ)ni exp ( η(θ)T ∑ni j=1t(yij , zij) ) pi(θ; ν0/s,φ0/s,ω0/s) ∝ pi ( θ ∣∣∣ ν0/s+ ni,φ0/s+∑nij=1t(yij , zij),ω0/s) . Suppose that it is possible to sample from the full conditional p(zi|θ,yi) in a single machine analysis. It is then also possible to sample from conditional subposterior distribution p˜(zi|θ,yi) using the same update equations. The conditional subposterior has the same form as in regular analysis of a small portion of the full dataset: p˜(zi|yi,θ) ∝ p(yi, zi|θ)p˜(θ) ∝ p(zi|θ,yi)p(yi|θ) ∝ p(zi|θ,yi). Use of the subprior p˜(θ) is not material when conditioning on θ. We keep the tilde to acknowledge that both p(zi|θ,yi) and p˜(θ|yi, zi) are targeted in the apply stage by an invidual worker. As such, we can use Gibbs sampling to target the subposterior distribution p˜(θ, zi|yi) by sampling from the conditional densities p˜(θ|yi, zi) and p˜(zi|θ,yi). We have previously assumed that each worker returns the subposterior evidence scores. We can use Chib’s method in the apply stage to compute the subposterior evidence values. Let z [b] i be the sampled latent variables at iteration b in subposterior i. Assuming the chain has mixed well, z [b] i is a sample from the distribution p˜(zi|yi). At any particular ordinate θ∗ ∈ Ω we have the identity log p˜(yi) = log p(yi|θ∗) + log p˜(θ∗)− log p˜(θ∗|yi), A simulation consistent estimator of log p˜(yi) is therefore l̂og p˜(yi) = log p(yi|θ∗) + log p˜(θ∗)− log ( 1 B B∑ b=1 p˜(θ∗|yi, z[b]i ) ) . (2.58) 2.5.6 Combine step The sequence of full conditional Gibbs distributions in the apply stage can be used to give a closed form estimator of the subposterior integral Isub = ∫ s∏ i=1 p˜(θ|yi) dθ. (2.59) To describe the estimator of Isub suppose the latent variables z1, . . . ,zs take values in state spaces Z1, . . . ,Zs respectively. Let the joint space of the subposterior latent variables be denoted Z = Z1 × . . . × Zs. Each subposterior has the marginal representation p˜(θ|yi) = ∫ Zi p˜(θ|yi, zi)p˜(zi|yi) dzi. The subposterior latent variables can be considered together as a random vector (z1, z2, . . . ,zs) with joint distribution p˜(z1, z2, . . . ,zs) = ∏s i=1 p˜(zi|yi). As such, we have the following representation of the product of the subposterior distributions s∏ i=1 p˜(θ|yi) = s∏ i=1 (∫ Zi p˜(θ|yi, zi)p˜(zi|yi) dzi ) = ∫ Z1×...×Zs ( s∏ i=1 p˜(θ|yi, zi)p˜(zi|yi) ) dz1dz2 . . . dzs = ∫ Z ( s∏ i=1 p˜(θ|yi, zi) )( s∏ i=1 p˜(zi|yi) ) dz1dz2 . . . dzs. (2.60) 31 2. Computing the model evidence using parallel processing Switching the integration and product operators is justified using Fubini’s theorem. The necessary assumptions are satisfied if all the subposterior distributions are proper (Keener, 2013). Substituting (2.60) into the definition of Isub (2.59) gives ∫ Ω s∏ i=1 p˜(θ|yi) dθ = ∫ Ω ∫ Z ( s∏ i=1 p˜(θ|yi, zi) )( s∏ i=1 p˜(zi|yi) ) dz1dz2 . . . dzs dθ Using Fubini’s theorem once more allows the order of integration to be switched, ∫ Ω s∏ i=1 p˜(θ|yi) dθ = ∫ Z (∫ Ω s∏ i=1 p˜(θ|yi, zi) dθ )( s∏ i=1 p˜(zi|yi) ) dz1dz2 . . . dzs (2.61) The inner integral over Ω can be obtained in closed form given that the subposteriors are conditionally conjugate. The subposterior integral is equal to the expected value of the augmented subposterior integral, where the latent variables z1, . . . ,zs are sampled from the subposterior distributions. We can write Isub = Ep˜(z1,...,zs|y1,...,ys) [∫ Ω s∏ i=1 p˜(θ|yi, zi) dθ ] (2.62) For convenience, let t(yi, zi) denote the complete data sufficient statistic for subset i, that is t(yi, zi) =∑ni j=1 t(yij , zij). For more compact notation define for i = 1, . . . , s the subset normalising constants ci = c(ν0/s+ ni,φ0/s+ t(yi, zi),ω0/s), where c(·) is the normalising function for the enriched conjugate prior. Additionally, define the full dataset normalising constant as C = c(ν0 + n,φ0 + ∑s i=1t(yi, zi),ω0). Using properties of the standard conjugate prior we obtain a closed form expression for the augmented subposterior integral. ∫ Ω s∏ i=1 p˜(θ|yi, zi) dθ = ∫ Ω s∏ i=1 ci × b(θ|ω0/s)g(θ)ν0/s+ni exp [ η(θ)T (φ0/s+ t(yi, zi)) ] dθ = ( s∏ i=1 ci )∫ Ω g(θ)ν0+nb(θ|ω0) exp [ η(θ)T ( φ0 + s∑ i=1 t(yi, zi) )] dθ = (∏s i=1 ci C )∫ Ω p(θ|y1, . . . ,ys, z1, . . . ,zs) dθ = ∏s i=1 c(ν0/s+ ni,φ0/s+ t(yi, zi),ω0/s) c(ν0 + n,φ0 + ∑s i=1 t(yi, zi),ω0) . (2.63) The second line uses the fact that ∏s i=1 b(θ|ω0/s) = b(θ|ω0). Substituting (2.63) into (2.64) gives an important expression for the subposterior integral Isub. Isub = Ep˜(z1,...,zs|y1,...,ys) [∏s i=1 c(ν0/s+ ni,φ0/s+ t(yi, zi),ω0/s) c(ν0 + n,φ0 + ∑s i=1 t(yi, zi),ω0) ] (2.64) Let φ [b] i = φ0/s+ t(z [b] i ,yi) be the data dependent parameter of the conditional subposterior at iteration b for i = 1, . . . , s. Suppose we have saved either the sufficient statistics t(yi, z [b] i ) or conditional posterior parameters φ [b] i at each iteration b = 1, . . . , B in each minibatch analysis i = 1, . . . , s. The subposterior integral can then be estimated in the combine step by pooling the Gibbs histories from the apply step. 32 2.5.7. Embarrassingly parallel evidence estimation In full, Îsub = 1 B B∑ b=1 ∫ Ω s∏ i=1 p˜(θ|yi, z[b]i ) dθ = 1 B B∑ b=1 ∏s i=1 c(ν0/s+ ni,φ0/s+ t(yi, z [b] i )) c ( ν0 + n,φ0 + ∑s i=1 t(z [b] i ,yi) ) (2.65) = 1 B B∑ b=1 ∏s i=1 c(ν0/s+ ni,φ [b] i ) c(ν0 + n, ∑s i=1 φ [b] i ) . (2.66) Equations (2.65) and (2.66) both define plug-in Monte Carlo estimators of the expectation in equation (2.64). 2.5.7 Embarrassingly parallel evidence estimation Recall (2.27), the subset decomposition of the full dataset model evidence log p(y1, . . . ,ys) = ( s∑ i=1 log p˜(yi) ) + log Isub + s logα. The sum over the subposterior evidence scores can be estimated during the apply stage using Chib’s method. To estimate the sequence of full conditional distributions in the combine stage it is necessary to save information about the full conditional distributions that are sampled from in the apply stage. The subprior normalising constant α can generally be determined in closed form given that the original prior is in the exponential familty. The full dataset model evidence is then estimated by combining the subset evidence scores with the subposterior integral, l̂og p(yi, . . . ,ys) = ( s∑ i=1 l̂og p˜(yi) ) + log Îsub + s logα. Algorithm 2.2 presents the proposed methodology in detail. Algorithm 2.2 Divide and conquer for conditionally conjugate models Apply Step: Subset Markov chain Monte Carlo (MCMC) runs for i = 1, . . . , s do for b = 1, . . . , B do Sample z [b] i ∼ p(zi|θ,yi) Compute φ [b] i = φ0/s+ t(z [b] i ,yi) Save φ [b] i Sample θ from p˜(θ|z[b]i ,yi) = pi(θ; ν0/s+ ni,φ[b]i ,ω0/s) end for Compute l̂og p˜(yi) using (2.58) or alternative method. end for Combine Step: Post processing subset Gibbs output for b = 1, . . . , B do Compute gb = ∏s i=1 c(ν0/s,φ [b] i ,ω0/s) c(ν0, ∑s i=1 φ [b] i ,ω0) end for Compute Îsub = B −1∑B b=1 gb Compute l̂og p(yi, . . . ,ys) = (∑s i=1 l̂og p˜(yi) ) + log Îsub + s logα To recap, we have developed an embarrassingly parallel algorithm to compute the full dataset model evidence using distributed computing and Gibbs sampling. The algorithm meets the desired template of Figure 2.1. The complexity of the combine step does not depend on n. It is quite remarkable that we can 33 2. Computing the model evidence using parallel processing estimate the integrated likelihood on massive datasets without ever fitting the model to the whole dataset. This is a significant departure from many other techniques for computing the integrated likelihood, where generating samples from the full dataset posterior p(θ|y1, . . . ,ys) is an important intermediate step (Friel and Wyse, 2012). Existing methods for calculating the integrated likelihood are discussed in more detail in Chapter 3. 2.5.8 Monte Carlo error For Algorithm 2.2 to be effective, the Monte Carlo error in the combine step needs to be tolerable. Algorithm 2.1 is impractical for this reason, due to the curse of dimensionality impeding nonparametric kernel density estimation of the subposterior integral. In this section we show that the variance of the Gibbs based estimatpor subposterior integral (2.66) is tied to the disparity between the subposterior distribution on the latent variables, and the full dataset posterior distribution. The variance can be characterised in terms of importance sampling. As a quick review, importance sampling is a basic Monte Carlo technique based on a change of measure. Suppose we are interested in the expectation of a function h(x) over some probability distribution p(x) on state space X . Importance sampling is based on the simple identity, ∫ X h(x)p(x) dx = ∫ X h(x) p(x) q(x) q(x) dx, for some other distribution q(x) on the same space X . A common use for importance sampling is when p(x) is difficult to sample from, but it is easy to find an approximating distribution q(x). An important diagnostic in importance sampling is the variance of the importance sampling weights varq(x) ( p(x) q(x) ) . If q(x) ∝ p(x), then the variance of the weights is zero. The variance of the weights is strongly influenced by the differences between p(x) and q(x). The tails of the distribution q(x) are particularly significant. If p(x) has high support in regions that are minimally covered by q(x), the variance of the importance weights will be high, and the importance sampler will be ineffective. It is possible for the variance of the weights to be infinite, in which case there can be a large finite sample bias on the importance sampling estimator (Robert and Casella, 2010). Let p˜(z1, . . . ,zs|y1, . . . ,ys) be the effective joint distribution on the latent variables if sampling in- dependently from each subposterior, so p˜(z1, . . . ,zs|y1, . . . ,ys) = ∏s i=1 p˜(zi|yi). The target posterior on the latent variables is p(z1, . . . ,zs|y1, . . . ,ys). To derive the variance of Îsub we start by applying Bayes theorem to the augmented posterior p(θ|z1, . . . ,zs,y1, . . . ,ys) = p(y1, . . . ,ys, z1, . . . ,zs|θ)p(θ) p(z1, . . . ,zs,y1, . . . ,ys) = αs p(z1, . . . ,zs|y1, . . . ,ys)p(y1, . . . ,ys) s∏ i=1 p(yi, zi|θ)p˜(θ) = αs p(z1, . . . ,zs|y1, . . . ,ys)p(y1, . . . ,ys) s∏ i=1 p˜(θ|yi, zi)p˜(zi|yi)p˜(yi) = αs ∏s i=1 p˜(yi) p(y1, . . . ,ys) ∏s i=1 p˜(zi|yi) p(z1, . . . ,zs|y1, . . . ,ys) s∏ i=1 p˜(θ|yi, zi). (2.67) The core identity p(y1, . . . ,ys) = ( s∏ i=1 p˜(yi) ) αs × Isub 34 2.6. Logistic regression gives the relationship ∏s i=1 p˜(yi) p(y1, . . . ,ys) = α−s Isub . (2.68) Substituting (2.68) into (2.67) and recalling the definition p˜(z1, . . . ,zs|y1, . . . ,ys) = ∏s i=1 p˜(zi|yi), p(θ|z1, . . . ,zs,y1, . . . ,ys) = α sα−s Isub p˜(z1, . . . ,zs|y1, . . . ,ys) p(z1, . . . ,zs|y1, . . . ,ys) s∏ i=1 p˜(θ|yi, zi). Integrating both sides over θ gives 1 = 1 Isub p˜(z1, . . . ,zs|y1, . . . ,ys) p(z1, . . . ,zs|y1, . . . ,ys) ∫ s∏ i=1 p˜(θ|yi, zi) dθ. Rearranging shows a connection between the integral over the augmented subposteriors and importance sampling weights. ∫ s∏ i=1 p˜(θ|yi, zi) dθ = Isub × p(z1, . . . ,zs|y1, . . . ,ys) p˜(z1, . . . ,zs|y1, . . . ,ys) (2.69) The subposterior integral estimator is defined as the plug-in expectation of the augmented integral over p˜(z1, . . . ,zs|y1, . . . ,ys). The estimator can be written as Îsub = 1 B B∑ b=1 ∫ Ω s∏ i=1 p˜(θ|yi, z[b]i ) dθ = 1 B × Isub × B∑ b=1 p(z [b] 1 , . . . ,z [b] s |y1, . . . ,ys) p˜(z [b] 1 , . . . ,z [b] s |y1, . . . ,ys) . Assuming independent subposterior samples, the variance of the estimator is therefore var(Îsub) = 1 B × I2sub × varp˜(z1,...,zs|y1,...,ys) ( p(z1, . . . ,zs|y1, . . . ,ys) p˜(z1, . . . ,zs|y1, . . . ,ys) ) . (2.70) In reality, correlated subposterior samples will inflate the variance, but to simplify the discussion we as- sume independence or samples from a sufficiently thinned Markov chain. The variance of Îsub is tied to the mean multiplied by a density ratio involving the latent variables in the model. The final term in brackets can be interpreted as importance sampling weights if using the subposterior distributions over the latent variables to approximate the full dataset joint posterior on the latent variables. There are two important conclusions. The first is that the variance of the integral estimator is only finite if the variance of the importance sampling weights is finite. If the tails of the subposterior distribution p˜(z1, . . . ,zs|y1, . . . ,ys) contain regions where the target posterior p(z1, . . . ,zs|y1, . . . ,ys) has high support, then the estimator Îsub could have infinite variance. The second conclusion in the converse direction is that it is possible to obtain a zero variance estimator if p˜(z1, . . . ,zs|y1, . . . ,ys) is proportional to p(z1, . . . ,zs|y1, . . . ,ys). It is difficult to make a broad statement about how the variance of the proposed estimator (2.66) scales with the number of subsets s. The variance is tied to the model, the goodness of fit of the model and how the full dataset is partitioned. However, Algorithm 2.2 has more promise than the approach based on kernel density estimation as the curse of dimensionality does not completely rule out its efficacy. 2.6 Logistic regression We now show how Algorithm 2.2 can be used for large scale logistic regression. Suppose we have n binary observations yi ∈ {0, 1} with an associated p-dimensional vector of covariates xi for i = 1, . . . , n. The responses are modelled as yi ∼ Bernoulli(σ(ηi)), where ηi = xTi β and σ(η) gives the inverse logistic function σ(η) = exp(xTi β)/(1 + exp(x T i β)). Polson et al. (2013) propose a data augmentation scheme for logistic regression involving the Po´lya-Gamma distribution. A Po´lya-Gamma random variable is an infinite sum of gamma random variables, defined more precisely below. 35 2. Computing the model evidence using parallel processing Definition 2.1 (Polson et al. (2013)). A random variable X has a Po´lya-Gamma distribution b > 0 and c ∈ R, denoted X ∼ PG(b, c) if X d = 1 2pi2 ∞∑ k=1 gk (k − 1/2)2 + c2/(4pi2) , where the gk ∼ Gamma(b, 1) are independent gamma random variables with shape parameter b and rate parameter 1, and d = indicates equality in distribution. Let X ∼ PG(b, c). Let cosh(x) = (exp(x) + exp(−x))/2. The probability density function of X can be written as p(x|b, c) = coshb(c/2)2 b−1 Γ(b) ∞∑ n=0 (−1)n Γ(n+ b) Γ(n+ 1) (2n+ b)√ 2pix3 exp ( − (2n+ b) 2 8x − c 2 2 x ) . For each binary observation yi we introduce a latent Po´lya-Gamma random variable zi with parameters b = 1, c = 0. Polson et al. show that the likelihood contribution of observation i can be written as an integral over the latent zi, p(yi|xi,β) = [exp(x T i β)] yi 1 + exp(xtiβ) = 2−1 exp(κixTi β) ∫ ∞ 0 exp(−zi(xTi β)2/2)p(zi|1, 0) dzi, where κi = yi− 1/2 and p(zi|1, 0) is the density of a Po´lya-Gamma random variable. The complete data likelihood can be written as a quadratic function of β. Let X give the n× p design matrix where row i is given by xTi . Up to arbitrary constants, the complete data likelihood is equal to p(y, z|X,β) = p(z) n∏ i=1 exp(κix T i β − zi(xTi β)2/2) = p(z) n∏ i=1 exp(κ2i /(2zi)) exp ( −zi 2 (xTi β − κi/zi)2 ) = h(z)p(z) exp ( n∑ i=1 −zi 2 (xTi β − κi/zi)2 ) = h(z)p(z) exp ( −1 2 (y′ −Xβ)TΩ(y′ −Xβ) ) , where y′ = (κ1/z1, . . . , κn/zn) and Ω = diag(z1, . . . , zn) and h(y, z) is a function of the augmented dataset that is not dependent on β. This resembles a regression likelihood with working responses y′, design matrix X, coefficients β, and diagonal covariance matrix Ω−1. With the connection to linear regression, it is simple to see that a conditionally conjugate prior on β is a multivariate normal distribu- tion. The enriched parametrisation (2.51) is cumbersome to work with, we instead use the more typical parametrisation in terms of the prior mean m0 and prior covariance matrix V0. A two-block Gibbs sampler can be used to sample from the joint posterior of the coefficients β and the latent variables z. The full conditional on β is p(β|z,y) = N(m,V ), (2.71) where V = (XTΩX + V −10 ) −1 m = V (XTκ+ V −10 m0). Recall κ = (y1 − 1/2, . . . , yn − 1/2) and Ω is a diagonal matrix where Ωii = zi. The latent variables z are conditionally independent given y and θ. The full conditional for the latent zi is p(zi|β,y) = PG(1,xTi β), (2.72) 36 2.6. Logistic regression for i = 1, . . . , n, and PG(b, c) denotes a Po´lya-Gamma distribution with parameters (b, c). The Po´lya- Gamma distribution is not in the exponential family, but still has a useful conditional conjugacy property for the logistic regression model. From section 2.5.5 the subposterior distributions can also be targeted using Gibbs sampling. The subprior is a N(m0, sV0) distribution. The fractionation scales the covariance matrix by a factor of s. To sample from p˜(β, zi|yi) we iterate sampling from p˜(β|zi,yi) and p˜(zi|β,yi). Let xij give the vector of covariates for the jth observation in subset i for j ∈ {1, . . . , ni} and i ∈ {1, . . . , s}. Let X(i) give the ni × p matrix of covariates in subset i for i = 1, . . . , s. The subposterior full conditional on β is p˜(β|zi,yi) = N(m,V ), (2.73) where V = (XT(i)ΩX(i) + s −1V −10 ) −1 m = V (XT(i)κ+ s −1V −10 m0). Here κ = (yi1 − 1/2, . . . , yini − 1/2) and Ω = diag(zi). The latent variables zi are again conditionally independent given yi and β. The full conditional for the latent zij is p˜(zij |β,yi) = PG(1,xTijβ), (2.74) for j = 1, . . . , ni. Suppose we run B sweeps of the Gibbs sampler in each subset analysis during the apply step. As dictated by Algorithm 2.2, during the apply step we save the parameters of the full conditional p˜(β|yi, zi) at each iteration. Let m[b]i represent the conditional mean of β and let V [b]i represent the conditional variance of β at iteration b in the ith subposterior analysis for i = 1, . . . , s and b = 1, . . . , B. Likewise, let z [b] i denote the sampled latent variables at iteration b in the ith subset analysis for i = 1, . . . , s and b = 1, . . . , B. Each worker can compute the subposterior evidence using Chib’s method. For any ordinate β∗ subsposterior evidence satisfies the identity log p˜(yi) = log p(yi|β∗) + log p˜(β∗)− log p˜(β∗|yi). A simulation consistent estimator is therefore l̂og p˜(yi) = log p(yi|β∗) + log p˜(β∗)− log ( B−1 B∑ b=1 p(β∗|yi, z[b]i ) ) = log p(yi|β∗) + log p˜(β∗)− log ( B−1 B∑ b=1 N(β∗;m[b]i ,V [b] i ) ) . (2.75) We now turn to the combine step. The subposterior integral has the representation Isub = Ep˜(z1,...,zs|y1,...,ys) [∫ Ω s∏ i=1 p˜(β|yi, zi) dβ ] . From the results in section 2.5.6 it is possible to obtain a closed form expression for the integral over the augmented subposterior distributions given the conjugate structure of the model. Here the condi- tional subposterior is multivariate normal. Lemma 2.1 gives a result for the integral over a product of multivariate Gaussians. Lemma 2.1. Let p(x;µi,Σi) denote a multivariate Gaussian pdf with mean vector µi and covariance matrix Σi for i = 1, . . . , s. For i = 1, . . . , s define ηi = Σ −1 i µi, Λi = Σ −1 i , ξi = −1 2 ( d log 2pi − log|Λi|+ ηTi Λiηi ) . 37 2. Computing the model evidence using parallel processing Similarly, define η = ∑s i=1ηi, Λ = ∑s i=1Λi ξ = −1 2 ( d log 2pi − log|Λ|+ ηTΛη) . The integral over the product of the s density functions has the closed form solution∫ s∏ i=1 p(x;µi,Σi) dx = exp [( ∑s i=1ξi)− ξ] . The proof uses the fact that the product of multivariate Gaussian density functions is proportional to another multivariate Gaussian density function. This is a special case of the general result in section 2.5.6. For convenience define the function f(m1, . . . ,ms,Σ1, . . . ,Σs) as f(m1, . . . ,ms,V1, . . . ,Vs) = ∫ s∏ i=1 N(x;mi,Vi) dx. The function f(·) can be numerically evaluated using the result in Lemma 2.1. The Monte-Carlo estimator of the subposterior integral is then Îsub = B −1 B∑ b=1 ∫ Ω s∏ i=1 p˜(β|yi, z[b]i ) dβ = B−1 B∑ b=1 f(m [b] 1 , . . . ,m [b] s ,V [b] 1 , . . . ,V [b] s ) (2.76) The subprior normalising constant is α = sp/2( √ (2pi)p|V0|)1−1/s. Combining (2.75) and (2.76) the full dataset model evidence is estimated as l̂og p(y1, . . . ,ys|β) = ( s∑ i=1 l̂og p˜(yi) ) + s logα+ log Îsub. 2.7 Data application 2.7.1 Flights dataset We considered a dataset on flights departing New York city, available in the R package nycflights13 (Wickham, 2014). There are n = 327, 346 observations. We dichotomised the arrival delay variable (original units in minutes) to obtain a binary outcome. We labelled flights as late if the arrival delay was greater than zero, and on time if the arrival delay was less than or equal to zero. Unsurprisingly, there is a clear statistical association between late arrival at the destination and the departure delay leaving New York. There are data on 16 different carriers (airlines). Figure 2.9 (a) plots the number of flights against departure delay. Panel (b) plots the proportion of flights that were late against departure delay in minutes. In (a) we see that the distribution of departure delay time has a mode slightly above zero. The majority of flights depart late. In (b) we see a logistic relationship between departure delay (minutes) and the empirical probability of late arrival at the destination. This suggests a logistic regression model is appropriate. Let yi ∈ {0, 1} denote the response, where yi = 1 if flight i was late and is zero otherwise. We considered two regression models for the response, M1 = intercept + delay (2.77) M2 = intercept : carrier + delay : carrier. (2.78) 38 2.7.2. Pima Indians dataset 10 1000 0 100 200 300 Departure delay (minutes) N um be r o f f lig ht s (a) 0.00 0.25 0.50 0.75 1.00 −50 0 50 100 Departure delay (minutes) Pr ob ab ilit y of la te a rri va l (b) Figure 2.9: Flights dataset (n = 327, 346) and fitted logistic model. (a) shows the number of flights against departure delay. (b) Probability of late arrival given departure delay. Points are observed data. The blue solid line is the posterior mean from the logistic regression model. In (a) we see that the distribution of departure delay time has a mode slightly above zero. The majority of flights depart late. In (b) we see a logistic relationship between departure delay (minutes) and the empirical probability of late arrival at the destination. Model 1 is a pooled model where all carriers are modelled identically. Model 2 is a completely unpooled model where each carrier is given a unique intercept and slope. We used independent N(0, 1) priors on the coefficients. In panel (b) of Figure 2.9 we plot the posterior predictive mean fit of model 1 as a dashed line. The posterior predictive mean for a new response given covariates xnew is obtained by integrating over the posterior distribution of the coefficients E[ynew|M] = ∫ p(ynew = 1|xnew,β,M)p(β|X,y,M). (2.79) The fit of M1 to the observed data appears to be extremely good. There is some deviation from the theoretical proportion for very early flights, say with a departure delay of less than -20 minutes. However, we have very few data points in this range (refer back to panel (a)) so it seems reasonable to see this level of noise in (b). Figure 2.10 plots probability of late arrival against departure delay for each carrier (airline) in the dataset. The blue dashed line is the fitted mean from the pooled model (M1). The goodness of fit varies over airlines. There is noticeable deficiency in the pooled model in the results for the carriers FL and MQ. The graphical diagnositic in panel (b) of Figure 2.9 did obscure some deficiencies in the model. When averaging over carriers, model 1 looks very good. However when looking at the quality of the fit for individual carriers we see some systematic problems. Model 2 allows for more flexibility. The posterior mean fit for model 2 is plotted as a solid red line. The fit is noticeably better, but requires the addition of 30 parameters. We computed the integrated likelihood using the divide and conquer strategy with s = 5 subsets. Datasets were split uniformly at random. The log Bayes factor in favour of model 2 is 2353. There is very strong support for modelling carriers separately. 2.7.2 Pima Indians dataset We also analyse the Pima Indians diabetes dataset (Venables and Ripley, 2002). This is not a Big Data application as n = 532. The idea is to illustrate how the initial data split affects the Monte Carlo variance of Îsub on a benchmark dataset. We take diabetes status as the response, and glucose and body mass index as the covariates. Let yj denote a binary response where yj is equal to subject j has diabetes and is zero otherwise for j = 1, . . . , n. Let xj,1 represent glucose level and xj,2 represent body 39 2. Computing the model evidence using parallel processing carrier: US carrier: VX carrier: WN carrier: YV carrier: HA carrier: MQ carrier: OO carrier: UA carrier: DL carrier: EV carrier: F9 carrier: FL carrier: 9E carrier: AA carrier: AS carrier: B6 −50 0 50 100 −50 0 50 100 −50 0 50 100 −50 0 50 100 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Departure delay (minutes) Pr ob ab ilit y of la te a rri va l Figure 2.10: Comparison of pooled logistic regression model to unpooled logistic regression model on the flights dataset. The blue dashed line shows the posterior mean fit from the pooled model. The red solid line shows the posterior mean fit from the unpooled model. There appears to be heterogeneity across carriers. The unpooled model gives a better individual fit to each carrier. mass index for subject j. Given covariates xj = (xj,1, xj,2) T, yj is modelled as Bernoulli(σ(ηj)) where ηj = β0 + xj,1β1 + xj,2β2 and σ(ηj) = 1/(1 + exp(−ηj)). We compare two partitions of the full dataset y = (y1, . . . , yn) into s = 2 subsets y1 and y2. The first split is made uniformly at random and is shown in Figure 2.12. The second split oversamples cases by a factor of 10 in the first subset. The biased split of the dataset is shown in Figure 2.13. In each panel we plot a decision boundary from the respective analysis. Let x = (x1, x2) T represent a vector of covariates. The displayed decision boundary is given by the hyperplane xTw+α = 0, where w and α are determined using the posterior distribution in the full dataset results and the subposterior distributions in the subset results. In the full dataset analysis we obtain the slope coefficients w = E[(β1, β2)T|y] and the intercept as α = E[β0|y] where the expectation is over the posterior distribution p(β0, β1, β2|y). For the subset results we set w = E[(β1, β2)T|yi] and the intercept as α = E[β0|y] where the expectation is over the subposterior distribution p˜(β0, β1, β2|yi) for i = 1, 2. The subset decision boundaries are influenced by the subset specific case proportion in yi. In Figure 2.12 the decision boundaries in the subset analyses are similar to the decision boundary in the full dataset analyses. The subset results are in line with the full dataset results. In constrast, in Figure 2.13 we see that the subset decision boundaries are different to the decision boundary from the full dataset analysis. In the subset 2 results the decision boundary is out of the range of the panel. The biased split has caused the subset analyses to not be consistent with the full dataset analysis. This will have a flow on impact to the subposterior distributions on the latent 40 2.7.2. Pima Indians dataset subject: 266 subject: 327 subject: 343 split: biased split: unifo rm 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 0 2 4 6 0 2 4 6 value de ns ity Figure 2.11: Comparison of subposterior and target posterior distributions on the latent variables for the Pima Indians dataset. The solid black line represents the target posterior distribution on the latent variable p(zij |X,y) and the dashed red line gives the subposterior distribution on the latent variable p˜(zij |X(i),yi) for the subjects listed in Table 2.6. The disparity between the subposterior and the target posterior distributions is greater under the biased split than under the uniform split. This disparity is directly related to the variance of the subposterior integral estimator (2.70) variables. We ran Algorithm 2 with B = 1000 for both partitions one hundred times. We then compared the variance of log Îsub. The variance of log Îsub is a function of the variance of the importance weights p(z1, . . . ,zs) p˜(z1, . . . ,zs) . (2.80) The subposterior distribution p˜(z1, . . . ,zs) is expected to be very different to the target distribution p(z1, . . . ,zs) under the biased split. The subposterior distribution p˜(z1, . . . ,zs) should be more similar to the target distribution p(z1, . . . ,zs) under the uniform split. The biased split affects the estimate of the intercept parameter β0 in each subset. This is turn affects the subposterior distribution over the latent variables. Figure 2.11 compares the subposterior distributions on the latent variables to the target distribution for each partition. Results are shown for three observations, described in Table 2.6. In Figure 2.11, the solid line gives the target posterior distribution p(zij |y1,y2) for the three subjects of interest. The dashed line gives the subposterior distribution p˜(zij |y1,y2) for each subject. As predicted, the biased split causes p˜(zij |y1,y2) to be very different from p(zij |y1,y2). The uniform split results in p˜(zij |y1,y2) being very similar to p(zij |y1,y2). As expected, the Monte Carlo variance of log Îsub is very different under each split. Figure 2.14 shows boxplots of the estimates of log Îsub obtained in the combine step over the one hundred replications of Algorithm 2.2. The results under the biased split are more variable than the results under the uniform split. The standard deviation under the biased split is 3.04. The standard deviation under the uniform split is 0.032. The initial partition in the split step has a large influence on the Monte Carlo error in the combine step. 41 2. Computing the model evidence using parallel processing l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l ll l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l ll l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l Full dataset Subset 1 Subset 2 50 100 150 200 50 100 150 200 50 100 150 200 20 30 40 50 60 Glucose Bo dy m as s in de x Figure 2.12: Uniform split of the Pima Indians dataset. Solid line gives the decision boundary using the (sub)posterior mean of β = (β0, β1, β2)T in each dataset. Red crosses denote cases and black circles denote controls. The decision boundaries are similar across each analysis. The consistency in the results suggests the evidence synthesis in the combine step will be relatively congenial. l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l ll l l l l l l l l Full dataset Subset 1 Subset 2 50 100 150 200 50 100 150 200 50 100 150 200 20 30 40 50 60 Glucose Bo dy m as s in de x Figure 2.13: Biased split of the Pima Indians dataset. Solid line gives the decision boundary using the (sub)posterior mean of β = (β0, β1, β2)T in each dataset. Red crosses denote cases, and black circles denote controls. The decision boundaries are dissimilar over the analyses. The deviation in the results suggests the evidence synthesis in the combine step will be comparatively challenging. subject type glu bmi 266 No 106 30.50 327 No 78 36.90 343 No 157 25.60 Table 2.6: Raw data for subjects shown in Figure 2.11. 42 2.8. Conclusion l Biased split Uniform split − 5 0 5 10 Distribution of log I^sub lo g I^ s u b − E( lo g I^ s u b) Figure 2.14: Distribution of log Îsub under the biased split and the uniform split on the Pima Indians dataset. There is greater variance under the biased split compared to the uniform split. The initial partition in the split step influences the Monte Carlo variance in the combine step. 2.8 Conclusion Parallel processing is a compelling computational pathway for scalable Bayesian inference. Distributed computing platforms facilitate a divide and conquer approach where the Big Data analysis problem can be broken down into a series of smaller conventional analyses. This idea has been explored for posterior simulation. We have investigated divide and conquer Bayesian model selection and found that model selection requires different theory and methods. We have considered two different algorithms that fit the mould prescribed by Figure 2.1. We initially considered a general approach that is applicable to any any parametric model. MCMC can be used in the apply stage, with the output then consisting of the posterior draws. The combine step can then be approached as a Bayesian hypothesis testing problem. The posterior output can then be used to estimate a Savage-Dickey densty ratio that checks for the global suitability of the model over all subsets. The curse of dimensionality yields a major hurdle for this algorithm in the combine step. The dimension of the density estimation task increases with the number of subsets s, thus limiting the scalability of Algorithm 2.1. If the original model can be augmented with suitable latent variables, it is possible to obtain a practical algorithm. We propose an embarrassingly parallel algorithm for computing the model evidence that makes use of Gibbs sampling in the apply stage to ease the difficulty of the combine step. The subset Gibbs runs in the apply phase can be undertaken using simple modifications of standard update formulae. The combine step can be easily by carried out aggregating the Gibbs output. Interestingly, the initial split of the dataset strongly influences the Monte Carlo variance in the combine step. If the initial split it very poor, the results in the combine stage will be useless. As such it may be worth exploring adaptive procedures that optimise the data split after a preliminary Gibbs run. The split-apply-combine methodology using data augmentation may also be of use for posterior sampling. Using the divide and conquer procedure, we generate samples from the subposterior dis- tribution of the latent variables p˜(z1, . . . ,zs|y1, . . . ,ys). If we can determine the importance ratio p˜(z1, . . . ,zs|y1, . . . ,ys) up to a constant of proportionality, we can use p˜(z1, . . . ,zs|y1, . . . ,ys) in a self- normalised importance sampler to target the true posterior. Equation (2.69) showed that the subposterior 43 2. Computing the model evidence using parallel processing integral is proportional to the required importance sampling weights:∫ s∏ i=1 p˜(θ|yi, zi) dθ = Isub × p(z1, . . . ,zs|y1, . . . ,ys) p˜(z1, . . . ,zs|y1, . . . ,ys) ∝ p(z1, . . . ,zs|y1, . . . ,ys) p˜(z1, . . . ,zs|y1, . . . ,ys) . Given that we have a closed form expression for the augmented subposterior integral∫ Ω s∏ i=1 p˜(θ|yi, zi) dθ = ∏s i=1 c(ν0/s+ ni,φ0/s+ t(yi, zi),ω0/s) c(ν0 + n,φ0 + ∑s i=1 t(yi, zi),ω0) , we can compute the required importance sampling weights. We then can use self-normalised importance sampling in the combine stage to target the full dataset posterior p(θ|y1, . . . ,ys). Existing divide and conquer procedures have been criticised for not targeting the exact target posterior (Bardenet et al., 2017). Using the Gibbs based methodology here we can target the exact posterior providing that the model has some conditionally conjugate structure. This is a promising future research direction. Calculation of the model evidence is an important task, especially so when given masses of data with multiple plausible models. Distributed computing appears to be useful in the pursuit of this goal. 44 3Bounding the model evidence using the subsampled sandwich estimator Summary We investigate computational methods for estimation of the integrated likelihood on large n datasets. Existing importance sampling techniques can be impractical due to the need for full likelihood evaluations. We develop a novel strategy for estimating the log model evidence using subsampling. We propose to pair a variational lower bound with an upper bound formed using a maximum entropy argument. The enclosing bounds are asymptotically tight as n grows, and permit a reduction in the computational budget. The bounds can be estimated in a computationally efficient manner using subsampling and control variates. The sandwich approach provides more definitive error measures than other methods for approximating the integrated likelihood. We demonstrate the methods on a large logistic regression dataset. 3.1 Introduction The computational expense of likelihood evaluations is the root cause of many challenges in Bayesian computation for Big Data. Standard iterative algorithms can struggle with large datasets as the O(n) cost per loop stalls the procedure from a practical point of view. A generic goal is to reduce computational cost by allowing for access of only a small minibatch of data per iteration. Most existing algorithms for estimating the integrated likelihood suffer from the likelihood scaling issue, becoming impractical for huge n. We develop a novel strategy for estimating the model evidence on tall datasets, based on sandwiching the log integrated likelihood between upper and lower bounds. Squeezing the log integrated likelihood between encasing bounds is a departure from traditional importance sampling methods that deliver point estimates of the model evidence. Shifting to interval estimation is useful as it eases the integration of subsampling into the algorithm. We find it is necessary to use control variates to obtain a scalable algorithm, a common finding in the Big Data literature. Subsampling has been examined as a technique to reduce the computational cost of Bayesian inference in the fixed model setting, where the primary goal is to generate samples from the posterior distribution (Maclaurin and Adams, 2014; Quiroz et al., 2018). Given a likelihood p(y|θ) and prior p(θ), suppose we wish to sample from the posterior distribution p(θ|y) ∝ p(y|θ)p(θ). Markov chain Monte Carlo (MCMC) has proven to be a versatile computational engine for applied Bayesian inference, facilitating sampling from the posterior p(θ|y) in complex models. As a simple example, consider the Metropolis-Hastings algorithm, described in Algorithm 3.1. The calculation of the acceptance ratio α(θ,θ′) in line 6 requires the full likelihood p(y|θ). The O(n) cost per iteration can render the algorithm impractically slow for Big Data applications. There is a body of work exploring the integration of subsampling into the pseudo- marginal MCMC framework of Andrieu et al. (2009) to address this likelihood cost. The innovation behind the pseudo-marginal idea is that if the likelihood calculations in the acceptance ratio are replaced 45 3. Computing the model evidence using subsampling Algorithm 3.1 Metropolis-Hastings for simulating from the posterior distribution 1: Input θ[0] 2: for b = 1, . . . , B do 3: θ ← θ[b−1] 4: θ′ ∼ g(·|θ) . Sample new proposal from g 5: 6: α(θ,θ′)← p(y|θ ′)p(θ′) p(y|θ)p(θ) × g(θ|θ′) g(θ′|θ) . acceptance ratio calculation 7: 8: u ∼ Uniform(0, 1) 9: if u < α(θ,θ′) then 10: θ[b] ← θ′ . Accept proposal 11: else 12: θ[b] ← θ . Reject proposal 13: end if 14: end for 15: return { θ[1], . . . ,θ[B] } with a non-negative unbiased estimator p̂(y|θ), it is still possible to define a Markov chain that maintains the correct target distribution p(θ|y) ∝ p(θ)p(y|θ). The pseudo-marginal algorithm for tall datasets is described in Algorithm 3.1. The pseudo-marginal algorithm uses an approximate acceptance ratio in line 8. The key point of difference relative to Algorithm 3.1, is that an estimated likelihood now appears in the numerator of the acceptance ratio. An important detail of the pseudo-marginal algorithm is that previous likelihood estimates need to be recycled if the proposed value θ′ is rejected (see line 16). There has been a large amount of effort in designing almost-surely non-negative unbiased estimators of the likelihood that use subsampling (Bardenet et al., 2017; Quiroz et al., 2016). Although the per iteration cost decreases when using a subsampled pseudo-marginal algorithm compared the the standard Metropolis-Hastings implementation, the autocorrelation in the chain increases with the variance of the likelihood estimator. A highly variable likelihood estimator can potentially reduce the effective sample size per unit time relative to the standard Metropolis-Hastings algorithm (Maclaurin and Adams, 2014). Algorithm 3.2 Pseudo-marginal Metropolis-Hastings using subsampling 1: Input θ[0], p(y|θ)[0] 2: for b = 1, . . . , B do 3: θ ← θ[b−1] 4: p̂(y|θ)← p̂(y|θ)[b−1] 5: θ′ ∼ g(·|θ) . Sample new proposal from g 6: Estimate likelihood p̂(y|θ′) . Use subsample of full dataset 7: 8: α̂(θ,θ′)← p̂(y|θ ′)p(θ′) p̂(y|θ)p(θ) × g(θ|θ′) g(θ′|θ) . Acceptance ratio calculation 9: 10: u ∼ Uniform(0, 1) 11: if u < α̂(θ′,θ) then 12: θ[b] ← θ′ . Accept proposal 13: p̂(y|θ)[b] ← p̂(y|θ′) . Update likelihood estimate 14: else 15: θ[b] ← θ . Reject proposal 16: p̂(y|θ)[b] ← p̂(y|θ)[b−1] . Retain likelihood estimate 17: end if 18: end for 19: return { θ[1], . . . ,θ[B] } Subsampling has also been explored as a technique to reduce the per epoch cost of other stochas- tic simulation methods. Stochastic gradient Langevin dynamics (Welling and Teh, 2011) bypasses the accept reject step entirely, the iterative algorithm uses subsampled estimates of the gradient of the log 46 3.1. Introduction likelihood for computationally cheap individual steps. The Zig-Zag sampler (Bierkens et al., 2016) and the scalable Langevin exact sampler (Pollock et al., 2016) are both based on discretisations of continuous time stochastic processes and use subsampled estimates of the gradient of the log likelihood. Unbiased estimation of the log scale is significantly easier than unbiased estimation of likelihoods, and this is seen as a competitive advantage of non pseudo-marginal based methods (Bardenet et al., 2017). The mechanics of the aforementioned large n samplers are varied and technical and will not be reviewed here. We will proceed assuming that we have a large dataset and have obtained samples from the posterior p(θ|y) using a suitable algorithm. The task of interest is to estimate the integrated likelihood p(y) = ∫ p(y|θ)p(θ) dθ. Given posterior samples, it is common to use importance sampling based methods for estimating the integrated likelihood p(y) (Friel and Wyse, 2012). Importance sampling methods for computing the evidence typically require repeated additional full likelihood evaluations, making them unsuited to large n problems. This is a roadblock for carrying out Bayesian model selection on tall datasets. We will argue that it is difficult to adapt existing importance sampling methods to use subsampling effectively. Faster methods for approximating the integrated likelihood include the Laplace approximation, the Laplace-Metropolis estimator, and variational Bayesian methods. Although computationally cheap, it is difficult to bound the approximation error of these faster methods. The emergence of tall datasets has led to a shift in the procedures used for posterior sampling. Estimation of the model evidence may also require a similar paradigm shift. Historically, estimation of the integrated likelihood is performed using a combination of posterior simulation and asymptotic approximation (DiCiccio et al., 1997). We also use these techniques but with the large n dynamics of the posterior distribution in mind. Underlying our approach is a less commonly used representation of the log model evidence. The log model evidence log p(y) can be written as an information criterion type score involving a goodness of fit term and a penalty term for model complexity. Starting from Bayes’ theorem, we easily obtain the so called ‘Candidiate’s formula’ an important identity relating the model evidence to the posterior and prior densities (Besag, 1989). For any ordinate θ ∈ Ω we have the identity : log p(y) = log p(y|θ) + log p(θ)− log p(θ|y) (3.1) = log p(y|θ)− log p(θ|y) p(θ) . (3.2) Taking the Candidate’s formula (3.2) one step further, we can integrate both sides over the posterior distribution to obtain the following decomposition of the log model evidence:∫ log p(y)p(θ|y) dθ = ∫ p(θ|y) log p(y|θ) dθ − ∫ p(θ|y) log p(θ|y) p(θ) dθ. log p(y) = Ep(θ|y)[log p(y|θ)]−D(p(θ|y) ‖ p(θ)). (3.3) The identity (3.3) is not an original result, however it seems to have attracted very little attention in the existing literature on Bayesian model selection (Robert, 2007; Gelman et al., 2014; Kass and Raftery, 1995; Gelfand and Dey, 1994; Raftery, 1995). The first term Ep(θ|y)[log p(y|θ)] is the expected log likelihood over the posterior distribution. This can be characterised as a Bayesian measure of goodness of fit (Dempster, 1997; Spiegelhalter et al., 2002). The second term D(p(θ|y) ‖ p(θ)) represents the Kullback- Leibler divergence of the prior from the posterior. This can be seen as a penalty term that captures model complexity and prior regularisation simultaneously. We propose to estimate the goodness of fit term using subsampling, and to bound the penalty term using elementary information theory. The goodness of fit term and the bounds on the penalty can be estimated using a posterior samples and subsampled evaluations of the log likelihood. Overall we obtain a computationally efficient interval estimator of the log model evidence that avoids repeated O(n) likelihood evaluations. Asymptotic arguments suggest that the evidence bounds may be sufficient to rank competing models confidently in the large n regime. The general idea of introducing bounded error in order to drastically cut the computational budget has also 47 3. Computing the model evidence using subsampling been explored for posterior simulation (Bardenet et al., 2014; Korattikara et al., 2014), and we feel it is natural to follow this avenue for estimation of the model evidence. The sandwich approach reduces the computational burden of estimating the integrated likelihood while still giving confidence measures on the quality of the estimate. Section 3.2 reviews asymptotic properties of Bayesian model selection and develops the upper and lower bounds on the penalty term. Section 3.3 reviews how subsampling can be used to estimate likelihoods and recaps existing methods for estimating the model evidence. Section 3.4 discusses the utility of our proposed evidence bounds for tall datasets. We test our proposed methodology in a logistic regression data application in Section 3.5. 3.2 Bayesian model selection Suppose that we have M competing parametric models M1, . . . ,MM . Each model j = 1, . . . ,M has parameter θj which takes values in parameter space Ωj of dimension pj . We assume the data y = (y1, . . . ,yn) consists of n independently and identically distributed observations from some distribution f0. We take the M-open viewpoint, in that the true generative model f0 may not be in the set of candidate models M1, . . . ,MM . Given two models Mj and Mk, let logBjk denote the log Bayes factor in favour of model j: logBjk = log p(yi|Mj)− log p(y|Mk). (3.4) The asymptotic behaviour of logBjk is of direct interest given our focus on tall data. For each model Mj we can define a measure of closeness to the true model f0 using the Kullback-Leibler divergence. Let dKL(f0, g) denote the Kullback-Leibler divergence of a candidate density g from the true generate model f0. Formally, dKL(f0, g) = ∫ f0 log f0 g . For each model Mj we can find a density p(yi|θ∗j ,Mj) such that θ∗j satisfies θ∗j := arg min θj∈Ωj dKL(f0, p(yi|θj ,Mj)). (3.5) The density p(yi|θ∗j ) is the closest density to the true model f0 in the parametric family associated with modelMj . We can characterise this as the best approximation to the truth attainable under modelMj . Let djKL denote the divergence from the truth for model j, that is djKL = dKL(f0, p(yi|θ∗j )). (3.6) Under regularity conditions, as n tends to infinity, the limiting Bayes factors between two competing models j and k will be controlled by djKL and d k KL. The asymptotic behaviour of the posterior distribution on models can be described by the asymptotic behaviour of the set of Bayes factors. In the M-open viewpoint pragmatic consistency of the posterior distribution amounts to ensuring that the posterior distribution concentrates on the best possible model in the candidate set. Bayes factor consistency is described more formally in Definition 3.1. Definition 3.1 (Pragmatic Bayes Factors Consistency (Walker et al., 2004)). Consider two models Mj and Mk. Model j has pj parameters and Model k has pk parameters. Let djKL and dkKL be defined as in (3.6). We say that the Bayes factors achieve pragmatic consistency if they exhibit the following asymptotic behaviour as n→∞ for all j, k ∈ {1, . . . ,M}. (a) Suppose that djKL < d k KL. Then we require that logBjk →∞ (a) Suppose that djKL > d k KL. Then we require that logBjk → −∞ 48 3.2.1. Evidence bounds 2 logBjk Bjk Strength of evidence 0 to 2 1 to 3 Not worth more than a bare mention 2 to 6 3 to 20 Positive 6 to 10 20 to 150 Strong >10 >150 Very strong Table 3.1: Guidelines for the interpretation of Bayes factors (Kass and Raftery, 1995). Given two modelsMjand Mk, the quantity Bjk gives the Bayes factor in favour of model j (recall equation (3.4)). (a) Suppose that djKL = d k KL and model j is nested in model k, so pj < pk. Then we require that logBjk →∞. The case of non-nested models with djKL = d k KL is more complicated, and there does not seem to be a general theory for this situation (Casella et al., 2009). We assume that this we do not encounter this problem for the sake of simplicity. Pragmatic Bayes factor consistency requires that asymptotically all Bayes factors tend to negative infinity or positive infinity. Asymptotically, the posterior distribution of models concentrates tightly on the best model in the candidate set, even under misspecification. Under relatively mild conditions it is possible to show that pragmatic Bayes factor consistency holds for a wide range of problems (Chib and Kuffner, 2016; Chatterjee et al., 2018). Key references for nested models include Schwarz (1978) and Gelfand and Dey (1994). We proceed assuming that the collection of models of interest and data generating process is such that pragmatic Bayes factor consistency holds. The practical interpretation of the consequences of Definition 3.1 is that we expect to see very large Bayes factors when comparing models in the large n regime. Under mild conditions the gaps between the log evidence scores are expected to widen as n grows. From an inferential point of view, this large sample dynamic needs to be taken into account when interpreting Bayes factors on tall datasets. Kass and Raftery (1995) present a table providing guidelines on the interpretation of Bayes factor, reproduced here as Table 3.1. Bayes factors and p-values do have some commonality in terms of large sample behaviour. The American Statistical Association’s statement on p-values drew attention to the fact that it is important to consider the sample size when weighing the evidence provided by a small p-value (Wasserstein and Lazar, 2016). When n is large, practically insignificant deviations from the null hypothesis can lead to very small p-values. This same phenomenon will affect Bayes factors. The log Bayes factors for two competing models are expected to tend to −∞ or ∞ as n increases, even when the two models differ in an practically inconsequential manner. For tall datasets models may become separated by larger factors than what is considered in Table 3.1. As n increases, exact calculation of the integrated likelihood becomes more computationally demand- ing. This problem is not insurmountable as the asymptotic behaviour of the posterior distribution on models works in our favour, assuming that we take Definition 3.1 as realistic. As sample sizes increase we expect to see wider margins between the log evidence values within a collection of models. From a decision theoretic point of view, exact calculation of the integrated likelihood becomes less necessary as n increases. We can still rank models confidently given interval estimates of the integrated likelihood if the interval widths are small compared to the true differences in Bayes factors. This large sample behaviour provides some leeway to carry out computationally efficient Bayesian model selection on tall datasets. This is assumption is used to motivate our evidence sandwich approach. 3.2.1 Evidence bounds An important assumption for our bound to hold is that the parameter θ has unconstrained support in Rd. It may be necessary to reparameterise the original statistical model in order to meet this condition. 49 3. Computing the model evidence using subsampling For example, variance components and proportions can be mapped to the real line using the log and logistic transform respectively. We believe this assumption to not be overly restrictive, as a wide range of statistical models can be transformed to an unconstrained parameterisation. The probabilistic pro- gramming language Stan (Carpenter et al., 2017) internally transforms all user declared models to an unconstrained parameterisation in order to improve the efficiency of the underlying Hamiltonian Monte Carlo algorithms. The default transformations are listed in the Stan manual (Stan Development Team, 2018). From here on in we assume that θ has unconstrained support in Rd. 3.2.2 Entropy The differential entropy represents the amount of uncertainty associated with a continuous random vector. For a continuous random vector X with probability density p(x), the differential entropy is defined as differential entropy of X = − ∫ p(x) log p(x) dx. (3.7) The differential entropy is not bounded, taking values in (−∞,∞). The differential entropy of a multi- variate normal random variable is a simple function of the covariance matrix. Lemma 3.1. Let X be a d-dimensional multivariate normal random vector X ∼ N(µ,Σ), where Σ is of full rank. The differential entropy of X is given by d 2 + d 2 log (2pi) + 1 2 log(det(Σ)). The result is easily established using properties of quadratic forms. The determinant of the covariance matrix has an interpretation as a scalar measure of the overall variability associated with a multidimen- sional random vector. The multivariate normal distribution has a maximum entropy property for random vectors in Rd. Theorem 3.1. Let X be a random vector in Rd with mean µ and full rank covariance matrix Σ. The maximum entropy probability distribution amongst all candidate distributions for X with mean µ and covariance matrix Σ is the multivariate normal distribution N(µ,Σ). Proof: Let q(x) denote some distribution on Rd with mean µ and covariance matrix Σ. Let p(x) = N(µ,Σ). Let H(q) denote the differential entropy of q(x) and let H(p) denote the differential entropy of p(x). Let D(q(x) ‖ p(x)) denote the Kullback-Leibler divergence of p from q. The Kullback-Leibler divergence is positive so 0 ≤ D(q(x) ‖ p(x)) = ∫ q(x) log q(x) p(x) dx = ∫ q(x) log q(x) dx− ∫ q(x) log p(x) dx = −H(q)− ∫ q(x) ( −d 2 log(2pi)− 1 2 log(det(Σ))− 1 2 (x− µ)TΣ−1(x− µ) ) dx = −H(q) + d 2 log(2pi) + 1 2 log(det(Σ)) + d 2 = −H(q) +H(p). The step in the second last line follows from properties of quadratic forms and the fact that q(x) has the same first and second moments as p(x). The final line establishes the desired result that H(p) ≥ H(q). As D(q(x) ‖ p(x)) = 0 if and only if q(x) is a multivariate normal distribution, H(q) = H(p) if and only if q(x) = N(µ,Σ). 50 3.2.3. Upper bounding the evidence It follows from Theorem 3.1 that given some random d-dimensional vector X with mean µ, full rank covariance matrix Σ and unknown probability density q(x), the differential entropy has the upper bound, − ∫ q(x) log q(x) dx ≤ d 2 + d 2 log (2pi) + 1 2 log(det(Σ)). (3.8) Equality holds if and only if X ∼ N(µ,Σ). We now show how the entropy bound can be used to approximate the model evidence. 3.2.3 Upper bounding the evidence The penalty term in (3.3) involves a posterior expectation and the negative posterior entropy: D(p(θ|y) ‖ p(θ)) = p(θ|y) log p(θ|y)− Ep(θ|y) [log p(θ)] . (3.9) We can lower bound the penalty term by using the entropy upper bound (3.8). Let Σθ|y be the posterior covariance matrix of θ. Throughout we make the mild assumption that Σθ|y is of full rank. The entropy bound (3.8) gives that∫ p(θ|y) log p(θ|y) dθ ≥ d 2 + d 2 log (2pi) + 1 2 log(det(Σθ|y)). The bound will be very tight when the posterior distribution of θ is approximately normal. Posterior normality is a plausible assumption when the sample size is large (Van Der Vaart, 1998, Chapter 10). We subsequently obtain the lower bound on the penalty term, D(p(θ|y) ‖ p(θ)) ≥ d 2 + d 2 log (2pi) + 1 2 log(det(Σθ|y))− Ep(θ|y) [log p(θ)] . (3.10) We now upper bound the penalty term. 3.2.4 Lower bounding the evidence Variational Bayesian inference is a generic family of methods for approximate Bayesian inference (Blei et al., 2017). Variational Bayesian methods introduce an approximate posterior distribution q(θ) where q(θ) is chosen to maximise a lower bound on the log model evidence log p(y). An important identity is that for any distribution q(θ), log p(y) = Eq(θ) [log p(y|θ)]−D(q(θ) ‖ p(θ)) +D(q(θ) ‖ p(θ|y)). (3.11) The proof involves a similar line of reasoning that we used to arrive at (3.3). Starting from the ‘Candi- date’s formula’ (3.2) we integrate over the arbitrary distribution q(θ). log p(y) = log p(y|θ) + log p(θ) p(θ|y) = ∫ q(θ) [ log p(y|θ) + log p(θ) p(θ|y) ] dθ = Eq(θ) [log p(y|θ)] + Eq(θ) [log p(θ)] + ∫ q(θ) log 1 p(θ|y) dθ = Eq(θ) [log p(y|θ)] + Eq(θ) [log p(θ)] + ∫ q(θ) log 1 p(θ|y) dθ+∫ q(θ) log q(θ) dθ − ∫ q(θ) log q(θ) dθ = Eq(θ) [log p(y|θ)] + Eq(θ) [log p(θ)]− Eq(θ) [log q(θ)] + ∫ q(θ) log q(θ|y) p(θ) dθ = Eq(θ) [log p(y|θ)] + Eq(θ) [log p(θ)]− Eq(θ) [log q(θ)] +D(q(θ) ‖ p(θ|y)) (3.12) = Eq(θ) [log p(y|θ)]−D(q(θ)‖p(θ)) +D(q(θ) ‖ p(θ|y)). (3.13) 51 3. Computing the model evidence using subsampling Now as D(q(θ) ‖ p(θ|y)) is positive, dropping the term from (3.13) gives a lower bound on the log model evidence. log p(y) ≥ Eq(θ) [log p(y|θ)]−D(q(θ) ‖ p(θ)). (3.14) Now substituting in (3.3) Ep(θ|y)[log p(y|θ)]−D(p(θ|y) ‖ p(θ)) ≥ Eq(θ) [log p(y|θ)]D(q(θ) ‖ p(θ)). This then gives an upper bound on the penalty term: D(p(θ|y) ‖ p(θ)) ≤ Ep(θ|y)[log p(y|θ)]− Eq(θ) [log p(y|θ)] +D(q(θ) ‖ p(θ)). (3.15) The upper bound on the penalty term will be tight when D(q(θ) ‖ p(θ|y)) is small. This is the same gap that appears in the usual variational evidence lower bound. 3.2.5 Sandwiching the evidence Substituting the lower bound on the penalty term (3.10) into (3.3) gives an upper bound on the evidence: log p(y) ≤ Ep(θ|y) [log p(y|θ)] + Ep(θ|y) [log p(θ)] + d 2 + d 2 log (2pi) + 1 2 log(det(Σθ|y)). (3.16) Substituting the upper bound on the penalty term (3.10) into (3.3) gives the usual variational lower bound on the model evidence log p(y) ≥ Eq(θ) [log p(y|θ)] + Eq(θ) [log p(θ)]− Eq(θ) [log q(θ)] . (3.17) If the posterior distribution is normally distributed, and we use a Gaussian variational distribution both the upper and lower bounds will be tight. As the integrated likelihood is invariant to reparameterisations it may be worth searching for effective normalising transforms of θ. We will argue that subsampling can be used to estimate the bounds efficiently. Before developing our methodology further we review important existing related work. 3.3 Related work In this section we review on subsampling for tall datsets and existing methods for calculating the in- tegrated likelihood. As mentioned in the introduction, the pseudo-marginal MCMC algorithm requires unbiased estimation of the likelihood from subsamples. Existing work builds an unbiased estimator, or asymptotically unbiased estimator of the likelihood from unbiased estimators of the log likelihood (Quiroz et al., 2016; Bardenet et al., 2017; Quiroz et al., 2018). We focus on Monte Carlo estimators of the integrated likelihood that can be implemented given posterior samples and some generic closed form approximations. More sophisticated estimators of the integrated likelihood such as nested sampling (Skilling et al., 2006) or the power posterior method (Friel and Pettitt, 2008) are not considered here as they require more specialised MCMC implementations. 3.3.1 Subsampled log likelihoods Unbiased estimation of the likelihood using subsamples is difficult as the likelihood is a product of n terms. Unbiased estimation of the log likelihood, `(θ) = log p(y|θ), is significantly easier as log p(y|θ) is a sum over n log likelihood contributions, log p(y|θ) = n∑ i=1 log p(yi|θ). 52 3.3.1. Subsampled log likelihoods The simplest approach is to take a uniform random sample of size m, where we assume m  n. Let S denote the set of subsampled indices, so S ⊂ {1, . . . , n} and |S| = m. Then a simple estimator of the log likelihood at θ is ̂` simple(θ) = n m ∑ i∈S log p(yi|θ). Although easy to understand and implement, this estimator scales very poorly with n. Let vn(θ) denote the population variance of the log likelihood values evaluated at θ over the n observations. The variance of the estimator is seen to be var ̂`simple(θ) = n2 m vn(θ). It is reasonable to expect vn(θ) to stabilise around a constant as n increases. Thus, in order to control the variance as n increases, we have to increase the batch size m quadratically with n. This is undesirable as we would like the computational cost of estimating the integrated likelihood to be sublinear in n. To reduce the variance of the log likelihood estimator we can employ control variates, a generic technique for reducing Monte Carlo variance. Control variates have been identified as a critical tool in many Big Data subsampling algorithms (Baker et al., 2017). Suppose we have a Monte Carlo estimator Z such that E[Z] = µ, where µ is the unknown quantity of interest. Suppose that we can define another random variable W on the same probability space with known expectation τ . Let α denote some constant. We can define a new estimator ZCV that makes use of the auxiliary control variate W : ZCV = Z + α(W − τ). The estimator ZCV is also an unbiased estimator for µ for any choice of constant α. The variance of the new estimator is var(ZCV ) = var(Z) + α 2var(W ) + 2α× cov(Z,W ). (3.18) If Z and W are correlated, and α is chosen appropriately, var(ZCV ) can be much smaller than var(Z). The full dataset log likelihood at θ is the sum of n contributions log p(y|θ) = ∑ni=1 log p(yi|θ) = ∑ni=1 `i(θ). The difference estimator is a special control variate scheme for estimating population totals from a subsample (Sarndal et al., 1992). The difference estimator introduces an approximate log likelihood contribution for each observation ̂`i(θ) for i = 1, . . . , n. As in Bardenet et al. (2017) and Quiroz et al. (2018) we consider the use of Taylor series approximations about the maximum likelihood estimate to form the log likelihood approximations ̂`i(θ) for i = 1, . . . , n. The control variates can reduce the variance compared to the simple estimator ̂`simple(θ). Bardenet et al. and Quiroz et al. both consider second order approximations. We also consider a first order approximation, as computing the Hessian matrix can be challenging for high-dimensional statistical models. Let θ̂ denote the maximum likelihood estimate of θ. Let gi denote the gradient of the log likelihood evaluated at θ̂ for observation i. Likewise, let Hi denote the Hessian matrix of the log likelihood contribution of observation i, evaluated at the maximum likelihood estimate. For i = 1, . . . , n: gi = ∇ log p(yi|θ)|θ=θ̂ , Hi = ∇2 log p(yi|θ) ∣∣ θ=θ̂ . Let g denote the full dataset gradient so g = ∑n i=1 gi. Let H give the full dataset Hessian matrix, so H = ∑n i=1Hi. For i = 1, . . . , n the first order approximation to the log likelihood contribution is given by ̂` i,1(θ) = `(θ̂) + (θ − θ̂)Tgi. (3.19) For i = 1, . . . , n the second order approximation to the log likelihood contribution is given by ̂` i,2(θ) = `(θ̂) + (θ − θ̂)Tgi + 1 2 (θ − θ̂)THi(θ − θ̂). (3.20) 53 3. Computing the model evidence using subsampling The first order log likelihood estimator ̂`gradient(θ) is defined as : ̂` gradient(θ) = log p(y|θ̂ + n m ∑ i∈S [ `i(θ)− ̂`i,1(θ)] . (3.21) The second order log likelihood estimator ̂`hessian(θ) is defined as: ̂` hessian(θ) = log p(y|θ̂) + 1 2 (θ − θ̂)TH(θ − θ̂) + n m ∑ i∈S [ `i(θ)− ̂`i,2(θ)] . (3.22) The variance of ̂`gradient and ̂`hessian is a function of the remainder terms in the Taylor series approxi- mations (3.19) and (3.20). Qualitatively, the smaller the remainder terms, the smaller the variance. We analyse the asymptotic variance of the estimators in the Appendix to this chapter. We make a heuristic argument under mild assumptions that the variance of ̂`gradient is Op(m−1) and the variance of ̂`hessian is Op(n −1). Stochastic notation is used to account for the fact that we take θ to be a stochastic sequence, where we are treating θ as a sample from a posterior distribution. 3.3.2 Subsampled likelihoods For simplicity suppose we have n independent observations y = (y1, . . . ,yn). Unbiased estimation of the likelihood p(y|θ) = ∏ni=1 p(yi|θ) is difficult as it is a product over n terms. The Poisson estimator (Wagner, 1987; Papaspiliopoulos, 2009) builds an unbiased estimator of the likelihood from unbiased estimators of the log likelihood. We assume that we use one of the subsampling based esitmators of the log likelihood described in the previous subsection. Let (̂`j(θ)) be a sequence of independently and identically distributed unbiased estimators of the log likelihood for j ∈ N. The basic Poisson estimator is defined as the randomised product p̂(y|θ) = exp(λ) J∏ j=1 ̂` j(θ) λ , where J ∼ Poisson(λ). (3.23) Taking iterated expectations shows the estimator is unbiased. We take expectations over the subsampling indices for the log likelihood estimators S, and the random number of product terms J . ES,J [p̂(y|θ)] = EJ [ES|J [p̂(y|θ)|J = j]] = EJ [ exp(λ) λj j∏ k=1 ES|J [̂`j(θ)] ] = EJ [ exp(λ) λj j∏ k=1 `(θ) ] = EJ [ exp(λ) λj `(θ)j ] = ∞∑ j=0 Pr(J = j) exp(λ) λj `(θ)j = ∞∑ j=0 1 j! exp(−λ)λj × exp(λ) λj `(θ)j = ∞∑ j=0 `(θ)j j! = exp(`(θ)) = p(y|θ). The second line relies on the unbiasedness and the independence of the log likelihood estimators ̂`1(θ), . . . , ̂`j(θ). The fourth line uses the probability mass function of the Poisson distribution. The seventh line uses the 54 3.3.3. Importance sampling power series expansion of exp(z) = ∑∞ i=0 z i/i!. The value of λ sets the expected computational cost of the Poisson estimator. Assuming the unbiased log likelihood estimators use a subsample of size m, the expected number of likelihood evaluations per use is mλ. In practice it is common to introduce an extra scalar tuning parameter ω to reduce the variance. The modified estimator is again a randomised product: exp(λ+ ω) J∏ j=1 ̂` j(θ)− ω λ , J ∼ Poisson(λ). (3.24) The modified Poisson estimator is also unbiased. From the general result given by Papaspiliopoulos (2009), we have that for a fixed λ, the choice of ω giving the minimum variance is ω = log p(y|θ) − λ. The minimum variance tuning parameter requires the unknown log likelihood at θ. An important issue is that both the simple Poisson estimator and the modified Poisson estimator are possibly negative. The modified estimator is almost surely nonnegative if ω is chosen such that ̂`j(θ) > ω holds almost surely. It can be difficult to determine such a bound analytically, and a conservative choice of ω can inflate the Monte Carlo variance (Quiroz et al., 2018). A general result is that without additional information about the log likelihood there is no algorithm that takes unbiased estimators of the log likelihood and outputs an almost-surely nonnegative estimator of the likelihood (Jacob et al., 2015). An alternative subsampled likelihood estimator is the estimator proposed in Rhee and Glynn (2015). The Rhee and Glynn estimator also uses unbiased estimators of the log likelihood in a randomised product. A lower bound on the likelihood is also required to ensure a nonnegative estimator. In general it is difficult to construct unbiased and almost surely nonnonegative estimators of the likelihood that use subsampling. 3.3.3 Importance sampling The model evidence can be viewed as the expected likelihood over the prior distribution p(y) = ∫ p(y|θ)p(θ) dθ. A direct approach for estimating the model evidence is to take W samples from the prior distribution θ[1], . . . ,θ[W ], and to set p̂(y) = 1 W W∑ w=1 p(y|θ[w]). (3.25) Although simulation consistent, this strategy is typically highly inefficient (Raftery, 1996; Friel and Wyse, 2012). The likelihood will be concentrate in a small region supported by the prior, and as such a small fraction of the prior samples will dominate the sum. This leads to the estimator (3.25) having very high variance. Importance sampling estimators express the model evidence as an integral over some importance distribution q(θ) as opposed to an integral over the prior distribution p(θ). The identity at the heart of this strategy is p(y) = ∫ p(y|θ)p(θ) dθ = ∫ p(y|θ)p(θ) q(θ) q(θ) dθ. Given R samples from q(θ), denoted θ [1] q , . . . ,θ [R] q , the importance sampling estimator is p̂(y) = 1 R R∑ r=1 p(y|θ[r]q )p(θ[r]q ) q(θ [r] q ) . (3.26) If the importance distribution q(θ) is exactly equal to the posterior distribution p(θ|y) then the Monte Carlo estimator (3.26) has zero variance. In this ideal scenario q(θ) = p(y|θ)p(θ)/p(y) and the estimator 55 3. Computing the model evidence using subsampling reduces to p̂(y) = 1 R R∑ r=1 p(y|θ[r])p(θ[r]q ) q(θ [r] q ) = 1 R R∑ r=1 p(y|θ[r]q )p(θ[r]q ) p(y|θ[r]q )p(θ[r]q ) p(y) 1 = Rp(y) R = p(y). Gelfand and Dey (1994) propose to set the importance distribution q(θ) to be a normal distribution with mean and covariance matrix given by the sample mean and covariance of the posterior samples θ[1], . . . ,θ[B]. This is a popular choice in practice. 3.3.4 Harmonic mean estimator The harmonic mean estimator (Newton and Raftery, 1994) uses posterior samples to estimate the marginal likelihood. The harmonic mean estimator is based on the identity Ep(θ|y) [ 1 p(y|θ) ] = ∫ 1 p(y|θ)p(θ|y) dθ = ∫ p(y|θ)p(θ) p(y|θ)p(y) dθ = 1 p(y) ∫ p(θ) dθ = 1 p(y) . Given B posterior samples θ[1], . . . ,θ[B], a simulation consistent estimator of the model evidence is then p̂(y) = ( 1 B B∑ b=1 p(y|θ[b]) )−1 . An advantage of the harmonic mean estimator is that it is readily computable if the posterior samples were generated using a Metropolis-Hastings algorithm (eg. Algorithm 3.1). If the likelihood ratio calculations are saved at each iteration, the harmonic mean estimator can be computed with very little additional cost after the posterior sampling run. In the tall data setting, it is perhaps unlikely that full likelihood evaluations have been made in the process of generating θ[1], . . . ,θ[B], so this benefit may not be present. The harmonic mean estimator can have infinite variance, and can exhibit a large finite sample bias (Friel and Wyse, 2012). 3.3.5 Bridge sampling Bridge sampling (Bennett, 1976; Meng and Wong, 1996) also uses an importance distribution q(θ) but is based on a slightly more complicated identity. We first start with the relationship 1 = ∫ p(y|θ)p(θ)a(θ)q(θ) dθ∫ p(y|θ)p(θ)a(θ)q(θ) dθ , for some bridge function a(θ). Multiplying both sides of the equation by p(y) gives p(y) = ∫ p(y|θ)p(θ)a(θ)q(θ) dθ∫ [p(y|θ)p(θ)/p(y)]a(θ)q(θ) dθ = Eq(θ)[p(y|θ)p(θ)a(θ)] Ep(θ|y)[a(θ)q(θ)] . 56 3.3.6. Laplace approximation Suppose we have R samples from the importance density q(θ), denoted θ [1] q , . . . ,θ [R] q . Suppose we have B samples from the posterior distribution, θ[1], . . . ,θ[B] Bridge sampling requires the choice of importance distribution q(θ) and bridge function a(θ). There are some theoretical arguments to take q(θ) to resemble the posterior distribution. A common approach is to make a normal approximation. Let s1 = R/(R + B) and s2 = B/(R + B) The optimal choice of bridge function in terms of minimising mean square error is aopt(θ) = 1 s1p(y|θ)p(θ) + s2p(y)q(θ) . When setting a(θ) = 1/q(θ), bridge sampling reduces to ordinary importance sampling. The optimal bridge function is unattainable as it includes the unknown p(y). Meng and Wong propose an iterative scheme to estimate the optimal bridge function, which in turn gives an iterative estimator of p(y). The iterative bridge sampling estimator is p̂(y)(t+1) = R−1 ∑R r=1 p(y|θ[r]q )p(θ[r]q ) s1p(y|θ[r]q )p(θ[r]q ) + s2p̂(y)(t)q(θ[r]q ) B−1 ∑B b=1 q(θ[b]) s1p(y|θ[b])p(θ[b]) + s2p̂(y)(t)q(θ[b]) . (3.27) The iterative scheme requires full likelihood evaluations for the B posterior samples, and the R samples from the importance distribution. The iterative estimator (3.27) has an interpretation as a maximum profile likelihood estimator of the model evidence under the assumption of independent posterior samples (Geyer, 1996; Shirts et al., 2003). 3.3.6 Laplace approximation The Laplace approximation can be motivated under the assumption that the posterior density is approx- imately quadratic around the posterior mode θ˜. Let G and H give the gradient and Hessian matrix of the unnormalised posterior evaluated at the mode respectively: G = ∇[log p(y|θ) + log p(θ)]|θ=θ˜ H = ∇2[log p(y|θ) + log p(θ)]∣∣ θ=θ˜ . A second order Taylor expansion around the mode gives p(y) ≈ ∫ exp [ log p(y|θ˜) + log p(θ˜) + (θ − θ˜)TG+ 1 2 (θ − θ˜)TH(θ − θ˜) ] dθ = ∫ exp [ log p(y|θ˜) + log p(θ˜) + 1 2 (θ − θ˜)TH(θ − θ˜) ] dθ = exp [ log p(y|θ˜) + log p(θ˜) ] ∫ exp [ 1 2 (θ − θ˜)TH(θ − θ˜) ] dθ. The second line uses the fact that gradient is zero at the posterior mode. Let S−1 = (−H) so that the integral can be recognised as the kernel of a multivariate Gaussian density with covariance matrix S. Introducing the required normalising constant and integrating over the density function yields the approximation p(y) ≈ exp [ log p(y|θ˜) + log p(θ˜) ] ∫ exp [ −1 2 (θ − θ˜)TS−1(θ − θ˜) ] dθ = exp [ log p(y|θ˜) + log p(θ˜) ] (2pi)d/2|S|1/2 ∫ 1 (2pi)d/2|S|1/2 exp [ −1 2 (θ − θ˜)TS−1(θ − θ˜) ] dθ = exp [ log p(y|θ˜) + log p(θ˜) ] (2pi)d/2|S|1/2. The Laplace approximation to the log model evidence is log p(y) ≈ log p(y|θ˜) + log p(θ˜) + d 2 log(2pi) + 1 2 log(det(S)). (3.28) 57 3. Computing the model evidence using subsampling Although the Laplace approximation has been shown to be competitive in simulation studies, it can be difficult to bound the error in the approximation in practice (Gelfand and Dey, 1994). Additionally, computation of the Hessian matrix can be difficult in high-dimensional statistical models. 3.3.7 Laplace-Metropolis estimator The Laplace-Metropolis estimator (Raftery, 1996; Lewis and Raftery, 1997) is a variant of the Laplace approximation designed to be used more easily in situations where posterior samples are available. The Laplace approximation requires the posterior mode and the Hessian evaluated at the mode. These are not immediately available given posterior samples θ[1], . . . ,θ[B]. Raftery proposes a number of candidate estimators for the posterior mode θ˜ using the posterior samples: 1. Take θ˜ to be the posterior sample that maximises p(y|θ[b])p(θ[b]) over b = 1, . . . , B. 2. Estimate the components of θ˜ by taking the 1 component wise means of the posterior samples. 3. Estimate the components of θ˜ by taking the component wise medians of the posterior samples. 4. Estimate θ˜ by finding the multivariate median over the samples. Let Σθ|y denote the covariance matrix of the posterior distribution. Raftery suggests using Σθ|y in place of S = (−H)−1 in the Laplace approximation (3.28). This is justified using an asymptotic argument. The posterior covariance matrix Σθ|y can be estimated using the empirical covariance matrix of the posterior samples or an alternative robust covariance matrix estimator. The Laplace-Metropolis estimator is log p(y) ≈ log p(y|θ∗) + log p(θ∗) + d 2 log(2pi) + 1 2 log(det(Σ̂θ|y)), (3.29) where θ∗ and Σ̂θ|y are suitable estimators of the posterior mode and posterior covariance matrix, which can be obtained using the posterior samples. The Laplace-Metropolis estimator has performed well in simulations but its theoretical properties are not well understood (Raftery, 1996). 3.4 Application to tall datasets Importance sampling, the harmonic mean estimator and bridge sampling all require repeated full likeli- hood evaluations p(y|θ). If each likelihood evaluation is O(n), these techniques may be impractical for large n datasets. As discussed in Section 3.3.2 subsampling can be used to provide unbiased estimates of likelihoods p̂(y|θ). In principle it is possible to modify existing estimators of the model evidence to use subsampled likelihoods. Consider the basic importance sampling estimator described in Section 3.3.3. Given R samples from the importance distribution q(θ), denoted θ [1] q , . . . ,θ [R] q we could replace the full likelihood evaluations p(y|θ[r]) with unbiased subsampled estimates p̂(y|θ[r]) using the modified Poisson estimator in Section 3.3.2. The subsampled importance sampling estimator could be constructed as p̂(y) = 1 R R∑ r=1 p̂(y|θ[r])p(θ[r]) q(θ[r]) . (3.30) An issue is that we could obtain negative estimates of the model evidence unless the tuning parameter ω is an appropriate lower bound. Assuming that we plug in subsampled estimates of the likelihood, the possibility of negative estimates will also affect subsampled implementations of the harmonic mean estimator and subsampled bridge sampling. Although the Poisson estimator can cut the number of likelihood evaluations required to use the importance sampling estimators, the difficulties in choosing an appropriate lower bound ω make such an approach unattractive. The Laplace approximation and the Laplace-Metropolis estimator are attractive alternatives that avoid much of the computational expense associated with the other methods. A drawback is that the 58 3.4.1. Estimation of evidence bounds error associated with these approximate methods can be difficult to quantify in practice. As mentioned earlier, the Laplace approximation requires the posterior mode and Hessian, which is not automatically provided given posterior samples. The Laplace-Metropolis estimator approximates the mode and Hessian using information that is more readily available given simulation output. Although convenient, it is again difficult to determine the error that this introduces into the final estimator. To integrate subsampling with the pseudo-marginal MCMC algorithm it is a requirement to have an unbiased estimator of the likelihood, necessitating the use of the Poisson estimator or alternatives. As our goal is Bayesian model selection rather than posterior sampling, we have the freedom to focus on the log model evidence. As already discussed, estimating log p(y|θ) using subsampling is significantly easier than estimating p(y|θ) using subsampling. At this point the computational advantages of the evidence bounds become more evident. As they are defined on the log scale it relatively easily to incorporate subsampling methodology. Suppose we have B samples from the posterior distribution. Let Σ̂θ|y be the empirical covariance matrix of the posterior samples θ[1], . . . ,θ[B]. A simulation consistent estimate of the entropy upper bound is 1 B B∑ i=1 log p(y|θ[i]) + 1 B B∑ i=1 log p(θ[b]) + d 2 + d 2 log (2pi) + 1 2 log(det(Σ̂θ|y)) (3.31) Suppose we generate R samples from the approximate variational posterior q(θ). Similar to the Gelfand and Dey (1994) recommendation for importance sampling, we propose to set the variational distribution q(θ) to be a normal distribution N(µ̂θ|y, Σ̂θ|y) where µ̂θ|y and Σ̂θ|y are the mean and covariance matrix of the posterior samples θ[1], . . . ,θ[B]. Let these samples be denotes θ [1] q , . . . ,θ [R] q . A simulation consistent estimator of the evidence lower bound is 1 R R∑ r=1 log p(y|θ[r]q ) + 1 R R∑ r=1 p(θ[q]q ) + 1 R R∑ r=1 log q(θ[q]). (3.32) In some situations the expectations Eq(θ) [log p(θ)] and Eq(θ) [log q(θ)] can be determined analytically. This is the case when both the prior p(θ) and the variational approximation q(θ) are normal distributions. The simulation consistent estimator of the variational lower bound can be simplified to 1 R R∑ r=1 log p(y|θ[r]q ) + Eq(θ) [log p(θ)]− Eq(θ) [log q(θ)] . (3.33) For tall datasets the B full log likelihood evaluations in (3.31) and the R full log likelihood evaluations in (3.32) and (3.33) will be computationally demanding. Subsampling can be used to estimate the log likelihood terms efficiently. 3.4.1 Estimation of evidence bounds Suppose we have B samples from the posterior distribution θ[1], . . . ,θ[B] using one of the Big Data posterior simulation algorithms mentioned in the introduction. Pseudo-marginal MCMC or a stochastic gradient Langevin dynamics based sampler are two possibilities. To estimate the bounds (3.16) we need to estimate the posterior expectation Ep(θ|y)[log p(y|θ)]. We propose to use the subsampling estimators to estimate the log likelihoods. That is for some batch size m n, to use either of the estimators Êp(θ|y)[log p(y|θ)] = 1 B B∑ b=1 ̂` gradient(θ [b]), (3.34) Êp(θ|y)[log p(y|θ)] = 1 B B∑ b=1 ̂` hessian(θ [b]). (3.35) Similarly, we can use the control variates to estimate the required goodness of fit term Eq(θ)[log p(y|θ)] in the variational lower bound (3.14). Suppose R samples from the variational distribution q(θ) are 59 3. Computing the model evidence using subsampling available, denoted θ [1] q , . . . ,θ [R] q . We propose to use either of the estimators Êq(θ)[log p(y|θ)] = 1 R R∑ r=1 ̂` gradient(θ [r] q ), (3.36) Êq(θ)[log p(y|θ)] = 1 R R∑ r=1 ̂` hessian(θ [r] q ). (3.37) Given the subsampled estimators of the log likelihood we can define the subsampled estimator of the upper bound l̂og p(y) ≤ Êp(θ|y) [log p(y|θ)] + Êp(θ|y) [log p(θ)] + d 2 + d 2 log (2pi) + 1 2 log(det(Σ̂θ|y)). (3.38) We can also define a subsampled estimator of the lower bound l̂og p(y) ≥ Êq(θ) [log p(y|θ)] + Eq(θ) [log p(θ)]− Eq(θ) [log q(θ)] . (3.39) 3.5 Data application: flights dataset We analyse the flights dataset that was also considered in Chapter 2. The flights dataset is available in the R package nycflights13 (Wickham, 2014). There are n = 327, 346 observations on flights departing New York City in 2013. There are data on 16 different carriers (airlines). We dichotomised the arrival delay variable (original units in minutes) to obtain a binary outcome. We labelled flights as late if the arrival delay was greater than zero, and on time if the arrival delay was less than or equal to zero. We compared three different logistic regression models for late arrival. M1 = intercept : carrier + delay : carrier. (3.40) M2 = intercept : carrier + delay : carrier + weekday (3.41) M3 = intercept : carrier + delay : carrier + weekday : carrier (3.42) Model 1 is the unpooled model from Chapter 2. Model 2 introduces a fixed effect for the day of the week. It could be the case that flights are more likely to arrive late on a weekend. Model 3 allows for an interaction between weekday and carrier. We used independent N(0, 1) priors on all coefficients. We used a Gibbs sampler to generate B = 5000 observations from the full dataset posterior distribution. This dataset is of moderate size, and the covariates can be grouped for faster sampling. Ideally we would use a Big Data algorithm to generate the posterior samples, but we were unable to find an R package for doing so. We have not been concerned with the method of generation of the posterior samples, as our interest has been the follow up likelihood cost for then obtaining the model evidence. Our conclusions should not be affected by the fact that we have used a conventional Gibbs sampler to generate the initial collection of posterior samples. The goodness of fit of each model were summarised by computing an Receiver Operating Charac- teristic (ROC) curve and a reliability plot. We generated predicted probabilities of late arrival for each observation. We again take the posterior predictive mean as the predicted probability of late arrival. The posterior predictive mean for a response given covariates xi is obtained by integrating over the posterior distribution of the coefficients E[yi|M] = ∫ p(yi = 1|xi,β,M)p(β|X,y,M). (3.43) The ROC curve plots the in-sample true positive rate against the false positive rate using the predicted probabilities. The ROC curve summarises the predictive ability of each classifier. The area under the curve (AUC) is a summary measure for the predictive performance. A perfect classifier has AUC of 1, random guessing gives an AUC of 0.5. ROC curves were generated using the R package pROC (Robin et al., 2011). 60 3.5. Data application: flights dataset Good predictive performance does not mean a model correctly specified. Another useful diagnostic is a reliability plot. Suppose a group of observations has a predicted probability of late arrival of 0.2. We would expect 20 percent of the observed data points to be late arrivals in this data group. The reliability plot checks the goodness of fit of the model in terms of empirical event probabilities matching the predicted event probabilities. Observations are binned according to predicted probabilities, and the average predicted probability is compared to the empirical probability of late arrival within the bin. The reliability curves were generated using the R package caret (Kuhn, 2008). Figure 3.1 shows the ROC curves for each model. The AUC for each model is given in the Figure legend. The models have very similar AUC scores. Model 2 outperforms model 1 by 0.0023 and model 3 outperforms model 2 by 0.0006. Model 2 and 3 have more free parameters than model 1 so it is not surprising that they have better classification performance. What is more interesting is how the differences in predictive ability translate into Bayes factors. Before looking at this in more detail it is worth studying the reliability plots shown in Figure 3.2. Reliability curves are shown for each of the three models. A perfectly calibrated probabilistic classifier would give a reliability curve falling along the identity line in a sufficiently large sample. All three models appear very well calibrated. By the diagnostics in Figures 3.1 and 3.2 it is difficult to make a clear ranking of the models. We now turn to the model evidence. We first computed the model evidence for each model using the importance sampling method proposed by Gelfand and Dey (1994). The estimator is described in Section 3.3.3. We generated R = 5000 samples from the importance distribution. The estimator required R = 5000 full likelihood evaluations. Estimated log Bayes factors are reported in Table 3.2. Moving from left to right, both model 2 and model 3 are supported over model 1. There is evidence for a day of the week effect on the probability of late arrival. Model 2 is preferred over model 3 with 2 logB23 = 55. The day of the week effect appears constant across carriers. Model 2 has fewer parameters, and gives very similar results to Model 3 as seen in Figures 3.1 and 3.2. The computed Bayes factors are much larger than what is considered in Table 3.1. Here we have n = 327, 346. We expect the typical magnitude of log Bayes factors to increase as n grows under standard asymptotic theory. Large sample Bayes factors can be difficult to interpret in isolation. The Bayes factors provide extra information to discriminate between models over what is shown in Figure 3.1 and Figure 3.2, however it is difficult to give an intuitive measure of the strength of evidence. We then estimated the model evidence using the subsampled bound estimators (3.39) and (3.38). We estimated the required goodness of fit expectations using the three log likelihood estimators ̂`simple,̂` gradient and ̂`hessian discussed in Section 3.3.1. For the lower bound, we took the variational distribution q(θ) to be the same normal approximation that was used in the implementation of importance sampling. We set the subsample size at m = 500. Table 3.3 reports the estimated evidence bounds and the Monte Carlo standard errors. Standard errors were obtained using the default method in the mcmcse R package (Flegal et al., 2017). We also report the estimates of the log model evidence using the importance sampling (IS) method. Standard errors for importance sampling were obtained using the nonparametric bootstrap. The results in Table 3.3 show the importance of using control variates when estimating the full log likelihood from a subsample. The standard errors for ̂`simple are much higher than the other estimators for all three models. In particular the standard errors for the bounds on Models 2 and 3 are larger than the difference in log Bayes factors. We can not determine which model has a the highest evidence score using the simple estimator ̂`simple. We obtain more precise estimates of the bounds using the control variate estimators ̂`gradient and ̂`hessian. The estimated bounds are reasonably tight. The model evidence can generally be enclosed in an interval of approximately width one. This suggests the normal approximation to the posterior distribution is reasonable under each model. Interestingly, the standard error of ̂`hessian is greater than the standard error of ̂`gradient for model 3. This is contrary to what we expect from the asymptotic analysis of the estimators. The larger standard error could be a result of the sampling method. We have used simple 61 3. Computing the model evidence using subsampling ROC curves 1 − Specificity Se ns itiv ity 0.0 0.2 0.4 0.6 0.8 1.0 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 Model 1 (AUC=0.8313) Model 2 (AUC=0.8336) Model 3 (AUC=0.8342) Figure 3.1: ROC curves for the flights dataset. Curves are shown for each of the three models in the candidate set. Classification performance is very similar across the three models. AUC values are reported in the legend. The addition of weekday improves in-sample predictive performance by a small margin. random sampling in a situation where we do not have an equal number of observations for each carrier. The dummy coding in the design matrix will mean that a simple random subsample will contain very little information about the entire parameter set. We should aim to stratify the sample so that we get information about each parameter in the subsample. The second order estimator may be sensitive to this issue. The standard errors for the subsampled bound estimators are roughly an order of magnitude larger than the standard error of the importance sampling estimator. This should be taken into consideration when comparing the wall-clock time for each estimator. The standard errors in Table 3.3 are all estimated from a single replication of the computational experiment. Also we use different methodology to estimate the standard errors for the bounds and the importance sampling estimator. As future work we will repeat the experiment multiple times to obtain more robust estimates of the Monte Carlo standard errors. Table 3.4 reports the time spent on likelihood evaluations for each estimator. We report the sum of the time to compute the upper and lower bounds for the subsampling based estimators. The subsampled estimators are multiple orders of magnitude faster than the standard importance sampling method. The subsampled estimators require m = 500 likelihood evaluations at each θ with standard importance sampling requiring n = 327, 346 likelihood evaluations. The second order estimator ̂`hessian is considerably slower than the first order estimator ̂`gradient. The cost of the extra quadratic adjustment in (3.20) compared to (3.19) is non-negligible. As the second order estimator has higher standard errors, it seems the gradient based estimator is a better choice in this example. We also implemented a subsampled version of importance sampling as per (3.30). We again generated R = 5000 samples from the same normal importance distribution that was used for the regular importance sampling method. We used the simple Poisson estimator (3.23) to generate unbiased estimates of the likelihood. The Poisson estimator requires unbiased estimators of the log likelihood as input. We tried the Poisson estimator with each of the log likelihood estimators ̂`simple, ̂`gradient and ̂`hessian. We set λ = 5 and took m = 100 when using the subsampling estimators. This was so the expected number of likelihood evaluations per use was equal to five hundred. Table 3.5 reports the proportion of negative likelihood estimates obtained using each subsampling estimator. The proportion of negative estimates is close to 0.5 for each model and estimator. It is hard to construct a good estimator of the log model evidence given many negative likelihood estimates. It is difficult to use to simple Poisson estimator to accelerate importance samplers for the model evidence. 62 3.5. Data application: flights dataset Reliability plots Bin Midpoint O bs er ve d Ev e n t P e rc e n ta ge 0 20 40 60 80 100 0 20 40 60 80 100 l l l l l l l l l l l l l Model 1 Model 2 Model 3 Figure 3.2: Reliability curves for the flights dataset. Curves are shown for each of the three models in the candidate set. All models are very well calibrated across the range of theoretical probabilities. Each model could be considered to be correctly specified. 2 logB21 2 logB31 2 logB23 Value 941 887 55 Table 3.2: Log Bayes Factors for the flights dataset (n = 327, 346). The values of the Bayes factors are all outside the range considered in Table 3.1. The models in the candidate set appear give very similar predictions and are all well calibrated. Interpreting the strength of evidence when n is large can be difficult. ̂` simple ̂` gradient ̂` hessian IS Lower Upper Lower Upper Lower Upper Model 1 -147209.1 (98.02) -147100.8 (99.46) -147112.5 (0.21) -147111.9 (0.16) -147112.6 (0.25) -147111.9 (0.63) -147112.0 (0.02) Model 2 -146548.8 (122.18) -146637.6 (97.23) -146641.2 (0.22) -146641.2 (0.19) -146640.7 (0.76) -146641.8 (0.20) -146641.0 (0.02) Model 3 -146491.5 (106.59) -146611.4 (93.61) -146669.6 (0.55) -146668.5 (0.53) -146579.6 (21.04) -146588.0 (17.07) -146668.7 (0.04) Table 3.3: Estimates of the model evidence for the flights dataset. The first three columns report the evidence bounds obtained using the different log likelihood estimators. Estimated standard errors are in brackets. The Monte Carlo variance of the simple log likelihood estimator is too large to rank models confidently. The control variate based estimators have significantly lower standard errors. The widths of the evidence bounds are small in relation to the estimated Bayes factors between models. ̂` simple ̂` gradient ̂` hessian IS Model 1 3 4 10 523 Model 2 4 4 12 603 Model 3 12 13 85 1743 Table 3.4: Time spent on likelihood evaluations for the flights dataset (seconds). The importance sampling method requires a full O(n) likelihood evaluation for each posterior sample. The subsampling based estimators require O(m) likelihood evaluations per sample. We report the total time for calculating the lower and upper bounds for the subsampling estimators. In this simulation n = n = 327, 346 and m = 500. We report the time spent on likelihood evaluations only. The time spent on generating subsampling integer sets S was recorded separately. 63 3. Computing the model evidence using subsampling ̂` simple ̂` gradient ̂` hessian Model 1 0.52 0.50 0.50 Model 2 0.49 0.50 0.51 Model 3 0.50 0.50 0.51 Table 3.5: Proportion of negative likelihood estimates using the simple Poisson estimator over the B = 5000 posterior samples. We applied the simple Poisson estimator using each of the subsampled log likelihood esitmators. The high fraction of negative likelihood estimates makes it difficult to construct a subsampled version of importance sampling for the purposes of estimating the model evidence. The evidence bounds are defined on the log scale not subject to a positivity requirement. 3.6 Conclusion Much of the status quo in Bayesian computation is not well suited to Big Data. As mentioned in the introduction, procedures that require repeated O(n) evaluations of the full dataset likelihood are unsustainable as datasets grow taller. Estimation of the integrated likelihood can be computationally demanding on tall datasets because of this likelihood burden. We proposed a method for estimating the integrated likelihood using subsampling that avoids repeated O(n) likelihood evaluations. The relevance to model choice is that the proposed estimator of the model evidence can be calculated in less time than traditional importance sampling methods. Bayesian model choice can then proceed at a less glacial pace. Practitioners will inevitably perform a cost benefit analysis when deciding what procedure to use for model selection. If the computational cost of the Bayesian approach is very high, the scales will tip in favour of alternative approaches, whatever the benefits of a Bayesian analysis may be. The main objective of the thesis is to identify and investigate strategies that minimise the expense of the compute step in Box’s loop (refer back to Figure 1.1 in Chapter 1). The point of this Chapter is to see how subsampling can be used to minimise the expense of the compute step when we want to critique multiple models on a large dataset using the fully Bayesian paradigm. To reiterate, we are trying to give an air of computational feasibility to Bayesian purism in the huge n setting. Subsampling has been used in algorithms for generating posterior samples p(y|θ) given a large dataset y, however subsampling based estimation of the model evidence does not seem to have been explored in great depth. Subsampling methodology for posterior simulation makes use of subsampled estimates of the log likelihood. Integrating these estimators into existing importance sampling methods for calculating the evidence is difficult due to the possibility of negative estimates of the likelihood. We found that the log likelihood estimators can also be used for the purposes of efficient estimation of the model evidence in a different manner. Our starting point was an identity for the log model evidence: log p(y) = Ep(θ|y)[log p(y|θ)]−D(p(θ|y) ‖ p(θ)). (3.44) The goodness of fit term Ep(θ|y)[log p(y|θ)] can be estimated efficiently using subsampling. Given pos- terior samples, we can bound the penalty term D(p(θ|y) ‖ p(θ)) using a maximum entropy argument and a standard variational Bayes approach. Control variates are essential in order to control the Monte Carlo variance of the algorithm, this is a similar finding to other work on subsampling algorithms for Bayesian computation (Baker et al., 2017; Bierkens et al., 2016). We investigated both first order and second order Taylor series approximations as control variates. The first order estimator has attractive theoretical properties and performed competitively with the second order estimator in the simulations. Asymptotic properties of Bayes factors suggests that interval estimators of the log model evidence may be acceptable for ranking models in the large n regime. The proposed methodology is closely related to the standard Laplace approximation, however there are some key differences. The Laplace approximation is based on a second order Taylor series expansion about the posterior mode. It can be difficult to determine the error in the Laplace approximation in practice 64 3.7. Appendix (Gelfand and Dey, 1994). We make a normal approximation to the posterior distribution. Assuming the parameter space is unconstrained, this provides an upper bound on the log model evidence using the maximum entropy property of the Gaussian distribution. Although our approach is more computationally demanding than the standard Laplace approximation or the Laplace-Metropolis estimator it does come with additional guarantees on the error in the approximation. We have used the maximum entropy property of the normal distribution in Rd, and assumed that the parameter space is unconstrained. Another way to achieve a closed form upper bound on the posterior entropy is to use general properties of the exponential family. We typically have closed form expressions for the entropy of a distribution in the exponential family. The maximum entropy distribution of a random variables X taking values in the interval [0,∞) subject to the constraint that E[X] = 1/λ is an exponential distribution with density f(x) = λ exp(−λx). Suppose we had a single parameter θ with support [0,∞) in our model. Given posterior samples we can form an estimate of E[θ|y]. We could then form an upper bound on the posterior entropy using the maximum entropy property of the exponential distribution. Using exponential family distributions for the maximum entropy bound also suggests taking the same exponential family distributions for the variational approximation q(θ) for the lower bound. This is an interesting direction that helps to further distinguish our approach from the Laplace and Laplace-Metropolis estimators. 3.7 Appendix 3.7.1 Control variates We have defined `(θ) = log p(y|θ) and `i(θ) = log p(yi|θ). The variance of the gradient estimator̂` gradient(θ) (3.21) will be a function of the remainder terms from each of the individual log likelihood contributions. Let Rigradient denote the remainder term from the approximation for observation i, so Rigradient = `i(θ)− ̂`i,1(θ). An alternative expression for (3.21) is then ̂` gradient(θ) = `(θ̂) + n m ∑ i∈S Rigradient. Now as var(Z) ≤ E(Z2) for any random variable Z, the variance can be upper bounded in terms of the expected squared remainder term var(̂`gradient(θ)) ≤ n m n∑ i=1 |Rigradient|2. (3.45) If the remainder terms are small the variance of the estimator will be small. The introduction of the first order control variates can decrease the Monte Carlo variance of the log likelihood estimator if the linear approximations to the log likelihood contributions are accurate. The variance of ̂`hessian(θ) estimator will also be a function of the remainder terms from each of the individual log likelihood approximations. Let Rihessian denote the remainder term from the approximation for observation i, so Rihessian = `i(θ)− ̂`i,2(θ). The second order estimator can be expressed as ̂` hessian(θ) = log p(y|θ̂) + n m ∑ i∈S Rihessian Now as var(Z) ≤ E(Z2) for any random variable Z, the variance of ̂`hessian(θ) can be upper bounded in terms of the expected squared remainder term var(̂`hessian(θ)) ≤ n m n∑ i=1 |Rihessian|2. (3.46) 65 3. Computing the model evidence using subsampling It is reasonable to expect the remainder terms from the second order approximation Rihessian to be smaller in magnitude than the remainders Rigradient from the first order approximation. As a consequence the variance of the second order estimator should be smaller than the variance of the first order estimator. We study the asymptotic variance of each estimator to compare them in a more precise manner. 3.7.2 Asymptotic variance Here we consider the asymptotic variance of the proposed estimators ̂`gradient(θ) and ̂`hessian(θ) as the sample size n is taken to infinity. As mentioned earlier, the variance of the estimators will depend on the remainder terms from the Taylor approximations for each likelihood contribution. Recall that Rigradient denotes the remainder term for observation i from the first order Taylor expansion about the maximum likelihood estimate. Suppose that we can bound the second order derivatives of the log likelihood for each observation i = 1, . . . , n. Given a d-dimensional parameter, θ = (θ1, . . . ,θd) T, assume that for all j, k ∈ {1, . . . , d}, ∣∣∣∣∂2 log p(yi|θ)∂θj∂θk ∣∣∣∣ ≤Mi, for some constant Mi, and i = 1, . . . , n. This is possible for a wide range of models in the exponential family. Then by the Taylor-Lagrange inequality we can form bounds on the remainder term |Rigradient| ≤ Mi 2 ‖θ − θ̂‖21, |Rigradient|2 ≤ M2i 4 ‖θ − θ̂‖41. Substituting the squared remainder bound into (3.45) gives an upper bound on the variance var(̂`gradient(θ)) ≤ n m n∑ i=1 M2i 4 ‖θ − θ̂‖41 = 1 m ( n∑ i=1 1 n M2i 4 ) n2‖θ − θ̂‖41. (3.47) To determine the asymptotic variance of the estimators we take the evaluation point θ to be a stochastic sequence. Roughly speaking we assume the posterior distribution concentrates in a ball of radius √ n around the maximum likelihood estimate θ̂ as n increases. One way to motivate this is through the Bernstein-von Mises theorem (Van Der Vaart, 1998, Chapter 10). Under mild conditions we expect the posterior distribution to be asymptotically normal, p(θ|y) ≈ N(θ̂, n−1I1(θ̂)), where I1(θ̂) is the Fisher information matrix for a single observation evaluated at the maximum likelihood estimate (Van Der Vaart, 1998, Chapter 10). The Fisher information for a single observation is defined as I(θ) = −E[∇2 log p(yi|θ)|θ], where the expectation is over the generative model with known parameter θ. We expect the posterior variance to be O(n−1). Using properties of the folded normal distribution, each element of norm ‖θ − θ̂‖1 = d∑ i=1 |θi − θ̂i| can be reasoned to be Op(n −1/2). As such we can take ‖θ − θ̂‖41 to be of the order Op(n−2). Plugging this deviation rate into the variance bound (3.47), we can cancel out the inflation by n that debilitates the simple likelihood estimator, var(̂`gradient(θ)) ≤ 1 m ( n∑ i=1 1 n M2i 4 ) n2‖θ − θ̂‖41 ≈ 1 m ( n∑ i=1 1 n M2i 4 ) n2 ×Op(n−2) ≈ Op(m−1). 66 3.7.2. Asymptotic variance Under mild assumptions the term inside the sum (∑n i=1 n −1M2i ) can be assumed to approach constant as n → ∞. As such we expect the variance of the gradient estimator to be Op(1/m) for large n. The approximation in the final line is to reflect that this is a heuristic argument. This is a significant improvement over the simple subsampling estimator where var(̂`simple(θ)) is expected to be Op(n2/m). A similar analysis can be performed for the second order estimator ̂`hessian(θ). Let Rihessian denote the remainder term for observation i from the second order Taylor expansion about the maximum like- lihood estimate. We assume that we can bound the third order partial derivatives of the log likelihood contribution for each observation. Specifically, suppose that for all j, k, l ∈ {1, . . . , d},∣∣∣∣∂3 log p(yi|θ)∂θj∂θkθl ∣∣∣∣ ≤Mi, for some constant Mi, and i = 1, . . . , n. Then by the Taylor-Lagrange inequality we can form bounds on the remainder term (Walschap, 2015, Chapter 2), |Rihessian| ≤ Mi 6 ‖θ − θ̂‖31, |Rihessian|2 ≤ M2i 36 ‖θ − θ̂‖61. Substituting the squared remainder bound into (3.46) gives an upper bound on the variance var(̂`hessian(θ)) ≤ n m n∑ i=1 M2i 36 ‖θ − θ̂‖61 = 1 m ( n∑ i=1 1 n M2i 36 ) n2‖θ − θ̂‖61. We again use the fact that we expect ‖θ − θ̂‖1 to be Op(n−1/2). Making another heuristic argument for the asymptotic variance, var(̂`hessian(θ)) ≤ 1 m ( n∑ i=1 1 n M2i 36 ) n2‖θ − θ̂‖61 ≈ 1 m ( n∑ i=1 1 n M2i 36 ) n2 ×Op(n−3) ≈ 1 m ( n∑ i=1 1 n M2i 36 ) ×Op(n−1). Again assuming that ∑n i=1 n −1M2i approaches a constant, we can reason that the variance of the second order estimator approaches zero as n tends to infinity. It is interesting to compare the behaviour of ̂`gradient(θ) an ̂`hessian(θ). The assumption that θ approaches θ̂ at rate Op(n −1/2) implies that the remainder terms vanish. This is counterbalanced by the scaling factor n/m that appears in both the definition of both estimators (equations (3.34)). For thê` hessian(θ), the remainder terms vanish at a sufficiently fast rate such that the variance tends to zero. For the ̂`gradient(θ) the remainder terms diminish at a rate that leads to stable finite variance. Example: logistic regression The logistic regression model satisfies the assumptions in the analysis of the asymptotic variance. Let yi denote the binary response and xi represent the column vector of covariates for observation i, for i = 1, . . . , n. We assume that yi ∼ Bernoulli(σ(ηi)), where ηi = xTi β and σ(z) = 1/(1 + exp(−z)). For the analysis of ̂`gradient in Section 3.7.2 to apply, it is necessary to bound the second order partial derivatives. The Hessian matrix of each log likelihood contribution is ∇2 log p(yi|xi,β) = −ŷi(1− ŷi)xixTi , 67 3. Computing the model evidence using subsampling where ŷi = σ(x T i β). As 0 ≤ ŷi ≤ 1, it holds that ŷi(1− ŷi) < 1/4. The absolute value of the second order partial derivatives then satisfies ∣∣∣∣∂2 log p(yi|xi,β)∂βj∂βk ∣∣∣∣ ≤ 14‖xi‖21, for each observation i = 1, . . . , n. The analysis of ̂`hessian in Section 3.7.2 assumed bounded third order partial derivatives. The logistic regression model satisfies∣∣∣∣∂3 log p(yi|xi,β)∂βj∂βkβl ∣∣∣∣ ≤ 14‖xi‖31, for each observation i = 1, . . . , n. 68 4Statistical properties of sketching algorithms Summary Sketching is a probabilistic data compression technique that has been largely developed in the computer science community. Numerical operations on big datasets can be intolerably slow; sketching algorithms address this issue by generating a smaller surrogate dataset. Typically, inference proceeds on the com- pressed dataset. Sketching algorithms generally use random projections to compress the original dataset and this stochastic generation process makes them amenable to statistical analysis. We argue that the sketched data can be modelled as a random sample, thus placing this family of data compression methods firmly within an inferential framework. In particular, we focus on the Gaussian, Hadamard and Clarkson- Woodruff sketches, and their use in single pass sketching algorithms for linear regression with huge n. We explore the statistical properties of sketched regression algorithms and derive new distributional results for a large class of sketched estimators. We develop confidence intervals for sketching estimators and bounds on the relative efficiency of different sketching algorithms. An important finding is that the best choice of sketching algorithm in terms of mean square error is related to the signal to noise ratio in the source dataset. Finally, we demonstrate the theory and the limits of its applicability on two real datasets. 4.1 Introduction Sketching is a general probabilistic data compression technique designed for Big Data applications (Cor- mode, 2011). Even routine calculations can be prohibitively computationally expensive on massive datasets. Computation time can be reduced to an acceptable level by allowing for some approxima- tion error in the results. Sketching algorithms relax the computational task by generating a compressed version of the original dataset which then serves as a surrogate for calculations. The compressed dataset is referred to as a sketch as it acts as a compact representation of the full dataset. Sketching algorithms use a randomised compression stage which makes them interesting from a statistical viewpoint. Sketching algorithms for linear regression have attracted significant attention in the numerical linear algebra and theoretical computer science communities (Woodruff, 2014; Mahoney, 2011). In this paper we investi- gate the statistical properties of sketched regression algorithms, a perspective which has received little attention up to now. To describe sketched regression in more detail, we first assume the data consists of a n-length response vector y and a n×p matrix of covariates, X which is of full rank. It is assumed that n > p. The objective is to find the optimal least squares coefficients. Given sufficient computational resources, these could be computed exactly as βF = (X TX)−1XTy. The subscript F is used to indicate the connection to the full dataset. Only two quantities are needed in order to determine βF , the Gram matrix X TX, and the marginal associations XTy. Calculation of 69 4. Statistical properties of sketching algorithms XTX requires O(np2) operations while computation of XTy needs only O(np) calculations. There are two broad methods for sketched regression, complete sketching and partial sketching. Complete sketching is based on approximating both XTX and XTy, whereas partial sketching only approximates the Gram matrix. Sketching algorithms use random linear mappings to reduce the size of the dataset from n to k observations. The random linear mapping can be represented as a k × n sketching matrix S. Complete sketching generates a k-length sketched response vector y˜ and a k × p matrix of sketched predictors X˜. The sketched data are computed through the linear mappings y˜ = Sy and X˜ = SX. Partial sketching only generates a k × p matrix of sketched covariates X˜. We again use the random mapping X˜ = SX. The complete sketching estimator, βS , is defined as the least squares coefficients using the sketched responses and predictors, βS = (X˜ TX˜)−1X˜Ty˜. (4.1) The partial sketching estimator, βP , is defined as βP = (X˜ TX˜)−1XTy. (4.2) The key difference between (4.1) and (4.2) is that the partial sketched estimator βP is constructed using the exact marginal associations XTy. Given the sketched data, computation of βS or βP requires only O(kp2) operations, compared with the O(np2) required for βF . The estimand within a sketching algorithm is the optimal coefficient vector βF . Sketching algorithms have the property that given a fixed k, the approximation error ‖βS − βF ‖2 or ‖βP − βF ‖2 remains probabilistically bounded even as n → ∞. Designing estimators for approximate computation with such properties is very difficult, and is a common goal in the development of techniques for Big Data (Bardenet and Maillard, 2015; Phillips, 2016). The favourable scaling properties of sketching algorithms are a critical factor in making them stand apart from simple subsampling approaches, where it can be difficult to establish universal worst case bounds for large n (Drineas et al., 2006; Ma et al., 2015). The fact that sketching algorithms provide finite k guarantees for arbitrarily large n is a major reason they have received so much attention in the computer science community. There is a large literature concerned with designing appropriate distributions for the random sketching matrix S. Our focus is on data-oblivious random projections, where the distribution of the sketching matrix is not a function of the source data (y,X). An example is the Gaussian sketch, where each element is independently distributed as a N(0, 1/k) variate. We also consider the Hadamard sketch and the Clarkson-Woodruff sketch, random projections that exploit structure and sparsity for computational efficiency. Most existing results on the accuracy of sketching are universal worst case bounds (Woodruff, 2014; Mahoney and Drineas, 2016). This is typical for randomised algorithms, however a more detailed error analysis can provide important insights (Halko et al., 2011). We investigate the statistical properties of βP and βS when using data oblivious sketches. An important finding is upper and lower bounds on the relative efficiency of complete sketching to partial sketching in terms of the signal to noise ratio in the source dataset. The statistical analysis also allows the construction of exact confidence intervals for the Gaussian sketch, and asymptotic confidence intervals for other random projections, paving the way for their wider use in the statistical community interested in Big Data methods. We start by reviewing the existing literature on sketching algorithms before investigating the statistical properties in more detail. At its core, sketched regression is a randomised algorithm for approximate computation of βF . Repeated application of the sketching algorithm on the same dataset will produce different results. The first stage in our analysis is to establish the distributional properties of the sketched estimators with the source dataset fixed. This gives a clear statistical picture of the behaviour of the randomised algorithm. An important result is a conditional central limit theorem for the sketched dataset 70 4.2. Background and related work that connects the Hadamard and Clarkson-Woodruff projections to the Gaussian sketch. The regularity conditions have a intuitive interpretation in terms of the geometry of the source dataset. We then analyse a large genetic dataset to compare the performance of different sketching algorithms and to test the asymptotic theory that we have developed. 4.2 Background and related work Before proceeding, it is worth mentioning alternatives to sketching, in particular iterative methods for calculating the least squares coefficients βF . These include coordinate descent or stochastic gradient methods. Iterative methods are guaranteed to converge to βF under very mild conditions. These iterative techniques assume that the entire dataset can stored in memory in a single location, or require regular communication if the full dataset is distributed across multiple sites. Sketching algorithms are not burdened by these memory and communication costs, with the drawback of no convergence guarantees to βF . Connections to iterative methods are postponed until the discussion, the focus for now is on the single pass estimators βS and βP . The purpose of this section is to review the existing theoretical framework for sketching algorithms. Sketching algorithms are largely motivated through worst case guarantees. We recap how these bounds can be developed before studying the statistical properties of the sketched estimators. It will be helpful to define a number of quantities related to the full dataset before moving on. Let TSSF = y Ty, RSSF = ‖y −XβF ‖22,MSSF = ‖XβF ‖22 and R2F = MSSF /TSSF . These terms summarise the goodness of fit of the model. The total, residual and model sum of squares are given by TSSF , RSSF and MSSF respectively, with TSSF = MSSF + RSSF . The proportion of variance explained by the model is given by R2F . These values will be important in characterising the behaviour of βS and βP . 4.2.1 Embedding bounds A key concept in the construction of sketching algorithms is the notion of an -subspace embedding (Woodruff, 2014; Meng and Mahoney, 2013; Yang et al., 2015a). Definition 4.1. -subspace embedding. For a given n× d matrix A, we call a k × n matrix S an -subspace embedding for A, if for all vectors z ∈ Rp (1− )||Az||22 ≤ ||SAz||22 ≤ (1 + )||Az||22. Speaking broadly, an -subspace preserves the linear structure of the columns of the original dataset up to some multiplicative (1± ) factor. In particular, if  is small, the linear mapping S approximately preserves the covariance structure of the source dataset. Most theoretical arguments for sketching algo- rithms are predicated on the idea that the sketching matrix S is an -subspace embedding for the source dataset. The general notion is that it is possible to use a linear mapping S that reduces the sample size from n to k whilst preserving much of the linear information in the full dataset. The issue of how to generate -subspace embeddings is deferred until section 3.2, the present focus will be on the utility of -subspace embeddings for linear regression problems. For now, assume that we have some method for generating -subspace embeddings for the source data matrix A. It will be convenient to refer to A˜ = SA as an -subspace embedding of A if S is an -subspace embedding for A. As regression is the focus from this point forward, we will define the source data matrix as A = [y,X], the sketched data matrix as A˜ = [y˜, X˜] and set d = p+ 1. 71 4. Statistical properties of sketching algorithms The complete sketched estimator βS is given by the least squares coefficients using the sketched responses y˜ and the sketched predictors X˜, βS = argmin β∈Rp ‖y˜ − X˜β‖22. An -subspace embedding is useful as it relates the sketched optimisation problem to the full dataset optimisation problem. If A˜ = [y˜, X˜] is an -subspace embedding of A = [y,X], it must hold that for all β ∈ Rp, (1− )||y −Xβ||22 ≤ ||y˜ − X˜β||22 ≤ (1 + )||y −Xβ||22. If  is small, minimising the sum of squared residuals on the sketched dataset is similar to minimising the sum of squared residuals on the full dataset. If this is the case, it can be expected that βS will be close to βF . It is possible to establish the concrete bounds, that if A˜ is an -subspace embedding of A˜ (Sarlos, 2006), ‖βS − βF ‖22 ≤ 2 σ2min(X) RSSF , (4.3) where σmin(X) represents the smallest singular value of the design matrix X. A very similar argument can be used to motivate the partial sketched estimator βP . Existing bounds for the partial sketch focus on the prediction error ‖XβP −XβF ‖22 (Becker et al., 2015; Pilanci and Wainwright, 2016). To make a direct comparison to (4.3) we establish a bound on the coefficient error Theorem 4.1. Suppose that X˜ is an -subspace embedding of X with  < 0.5. Then the following bound holds, ‖βP − βF ‖22 ≤ 42 σ2min(X) MSSF (4.4) For proof see Chapter 5. The mild requirement that  < 0.5 is imposed so that the bound matches the functional form of the complete sketching bound (4.3). Comparing the partial sketching bound to (4.3), we see that the tightness of the bound is controlled by the model sum of squares as opposed to the residual sum of squares. The sensitivity of partial sketching to the model sum of squares as opposed to the residual sum of squares has been noted in previous on partial sketching (Dhillon et al., 2013; Pilanci and Wainwright, 2016; Becker et al., 2015). This suggests that the signal to noise ratio in the source dataset will be important when selecting which sketched estimator to use. A naive conclusion is that complete sketching is preferred when RSSF < 4MSSF , or equivalently R 2 F > 0.25. Such a result is hardly prescriptive, as the worst case bound is not necessarily indicative of expected performance. A second point of interest is that if the k × n matrix S is an -subspace embedding for A = [y,X], it is also an -subspace embedding for X. This suggests that it is reasonable to compute both βP and βS from a single sketch, although it is not clear how to combine the estimators into a single point estimator. These issues will be explored in more depth by examining the statistical properties of both complete and partial sketching. Before moving on to the statistical analysis we review some of the existing methods for generating -subspace embeddings. 4.2.2 Sketches There are two general categories of distributions for the random matrix S, data aware random projections and data oblivious random projections. A data aware random projection uses information in the source data y, X to generate S. In contrast, a data oblivious random projection can be sampled without knowledge of y or X. Data oblivious projections are designed to produce -subspace embeddings for an arbitrary source data matrix with high probability. Our focus is on data oblivious random projections. 72 4.2.3. Sketching examples The Gaussian sketch was one of the first projections proposed for sketched regression (Sarlos, 2006). Recall that a Gaussian sketch is formed by independently sampling each element of S from a N(0, 1/k) distribution. The drawback of the Gaussian sketch is that computation of the sketched data is quite demanding, taking O(npk) operations. As such, there has been work on designing more computationally efficient random projections. The Hadamard sketch and the Clarkson-Woodruff sketch are two examples of more efficient methods for generating -subspace embeddings. The Hadamard sketch is a structured random matrix (Ailon and Chazelle, 2009). The sketching matrix is formed as S = ΦHD/ √ k, where Φ is a k × n matrix and H and D are both n× n matrices. The fixed matrix H is a Hadamard matrix of order n. A Hadamard matrix is a square matrix with elements that are either +1 or −1 and orthogonal rows. Hadamard matrices do not exist for all integers n, the source dataset can be padded with zeroes so that a conformable Hadamard matrix is available. The random matrix D is a diagonal matrix where each nonzero element is an independent Rademacher random variable. The random matrix Φ subsamples k rows of H with replacement. The structure of the Hadamard sketch allows for fast matrix multiplication, reducing calculation of the sketched dataset to O(nd log k) operations. The Clarkson-Woodruff sketch is a sparse random matrix (Clarkson and Woodruff, 2013). The pro- jection can be represented as the product of two independent random matrices, S = ΓD, where Γ is a random k×n matrix and D is a random n×n matrix. The matrix Γ is formed by choosing one element in each column independently and setting the entry to +1. The matrix D is a diagonal matrix where each nonzero element is an independent Rademacher random variable. This results in a sparse S, where there is only one nonzero entry per column. The sparsity of the Clarkson-Woodruff sketch speeds up matrix multiplication, dropping the complexity of generating the sketched dataset to O(nd). 4.2.3 Sketching examples As examples, we demonstrate the construction of a Hadamard sketch and a Clarkson-Woodruff sketch, for k = 3, n = 4. The Hadamard sketch matrix is formed as S = ΦHD/ √ k, where Φ is a k × n matrix and H and D are both n× n matrices. The fixed matrix H is a Hadamard matrix of order n. The random matrix D is a diagonal matrix where each nonzero element is an independent Rademacher random variable. The random matrix Φ subsamples k rows of H with replacement. The display below shows an example of the random projection. The first matrix in the display represents ΦH, a subsample of three rows from a 4 × 4 Hadamard matrix. In step 2, the diagonal matrix D is generated, with random Rademacher random variables along the diagonal. The diagonal elements are shown above the matrix. In step 3 the matrix multiplication ΦHD is performed. This outputs the sketching matrix S.   1 −1 1 −1 1 −1 −1 1 1 −1 1 −1 → step 2 +1 −1 +1 +1 D11 D22 D33 D44  1 −1 1 −1 1 −1 −1 1 1 −1 1 −1 → step 3 +1 −1 +1 +1 × × × ×  1 −1 1 −1 1 −1 −1 1 1 −1 1 −1 → output   1 1 1 −1 1 1 −1 1 1 1 1 −1 The Clarkson-Woodruff sketch is a sparse random matrix. The projection can be represented as the prod- uct of two independent random matrices, S = ΓD, where Γ is a random k×n matrix and D is a random n×n matrix. The matrix Γ is formed by choosing one element in each column independently and setting the entry to +1. The matrix D is a diagonal matrix where each nonzero element is an independent Rademacher random variable. This results in a sparse S, where there is only one nonzero entry per column. The display below shows an example of the random projection. The first matrix in the display represents Γ, a random matrix where a single element in each column is set to one. In step 2, the diagonal 73 4. Statistical properties of sketching algorithms matrix D is generated, with random Rademacher random variables along the diagonal. The diagonal elements are shown above the matrix. In step 3 the matrix multiplication ΓD is performed. This outputs the sketching matrix S.   1 0 0 0 0 0 0 1 0 1 1 0 → step 2 −1 +1 −1 +1 D11 D22 D33 D44  1 0 0 0 0 0 0 1 0 1 1 0 → step 3 +1 −1 +1 +1 × × × ×  1 0 0 0 0 0 0 1 0 1 1 0 → output   1 0 0 0 0 0 0 −1 0 −1 1 0 Figure 4.1 shows examples of the three sketches for k = 32, n = 36. (a) Gaussian Sketch (b) Hadamard Sketch (c) Clarkson−Woodruff Sketch Figure 4.1: Sampled sketching matrices S for k = 32, n = 36. Elements in the sketching matrix are coloured based on the value. One and negative one are coloured as black and white respectively. Intermediate values are in shades of grey. 4.2.4 Sketching bounds Data oblivious sketches are designed to give an -subspace embedding for an arbitrary source dataset with at least probability (1− δ). Sketching algorithms are appealing for large n problems as the required k to attain the (δ, ) bound is independent of n for the Gaussian and Clarkson-Woodruff sketches, and very weakly dependent on n for the Hadamard sketch. Table 4.1 summarises existing results on the necessary k to attain the (, δ) bound. Probabilistic worst case bounds for sketched regression are formed by noting that if a sketch produces an -subspace embedding with probability at least (1− δ), then the bounds in section 2 must hold with probability at least (1− δ). Woodruff (2014) gives an excellent survey of work in this area. We consider embedding probabilities in more detail in Chapter 6. As mentioned, data aware random projections can also be used to generate -subspace embeddings. Existing data aware projections perform weighted sampling with replacement from the source dataset. As such, data aware sketching methods are closely related to resampling methods such as the bootstrap and the jacknife (Ma and Sun, 2015). We focus on data oblivious random projections, where there is no direct connection to resampling methods. The Gaussian sketch is mathematically tractable, and it is possible to establish a number of exact finite sample results regarding the performance of the sketched estimators. In the next section, we obtain the exact distribution of βS and the bias and variance of βP . This provides guidance on issues regarding the relative efficiency of complete to partial sketching. 74 4.3. Gaussian sketching Sketch Sketching time Required sketch size k Gaussian O(ndk) O((d+ log(1/δ))/2) Hadamard O(nd log k) O(( √ d+ √ log n)2(log(d/δ))/2) Clarkson-Woodruff O(nd) O(d2/δ2) Table 4.1: Properties of different data oblivious random projections (Woodruff, 2014). The third column refers to the necessary sketch size k to obtain an -subspace embedding for an arbitrary n × d source dataset with at least probability (1− δ). 4.3 Gaussian sketching 4.3.1 Complete sketching The Gaussian sketch is mathematically tractable, and it is possible to establish a number of exact finite sample results regarding the performance of the sketched estimators. In this section we will develop the distribution of βS when using a Gaussian sketch. As mentioned previously, all results treat y and X as fixed. The variability in βS is solely due to the use of the random sketching matrix S. Let (y˜j , x˜ T j ) T refer to the jth row in the sketched data matrix A˜ = [y˜, X˜] for j = 1, . . . , k. Similarly, let sTj denote the jth row in the sketching matrix S. The sketched dataset consists of k random units (y˜j , x˜ T j ) for j = 1, . . . , k. The jth sketched response is given by y˜j = s T j y, and the jth sketched predictor is calculated as x˜j = s T jX for j = 1, . . . , k. The k sketched instances are independently distributed, because rows of the sketching matrix are independent. We take an indirect route to find the distribution of βS , by focusing on the distribution of the sketched data A˜ = [y˜, X˜] conditional on the original dataset A = [y,X]. The initial step is to decompose the joint distribution on the sketched responses and predictors as the product of a marginal and conditional distribution. Specifically, p(y˜, X˜|y,X) = p(y˜|X˜,y,X)p(X˜|y,X). It can be shown that p(y˜|X˜,y,X)p(X˜|y,X) has the structure of a hierarchical Gaussian linear model. We first show that the sketched dataset has a multivariate normal distribution, conditional on the source dataset. This follows as the sketched dataset can be expressed as a linear combination of Gaussian random variables. Specifically, row j in the sketched dataset is given by a˜j = (y˜j , x˜j) = sjA. To be clear, it is helpful to express (y˜j , x˜j) as a column vector y˜j x˜Tj  = ATsTj . Conditional on A = [y,X], ATsTj is a linear combination of independent Gaussians as s T j ∼ N(0, Id/k). As affine transformations of Gaussians are also multivariate normal, (y˜j , x˜j) must then be jointly normally distributed, conditional on the source data (y,X). It is easily shown that the joint distribution of the sketched responses and predictors is then y˜j x˜Tj ∣∣∣∣∣∣y,X ∼ N 0 0  , 1 k  yTy yTX XTy XTX  , independently for j = 1, . . . , k. Standard results on multivariate normals give that the conditional distribution of y˜j given x˜j is also normal. A routine calculation shows that the conditional mean is related to βF , that is ES [y˜j | x˜j ,y,X] = x˜jβF . The subscript S is used on the expectation operator to emphasise that only random quantity is the sketching matrix. The conditional variance is related to the prediction error on the source dataset 75 4. Statistical properties of sketching algorithms RSSF , varS (y˜j |x˜j ,y,X) = 1 k [ yTy − yTX(XTX)−1XTy] = 1 k RSSF . The subscript S is again used to recognise that the source of the variance is the random sketching matrix, the source dataset is fixed. The step in the second line follows from sum of squares partitions in linear models (Searle, 1997, Chapter 3). Therefore, the conditional distribution of y˜j given the sketched predictors x˜j and the source dataset y,X is y˜j | x˜j ,y,X ∼ N ( x˜jβF , RSSF k ) independently for j = 1, . . . , k. This is the exact form of a standard Gaussian linear model. The distribution p(X˜|y,X) is easily obtained as the marginal distribution of x˜j is also multivariate normal, x˜Tj ∼ N ( 0,XTX/k ) , independently for j = 1, . . . , k. The sketching process can be described using the following hierarchical model, y˜ | X˜,y,X ∼ N ( X˜βF , RSSF k Ik ) , X˜ | y,X ∼MN ( 0k×p, Ik, 1 k XTX ) . A Gaussian sketch effectively simulates a series of observations from a Gaussian linear model parametrised in terms of βF and σ 2 F , where the design matrix has a matrix normal distribution. We now turn to the distribution of βS . The distribution of βS conditional on the sketched predictors follows immediately from standard results on linear models (Searle, 1997, Chapter 3). [βS | X˜,y,X] ∼ N ( βF , RSSF k (X˜TX˜)−1 ) . (4.5) To obtain the marginal distribution of βS it is necessary to integrate over the random sketched design matrix X˜. From properties of the normal distribution (Eaton, 2007), it is possible to show (X˜TX˜) | y,X ∼Wishart(k,XTX/k). As such, (X˜TX˜)−1 | y,X ∼ InvWishart (k, k(XTX)−1) . As seen in equation (4.5), βS is normally distributed when conditioned on the random Inverse-Wishart matrix (X˜TX˜)−1. The marginal distribution of βS can then be described using the Normal Inverse- Wishart distribution (Gelman et al., 2014, p.73). The following theorem characterises the distribution of βS under the Gaussian sketch. Theorem 4.2. Suppose βS is computed using a Gaussian sketch and k > p + 1. The conditional distribution of βS is (i) βS |X˜,y,X ∼ N ( βF , nσ2F k ( X˜TX˜ )−1) . The marginal distribution of βS is (ii) βS |y,X ∼ Student ( βF , nσ2F k − p+ 1 ( XTX )−1 , k − p+ 1 ) . For the proof see Chapter 5. An immediate application of result (i) is the ability to generate exact confidence intervals for the elements of βS , methodology that does not appear to be present in the existing literature. It is also 76 4.3.2. Partial sketching possible to estimate exact joint confidence regions for the entire vector βS . Again assuming that k > p+1, it should be noted that the variance of βS , var(βS |y,X) = RSSF (k − p+ 1)(X TX)−1 (4.6) is not dependent on the compression ratio k/n. Although RSSF can be expected to grow linearly with n, this will generally be counterbalanced by (XTX)−1 decreasing linearly with n. The distribution of the approximation error ‖βS − βF ‖2 will largely be controlled by the target dimension k. This speaks to the defining characteristic of sketching algorithms, that given a fixed k, the stochastic approximation error does not necessarily increase with size of the original dataset n. Probabilistic worst case bounds on the error ‖βS − βF ‖2 can also be obtained by making an appeal to Chebyschev’s inequality. 4.3.2 Partial sketching Partial sketching was first proposed by Dhillon et al. (2013) using uniform subsampling, and later studied for general sketches by Pilanci and Wainwright (2016). Existing results on partial sketching highlight that the model sum of squares influences the approximation error of the partial sketched estimator βP . An important finding was that the variance of the complete sketching estimator is dependent on the residual sum of squares. It is simple to see that the variance of the partial sketched estimator will not be a function of the residual sum of squares. From the normal equations it holds that XTy = XTXβF . Using this property, we see that conditional on y,X, the variance of the random linear combination βP = (X TSTSX)−1XTy = (XTSTSX)−1XTXβF will be a function of the covariates and the fitted values. The residual vector has no influence on the variance of the partial sketching estimator, and as such the variance of βP will not be related to the residual sum of squares. This suggests that when the noise level is high, partial sketching may become preferrable to complete sketching. This idea has been touched on in the existing literature, but specific guidelines are lacking (Becker et al., 2015; Dhillon et al., 2013). A statistical analysis can provide some insight into this issue. The hierarchical model for complete sketching gave an intuitive statistical perspective on the me- chanics of the algorithm. Partial sketching seems to lack a similar conceptual device. The least squares coefficients can be represented as the solution to the linear system of the equations XTXb = XTy. Par- tial sketching simply returns the solution, b, to the approximate linear system X˜TX˜b = XTy. Lacking a convenient representation for the estimator, we must proceed in a more pedestrian manner. The mean square error of the estimator βP can be determined using only mean and variance information, and this will be the goal for now. The key observation is that (X˜TX˜)−1|y,X ∼ InvWishart (k, k(XTX)−1) . Conditional on y,X, the estimator βP = (X˜ TX˜)−1XTy is a linear combination of the elements of an Inverse-Wishart random variable. However, this is a non-standard distribution and it is difficult to di- rectly express the distribution function of βP . Despite this, it is straightforward to determine the mean and variance of βP . From properties of the Inverse-Wishart distribution, it can be seen that the partial sketched estimator is biased, with mean ES [βP |y,X] = k (k − p− 1)βF , where it is assumed that k > p+ 3. This motivates an alternative unbiased estimator β∗P = (k − p− 1) k (X˜TX˜)−1XTy. Determining the variance of βP and the unbiased β ∗ P is a more lengthy computation (see Chapter 5). Skipping the work, the variance of the biased estimator βP is var(βP |y,X) = k 2 (k − p)(k − p− 1)(k − p− 3) ( MSSF (X TX)−1 + (k − p+ 1) (k − p− 1)βFβ T F ) . (4.7) 77 4. Statistical properties of sketching algorithms The variance of the unbiased estimator β∗P is var(β∗P |y,X) = (k − p− 1) (k − p)(k − p− 3) ( MSSF (X TX)−1 + (k − p+ 1) (k − p− 1)βFβ T F ) . (4.8) The variances of βP and β ∗ P have a similar structure to the variance of βS . The main point of difference is that the variance of βS depends on the residual sum of squares, whereas the variance of βP and β ∗ P depends on the model sum of squares. As mentioned the explicit form of the sampling distribution is hard to obtain, but by making a connection with method of moments estimation it is possible to establish asymptotic normality of both βP and β ∗ P as k tends to infinity. This motivates the construction of approximate confidence intervals. As the exact variance is unknown we propose the following estimator v̂ar(β∗P |y,X) = (k − p− 1) (k − p)(k − p− 3) (( k − p− 1 k ) MSSS(X˜ TX˜)−1 + β∗Pβ ∗T P ) . (4.9) 4.3.3 Relative efficiency The relative efficacy of complete and partial sketching is also of interest. As the plug in estimator βP has a higher mean square error than β∗P , it will not be considered in this section. The performance of the complete sketching estimator βS and the unbiased partial sketched estimator β ∗ P will be compared in terms of mean squared error. As both βF and β ∗ P are unbiased, the mean squared error can be computed using their respective covariance matrices, that is ES (‖βS − βF ‖22 | y,X) = tr(var(βS)), ES (‖β∗P − βF ‖22 | y,X) = tr(var(β∗P )). Comparing (4.6) and (4.8), the variance of β∗P is dependent on MSSF , whereas the variance of βS is dependent on RSSF . This suggests that the signal to noise ratio in the source dataset will be an influential factor in determining which estimator is more efficient. When R2F is close to one we expect complete sketching to be be orders of magnitude more efficient than partial sketching, and when R2F is close to zero, we expect partial sketching to be orders of magnitude more efficient than complete sketching. 4.3.4 Combined estimator So far we have assumed that an analyst much choose between one of the two methods. Obtaining both β∗P and βS from a single sketch is computationally cheap, and may be an attractive strategy. The most demanding operation with the sketched data is calculating (X˜TX˜)−1. Given this quantity it is economical to compute both βS and β ∗ P . Becker et al. (2015) mention they are presently investigating such a strategy, but do not give any details. Our motivation for a combined estimator is driven by the fact even when using a single sketch (y˜, X˜), the two estimators are uncorrelated, that is cov (β∗P ,βS) = 0. This is established by taking iterated expectations, and using the hierarchical model established in section 4.3.1 (see Chapter 5). A simple strategy is then to take a weighted combination of βS and β ∗ P . A combined estimator βC can be defined as βC = αβS + (1− α)β∗P , for some 0 < α < 1. The value of α that minimises the mean square error is αopt = tr(var(β∗P )) tr(var(β∗P )) + tr(var(βS)) . The weight given to the βS is related to the relative efficiency of the two estimators. Use of the weighted estimator is expected to be most beneficial when the signal to noise ratio is moderate, that is R2F ≈ 0.5. When the signal to noise ratio is either very high or very low, there is little gain from using the weighted estimator as either the complete or partial estimator will dominate. 78 4.4. Asymptotics 4.4 Asymptotics 4.4.1 Preliminaries Finite sample distributions of random projection estimators can be mathematically intractable, and as such asymptotic analysis can be a powerful tool (Li et al., 2006). It is a very difficult task to establish meaningful finite sample results for the Hadamard and Clarkson-Woodruff sketches, as they are discrete distributions over a very large combinatorial space. The explicit finite sample distribution of the sketched estimators can be written as a sum over all these possible combinations, but such a representation is not very informative. Instead, it is useful to study the large n distribution of the estimators βS and βP to obtain an interpretable expression. As βF is the estimand in sketching algorithms, this requires conditioning on the source data in the asymptotic analysis. We make no assumption on the nature of the data generating process. To elaborate, let A(n) = [y(n),X(n)] represent the n×d source data matrix of full column rank. Any source data matrix A(n) has a set of associated least squares coefficients, which will here be denoted β (n) F . The overall goal is to determine the asymptotic form of the distributions p(βS |A(n)) and p(β∗P |A(n)) for some arbitrary large dataset A(n). To take limits, we employ a fixed sequence of n× d datasets, all of rank d. In the regression scenario this amounts to assuming thatX(n) is of full column rank and that y(n) is not a perfect linear combination of the columns of X(n) for all n. Conditioning on A(n) is effectively the same as treating the full dataset as an arbitrary sequence of constants Aij for i = 1, . . . , n, j = 1, . . . , d. This is analogous to large sample results for regression models where the design matrix is treated as arbitrary set of constants, and the random variables of interest are the error terms, for example see Van Der Vaart (1998, section 2.5). Here the source dataset is treated as a sequence of constants and the random variables of interest are the elements of the sketching matrix. The asymptotic analysis is carried out in two stages. The initial step is to establish asymptotic normality of the sketched dataset. This is then followed by an analysis of the limiting distribution of βS , and β ∗ P . There is some related work by Ma et al. (2015) who develop asymptotic expressions for the bias and variance of data aware sketched regression estimators, where limits are taken in the sketch size k. Our work is different as we study data oblivious random projections and take limits in n, which is perhaps more natural in the Big Data setting. 4.4.2 Sketching central limit theorem Using a Gaussian sketch, treating the source data matrix A as fixed: [A˜|A] ∼MN(0, Ik,ATA/k). (4.10) Each row is statistically independent, and marginally normally distributed with covariance matrixATA/k. We would like to establish an analogous asymptotic result for the Clarkson-Woodruff and Hadamard sketches. In (4.10), the source dataset is treated as fixed. The original data matrix A can originate from any data generating process. The source dataset could arise from a spatial point process, a collection of time series or a Gaussian Markov random field. When we treat the source data matrix A as fixed we are completely agnostic as to the data generating process that led to the creation of A. We would like to establish an asymptotic result in this vein, where we make minimal assumptions on A. We have gone to considerable effort to establish a conditional central limit theorem, where the source dataset is treated as an arbitrary sequence of fixed constants. This is so our expedition into asymptopia remains tied as closely to reality as possible. In practice, the large dataset that we apply a random projection to will be a fixed file living on a server or hard drive. We would like to describe the distribution of the random sketched dataset in relation to the fixed source dataset. This is the distribution that is reported in (4.10) for the Gaussian sketch. We want to understand the asymptotic form of this conditional distribution for 79 4. Statistical properties of sketching algorithms the Hadamard and Clarkson-Woodruff projections. The only random variables in the sequence of distri- butions that we study are elements of the random projections. An unconditional central limit theorem for sparse sketching matrices with independent entries is given in Li et al. (2006). We cannot easily extend their method as we wish to establish a conditional central limit theorem and the Clarkson-Woodruff sketch and the Hadamard sketch have dependent entries. Our method of proof differs in many significant aspects. Under some regularity conditions the Hadamard and Clarkson-Woodruff sketches produce sketched data that asymptotically has the same matrix normal distribution as under the Gaussian sketch. Al- though asymptotic normality may not be particularly surprising seeing as the sketched data are linear combinations of random vectors, the proof is not immediate due to the dependence in the Hadamard and Clarkson-Woodruff sketches and our conditional perspective. The difficulties we face are most easily illustrated for the Clarkson-Woodruff sketch. Algorithm 4.1 Clarkson-Woodruff sketch A˜← 0 Initialise sketched dataset as k × d matrix of zeroes For i = 1 to i = n Sample z ∼ Uniform(1, . . . , k) Sample random index Sample r ∼ Uniform(−1,+1) Sample random sign A˜z ← r ×Ai + A˜z Multiply by r and add to row z in sketch Output A˜ Output sketched dataset The behaviour of the Clarkson-Woodruff sketch can be represented as a many to less mapping. Each row in the source dataset is assigned to a single row in the sketched dataset. The Clarkson-Woodruff sketch has an alternative streaming construction that highlights this property, given in Algorithm 4.1. As each row in the source dataset only contributes to a single row in the sketched dataset, it might be expected that this results in some statistical dependence amongst the rows of the sketched dataset. The concept of dependence here is again conditional on the source dataset A being known. The independence in (4.10) refers to the fact that knowledge of any particular row in the sketched dataset does not help to predict any other row in the sketched dataset, given that we know the full source dataset A. This definition of independence has interesting implications for the Clarkson-Woodruff sketch. Suppose that we have the following source dataset where n = 3 and d = 2: A =  10 10 1 −1 0.1 0.1  (4.11) Suppose that we take a Clarkson-Woodruff sketch of size k = 2 to obtain the k × d matrix A˜. Suppose we reveal the first row to an outside observer who then has to predict the second row in the sketched data matrix. The observer knows A˜ = 11 9 ? ?  (4.12) Conditional on the observer having access to the source data matrix A (4.11) they could easily reason that the second row in A˜ will either be (0.1, 0.1) or (−0.1,−0.1). The observer can piece this together by realising that rows 1 and 2 in the source dataset must have been assigned to the first row in the sketched dataset. Knowledge ofA and the first row in A˜ is enough to reverse engineer the sketched mapping and to predict the second row in the sketched data matrix. For asymptotic equivalence with the Gaussian sketch, we require no information gain from revealing any row of the sketched data matrix given that we have access to the full source dataset A. The predictive ability of any row in the sketched data matrix needs 80 4.4.2. Sketching central limit theorem to dissipate with n even after accounting for the fact that we are always conditioning on knowing the full source dataset A. We will see that we can obtain asymptotic independence of the sketched dataset rows conditional on knowing the full source dataset under mild regularity conditions. Additionally, although it seems each row in the sketched dataset will be marginally normally distributed, it is not clear if joint asymptotic normality over all rows will hold. Similar conundrums arise when examining the Hadamard sketch in detail. The k × d random matrix A˜ is the output of a stochastic process governed by the fixed n× d source dataset A(n) and the distribution of the random k × n sketching matrix S. The sketched dataset is a linear combination of random vectors, the number of which increases with n. As such, we can expect A˜ to demonstrate some stable limiting behaviour as n grows larger. Under an assumption on the limiting leverage scores of the source data matrix, we can establish a central limit theorem for the sketched dataset. Recall the singular value decomposition of the source dataset A(n) = U(n)D(n)V T (n). The leverage scores for observation i in the source dataset is defined as ‖u(n)i‖22 where uT(n)i gives row i in U(n). The leverage scores of the observations in the source data matrix have been identified an important structural property of sketching algorithms (Mahoney and Drineas, 2016). Assumption 1 highlights their role in establishing asymptotic normality of the sketched data matrix. Assumption 1 Let the singular value decomposition of the n × d source dataset be given by A(n) = U(n)D(n)V T (n). Let u T (n)i give the ith row in U(n). Assume that the maximum leverage score tends to zero, that is lim n→∞ maxi=1,...,n ‖u(n)i‖22 = 0. Theorem 4.3 gives the sketching central limit theorem. Theorem 4.3. Consider a fixed sequence of arbitrary n × d data matrices A(n), where d is fixed. Let A(n) = U(n)D(n)V T (n) represent the singular value decomposition of A(n). Let S be a k× n Hadamard or Clarkson-Woodruff sketching matrix where k is also fixed. Suppose that Assumption 1 on the maximum leverage score is satisfied. Then as n tends to infinity with k and d fixed, [A˜V(n)D −1 (n) | A(n)] d→ MN(0, Ik, Id/k). The proof of Theorem 4.3 is given in Chapter 5. Heuristically, for large n we expect the matrix normal result (4.10) to approximately hold for the Hadamard and Clarkson-Woodruff sketches. The significance of Assumption 1 is perhaps best explained by making a connection to a version of the Lindeberg-Feller theorem for triangular arrays of uniformly bounded random variables (Billingsley, 1999). Theorem 4.4 (Billingsley, 1999). For each n ∈ N, let Zn1, Zn2, . . . , Znrn be a sequence of independent random variables with E(Zni) = 0 and var(Zni) = σ2ni for i = 1, . . . , rn. Let s2n = ∑rn i=1 σ 2 ni and assume that rn →∞ as n→∞. Suppose that we can form a sequence of upper bounds (Kn)n∈N such that |Zni| ≤ Kn almost surely for i = 1, . . . , rn. Then if Kn/sn → 0 as n→∞ we have the convergence in distribution 1 sn rn∑ i=1 Zni d→ N(0, 1) In Theorem 4.4 the condition that Kn/sn → 0 ensures that no random variable in a particular row of the array has to much pull over the sum ∑rn i=1 Zni. A triangular array of random variables satisfying the conditions in Theorem 4.4 is often said to be uniformly asymptotically negligible in that no single term has undue influence over the random sum. We can make an analogy to the leverage score condition in the sketching central limit theorem (Theorem 4.3). The sum of the statistical leverage scores is always 81 4. Statistical properties of sketching algorithms −0.4 −0.2 0.0 0.2 0 1 2 Mean centred total weight (grams) M ea n ce nt re d le ng th (m m) Figure 4.2: Abalone dataset (nfull = 4, 167) with principal component axes. 0.0 0.2 0.4 0.6 0.8 0 1000 2000 3000 4000 Number of observations (n) M ax im u m le ve ra ge s co re equal to the rank of the source dataset. As we have assumed that each dataset in the sequence is of rank d, we have that ∑n i ‖u(n)i‖22 = d for all n. As n grows we need the maximum contribution from a single term in the sum to tend to zero. The limiting leverage scores must satisfy an asymptotic negligibility condition, so that each individual observation provides a vanishingly small contribution to the total sum of the leverage scores. Given a source dataset with centred columns, the leverage scores have a particularly intuitive inter- pretation in terms of the principal components decomposition of the source dataset. The row vector uT(n)iD(n) gives the coordinates of observation i on the principal component axes. The elements of the vector u(n)i give the coordinates of observation i in a scaled system where the variance along each prin- cipal coordinate axis is set to be one. If we think of the source dataset as a point cloud in Euclidean space, Assumption 1 essentially implies that there are no extreme outliers as n tends to infinity. Each observation must have a negligible contribution to the total variance along each principal component axis. We consider a real dataset to illustrate this concept. Figure 4.2 shows a dataset consisting of physical measurements on abalone. Each variable has been mean centred. The first and second principal components are shows as a solid and dashed lines respectively. We define a sequence of datasets A(n) by taking the first n rows in the full dataset. Panel (b) shows the maximum leverage score against n, where we move through the dataset sequentially. We compute the maximum leverage score at each value of n for n = 10, 11, . . . , nfull. The maximum leverage score appears to be heading towards zero as n increases. Assumption 1 rules out gross outliers in the source dataset. The proof of Theorem 4.3 relies heavily on the fact that the random variables that are involved in the construction of the Clarkson-Woodruff sketch and the Hadamard sketch are bounded. The random projections use random variables that take values in {−1, 0,+1}. A second important factor is that the leverage scores of an arbitrary data matrix A(n) = U(n)D(n)V T (n) are also bounded. For all n ∈ N, ‖u(n)i‖22 ≤ 1 for all i = 1, . . . , n. We can thus study a sequence of bounded random matrices. Another key factor in the proof is that Hadamard matrices have a number of symmetry properties that lead to a pairwise independence structure in the random projection. These issues are discussed in more detail in Chapter 5. 82 4.4.3. Sketching estimators 4.4.3 Sketching estimators The central limit theorem for the sketched data suggests that the results about βS and βP for the Gaussian sketch will also approximately hold for the Hadamard and Clarkson-Woodruff sketches for large n. In order to establish convergence of the estimators it helps to adopt an extra assumption on the sequence of source datasets. Assumption 2: lim n→∞n −1  yT(n)y(n) yT(n)X(n) XT(n)y(n) X T (n)X(n)  = Q for some positive-definite matrix Q. It is worth discussing the significance of the limiting matrix Q. A useful comparison can be made to asymptotic theory for regression models, where a common assumption is that the design matrix satisfies the limit condition n−1XT(n)X(n) → B, where B is some positive definite matrix (White, 1984; Greene, 1997). The development of asymptotic results is often eased by treating the covariates as a random sample, although this requires positing a realistic probability model for the covariates, which may be difficult. Treating the covariates as an arbitrary fixed sequence relaxes this assumption and covers more general scenarios. Although it is possible to establish asymptotic results when n−1XT(n)X(n) is not required to converge to any fixed matrix, proofs can become very technical (Fahrmeir and Tutz, 1994, Appendix A.2). Imposing a limiting value for n−1XT(n)X(n) simplifies arguments and can be seen as a compromise between making strong and weak assumptions about the covariates (Fahrmeir and Tutz, 1994, p.46). There is an analogous motivation for Assumption 2, the limiting matrix Q is present to avoid specifying a probability model for the source dataset, without overcomplicating the mathematical analysis. Setting up a limit theorem requires a little extra care with notation. As we have a sequence of datasets A(n), there is a corresponding sequence of optimal least squares coefficients β (n) F . Similarly, there is a sequence of squared residual errors RSS (n) F and model sum of squares MSS (n) F . As the sequence of datasets are fixed, β (n) F , RSS (n) F and MSS (n) F are a deterministic sequence. Under the assumptions 1 and 2, it is possible to establish an asymptotic result for βS and βP . Theorem 4.5. Suppose that Assumptions 1 and 2 hold, k ≥ p, and βS is computed using a Hadamard or Clarkson-Woodruff sketch. Let (X˜TX˜)+ denote the Moore-Penrose pseudo-inverse of (X˜TX˜). Let H˜(n) = RSS (n) F k ( X˜TX˜ )+ and H(n) = RSS (n) F k − p+ 1 ( XT(n)X(n) )−1 . Then as n→∞, convergence in distribution holds for (i)[H −1/2 (n) (βS − β(n)F )|A(n)]→ Student (0, Ip, k − p+ 1) , (ii)[H˜ −1/2 (n) (βS − β(n)F ) |A(n)]→ N (0, Ip) . The proof is in Chapter 5. For large n, we expect βS to be approximately distributed as per Theorem 4.5 for both the Hadamard and Clarkson-Woodruff sketches. It is harder to establish a comparable limit theorem for β∗P , due to the non-standard distribution of β∗P when using a Gaussian sketch. There is no typical normalised distribution to target. Instead, we wish to show asymptotic equivalence in moments. The partially sketched estimator under the Hadamard and Clarkson-Woodruff sketches should have similar mean and variance properties to the Gaussian partially sketched estimator. An extra assumption has to be made to show convergence in moments. A sufficient condition is a stability condition on the singular values of the sketched data matrix. Assumption 3. Let G be the Gram matrix of the scaled sketched dataset, G = n−1X˜TX˜. Assume that the sequence of source datasets is such that ES ( 1 σ2min(G) )2 is finite for large enough n. This additional regularity condition enables a formal limit theorem regarding the moments of β∗P . 83 4. Statistical properties of sketching algorithms Theorem 4.6. Suppose that Assumptions 1, 2 and 3 hold, k > p + 3, and β∗P is computed using a Hadamard or Clarkson-Woodruff sketch. Let H(n) = (k − p− 1) (k − p)(k − p− 3) ( MSS (n) F (X T (n)X(n)) −1 + (k − p+ 1) (k − p− 1)β (n) F β (n)T F ) . Then as n→∞, (i) ES [β ∗ P − β(n)F |A(n)] → 0. (ii) varS ( H −1/2 (n) (β ∗ P − β(n)F ) |A(n) ) → Id The proof is in Chapter 5. Once again, the heavy notation may obscure the essence of the result. The subscript S is used to emphasise that the only source of randomness is the sketching matrix, and that the source dataset is fixed. The theorem suggests that the bias and variance of β∗P under the Clarkson- Woodruff and Hadamard sketches should be approximately equal to that under the Gaussian sketch. Specifically, we expect equations (4.6), (4.7), and (4.8) to be good approximations for the variance of the sketched estimators using the Hadamard or Clarkson-Woodruff sketches. The results here are meant to be useful heuristics to assess the uncertainty attached to the output of the randomised approximation algorithm. There is a need to communicate and quantify the approximation error of sketching algorithms to end users, and the asymptotic results developed in this section can be of use. 4.5 Data application 4.5.1 Human leukocyte antigen dataset We compared the performance of the sketching estimators on a real genetic dataset taken from the UK Biobank database. We use a small extract from the data in Astle et al. (2016). The selected response variable was mean red cell volume (MCV), taken from the full blood count assay and adjusted for various technical and environmental covariates. Genome-wide imputed genotype data in expected allele dose format were available on n = 132,353 study subjects (Howie et al., 2009). We consider 1000 genetic variants in the Human leukocyte antigen (HLA) region of chromosome 6, selected so that no pair of variants had Pearson correlation of allelic scores greaterless than 0.8. The region was chosen as many associations were discovered in a genome-wide scan using univariable models; these associations were with variants with different allele frequencies, suggesting multiple distinct causal variants in the region. The aim is to perform a multivariable regression analysis to obtain variant effect size estimates that are conditional on the other variants in the region. An early theoretical finding was that the partial sketched estimator βP was biased. One thousand sketches were taken to estimate the bias ES(βP − βF ) with k = 1,500. We also computed the bias corrected estimator β∗P in each replication. Figure 4.3 plots the average value of the estimators against the true value of the least squares coefficient using the full dataset. The top row (a)-(c) shows results for βP , and the bottom row (d)-(f) shows results for β ∗ P . The first, second and third columns display the results for the Gaussian, Hadamard and Clarkson-Woodruff sketches respectively. The solid line in each panel is the identity line. The dashed line in the top row shows the theoretical bias, having slope k/(k − p− 1). The results in the top row show that βP is biased for each of the random projections. The bias closely matches the theoretical factor. The bottom row shows that the adjusted estimator β∗P appears to be unbiased, with the mean values falling closely along the identity line. We also compared the complete and partially sketched estimators on mean square error and the coverage of confidence intervals at k = 1,500 and k = 100,00. We did not consider a combined estimator as the small R2F value would give an optimal complete sketching weight of close to zero. Table 4.2 reports 84 4.5.1. Human leukocyte antigen dataset −0.4 −0.2 0.0 0.2 0.4 0.6 − 0. 4 − 0. 2 0. 0 0. 2 0. 4 (a) Gaussian sketch βP Full dataset coefficient Av e ra ge s ke tc he d da ta se t c oe ffi cie nt −0.4 −0.2 0.0 0.2 0.4 0.6 − 0. 4 − 0. 2 0. 0 0. 2 0. 4 (b) Hadamard sketch βP Full dataset coefficient Av e ra ge s ke tc he d da ta se t c oe ffi cie nt −0.4 −0.2 0.0 0.2 0.4 0.6 − 0. 4 − 0. 2 0. 0 0. 2 0. 4 (c) Clarkson−Woodruff sketch βP Full dataset coefficient Av e ra ge s ke tc he d da ta se t c oe ffi cie nt −0.2 −0.1 0.0 0.1 0.2 − 0. 15 − 0. 05 0. 05 0. 15 (d) Gaussian sketch βP* Full dataset coefficient Av e ra ge s ke tc he d da ta se t c oe ffi cie nt −0.2 −0.1 0.0 0.1 0.2 − 0. 15 − 0. 05 0. 05 0. 15 (e) Hadamard sketch βP* Full dataset coefficient Av e ra ge s ke tc he d da ta se t c oe ffi cie nt −0.2 −0.1 0.0 0.1 0.2 − 0. 15 − 0. 05 0. 05 0. 15 (f) Clarkson−Woodruff sketch βP* Full dataset coefficient Av e ra ge s ke tc he d da ta se t c oe ffi cie nt Figure 4.3: Bias of sketching estimators on the HLA dataset. Mean estimates are plotted against true values. In this scenario n = 132353, p = 1000, k = 1500. Solid line is the identity line and dashed line represents the theoretical bias factor. k = 1, 500 k = 10, 000 βS βP β ∗ P βS βP β ∗ P Gaussian 235 (2) 39 (0.4) 3.9 (0.04) 13.1 (0.1) 0.28 (0.002) 0.214 (0.001) Hadamard 233 (2) 39 (0.4) 3.8 (0.04) 12.5 (0.1) 0.27 (0.002) 0.204 (0.002) Clarkson-Woodruff 237 (2) 39 (0.5) 4.0 (0.05) 13.1 (0.1) 0.27 (0.002) 0.212 (0.002) Table 4.2: Mean square error of sketched estimators on HLA dataset. Standard errors are in brackets. the mean square error for each of the estimators. The signal to noise ratio is quite low for this dataset with R2F = 0.02. The relative efficiency bound dictates that partial sketching will be much more efficient than complete sketching on this dataset. The simulation results support this idea, with β∗P having a mean square error roughly sixty times smaller than βS at both values of k. Results are very similar for each of the random projections, suggesting that the asymptotic approximations are reasonable for this dataset. For k = 1,500, the mean square error of βP is approximately ten times that of β ∗ P . For k = 10,000, there is less of a difference, as the ratio k/(k − p − 1) is closer to one. The bias adjusted estimator β∗P has significant advantages over βP when k/(k − p − 1) is larger than one. Table 4.3 summarises the coverage of 95% confidence intervals for the sketched estimators. We report the overall proportion of intervals that contained the true value of the least squares estimate βF over the two hundred and fifty sketches and p = 1,000 coefficients. The observed coverage is close the nominal level of 0.95 at both levels of k. The different random projections give very similar results, suggesting that the use of asymptotic approximations is again reasonable on this dataset. The intervals for the Hadamard sketch appear to be slightly conservative at k = 10,000. This may be due to the specialised random number generator used in the implementation of the Hadamard sketch. The Radamacher random variables are only four-wise independent as opposed to being mutually independent (Geppert et al., 2017, p.85). 85 4. Statistical properties of sketching algorithms k = 1, 500 k = 10, 000 βS β ∗ P βS β ∗ P Gaussian 0.950 0.951 0.950 0.950 Hadamard 0.949 0.952 0.953 0.953 Clarkson-Woodruff 0.950 0.951 0.950 0.950 Table 4.3: Coverage of confidence intervals on the HLA dataset. The largest standard error is 0.001 βS βP β ∗ P Gaussian 62 (1) 15,200 (300) 14,400 (300) Hadamard 59 (1) 14,600 (300) 14,800 (300) Clarkson-Woodruff 59 (1) 14,700 (300) 14,100 (300) Table 4.4: Mean square error of sketched estimators on flights dataset with k = 5000. Standard errors are in brackets. βS β ∗ P Gaussian 0.950 0.954 Hadamard 0.950 0.949 Clarkson-Woodruff 0.948 0.952 Table 4.5: Coverage of 95% confidence intervals on the flights dataset with k = 5000. The largest standard error is 0.004 4.5.2 Flights dataset The sketching algorithms were also evaluated on the New York flights dataset available in the R package nycflights13 (Wickham, 2014). Arrival delay was taken as the response, and departure delay, distance, departure time, origin and month and day were chosen to be the covariates. Rows of the dataset with missing data were omitted, leaving n = 327,346 and p = 47. We fit a saturated linear model with the 47 predictors. The goal was to compare the accuracy of the various sketches on real data rather than to build a statistical model for the flights dataset. We compared the mean square error of the estimators and the coverage of confidence intervals for k = 5000. In contrast to the HLA dataset, the flights dataset has a very high R2F value of 0.99. We took one thousand sketches to compare complete and partial sketching. Table 4.4 reports the mean square error of βS ,βP and β ∗ P . As expected, complete sketching has a much smaller mean square error than partial sketching. Table 4.5 summarises the coverage rates of the 95% confidence intervals. We report the overall proportion of intervals that contained the true value of the least squares estimate over the thousand sketches and p = 47 coefficients. We also generated a synthetic flights dataset with an R2F of close to 0.5. This was achieved by generating a synthetic response vector y′ using the fitted model. The simulated response was computed as y′ = XβF + 15e, where e was the residual vector from the least squares fit e = y −XβF . We took one thousand sketches and computed βS , β ∗ P and a weighted estimator using the optimal weight αopt in each iteration. Table 4.6 reports the mean square error for each estimator. From the theoretical analysis the mean square error of the weighted estimator is expected to be roughly half that of βS or β ∗ P . The results support this for each of the three different sketches. We also assessed the finite sample behaviour of the normal approximation in Theorem 4.3 at different levels of k and p. We dropped some predictors from the full flights dataset to give smaller datasets with p = 10 and p = 25 covariates. We then took subsamples of different sizes from each of the datasets. A single subsample was taken at each value of n, so the same subsampled dataset was being sketched each 86 4.6. Discussion βS β ∗ P βC with α = αopt Gaussian 13,500 (300) 13,700 (300) 7,000 (150) Hadamard 13,100 (300) 14,300 (300) 6,900 (150) Clarkson-Woodruff 13,600 (300) 14,600 (300) 7,000 (150) Table 4.6: Mean square error of sketched estimators on synthetic flights dataset with k = 5000. Standard errors are in brackets. time. One thousand sketches were taken of each dataset at different values of k. We tested the joint multivariate normality of [y˜, X˜] and the normality of the sketched residual e˜ = S(y−XβF ). The squared Mahalanobis distance of the sketched observations was compared to the theoretical χ2-distribution. As n increases the rejection rate is expected to fall to the type one error rate of 0.05. Figure 4.4 plots the proportion of times the null hypothesis of normality is rejected against the size of the source dataset. The Hadamard sketch appears to have a much faster rate of convergence than the Clarkson-Woodruff sketch. When using a Hadamard sketch, each row in the sketched dataset is a linear combination of n observations from the source dataset. When using a Clarkson-Woodruff sketch, each row in the sketched dataset is expected to be a combination of only n/k observations from the source dataset. As such, n/k must be large for the normal approximation to hold. As expected, the rejection rate for the Clarkson- Woodruff sketch increases with k, but remains stable for the Hadamard sketch. In Figure 4.4 the rejection rate for the Clarkson-Woodruff sketch increases with p. The Hadamard sketch seems to be less sensitive to the number of covariates. The extra log k computation cost associated with the Hadamard sketch (Table 4.1) appears to have the benefit of accelerated convergence to normality. Even though joint normality may not be holding for the Clarkson-Woodruff sketch for the flights dataset, the coverage of the confidence intervals is still very good. As y˜ = X˜βF + e˜, normality of the sketched residual is perhaps sufficient in justifying the approximate confidence intervals given by Theorem 4 (ii). The sketched residual converges much more quickly than the full sketched data matrix, which perhaps explains the good coverage properties of the confidence intervals for βS in Table 4.5. 4.6 Discussion Sketching algorithms have emerged in the computer science community as a powerful device for the analysis of massive datasets (Mahoney and Drineas, 2016). Sketched regression algorithms use random projections to reduce the size of the original dataset, the sketched dataset is then used to estimate the optimal least squares coefficients. Most existing theory for sketched regression is from an algorithmic worst case perspective and connects with random matrix theory and computational geometry (Raskutti and Mahoney, 2014; Thanei et al., 2017). In this chapter we have provided a complementary statistical perspective and derived new tools for assessing the uncertainty attached to sketched estimators as well as guidelines for choosing between competing sketching algorithms. Iterative methods, in particular stochastic gradient descent have not been mentioned so far. For large n regression problems, stochastic gradient descent will produce iterates that converge to βF under very mild conditions. Comparisons between single pass sketching and stochastic gradient methods are difficult, as the two techniques are not formulated for the exact same purpose. Single pass sketching algorithms are designed to return an approximate solution in finite time with probabilistically controlled error, whereas stochastic gradient methods are designed to converge to the exact solution asymptotically. It is perhaps more appropriate to compare stochastic gradient descent to iterative sketching methods, as iterative sketching algorithms also come with convergence guarantees to βF (Pilanci and Wainwright, 2016; Gower and Richtrik, 2015). Iterative sketching methods make use of approximate second order information that can lead to a potential improvement compared to first order stochastic gradient methods (Roosta-Khorasani and Mahoney, 2016). Our focus has been on characterising the approximation error 87 4. Statistical properties of sketching algorithms k = 500 k = 1500 k = 5000 0.1 1.0 0.1 1.0 0.1 1.0 p = 10 p = 25 p = 46 1e+05 2e+05 3e+05 1e+05 2e+05 3e+05 1e+05 2e+05 3e+05 n Pr op or tio n of ti m es n u ll hy po th es is re jec ted at al ph a= 0.0 5 Figure 4.4: Proportion of times null hypothesis of normality is rejected against size of the source dataset (n) for the Hadamard (solid line) and Clarkson-Woodruff sketches (dashed line). Results for tests of the sketched residual vector e˜ = S(y −XβF ) are plotted as triangles (4), and results for tests of the entire sketched dataset [y˜, X˜] are plotted as circles (◦). The horizontal line gives the type 1 error of 0.05. The y-axis is on a log scale. attached to single pass sketching estimators. The spectral theory developed in Chapter 6 has applications to iterative sketching algorithms, although that is not the focus of the chapter. There has been recent work in adapting sketching methods for statistical inference in large datasets, building from the worst case bounds in the computer science literature. Geppert et al. (2017) and Bardenet and Maillard (2015) investigate sketching algorithms for Bayesian regression, and derive bounds on the difference between the sketched posterior distribution and the full data posterior distribution. Yang et al. (2015b) consider sketched penalised regression, and give bounds between the sketched solution and the full data solution similar to the results in section 4.2.1. Only complete sketching is considered in the aforementioned work. The results on the advantages of partial sketching in this paper could motivate adaptations that make use of the exact marginal associations XTy. Sketching ideas have been used to develop methods for approximate non-linear regression (Avron et al., 2014; Banerjee et al., 2013). A related branch of work uses random projections to reduce the number of predictors in regression and classification problems (Shah and Meinshausen, 2013; Cannings and Samworth, 2015; Guhaniyogi and Dunson, 2015). Sketching can also be used to ensure privacy when sharing sensitive datasets (Zhou et al., 2009). 88 4.6. Discussion Acknowledgement This work has been conducted using the UK Biobank resource under applications number 13745. Many thanks to Rajen Shah for helpful discussions. 89 5Proofs regarding sketching algorithms Summary This Chapter contains proofs for the results in Chapter 4. We restate the major results before giving proofs. 5.1 Proof of Theorem 4.1 (Worst case bound for partial sketching) Theorem. Suppose that X˜ is an -subspace embedding of X with 0 <  < 0.5. Then the following bound holds, ‖βP − βF ‖22 ≤ 42 σ2min(X) MSSF . Let the singular value decomposition ofX be given byX = UDV T. The singular value decomposition will help to simplify expressions in later working. If the sketching matrix S is an -subspace embedding for the source dataset with 0 <  < 1, then UTSTSU is necessarily invertible. The expression for βP can then be simplified to βP = V D −1(UTSTSU)−1D−1V TXTy = V D−1(UTSTSU)−1D−1V TV DUTy = V D−1(UTSTSU)−1UTy. Similarly, βF can be written as βF = V D −1UTy. The Euclidean norm of the approximation error can thus be expressed as ‖βP − βF ‖2 = ‖V D−1(UTSTSU)−1UTy − V D−1UTy‖2 = ‖{V D−1(UTSTSU)−1 − V D−1}UTy‖2 = ‖{V D−1[(UTSTSU)−1 − Ip]}UTy‖2. The model sum of squares can be written as MSSF = ‖XβF ‖22 = ‖XVD−1UTy‖22 = ‖UDV TV D−1UTy‖22 = ‖UUTy‖22 = ‖UTy‖22. (5.1) 90 5.2. Proof of Theorem 4.2 (Hierarchical model for the Gaussian sketch) The final line uses the fact that UTU = Ip. Using the matrix norm induced by the Euclidean norm and the usual Euclidean norm for vectors we can form an upper bound on the error. ‖βP − βF ‖2 ≤ ‖V D−1 { (UTSTSU)−1 − Ip }‖2‖UTy‖2 ≤ ‖V D−1‖2‖UTy‖2‖(UTSTSU)−1 − Ip‖2 = MSS 1/2 F σmin(X) ‖(UTSTSU)−1 − Ip‖2. (5.2) It remains to upper bound the maximum singular value of the matrix (UTSTSU)−1 − Ip. Let M = UTSTSU . The maximum absolute value of the singular values of (UTSTSU)−1 − Ip will be given by max(|1/σmin(M) − 1|, |1/σmax(M) − 1|), where σmin(M) is the minimum singular value of M , and σmax(M) is the maximum singular value of M . If S is an -subspace embedding for the source covariate matrix X then it must hold that σmin(M) ≥ 1 − , and σmax(M) ≤ 1 +  (Woodruff, 2014, p.11). As such, max(|1/σmin(M)−1|, |1/σmax(M)−1|) ≤ |1/(1− )−1|. It is simple to show that over the interval 0 ≤  ≤ 0.5, |1/(1− )− 1| ≤ 2. This results in an upper bound on the singular value of interest, ‖(UTSTSU)−1 − Ip‖2 ≤ |1/(1− )− 1| ≤ 2. Substituting this back into (5.2) gives that under the condition that  < 0.5 ‖βP − βF ‖2 ≤ MSS 1/2 F σmin(X) × 2. Squaring both sides gives the final result, that if  < 0.5 ‖βP − βF ‖22 ≤ 42 σ2min(X) MSSF . 5.2 Proof of Theorem 4.2 (Hierarchical model for the Gaussian sketch) Theorem. Suppose βS is computed using a Gaussian sketch and k > p+ 1. The conditional distribution of βS is (i) βS |X˜,y,X ∼ N ( βF , nσ2F k ( X˜TX˜ )−1) . The marginal distribution of βS is (ii) βS |y,X ∼ Student ( βF , nσ2F k − p+ 1 ( XTX )−1 , k − p+ 1 ) . We use the following lemma about the Normal Inverse-Wishart distribution in many of our results (Gelman et al., 2014, p.73). Suppose that Σ is a random d× d matrix and y is a d-dimensional random vector from the following hierarchical model y|Σ ∼ N (µ,Σ/κ) , Σ ∼ Inv-Wishart(Λ, ν), where Λ is a d× d scale matrix, ν is a scalar giving degrees of freedom, and κ is a scaling constant. Then marginally, y ∼ Student(µ,Λ/(κ(ν − d+ 1)), ν − d+ 1). Theorem 4.2 follows from setting µ = βF , Σ = (X˜ TX˜)−1, κ = k/RSSF , Λ = k(XTX)−1, ν = k and d = p. 91 5. Proofs regarding sketching algorithms 5.3 Variance for partial sketching Using a Gaussian sketch of size k where k > p + 3, the standard partial sketching estimator βP has variance var(βP ) = k2 (k − p)(k − p− 1)(k − p− 3) ( MSSF (X TX)−1 + (k − p+ 1) (k − p− 1)βFβ T F ) . (5.3) The bias corrected partial sketching estimator β∗P has variance var(β∗P ) = (k − p− 1) (k − p)(k − p− 3) ( MSSF (X TX)−1 + (k − p+ 1) (k − p− 1)βFβ T F ) . (5.4) We now prove (5.3) and (5.4). Let the singular value decomposition ofX be given byX = UDV T. The singular value decomposition will help to simplify expressions in later working. The sketched Gram matrix has the form X˜TX˜ = V DUTSTSUDV T. As UTSTSU ∼Wishart(k, Ip/k), the matrix UTSTSU is almost surely invertible. The inverse Gram matrix can then be written as (X˜TX˜)−1 = [DV T]−1(UTSTSU)−1[V D]−1 = V D−1(UTSTSU)−1D−1V T. The expression for βP can then be simplified to βP = V D −1(UTSTSU)−1D−1V TXTy = V D−1(UTSTSU)−1D−1V TV DUTy = V D−1(UTSTSU)−1UTy. Let M = (UTSTSU)−1. We know that M ∼ Inverse-Wishart(k, kIp). Properties of the Inverse-Wishart distribution give that that for i = 1, . . . , p, var(Mii) = 2k2 (k − p− 1)2(k − p− 3) . (5.5) Additionally, for i, j = 1, . . . , p, where j 6= i var(Mij) = k2(k − p− 1) (k − p)(k − p− 1)2(k − p− 3) . (5.6) Finally we have that for i, j = 1, . . . , p, i 6= j, cov(Mij ,Mji) = k2(k − p− 1) (k − p)(k − p− 1)2(k − p− 3) , (5.7) cov(Mii,Mjj) = 2k2 (k − p)(k − p− 1)2(k − p− 3) . (5.8) All other covariances cov(Mij ,Mbr) are equal to zero unless they reduce to the cases in (5.7) or (5.8). Let z = UTy. Let W = cov ( MUTy ) = cov(Mz). The elements of W can be determined using the properties in equations (5.5) to (5.8). Starting with the diagonal entries, Wii = var  p∑ j=1 Mijzj  = p∑ j=1 z2j var(Mij) + p∑ j=1 p∑ w 6=j zjzwcov(Mij ,Miw). As cov(Mij ,Miw) is equal to zero for all w 6= j this simplifies to Wii = var  p∑ j=1 Mijzj  = p∑ j=1 z2j var(Mij). 92 5.3. Variance for partial sketching It is helpful to split the sum into two pieces, a single term for j = i and then a sum over the remaining indices. Grouping terms leads to an expression involving the model sum of squares MSSF . Wii = z 2 i 2k2 (k − p− 1)2(k − p− 3) + p∑ j=1,j 6=i z2j k2(k − p− 1) (k − p)(k − p− 1)2(k − p− 3) = z2i 2k2(k − p) (k − p)(k − p− 1)2(k − p− 3) + p∑ j=1,j 6=i z2j k2(k − p− 1) (k − p)(k − p− 1)2(k − p− 3) = z2i 2k2(k − p− 1) + 2k2 (k − p)(k − p− 1)2(k − p− 3) + p∑ j=1,j 6=i z2j k2(k − p− 1) (k − p)(k − p− 1)2(k − p− 3) = k2(k − p− 1) (k − p)(k − p− 1)2(k − p− 3) p∑ j=1 z2j + k2(k − p− 1) + 2k2 (k − p)(k − p− 1)2(k − p− 3)z 2 i = k2(k − p− 1) (k − p)(k − p− 1)2(k − p− 3)MSSF + k2(k − p+ 1) (k − p)(k − p− 1)2(k − p− 3)z 2 i . In the second line the first term is modified to have the same denominator as the remainder sum. In the third line we add and subtract by 2k2 so that the numerator in the first term matches the numerator in the remainder sum. This allows the zj terms to be grouped into a sum over the full set of indexes j = 1, . . . , p in the third line. The fourth line uses the fact that ∑p j=1 z 2 j = z Tz = MSSF . This was shown in the proof of Theorem 1 (5.1). For the off diagonal entries Wib where b 6= i, Wib = cov  p∑ j=1 Mijzj , p∑ r=1 Mbrzr  = p∑ j=1 p∑ r=1 zjzr cov(Mij ,Mbr). Now cov(Mij ,Mbr) is only nonzero for cov(Mib,Mbi) and cov(Mii,Mbb). Using (5.7) and (5.8) we obtain Wib = zizb cov(Mib,Mbi) + zizb cov(Mii,Mbb) = k2(k − p− 1) (k − p)(k − p− 1)2(k − p− 3)zizb + 2k2 (k − p)(k − p− 1)2(k − p− 3)zizb = k2(k − p+ 1) (k − p)(k − p− 1)2(k − p− 3)zizb. The entire covariance matrix W can therefore be written compactly as W = k2(k − p− 1) (k − p)(k − p− 1)2(k − p− 3) (MSSF Ip) + k2(k − p+ 1) (k − p)(k − p− 1)2(k − p− 3)zz T = k2(k − p− 1) (k − p)(k − p− 1)2(k − p− 3) ( MSSF Ip + (k − p+ 1) (k − p− 1)zz T ) . = k2 (k − p)(k − p− 1)(k − p− 3) ( MSSF Ip + (k − p+ 1) (k − p− 1)zz T ) . Now βP = V D −1Mz. Therefore var(βP ) = V D−1var(Mz)D−1V T = V D−1WD−1V T. The variance of βP is then a linear function of W , var(βP ) = V D −1WD−1V T = V D−1 k2 (k − p)(k − p− 1)(k − p− 3) ( MSSF Ip + (k − p+ 1) (k − p− 1)zz T ) D−1V T = k2 (k − p)(k − p− 1)(k − p− 3)MSSF (V D −2V T)+ k2(k − p+ 1) (k − p)(k − p− 1)2(k − p− 3)V D −1zzTD−1V T (5.9) 93 5. Proofs regarding sketching algorithms Recall that z = UTy and βF = (X TX)−1XTy (5.10) = V D−1UTy (5.11) = V D−1z. (5.12) The term V D−1z appears in (5.9). Substituting (5.12) into (5.9) gives var(βP ) = k2 (k − p)(k − p− 1)(k − p− 3)MSSF (V D −2V T)+ k2(k − p+ 1) (k − p)(k − p− 1)2(k − p− 3)βFβ T F . A final simplification can be made by noting that (XTX)−1 = V D−2V T giving var(βP ) = k2 (k − p)(k − p− 1)(k − p− 3)MSSF (X TX)−1 + k2(k − p+ 1) (k − p)(k − p− 1)2(k − p− 3)βFβ T F = k2 (k − p)(k − p− 1)(k − p− 3) ( MSSF (X TX)−1 + (k − p+ 1) (k − p− 1)βFβ T F ) . The variance of β∗P = [(k − p− 1)/k]βP is then var(β∗P ) = ( k − p− 1 k )2 k2(k − p− 1) (k − p)(k − p− 1)2(k − p− 3) ( MSSF (X TX)−1 + (k − p+ 1) (k − p− 1)βFβ T F ) = (k − p− 1) (k − p)(k − p− 3) ( MSSF (X TX)−1 + (k − p+ 1) (k − p− 1)βFβ T F ) . (5.13) 5.4 Combined estimator results We first show that β∗P and βS are uncorrelated. We again avoid notation explicitly conditioning on the source dataset [y,X] in every step, which is always treated as fixed. The covariance between β∗P and βS computed from the same sketch can be shown to be zero. Using the definition of covariance, and taking iterated expectations cov(β∗P ,βS) = E [ (β∗P − βF )(βS − βF )T ] = E { E [ (β∗P − βF )(βS − βF )T | X˜ ]} . Recall the hierarchical model for complete sketching, y˜ | X˜ ∼ N ( X˜βF , RSSF k Ik ) . Equivalently, y˜ | X˜ = X˜βF + e˜, where e˜ | X˜ ∼ N(0, RSSF k Ik). So βS | X˜,y,X = βF + (X˜TX˜)−1X˜Te˜. Substituting back into the expression for the covariance, cov(β∗P ,βS) = E { E [ (β∗P − βF )(βF + (X˜TX˜)−1X˜Te˜− βF )T | X˜ ]} = E {[ (β∗P − βF )(βF + (X˜TX˜)−1X˜TE[e˜ | X˜]− βF )T | X˜ ]} = E {[ (β∗P − βF )(βF − βF )T | X˜ ]} = E {[ (β∗P − βF )0T | X˜ ]} = 0p×p. 94 5.5. Proof of Theorem 4.4 (central limit theorem under asymptotic negligibility condition) The final line may be obtained by noting that E[β∗P ] = βF . Simple calculus shows that the value which minimises the expected mean square error ES(‖βC − βF ‖22 | y,X) is αopt = trace(var(β∗P )) trace(var(β∗P )) + trace(var(βS)) . 5.5 Proof of Theorem 4.4 (central limit theorem under asymptotic negligibility condition) A triangular array of random variables is a useful structure for studying weak convergence. To establish a triangular array, define for every n ∈ N a collection of random variables Zn1, Zn2, . . . , Znrn . There are rn random variables in row n of the array. Suppose that rn = n. Visually we can represent the first three rows of the array as Z11 Z21 Z22 Z31 Z32 Z33 Theorem (Billingsley, 1999). For each n ∈ N, let Zn1, Zn2, . . . , Znrn be a sequence of independent random variables with E(Zni) = 0 and var(Zni) = σ2ni for i = 1, . . . , rn. Let s2n = ∑rn i=1 σ 2 ni and assume that rn →∞ as n→∞. Suppose that we can form a sequence of upper bounds (Kn)n∈N such that |Zni| ≤ Kn almost surely for i = 1, . . . , rn. Then if Kn/sn → 0 as n→∞ we have the convergence in distribution 1 sn rn∑ i=1 Zni d→ N(0, 1) Lindeberg’s condition is a critical component in establishing asymptotic normality. We state Linde- berg’s condition for triangular arrays of random variables. Definition 5.1 (Lindeberg’s condition). For each n ∈ N, let Zn1, Zn2, . . . , Znrn be a sequence of random variables with E(Zni) = 0 and var(Zni) = σ2ni for i = 1, . . . , rn. Let s2n = ∑n i=1 σ 2 ni and suppose that rn →∞ as n→∞. The random variables are said to satisfy Lindeberg’s condition if for all η > 0, lim n→∞ 1 s2n rn∑ i=1 E(Z2ni1{|Zni|>ηsn}) = 0. (5.14) The triangular array of random variables does not have to have independent random variables in each row in order to satisfy the condition. The general form of the Lindeberg-Feller central limit theorem shows that a triangular array of independent random variables satisfying Lindeberg’s condition is asymptotically normal after suitable scaling. Theorem 5.1 (Lindeberg-Feller). For each n ∈ N, let Zn1, Zn2, . . . , Znrn be a sequence of random variables with E(Zni) = 0 and var(Zni) = σ2ni for i = 1, . . . , rn. Let s2n = ∑rn i=1 σ 2 ni and suppose that rn → ∞ as n → ∞. Suppose the triangular array of random variables satisfies Lindeberg’s condition (Definition 5.1). Then 1 sn rn∑ i=1 Zni d→ N(0, 1) For a proof see Loeve (1977). It can be difficult to show Lindeberg’s condition directly. A stronger condition that implies the Lindeberg condition is the Lyapunov condition. 95 5. Proofs regarding sketching algorithms Definition 5.2 (Lyapunov’s condition). For each n ∈ N, let Zn1, Zn2, . . . , Znrn be a sequence of random variables with E(Zni) = 0 and var(Zni) = σ2ni for i = 1, . . . , rn. Let s2n = ∑rn i=1 σ 2 ni and suppose that rn →∞ as n→∞. The triangular array of random variables is said to satisfy Lyapunov’s condition if there exists a δ > 0 such that lim n→∞ 1 s2+δn rn∑ i=1 E(|Zni|2+δ) = 0. (5.15) The Lyapunov condition implies the Lindeberg condition. We state this in a Lemma for later reference. Lemma 5.1. The Lyapunov condition implies the Lindeberg condition. To see this assume the Lyapunov condition is satisfied and fix η > 0. Now |Zni| ≥ ηsn implies that 1 ≤ |Zni/(ηsn)|δ. We can then form an upper bound on the sequence of partial sums that appear in Lindeberg’s condition. 1 s2n rn∑ i=1 E(Z2ni1{|Zni|>ηsn}) ≤ 1 s2n rn∑ i=1 E(Z2ni|Zni/(ηsn)|δ1{|Zni|>ηsn}) = 1 s2n rn∑ i=1 E(|Zni|2|Zni/(ηsn)|δ1{|Zni|>ηsn}) = 1 s2n 1 (ηsn)δ rn∑ i=1 E(|Zni|2+δ1{|Zni|>ηsn}) = 1 ηδ 1 s2+δn rn∑ i=1 E(|Zni|2+δ). Assuming that Lyapunov’s condition holds we can establish zero as an upper bound lim n→∞ 1 s2n rn∑ i=1 E(Z2ni1{|Zni|>ηsn}) ≤ limn→∞ 1 ηδ 1 s2+δn rn∑ i=1 E(|Zni|2+δ) = 1 ηδ lim n→∞ 1 s2+δn rn∑ i=1 E(|Zni|2+δ) = 0. We also have the lower bound 0 ≤ lim n→∞ 1 s2n rn∑ i=1 E(Z2ni1{|Zni|>ηsn}). By the squeeze theorem we then have that the Lyapunov condition implies the Lindeberg condition. We now present a useful Lemma for showing the Lyapunov condition. The result is from Billingsley (1999) and applies to triangular arrays of uniformly bounded random variables. Lemma 5.2 (Billingsley, 1999). For each n ∈ N, let Zn1, Zn2, . . . , Znrn be a sequence of random variables with E(Zni) = 0 and var(Zni) = σ2ni for i = 1, . . . , rn. Let s2n = ∑rn i=1 σ 2 ni and suppose that rn → ∞ as n→∞. Suppose that we can form a sequence of upper bounds (Kn)n∈N such that |Zni| ≤ Kn almost surely for i = 1, . . . , rn. Then if Kn/sn → 0 as n→∞ the Lyapunov condition holds for the triangular array of random variables. Lemma 5.2 is useful as it does not impose a constant uniform bound on the random variables. In the special case where |Zni| ≤ M almost surely for some constant M for all n ∈ N and all i = 1, . . . , rn we have that Lyapunov’s condition is satisfied providing that sn →∞. Lemma 5.2 allows for the bound Kn to increase with n as long as the rate of growth is slower than the rate of growth of sn. Lyapunov’s condition holds providing that Kn = o(sn). 96 5.6. Proof of Theorem 4.3 (Sketching central limit theorem) The proof of Lemma 5.2 is given below. Again fix some δ > 0. If |Zni| ≤ Kn almost surely for i = 1, . . . , rn it must hold that |Zni|δ ≤ Kδn as |Zni|,Kn and δ are all positive. As such |Zni|2+δ = |Zni|2|Zni|δ ≤ |Zni|2Kδn. We can then form an upper bound on the sequence of partial sums that appear in Lyapunov’s condition. 1 s2+δn rn∑ i=1 E(|Zni|2+δ) ≤ 1 s2+δn rn∑ i=1 E(|Zni|2)Kδn = Kδn s2+δn rn∑ i=1 E|Zni|2. = Kδn s2+δn s2n = ( Kn sn )δ . (5.16) Now assuming that Kn = o(sn) we have that Kn/sn → 0 as n→∞. We then also have that lim n→∞ ( Kn sn )δ = ( lim n→∞ Kn sn )δ = 0, as the exponentiation by δ > 0 is a continuous function. Now taking limits on both sides of the inequality (5.16): lim n→∞ 1 s2+δn rn∑ i=1 E(|Zni|2+δ) ≤ lim n→∞ ( Kn sn )δ (5.17) = 0. (5.18) We also have the lower bound 0 ≤ lim n→∞ 1 s2+δn rn∑ i=1 E(|Zni|2+δ). By the squeeze theorem we then have that Kn = o(sn) is sufficient for Lyapunov’s condition to hold. The triangular array of independent random variables in Theorem 4.4 satisfies Lyapunov’s condition by Lemma 5.2. As the Lyapunov condition implies the Lindeberg condition (Lemma 5.1) the general Lindeberg-Feller central limit theorem (Theorem 5.1) gives asymptotic normality of the scaled row sums, thus proving Theorem 4.4. 5.6 Proof of Theorem 4.3 (Sketching central limit theorem) Assumption 1 Let the singular value decomposition of the n × d source dataset be given by A(n) = U(n)D(n)V T (n). Let u T (n)i give the ith row in U(n). Assume that the maximum leverage score tends to zero, that is lim n→∞ maxi=1,...,n ‖u(n)i‖22 = 0. Theorem 4.3 gives the sketching central limit theorem. Theorem. Consider a fixed sequence of arbitrary n×d data matrices A(n), where d is fixed. Let A(n) = U(n)D(n)V T (n) represent the singular value decomposition of A(n). Let S be a k×n Hadamard or Clarkson- Woodruff sketching matrix where k is also fixed. Suppose that Assumption 1 on the maximum leverage score is satisfied. Then as n tends to infinity with k and d fixed, [A˜V(n)D −1 (n) | A(n)] d→ MN(0, Ik, Id/k). 97 5. Proofs regarding sketching algorithms To prove the sketching central limit theorem it helps to restate Lemma 5.2. This helps to show the importance of the leverage scores in establishing asymptotic normality. Lemma 5.2 provided a sufficient condition for showing that Lindeberg’s condition holds. We can restate Lemma 5.2 in terms of a normalised triangular array. Theorem 5.2 (Billingsley, 1999). For each n ∈ N let Zn1, Zn2, . . . , Znrn be a sequence of random vari- ables with E[Zni] = 0 and var(Z2ni) = σ2ni for i = 1, . . . , rn. Define s2n = ∑rn i=1 σ 2 ni each n. Suppose that the rows of the triangular array are standardised such that s2n = 1 for all n. Suppose that rn → ∞ as n → ∞. Suppose we have a sequence of upper bounds (Zn) suck that |Zni| ≤ Kn almost surely for all i = 1, . . . , rn. Then a sufficient condition for Lyapnuov’s condition to hold is Kn → 0 as n→∞. The standardisation of the triangular array gives an intuitive condition for Lyapunov’s and hence Lindeberg’s condition to hold. We require that Kn → 0 as n → ∞. We require that the upper bound tends to zero. All the random variables in the row must converge almost surely to zero. Almost sure convergence is stronger than convergence in probability and rules out pathological cases where a single random variable in a row can take a large value with small probability. Assumption 1 on the leverage scores in the sketching central limit theorem enforces a bounded growth condition that relates to Theorem 5.2. Let n ∈ N index the sequence of source datasets of increasing size. We assume that the source dataset consists of rn observations where rn → ∞ as n → ∞. For now we can take take rn = n to ease interpretation. We take the singular value decomposition of each dataset A(n) = U(n)D(n)V T (n). All results in this section treat the source dataset A(n) as fixed, only the sketching matrix is random. We consider the sequence of whitened sketched datasets A˜V(n)D −1 (n) = (SA)V(n)D −1 (n) = SU(n)D(n)V T (n)V(n)D −1 (n) = SU(n). The whitened sketched dataset A˜V(n)D −1 (n) has a MN(0, Ik, Id/k) distribution when S is a Gaus- sian sketch. We need to show that as n tends to infinity, SU(n) convergences in distribution to a MN(0, Ik, Id/k) random matrix for both the Clarkson-Woodruff and Hadamard sketches. Let uT(n)i denote row i of the matrix of left singular vectors U(n). We write u T (n)i so that that we can form a triangular array of left singular vectors. Taking rn = n, the first three rows of the triangular array can be written as u(1)1 u(2)1 u(2)2 u(3)1 u(3)2 u(3)3 An important property is that for all n, the sum of the norms of the leverage scores always equals the number of variables in the source dataset d. rn∑ i=1 ‖u(n)i‖22 = d. (5.19) As n increases, the typical norm of each vector u(n)i, i ∈ {1, . . . , rn} is expected to decrease. For completeness we restate Assumption 1 in terms of the triangular array formulation. Assumption 1 Let the singular value decomposition of the rn × d source dataset be given by A(n) = U(n)D(n)V T (n). Let u T (n)i give the ith row in U(n) for i = 1, . . . , rn. Assume that the maximum leverage score tends to zero, that is lim n→∞ maxi=1,...,rn ‖u(n)i‖22 = 0. 98 5.6. Proof of Theorem 4.3 (Sketching central limit theorem) This increasing collection of smaller quantities is similar to the behaviour of the triangular array of random variables in Theorem 5.2. The standardisation property in equation (5.19), namely that ∑rn i=1‖u(n)i‖22 = d for all n is similar to the assumption that sn = 1 in each row of the triangular array of random variables in Theorem 5.2. Assumption 1 on the leverage scores, where the maximum individual norm tends to zero is similar to the assumption that Kn → 0 in Theorem 5.2. This will be made more explicit in the proofs. Before moving on we make a note that assumption 1 also implies that the maximum square root of the leverage scores also tends to zero. As max i=1,...,rn ‖u(n)i‖2 = ( max i=1,...,rn ‖u(n)i‖22 )1/2 (5.20) We have that lim n→∞ maxi=1,...,rn ‖u(n)i‖2 = lim n→∞ ( max i=1,...,rn ‖u(n)i‖22 )1/2 = ( lim n→∞ maxi=1,...,rn ‖u(n)i‖22 )1/2 = 0. (5.21) To establish joint asymptotic normality of the sketched data matrix we use the Crame´r-Wold device. Lemma 5.3 (Crame´r-Wold device). Let z be a real valued random vector of dimension v and let (zn) be a sequence of real valued random vectors of the same fixed dimension v. The sequence of random vectors (zn) converges in distribution to z as n tends to infinity if and only if (λ Tzn) converges in distribution to λTz for all unit vectors λ ∈ Rv. Let zn represent the kd length vector formed by stacking transposed rows of the whitened sketched dataset U˜ = SU(n). Let u˜ T j give row j in U˜ for j = 1, . . . , k. Formally, zn =  u˜1 u˜2 ... u˜k  . (5.22) Let us define the random matrix k × d random matrix W as having the matrix normal distribution W ∼MN(0, Ik, Id/k) Let wTi refer to row i in W for i = 1, . . . , k. Let zL refer to the stacked transposed rows of W , so zL =  w1 w2 ... wk  . (5.23) Let λ be an arbitrary unit vector in Rk×d. It will be useful to also partition the vector λ into k sub-vectors, λ =  λ1 λ2 ... λk  , (5.24) 99 5. Proofs regarding sketching algorithms where λj is a d-dimensional vector for j = 1, . . . , k. For any unit vector λ ∈ Rk×d, λTzL is distributed as N(0, 1/k). We will aim to show that the distribution of the whitened sketched data SA(n)V(n)D −1 (n) converges to that of W through the Crame´r-Wold device. We must show that for any fixed k× d length unit vector λ, λTzn converges in distribution to N(0, 1/k) as n→∞. We will rely on a central limit theorem for jointly symmetric, pairwise independent random variables (Pruss and Szynal, 2000). A collection of random variables (Z1, . . . Zn) is said to be jointly symmetric if (Z1, . . . Zn) has the same distribution as (q1Z1, . . . , qnZn), where qi ∈ {+1,−1} for i = 1, . . . , n. Given a set of random variables Y1, . . . , Yn, a jointly symmetric collection Z1, . . . , Zn can be formed by sampling n independent Rademacher random variables h1, . . . , hn, and setting Zi = hiYi (Pruss and Szynal, 2000). It is possible to establish a central limit theorem for jointly symmetric, pairwise independent random variables. Theorem 5.3 (Pruss and Szynal (2000), Theorem 1, Corollary 2). For each n ∈ N, let Zn1, Zn2, . . . , Znrn be a sequence of jointly symmetric pairwise independent random variables with E(Zni) = 0 and var(Zni) = σ2ni for i = 1, . . . , rn. Let s 2 n = ∑rn i=1 σ 2 ni and assume that rn → ∞ as n → ∞. Suppose the triangular array of random variables satisfies Lindeberg’s condition. Then as n → ∞, s−1n ∑rn i=1 Zni converges in distribution to N(0, 1). Not all triangular arrays with pairwise independent random variables in each row satisfy a central limit theorem. The joint symmetry property is very important (Pruss and Szynal, 2000; Janson, 1988). To use Theorem 5.3 we need to show that the triangular array of random variables satisfies Lindeberg’s condition. As discussed this can be very difficult to establish directly. If the triagular array of random variables can be appropriately bounded, we can use Theorem 5.2 to show that Lyapunov’s condition holds, and subsequently that Lindeberg’s condition holds. This is the approach we take in proving the sketching central limit theorem. The Crame´r-Wold device is used to reduce the study of multivariate convergence to univariate convergence. We can then form a triangular array of random variables such that elements in each row are jointly symmetric and pairwise independent. We then show that triangular array satisfies Lindeberg’s condition using Theorem 5.2. Assumption 1 on the maximum leverage score enforces the necessary cap on the rate of growth. Theorem 5.3 is then used to establish asymptotic normality. 5.6.1 Clarkson-Woodruff sketch The Clarkson-Woodruff sketch can be represented as the product of two independent random matrices, S = ΓD, where Γ is a random k × n matrix and D is a random n× n matrix. The diagonal matrix D contains n independent Rademacher random variables on the diagonal. Let hi ∈ {+1,−1} be the random sign in element Dii. The matrix Γ is formed by choosing one element in each column independently and setting the entry to +1. Element Γij is equal to +1 if we add observation i in the original dataset to sketched observation j. The signs in row i are flipped if hi is equal to negative one. Each observation in the original dataset is assigned to one sketched observation as each column of Γ contains a single +1 entry. Using a Clarkson-Woodruff sketch row j in the sketched data matrix can be represented as u˜Tj = n∑ i=1 hiΓiju T (n)i, where hi represents the random sign flip applied to row i of the original data matrix, and Γij is the indicator variable which is equal to one if row i of the original data is added to row j of the sketched dataset. Let us consider the linear combination λTz, where λ and z are defined as in (5.22) and (5.24) respectively. The sum over the k rows in the sketched dataset can be rearranged into a sum over the n 100 5.6.1. Clarkson-Woodruff sketch rows in the source dataset, λTzn = k∑ j=1 λTj u˜j = k∑ j=1 λTj n∑ i=1 hiΓiju(n)i = n∑ i=1 hi k∑ j=1 Γijλ T j u(n)i. (5.25) The scalar λTzn is equal to the sum of n independent random variables. Independence holds as the signs flips hi on each observation are independent, and each column of Γ is independent. In the language of Theorem 5.2 we can form a triangular array of random variables setting Zni = hi k∑ j=1 Γijλ T j u(n)i. (5.26) for i = 1, . . . , n and n ∈ N. The linear combination in (5.25) then be expressed as a row sum over the triangular array defined in (5.26): λTzn = n∑ i=1 Zni. (5.27) Our goal of showing that λTzn converges in distribution to a N(0, 1/k) random variable is achieved if we can show that ∑n i=1 Zni converges in distribution to a N(0, 1/k) random variable. It is worth making a connection to Theorem 5.3, because of the random sign flips hi appearing in (5.26), we have a sequence of mutually independent jointly symmetric random variables. Mutually independent random variables are also necessarily pairwise independent. Theorem 5.3 can be used to establish asymptotic normality of the sum in (5.27) and hence the linear combination λTzn. To show that the triangular array of random variables defined in (5.26) satisfies Lindeberg’s condition we use Theorem 5.2. Set s2n = ∑n i=1 var(Zni). We first determine s 2 n. We then form the necessary sequence of upper bounds Kn such that |Zni| ≤ Kn almost surely for i = 1, . . . , n. The variance of a single term in the sum (5.25) is var(Zni) = var hi k∑ j=1 Γijλ T j u(n)i  (5.28) = k∑ j=1 1 k λTj u(n)iu T (n)iλj . (5.29) 101 5. Proofs regarding sketching algorithms The row-wise variance totals s2n are then s2n = n∑ i=1 var(Zni) = n∑ i=1 var hi k∑ j=1 Γijλ T j u(n)i  = 1 k n∑ i=1 k∑ j=1 λTj u(n)iu T (n)iλj = 1 k k∑ j=1 λTj ( n∑ i=1 u(n)iu T (n)i ) λj = 1 k k∑ j=1 λTjU T (n)U(n)λj = 1 k k∑ j=1 λTj Idλj . = 1 k k∑ j=1 λTj λj = 1 k . The fact that UT(n)U(n) = Id for all n serves as a useful normalisation to give stable limiting behaviour. The step in the last line follows as we have taken λ to be a unit vector. We have sn = 1/k for all n in the triangular array. We now establish a sequence of upper bounds (Kn). As the random variables in the construction of construction of the sketch are bounded, we can bound the random variables in the triangular array using the leverage scores of the sequence of source dataset. Now as the random sign hi ∈ {+1,−1} |Zni| = |hi k∑ j=1 Γijλ T j u(n)i| = |  k∑ j=1 Γijλ T j u(n)i|. (5.30) Now by the Cauchy-Schwarz inequality |  k∑ j=1 Γijλ T j u(n)i| ≤ ‖ k∑ j=1 Γijλj‖2‖u(n)i‖2 (5.31) Now as Γij = 1 for a single j ∈ {1, . . . , k} and is zero otherwise we have that ‖ k∑ j=1 Γijλj‖2 ≤ max j=1, . . . , k ‖λj‖2 ≤ 1. (5.32) The last line follows as we have taken λ to be a unit vector. Substituting (5.32) and (5.31) into (5.30) we arrive at |Zni| ≤ ‖u(n)i‖2. We can then form the sequence of upper bounds Kn, Kn = max i=1,...,n ‖u(n)i‖2. 102 5.6.2. Hadamard sketch We have that |Zni| ≤ Kn almost surely for i = 1, . . . , n and n ∈ N. Assumption 1 controls the limiting behaviour of Kn = max i=1,...,n ‖u(n)i‖2 (recall equation (5.21)). Taking limits and using Assumption 1 shows that Kn → 0, lim n→∞ Kn = limn→∞ maxi=1,...,n ‖u(n)i‖2 = 0. By theorem 5.2 we have that the triangular array of random variables in (5.26) satisfies Lindeberg’s condition. As such the conditions of Theorem 5.3 are satisfied, giving that λTzn converges in distribution to N(0, 1/k). Finally, the Crame´r-Wold device gives that the whitened sketched dataset has a limiting matrix normal distribution, that is A˜V(n)D −1 (n) converges in distribution to a MN (0k×d, Ik, Id/k) random matrix. 5.6.2 Hadamard sketch Recall that the Hadamard sketch is defined through S = ΦHD/ √ k. Here H is a Hadamard matrix. Hadamard matrices are square matrices with 2n rows for some integer n. To take limits we have to define our sequence of source datasets (A(n) = U(n)D(n)V T (n)) as having rn = 2 n rows for n ∈ N+. In practice when taking a Hadamard sketch we pad the original dataset with zeros if the original number of observations is not a power of two. To rigourously establish asymptotic normality for the Hadamard sketch we have to take rn = 2 n. The first three rows of the triangular array of left singular vectors now looks like u(1)1 u(2)1 u(2)2 u(3)1 u(3)2 u(3)3 u(n)4. The intuition is the same as with the Clarkson-Woodruff sketch, as we move down the rows n we expect the norms u(n)i, i ∈ {1, . . . , 2n} to decrease. This follows from the implicit row-wise normalisation property rn∑ i=1 ‖u(n)i‖22 = d. The indexing change to rn = 2 n instead of rn = n has very little impact on the underlying arguments. There are two independent sources of randomness in a Hadamard sketch, the rn = 2 n independent random Rademacher variables in the diagonal matrix D, and the random matrix Φ which subsamples k rows with replacement from the Hadamard matrix H. Hadamard matrices have a number of properties that we will use (Anderson, 1997, section 3.2). • (P1) The first column contains all ones. • (P2) Every column other than the first contains an equal number of +1 and −1 entries. • (P3) Consider any two different columns i and s, where i, s ∈ {2, . . . , rn}, i 6= s. Columns i and s will have +1 together in a quarter of the rows, and −1 together in a quarter of the rows. Furthermore, a quarter of the rows will have +1 in column i and −1 in column s. Similarly, a quarter of the rows will have −1 in column i and +1 in column s. Let M represent the random k × n matrix from the subsampling operation M = ΦH. Let mji refer to the element in row j and column i of M . Each element in M is equal to +1 or −1. Let hi ∈ {+1,−1} be the random sign in element Dii. We now represent the Hadamard sketch as S = MD/ √ k. 103 5. Proofs regarding sketching algorithms The structure of the Hadamard matrix gives the random matrix M some useful properties. Consider an arbitrary row j in M . By (P1) listed above regarding the first column of M , mj1 = 1 with probability one. For the other columns, mji = 1 with probability half, and mji = −1 with probability half for i = 2, . . . , rn by (P2). By (P3) listed above, we have pairwise independence between elements in row j of M , that is p(mji|mjs) = p(mji) for i, s ∈ {1, . . . , rn}, i 6= s. As rows of M are sampled independently, each column of M is pairwise independent. Row j in the sketched dataset is given by u˜Tj = 1√ k rn∑ i=1 mjihiu T (n)i. Let us again consider the linear combination λTzn, where λ and zn are defined as in (5.22) and (5.24) respectively. The sum over the k rows in the sketched dataset can be rearranged into a sum over the rn = 2 n rows in the source dataset, λTzn = k∑ j=1 λTj u˜j = 1√ k k∑ j=1 λTj rn∑ i=1 mjihiu(n)i = 1√ k rn∑ i=1 hi  k∑ j=1 mjiλ T j u(n)i. (5.33) In the language of Theorem 5.2 we can form a triangular array of random variables setting Zni = 1√ k hi  k∑ j=1 mjiλ T j u(n)i. (5.34) for i = 1, . . . , rn and n ∈ N. The linear combination in (5.33) can then be expressed as a row sum of the triangular array defined by (5.34) λTzn = rn∑ i=1 Zni. (5.35) Our goal of showing that λTzn converges in distribution to a N(0, 1/k) random variable is achieved if we can show that ∑n i=1 Zni converges in distribution to a N(0, 1/k) random variable. The sequence of random variables in each row of the triangular array Zn1, . . . , Znrn are not mutually independent over i = 1 . . . , rn. This is because the columns ofM are not mutually independent. However, as the columns of M are pairwise independent, the random sums ∑k j=1mjiλ T j appearing in (5.34) are also pairwise independent. Again making a connection to Theorem 5.3, the independent sign flips hi appearing in (5.34) ensure that the random variables in each row of the triangular array are jointly symmetric and pairwise independent. Theorem 5.3 can be used to establish asymptotic normality of the sum in (5.35) and hence the linear combination λTzn. To show that the triangular array of random variables defined in (5.34) satisfies Lindeberg’s condition we use Theorem 5.2. Set s2n = ∑n i=1 var(Zni). We first determine s 2 n. We then form the necessary sequence of upper bounds Kn such that |Zni| ≤ Kn almost surely for i = 1, . . . , n. We start by considering the variance of a single term in the triangular array var(Zni). We have that var(Zni) = 1 k var hi  k∑ j=1 mjiλ T j u(n)i  (5.36) 104 5.6.2. Hadamard sketch It is important to consider the covariance between the elements of the sum over j = 1, . . . , k. For i 6= 1 and j, v ∈ {1, . . . , k}, j 6= v the covariance is zero cov ( himjiλ T j u(n)i, himviλ T vu(n)i ) = E [ h2imjimviλ T j u(n)iλ T vu(n)i ] = E [mjimvi]λTj u(n)iλTvu(n)i = 0. We use (P2) to conclude that E [mjimvi] = 0. Therefore for i = 2, . . . , rn var(Zni) = 1 k var  k∑ j=1 himjiλ T j u(n)i  = 1 k k∑ j=1 var ( himjiλ T j u(n)i ) = 1 k k∑ j=1 λTj u(n)iu T (n)iλj . (5.37) Results are different for i = 1 as the first column of the Hadamard matrix is all ones (P1). For j, v ∈ {1, . . . , k}, j 6= v the covariance is cov ( h1mj1λ T j u(n)1, h1mv1λ T vu(n)1 ) = E [ h21mj1mv1λ T j u(n)1λ T vu(n)1 ] = E [mj1mv1]λTj u(n)1λTvu(n)1 = λTj u(n)1λ T vu(n)1. From (P1) mj1 = mv1 = 1. Now using the Cauchy-Schwarz inequality, |cov (h1mj1λTj u(n)1, h1mv1λTvu(n)1)| = |λTj u(n)1||λTvu(n)1| ≤ ‖λj‖2‖u(n)1‖2‖λv‖2‖u(n)1‖2 ≤ ‖u(n)1‖2‖u(n)1‖2 = ‖u(n)1‖22 The second last last uses the fact that λ is a unit vector and we must have ‖λj‖2 ≤ 1, ‖λj‖2 ≤ 1 for any j, k. From assumption 1, the right hand side of the previous inequality tends to zero as n tends to infinity. As such we conclude that |cov (h1mj1λTj u(n)1, h1mv1λTvu(n)1)| is o(1). Some covariance terms appear in the expression for var(Zn1) var(Zn1) = 1 k var  k∑ j=1 h1mj1λ T j u(n)1  = 1 k k∑ j=1 var ( himjiλ T j u(n)1 ) + 1 k 2 k−1∑ j=1 k∑ v=j+1 cov ( h1mj1λ T j u(n)1, h1mv1λ T vu(n)1 ) = 1 k k∑ j=1 var ( h1mj1λ T j u(n)1 ) + 1 k 2 k−1∑ j=1 k∑ v=j+1 λTj u(n)1λ T vu(n)1 = 1 k k∑ j=1 var ( h1mj1λ T j u(n)1 ) + o(1) = 1 k k∑ j=1 λTj u(n)1u T (n)1λj + o(1) (5.38) 105 5. Proofs regarding sketching algorithms The trailing term can be grouped into an o(1) term as the sketch size k is fixed in our analysis. Using (5.37) and (5.38) we can then determine the row-wise variance totals s2n: s2n = 1 k rn∑ i=1 var(Zni) = 1 k rn∑ i=1 k∑ j=1 λTj u(n)iu T (n)iλj + o(1) = 1 k k∑ j=1 λTj ( rn∑ i=1 u(n)iu T (n)i ) λj + o(1) = 1 k k∑ j=1 λTjU T (n)U(n)λj + o(1) = 1 k k∑ j=1 λTj Idλj + o(1) = 1 k k∑ j=1 λTj λj + o(1) = 1 k + o(1). The step in the last line follows as we have taken λ to be a unit vector. The fact that UT(n)U(n) = Id for all n serves as a useful normalisation to give stable limiting behaviour. We are working with a triangular array where the rows are nearly standardised. Asymptotically in n, s2n → 1/k. We now establish a sequence of upper bounds (Kn). As the random variables in the construction of construction of the Hadamard sketch are bounded, we can bound the random variables in the triangular array (5.34) using the leverage scores of the sequence of source datasets. Now as the random sign hi ∈ {+1,−1} we have that for all i = 1, . . . , rn: |Zni| = 1√ k |hi k∑ j=1 mjiλ T j u(n)i| = 1√ k | k∑ j=1 mjiλ T j u(n)i|. Now using the Cauchy-Schwarz inequality, 1√ k |  k∑ j=1 mjiλ T j u(n)i| ≤ 1√ k ‖  k∑ j=1 mjiλj ‖2‖u(n)i‖2. (5.39) Using the triangle inequality, ‖  k∑ j=1 mjiλj ‖2 ≤ k∑ j=1 ‖mjiλj‖2. (5.40) Now as mji ∈ {+1,−1} for all j = 1, . . . , k, k∑ j=1 ‖mjiλj‖2 = k∑ j=1 ‖λj‖2. (5.41) As λ is a unit vector we can easily form the bound k∑ j=1 ‖λj‖2 ≤ k. (5.42) 106 5.7. Proof of Theorem 4.5 (Complete sketching asymptotics) Substituting (5.41) and (5.42) into (5.39) leads to the upper bound for i = 1, . . . , rn: |Zni| ≤ 1√ k k‖u(n)i‖2 = √ k‖u(n)i‖2. (5.43) This is less than ideal as the upper bound is a function of k. In the present analysis the sketch size k is fixed. In future work we would like establish limit theorems letting k and d grow. This is discussed more in Chapter 6. We would like to eliminate the sketch size k from the upper bound (5.43). To do so we establish a tighter bound than in (5.42). It is reasonable to form a tighter bound as we have the restriction that λ is a unit vector, and we have not made full use of that in (5.42). To improve (5.42) we use the following lemma Lemma 5.4. Let a1, . . . , ak be a sequence of positive scalars such that ∑k j=1 aj = 1. Define f(a1, . . . , ak) =∑k j=1 a 1/2 j . Then f s maximised by setting aj = 1/k for j = 1, . . . , k. Furthermore f(1/k, . . . , 1/k) = √ k. The proof is simple using Lagrange multipliers. The quantity in (5.41) can be upper bounded using Lemma 5.4. We set aj = ‖λj‖22 for j = 1, . . . , k. As λ is a unit vector we have that ∑k j=1‖λj‖22 =∑k j=1 aj = 1. We can write ∑k j=1‖λj‖2 = ∑k j=1 a 1/2 j . Take f(a1, . . . , ak) = ∑k j=1 a 1/2 j . We can form an upper bound on ∑k j=1‖λj‖2 by maximising f over the arguments a1, . . . , ak subject to the constraint that ∑k j=1 aj = 1. From Lemma 5.4 we have that k∑ j=1 ‖λj‖2 ≤ √ k. (5.44) We can then form the tighter bounds on the triangular array |Zni| ≤ 1√ k √ k‖u(n)i‖2 (5.45) = ‖u(n)i‖2. (5.46) The upper bound no longer depends on the sketch size k as was desired. Continuing, we can then form the sequence of upper bounds Kn: Kn = max i=1,...,rn ‖u(n)i‖2. We have that |Zni| ≤ Kn almost surely for i = 1, . . . , rn and n ∈ N. Assumption 1 (recall equation (5.21)) gives the limiting behaviour of Kn. lim n→∞ Kn = limn→∞ maxi=1,...,rn ‖u(n)i‖2 = 0. We have that Kn → 0 as n → ∞. As s2n = 1/k + o(1) we have an asymptotically standardised array and we can use Theorem 5.2 to conclude that the triangular array of random variables defined in (5.34) satisfies Lindeberg’s condition. As such the conditions of Theorem 5.3 are satisfied. We conclude that the row sums in (5.35) converge in distribution to N(0, 1/k). Finally, the Crame´r-Wold device gives that the whitened sketched dataset has a limiting matrix normal distribution. That is the sequence of random matrices A˜V(n)D −1 (n) converges in distribution to a MN (0, Ik, Id/k) random matrix. 5.7 Proof of Theorem 4.5 (Complete sketching asymptotics) Assumption 2: lim n→∞n −1  yT(n)y(n) yT(n)X(n) XT(n)y(n) X T (n)X(n)  = Q for some positive-definite matrix Q. 107 5. Proofs regarding sketching algorithms Theorem. Suppose that Assumptions 1 and 2 hold, k ≥ p, and βS is computed using a Hadamard or Clarkson-Woodruff sketch. Let (X˜TX˜)+ denote the Moore-Penrose pseudo-inverse of (X˜TX˜). Let H˜(n) = RSS (n) F k ( X˜TX˜ )+ and H(n) = RSS (n) F k − p+ 1 ( XT(n)X(n) )−1 . Then as n→∞, convergence in distribution holds for (i)[H −1/2 (n) (βS − β(n)F )|A(n)]→ Student (0, Ip, k − p+ 1) , (ii)[H˜ −1/2 (n) (βS − β(n)F ) |A(n)]→ N (0, Ip) . Notation is slightly heavier in the proof compared to the main text for the sake of clarity. Again we do not explicitly condition on the source dataset A(n), the source dataset is always fixed, and the only randomness is from the sketching matrix. The sketched data will be denoted y˜(n) and X˜(n) to denote the dependence on the n× d source dataset. So y˜(n) = Sy(n) and X˜(n) = SX(n). The dimension of the sketched dataset does not change. Assumption 2 is of assistance in establishing the limit theorem. Let Q(n) = n −1  yT(n)y(n) yT(n)X(n) XT(n)y(n) X T (n)X(n).  The matrixQ(n) contains the sufficient statistics needed to fit a Gaussian linear model, y T (n)y(n),X T (n)y(n) and XT(n)X(n) given the source dataset A(n) = [y(n),X(n)]. Assumption 2 states the averaged sufficient statistic matrix converges to a limiting matrix Q. It will be helpful to partition the limiting matrix Q as Q = lim n→∞n −1  yT(n)y(n) yT(n)X(n) XT(n)y(n) X T (n)X(n)  =  s mT m G  , (5.47) where s is a scalar, G is a p× p matrix and b is a p-length column vector. The matrix G is the limiting averaged Gram matrix of the predictors. The vector m is the limit of the predictor response inner products n−1XT(n)y(n), and the scalar s is the limit of the mean total sum of squares n −1yT(n)y(n). As mentioned, the assumption of a sequence of source datasets also gives a sequence of optimal least squares coefficients and residual errors. Let σ 2(n) F = RSSF /n. Define the limiting least squares coefficient estimate as β = lim n→∞β (n) F and the limiting residual error as σ 2 = lim n→∞σ 2(n) F . Both β and σ 2 can be expressed as functions of the matrix Q. Specifically, β = G−1m, (5.48) σ2 = s−mTG−1m. (5.49) From Assumption 2, we have that n−1V(n)D2(n)V T (n) → Q. As such we have that n−1/2D(n)V T(n) → Q1/2 From the sketching central limit theorem the whitened sketched data converges to a matrix normal distribution [y˜(n), X˜(n)]V(n)D −1 (n) d→MN (0k×d, Ik, Id/k) The benefit of adding Assumption 2 is that using Slutsky’s theorem we have the additional convergence result n−1/2[y˜(n), X˜(n)] d→MN (0k×d, Ik,Q/k) . To prove results (i) and (ii) we use the continuous mapping theorem (Van Der Vaart, 1998, p. 7) in conjunction with the previous convergence result. It will be helpful to define the random variables y˜L, X˜L as having the above limiting matrix normal distribution [y˜L, X˜L] ∼MN (0k×d, Ik,Q/k) . 108 5.7. Proof of Theorem 4.5 (Complete sketching asymptotics) This is so we can say that n−1/2[y˜(n), X˜(n)] d→ [y˜L, X˜L]. Lemma 5.5 (Continuous Mapping Theorem). Let Zn indicate a sequence of random vectors and Z indicate another random vector. Suppose the function g : Rd → Rm is continuous at every point of a set C such that P (Z ∈ C) = 1. Then if Zn d→ Z then g(Zn) d→ g(Z). In Lemma 5.5, the function g : Rd → Rm does not change with n, and the dimensions d and m are fixed when taking limits. The sketched estimator βS can be defined as a function of the sketched data that is continuous over the set where X˜(n) is of full rank. Formally we could say that βS = g(n−1/2y˜(n), n−1/2X˜(n)). As X˜L is of rank p almost surely, and X˜(n) d→ X˜L we can apply the continuous mapping theorem to determine the limiting distribution of the βS . The random matrix [y˜L, X˜L] can be described using a hierarchical model completely analogous in structure to the hierarchical model established for the Gaussian sketch in section 3 of the main text. Specifically, y˜L | X˜L ∼ N ( X˜β, 1 k σ2Ik ) , X˜L ∼MN ( 0k×p, Ik, 1 k Q ) . From Theorem 2 in the main text, and recalling that the function g outputs βS , we have that g(y˜L, X˜L) ∼ Student ( β, σ2 k − p+ 1G −1, k − p+ 1 ) . As such, for the Hadamard and Clarkson-Woodruff sketches, [βS | y(n),X(n)] d→ Student ( β, σ2 k − p+ 1G −1, k − p+ 1 ) . Let H(n) = σ 2(n) F /(k − p+ 1) ( n−1XT(n)X(n) )−1 . Now as n−1XT(n)X(n) → G, σ2(n)F → σ2, and β(n)F → β, Slutsky’s theorem can be used to arrive at (i), H −1/2 (n) (βS − βF ) d→ Student (0, Ip, k − p+ 1) . For result (ii), let us define the function f(n−1/2y˜(n), n−1/2X˜(n)) = [ n ( X˜TX˜ )+]−1/2 ( X˜+y˜ − β ) = [ n ( X˜TX˜ )+]−1/2 (βS − β) . This function transforms the βS so that the output is uncorrelated. This function is also continuous over the set where X˜(n) is of rank p. Again using the fact that X˜L has rank p almost surely, it follows from the continuous mapping theorem that f(n−1/2y˜(n), n−1/2X˜(n)) d→ f(y˜L, X˜L). Result (ii) in Theorem 4.2 also applies to the hierarchical model for y˜L, X˜L, and gives the distribution of the transformed βS under the Gaussian sketch. The distribution of f(y˜L, X˜L) will be f(y˜L, X˜L) ∼ N ( 0, σ2 k Ip ) . As such, for the Clarkson-Woodruff and Hadamard sketches,[ n ( X˜TX˜ )+]−1/2 (βS − β) d→ N ( 0, σ2 k Ip ) . 109 5. Proofs regarding sketching algorithms Now let H˜(n) = nσ 2(n) F /k ( X˜TX˜ )+ Now as σ 2(n) F → σ2, and β(n)F → β, Slutsky’s theorem can be used to arrive at (ii) H˜ −1/2 (n) (βS − β(n)F ) d→ N (0, Ip) . 5.8 Proof of Theorem 4.6 (Partial sketching asymptotics) Theorem. Suppose that Assumptions 1, 2 and 3 hold, k > p+ 3, and β∗P is computed using a Hadamard or Clarkson-Woodruff sketch. Let H(n) = (k − p− 1) (k − p)(k − p− 3) ( MSS (n) F (X T (n)X(n)) −1 + β(n)F β (n)T F ) . Then as n→∞, (i) ES [β ∗ P − β(n)F |A(n)] → 0. (ii) varS ( H −1/2 (n) (β ∗ P − β(n)F ) |A(n) ) → Id Application of the continuous mapping theorem gives that the distribution of βS and β ∗ P under the Hadamard and Clarkson-Woodruff sketches converges to the distribution of the estimators under the Gaussian sketch. This does not necessarily guarantee convergence in moments. To establish a limit theorem for the bias and variance of the estimators, we need a uniform integrability condition on the sketched dataset. The sketched data will be denoted X˜(n) to denote the dependence on the n × p source covariate matrix. So X˜(n) = SX(n). We again do not explicitly condition on the source dataset A(n) = [y(n),X(n)] in the following working. Let G(n) = n −1X˜T(n)X˜(n). From the continuous mapping theorem and Theorem 4.3, it is known that G−1(n) d→W , where W has an Inverse-Wishart(k, kQ−1) distribution and Q is the limiting matrix from assumption 2. We would like to establish convergence in first and second moments, that is E[G−1(n)]→ E[W ], var(G−1(n))→ var(W ). If convergence in first and second moments occurs, then we can show that (i) and (ii) will hold. If E[G−1(n)]→ E[W ], we can say that E[β∗P − β]→ 0, where β is the limiting ordinary least squares estimator (5.48), that is a function of the limiting matrix Q in assumption 2. From here, using that lim n→∞β (n) F , Slutsky’s theorem can be used to arrive at (i) E[β∗P − β(n)F ]→ 0. To show convergence of the variance of the sketched estimator (ii), we define H(n) = (k − p− 1) (k − p)(k − p− 3) ( MSS (n) F ( XT(n)X(n) )−1 + (k − p+ 1) (k − p− 1)β (n) F β (n)T F ) , H = (k − p− 1) (k − p)(k − p− 3) ( (s− σ2)G−1 + (k − p+ 1) (k − p− 1)ββ T ) . 110 5.8. Proof of Theorem 4.6 (Partial sketching asymptotics) Where s, σ2 and and G are functions of the limiting matrix Q, as in (5.47), (5.48) and (5.49). If var(G−1(n))→ var(W ) it follows that varS ( H−1/2(β∗P − β) ) → Id. As H(n) converges to H and β (n) F converges to β asymptotically with n, an application of Slutsky’s theorem gives (ii), varS ( H −1/2 (n) (β ∗ P − β(n)F ) ) → Id As such, if we can establish that var(G−1(n)) → var(W ) we have proved (ii). The following theorem describes the necessary conditions for such convergence to occur. Theorem 5.4. (Billinglsley, 1968, Theorem 5.4) Let X1, . . . ,Xn be a sequence of random vectors. Sup- pose Xn converges in distribution to a random variable Z as n tends to infinity. For the additional convergence of moments E[Xn] → E[Z] and var[Xn] → var[Z], it must hold that for all conformable constant vectors λ lim M→∞ lim sup n→∞ |λTXn|21{|λTXn|2≥M} = 0. The above condition can be difficult to verify directly. It can be shown that if asymptotically |λTXn| has a bounded fourth moment, then the integrability condition is satisfied (Van Der Vaart, 1998, section 2.5). A linear combination of the elements of the random matrix G−1(n) can be written an trace(ΛG −1 (n)) for a p × p matrix of constants Λ. It is easier to work with this form rather than stacking the elements of the random matrix to form a random vector. From theorem 5.4, it is sufficient to show that show that the expected value of |trace(ΛG−1(n))|4 is finite for large n to show the desired convergence in moments. As trace(ΛG−1(n)) equals the sum of the singular values of the matrix ΛG −1 (n), we can form an upper bound on the value, trace(ΛG−1(n)) ≤ p‖ΛG−1(n)‖2 ≤ p‖Λ‖2‖G−1(n)‖2 Squaring both sides gives an upper bound on the quantity that must satisfy the uniform integrability condition, |trace(ΛG−1(n))|2 ≤ p2‖Λ‖22‖G−1(n)‖22. Squaring again gives an upper bound on the fourth moment of the linear combination of interest |trace(ΛG−1(n))|4 ≤ p4‖Λ‖42‖G−1(n)‖42 = p4‖Λ‖42 ( 1 σ2min(G(n)) )2 By assumption 3, the expectation of the right hand side is finite. As such, the uniform integrability condition holds and we can conclude that E[G−1(n)]→ E[W ], var(G−1(n))→ var(W ). As discussed at the beginning of the proof this is sufficient to show that (i) and (ii) hold. 111 6On subspace embeddings, Tracy-Widom limits and approximate Bayesian subset selection 6.1 Summary Sketching is a probabilistic data compression technique that uses random projection. Sketching has recently been proposed as a method for approximate Bayesian regression on large datasets. We investigate sketching for approximate Bayesian subset selection. Sketching algorithms offer probabilistic bounds on the discrepancy introduced from using the sketched dataset in place of the full dataset. Existing bounds give limited information on how to choose the size of the compressed dataset. We consider the asymptotic behaviour of various random sketching matrices and establish an important link between the Tracy-Widom distribution and the stochastic error rates of sketching algorithms. The Tracy-Widom distribution can be used to construct asymptotic error bounds describing the information loss from the use of the randomised algorithm. The asymptotic results provide new insights on the comparative performance of different sketching algorithms and will help to give useful guidelines for practitioners. We test the theory and methods on a large genetic dataset. 6.2 Introduction As discussed in Chapter 4 sketching algorithms use random projection to generate a smaller surrogate dataset for efficient approximate computation. Recent work has examined the possibility of using sketch- ing for approximate Bayesian inference (Geppert et al., 2017; Bardenet and Maillard, 2015). An apprecia- ble barrier to the adoption of inferential procedures using sketching is the difficulty in constructing tight error bounds on the sketching noise injected into the analysis. We take a new approach and use random matrix asymptotics to assess the suitability of random projections for approximate Bayesian inference. A main finding is that under mild regularity conditions, the stochastic distortion caused by a wide class of random projections is well described by the Tracy-Widom law (Tracy and Widom, 1994). The asymptotic results help to determine appropriate compression ratios for sketched analyses and to compare the operat- ing characteristics of different random projections. We assess the suitability of sketching for approximate Bayesian subset selection using a combination of asymptotic theory and simulation. We assume we have n independent observations from the standard Gaussian linear model. For sim- plicity the error variance σ2 is treated as known. Let y denote a vector of n responses, and let X denote a n×p matrix of covariates. As discussed in Chapter 4, sketching algorithms use random linear mappings to the size of the dataset from n to k observations. The random linear mapping can be represented as a k × n sketching matrix S. Complete sketching generates a k-length sketched response vector y˜ and a k × p matrix of sketched predictors X˜. The sketched data are computed through the linear mappings y˜ = Sy and X˜ = SX. Partial sketching only generates a k × p matrix of sketched covariates X˜. We again use the random mapping X˜ = SX. To simplify the analysis we only consider complete sketching in this chapter. Extensions to partial sketching are covered in the discussion. Working with the sketched 112 6.2. Introduction responses y˜ and sketched covariates X˜ is more computationally efficient than operating on the full dataset of n observations. The use of structured random projections S allows one to quantify the information loss incurred from working with the compressed dataset. As discussed in section 4.2.1 of Chapter 4, sketching algorithms are typically motivated from objects known as -subspace embeddings (Woodruff, 2014; Meng and Mahoney, 2013; Yang et al., 2015a). Recall the formal definition of an -subspace embedding. Definition 6.1. -subspace embedding. For a given n× p matrix A, we call a k × n matrix S an -subspace embedding for A, if for all vectors z ∈ Rp (1− )||Az||22 ≤ ||SAz||22 ≤ (1 + )||Az||22. An -subspace preserves the linear structure of the original dataset up to a multiplicative (1 ± ) factor. Broadly speaking, the covariance matrix of the sketched dataset is similar to the covariance matrix of the source dataset if  is small. Mathematical arguments show that the sketched dataset is a good surrogate for many linear statistical methods if the sketching matrix S is an -subspace embedding for the original dataset, with  sufficiently small (Woodruff, 2014). Epsilon-subspace embeddings are central in our analysis for developing posterior approximations and for constructing probabilistic bounds on the compression loss. Suitable ranges for  depend on the task of interest and structural properties of the source dataset. Sketching has been investigated for posterior approximation in the fixed model setting. The likelihood is p(y|X, β, σ2) = N(Xβ, σ2In), where β is a p-dimensional vector of coefficients. Given a prior distri- bution p(β), the target posterior distribution of interest is p(β|y,X, σ2) ∝ p(y|X,β, σ2)p(β). For large n, it may be computationally expensive to sample from the posterior distribution using Markov-Chain Monte Carlo, or to determine the analytic form if a conjugate prior is used. Geppert et al. (2017) consider using the sketched dataset [y˜, X˜] to construct an approximate posterior distribution on the coefficients β. Given an -subspace embedding it is possible to establish bounds on the difference between the sketched posterior and the target posterior as a function of  (Geppert et al., 2017). We consider approximate Bayesian model selection, also motivated by properties of -subspace embeddings. To assess the confidence in the approximate posterior it is necessary to obtain the probability that the random projection S is an -subspace embedding for the source dataset. The embedding probability Pr(S is an -subspace embedding for A), (6.1) is a critical feature of many sketching algorithms that is difficult to characterise precisely using existing theory (Venkatasubramanian and Wang, 2011). Most existing results on the embedding probability are finite sample lower bounds, that can potentially contain hidden constants. This makes it difficult to design methods for reporting the level of uncertainty associated with the sketched approximation. We take a new approach and develop asymptotic expressions for the embedding probability (6.1). We find that the Tracy-Widom law can be used to describe the probability of obtaining an -subspace embedding for many data oblivious sketches in the literature. The Tracy-Widom law has many applications in high-dimensional statistics (Johnstone, 2006; Bai and Silverstein, 2010), however there is little work to our knowledge foregrounding the importance of the law for statistical computation. The connection to sketching algorithms is novel and hints at deeper principles that may apply to randomised algorithms that use data oblivious projections. We test the accuracy of the asymptotic results on a large genetic dataset and find that the Tracy- Widom law gives a very good approximation for the embedding probability (6.1). We also analyse a real genetic dataset to provide guidelines on the necessary  required for approximate Bayesian model selection. It appears that the noise introduced by the sketch can overwhelm the posterior distribution in situations where there is model uncertainty. False positives are also another systemic issue with sketched model 113 6. Sketching for posterior approximation selection. Although single pass sketching does not appear to be a viable route for posterior approximation, we believe the asymptotic analysis presented here will nevertheless be useful in the design and analysis of general sketching algorithms. 6.3 Bayesian model selection We assume that the model is unknown, and wish to incorporate model uncertainty into the Bayesian analysis. The p-length binary vector γ is used to index different models. Element j in γ is equal to one if variable j is included in the model, and zero otherwise. Let Xγ denote the design matrix for a particular model. The matrix Xγ is a submatrix of X where column j is included if γj = 1. Let pγ denote the number of predictors in model γ, pγ is equal to the sum of the elements of γ. The likelihood is defined as p(y|X,βγ , σ2,γ) = N(Xγβγ , σ2In), where βγ is a pγ-dimensional vector of coefficients for model γ. We will use Zellner’s g-prior, p(βγ |γ, σ2) = N(0, gσ2(XTγXγ)−1), where g is a hyper-parameter controlling the prior variance over the coefficients. As stated in the intro- duction we treat σ2 as known. Let the residual sum of squares for a particular model γ be denoted RSSγF = y Ty − yTXγ(XTγXγ)−1XTγ y. The marginal likelihood is a function of the residual sum of squares RSSγF and the total sum of squares yTy. The expression is p(y|γ, g, σ2) = (1 + g)−pγ/2 exp [ − 1 2σ2 ( g g + 1 RSSγF + 1 g + 1 yTy )] . (6.2) We assume we use Markov chain Monte Carlo (MCMC) to sample from the posterior distribution over models, for example the evolutionary stochastic search algorithm in Bottolo and Richardson (2010). For tall datasets it is likely to be advantageous to precompute the sufficient statistics XTX,XTy,yTy. Computing XTX is O(np2), computing XTy is O(np) and computing yTy is O(n). This is a one-off set up cost of O(np2) operations, dominated by computation of the Gram matrix of predictors XTX. At each step in the MCMC algorithm it is only necessary to compute the model sum of squares yTXγ(X T γXγ) −1XTγ y for each new model γ visited. This can be obtained in O(p 3 γ) operations. The first step is to compute (XTγXγ) −1XTγ y by solving the linear system XTγXγb = X T γ y (6.3) for b. The matrix XTγXγ is obtained by taking the appropriate subset from the full Gram matrix of the predictors XTX. Likewise, the marginal associations XTγ y are read from the summary statistic X Ty. Solving the linear system (6.3) takes O(p3γ) operations. Given the solution b̂ = (XTγXγ) −1XTγ y, computation of the product y TXγ [(X T γXγ) −1XTγ y] can be done in O(pγ) time, leading to an overall O(p 3 γ) cost for computing the integrated likelihood. The pay-off from the initial investment in computing the sufficient statistics is that the subsequent cost of each MCMC step is then independent of n. For a very tall dataset the O(np2) set up cost of computing XTX may be undesirable. To reduce the initial computational expense, we can approximate the sufficient statistics using the sketched dataset. Given a sketched dataset of k observations [y˜, X˜], the sketched sufficient statistics X˜TX˜, X˜Ty˜, y˜Ty˜ can be computed in O(kp2) time. The approximate posterior distribution over models is defined using the sketched sufficient statistics. The approximate posterior can be explored using the same MCMC algorithm that would be applied to the full dataset sufficient statistics. Use of the sketched sufficient statistics is formally motivated using -subspace embeddings in the next section. 114 6.4. Approximate Bayesian inference 6.4 Approximate Bayesian inference Geppert et al. (2017) examine sketching for approximate Bayesian regression in the fixed model setting, using properties of -subspace embeddings to motivate their approach. If A˜ = [y˜, X˜] is an -subspace embedding of A = [y,X], it must hold that for all β ∈ Rp, (1− )||y −Xβ||22 ≤ ||y˜ − X˜β||22 ≤ (1 + )||y −Xβ||22. (6.4) Let p(β) denote the prior distribution and p(y|β,X, σ2) denote the likelihood, so p(y|β,X, σ2) = 1 (2piσ2)n/2 exp ( − 1 2σ2 ||y −Xβ||22 ) . The target posterior distribution is p(β|y,X) ∝ p(y|β,X, σ2)p(β). If n is very large it may be compu- tationally infeasible to compute or sample from the target posterior distribution. Geppert et al. propose to form an approximate likelihood function p˜(y|β,X, σ2) by substituting the sketched squared residuals ||y˜ − X˜β||22 for the true squared residuals ‖y −Xβ‖22, p˜(y|β,X, σ2) = 1 (2piσ2)n/2 exp ( − 1 2σ2 ||y˜ − X˜β||22 ) . If  is small, the bounds (6.4) imply that p˜(y|β,X, σ2) ≈ p(y|β,X, σ2). Using the sketched likelihood we can then define a sketched posterior p˜(β|X,y, σ2) ∝ p(β)p˜(y|X,β, σ2). As → 0, the approximate likelihood p˜(y|X,β, σ2) approaches the true likelihood p(y|β,X, σ2), and the sketched approximate posterior approaches the target posterior distribution. More formally, as → 0, D(p(β|y,X, σ2) ‖ p˜(β|y,X, σ2))→ 0. where D(p(β|β,X, σ2) ‖ p˜(β|y,X, σ2)) denotes the Kullback-Leibler divergence of the sketched posterior from the target posterior. Geppert et al. establish a bound on the Wasserstein distance between the sketched posterior and the target posterior for finite . We can use a similar argument to construct an approximate posterior distribution over models. Let RSSγS denote the residual sum of squares using the sketched dataset for a particular model γ, RSSγS = min β∈Rpγ ‖y˜ − X˜γβ‖22 = y˜Ty˜ − y˜TX˜γ(X˜Tγ X˜γ)−1X˜Tγ y˜. Lemma 6.1 gives an important result that motivates our approach. Lemma 6.1. Suppose A˜ = [y˜, X˜] is an -subspace embedding of A = [y,X]. Then for all models γ (1− )RSSγF ≤ RSSγS ≤ (1 + )RSSγF . Proof: Let βγF denote the least squares coefficients for model γ using the full dataset, where omitted covariates have their respective coefficients set to zero. The vector βγF is p dimensional, with pγ nonzero elements. Similarly, let βγS denote the least squares coefficients for model γ using the sketched dataset, where again omitted covariates have their respective coefficients set to zero. The vector βγS is also p dimensional, with pγ nonzero elements. It holds that RSS γ F = ‖y −XβγF ‖22 and RSSγS = ‖y˜ − X˜βγS‖22. To establish the upper bound in Lemma 6.1 we use the upper bound in (6.4), ‖y˜ − X˜βγS ||22 ≤ ‖y˜ − X˜βγF ||22 ≤ (1 + )||y −XβγF ||22 = (1 + )RSSγF . 115 6. Sketching for posterior approximation The first line follows from the optimality of βγS on the sketched dataset. To establish the lower bound in Lemma 6.1 we use the lower bound in (6.4), ||y˜ − X˜βγS ||22 ≥ (1− )‖y −XβγS ||22 ≥ (1− )‖y −XβγF ||22 = (1− )RSSγF . The second line follows from the optimality of βγF on the full dataset, completing the proof. We can define an approximate integrated likelihood p˜(y|γ, g, σ2) in terms of the residual sum of squares on the sketched dataset. We substitute RSSγS in place of RSS γ F in the target integrated likelihood (6.2), obtaining p˜(y|γ, g, σ2) = (1 + g)−pγ/2 exp [ − 1 2σ2 ( g g + 1 RSSγS + 1 g + 1 yTy )] . (6.5) If [y˜, X˜] is an -subspace embedding of [y,X], the bounds on the residual sum of squares (1− )RSSγF ≤ RSSγS ≤ (1 + )RSSγF must hold. If  is small, the approximate integrated likelihood p˜(y|γ, g, σ2) should be a good approximation of the target integrated likelihood p(y|γ, g, σ2). The sketched posterior approximation over models is then defined as p˜(γ|y, g, σ2) ∝ p(γ)p˜(y|γ, g, σ2). As → 0, p˜(y|γ, g, σ2) approaches p(y|γ, g, σ2), and the sketched approximate posterior approaches the target posterior distribution. More formally, as → 0, D(p(γ|y, g, σ2) ‖ p˜(γ|y, g, σ2))→ 0, where D(p(γ|y, g, σ2) ‖ p˜(γ|y, g, σ2)) denotes the Kullback-Leibler divergence of the sketched posterior from the target posterior. A Markov chain Monte Carlo algorithm targeting the sketched posterior requires a set up cost of O(kp2) operations, after which each MCMC step to a new model has O(p3γ) cost. If we can reliably generate -subspace embeddings with small  we can arrive at a posterior approximation with a smaller overall computational cost than through exact computation. To form concrete probabilistic error bounds we need to determine the probability that the random sketching matrix S is an -subspace embedding for the source dataset [y,X]. The success probability (6.1) determines the confidence we have in the randomised algorithm for approximate regression. We develop useful guidelines for practitioners in section 4. Secondly, we also need to know what value of  is needed to obtain a tolerable posterior approximation. Geppert et al. (2017) aim for  ∈ (0.1, 0.2) for posterior approximation in the fixed model setting. We examine what is an appropriate  for posterior approximation through simulation. Our results suggest that  needs to be much smaller than 0.1 for approximating the posterior distribution over models. We now turn to the first question of interest, determining the embedding probability. 6.5 Embedding probabilities 6.5.1 Previous work We first review some key theoretical results used to support sketching algorithms from the computer science literature. Let A denote an arbitrary n×d data matrix that we wish to compress. Data oblivious random projections are distributions over sketching matrices S ∈ Rk×n that are not a function of the source dataset A ∈ Rn×d. At a high-level, data oblivious random projections offer guarantees on the success probability Pr(S is an -subspace embedding for A), 116 6.5.1. Previous work Sketch Sketching time Required sketch size k Gaussian O(ndk) O((d+ log(1/δ))/2) Hadamard O(nd log k) O(( √ d+ √ log n)2(log(d/δ))/2) Clarkson-Woodruff O(nd) O(d2/δ2) Table 6.1: Properties of different data oblivious random projections (Woodruff, 2014). The third column refers to the necessary sketch size k to obtain an -subspace embedding for an arbitrary n × d source dataset with at least probability (1− δ). for an arbitrary n× d source dataset A. The guarantees are typically finite sample lower bounds on the embedding probability. The bounds concern the required sketch size k needed to obtain an -subspace embedding with probability at least 1− δ. The lower bounds for the Gaussian, Hadamard and Clarkson- Woodruff sketches are listed in Table 6.1. The results are expressed in Big-O notation, the formulae for the required sketch size k are not fully explicit. For the Gaussian sketch, the notation O(( √ d + √ log n)2(log(d/δ))/2) indicates the existence of constants c1 and c2 that are independent of n, such that if k is chosen within the bounds c1(d+ log(1/δ))/ 2) ≤ k ≤ c2(d+ log(1/δ))/2), then the probability of obtaining an -subspace embedding is at least (1 − δ). The hidden constants in Table 6.1 are of little concern from a computer science perspective as the algorithmic significance is that the required sketch size k is independent of n for the Gaussian and Clarkson-Woodruff sketches, and very weakly dependent on n for the Hadamard projection. This implies that size of the sketch does not necessarily have to increase with n to obtain a fixed error guarantee. This is a very desirable property for a data compression algorithm. Many proofs of the results in Table 6.1 use the following important lemma that gives a necessary and sufficient condition to obtain an -subspace embedding (Woodruff, 2014, Chapter 2). Lemma 6.2. For a given matrix n×d matrix A of rank d, let U be an orthonormal basis for the columns of A. Let S be some k×n sketching matrix. The matrix S an -subspace embedding for A, if and only if σmax(Id −UTSTSU) ≤ , where σmax(Id −UTSTSU) denotes the maximum singular value of the d× d matrix Id −UTSTSU . Proof: From Definition 6.1, it is necessary to preserve norms in the column space of A up to (1 ± ) factor, (1− )||Az||22 ≤ ||SAz||22 ≤ (1 + )||Az||22. It is therefore sufficient to consider an orthonormal basis for the column space of A, as{ Az : z ∈ Rd} = {Uv : v ∈ Rd} . We can limit attention to unit vectors v, as if (1− )||Uv||22 ≤ ||SUv||22 ≤ (1 + )||Uv||22, holds for all unit vectors v, it holds for all vectors v′ ∈ Rd by scaling. It is therefore sufficient to show that for all unit vectors v, (1− ) ≤ ‖SUv‖22 ≤ (1 + ). 117 6. Sketching for posterior approximation Let λmin(U TSTSU) and λmax(U TSTSU) denote the minimum and maximum eigenvalues of the matrix UTSTSU respectively. The extreme eigenvalues are the solutions to the optimisation problems over unit vectors v, λmin(U TSTSU) = min v ‖SUv‖22, λmax(UTSTSU) = max v ‖SUv‖22. The extreme eigenvalues give upper and lower bounds on the distortion in norms caused by S. As we need (1 − ) ≤ ‖SUv‖22 ≤ (1 + ), the matrix S is then an -subspace embedding if and only if |1−λmin(UTSTSU)| ≤  and |1−λmax(UTSTSU)| ≤ . The singular values of the matrix Id−UTSTSU are given by the absolute values of the eigenvalues of Id−UTSTSU . The maximal singular value is then σmax(Id −UTSTSU) = max(|1− λmin(UTSTSU)|, |1− λmax(UTSTSU)|) (6.6) It follows that σmax(Id−UTSTSU) ≤  is a necessary and sufficient condition for |1−λmin(UTSTSU)| ≤  and |1− λmax(UTSTSU)| ≤  to hold jointly. As such the matrix S is an -subspace embedding if and only if σmax(Id −UTSTSU) ≤  As discussed in Woodruff (2014, Chapter 2), the bounds in Table 6.1 can be obtained by giving up- per bounds on the failure probability 1− Pr(S is an -subspace embedding for A) = Pr(σmax(Id −UTSTSU) > ). Let ‖A‖F denote the Frobenius norm of a n× d matrix A, ‖A‖F = √√√√ n∑ i=1 d∑ j=1 A2ij . The proof for the Clarkson-Woodruff sketch uses Markov’s inequality and the fact that σ2max(A) ≤ ‖A‖2F (Nelson and Nguyeˆn, 2013; Meng, 2014). The key argument is, Pr(σmax(Id −UTSTSU) > ) = Pr(σ2max(Id −UTSTSU) > 2) ≤ −2E (σ2max(Id −UTSTSU)]) ≤ −2E (‖Id −UTSTSU‖2F ) . The expected squared Frobenius norm can be upper bounded using properties of the Clarkson-Woodruff sketch and the fact that UTU = Id (Nelson and Nguyeˆn, 2013; Meng, 2014). The proof for the Hadamard sketch uses a matrix Chernoff bound and Boole’s inequality to upper bound Pr(σmax(Id −UTSTSU) > ) (Tropp, 2011; Meng, 2014). The matrix Chernoff bound is used to bound the extreme eigenvalues of a random matrix. Boole’s inequality upper bounds the probability of a union of events, Pr ( ⋃m i=1Bi) ≤ ∑m i=1 Pr(Bi) for any countable set of events B1, . . . , Bm. The argument is more technical and not reviewed here. See Tropp (2011) for an insightful proof. The different functional forms of the bounds in Table 6.1 are due to the different methods of proof, and the different statistical properties of the Gaussian, Hadamard and the Clarkson-Woodruff sketches. The use of chained bounds makes it difficult to keep track of the necessary constant factors to obtain a tight bound on the embedding probability. Finite sample, worst case bounds can be pessimistic in regards to the performance of the randomised algorithm (Halko et al., 2011; Mahoney and Drineas, 2016). We propose to use asymptotic random matrix theory to characterise the distribution of the critical term σmax(Id − UTSTSU). The singular values of large random matrices is a well studied topic in random matrix theory, and we can draw from existing results to analyse the behaviour of sketching algorithms (Bai et al., 2014). Given the asymptotic distribution of σmax(Id−UTSTSU) we can give the asymptotic probability of obtaining an -subspace embedding rather than forming a lower bound. We first analyse the Gaussian sketch. 118 6.5.2. Gaussian sketch 6.5.2 Gaussian sketch From Lemma 6.2, the embedding probability is a function of the maximum singular value Pr(S is an -subspace embedding for A) = Pr(σmax(Id −UTSTSU) ≤ ), As noted in Meng (2014, Section 2.3), when using a Gaussian sketch it is instructive to consider directly the distribution of the random variable σmax(Id−UTSTSU). Lemma 6.2 helps to show why the Gaussian projection is useful as a data oblivious sketch. Consider an arbitrary n×d data matrixA. Let the singular value decomposition ofA be given byA = UDV T. The success probability of interest is can be expressed as Pr(S is an -subspace embedding for A) = Pr(σmax(Id −UTSTSU) ≤ ). Note that on the right hand side the sketching matrix only enters through the term UTSTSU . As S is a matrix of independent Gaussians with mean zero and variance 1/k, it is possible to show that UTSTSU ∼Wishart (k,UTU/k) = Wishart (k, Id/k) , where the second line follows as U is an orthonormal matrix. The key term UTSTSU is in some sense a pivotal quantity, as its distribution is invariant to the actual values of the data matrix A. When using a Gaussian sketch, the probability of obtaining an -subspace embedding has no dependence on the number of original observations n, or on the values in the data matrix A. This is a useful property for a data oblivious sketch, as it is possible to develop universal performance guarantees that will hold for any possible source dataset. This invariance property is also noted in Meng (2014), although the derivation is different. Let us define the random matrix W ∼Wishart(k, Id/k). The success probability of interest can then be expressed in terms of the Wishart distribution, Pr(S is an -subspace embedding for A) = Pr ( σmax(Id −UTSTSU) ≤  ) = Pr(σmax (Id −W ) ≤ ). (6.7) Now as σmax(Id−W ) = max(|1−λmin(W )|, |1−λmax(W )|) the embedding probability can be expressed in terms of the extreme eigenvalues of the Wishart distribution. The embedding probability of interest has the representation Pr(S is an -subspace embedding for A) = Pr(σmax(Id −W ) ≤ ) = Pr (|1− λmin(W )| ≤ , |1− λmax(W )| ≤ ) , (6.8) where W ∼ Wishart(k, Id/k). It is difficult to obtain a closed form expression for the embedding probability as it involves the joint distribution of the extreme eigenvalues. Meng forms a lower bound on this probability using concentration results on the eigenvalues of the Wishart distribution. The lower bound is reported here in Lemma 6.3. Lemma 6.3. (Meng, 2014, Lemma 11) Suppose we have an arbitrary n× d data matrix A where n > d and A is of rank d. Suppose we take a Gaussian projection with k ≥ 6(√d+√2 log(2/δ))2/2 then Pr(S is an -subspace embedding for A) ≥ 1− δ. For a proof see (Meng, 2014, Chapter 2, Section 2.5). The bound can be inverted to give a lower bound on the success probability as a function of . Doing so gives Pr(S is an -subspace embedding for A) ≥ 1− exp [ log(2)− ( √ k2/6−√d)√ 2 ] . (6.9) 119 6. Sketching for posterior approximation We can investigate the tightness of the bound (6.9) through simulation. From equation (6.7), for an arbitrary n× d data matrix A, Pr(S is an -subspace embedding for A) = Pr(σmax (Id −W ) < ), whereW ∼Wishart(k, Id/k). To estimate the embedding probability for the Gaussian sketch we can sim- ulateW ∼Wishart(k, Id/k) and look at the empirical distribution of the random variable σmax (Id −W ). We first generated B = 10000 random Wishart matrices W [1], . . . ,W [B]. For each simulated matrix W [b] we computed the distortion factor [b]: [b] = σmax(Id −W [b]), (6.10) for b = 1, . . . , B. The simulated distortion factors [1], . . . , [1B] can be used to give a Monte Carlo estimate of the embedding probability: P̂r(S is an -subspace embedding for A) = P̂r(σmax(Id −W ) ≤ ) = 1 B B∑ b=1 1([b] ≤ ). We used the ARPACK library Lehoucq et al. (1998) to compute the maximum singular value of the high- dimensional matrices. The estimated embedding probabilities are displayed in Figure 6.1 for different dimensions d. The sketch size was kept at k = 20 × d. The red line shows the empirical probability of obtaining an -subspace embedding. The dot-dash line shows the lower bound of Meng (2014), given in equation (6.9). The solid vertical line gives an asymptotic limiting value that will be discussed later in section 6.7.1. Comparing the empirical embedding probabilities to the lower bound, we see that the lower bound is quite loose. The lower bound is zero in each plot at points where the empirical cdf is close to one. To obtain a tighter prediction on the embedding probability we consider the asymptotic distribution of the eigenvalues of a random Wishart matrix. This is a well studied area of random matrix theory (Edelman, 1988). Asymptotic analysis is adopted in order to obtain point estimate of the success probability rather than a worst case bound. We develop two asymptotic expressions for the embedding probability. We introduce the random matrix asymptotics by first considering the pointwise limit of the extreme eigenvalues. The pointwise asymptotic result suggests a sharp phase change in the success probability of the algorithm. We then develop a more accurate approximation for the embedding probability by making a connection to the Tracy-Widom distribution. 6.6 Asymptotics Our asymptotic arguments in section 6.7 concern the convergence of probability measures. Billingsley (1999) is an authoritative reference on the topic. We now recap some useful foundational theory, as is presented in Van Der Vaart (1998, Chapter 2). Let (Zn)n∈N denote a sequence of real v-dimensional random vectors with cumulative distribution functions (Fn)n∈N. Let Z denote a real v-dimensional random vector with cumulative distribution function F . The sequence of random vectors (Zn) converges in distribution to Z if lim n→∞Fn(z) = F (z), at every point z ∈ Rv where the limit distribution F is continuous. Convergence in distribution is denoted as Zn d→ Z. Convergence in distribution is often referred to as weak convergence. Let ‖.‖2 denote the Euclidean norm. A sequence of random vectors (Zn)n∈N, is said to converge in probability to a random vector Z if for all δ > 0: lim n→∞Pr(‖Zn −Z‖2 > δ) = 0. 120 6.6. Asymptotics 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 (a) d=10, k=20*d Epsilon Em be dd in g pr ob ab ilit y 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 (b) d=20, k=20*d Epsilon Em be dd in g pr ob ab ilit y 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 (c) d=100, k=20*d Epsilon Em be dd in g pr ob ab ilit y 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 (d) d=1000, k=20*d Epsilon Em be dd in g pr ob ab ilit y Empirical Lower bound Pointwise limit Figure 6.1: Comparison of simulated embedding probabilities against theoretical results at different k and d. The sketch size to variables ratio is kept constant at twenty. We want the sketching matrix S to be an -subspace embedding for the source dataset with small  with high probability. The y-axis gives the proportion of times we attain an -subspace embedding. The x-axis gives the distortion factor  (recall Definition 6.1). The dot-dash line gives a finite sample lower bound. The vertical line gives the asymptotic limiting distribution as d increases. As d increases the empirical cdf concentrates around the step-function given by pointwise asymptotic theory (Theorem 6.5). Convergence in probability is denoted Zn p→ Z. Convergence in probability requires that (Zn) and Z be defined on the same probability space. Convergence in distribution has no such requirement. The Portmanteau lemma gives a number of useful equivalent definitions of convergence in distribution (weak convergence). Lemma 6.4 (Portmanteau). Let (Zn)n∈N denote a sequence of random vectors of fixed dimension, and Z denote another random vector of the same dimension. The following statements are equivalent, where limits are being taken in n: (a) Pr(Zn ≤ z)→ Pr(Z ≤ x) at all continuity points z of the cumulative distribution function Pr(Z ≤ z). (b) Ef(Zn)→ Ef(Z) for all bounded, continuous function f . (c) Ef(Zn)→ Ef(Z) for all bounded Lipschitz functions f (d) lim inf Ef(Zn) ≥ Ef(Z) for all nonnegative, continuous functions f . (e) lim inf Pr(Zn ∈ G) ≥ Pr(Z ∈ G) for every open set G. (f) lim sup Pr(Zn ∈ F ) ≤ Pr(Z ∈ F ) for every closed set F . 121 6. Sketching for posterior approximation (g) Pr(Zn ∈ B) → Pr(Z ∈ B) for all Borel sets B with Pr(X ∈ ∂B) = 0, where ∂B denotes the boundary of the set B. The boundary is defined as the closure of the set B minus the interior of B, so ∂B = B \Bo. The continuous mapping theorem is a useful result for studying the weak convergence of functions of random variables. The continuous mapping theorem is quite powerful as we only need to show that the limiting random variable Z satisfies P (Z ∈ C) = 1. Possible discontinuities of g(Zn) do not cause any difficulties in establishing a limit theorem. Theorem 6.1 (Continuous Mapping Theorem). Let (Zn)n∈N indicate a sequence of v-dimensional ran- dom vectors. Let Z denote another random vector of dimension v. Suppose the function g : Rv → Rm is continuous at every point of a set C such that P (Z ∈ C) = 1. Then the following results hold (a) If Zn d→ Z then g(Zn) d→ g(Z). (b) If Zn p→ Z then g(Zn) p→ g(Z). There are some useful relationships between convergence in probability and convergence in distribu- tion. Additionally, there are some useful relationships between joint and marginal convergence when one or more sequences in question converges weakly to a constant. The following theorem summarises some useful results. Theorem 6.2 (Relationship between modes of convergence). Let (Zn)n∈N and (Yn)n∈N be sequences of random vectors. Let Z and Y denote random vectors. Then the following relationships hold: (a) Xn p→X implies Xn d→X. (a) Xn p→ c for a constant c if and only if Xn d→ c. (a) If Xn p→ c1 and Yn p→ c2 for constants c1 and c2, then (Xn,Yn) p→ (c1, c2) Slutsky’s theorem concerns the joint convergence of functions of random variables when there is some convergence to a constant. Theorem 6.3 (Slutsky). Let (Zn)n∈N and (Yn)n∈N be sequences of random vectors or random variables. Let X denote random vector or random variable. If Xn d→X and Yn d→ c for a constant c, then (a) Xn + Yn d→X + c (a) YnXn d→ cX (a) Y −1n Xn d→ c−1X provided that c 6= 0. Lemma 6.5 (Uniform convergence). Suppose that (Zn)converges in distribution to a random vector Z with a continuous distribution function. Then lim n→∞supz |Pr(Zn ≤ z)− Pr(Z ≤ z)| = 0. Proofs of all the results in this subsection are given in Chapter 2 of Van Der Vaart (1998). 122 6.7. Random matrix theory 6.7 Random matrix theory 6.7.1 Pointwise limit The extreme eigenvalues of a Wishart random matrix converge in probability to fixed values as both the dimension and degrees of freedom expand. The result for the largest eigenvalue is due to Geman (1980) and the result for the smallest eigenvalue is due to Silverstein (1985). Theorem 6.4. (Geman, 1980; Silverstein, 1985) Consider a sequence of Wishart(k, Id/k) random matrices where the degrees of freedom k and dimension d are both taken to infinity. Suppose that the variables to samples ratio d/k converges to a constant (d/k) → α, where α ∈ (0, 1]. Then the extreme eigenvalues of the random matrix, λmin and λmax converge in probability to the limits (i) λmin p→ (1−√α)2, (6.11) (ii) λmax p→ (1 +√α)2. (6.12) Theorem 6.4 and the continuous mapping theorem (Theorem 6.1) can be used to determine the asymptotic embedding probability for the Gaussian sketch. Theorem 6.5 gives the limiting probability of obtaining an -subspace embedding for a Gaussian sketch. The limit is asymptotic in n, d and k. This can be interpreted as a type of Big Data asymptotic, where we consider tall and wide datasets through the limit in n and d, and increasing sketch sizes k to cope with the with the expanding number of variables d. In Theorem 6.5 the limiting variables to sketch ratio d/k is taken to a constant that is less than one. This is reasonable as the sketched dataset will only have a full rank covariance matrix if k > d. Theorem 6.5. Suppose we have an arbitrary n × d data matrix A where n > d and A is of rank d. Assume we take a Gaussian sketch of size k. Then asymptotically in n, k and d, with d/k → α where α ∈ (0, 1], lim n,d,k→∞ Pr(S is an -subspace embedding for A) = 0 if  < (1 + √ α)2 − 1 1 if  > (1 + √ α)2 − 1 Before giving a proof we first make a comment on the practical interpretation of Theorem 6.5. For large d and k, we expect the embedding probability to be a step function at (1 + √ d/k)2 − 1. In the simulations depicted in Figure 6.1, the sketch size to variable ratio was held constant at twenty. As k and d grow larger, the embedding probability curve should concentrate at (1 + √ 1/20)2−1 ≈ 0.5. This value is plotted as a dashed line in each panel. Moving through the panels in the order (a) through (d) we can see visible convergence to the pointwise limit around 0.5. As d and k increase, the empirical cdf of the simulated values [1], . . . , [B], (see equation (6.10)) approaches a step function at the pointwise limit. Proof: Let W ∼Wishart(k, Id/k), and let λmin and λmax denote the minimum and maximum eigen- values of W respectively. Using Slutsky’s theorem and the continuous mapping theorem we have the joint convergence result |1− λmin| |1− λmax|  p→ |1− (1−√α)2| |1− (1 +√α)2|  . (6.13) For large k and d, the maximum eigenvalue λmax is expected to show greater deviation from one than the minimum eigenvalue λmin. Over the interval α ∈ (0, 1] it holds that |1− (1 +√α)2| > |1− (1−√α)2|. Applying the continuous mapping theorem to the random vector in (6.13), max |1− λmin| |1− λmax|  p→ max |1− (1−√α)2| |1− (1 +√α)2|  , 123 6. Sketching for posterior approximation yielding max(|1 − λmin|, |1 − λmax|) p→ |1 − (1 + √ α)2|. Now as (1 + √α)2 is greater than one for all α > 0, the absolute value sign can be removed in the limit giving the equivalent statement max(|1 − λmin|, |1 − λmax|) p→ (1 + √ α)2 − 1. Recalling that σmax(Id −W ) = max(|1 − λmin|, |1 − λmax|), we establish convergence of the limiting singular value σmax(Id −W ) p→ (1 + √ α)2 − 1. (6.14) Property Portmanteau lemma then gives the probabilistic statement lim n,d,k→∞ Pr(σmax(Id −W ) ≤ ) = 0 if  < (1 + √ α)2 − 1, 1 if  > (1 + √ α)2 − 1. As  = (1 + √ α)2 − 1 is a discontinuity point of the limiting distribution function we do not make a statement about the case  = (1 + √ α)2 − 1. From (6.7) we have the equality in limits lim n,d,k→∞ Pr(S is an -subspace embedding for A) = lim n,d,k→∞ Pr(σmax(Id −W ) ≤ ), giving the final result. In Figure 6.1 the Monte Carlo estimate of the embedding probability curve shows deviation from the sharp step function predicted by Theorem 6.5. We can obtain a more accurate approximation of the embedding probability using more sophisticated asymptotics. 6.7.2 Tracy-Widom limit We will show that the embedding probability (6.1) under the Gaussian sketch is closely related to the Tracy-Widom law. The Tracy-Widom distribution is the limiting distribution of the maximum eigenvalue of a random Wishart(m, Im) matrix. Note that in the previous expression the degrees of freedom m matches the dimension of the m×m random matrix. The cumulative distribution function of the Tracy- Widom distribution F1(x), is defined as F1(x) = exp ( −1 2 ∫ ∞ x q(t) + (t− x)q2(t) dt ) . Where q(x) satisfies the nonlinear differential equation q′′(x) = xq(x) + 2q3(x), subject to the asymptotic boundary condition, q(x) ∼ Ai(x) as x → ∞. The function Ai(x) denotes the Airy function, defined as Ai(x) = 1 pi ∫ ∞ 0 cos ( t3 3 + xt ) dt. The Airy equation y′′ − xy = 0 has solution y = Ai(x) under the boundary condition y → 0 as x → ∞. Figure 6.2 plots the density function of the Tracy-Widom distribution and the Airy function. The R package RMTstat has a suite of functions for working with the Tracy-Widom distribution (Johnstone et al., 2014). The Tracy-Widom law describes Wishart(m, Im) random matrices, however we are interested in situations where the degrees of freedom may not match the dimension of the matrix. For sketching algorithms the random matrix of interest W ∼ Wishart(k, Id/k) has the degrees of freedom controlled by the sketch size k, and the dimension of the matrix is given by the number of variables in the source dataset d. Johnstone (2001) showed that Tracy-Widom law also gives the asymptotic distribution of the maximum eigenvalue of a Wishart(k, Id/k) matrix after appropriate centring and scaling. In subsequent work Ma (2012) showed that the rate of convergence could be improved by using different centering and scaling constants than in Johnstone (2001). We present the convergence result given by Ma. 124 6.7.2. Tracy-Widom limit −6 −4 −2 0 2 4 6 0. 00 0. 10 0. 20 0. 30 (a) Tracy−Widom distribution x D en si ty −15 −10 −5 0 5 10 − 0. 4 − 0. 2 0. 0 0. 2 0. 4 (b) Airy function Ai(x) x Ai (x) Figure 6.2: (a) Tracy-Widom distribution. (b) Airy function. The Tracy-Widom distribution has a significant role in describing the asymptotic distributions of the eigenvalues of large random matrices. Theorem 6.6. (Ma, 2012) Consider a sequence of Wishart(k, Id/k) random matrices where the degrees of freedom k and dimension d are both taken to infinity. Let λmax denote the maximum eigenvalue of the random matrix. Suppose that the variables to samples ratio d/k converges such that d/k → α where α ∈ (0, 1]. Define the centring and scaling constants as µk,d = k −1( √ k − 1/2 + √ d− 1/2)2, σk,d = k −1( √ k − 1/2 + √ d− 1/2) ( 1√ k − 1/2 + 1√ d− 1/2 )1/3 . Then (λmax − µk,d) σk,d d→ Z, where Z ∼ F1 and F1 is the Tracy-Widom distribution. The limiting values of µk,d and σk,d are (1+ √ α)2 and 0 as we take k, d to infinity with d/k → α. This shows a correspondence with the pointwise limit of maximum eigenvalue given in Theorem 6.4. Theorem 6.7 uses the Tracy-Widom law to describe the asymptotic embedding probability when using a Gaussian sketch. Theorem 6.7. Suppose we have an arbitrary n× d data matrix A where n > d and A is of rank d. Let the singular value decomposition of A be given by A = UDV T. Furthermore assume we take a Gaussian sketch of size k. Consider the limit in n, k and d, such that d/k → α with α ∈ (0, 1]. Let µk,d and σk,d be the centering and scaling constants given in Theorem 6.6. The asymptotically in n, d and k, (i) lim n,d,k→∞ ∣∣∣∣Pr (S is an -subspace embedding for A)− Pr(Z ≤ + 1− µk,dσk,d )∣∣∣∣ = 0. Where Z ∼ F1 and F1 is the Tracy-Widom distribution. Furthermore we have convergence in distribution (ii) σmax(Id −USTSU)− µk,d + 1 σk,d d→ Z. 125 6. Sketching for posterior approximation 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 (a) d=10, k=20*d Epsilon Em be dd in g pr ob ab ilit y 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 (b) d=20, k=20*d Epsilon Em be dd in g pr ob ab ilit y 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 (c) d=100, k=20*d Epsilon Em be dd in g pr ob ab ilit y 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 (d) d=1000, k=20*d Epsilon Em be dd in g pr ob ab ilit y Empirical Lower bound Pointwise limit Tracy−Widom limit Figure 6.3: Empirical probability of obtaining an -subspace embedding at different k and d. The sketch to vari- ables ratio is kept constant at twenty. We want the sketching matrix S to be an -subspace embedding for the source dataset with small  with high probability. The y-axis gives the proportion of times we attain an -subspace embedding. The x-axis gives the distortion factor  (recall Definition 6.1). The empirical cdf is obtained by simulat- ing σmax(Id −W ) where W ∼ Wishart(k, Id/k). The Tracy-Widom law gives the most accurate predictions of the embedding probability in the simulation. The lower bound is conservative at each d. The Tracy-Widom based pre- diction (Theorem 6.7) describes the fluctuations around the sharp step-function given by the pointwise asymptotic theory (Theorem 6.5). The accuracy of the Tracy-Widom approximation improves as d increases. Before presenting the proof we show an application of Theorem 6.7. Result (i) gives an asymptotic approximation for the embedding probability given that we take a sketch of size k of a source dataset with d variables. We compare the Tracy-Widom approximation in Theorem 6.7 to the simulation results reported in section 6.5.2. Figure 6.3 displays the results. The asymptotic expression for the embedding probability is superimposed as a blue dashed line over the empirical estimate (red line). The pointwise limit discussed in section 6.7.1 is again plotted as a solid vertical line. The Tracy-Widom limit is clearly more accurate than the pointwise approximation, modelling the fluctuation around the limiting value (1 −√(1/20))2 − 1 ≈ 0.5. The Tracy-Widom approximation improves as k and d grow larger. In (a) we can see that the left tail of the Tracy-Widom approximation is not accurate for d = 10. In (c) we can see that the approximation is very good at d = 100. The lower bound (6.9) is again plotted as the dot-dash line. Although an asymptotic result, the Tracy-Widom limit gives a more accurate prediction of the embedding probability than the finite sample lower bound in this simulation. Theorem 6.7 (ii) gives the asymptotic distribution of σmax(Id − UTSTSU) for an arbitrary data matrix A = UDV T. The simulated values [1], . . . , [B] (recall equation (6.10)) can be used to estimate the density of σmax(Id−UTSTSU). Figure 6.4 compares the asymptotic theoretical density of σmax(Id− UTSTSU) to a kernel density estimate from the simulated values. The agreement improves as d grows, with the asymptotic approximation being extremely good for d = 100 and d = 1000. Theorem 6.7 is not an immediate extension of Theorem 6.6 as Theorem 6.6 does not pertain describe the joint distribution 126 6.7.2. Tracy-Widom limit 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 2 4 6 8 (a) d=10, k=20*d σmax(Id − UTSTSU) D en si ty 0.3 0.4 0.5 0.6 0.7 0 2 4 6 8 (b) d=20, k=20*d σmax(Id − UTSTSU) D en si ty 0.40 0.45 0.50 0.55 0.60 0 5 10 15 20 25 (c) d=100, k=20*d σmax(Id − UTSTSU) D en si ty 0.47 0.48 0.49 0.50 0.51 0.52 0 20 40 60 80 10 0 (d) d=1000, k=20*d σmax(Id − UTSTSU) D en si ty Empirical Tracy−Widom limit Pointwise limit Figure 6.4: Observed and theoretical density of σmax(Id −UTSTSU) at different k and d. The sketch to variables ratio is kept constant at twenty. We want the sketching matrix S to be an -subspace embedding for the source dataset with small  with high probability. The sketching matrix S is an -subspace embedding for the source dataset if and only if σmax(Id − UTSTSU) ≤ . The x-axis limits are different in each plot. The accuracy of the Tracy-Widom approximation improves as d increases as is expected from the asymptotic theory (Theorem 6.7). of λmin and λmax. The embedding probability involves the joint distribution of λmin and λmax. We use Theorem 6.6 in conjunction with the Portmanteau lemma, Theorem 6.5 and the continuous mapping theorem in order to establish Theorem 6.7. We now present the proof of Theorem 6.7. Proof: Let W ∼Wishart(k, Id/k), and let λmin and λmax denote the minimum and maximum eigenvalues of W respectively. The majority of the proof comes down to showing that λmax controls the embedding probability. Using the Portmanteau lemma (Lemma 6.4) we will show that lim d,k→∞ Pr(σmax(Id −W ) ≤ ) = lim d,k→∞ Pr (|1− λmax| ≤ ) . Recall the key expression given in (6.8), Pr(S is an -subspace embedding for A) = Pr(σmax(Id −W ) ≤ ) = Pr (|1− λmin| ≤ , |1− λmax| ≤ ) . The Tracy-Widom law describes the marginal distributions of λmin and λmax. We would like to avoid working with the joint distribution of the extreme eigenvalues, and instead restrict attention to the distribution of the maximum. Let denote the random vector X = (|1 − λmin|, |1 − λmax|)T. Figure 6.5 presents some diagrams that will be useful. We wish to know the probability that X lies in the region shaded region C in panel (a). For every  > 0 we have that Pr (|1− λmin| ≤ , |1− λmax| ≤ ) = Pr(X ∈ C). The region C can be expressed as C = M−R where M and R are the shaded regions in panels (b) and 127 6. Sketching for posterior approximation Figure 6.5: Regions of interest in determining the embedding probability. To obtain an -subspace embedding we require that |1− λmin| ≤  and |1− λmax| ≤ . If we define X = (|1− λmin|, |1− λmax|)T, we have that Pr(X ∈ C) = Pr(X ∈M)− Pr(X ∈ R). In panel (c) the dot-dash line gives the identity line where |1− λmax| = |1− λmin|. (c) respectively. The probability Pr(X ∈M) represents the marginal probability that |1−λmax| ≤ . The probability Pr(X ∈ R) represents the probability of the joint event that (|1− λmax| ≤ , |1− λmin| > ). We have that Pr(X ∈ C) = Pr(X ∈M)− Pr(X ∈ R). In panel (c) the dot-dash line gives the identity line where |1−λmax| = |1−λmin|. From Theorem 6.5 we know that as d, k tends to infinity X converges in distribution to the constant vector XL = (|1 − (1 −√ α)2|, |1−(1+√α)2|)T. As such, asymptotically |1−λmax| > |1−λmin| with probability one. Referring to panel (c), the random vector XL takes values in the region below the dot-dash line with probability one. The limiting random vector XL thus satisfies Pr(XL ∈ R) = 0 and Pr(XL ∈ ∂R) = 0. As X d→ XL, property (g) of the Portmanteau lemma (Lemma 6.4) gives that Pr(X ∈ R) → Pr(XL ∈ R) = 0 . The limiting probability is then lim d,k→∞ Pr (|1− λmin| ≤ , |1− λmax| ≤ ) = lim d,k→∞ Pr(X ∈ C) (6.15) = lim d,k→∞ Pr(X ∈M)− lim d,k→∞ Pr(X ∈ R) (6.16) = lim d,k→∞ Pr(X ∈M)− 0 (6.17) = lim d,k→∞ Pr (|1− λmax| ≤ ) .. (6.18) We have now isolated the maximum eigenvalue λmax as the determining factor in obtaining an -subspace embedding. We make another application of the Portmanteau lemma to arrive at the final result. From here we can write Pr (|1− λmax| ≤ ) = Pr(λmax ≤ + 1)− Pr(λmax ≤ 1− ). (6.19) From Theorem 6.5 we know that λmax converges in distribution to the constant random variable ZL = (1 + √ α)2, where we have assumed α ∈ (0, 1]. Let B denote the interval (−∞, 1]. The limiting random variable ZL satisfies Pr(ZL ∈ B) = 0 and Pr(ZL ∈ ∂B) = 0. As such using property g of the Portmanteau lemma, lim d,k→∞ Pr(λmax ∈ B) = 0. Now Pr(λmax ≤ 1 − ) ≤ Pr(λmax ∈ B) for any  > 0. We can then conclude that lim d,k→∞ Pr(λmax ≤ 1 − ) = 0 for any  > 0. Asymptotically, the term Pr(λmax ≤ 1 − ) drops out of the expression for the embedding probability. Taking limits over (6.19) lim d,k→∞ Pr (|1− λmax| ≤ ) = lim d,k→∞ Pr(λmax ≤ + 1)− lim d,k→∞ Pr(λmax ≤ 1− ) = lim d,k→∞ Pr(λmax ≤ + 1)− 0. 128 6.8. Sketching asymptotics The asymptotic embedding probability is then related to the asymptotic cdf of λmax. Let Z be a random variable with Tracy-Widom distribution F1. Now for any fixed d and k, it must hold that for any fixed  > 0,∣∣∣∣Pr(λmax − µk,dσk,d ≤ + 1− µk,dσk,d ) − Pr ( Z ≤ + 1− µk,d σk,d )∣∣∣∣ ≤ sup z∈R ∣∣∣∣Pr(λmax − µk,dσk,d ≤ z ) − Pr(Z ≤ z) ∣∣∣∣ . Using Theorem 6.6, the random variable (λmax − µk,d)/σk,d converges in distribution to the continuous random variable Z ∼ F1. It follows from Lemma 6.5 that lim d,k→∞ sup z∈R ∣∣∣∣Pr(λmax − µk,dσk,d ≤ z ) − Pr(Z ≤ z) ∣∣∣∣ = 0. Now by the squeeze theorem, it holds that for all  > 0, lim d,k→∞ ∣∣∣∣Pr(λmax − µk,dσk,d ≤ + 1− µk,dσk,d ) − Pr ( Z ≤ + 1− µk,d σk,d )∣∣∣∣ = 0. Property (a) of the Lemma (6.4) then gives part (ii). 6.8 Sketching asymptotics We also wish to characterise the probability of obtaining an -subspace embedding for the Hadamard and Clarkson-Woodruff projections. Again letA be some large data matrix with singular value decomposition A = UDV T. The embedding probability of interest is Pr(S is an -subspace embedding for A) = Pr(σmax ( Id −UTSTSU) ≤  ) . Because of the discrete nature of the Hadamard and Clarkson-Woodruff projections it is cumbersome to express this probability in a meaningful way. As in Chapter 4 we again appeal to large n asymptotics to obtain an interpretable expression. Using the sketching central limit theorem (Theorem 4.3) we can argue that the Hadamard and Clarkson-Woodruff sketches have the same limiting embedding probability as the Gaussian projection. The sketching central limit theorem is restated here as Theorem 6.8. The necessary regularity conditions are described again in Assumption 1. Assumption 1 Let the singular value decomposition of the n × d source dataset be given by A(n) = U(n)D(n)V T (n). Let u T (n)i give the ith row in U(n) for i = 1, . . . , n. Assume that the maximum leverage score tends to zero, that is lim n→∞ maxi=1,...,rn ‖u(n)i‖22 = 0. Theorem 6.8. Consider a sequence of arbitrary n× d data matrices A(n), where d is fixed. Let A(n) = U(n)D(n)V T (n) represent the singular value decomposition of A(n). Let S be a k×n Hadamard or Clarkson- Woodruff sketching matrix where k is also fixed. Let uT(n)i represent the ith row of U(n). Suppose that Assumption 1 on the maximum leverage score is satisfied. Then as n tends to infinity with k and d fixed, [SA(n)]V(n)D −1 (n) d→ MN(0, Ik, Id/k). Theorem 6.9 gives the asymptotic probability of obtaining an -subspace embedding for the Hadamard and Clarkson-Woodruff sketches. Theorem 6.9. Consider a sequence of arbitrary n× d data matrices A(n), where each data matrix is of rank d, and d is fixed. Let A(n) = U(n)D(n)V T (n) represent the singular value decomposition of A(n). Let S be a k× n Hadamard or Clarkson-Woodruff sketching matrix where k is also fixed. Let uT(n)i represent the ith row of U(n). Assume that the maximum leverage score tends to zero, so lim n→∞ maxi=1,...,n ‖u(n)i‖22 = 0. 129 6. Sketching for posterior approximation Then as n tends to infinity with k and d fixed, lim n→∞Pr ( S is an -subspace embedding for A(n) ) = Pr (σmax(Id −W ) ≤ ) , where W ∼Wishart(k, Id/k). Proof: The sketching central limit theorem holds under Assumption 1. As we only need to con- sider the sequence of orthonormal matrices U(n) to determine the embedding probability, we can use Theorem 6.8 with D(n) and V(n) set to the d × d identity matrix. As such we conclude that SU(n) d→ MN(0k×d, Ik, Id/k). By the continuous mapping theorem it holds that for fixed d and k, asymptotically with n, UT(n)S TSU(n) d→Wishart(k, Id/k). Another application of the continuous mapping theorem gives σmax(Id −UT(n)STSU(n)) d→ σmax(Id −W ), where W ∼ Wishart(k, Id/k). We can use the continuous mapping theorem as the limiting Wishart matrix W has rank d with probability one. The maximum singular value function is continuous over the range where W has full rank (Bhatia, 1996). By the Portmanteau lemma it then holds that lim n→∞Pr ( σmax(Id −UT(n)STSU(n)) ≤  ) = Pr (σmax(Id −W ) ≤ ) . Now as lim n→∞Pr ( S is an -subspace embedding for A(n) ) = lim n→∞Pr ( σmax(Id −UT(n)STSU(n)) ≤  ) = Pr (σmax(Id −W ) ≤ ) , we have the final result The limiting embedding probability in Theorem 6.9 is the exact same embedding probability as for the Gaussian sketch (6.7). As n grows, we expect the Hadamard and Clarkson-Woodruff sketches to be as effective as the Gaussian sketch for generating -subspace embeddings. This is significant as the Hadamard and Clarkson-Woodruff sketches are dramatically faster than the Gaussian projection (see Table 6.1). The Clarkson-Woodruff sketch is O(k) times faster than the Gaussian projection, which can be a very large factor in practice. We can again use the Tracy-Widom distribution to approximate the limiting embedding probability Pr (σmax(Id −W ) ≤ ) for the Clarkson-Woodruff and Hadamard sketches. We are unable to establish a formal limit theorem in terms of the Tracy-Widom distribution lim n,d,k→∞ Pr(S is an -subspace embedding for A(n)) = Pr ( Z ≤ + 1− µk,d σk,d ) , where Z ∼ F1 as per Theorem 6.7. This is because the sketching central limit theorem (Theorem 6.8) has k and d fixed, with only n being taken to infinity. Central limit theorems under expanding dimension are more challenging (Portnoy, 1986), and this is left as future work. It is possible that Assumption 1 on the leverage scores will remain sufficient in the expanding dimension scenario. The following reasoning was also used by Huber (1973) in the analysis of high dimensional regression models. For any d, the maximum leverage score must be greater than the average leverage score. We thus have max i=1,...,n ‖u(n)i‖22 ≥ 1 n n∑ i=1 ‖u(n)i‖22 = d n . If we maintain that Assumption 1 holds on the leverage scores as n, d, k →∞ we necessarily imply that d/n→ 0. This is intuitively reasonable. If we simultaneously require d/k → α where α ∈ (0, 1] we must 130 6.9. Data application also have that k/n → 0. This is also a sensible requirement. The proofs in Chapter 5 do not appear to have any critical weaknesses that prevent an extension to the expanding dimension scenario. The lack of a central limit theorem in the expanding dimension scenario is not a critical gap, as the key result is that the Hadamard and Clarkson-Woodruff sketches behave like the Gaussian projection for large n, with k and d fixed. If the Tracy-Widom approximation in Theorem 6.7 is good for finite k and d with the Gaussian sketch, it should hold well for the Hadamard and Clarkson-Woodruff projections for n sufficiently large. 6.9 Data application We test the theory on a large genetic dataset. The covariate data consists of genotypes at p = 1032 genetic variants on n = 407, 779 subjects. The genetic variants are in the Protein Kinase C Epsilon (PRKCE) gene. Variants were filtered to have mean allele frequency of greater than one percent. The response variable is haemoglobin concentration adjusted for age/sex and technical covariates. We also consider a subset of this dataset with p = 130 representative markers identified by hierarchical clustering. We first assess the accuracy of Theorem 6.7 and 6.9 by comparing the theoretical embedding proba- bility to the observed embedding probability. We then check the accuracy of the posterior approximation as a function of  in a simulation study. We compare the Gaussian, Hadamard and Clarkson-Woodruff sketches. We also compare the data oblivious sketches to simple random sampling with replacement. The simple random sampling of observations is also referred to as the Uniform projection in the sketching literature. As mentioned in Chapter 4, although the uniform projection is simple to understand and implement, it is difficult to establish strong error bounds on its performance. 6.9.1 Embedding probabilities The dataset is of moderate size, so it is feasible to take the singular value decomposition of the full n × d dataset A = UDV T. Given the singular value decomposition we can run an oracle procedure to determine the empirical embedding probability. Suppose we take B sketches. Let S[1], . . . ,S[B] denote the B sketching matrices. Define [b] as [b] = σmax(Id −UTS[b]TS[b]U), for b = 1, . . . , B. The value [b] measures the quality of the sketching matrix S[b]. The estimated embedding probability is then P̂r(S is an -subspace embedding for A) = P̂r(σmax(Id −UTSTSU) ≤ ) = 1 B B∑ b=1 1([b] ≤ ). Figure 6.6 shows the empirical and theoretical embedding probabilities for the representative PRKCE dataset (n = 407, 779, d = 132) for each type of sketch. We took one hundred sketches at k = 20 × d. The observed and theoretical curves match well for the Gaussian, Hadamard and Clarkson-Woodruff projection. The uniform projection performs worse than the other data-oblivious random projections. The distribution of  is shifted to the right compared to the other projections. Larger values of  indicate weaker approximation bounds. The uniform projection does not satisfy a central limit theorem for fixed k, so we do not necessarily expect the Tracy-Widom law to give a good approximation for the uniform projection. Table 6.2 reports the average sketching time for each projection. The Clarkson-Woodruff projection is much faster than the Gaussian projection, but has the same embedding probability, as predicted by Theorem 6.9. Figure 6.7 shows the empirical and theoretical embedding probabilities for the full PRKCE dataset (n = 407, 779, d = 1034). We took one hundred sketches at k = 20 × d. The x-axis is different in 131 6. Sketching for posterior approximation Clarkson−Woodruff Uniform Gaussian Hadamard 0.5 0.6 0.7 0.8 0.5 0.6 0.7 0.8 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Epsilon Em be dd in g pr ob ab ilit y Empirical Pointwise limit Tracy−Widom limit Figure 6.6: Empirical and theoretical embedding probabilities for the representative PRKCE dataset. Hierarchical clustering was used to select a subset of genetic variants from the full PRKCE dataset. We want the sketching matrix S to be an -subspace embedding for the source dataset with small  with high probability. The y-axis gives the proportion of times we attain an -subspace embedding. The x-axis gives the distortion factor  (recall Definition 6.1). In this simulation n = 407, 779, d = 132, k = 20× d. One thousand sketches were generated for each type of sketch. The vertical line gives the asymptotic pointwise limit. The Tracy-Widom approximation is very accurate for the Gaussian, Hadamard and Clarkson-Woodruff sketches. Projection Gaussian Hadamard Clarkson-Woodruff Uniform Time (seconds) 769 17.6 1.34 0.03 Table 6.2: Mean sketching time for the representative PRKCE dataset. Results for the Hadamard, Clarkson- Woodruff and Uniform projections are over one hundred sketches. Results for the Gaussian sketch are over ten sketches.The Gaussian sketch is considerably slower than the Hadamard and Clarkson-Woodruff sketches as is expected from Table 6.1. 132 6.9.1. Embedding probabilities Clarkson−Woodruff Uniform Gaussian Hadamard 0.5 0.6 0.7 0.8 0.9 0 5 10 15 0.450 0.475 0.500 0.525 0.550 0.450 0.475 0.500 0.525 0.550 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Epsilon Em be dd in g pr ob ab ilit y Empirical Pointwise limit Tracy−Widom limit Figure 6.7: Empirical and theoretical embedding probabilities for the full PRKCE genetic dataset (n = 407, 779, d = 1034, k = 20 × d). One hundred sketches were generated for each type of sketch. We want the sketching matrix S to be an -subspace embedding for the source dataset with small  with high probability. The y-axis gives the proportion of times we attain an -subspace embedding. The x-axis gives the distortion factor  (recall Definition 6.1). Different x-scales are used in each plot. The Tracy-Widom approximation is very accurate for the Gaussian sketch. There is moderate deviation from the asymptotic approximation for the Hadamard sketch and larger deviation for the Clarkson-Woodruff sketch. each plot. Overall, the empirical and theoretical curves do not match as well compared to Figure 6.6. The approximation for the Gaussian projection is very accurate. Interestingly, the Hadamard projection slightly outperforms the Gaussian projection. The dashed line in the Hadamard panel is shifted slightly to the left compared to the theoretical curve, representing better than expected performance. The Clarkson-Woodruff projection has a much longer right tail then is predicted by the Tracy-Widom law. The Uniform projection does very poorly in this simulation compared to the the other data oblivious projections. Even though the distribution of  under Clarkson-Woodruff projection has a longer right tail than the theoretical, the distortion factors are always below 1. The median for the Uniform projection is above 5. The Hadamard and Clarkson-Woodruff projections to be more stable in this example. Data oblivious projections are designed to be robust, and it appears there are features in this dataset that cause major problems for uniform subsampling that are less of an issue for the data oblivious sketches. The extra computational cost of the Hadamard and Clarkson-Woodruff projections come with some demonstrable benefits. Table 6.3 reports the average sketching time for each projection. The Gaussian sketch was not timed in this example as we simulated directly from the matrix normal distribution of the sketched data A˜ ∼ MN(0, Ik,ATA/k) rather than computing the matrix product SA directly. This was so that the simulation could be run in a reasonable amount of time. The Clarkson-Woodruff sktch is again faster than the Hadamard sketch as is expected. We can better understand the deviation from the Tracy-Widom limit in Figure 6.7 by looking at the density of σmax(Id − UTSTSU). Figure 6.8 compares the empirical density to the Tracy-Widom approximation given in Theorem 6.7. The empirical distribution under the Hadamard sketch is shifted to 133 6. Sketching for posterior approximation Clarkson−Woodruff Uniform Gaussian Hadamard 0.5 0.6 0.7 0.8 0 5 10 15 0.450 0.475 0.500 0.525 0.550 0.450 0.475 0.500 0.525 0.550 0 30 60 90 0.00 0.05 0.10 0.15 0 30 60 90 0 30 60 90 σmax(Id − UTSTSU) de ns ity Empirical Pointwise limit Tracy−Widom limit Figure 6.8: Empirical and theoretical density of σmax(Id−UTSTSU) for the full PRKCE genetic dataset. We want the sketching matrix S to be an -subspace embedding for the source dataset with small  with high probability. The sketching matrix S is an -subspace embedding for the source dataset if and only if σmax(Id −UTSTSU) ≤ . One hundred sketches were generated for each type of sketch. The vertical line gives the asymptotic pointwise limit. Different x-scales are used in each plot. The Tracy-Widom approximation is very accurate for the Gaussian sketch. There is moderate deviation from the asymptotic approximation for the Hadamard sketch and larger deviation for the Clarkson-Woodruff sketch. the left compared to the asymptotic result, and the empirical distribution under the Clarkson-Woodruff sketch has more right skew. The deviation from the Tracy-Widom limit in Figure 6.8 could be because the finite sample approx- imation is poor. Theorem 6.9 suggests that the Hadamard and Clarkson-Woodruff projections behave like the Gaussian sketch for n sufficiently large. To test this we bootstrapped the full PRKCE dataset to be ten times its original size. The bootstrapped PRKCE dataset has n = 4, 077, 790, d = 1034. We took one thousand sketches of size k = 20 × d using the Clarkson-Woodruff projection and ran the or- acle procedure of computing [b] = σmax(Id − UTS[b]TS[b]U) for each sketch. Figure 6.9 compares the distribution of σmax(Id −UTSTSU) using Clarkson-Woodruff projection on the original dataset and on the large bootstrapped dataset. As n increases we expect the quality of the Tracy-Widom approximation to improve. Panel (a) compares the theoretical to the simulation results on the original dataset. The Clarkson-Woodruff projection shows greater variance than expected. Panel (b) compares the theoretical to the simulation results on the bootstrapped dataset. In (b) we see very good agreement between the empirical distribution and the theoretical distribution. It seems that for this dataset n ≈ 400, 000 is not big enough for the large sample asymptotics to kick in. At n ≈ 4 million we see the expected asymptotic behaviour. The results in Figure 6.9 are significant as they reflect the culmination of the asymptotic analysis in Chapter 5 and the asymptotic results in this chapter. The sketching central limit theorem ties the large n behaviour of the Clarkson-Woodruff projection to that of the Gaussian projection. Theorem 6.7 links the large d and k behaviour of the Gaussian projection to the Tracy-Widom law. The finite sample analysis of the Clarkson-Woodruff projection is very difficult. By chaining together asymptotic results we can obtain asymptotic predictions on its performance. As with any asymptotic analysis there 134 6.9.2. Posterior approximation 0.48 0.50 0.52 0.54 0.56 0 20 40 60 80 10 0 12 0 Orginal PRKCE dataset (n=407,779, d=1034, k=20*d) N = 100 Bandwidth = 0.003686 D en si ty 0.48 0.50 0.52 0.54 0.56 0 20 40 60 80 10 0 12 0 Boostrapped PRKCE dataset (n=4,077,790, d=1034, k=20*d) N = 1000 Bandwidth = 0.0007265 D en si ty Figure 6.9: Comparison of theoretical and empirical density of σmax(Id − UTSTSU) on the full PRKCE dataset (n = 407, 779) and the bootstrapped PRKCE dataset (n = 4, 077, 790). We want the sketching matrix S to be an -subspace embedding for the source dataset with small  with high probability. The sketching matrix S is an -subspace embedding for the source dataset if and only if σmax(Id −UTSTSU) ≤ . The dashed black line gives the Tracy-Widom limit and the solid red line shows a kernel density estimate. The sketch size was set at k = 20×d, where d = 1034. We used a Clarkson-Woodruff sketch. The accuracy of the Tracy-Widom approximation improves as n increases as is expected from the asymptotic theory (Theorem 6.9). Projection Gaussian Hadamard Clarkson-Woodruff Uniform Time (seconds) - 156 21 2.8 Table 6.3: Mean sketching time (seconds) for the full PRKCE dataset (n = 407, 779, d = 1034). This dataset is too large to use Gaussian sketch directly. The Clarkson-Woodruff sketch is faster than the Hadamard sketch as is expected from Table 6.1. is there is the question of how large n has to be before the approximation is reasonable. Further study on the rate of convergence and possible finite sample bounds on the error in the normal approximation would be very useful and interesting research directions. We have used asymptotic theory to assess the embed- ding probability (6.1). The results in Figure 6.9 show the asymptotic approximations are reasonable in a realistic data setting. Given the rapidly decreasing cost of genotyping, it is reasonable to anticipate the n and d in panel (b) to be commonplace in multivariable genotype-phenotype studies in the near future. To integrate sketching methods in the analysis pipeline it is necessary to understand the level of error introduced by using the random projection. The asymptotic results are useful as they provide important guidance on the quality of the randomised data compression step. Table 6.3 reports the average sketching time for each projection on the full dataset. This is the time required to compute the random sketched dataset A˜ = SA. The Clarkson-Woodruff sketch is faster than the Hadamard sketch as is expected from Table 6.1. 6.9.2 Posterior approximation We also compare the accuracy of the sketched posterior distribution at different sketch sizes. Sketching is a tool for approximate computation. As the sketch size k increases the accuracy of the approximate calculation increases. As discussed in section 6.4 this behaviour can be formalised using -subspace embeddings. As k increases we  p→ 0 and the sketched posterior distribution will approach the target posterior distribution. Here we examine the sensitivity of the results to k and  in a practical setting. 135 6. Sketching for posterior approximation llll l ll l l lll l ll l l ll l l l l l l l l l l l l l l l l l l ll ll l l l l l l l l l l l l l l l l l l l l l l l ll lll ll l l l l l ll ll l lll l l ll l l ll l l l l ll l l l l l ll l lll l ll l ll l l l l l l l l l l l l lll 0 20 40 60 80 100 120 0 1 2 3 4 5 6 Manhattan plot using multivariate regression model SNP index − lo g1 0 p− va lu e Figure 6.10: Manhattan plot using the representative PRKCE dataset. The y-axis gives the minus log 10 p-value for each predictor when testing a null hypothesis of zero. We fit the saturated model using p = 130 markers and performed a t-test on each coefficient associated with a particular genetic marker. l l l l l l l l l l l l l l l 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 Marginal probability of inclusion Po st er io r p ro ba bi lity rs 13 39 64 24 rs 61 75 68 62 rs 11 27 88 36 6 rs 56 40 93 20 rs 11 49 48 63 9 rs 10 20 44 6 rs 18 68 27 4 rs 67 12 55 7 rs 75 79 48 46 rs 13 88 54 32 8 rs 10 17 10 92 rs 10 16 83 49 rs 79 80 68 47 rs 67 24 31 5 rs 28 14 82 SNP Figure 6.11: Marginal inclusion probabilities from an analysis using the top 15 SNPs. These results were obtained by enumerating over all models using the full dataset. There are some variants with low support for inclusion, some variants with moderate support for inclusion and some variants with high support for inclusion. The range of target marginal inclusion probabilities makes this a useful benchmark dataset for sketching. To isolate the error introduced by the sketch we limit our analysis to a dataset with p = 15 genetic variants so that the posterior can be computed by enumeration. This is so there is not Monte Carlo error in the target posterior distribution. We took the top 15 variants identified by fitting a multivariable regression model using the representative PRKCE dataset. Figure 6.10 shows a Manhattan plot of the p = 130 genetic markers in the dataset. We take the 15 markers with the smallest p-values. The dashed line in the plot shows the cutoff used. The design matrix X was then set to only contain the top 15 genetic variants. From here we computed the target posterior over models γ. There are 216 = 65536 candidate models as we allow for an intercept. Figure 6.11 shows the marginal posterior probability of inclusion for each of the top 15 variants in the restricted analysis. We fixed the error variance at σ2 = n/(n − 131)σ̂2 where σ̂2 is the maximum likelihood estimate of the residual variance from fitting the saturated model with p = 130 genetic predictors. We then computed the integrated likelihood for all models using (6.2). By normalising we obtained the the exact posterior distribution on the full dataset. We benchmark the sketched posterior against the true posterior distribution. 136 6.9.2. Posterior approximation We then took B = 100 hundred sketches at different sizes k using the Clarkson-Woodruff projection. Let A˜[b] be the bth sketched dataset for b = 1, . . . , B. For b = 1, . . . , B we then computed the sketched sufficient statistics A˜[b]TA˜[b]. We then computed the approximate integrated likelihood for each model using (6.5). Formally, for b = 1, . . . , B and all models γ we calculated p˜[b](y|γ, g, σ2) = (1 + g)−pγ/2 exp [ − 1 2σ2 ( g g + 1 (RSSγS) [b] + 1 g + 1 yTy )] . (6.20) Using the approximate integrated likelihoods we obtain a sketched posterior over models p˜[b](γ|y, g, σ2) for b = 1, . . . , B. We computed the sketched marginal probabilities of inclusion for b = 1, . . . , B. We compare p˜[b](γ|y, g, σ2) for b = 1, . . . , B to the target posterior p˜(γ|y, g, σ2) using the marginal probabilities of inclusion. Figure 6.12 shows boxplots of the sketched posterior results over the B = 100 sketches. Each panel shows the results for the marginal inclusion probability for a particular genetic variant. The red dashed line gives the target marginal inclusion probability. At a high-level, as the sketch size increases, the quality of the approximate posterior should increase. Here quality refers to the difference compared to the exact posterior distribution computed on the full dataset p˜(γ|y, g, σ2). The gold-standard results are given by the red-dashed line in each panel. We would like to see boxplots tightly concentrated around the red dashed line. We do not see this. The panel in row 1 column 3 is worth studying in detail. The target marginal inclusion probability is below 0.05. The sketched approximations return marginal inclusion probabilities close to one for k up to ten thousand. The quality of the approximate calculation is very poor for this SNP. The panel in row 3 column 2 is also worth examining. The target marginal inclusion probability is around 0.45. The sketched marginal inclusion probabilities are biased upwards for k up to ten thousand. The sketched inclusion probabilities remain highly variable at k equal to one hundred thousand. The variance of the sketched approximation is very high and it is hard to discern any clear trends due tot the use of boxplots. Figures 6.13 and 6.14 show histograms of the approximate marginal inclusion probabilities for selected SNPs. Looking at the first column in Figure 6.13 we can see a large positive bias at k = 100. The sketched inclusion probabilities are bunched around one when the target inclusion probabilities are below 0.5. The results in Figure 6.14 show some interesting patterns. At k = 100 thousand we see that the sketched posterior inclusion probabilities have a U shaped distribution with modes at zero and one. We would like to see a mode around the red-dashed line. The noise introduced by the sketch has a strong and perhaps unpredictable effect on the marginal includion probabilities. The Gaussian sketch does not require that k be smaller than n. We repeated the analysis with very large sketch sizes to identify the necessary k to give a tolerable posterior approximation. We would also like to see if smoother behaviour emerges at larger sketch sizes. We simulated directly from the distribution of the sketched sufficient statistics A˜TA˜ ∼Wishart(k,ATA/k) to bypass the huge computational cost of generating comically large sketches. Figure 6.15 shows boxplots of the approximate marginal inclusion probabilities. Figures 6.16 and 6.17 show histograms of the approximate marginal inclusion probabilities for selected SNPs. These results show the general dynamic that we expect. As k increases we see concentration around the target marginal inclusion probabilities (red dashed line). In the fourth column in 6.17 we see a unimodal distribution centered around the target marginal inclusion probability rather than the U shaped distribution that was noted in the fourth column of Figure 6.14. Even at k = 10 million there is significant variance around the target inclusion probability. At k = 1 billion the error looks acceptable. The posterior distribution over models is a complicated nonlinear function of the sufficient statistics. Small changes in the approximated sufficient statistics can lead to a large change in the approximated posterior distribution. It is hard to determine apriori what level of deviation we expect in the approximate posterior given a particular sketch size. As before, we can run an oracle procedure to measure the quality of sketch at each k. Let the singular value decomposition of the source dataset again be given by A = UDV T. Define 137 6. Sketching for posterior approximation l l l l l l ll l l l l l l l l ll l l l l l l l ll l l l l l l l l ll l l l ll l ll l l ll l ll l l l l l l l l l l l l lll l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l ll l l l l l l l l l l l l l l l l l l l ll l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l ll l l l l l l ll l l l l l l l l ll l ll l l l l l l l l l l l l l ll l ll l ll l l l l ll l l l l l l lll ll l l l l l l l l l l l lll l l l l l l l l l ll l l l l l l l l l l l ll l ll l l l l l l l ll l l l l l l l l l l l l l l l l ll l l l l l rs79806847 rs6724315 rs281482 rs138854328 rs10171092 rs10168349 rs1868274 rs6712557 rs75794846 rs56409320 rs114948639 rs1020446 rs13396424 rs61756862 rs112788366 1 hundred 1 thousand 10 thousand 100 thousand 1 hundred 1 thousand 10 thousand 100 thousand 1 hundred 1 thousand 10 thousand 100 thousand 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Sketch size (k) M ar gi na l p os te rio r p ro ba bi lity o f i nc lu sio n Figure 6.12: Boxplots of sketched marginal posterior probabilities of inclusion. The Clarkson-Woodruff projection was used at different sketch sizes k. One hundred sketches were taken at each value of k. The target marginal probability of inclusion for each SNP is plotted as red dashed line. The sketched posterior approximations should approach the target values as k increases. The variability in the sketched approximation differs over SNPs. In general the sketched estimates lie above the target values (the red dashed line). It is hard to see a clear relationship between the variance of the sketched approximations and k in this plot. 138 6.9.2. Posterior approximation k: 1 hundred k: 1 thousand k: 10 thousand k: 100 thousand co efficient: rs112788366 co efficient: rs6712557 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0 25 50 75 0 25 50 75 Marginal posterior probability of inclusion co u n t Figure 6.13: Histograms of sketched marginal posterior probabilities of inclusion for SNPs where the target marginal probability is less than 0.5. Results were obtained using the Clarkson-Woodruff projection at different sketch sizes k. The target marginal probability of inclusion for each SNP is plotted as red dashed line. The sketched posterior approximations should approach the target values as k increases. The sketched marginal inclusion prob- abilities are concentrated around zero and one in each panel. It is hard to see a clear relationship between the variance of the sketched approximations and k in this plot. k: 1 hundred k: 1 thousand k: 10 thousand k: 100 thousand co efficient: rs13396424 co efficient: rs138854328 co efficient: rs10171092 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 Marginal posterior probability of inclusion co u n t Figure 6.14: Histograms of sketched marginal posterior probabilities of inclusion for SNPs where the target marginal probability is between 0.5 and 0.9. Results were obtained using the Clarkson-Woodruff projection at different sketch sizes k. The target marginal probability of inclusion for each SNP is plotted as red dashed line. The sketched posterior approximations should approach the target values as k increases. The sketched marginal inclusion probabilities are concentrated around zero and one in each panel.It is hard to see a clear relationship between the variance of the sketched approximations and k in this plot. 139 6. Sketching for posterior approximation l l l l l l ll l l l l l ll lll l ll l ll l ll l l l l l l l l l l l l l l l ll l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l rs79806847 rs6724315 rs281482 rs138854328 rs10171092 rs10168349 rs1868274 rs6712557 rs75794846 rs56409320 rs114948639 rs1020446 rs13396424 rs61756862 rs112788366 1 m illion 10 m illion 100 m illion 1 billion 1 m illion 10 m illion 100 m illion 1 billion 1 m illion 10 m illion 100 m illion 1 billion 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Sketch size (k) M ar gi na l p os te rio r p ro ba bi lity o f i nc lu sio n Figure 6.15: Boxplots of sketched marginal posterior probabilities of inclusion. The target marginal probability of inclusion for each SNP is plotted as red dashed line. The Gaussian sketch was used at different sketch sizes k. The sketched posterior approximations should approach the target values as k increases. One hundred sketches were taken at each value of k. As the sketch size k increases the approximate marginal inclusion probabilities concentrate around the target values. 140 6.9.2. Posterior approximation k: 1 million k: 10 million k: 100 million k: 1 billion co efficient: rs112788366 co efficient: rs6712557 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0 25 50 75 100 0 25 50 75 100 Marginal posterior probability of inclusion co u n t Figure 6.16: Histograms of sketched marginal posterior probabilities of inclusion for SNPs where the target marginal probability is less than 0.5. Results were obtained using the Gaussian projection at different sketch sizes k. The target marginal probability of inclusion is plotted as a dashed line. The sketched posterior approximations should approach the target values as k increases. As the sketch size k increases the approximated marginal inclusion probabilities concentrate around the target values. k: 1 million k: 10 million k: 100 million k: 1 billion co efficient: rs13396424 co efficient: rs138854328 co efficient: rs10171092 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0 20 40 60 0 20 40 60 0 20 40 60 Marginal posterior probability of inclusion co u n t Figure 6.17: Histograms of sketched marginal posterior probabilities of inclusion for SNPs where the target marginal probability is between 0.5 and 0.9. Results were obtained using the Gaussian projection at different sketch sizes k. The target marginal probability of inclusion is plotted as a dashed line. The sketched posterior approximations should approach the target values as k increases. As the sketch size k increases the approximated marginal inclusion probabilities concentrate around the target values. 141 6. Sketching for posterior approximation  = ES [σmax(Id −UTSTSU)]. The value  gives the expected distortion factor over the random sketch. This can be compared to the values predicted by Theorem 6.5 and Theorem 6.7. Using the Tracy-Widom approximation we expect that  ≈ σk,dE[Z] + µk,d − 1, where µk,d and σk,d are the centring and scaling constants in Theorem 6.6, and E[Z] is the mean of a Tracy-Widom random variable, E[Z] ≈ −1.21. Using the more crude pointwise approximation in Theorem 6.5 we have  ≈ (1 + √d/k)2 − 1. A Monte Carlo estimate of  is obtained by taking the sample average of [1], . . . , [B]. Table 6.4 compares the sample average to the asymptotic predictions for the Clarkson-Woodruff projection. There is a good correspondence between the theoretical and observed values. Table 6.5 compares the sample average to the asymptotic predictions for the Gaussian sketch. The Tracy-Widom approximation from Theorem 6.6 gives more accurate predictions than the pointwise approximation in Theorem 6.5. We are able to estimate the expected distortion factor  for a particular sketch size k using the asymptotic theory. The sketches are behaving as expected in terms of the distortion factor . It seems that very small  is required to give a tolerable posterior approximation. We can determine ahead of time what  is expected for a given sketch size k. This can be used to set expectations appropriately when interpreting the the sketched results. Sketch size (k) one hundred one thousand ten thousand one hundred thousand Monte Carlo estimate of  0.857 0.244 0.079 0.024 Tracy-Widom expectation 0.851 0.245 0.075 0.023 Pointwise expectation 0.995 0.278 0.084 0.026 Table 6.4: Comparison of Monte Carlo and asymptotic estimates of  = ES [σmax(Id−UTSTSU)] for the Clarkson- Woodruff sketch. The quality of a sketch can be summarised by . The source dataset is the top 15 SNP dataset described in section 6.9.2 (n = 407, 779, d = 17). The Tracy-Widom expectations (Theorem 6.7) match the Monte Carlo estimates more closely than the pointwise limit (Theorem 6.5). We are able to estimate the expected distortion factor  for a particular sketch size k using the asymptotic theory. Sketch size (k) one million ten million one hundred million one billion Monte Carlo estimate of  0.00771 0.00247 0.00077 0.00025 Tracy-Widom expectation 0.00738 0.00233 0.00074 0.00023 Pointwise expectation 0.00826 0.00261 0.00082 0.00026 Table 6.5: Comparison of Monte Carlo and asymptotic estimates of  = ES [σmax(Id−UTSTSU)] for the Gaussian sketch. The quality of a sketch can be summarised by . The source dataset is the top 15 SNP dataset described in section 6.9.2 (n = 407, 779, d = 17). The Tracy-Widom expectations (Theorem 6.7) match the Monte Carlo estimates more closely than the pointwise limit (Theorem 6.5). We are able to estimate the expected distortion factor  for a particular sketch size k using the asymptotic theory. 6.10 Conclusion We have investigated the use of sketching for approximate Bayesian subset selection. Approximation bounds are developed using -subspace embeddings. Data oblivious random projections are designed to output an -subspace embedding of the source dataset with high probability. We obtained asymptotic 142 6.10. Conclusion expressions for the embedding probability for the Gaussian, Hadamard and Clarkson-Woodruff sketch in terms of the Tracy-Widom law. Simulations showed that the asymptotic expression is accurate on large datasets. The asymptotic theory can be used to estimate the required sketch size k needed to obtain an -subspace embedding with probability at least (1− δ). Determination of the embedding probability Pr(S is an -subspace embedding for A) is a largely open question, and the asymptotic equivalence of the Gaussian, Hadamard and Clarkson-Woodruff sketched is of theoretical and practical interest. The universality of the Tracy-Widom law is a topic of current research (Bao et al., 2015) and it is also of interest to determine the full set of data oblivious random projections that satisfy the Tracy-Widom limiting embedding probability. We do not think that asymptotic analysis can supplant the finite sample results that drive much of the existing work on randomised algorithms. We feel that they are a useful complement that can provide answers to important questions that are nigh unresolvable in a finite sample framework. A combination of asymptotic and non-asymptotic results could be useful for the development of more advanced algorithms. We also investigated the range of  needed to obtain a tolerable posterior approximation. Sketching has been proposed for approximating the posterior distribution over coefficients in the fixed model setting (Geppert et al., 2017). We found that  needs to be much smaller than in the fixed model setting to maintain the important features of the target posterior over models. It appears that approximation of the integrated likelihood is a more difficult task than preservation of the normal likelihood conditional on a model. 143 7Conclusion As the thesis winds down, we can ease into the customary retrospection by cycling back to the introduction and reconsidering Box’s loop (Figure 7.1). Box’s loop breaks down a generic data analysis problem into a series of connected sub-problems involving model formulation, fitting, checking and revising. This staged workflow is described by Blei (2014) as build, compute, critique and repeat. The modularisation helped us to pin down the central research question of interest. The emergence of Big Data has led to the compute step becoming a bottleneck in practical applications. As datasets grow larger, the computational budget does not stretch as far and we are forced to adapt. New algorithms and methods are needed to facilitate statistical inference at scale. Our particular focus has been on computational methods for Bayesian analyses, however we have encountered a number of general principles that will impact statistical computing research on Big Data problems. We have investigated how distributed computing, subsampling and randomised data compression can be used for efficient statistical computation. A subtle yet fundamental distinction underpins how algorithms should be compared and assessed. This is the distinction between the number of floating point operations required by an algorithm and the typical execution time as measured by the end user (wall-clock time). Making such a differentiation is critical when allowing for parallel computation. A hardline accounting perspective is that the computational expense of an algorithm is defined by the number of floating point operations that are invoked, whether this is in parallel or not. A purely utilitarian viewpoint is that the only relevant factor is the time taken for the algorithm to return the end result, making wall clock-time the sole metric of interest. Figure 7.1: Box’s loop is conceptual process model of scientific data analysis (Box, 1976). Box’s loop defines a number of key phases (Build, compute, critique and repeat) when approaching a data modelling problem. The com- partmentalisation of the overall task highlights important tactical decisions that are involved the statistical analysis lifecycle and aids the elicitation of broader strategic elements that influence the work. Adapted from Blei (2014). 144 Consider the task of calculating the mean of each variable given a large dataset of n observations on p variables. This can be executed on a single machine in O(np) operations. It is trivial to define an embarrassingly algorithm for carrying out this task. Given that we distribute the job across s processors we can expect the wall-clock time to drop by a factor s assuming minimal communication costs. The wall-clock problem is solved by an embarrassingly parallel algorithm, however the raw number of floating point operations remains the same. Each of the s workers completes an O(np/s) task leading to an overall expense of O(np) operations. A pragmatic solution exists, yet the deeper issue of the floating point operation cost is not addressed by using a divide and conquer approach. The underlying expense of an algorithm is of interest in theoretical computer science. Reflective of this is that early work on sketching considered the estimation of a mean (Cormode, 2011). A more thorough analysis would of course include memory, disk and communication costs, we are choosing to emphasise the floating point operation count as it is of the most relevance to our work. If we can fundamentally reduce the number of floating point operations required in the compute step and still obtain an acceptable result, we have made great headway for numerical computing with Big Data. To surmise, subsampling and sketching cut wall clock time by reduce the underlying number of floating point operations. Divide and conquer approaches attack the wall clock time issue by distributing the floating point calculation burden over a large number of workers. A very classical statistical perspective may gloss over such implementation details, but this is an important aspect of dealing with Big Data (Donoho, 2017; Bryan and Wickham, 2017). The distinction between computational cost and wall-clock time should be kept in mind when comparing the work across chapters. Specialised algorithms can be necessary for the practical analysis of large datasets. Parallel process- ing, subsampling and random projection can act as useful components within algorithms for Big data regression. Each broad approach has unique strengths and weaknesses and the best tool for the job may change on a case by case basis. To be more formal, recall the generic problem set-up from the introduc- tion. Suppose we have a dataset y of n observations with likelihood p(y|θ). The parameter θ ∈ Rd has prior p(θ). Bayesian analyses can be challenging on tall datasets as computational procedures can be intolerably slow. The typical bottleneck encountered is the need for repeated full likelihood evaluations p(y|θ), which carry an O(n) cost. This can grow to an infeasible burden for a sufficiently large number of observations n (Robert et al., 2018). A main goal in the thesis to develop algorithms that minimise or eliminate O(n) calculations. The divide and conquer work in Chapter 2 splits the initial dataset into s subsets so that each worker only has tasks that are O(n/s). In the subsampling chapter we use subsam- ples of size m n to estimate the log likelihood in O(m) time. In the work in sketching we replace the full dataset of n observations with a compressed pseudo-dataset of k  n observations. Numerical work on the sketched dataset demands O(k) operations. Existing work in the field is largely directed at sampling from the posterior distribution p(θ|y). To carry out model selection it is desirable to have the integrated likelihood (model evidence) p(y|θ) =∫ p(y|θ)p(θ) dθ, and this was a motivating concern throughout the thesis. Chapter 2 uses divide and conquer methodology in order to compute the model evidence. Chapter3 uses subsampling to bound the model evidence. Chapters 4,5 and 6 examine how random projection can be used to approximate the least squares estimate θ̂ and the integrated likelihood p(y) for Gaussian linear models. The analysis of each technique required different mathematical methods and concepts. In Chapter 2 we developed an embarassingly parallel algorithm using Gibbs sampling and conditional conjugacy. The approach fits the split-apply-combine template shown in Figure 7.2. We propose to use Gibbs sampling in the apply stage, and to take the Gibbs sampler history as part of the output. The combine step involves pooling the Gibbs trajectories from each subset run. The initial partition of the data in the split step has a strong influence on the Monte Carlo variance in the combine step. In Chapter 3 we motivated the use of subsampling for estimation of the integrated likelihood using 145 7. Conclusion Figure 7.2: Template for embarrassingly parallel algorithms. The split step breaks the full dataset in to s non overlapping subsets. The illustration is for s = 2. Each subset is then allocated to a different machine on a cluster. During the apply step we apply conventional methodology to each data subset with no cross communication between workers. Each analysis is summarised by a consistent format of output. The s sets of output from the apply stage are then synthesised in the combine step to give the final result. In this design brief, the combine stage only involves the output from the apply step, and not the original dataset. the the identity log p(y) = Ep(θ|y)[log p(y|θ)]−D(p(θ|y) ‖ p(θ)). (7.1) The goodness of fit term Ep(θ|y)[log p(y|θ)] can be estimated using subsampling, and the penalty term D(p(θ|y)‖p(θ)) can be bounded using information theoretic arguments. The main contribution is an upper bound on the model evidence using the maximum entropy property of the normal distribution. Exact calculation of the model evidence becomes increasingly computationally demanding as n increases. Evidence bounds are attractive for large n problems as they can be expected to squeeze together as n increases under mild assumptions, and they can be estimated cheaply using control variate estimators. Chapters 4, 5 and 6 considered sketching, a probabilistic data compression technique. As mentioned in the introduction, most of our work on sketching can be seen as a statistical evaluation of algorithms developed in the computer science and machine learning literature. We pried into the nature of the existing stochastic machinery behind randomised algorithms. Randomised algorithms are interesting from a statistical perspective as repeated application of the algorithm to the same dataset will produce different results. The distribution of the output is a measure of the quality of the algorithm. We established a number of asymptotic results regarding the distribution of the sketched output. In Chapter 4 we considered the distribution of least squares coefficients on the sketched dataset. In Chapter 6 we considered the probability of obtaining an -subspace embedding (Definition 6.1). Existing results on sketching algorithms are typically finite sample worst case bounds that can be pessimistic. We took a different approach and tried to characterise typical behaviour under regularity conditions. It is important to be able to quantify the uncertainty attached to a randomised algorithm, and the results in Chapter 4 on the variance of the sketched coefficients and the results in Chapter 6 about the embedding probability are useful measures for end users. We found that the central limit theorem and the Tracy-Widom law were useful for analysing the behaviour of data oblivious sketching algorithms. Weak convergence is a useful concept in the analysis of randomised algorithms. We believe our results serve as a useful complement to the existing body of work in the area. Sketching algorithms can be more richly analysed using a combination of non-asymptotic and asymptotic results. The fundamental trade off between the computational budget and accuracy of the calculation is a primary concern for computational statistics in the Big Data world. This is in some sense no a free lunch problem, where lowering the computational expense of a Monte Carlo method can be expected to increase the Monte Carlo error. Successful Big Data algorithms use available resources wisely. Embarrassingly parallel algorithms spread the floating point operation cost over a large number of nodes without grossly increasing communication costs. Subsampling based algorithms use control variates as a stabilising influ- 146 ence. Sketching algorithms use well designed random projections to give robust performance guarantees in terms of the compression ratio and the structure of the source dataset. Truly scalable methodology controls the error in conjunction with the cost as datasets increase in size, and this is an overarching concern in each piece of work. Much of the statistical analysis in this thesis has been directed at non sampling based sources of error. The Divide and conquer approach for computing the integrated likelihood (Chapter 2) provides an interesting example of this challenge. We easily achieve a linear reduction in the analysis time given that we split the full dataset into s subsets and process them in parallel. However the difficulty of the combine step increases with the number of subsets s. The variance of the kernel density estimator of the subposterior integral will be extremely high when s is large. As such, there is a limit to how much can be gained from parallel processing. Data augmentation can be used to address the scaling problem: by using Gibbs sampling in the intermediate apply step of the algorithm we can obtain a feasible combine step for large s. In Chapter 3, we find that although unbiased estimators of the log likelihood can be constructed using simple random sampling, the variance of such estimators explodes as n tends to infinity. Control variates are needed to obtain a scalable algorithm as n increases. A similar principle applies to sketching. A simple random sample of size k can be viewed as a sketch, however it does not offer the same probabilistic guarantees as the Hadamard and as Clarkson-Woodruff sketch n tends to infinity. Algorithms need to be designed carefully in order to control the associated Monte Carlo or approximation error. There are a number of research avenues that appear to be worth pursing. As mentioned in Chapter 2, the split-apply combine approach using data augmentation and Gibbs sampling can also be used for posterior sampling. In Chapter 2 we considered estimation of the model evidence. We prescribe Gibbs sampling during the apply step and take the output to be the sequence of full conditional distributions. The combine step used the history of the subset Gibbs runs to estimate the integrated likelihood p(y). We can also make use of the Gibbs history to target the full dataset posterior p(θ|y) in the combine step. We can define a self-normalised importance sampler to target the true posterior distribution in the combine step using the providing that the model satisfies conditional conjugacy constraints. The work on subsampling based estimation of the log model evidence in Chapter 3 can also be taken in new directions. Exact calculation of the integrated likelihood appears to require a large number of floating point operations. If this cost is inescapable, a useful strategy is to instead bound the integrated likelihood. We built our strategy around the identity (7.1). Subsampling can be used to estimated the goodness of fit term Ep(θ|y)[log p(y|θ)]. It is of interest to determine other cheap estimators of the Kullback-Leibler penalty term D(p(θ|y)‖p(θ)). There was an important common pattern to the analysis in Chapters 4 and 6. Recall that the Gaussian sketch is particularly mathematically tractable but is computationally demanding whereas the Hadamard and Clarkson-Woodruff projections are computationally efficient but difficult to characterise mathematically. Our broad strategy was to analyse the Gaussian sketch and then establish asymptotic equivalence for the other projections. This was useful to develop guidelines for sketching applications. This general idea was suggested by Li et al. (2006), however it has appeared to have gained little traction in the sketching literature. We found this line of reasoning to be highly effective, and anticipate that it can be used address other questions concerning the behaviour of sketching algorithms. As in earlier chapters suppose we have a large n× d source dataset and we take a sketch of size k where k  n. It is of interest to extend the sketching central limit theorem to the ‘Big Data’ asymptotic regime: n, d, k →∞, d/k → α ∈ (0, 1], d/n→ 0, k/n→ 0. (7.2) In the regime (7.2) the sketched dataset becomes an infinite dimensional random matrix. It is possible to establish weak convergence in general metric spaces (Van Der Vaart, 1998), and we aim to identify the appropriate framework to study sketching under the limit conditions in (7.2). The concept of joint asymptotic normality as espoused by Mallows (1972) appears to be useful for establishing the necessary 147 7. Conclusion convergence of the finite dimensional distributions. We anticipate that it will be necessary to treat the sketched dataset as a random bounded linear operator. The connections between the Tracy-Widom law and the behaviour of data oblivious sketches may have deeper ramifications for statistical computation. Iterative sketching algorithms for least squares problems use a combination of approximate second order information from a sketches and exact gradient information. Iterative sketching algorithms can be interpreted as classical iterative methods with a random preconditioner (Gower and Richtrik, 2015; Pilanci and Wainwright, 2016). The iterates come with stochastic convergence guarantees to the optimal solution. It can be shown that the rate of convergence of the algorithm is governed by the eigenvalue distribution of the random preconditioner (Young, 1972). Knowledge of the spectrum can be used to accelerate the rate of convergence of the algorithm through the addition of learning rate and momentum hyperparameters (Young, 1972). We can implement Chebyshev acceleration on the sketched linear system given that we know the asymptotic spectrum of the random preconditioner. We are currently investigating the acceleration of iterative sketching algorithms using the spectral theory developed in Chapter 6. The Bayesian analysis of tall data can be challenging due to obstacles encountered in the compute step of Box’s loop. We believe distributed computing, subsampling and random projection are all promising approaches for computationally efficient Bayesian inference. We hope that the algorithms and results here expand the suite of computational tools for analysts to use in practice. 148 References Ailon, N. and Chazelle, B. (2009) The fast Johnson Lindenstrauss transform and approximate nearest neighbors. SIAM Journal on Computing, 39, 302–322. Anderson, I. (1997) Combinatorial Designs and Tournaments. Oxford lecture series in mathematics and its applications. Clarendon Press. Andrieu, C., Roberts, G. O. et al. (2009) The pseudo-marginal approach for efficient Monte Carlo com- putations. The Annals of Statistics, 37, 697–725. Arnold, B. C., Castillo, E. and Sarabia, J. M. (1993) Conjugate exponential family priors for exponential family likelihoods. Statistics, 25, 71–77. Astle, W. J., Elding, H., Jiang, T., Allen, D., Ruklisa, D., Mann, A. L., Mead, D., Bouman, H., Riveros- Mckay, F., Kostadima, M. A. et al. (2016) The allelic landscape of human blood cell trait variation and links to common complex disease. Cell, 167, 1415–1429. Avron, H., Nguyen, H. and Woodruff, D. (2014) Subspace embeddings for the polynomial kernel. In Advances in Neural Information Processing Systems, 2258–2266. Bai, Z., Fang, Z., Liang, Y.-C. and Fange, Z. (2014) Spectral theory of large dimensional random matrices. Singapore ; Hackensack, N.J. Bai, Z. and Silverstein, J. W. (2010) Spectral Analysis of Large Dimensional Random Matrices. Springer Series in Statistics. New York, NY: Springer New York, 2nd edn. Baker, J., Fearnhead, P., Fox, E. B. and Nemeth, C. (2017) Control variates for stochastic gradient MCMC. arXiv preprint arXiv:1706.05439. Banerjee, A., Dunson, D. B. and Tokdar, S. T. (2013) Efficient Gaussian process regression for large datasets. Biometrika, 100, 75–89. Bao, Z., Pan, G. and Zhou, W. (2015) Universality for the largest eigenvalue of sample covariance matrices with general population. The Annals of Statistics, 43, 382–421. Bardenet, R., Doucet, A. and Holmes, C. (2014) Towards scaling up markov chain monte carlo: an adaptive subsampling approach. In International Conference on Machine Learning (ICML), 405–413. Bardenet, R., Doucet, A. and Holmes, C. (2015) On Markov chain Monte Carlo methods for tall data. arXiv preprint 1505.02827. Bardenet, R., Doucet, A. and Holmes, C. (2017) On Markov chain Monte Carlo methods for tall data. The Journal of Machine Learning Research, 18, 1515–1557. 149 References Bardenet, R. and Maillard, O.-A. (2015) A note on replacing uniform subsampling by random projections in MCMC for linear regression of tall datasets. HAL preprint 01248841. Becker, S., Kawas, B., Petrik, M. and Ramamurthy, K. (2015) Robust partially-compressed least- squares. arXiv preprint arXiv:1510.04905v1. Benjamini, Y. and Hochberg, Y. (1995) Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57, 289–300. Bennett, C. H. (1976) Efficient estimation of free energy differences from Monte Carlo data. Journal of Computational Physics, 22, 245–268. Bernardo, J., Bayarri, M. and Berger, J. (2011) Bayesian Statistics 9. Oxford science publications. OUP Oxford. Bernardo, J. and Smith, A. (2006) Bayesian Theory. Wiley Series in Probability and Statistics. John Wiley & Sons Canada, Limited. Besag, J. (1989) A candidate’s formula: A curious result in bayesian prediction. Biometrika, 76, 183. Bhatia, R. (1996) Matrix Analysis. Springer. Bierkens, J., Fearnhead, P. and Roberts, G. (2016) The zig-zag process and super-efficient sampling for Bayesian analysis of big data. arXiv preprint arXiv:1607.03188. Billinglsley, P. (1968) Convergence of probability measures. Wiley. Billingsley, P. (1999) Convergence of probability measures. Wiley Series in Probability and Statistics. New York: Wiley, 2nd edn. Blei, D. M. (2014) Build, compute, critique, repeat: Data analysis with latent variable models. Annual Review of Statistics and Its Application, 1, 203–232. Blei, D. M., Kucukelbir, A. and McAuliffe, J. D. (2017) Variational inference: A review for statisticians. Journal of the American Statistical Association, 112, 859–877. Bottolo, L. and Richardson, S. (2010) Evolutionary stochastic search for Bayesian model exploration. Bayesian Analysis, 5, 583–618. Box, G. E. P. (1976) Science and statistics. Journal of the American Statistical Association, 71, 791–799. Bryan, J. and Wickham, H. (2017) Data science: A three ring circus or a big tent? Journal of Compu- tational and Graphical Statistics, 26, 784–785. Cannings, T. I. and Samworth, R. J. (2015) Random projection ensemble classification. arXiv preprint arXiv:1504.04595. Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P. and Riddell, A. (2017) Stan: A probabilistic programming language. Journal of Statistical Software, 76. Casella, G., Giro´n, F. J., Mart´ınez, M. L., Moreno, E. et al. (2009) Consistency of Bayesian procedures for variable selection. The Annals of Statistics, 37, 1207–1228. Centers for Disease Control and Prevention (2013) Behavioral Risk Factor Surveillance System. Available online at https://www.cdc.gov/brfss/annual_data/annual_2013.html. 150 References Chatterjee, D., Maitra, T. and Bhattacharya, S. (2018) A short note on almost sure convergence of bayes factors in the general set-up. The American Statistician, 0, 1–4. Chib, S. (1995) Marginal likelihood from the Gibbs output. Journal of the American Statistical Associ- ation, 90, 1313–1321. Chib, S. and Kuffner, T. A. (2016) Bayes factor consistency. arXiv preprints arXiv:1607.00292. Clarkson, K. L. and Woodruff, D. P. (2013) Low rank approximation and regression in input sparsity time. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, 81–90. ACM. Cormode, G. (2011) Sketch techniques for approximate query processing. Foundations and Trends in Databases. Dempster, A. P. (1997) The direct use of likelihood for significance testing. Statistics and Computing, 7, 247–252. Dhillon, P., Lu, Y., Foster, D. P. and Ungar, L. (2013) New subsampling algorithms for fast least squares regression. In Advances in Neural Information Processing Systems, 360–368. DiCiccio, T. J., Kass, R. E., Raftery, A. and Wasserman, L. (1997) Computing Bayes factors by combining simulation and asymptotic approximations. Journal of the American Statistical Association, 92, 903– 915. Dickey, J. M. (1971) The weighted likelihood ratio, linear hypotheses on normal location parameters. The Annals of Mathematical Statistics, 204–223. Donoho, D. (2017) 50 years of data science. Journal of Computational and Graphical Statistics, 26, 745–766. Drineas, P., Mahoney, M. W. and Muthukrishnan, S. (2006) Sampling algorithms for l2 regression and applications. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, 1127–1136. Society for Industrial and Applied Mathematics. Eaton, M. L. (2007) Chapter 8: The Wishart Distribution. In Multivariate Statistics, vol. 53 of Lecture Notes Monograph Series, 302–333. Ohio: Institute of Mathematical Statistics. Edelman, A. (1988) Eigenvalues and condition numbers of random matrices. SIAM Journal on Matrix Analysis and Applications, 9, 543–560. Efron, B. and Hastie, T. (2016) Computer Age Statistical Inference. Institute of Mathematical Statistics Monographs. Cambridge University Press. Fahrmeir, L. and Tutz, G. (1994) Multivariate statistical modelling based on generalized linear models. Springer series in statistics. Springer-Verlag. Flegal, J. M., Hughes, J., Vats, D. and Dai, N. (2017) mcmcse: Monte Carlo Standard Errors for MCMC. Riverside, CA, Denver, CO, Coventry, UK, and Minneapolis, MN. R package version 1.3-2. Friel, N. and Pettitt, A. N. (2008) Marginal likelihood estimation via power posteriors. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70, 589–607. Friel, N. and Wyse, J. (2012) Estimating the evidence a review. Statistica Neerlandica, 66, 288–308. Gelfand, A. E. and Dey, D. K. (1994) Bayesian model choice: asymptotics and exact calculations. Journal of the Royal Statistical Society: Series B (Methodological), 501–514. 151 References Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A. and Rubin, D. B. (2014) Bayesian data analysis. Boca Raton: Chapman & Hall, 3 edn. Gelman, A. and Vehtari, A. (2017) Comment: Consensus Monte Carlo using expectation propagation. Brazillian Journal of Probability and Statistics, 31, 692–696. Geman, S. (1980) A limit theorem for the norm of random matrices. The Annals of Probability, 252–261. Geppert, L. N., Ickstadt, K., Munteanu, A., Quedenfeld, J. and Sohler, C. (2017) Random projections for Bayesian regression. Statistics and Computing, 27, 79–101. Geyer, C. (1996) Estimating normalizing constants and reweighting mixtures in Markov chain Monte Carlo. Tech. Rep. 568, University of Minnesota. Gower, R. M. and Richtrik, P. (2015) Randomized iterative methods for linear systems. arXiv preprint arXiv:1506.03296. Green, P. J., Latuszyn´ski, K., Pereyra, M. and Robert, C. P. (2015) Bayesian computation: a summary of the current state, and samples backwards and forwards. Statistics and Computing, 25, 835–862. Greene, W. (1997) Econometric Analysis. Prentice-Hall international editions. Prentice Hall. Guhaniyogi, R. and Dunson, D. B. (2015) Bayesian compressed regression. Journal of the American Statistical Association, 110, 1500–1514. Gutie´rrez-Pen˜a, E., Smith, A., Bernardo, J. M., Consonni, G., Veronese, P., George, E., Giro´n, F., Mart´ınez, M., Letac, G. and Morris, C. N. (1997) Exponential and Bayesian conjugate families: review and extensions. Test, 6, 1–90. Halko, N., Martinsson, P. G. and Tropp, J. A. (2011) Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53, 217–288. Howie, B. N., Donnelly, P. and Marchini, J. (2009) A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet, 5, e1000529. Huang, Z. and Gelman, A. (2005) Sampling for Bayesian computation with large datasets. Tech. rep., University of Columbia. Huber, P. J. (1973) Robust regression: Asymptotics, conjectures and monte carlo. The Annals of Statis- tics, 1, 799–821. URLhttps://doi.org/10.1214/aos/1176342503. Jacob, P. E., Thiery, A. H. et al. (2015) On nonnegative unbiased estimators. The Annals of Statistics, 43, 769–784. Jahn, R., Dunne, B. and Nelson, R. (1987) Engineering anomalies research. Journal of Scientific Explo- ration. Janson, S. (1988) Some pairwise independent sequences for which the central limit theorem fails. Stochas- tics, 23, 439–448. Johnstone, I. M. (2001) On the distribution of the largest eigenvalue in principal components analysis. Annals of statistics, 295–327. — (2006) High dimensional statistical inference and random matrices. arXiv preprint arXiv:0611589. Johnstone, I. M., Ma, Z., Perry, P. O. and Shahram, M. (2014) RMTstat: Distributions, Statistics and Tests derived from Random Matrix Theory. R package version 0.3. 152 References Jordan, M. I. (2013) On statistics, computation and scalability. Bernoulli, 19, 1378–1390. Kass, R. E. and Raftery, A. E. (1995) Bayes factors. Journal of the American Statistical Association, 90, 773–795. Keener, R. W. (2013) Theoretical Statistics. Springer. Korattikara, A., Chen, Y. and Welling, M. (2014) Austerity in mcmc land: Cutting the metropolis- hastings budget. In International Conference on Machine Learning, 181–189. Kuhn, M. (2008) Building predictive models in R using the caret package. Journal of Statistical Software, Articles, 28, 1–26. Lehoucq, R. B., Sorensen, D. C. and Yang, C. (1998) ARPACK users’ guide: solution of large-scale eigenvalue problems with implicitly restarted Arnoldi methods, vol. 6. Siam. Lewis, S. M. and Raftery, A. E. (1997) Estimating Bayes factors via posterior simulation with the Laplace- Metropolis estimator. Journal of the American Statistical Association, 92, 648–655. Li, P., Hastie, T. J. and Church, K. W. (2006) Very sparse random projections. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, 287–296. ACM. Loeve, M. (1977) Probability Theory. Springer. Ma, P., Mahoney, M. W. and Yu, B. (2015) A statistical perspective on algorithmic leveraging. Journal of Machine Learning Research, 861–911. Ma, P. and Sun, X. (2015) Leveraging for big data regression. Wiley Interdisciplinary Reviews: Compu- tational Statistics, 7, 70–76. Ma, Z. (2012) Accuracy of the Tracy–Widom limits for the extreme eigenvalues in white Wishart matrices. Bernoulli, 18, 322–359. Maclaurin, D. and Adams, R. P. (2014) Firefly Monte Carlo: Exact MCMC with Subsets of Data. arXiv preprint arXiv:1403.5693. Mahoney, M. (2011) Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning, 3, 123–224. Mahoney, M. and Drineas, P. (2016) Structural properties underlying high-quality Randomized Numerical Linear Algebra algorithms. In Handbook of Big Data (eds. P. Buhlmann, P. Drineas, M. Kane and M. van de Laan), 137–154. Chapman and Hall. Mallows, C. (1972) A note on asymptotic joint normality. The Annals of Mathematical Statistics, 508–515. McCulloch, R. E. and Rossi, P. E. (1992) Bayes factors for nonlinear hypotheses and likelihood distribu- tions. Biometrika, 79, 663–676. Meng, X. (2014) Randomized Algorithms for Large-scale Strongly Over-determined Linear Regression Problems. Ph.D. thesis, Stanford University, Stanford, California, United States. Meng, X. and Mahoney, M. M. (2013) Low-distortion Subspace Embeddings in Input-sparsity Time and Applications to Robust Linear Regression. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, 91–100. ACM. Meng, X.-L. and Wong, W. H. (1996) Simulating ratios of normalizing constants via a simple identity: a theoretical exploration. Statistica Sinica, 831–860. 153 References Minsker, S., Srivastava, S., Lin, L. and Dunson, D. B. (2017) Robust and scalable Bayes via a median of subset posterior measures. The Journal of Machine Learning Research, 18, 4488–4527. Neiswanger, W., Wang, C. and Xing, E. (2013) Asymptotically exact, embarrassingly parallel MCMC. arXiv preprint arXiv:1311.4780. Nelson, J. and Nguyeˆn, H. L. (2013) Osnap: Faster numerical linear algebra algorithms via sparser sub- space embeddings. In Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium on, 117–126. IEEE. Newton, M. A. and Raftery, A. E. (1994) Approximate Bayesian inference with the weighted likelihood bootstrap. Journal of the Royal Statistical Society: Series B (Methodological), 3–48. Papaspiliopoulos, O. (2009) A methodological framework for Monte Carlo probabilistic inference for diffusion processes. Working paper, University of Warwick, Coventry. Phillips, J. M. (2016) Coresets and Sketches. arXiv preprint arXiv:1601.00617. Pilanci, M. and Wainwright, M. J. (2016) Iterative Hessian sketch: Fast and accurate solution approxi- mation for constrained least-squares. Journal of Machine Learning Research, 17, 1842–1879. Pollock, M., Fearnhead, P., Johansen, A. M. and Roberts, G. O. (2016) The scalable Langevin exact algorithm: Bayesian inference for big data. arXiv preprint arXiv:1609.03436. Polson, N. G., Scott, J. G. and Windle, J. (2013) Bayesian inference for logistic models using po´lya– gamma latent variables. Journal of the American Statistical Association, 108, 1339–1349. Portnoy, S. (1986) On the central limit theorem in rp when p tends to infinity. Probability Theory and Related Fields, 73, 571–583. Pruss, A. R. and Szynal, D. (2000) On the central limit theorem for negatively correlated random variables with negatively correlated squares. Stochastic Processes and their Applications, 87, 299 – 309. Quiroz, M., Kohn, R., Villani, M. and Tran, M.-N. (2018) Speeding up MCMC by efficient data subsam- pling. Journal of the American Statistical Association, 0, 1–35. Quiroz, M., Tran, M.-N., Villani, M., Kohn, R. and Dang, K.-D. (2016) The block-Poisson estimator for optimally tuned exact subsampling MCMC. ArXiv e-prints. Raftery, A. E. (1995) Bayesian model selection in social research. Sociological Methodology, 25, 111–163. — (1996) Hypothesis testing and model selection via posterior simulation. In Markov chain Monte Carlo in practice, 163–188. Raskutti, G. and Mahoney, M. (2014) A Statistical Perspective on Randomized Sketching for Ordinary Least-Squares. arXiv preprint arXiv:1406.5986. Rhee, C.-h. and Glynn, P. W. (2015) Unbiased estimation with square root convergence for SDE models. Operations Research, 63, 1026–1043. Robert, C. P. (2007) The Bayesian Choice From Decision-Theoretic Foundations to Computational Im- plementation. Springer Texts in Statistics. New York, NY: Springer New York, 2nd edn. Robert, C. P. and Casella, G. (2010) Monte Carlo Statistical Methods. Springer. Robert, C. P., Elvira, V., Tawn, N. and Wu, C. (2018) Accelerating MCMC algorithms. Wiley Interdis- ciplinary Reviews: Computational Statistics, 10, 1–14. 154 References Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J.-C. and Mller, M. (2011) proc: an open-source package for r and s+ to analyze and compare roc curves. BMC Bioinformatics, 12, 77. Roosta-Khorasani, F. and Mahoney, M. W. (2016) Sub-Sampled Newton Methods I: Globally Convergent Algorithms. arXiv preprint arXiv:1601.04737. Sarlos, T. (2006) Improved approximation algorithms for large matrices via random projections. In 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), 143–152. IEEE. Sarndal, C.-E., Swensson, B. and Wretman, J. H. (1992) Model assisted survey sampling. Springer series in statistics. New York: Springer-Verlag. Schwarz, G. (1978) Estimating the dimension of a model. The Annals of Statistics, 6, 461–464. Scott, S. L. (2017) Comparing consensus Monte Carlo strategies for distributed Bayesian computation. Brazilian Journal of Probability and Statistics, 31, 668–685. Scott, S. L., Blocker, A. W., Bonassi, F. V., Chipman, H., George, E. and McCulloch, R. (2013) Bayes and big data: the consensus Monte Carlo algorithm. In EFaBBayes 250 conference, vol. 16. Searle, S. R. (1997) Linear Models. New Jersey: Wiley-Interscience. Shah, R. D. and Meinshausen, N. (2013) Min-wise hashing for large-scale regression and classification with sparse data. arXiv preprint arXiv:1308.1269. Shirts, M. R., Bair, E., Hooker, G. and Pande, V. S. (2003) Equilibrium free energies from nonequilibrium measurements using maximum-likelihood methods. Physical Review Letters, 91, 140601. Silverman, B. W. (1986) Density estimation for statistics and data analysis. Monographs on statistics and applied probability (Series). London ; New York: Chapman and Hall. Silverstein, J. W. (1985) The smallest eigenvalue of a large dimensional wishart matrix. The Annals of Probability, 13, 1364–1368. URLhttps://doi.org/10.1214/aop/1176992819. Skilling, J. et al. (2006) Nested sampling for general Bayesian computation. Bayesian Analysis, 1, 833– 859. Spiegelhalter, D. J., Best, N. G., Carlin, B. P. and Van Der Linde, A. (2002) Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64, 583–639. Srivastava, S., Cevher, V., Tran-Dinh, Q., Dunson, D. B. and de Lausanne, F. (2015) WASP: Scalable Bayes via barycenters of subset posteriors. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, 912–920. Stan Development Team (2018) Stan Modeling Language Users Guide and Reference Manual. Tanner, M. A. and Wong, W. H. (1987) The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association, 82, 528–540. Terrell, G. R. and Scott, D. W. (1992) Variable kernel density estimation. The Annals of Statistics, 20, 1236–1265. Thanei, G.-A., Heinze, C. and Meinshausen, N. (2017) Random projections for large-scale regression. arXiv preprint 1701.05325. Tracy, C. A. and Widom, H. (1994) Level-spacing distributions and the airy kernel. Communications in Mathematical Physics, 159, 151–174. 155 References Tropp, J. A. (2011) Improved analysis of the subsampled randomized Hadamard transform. Advances in Adaptive Data Analysis, 3, 115–126. Van Der Vaart, A. (1998) Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Math- ematics, 3. Cambridge University Press. Varian, H. R. (2014) Big data: New tricks for econometrics. Journal of Economic Perspectives, 28, 3–28. Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S. New York: Springer, fourth edn. ISBN 0-387-95457-0. Venkatasubramanian, S. and Wang, Q. (2011) The Johnson-Lindenstrauss transform: an empirical study. In 2011 Proceedings of the Thirteenth Workshop on Algorithm Engineering and Experiments (ALENEX), 164–173. SIAM. Wagner, W. (1987) Unbiased Monte Carlo evaluation of certain functional integrals. Journal of Compu- tational Physics, 71, 21–33. Walker, S., Damien, P. and Lenk, P. (2004) On priors with a Kullback-Leibler property. Journal of the American Statistical Association, 99, 404–408. Walschap, G. (2015) Multivariable Calculus and Differential Geometry. De Gruyter. Wang, X. and Dunson, D. B. (2013) Parallelizing MCMC via Weierstrass sampler. arXiv preprint arXiv:1312.4605. Wasserstein, R. L. and Lazar, N. A. (2016) The ASA’s statement on p-values: Context, process, and purpose. The American Statistician, 70, 129–133. Welling, M. and Teh, Y. W. (2011) Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), 681–688. Wetzels, R., Grasman, R. P. and Wagenmakers, E.-J. (2010) An encompassing prior generalization of the Savage–Dickey density ratio. Computational Statistics & Data Analysis, 54, 2094–2102. White, H. (1984) Asymptotic theory for econometricians. Economic Theory, Econometrics and Mathe- matical Economics Series. Academic Press. Wickham, H. (2014) nycflights13: Data about flights departing NYC in 2013. R package version 0.1. Woodruff, D. P. (2014) Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science, 10, 1–157. Yang, J., Meng, X. and Mahoney, M. W. (2015a) Implementing randomized matrix algorithms in parallel and distributed environments. arXiv preprint arXiv:1502.03032. Yang, T., Zhang, L., Lin, Q. and Jin, R. (2015b) Fast sparse least-squares regression with non-asymptotic guarantees. arXiv preprint arXiv:1507.05185. Young, D. M. (1972) Second-degree iterative methods for the solution of large linear systems. Journal of Approximation Theory, 5, 137 – 148. Zhou, S., Ligett, K. and Wasserman, L. (2009) Differential privacy with compression. In Information Theory, 2009. ISIT 2009. IEEE International Symposium on, 2718–2722. IEEE. 156