Statistical properties of sketching algorithms

Summary Sketching is a probabilistic data compression technique that has been largely developed by the computer science community. Numerical operations on big datasets can be intolerably slow; sketching algorithms address this issue by generating a smaller surrogate dataset. Typically, inference proceeds on the compressed dataset. Sketching algorithms generally use random projections to compress the original dataset, and this stochastic generation process makes them amenable to statistical analysis. We argue that the sketched data can be modelled as a random sample, thus placing this family of data compression methods firmly within an inferential framework. In particular, we focus on the Gaussian, Hadamard and Clarkson–Woodruff sketches and their use in single-pass sketching algorithms for linear regression with huge samples. We explore the statistical properties of sketched regression algorithms and derive new distributional results for a large class of sketching estimators. A key result is a conditional central limit theorem for data-oblivious sketches. An important finding is that the best choice of sketching algorithm in terms of mean squared error is related to the signal-to-noise ratio in the source dataset. Finally, we demonstrate the theory and the limits of its applicability on two datasets.


Introduction
Sketching is a general probabilistic data compression technique designed for Big Data applications (Cormode, 2011).Even routine calculations can be prohibitively computationally expensive on massive datasets.Computation time can be reduced to an acceptable level by allowing for some approximation error in the results.Sketching algorithms relax the computational task by generating a compressed version of the original dataset which then serves as a surrogate for calculations.The compressed dataset is referred to as a sketch as it acts as a compact representation of the full dataset.Sketching algorithms use a randomised compression stage which makes them interesting from a statistical viewpoint.Sketching algorithms for linear regression have attracted significant attention in the numerical linear algebra and theoretical computer science communities (Woodruff, 2014;Mahoney, 2011).In this paper we investigate the statistical properties of sketched regression algorithms, a perspective which has received little attention up to now.
To describe sketched regression in more detail, we first assume the data consists of a nlength response vector y and a n × p matrix of covariates, X which is of full rank.It is assumed that n > p.The objective is to find the optimal least squares coefficients.Given sufficient computational resources, these could be computed exactly as The subscript F is used to indicate the connection to the full dataset.Only two quantities are needed in order to determine β F , the Gram matrix X T X, and the marginal associations X T y.Calculation of X T X requires O(np 2 ) operations while computation of X T y needs only O(np) calculations.There are two broad methods for sketched regression, complete sketching and partial sketching.Complete sketching is based on approximating both X T X and X T y, whereas partial sketching only approximates the Gram matrix.
Sketching algorithms use random linear mappings to reduce the size of the dataset from n to k observations.The random linear mapping can be represented as a k × n sketching matrix S. Complete sketching generates a k-length sketched response vector y and a k × p matrix of sketched predictors X.The sketched data are computed through the linear mappings y = Sy and X = SX.Partial sketching only generates a k × p matrix of sketched covariates X.We again use the random mapping X = SX.
The complete sketching estimator, β S , is defined as the least squares coefficients using the sketched responses and predictors, β S = ( X T X) −1 X T y. (1) The partial sketching estimator, β P , is defined as The key difference between (1) and ( 2) is that the partial sketched estimator β P is constructed using the exact marginal associations X T y.Given the sketched data, computation of β S or β P requires only O(kp 2 ) operations, compared with the O(np 2 ) required for β F .The estimand within a sketching algorithm is the optimal coefficient vector β F .Sketching algorithms have the property that given a fixed k, the approximation error β S − β F 2 or β P − β F 2 remains probabilistically bounded even as n → ∞.Designing estimators for approximate computation with such properties is very difficult, and is a common goal in the development of techniques for Big Data (Bardenet & Maillard, 2015;Phillips, 2016).The favourable scaling properties of sketching algorithms are a critical factor in making them stand apart from simple subsampling approaches, where it can be difficult to establish universal worst case bounds for large n (Drineas et al., 2006;Ma et al., 2015).The fact that sketching algorithms provide finite k guarantees for arbitrarily large n is a major reason they have received so much attention in the computer science community.
There is a large literature concerned with designing appropriate distributions for the random sketching matrix S. Our focus is on data-oblivious random projections, where the distribution of the sketching matrix is not a function of the source data (y, X).An example is the Gaussian sketch, where each element is independently distributed as a N (0, 1/k) variate.We also consider the Hadamard sketch and the Clarkson-Woodruff sketch, random projections that exploit structure and sparsity for computational efficiency.
Most existing results on the accuracy of sketching are universal worst case bounds (Woodruff, 2014;Mahoney & Drineas, 2016).This is typical for randomised algorithms, however a more detailed error analysis can provide important insights (Halko et al., 2011).We investigate the statistical properties of β P and β S when using data oblivious sketches.An important finding is upper and lower bounds on the relative efficiency of complete sketching to partial sketching in terms of the signal to noise ratio in the source dataset.The statistical analysis also allows the construction of exact confidence intervals for the Gaussian sketch, and asymptotic confidence intervals for other random projections, paving the way for their wider use in the statistical community interested in Big Data methods.
We start by reviewing the existing literature on sketching algorithms before investigating the statistical properties in more detail.The analysis departs from usual statistical practice, in that no model is assumed for the source dataset [y, X].The source dataset is treated as fixed, and distributional results are taken with respect to the random sketching matrix.

Background and Related Work
Before proceeding, it is worth mentioning alternatives to sketching, in particular iterative methods for calculating the least squares coefficients β F .These include coordinate descent or stochastic gradient methods.Iterative methods are guaranteed to converge to β F under very mild conditions.These iterative techniques assume that the entire dataset can stored in memory in a single location, or require regular communication if the full dataset is distributed across multiple sites.Sketching algorithms are not burdened by these memory and communication costs, with the drawback of no convergence guarantees to β F .Connections to iterative methods are postponed until the discussion, the focus for now is on the single pass estimators β S and β P .
The purpose of this section is to review the existing theoretical framework for sketching algorithms.Sketching algorithms are largely motivated through worst case guarantees.We recap how these bounds can be developed before studying the statistical properties of the sketched estimators.
It will be helpful to define a number of quantities related to the full dataset before moving on.Let These terms summarise the goodness of fit of the model.The total, residual and model sum of squares are given by T SS F , RSS F and M SS F respectively, with T SS F = M SS F +RSS F .The proportion of variance explained by the model is given by R 2 F .These values will be important in characterising the behaviour of β S and β P .

Worst Case bounds
A key concept in the construction of sketching algorithms is the notion of an -subspace embedding (Woodruff, 2014;Meng & Mahoney, 2013;Yang et al., 2015a).
Theorem 1. -subspace embedding.For a given n × d matrix A, we call a k × n matrix S an -subspace embedding for A, if for all Speaking broadly, an -subspace preserves the linear structure of the columns of the original dataset up to some multiplicative (1 ± ) factor.In particular, if is small, the linear mapping S approximately preserves the covariance structure of the source dataset.Most theoretical arguments for sketching algorithms are predicated on the idea that the sketching matrix S is an -subspace embedding for the source dataset.The general notion is that it is possible to use a linear mapping S that reduces the sample size from n to k whilst preserving much of the linear information in the full dataset.
The issue of how to generate -subspace embeddings is deferred until section 3.2, the present focus will be on the utility of -subspace embeddings for linear regression problems.For now, assume that we have some method for generating -subspace embeddings for the source data matrix A. It will be convenient to refer to A = SA as an -subspace embedding of A if S is an -subspace embedding for A. As regression is the focus from this point forward, we will define the source data matrix as A = [y, X], the sketched data matrix as A = [ y, X] and set d = p + 1.
The complete sketched estimator β S is given by the least squares coefficients using the sketched responses y and the sketched predictors X, An -subspace embedding is useful as it relates the sketched optimisation problem to the full dataset optimisation problem.
If is small, minimising the sum of squared residuals on the sketched dataset is similar to minimising the sum of squared residuals on the full dataset.If this is the case, it can be expected that β S will be close to β F .It is possible to establish the concrete bounds, that if A is an -subspace embedding of A (Sarlos, 2006), where σ min (X) represents the smallest singular value of the design matrix X.A very similar argument can be used to motivate the partial sketched estimator β P .Existing bounds for the partial sketch focus on the prediction error Xβ P − Xβ F 2 2 (Becker et al., 2015;Pilanci & Wainwright, 2016).To make a direct comparison to (3) we establish a bound on the coefficient error Theorem 1. Suppose that X is an -subspace embedding of X with < 0.5.Then the following bound holds, For proof see the supplementary material.The mild requirement that < 0.5 is imposed so that the bound matches the functional form of the complete sketching bound (3).Comparing the partial sketching bound to (3), we see that the tightness of the bound is controlled by the model sum of squares as opposed to the residual sum of squares.This suggests that the signal to noise ratio in the source dataset will be important when selecting which sketched estimator to use.A naive conclusion is that complete sketching is preferred when RSS F < 4M SS F , or equivalently R 2 F > 0.25.Such a result is hardly prescriptive, as the worst case bound is not necessarily indicative of expected performance.A second point of interest is that if the k × n matrix S is an -subspace embedding for A = [y, X], it is also an -subspace embedding for X.This suggests that it is reasonable to compute both β P and β S from a single sketch, although it is not clear how to combine the estimators into a single point estimator.These issues will be explored in more depth by examining the statistical properties of both complete and partial sketching.Before moving on to the statistical analysis we review some of the existing methods for generating -subspace embeddings.

Sketching time
Required sketch size k Gaussian sketch Table 1: Properties of different data oblivious random projections (Woodruff, 2014).The third column refers to the necessary sketch size k to obtain an -subspace embedding for an arbitrary n × d source dataset with at least probability (1 − δ).

Sketches
There are two general categories of distributions for the random matrix S, data aware random projections and data oblivious random projections.A data aware random projection uses information in the source data y, X to generate S. In contrast, a data oblivious random projection can be sampled without knowledge of y or X.Data oblivious projections are designed to produce -subspace embeddings for an arbitrary source data matrix with high probability.
Our main focus is on data oblivious random projections.
The Gaussian sketch was one of the first projections proposed for sketched regression (Sarlos, 2006).Recall that a Gaussian sketch is formed by independently sampling each element of S from a N (0, 1/k) distribution.The drawback of the Gaussian sketch is that computation of the sketched data is quite demanding, taking O(npk) operations.As such, there has been work on designing more computationally efficient random projections.The Hadamard sketch and the Clarkson-Woodruff sketch are two examples of more efficient methods for generating -subspace embeddings.
The Hadamard sketch is a structured random matrix (Ailon & Chazelle, 2009).The sketching matrix is formed as S = ΦHD/ √ k, where Φ is a k × n matrix and H and D are both n × n matrices.The fixed matrix H is a Hadamard matrix of order n.A Hadamard matrix is a square matrix with elements that are either +1 or −1 and orthogonal rows.Hadamard matrices do not exist for all integers n, the source dataset can be padded with zeroes so that a conformable Hadamard matrix is available.The random matrix D is a diagonal matrix where each nonzero element is an independent Rademacher random variable.The random matrix Φ subsamples k rows of H with replacement.The structure of the Hadamard sketch allows for fast matrix multiplication, reducing calculation of the sketched dataset to O(nd log k) operations.
The Clarkson-Woodruff sketch is a sparse random matrix (Clarkson & Woodruff, 2013).The projection can be represented as the product of two independent random matrices, S = ΓD, where Γ is a random k × n matrix and D is a random n × n matrix.The matrix Γ is formed by choosing one element in each column independently and setting the entry to +1.The matrix D is a diagonal matrix where each nonzero element is an independent Rademacher random variable.This results in a sparse S, where there is only one nonzero entry per column.The sparsity of the Clarkson-Woodruff sketch speeds up matrix multiplication, dropping the complexity of generating the sketched dataset to O(nd).
Figure 1 shows examples of the three sketches for k = 32, n = 36.The sketches are discussed in more detail in the supplementary material.
Data oblivious sketches are designed to give an -subspace embedding for an arbitrary source dataset with at least probability (1 − δ).Sketching algorithms are appealing for large n problems as the required k to attain the (δ, ) bound is independent of n for the Gaussian and Clarkson-Woodruff sketches, and very weakly dependent on n for the Hadamard sketch.Table 1 summarises existing results on the necessary k to attain the ( , δ) bound.Probabilistic worst case bounds for sketched regression are formed by noting that if a sketch produces an -subspace embedding with probability at least (1 − δ), then the bounds in section 2 must hold with probability at least (1 − δ).Woodruff (2014) gives an excellent survey of work in this area.
As mentioned, data aware random projections can also be used to generate -subspace embeddings.Existing data aware projections perform weighted sampling with replacement from the source dataset.As such, data aware sketching methods are closely related to resampling methods such as the bootstrap and the jacknife (Ma & Sun, 2015).We focus on data oblivious random projections, where there is no direct connection to resampling methods.The Gaussian sketch is mathematically tractable, and it is possible to establish a number of exact finite sample results regarding the performance of the sketched estimators.In the next section, we obtain the exact distribution of β S and the bias and variance of β P .This provides guidance on issues regarding the relative efficiency of complete to partial sketching.

Complete Sketching
In this section we will develop the distribution of β S when using a Gaussian sketch.As mentioned previously, all results treat {y, X} as fixed.The variability in β S is solely due to the use of the random sketching matrix S. Let a j refer to the jth row in the sketched data matrix A = [ y, X] for j = 1, . . ., k.Similarly, let s j denote the jth row in the sketching matrix S.
The sketched dataset consists of k random units { y j , x j } k j=1 .The jth sketched response is given by y j = s j y, and the jth sketched predictor is calculated as x j = s j X for j = 1, . . ., k.The k sketched instances are independently distributed, because rows of the sketching matrix are independent.
We take an indirect route to find the distribution of β S , by focusing on the distribution of the sketched data { y, X} conditional on the original dataset {y, X}.The initial step is to decompose the joint distribution on the sketched responses and predictors as the product of a marginal and conditional distribution.Specifically, p( y, X|y, X) = p( y| X, y, X)p( X|y, X).
It can be shown that p( y| X, y, X)p( X|y, X) has the structure of a hierarchical Gaussian linear model.We first show that the sketched dataset has a multivariate normal distribution, conditional on the source dataset.This follows as the sketched dataset can be expressed as a linear combination of Gaussian random variables.Specifically, row j in the sketched dataset is given by a j = ( y j , x j ) = s j A. To be clear, it is helpful to express ( y j , x j ) as a column vector Conditional on A = [y, X], A T s T j is a linear combination of independent Gaussians as s T j ∼ N (0, I d /k).As affine transformations of Gaussians are also multivariate normal, ( y j , x j ) must then be jointly normally distributed, conditional on the source data (y, X).It is easily shown that the joint distribution of the sketched responses and predictors is then , independently for j = 1, . . ., k.
Standard results on multivariate normals give that the conditional distribution of y j given x j is also normal.A routine calculation shows that the conditional mean is related to The subscript S is used on the expectation operator to emphasise that only random quantity is the sketching matrix.The conditional variance is related to the prediction error on the source dataset RSS F , var S ( The subscript S is again used to recognise that the source of the variance is the random sketching matrix, the source dataset is fixed.The step in the second line follows from sum of squares partitions in linear models (Searle, 1997, Chapter 3).Therefore, the conditional distribution of y j given the sketched predictors x j and the source dataset y, X is This is the exact form of a standard Gaussian linear model.The distribution p( X|y, X) is easily obtained as the marginal distribution of x j is also multivariate normal, The sketching process can be described using the following hierarchical model, A Gaussian sketch effectively simulates a series of observations from a Gaussian linear model parametrised in terms of β F and σ 2 F , where the design matrix has a matrix normal distribution.We now turn to the distribution of β S .The distribution of β S conditional on the sketched predictors follows immediately from standard results on linear models (Searle, 1997, Chapter 3).
To obtain the marginal distribution of β S it is necessary to integrate over the random sketched design matrix X.From properties of the normal distribution (Eaton, 2007), it is possible to show ( As seen in equation ( 5), β S is normally distributed when conditioned on the random Inverse-Wishart matrix ( X T X) −1 .The marginal distribution of β S can then be described using the Normal Inverse-Wishart distribution (Gelman et al., 2014, p.73).The following theorem characterises the distribution of β S under the Gaussian sketch.
Theorem 2. Suppose β S is computed using a Gaussian sketch and k > p + 1.The conditional distribution of β S is For proof see the the supplementary material.
An immediate application of result (i) is the ability to generate exact confidence intervals for the elements of β S , methodology that does not appear to be present in the existing literature.It is also possible to estimate exact joint confidence regions for the entire vector β S .Again assuming that k > p + 1, it should be noted that the variance of β S , is not dependent on the compression ratio k/n.Although RSS F can be expected to grow linearly with n, this will generally be counterbalanced by (X T X) −1 decreasing linearly with n.The distribution of the approximation error β S − β F 2 will largely be controlled by the target dimension k.This speaks to the defining characteristic of sketching algorithms, that given a fixed k, the stochastic approximation error does not necessarily increase with size of the original dataset n.Probabilistic worst case bounds on the error β S − β F 2 can also be obtained by making an appeal to Chebyschev's inequality.

Partial sketching
An important finding was that the variance of the complete sketching estimator is dependent on the residual sum of squares.It is simple to see that the variance of the partial sketched estimator will not be a function of the residual sum of squares.From the normal equations it holds that X T y = X T Xβ F .Using this property, we see that conditional on y, X, the variance of the random linear combination β P = (X T S T SX) −1 X T y = (X T S T SX) −1 X T Xβ F will be a function of the covariates and the fitted values.The residual vector has no influence on the variance of the partial sketching estimator, and as such the variance of β P will not be related to the residual sum of squares.This suggests that when the noise level is high, partial sketching may become preferrable to complete sketching.This idea has been touched on in the existing literature, but specific guidelines are lacking (Becker et al., 2015;Dhillon et al., 2013).
A statistical analysis can provide some insight into this issue.
The hierarchical model for complete sketching gave an intuitive statistical perspective on the mechanics of the algorithm.Partial sketching seems to lack a similar conceptual device.The least squares coefficients can be represented as the solution to the linear system of the equations X T Xb = X T y.Partial sketching simply returns the solution, b, to the approximate linear system X T Xb = X T y.Lacking a convenient representation for the estimator, we must proceed in a more pedestrian manner.The mean square error of the estimator β P can be determined using only mean and variance information, and this will be the goal for now.The key observation is that ( X T X) −1 |y, X ∼ InvWishart k, k(X T X) −1 .Conditional on y, X, the estimator β P = ( X T X) −1 X T y is a linear combination of the elements of an Inverse-Wishart random variable.However, this is a non-standard distribution and it is difficult to directly express the distribution function of β P .Despite this, it is straightforward to determine the mean and variance of β P .From properties of the Inverse-Wishart distribution, it can be seen that the partial sketched estimator is biased, with mean where it is assumed that k > p + 3.This motivates an alternative unbiased estimator Determining the variance of β P and the unbiased β * P is a more lengthy computation (see supplementary material).Skipping the working, the variance of the biased estimator The variance of the unbiased estimator The variances of β P and β * P have a similar structure to the variance of β S .The main point of difference is that the variance of β S depends on the residual sum of squares, whereas the variance of β P and β * P depends on the model sum of squares.As mentioned the explicit form of the sampling distribution is hard to obtain, but by making a connection with method of moments estimation it is possible to establish asymptotic normality of both β P and β * P as k tends to infinity.This motivates the construction of approximate confidence intervals.As the exact variance is unknown we propose the following estimator

Relative efficiency
The relative efficacy of complete and partial sketching is also of interest.As the plug in estimator β P has a higher mean square error than β * P , it will not be considered in this section.The performance of the complete sketching estimator β S and the unbiased partial sketched estimator β * P will be compared in terms of mean squared error.As both β F and β * P are unbiased, the mean squared error can be computed using their respective covariance matrices, that is | y, X = tr(var(β * P )).Comparing ( 6) and ( 8), the variance of β * P is dependent on M SS F , whereas the variance of β S is dependent on RSS F .This suggests that the signal to noise ratio in the source dataset will be an influential factor in determining which estimator is more efficient.It is possible to establish a bound on the relative efficiency of complete to partial sketching which captures this intuition, The lower bound indicates that if R 2 F ≥ 0.5, then the mean squared error of β S is less than the mean squared error of β * P .The upper bound depends on k and p.Assuming that k > p + 3, the scaling factor (k − p − 1)(k − p + 1)/(k − p)(k − p − 3) is upper bounded by 3.75.This gives a strong upper bound that is only a function of R 2 F , however it is quite loose.Strong upper and lower bounds are plotted in Figure 2. The scaling factor tends to one as k grows larger, giving a much tighter bound.A specific bound for k = 1500, p = 1000 is also plotted in fig. 2 as these values are used in a data application later on.
When R 2 F is close to one complete sketching can be orders of magnitude more efficient than partial sketching, and when R 2 F is close to zero, partial sketching can be orders of magnitude more efficient than complete sketching.

Combined estimator
So far we have assumed that an analyst much choose between one of the two methods.Obtaining both β * P and β S from a single sketch is computationally cheap, and may be an attractive strategy.The most demanding operation with the sketched data is calculating ( X T X) −1 .Given this quantity it is economical to compute both β S and β * P .Becker et al. (2015) mention they are presently investigating such a strategy, but do not give any details.Our motivation for a combined estimator is driven by the fact even when using a single sketch ( y, X), the two estimators are uncorrelated, that is cov (β * P , β S ) = 0.This is established by taking iterated expectations, and using the hierarchical model established in section 3.1 (see supplementary material).A simple strategy is then to take a weighted combination of β S and β * P .A combined estimator β C can be defined as P , for some 0 < α < 1.The value of α that minimises the mean square error is The weight given to the β S is related to the relative efficiency of the two estimators.Starting from the relative efficiency bounds (10), it is possible to establish a bound on the optimal weights (see Supplementary material this is the scaling factor in the relative efficiency upper bound.This is approximately equal to one unless k is very close to p.The optimal weight bound can then be written as The scaling factor c is upper bounded by 3.75, this again provides a strong upper bound that is only a function of R 2 F .The strong bound is not very useful for capping the optimal weight.Figure 3 plots the strong upper and lower bounds on the optimal weight.A particular upper bound for k = 1500, p = 1000 is also plotted, and it is significantly tighter than the strong upper bound.
Use of the weighted estimator is expected to be most beneficial when the signal to noise ratio is moderate, that is R 2 F ≈ 0.5.When the signal to noise ratio is either very high or very low, there is little gain from using the weighted estimator as either the complete or partial estimator will dominate.

Preliminaries
Finite sample distributions of random projection estimators can be mathematically intractable, and as such asymptotic analysis can be a powerful tool (Li et al., 2006).It is a very difficult task to establish meaningful finite sample results for the Hadamard and Clarkson-Woodruff sketches, as they are discrete distributions over a very large combinatorial space.The explicit finite sample distribution of the sketched estimators can be written as a sum over all these possible combinations, but such a representation is not very informative.Instead, it is useful to study the large n distribution of the estimators β S and β P to obtain an interpretable expression.As β F is the estimand in sketching algorithms, this requires conditioning on the source data in the asymptotic analysis.To elaborate, let A (n) = [y (n) , X (n) ] represent the n × d source data matrix of full column rank.Any source data matrix A (n) has a set of associated least squares coefficients, which will here be denoted β (n) F .The overall goal is to determine the asymptotic form of the distributions p(β S |A (n) ) and p(β * P |A (n) ) for some arbitrary large dataset A (n) .To take limits, we employ a fixed sequence of n × d datasets, all of rank d.In the regression scenario this amounts to assuming that X (n) is of full column rank and that y (n) is not a perfect linear combination of the columns of X (n) for all n.Conditioning on A (n) is effectively the same as treating the full dataset as an arbitrary sequence of constants A ij for i = 1, . . ., n, j = 1, . . ., d.This is to analogous large sample results for regression models where the design matrix is treated as arbitrary set of constants, and the random variables of interest are the error terms, for example see Van Der Vaart (1998, section 2.5).Here the source dataset is treated as a sequence of constants and the random variables of interest are the elements of the sketching matrix.
The asymptotic analysis is carried out in two stages.The initial step is to establish asymptotic normality of the sketched dataset.This is then followed by an analysis of the limiting distribution of β S , and β * P .There is some related work by Ma et al. (2015) who develop asymptotic expressions for the bias and variance of data aware sketched regression estimators, where limits are taken in the sketch size k.Our work is different as we study data oblivious random projections and take limits in n, which is perhaps more natural in the Big Data setting.

Sketching Central Limit Theorem
A central limit theorem for sparse sketching matrices with independent entries is given in Li et al. (2006).The Clarkson-Woodruff sketch and the Hadamard sketch have dependent entries, and as such we use a different method of proof.Under some regularity conditions the Hadamard and Clarkson-Woodruff sketches produce sketched data that asymptotically has the same matrix normal distribution as under the Gaussian sketch.Using a Gaussian sketch, Each row is statistically independent, and marginally normally distributed with covariance matrix A T A. Although asymptotic normality may not be particularly surprising seeing as the sketched data are linear combinations of random vectors, the proof is not immediate due to the dependence in the Hadamard and Clarkson-Woodruff sketches.The difficulties caused by the dependence are most easily illustrated for the Clarkson-Woodruff sketch.
Algorithm 1. Clarkson-Woodruff sketch Multiply by r and add to row z in sketch Output A Output sketched dataset The behaviour of the Clarkson-Woodruff sketch can be represented as a many to less mapping.Each row in the source dataset is assigned to a single row in the sketched dataset.The Clarkson-Woodruff sketch has an alternative streaming construction that highlights this property, given in Algorithm 1.As each row in the source dataset only contributes to a single row in the sketched dataset, it might be expected that this results in some statistical dependence amongst the rows of the sketched dataset.Additionally, although it seems each row in the sketched dataset will be marginally normally distributed, it is not clear if joint asymptotic normality over all rows will hold.Similar conundrums arise when examining the Hadamard sketch in detail.
Under some assumptions it is nevertheless possible to establish a central limit theorem for the sketched data matrix.
The core restriction imposed by Assumption 1 is a stability condition on the growth of the variance in the dataset.It is worth discussing the significance of the limiting matrix Q.A useful comparison can be made to asymptotic theory for regression models, where a common assumption is that the design matrix satisfies the limit condition n −1 X T (n) X (n) → B, where B is some positive definite matrix (White, 1984;Greene, 1997).The development of asymptotic results is often eased by treating the covariates as a random sample, although this requires positing a realistic probability model for the covariates, which may be difficult.Treating the covariates as an arbitrary fixed sequence relaxes this assumption and covers more general scenarios.Although it is possible to establish asymptotic results when n −1 X T (n) X (n) is not required to converge to any fixed matrix, proofs can become very technical (Fahrmeir & Tutz, 1994, Appendix A.2). Imposing a limiting value for n −1 X T (n) X (n) simplifies arguments and can be seen as a compromise between making strong and weak assumptions about the covariates (Fahrmeir & Tutz, 1994, p.46).There is an analogous motivation for Assumption 1, the limiting matrix Q is present to avoid specifying a probability model for the source dataset, without overcomplicating the mathematical analysis.
Again in reference to regression models, the importance of the assumption , indicating that the Fisher information in the dataset tends to infinity (Fahrmeir & Tutz, 1994, p.46).There is a similar motivation here, assumption 1 implies that A T (n) A (n) diverges, and as such the total variance in the dataset grows without bound.
The second assumption controls the grows of outliers in the dataset.This is to ensure that no single row in the source dataset has undue influence over the sketched dataset.With these assumptions, the limiting behaviour of the sketched data is given in the following theorem.
Theorem 3. Assume that we generate the sketched data using a Hadamard or Clarkson-Woodruff sketch.Suppose that assumptions 1 and 2 hold.Let , and define the sketched data matrix as A = [ y, X].Then as n → ∞, the sketched data converges in distribution to a matrix normal distribution, Normalising the sketched data by n −1/2 is necessary to control the variance.The k × d random matrix A is the output of a stochastic process governed by the fixed n × d source dataset A (n) and the distribution of the random k × n sketching matrix S. The sketched dataset is a linear combination of random vectors, the number of which increases with n.As such, we can expect A to demonstrate some stable limiting behaviour as n grows larger.The proof of Theorem 3 is in the supplementary material.
Heuristically, for large n we expect the matrix normal result (11) to approximately hold for the Hadamard and Clarkson-Woodruff sketches.The indicates rows in the sketched data matrix are asymptotically statistically independent, in the sense that knowledge of one row in the sketched data matrix does not help predict any other row in the sketched data matrix.Assumption 1 imposes sufficient variation in the source dataset for this to occur.
It would be pleasing to also establish convergence in distribution of the conditional distribution as n → ∞ for general random projections, as we could then use the same hierarchical model representation as with the Gaussian sketch.Conditional convergence will occur if the full dataset satisfies a set of additional technical conditions (Sweeting, 1989).We choose to not enforce these extra assumptions as they are not needed to establish convergence of the sketched estimators.

Sketching Estimators
The central limit theorem for the sketched data suggests that the results about β S and β P for the Gaussian sketch will also approximately hold for the Hadamard and Clarkson-Woodruff sketches for large n.Setting up a limit theorem requires a little extra care with notation.As we have a sequence of datasets A (n) , there is a corresponding sequence of optimal least squares coefficients β (n) F .Similarly, there is a sequence of squared residual errors RSS Under the assumptions 1 and 2, it is possible to establish an asymptotic result for β S and β P .
Theorem 4. Suppose that Assumptions 1 and 2 hold, k ≥ p, and β S is computed using a Hadamard or Clarkson-Woodruff sketch.Let ( X T X) + denote the Moore-Penrose pseudoinverse of ( X T X).Let .
Then as n → ∞, convergence in distribution holds for For large n, we expect β S to be approximately distributed as per Theorem 4 for both the Hadamard and Clarkson-Woodruff sketches.
It is harder to establish a comparable limit theorem for β * P , due to the non-standard distribution of β * P when using a Gaussian sketch.There is no typical normalised distribution to target.Instead, we wish to show asymptotic equivalence in moments.The partially sketched estimator under the Hadamard and Clarkson-Woodruff sketches should have similar mean and variance properties to the Gaussian partially sketched estimator.An extra assumption has to be made to show convergence in moments.A sufficient condition is a stability condition on the singular values of the sketched data matrix.Assumption 3. Let G be the Gram matrix of the scaled sketched dataset, G = n −1 X T X.
Assume that the sequence of source datasets is such that 2 is finite for large enough n.This additional regularity condition enables a formal limit theorem regarding the moments of β * P .Theorem 5. Suppose that Assumptions 1, 2 and 3 hold, k > p + 3, and β * P is computed using a Hadamard or Clarkson-Woodruff sketch.Let Once again, the heavy notation may obscure the essence of the result.The subscript S is used to emphasise that the only source of randomness is the sketching matrix, and that the source dataset is fixed.The theorem suggests that the bias and variance of β * P under the Clarkson-Woodruff and Hadamard sketches should be approximately equal to that under the Gaussian sketch.Specifically, we expect equations ( 6), ( 7), and (8) to be good approximations for the variance of the sketched estimators using the Hadamard or Clarkson-Woodruff sketches.
The results here are meant to be useful heuristics to assess the uncertainty attached to the output of the randomised approximation algorithm.There is a need to communicate and quantify the approximation error of sketching algorithms to end users, and the asymptotic results developed in this section can be of use.

Human leukocyte antigen dataset
We compared the performance of the sketching estimators on a real genetic dataset taken from the UK Biobank database.We use a small extract from the data in Astle et al. (2016).The selected response variable was mean red cell volume (MCV), taken from the full blood count assay and adjusted for various technical and environmental covariates.Genome-wide imputed genotype data in expected allele dose format were available on n = 132353 study subjects (Howie et al., 2009).We consider 1000 genetic variants in the Human leukocyte antigen (HLA) region of chromosome 6, selected so that no pair of variants had Pearson correlation of allelic scores less than 0.8.The region was chosen as many associations were discovered in a genomewide scan using univariable models; these associations were with variants with different allele frequencies, suggesting multiple distinct causal variants in the region.The aim is to perform a multivariable regression analysis to obtain variant effect size estimates that are conditional on the other variants in the region.
An early theoretical finding was that the partial sketched estimator β P was biased.One thousand sketches were taken to estimate the bias E S (β P − β F ) with k = 1500.We also computed the bias corrected estimator β * P in each replication.Figure 4  The results in the top row show that β P is biased for each of the random projections.The bias closely matches the theoretical factor.The bottom row shows that the adjusted estimator β * P appears to be unbiased, with the mean values falling closely along the identity line.
We also compared the complete and partially sketched estimators on mean square error and the coverage of confidence intervals at k = 1500 and k = 10000.We did not consider a combined estimator as the small R 2 F value would mean give an optimal complete sketching weight of close to zero.Table 2 reports the mean square error for each of the estimators.The signal to noise ratio is quite low for this dataset with R 2 F = 0.02.The relative efficiency bound dictates that partial sketching will be much more efficient than complete sketching on this dataset.The simulation results support this idea, with β * P having a mean square error roughly sixty times smaller than β S at both values of k. Results are very similar for each of the random projections, suggesting that the asymptotic approximations are reasonable for this dataset.For k = 1500, the mean square error of β P is approximately ten times that of β * P .For k = 10000, there is less of a difference, as the ratio k/(k − p − 1) is closer to one.The bias adjusted estimator β * P has significant advantages over β P when k/(k − p − 1) is larger than one.
Table 3 summarises the coverage of 95% confidence intervals for the sketched estimators.We report the overall proportion of intervals that contained the true value of the least squares estimate β F over the two hundred and fifty sketches and p = 1000 coefficients.The observed coverage is close the nominal level of 0.95 at both levels of k.The different random projections give very similar results, suggesting that the use of asymptotic approximations is again reasonable on this dataset.The intervals for the Hadamard sketch appear to be slightly conservative at k = 10000.This may be due to the specialised random number generator used in the implementation of the Hadamard sketch.The Radamacher random variables are only four-wise independent as opposed to being mutually independent (Geppert et al., 2017, p.85).

Flights Dataset
The sketching algorithms were also evaluated on the New York flights dataset available in the R package nycflights13 (Wickham, 2014).Arrival delay was taken as the response, and departure delay, distance, departure time, origin and month and day were chosen to be the covariates.Rows of the dataset with missing data were omitted, leaving n = 327346 and p = 47.The goal was to compare the accuracy of the various sketches on real data rather than to build a statistical model for the flights dataset.We compared the mean square error of the estimators and the coverage of confidence intervals for k = 5000.In contrast to the HLA F value of 0.99.We took one thousand sketches to compare complete and partial sketching.
Table 4 reports the mean square error of β S , β P and β * P .As expected, complete sketching has a much smaller mean square error than partial sketching.Table 5 summarises the coverage rates of the 95% confidence intervals.We report the overall proportion of intervals that contained the true value of the least squares estimate over the thousand sketches and p = 47 coefficients.
We also generated a synthetic flights dataset with an R 2 F of close to 0.5.This was achieved by generating a synthetic response vector y using the fitted model.The simulated response was computed as y = Xβ F + 15e, where e was the residual vector from the least squares fit e = y − Xβ F .We took one thousand sketches and computed β S , β * P and a weighted estimator using the optimal weight α opt in each iteration.Table 6 reports the mean square error for each estimator.From the theoretical analysis the mean square error of the weighted estimator is expected to be roughly half that of β S or β * P .The results support this for each of the three different sketches.
We also assessed the finite sample behaviour of the normal approximation in Theorem 3 at different levels of k and p.We dropped some predictors from the full flights dataset to give smaller datasets with p = 10 and p = 25 covariates.We then took subsamples of different sizes from each of the datasets.A single subsample was taken at each value of n, so the same subsampled dataset was being sketched each time.One thousand sketches were taken of each dataset at different values of k.We tested the joint multivariate normality of [ y, X] and the normality of the sketched residual e = S(y − Xβ F ). the sketched observations was compared to the theoretical χ 2 -distribution.As n increases the rejection rate is expected to fall to the type one error rate of 0.05. Figure 5 plots the proportion of times the null hypothesis of normality is rejected against the size of the source dataset.
The Hadamard sketch appears to have a much faster rate of convergence than the Clarkson-Woodruff sketch.When using a Hadamard sketch, each row in the sketched dataset is a linear combination of n observations from the source dataset.When using a Clarkson-Woodruff sketch, each row in the sketched dataset is expected to be a combination of only n/k observations from the source dataset.As such, n/k must be large for the normal approximation to hold.As expected, the rejection rate for the Clarkson-Woodruff sketch increases with k, but remains stable for the Hadamard sketch.In fig. 5 the rejection rate for the Clarkson-Woodruff sketch increases with p.The Hadamard sketch seems to be less sensitive to the number of covariates.The extra log k computation cost associated with the Hadamard sketch (Table 1) appears to have the benefit of accelerated convergence to normality.Even though joint normality may not be holding for the Clarkson-Woodruff sketch for the flights dataset, the coverage of the confidence intervals is still very good.As y = Xβ F + e, normality of the sketched residual is perhaps sufficient in justifying the approximate confidence intervals given by Theorem 4 (ii).The sketched residual converges much more quickly than the full sketched data matrix, which perhaps explains the good coverage properties of the confidence intervals for β S in Table 5.

Discussion
Sketching algorithms have emerged in the computer science community as a powerful device for the analysis of massive datasets (Mahoney & Drineas, 2016).Sketched regression algorithms use random projections to reduce the size of the original dataset, the sketched dataset is then used to estimate the optimal least squares coefficients.Most existing theory for sketched regression is from an algorithmic worst case perspective and connects with random matrix theory and computational geometry (Raskutti & Mahoney, 2014;Thanei et al., 2017) .In this paper we have provided a complementary statistical perspective and derived new tools for assessing the uncertainty attached to sketched estimators as well as guidelines for choosing between competing sketching algorithms.
Iterative methods, in particular stochastic gradient descent have not been mentioned so far.For large n regression problems, stochastic gradient descent will produce iterates that converge to β F under very mild conditions.Comparisons between single pass sketching and stochastic gradient methods are difficult, as the two techniques are not formulated for the exact same purpose.Single pass sketching algorithms are designed to return an approximate solution in finite time with probabilistically controlled error, whereas stochastic gradient methods are designed to converge to the exact solution asymptotically.It is perhaps more appropriate to compare stochastic gradient descent to iterative sketching methods, as iterative sketching algorithms also come with convergence guarantees to β F (Pilanci & Wainwright, 2016;Gower & Richtrik, 2015).Iterative sketching methods make use of approximate second order information that can lead to a potential improvement compared to first order stochastic gradient methods (Roosta-Khorasani & Mahoney, 2016).Our focus has been on characterising the approximation error attached to single pass sketching estimators.
There has been recent work in adapting sketching methods for statistical inference in large datasets, building from the worst case bounds in the computer science literature.Geppert et al. (2017) and Bardenet & Maillard (2015) investigate sketching algorithms for Bayesian regression, and derive bounds on the difference between the sketched posterior distribution and the full data posterior distribution.Yang et al. (2015b) consider sketched penalised regression, and give bounds between the sketched solution and the full data solution similar to the results in section 2.1.Only complete sketching is considered in the aforementioned work.The results on the advantages of partial sketching in this paper could motivate adaptations that make use of the exact marginal associations X T y.
Sketching ideas have been used to develop methods for approximate non-linear regression (Avron et al., 2014;Banerjee et al., 2013).A related branch of work uses random projections to reduce the number of predictors in regression and classification problems (Shah & Meinshausen, 2013;Cannings & Samworth, 2015;Guhaniyogi & Dunson, 2015).

Figure 1 :
Figure 1: Sampled sketching matrices S for k = 32, n = 36.Elements in the sketching matrix are coloured based on the value.One and negative one are coloured as black and white respectively.Intermediate values are in shades of grey.

Figure 2 :
Figure 2: Solid lines give the strong upper and lower bounds on the relative efficiency of β S to β * P as a function of R 2 F .The dashed line gives the upper bound for k = 1500, p = 1000.

Figure 3 :
Figure 3: Solid lines give the strong upper and lower bounds on the optimal weight α opt as a function of R 2 F .The dashed line gives the upper bound for k = 1500, p = 1000.

F
. As the sequence of datasets are fixed, β plots the average value of the estimators against the true value of the least squares coefficient using the full dataset.The top row (a)-(c) shows results for β P , and the bottom row (d)-(f) shows results for β * P .The first, second and third columns display the results for the Gaussian, Hadamard and Clarkson-Woodruff sketches respectively.The solid line in each panel is the identity line.The dashed line in the top row shows the theoretical bias, having slope k/(k − p − 1).

Figure 4 :
Figure 4: Bias of sketching estimators on the HLA dataset.Mean estimates are plotted against true values.In this scenario n = 132353, p = 1000, k = 1500.Solid line is the identity line and dashed line represents the theoretical bias factor.

Table 2 :
Mean square error of sketched estimators on HLA dataset.Standard errors are in brackets.

Table 3 :
Coverage of confidence intervals on the HLA dataset.The largest standard error is 0.001

Table 4 :
Mean square error of sketched estimators on flights dataset with k = 5000.Standard errors are in brackets.

Table 5 :
Coverage of 95% confidence intervals on the flights dataset with k = 5000.The largest standard error is 0.004 dataset, the flights dataset has a very high R 2

Table 6 :
The squared Mahalanobis distance of Mean square error of sketched estimators on synthetic flights dataset with k = 5000.Standard errors are in brackets.