On b-bit min-wise hashing for large-scale regression and classification with sparse data

Shah, Rajen D; Meinshausen, Nicolai

doi:10.17863/CAM.43394

On b-bit min-wise hashing for large-scale regression and classification with sparse data

Published version

Peer-reviewed

Repository URI

https://www.repository.cam.ac.uk/handle/1810/296348

Repository DOI

https://doi.org/10.17863/CAM.43394

Files

Published version (1.74 MB)

Type

Article

Authors

Shah, Rajen D

Meinshausen, Nicolai

Abstract

Large-scale regression problems where both the number of variables, $p$ , and the number of observations, $n$ , may be large and in the order of millions or more, are becoming increasingly more common. Typically the data are sparse: only a fraction of a percent of the entries in the design matrix are non-zero. Nevertheless, often the only computationally feasible approach is to perform dimension reduction to obtain a new design matrix with far fewer columns and then work with this compressed data.

$b$ -bit min-wise hashing is a promising dimension reduction scheme for sparse matrices which produces a set of random features such that regression on the resulting design matrix approximates a kernel regression with the resemblance kernel. In this work, we derive bounds on the prediction error of such regressions. For both linear and logistic models, we show that the average prediction error vanishes asymptotically as long as $q ∥ β ∗∥22 / n →0$ , where $q$ is the average number of non-zero entries in each row of the design matrix and $β ∗$ is the coefficient of the linear predictor.

We also show that ordinary least squares or ridge regression applied to the reduced data can in fact allow us fit more flexible models. We obtain non-asymptotic prediction error bounds for interaction models and for models where an unknown row normalisation must be applied in order for the signal to be linear in the predictors.

Keywords

math.ST, math.ST, stat.ML, stat.TH

Journal Title

Journal of Machine Learning Research

Journal ISSN

1532-4435
1533-7928

Volume Title

18

Publisher

Microtome Publishing

Publisher URL

http://www.jmlr.org/papers/v18/16-587.html

Rights

Attribution 4.0 International

Sponsorship

The first author was supported by The Alan Turing Institute under the EPSRC grant EP/N510129/1 and an EPSRC programme grant.

Collections

University of Cambridge Research Outputs (Articles and Conferences)

On b-bit min-wise hashing for large-scale regression and classification with sparse data

Published version

Peer-reviewed

Repository URI

Repository DOI

Files

Type

Change log

Authors

Abstract

Description

Keywords

Journal Title

Conference Name

Journal ISSN

Volume Title

Publisher

Publisher DOI

Publisher URL

Rights

Sponsorship

Collections