High-dimensional regression with potential prior information on variable importance
Authors
Stokell, Benjamin G
Shah, Rajen D
Publication Date
2022-06-14Journal Title
Statistics and Computing
ISSN
0960-3174
Publisher
Springer US
Volume
32
Issue
3
Language
en
Type
Article
This Version
VoR
Metadata
Show full item recordCitation
Stokell, B. G., & Shah, R. D. (2022). High-dimensional regression with potential prior information on variable importance. Statistics and Computing, 32 (3) https://doi.org/10.1007/s11222-022-10110-5
Abstract
Abstract: There are a variety of settings where vague prior information may be available on the importance of predictors in high-dimensional regression settings. Examples include the ordering on the variables offered by their empirical variances (which is typically discarded through standardisation), the lag of predictors when fitting autoregressive models in time series settings, or the level of missingness of the variables. Whilst such orderings may not match the true importance of variables, we argue that there is little to be lost, and potentially much to be gained, by using them. We propose a simple scheme involving fitting a sequence of models indicated by the ordering. We show that the computational cost for fitting all models when ridge regression is used is no more than for a single fit of ridge regression, and describe a strategy for Lasso regression that makes use of previous fits to greatly speed up fitting the entire sequence of models. We propose to select a final estimator by cross-validation and provide a general result on the quality of the best performing estimator on a test set selected from among a number M of competing estimators in a high-dimensional linear regression setting. Our result requires no sparsity assumptions and shows that only a logM price is incurred compared to the unknown best estimator. We demonstrate the effectiveness of our approach when applied to missing or corrupted data, and in time series settings. An R package is available on github.
Keywords
Article, High-dimensional data, Low variance filter, Lasso, Ridge regression, Missing data, Corrupted data
Sponsorship
Engineering and Physical Sciences Research Council (EP/N031938/1)
Identifiers
s11222-022-10110-5, 10110
External DOI: https://doi.org/10.1007/s11222-022-10110-5
This record's URL: https://www.repository.cam.ac.uk/handle/1810/338087
Rights
Licence:
http://creativecommons.org/licenses/by/4.0/
Statistics
Total file downloads (since January 2020). For more information on metrics see the
IRUS guide.
Recommended or similar items
The current recommendation prototype on the Apollo Repository will be turned off on 03 February 2023. Although the pilot has been fruitful for both parties, the service provider IKVA is focusing on horizon scanning products and so the recommender service can no longer be supported. We recognise the importance of recommender services in supporting research discovery and are evaluating offerings from other service providers. If you would like to offer feedback on this decision please contact us on: support@repository.cam.ac.uk