High-dimensional regression with potential prior information on variable
  importance

Stokell, Benjamin; Shah, Rajen

High-dimensional regression with potential prior information on variable importance

cam.depositDate	2022-05-19
cam.issuedOnline	2022-06-14
cam.orpheus.counter	3
cam.orpheus.success	Tue Jun 21 09:21:04 BST 2022 - Embargo updated
dc.contributor.author	Stokell, Benjamin
dc.contributor.author	Shah, Rajen
dc.contributor.orcid	Stokell, Benjamin [0000-0002-8365-715X]
dc.contributor.orcid	Shah, Rajen [0000-0001-9073-3782]
dc.date.accessioned	2022-05-19T23:30:31Z
dc.date.available	2022-05-19T23:30:31Z
dc.date.issued	2022-06
dc.date.updated	2022-05-19T10:02:06Z
dc.description.abstract	There are a variety of settings where vague prior information may be available on the importance of predictors in high-dimensional regression settings. Examples include ordering on the variables offered by their empirical variances (which is typically discarded through standardisation), the lag of predictors when fitting autoregressive models in time series settings, or the level of missingness of the variables. Whilst such orderings may not match the true importance of variables, we argue that there is little to be lost, and potentially much to be gained, by using them. We propose a simple scheme involving fitting a sequence of models indicated by the ordering. We show that the computational cost for fitting all models when ridge regression is used is no more than for a single fit of ridge regression, and describe a strategy for Lasso regression that makes use of previous fits to greatly speed up fitting the entire sequence of models. We propose to select a final estimator by cross-validation and provide a general result on the quality of the best performing estimator on a test set selected from among a number $M$ of competing estimators in a high-dimensional linear regression setting. Our result requires no sparsity assumptions and shows that only a $\log M$ price is incurred compared to the unknown best estimator. We demonstrate the effectiveness of our approach when applied to missing or corrupted data, and time series settings. An R package is available on github.
dc.identifier.doi	10.17863/CAM.84739
dc.identifier.eissn	1573-1375
dc.identifier.issn	0960-3174
dc.identifier.uri	https://www.repository.cam.ac.uk/handle/1810/337325
dc.language.iso	eng
dc.publisher	Springer Science and Business Media LLC
dc.publisher.department	Dept of Pure Mathematics And Mathematical Statistics
dc.publisher.department	Department of Pure Mathematics And Mathematical Statistics
dc.rights	Attribution 4.0 International
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.subject	stat.ME
dc.subject	stat.ME
dc.subject	stat.CO
dc.subject	stat.ML
dc.subject	62J07
dc.title	High-dimensional regression with potential prior information on variable importance
dc.type	Article
dcterms.dateAccepted	2022-05-16
prism.publicationName	Statistics and Computing
pubs.funder-project-id	Engineering and Physical Sciences Research Council (EP/N031938/1)
pubs.licence-display-name	Apollo Repository Deposit Licence Agreement
pubs.licence-identifier	apollo-deposit-licence-2-1
rioxxterms.type	Journal Article/Review
rioxxterms.version	VoR
rioxxterms.versionofrecord	10.1007/s11222-022-10110-5

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Stokell-Shah2022_Article_High-dimensionalRegressionWith.pdf
Size:: 518.55 KB
Format:: Adobe Portable Document Format
Description:: Published version
Licence: https://creativecommons.org/licenses/by/4.0/

Download

Collections

Cambridge University Research Outputs