Repository logo
 

High-dimensional regression with potential prior information on variable importance

cam.depositDate2022-05-19
cam.issuedOnline2022-06-14
cam.orpheus.counter3
cam.orpheus.successTue Jun 21 09:21:04 BST 2022 - Embargo updated
dc.contributor.authorStokell, Benjamin
dc.contributor.authorShah, Rajen
dc.contributor.orcidStokell, Benjamin [0000-0002-8365-715X]
dc.contributor.orcidShah, Rajen [0000-0001-9073-3782]
dc.date.accessioned2022-05-19T23:30:31Z
dc.date.available2022-05-19T23:30:31Z
dc.date.issued2022-06
dc.date.updated2022-05-19T10:02:06Z
dc.description.abstractThere are a variety of settings where vague prior information may be available on the importance of predictors in high-dimensional regression settings. Examples include ordering on the variables offered by their empirical variances (which is typically discarded through standardisation), the lag of predictors when fitting autoregressive models in time series settings, or the level of missingness of the variables. Whilst such orderings may not match the true importance of variables, we argue that there is little to be lost, and potentially much to be gained, by using them. We propose a simple scheme involving fitting a sequence of models indicated by the ordering. We show that the computational cost for fitting all models when ridge regression is used is no more than for a single fit of ridge regression, and describe a strategy for Lasso regression that makes use of previous fits to greatly speed up fitting the entire sequence of models. We propose to select a final estimator by cross-validation and provide a general result on the quality of the best performing estimator on a test set selected from among a number $M$ of competing estimators in a high-dimensional linear regression setting. Our result requires no sparsity assumptions and shows that only a $\log M$ price is incurred compared to the unknown best estimator. We demonstrate the effectiveness of our approach when applied to missing or corrupted data, and time series settings. An R package is available on github.
dc.identifier.doi10.17863/CAM.84739
dc.identifier.eissn1573-1375
dc.identifier.issn0960-3174
dc.identifier.urihttps://www.repository.cam.ac.uk/handle/1810/337325
dc.language.isoeng
dc.publisherSpringer Science and Business Media LLC
dc.publisher.departmentDept of Pure Mathematics And Mathematical Statistics
dc.publisher.departmentDepartment of Pure Mathematics And Mathematical Statistics
dc.rightsAttribution 4.0 International
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.subjectstat.ME
dc.subjectstat.ME
dc.subjectstat.CO
dc.subjectstat.ML
dc.subject62J07
dc.titleHigh-dimensional regression with potential prior information on variable importance
dc.typeArticle
dcterms.dateAccepted2022-05-16
prism.publicationNameStatistics and Computing
pubs.funder-project-idEngineering and Physical Sciences Research Council (EP/N031938/1)
pubs.licence-display-nameApollo Repository Deposit Licence Agreement
pubs.licence-identifierapollo-deposit-licence-2-1
rioxxterms.typeJournal Article/Review
rioxxterms.versionVoR
rioxxterms.versionofrecord10.1007/s11222-022-10110-5

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Stokell-Shah2022_Article_High-dimensionalRegressionWith.pdf
Size:
518.55 KB
Format:
Adobe Portable Document Format
Description:
Published version
Licence
https://creativecommons.org/licenses/by/4.0/