Correcting for optimistic prediction in small data sets.
View / Open Files
Publication Date
2014-08Journal Title
American journal of epidemiology
ISSN
0002-9262
Volume
180
Issue
3
Pages
318-324
Language
eng
Type
Article
This Version
VoR
Physical Medium
Print-Electronic
Metadata
Show full item recordCitation
Smith, G., Seaman, S., Wood, A., Royston, P., & White, I. R. (2014). Correcting for optimistic prediction in small data sets.. American journal of epidemiology, 180 (3), 318-324. https://doi.org/10.1093/aje/kwu140
Abstract
The C statistic is a commonly reported measure of screening test performance. Optimistic estimation of the C statistic is a frequent problem because of overfitting of statistical models in small data sets, and methods exist to correct for this issue. However, many studies do not use such methods, and those that do correct for optimism use diverse methods, some of which are known to be biased. We used clinical data sets (United Kingdom Down syndrome screening data from Glasgow (1991-2003), Edinburgh (1999-2003), and Cambridge (1990-2006), as well as Scottish national pregnancy discharge data (2004-2007)) to evaluate different approaches to adjustment for optimism. We found that sample splitting, cross-validation without replication, and leave-1-out cross-validation produced optimism-adjusted estimates of the C statistic that were biased and/or associated with greater absolute error than other available methods. Cross-validation with replication, bootstrapping, and a new method (leave-pair-out cross-validation) all generated unbiased optimism-adjusted estimates of the C statistic and had similar absolute errors in the clinical data set. Larger simulation studies confirmed that all 3 methods performed similarly with 10 or more events per variable, or when the C statistic was 0.9 or greater. However, with lower events per variable or lower C statistics, bootstrapping tended to be optimistic but with lower absolute and mean squared errors than both methods of cross-validation.
Keywords
Humans, Down Syndrome, Epidemiologic Methods, Multivariate Analysis, Data Interpretation, Statistical, Models, Statistical, Logistic Models, ROC Curve, Databases, Factual
Sponsorship
MRC (G0701619)
MRC (MR/L003120/1)
British Heart Foundation (RG/08/014/24067)
Identifiers
External DOI: https://doi.org/10.1093/aje/kwu140
This record's URL: https://www.repository.cam.ac.uk/handle/1810/277048
Recommended or similar items
The following licence files are associated with this item: