Sample size determination via learning-type curves

This paper is concerned with sample size determination methodology for prediction models. We propose combining the individual calculations via a learning-type curve. We suggest two distinct ways of doing so, a deterministic skeleton of a learning curve and a Gaussian process centred upon its deterministic counterpart. We employ several learning algorithms for modelling the primary endpoint and distinct measures for trial efficacy. We find that the performance may vary with the sample size, but borrowing information across sample size universally improves the performance of such calculations. The Gaussian process-based learning curve appears more robust and statistically efficient, while computational efficiency is comparable. We suggest that anchoring against historical evidence when extrapolating sample sizes should be adopted when such data are available. The methods are illustrated on binary and survival endpoints.


Introduction
Risk prediction models are routinely used in healthcare and medical research [1,2] to inform the diagnosis and/or prognosis of clinical events [3,4].Constructing risk prediction models relies on different modelling approaches, ranging from well-established statistical methods to more recent machine learning algorithms.However, irrespective of the underlying modelling methodology, leveraging data with an appropriate sample size for developing such models is imperative for achieving robust and accurate predictive performance on a given task, such as predicting binary, continuous or time-to-event outcomes.Therefore, accurate sample size calculations are necessary for the development phase, facilitating reliable and accurate prediction models.
Approaches to sample size calculation can differ based on the predictive task and modelling strategy as well as the underlying study design.In this paper, we focus on risk prediction models that address binary and time-to-event outcomes, but the techniques we develop are also directly applicable to continuous outcomes.This problem is distinct from sample size calculation, where one is interested in estimating the accuracy of a diagnostic test since that target accuracy may be independent of the sample size.In contrast, when developing prediction models, performance may vary with sample size, possibly due to including covariates that may improve predicting performance as the sample size increases, or because a more elaborate model may be employed.
A typical approach to determining an adequate sample size is factoring the number of predictor variables, such as ensuring at least ten events per predictor [5].Whilst simple, such criteria do not consider the predictors' type, magnitude and possible values (e.g.categorical variables may require more events) -often leading to poorly fitted models that fail to generalise well to out-of-sample data [6,7].In response, recent simulation studies [6] point to additional and necessary requirements to inform the sample size estimation, which relates to the choice of the modelling strategy and their expected out-of-sample performance.Riley et al [8] worked on models where R 2 -type measures are applicable and do incorporate the expected model performance, the number of (candidate) predictors and the outcome prevalence in the target population into the sample size calculation.The performance of a prediction model varies depending on the modelling strategy and variables included in the model, such as relationships between predictors and response variable(s).
In a clinical research setting, the statistician is typically interested in (1) calculating the optimal sample size associated with the study design before the data collection stage OR (2) assessing the feasibility of an ongoing study after a certain number of samples has been accrued.To address these issues, our study focuses on estimating the predictive performance of a modelling strategy at the design stage, possibly utilising external data if applicable.Specifically, we are concerned with (i) selecting a strategy and the associated sample size for developing a prediction model, (ii) assessing the feasibility of an ongoing study after a certain number of samples has been accrued.The technique used to develop a risk prediction model may differ depending on the method used and the variables included.For example, prediction effects and collinearity issues may affect the downstream performance of the resulting model.To assess the operating characteristics of a strategy we propose a learning curve approach that provides performance estimates at different sample sizes and can leverage historical evidence.
A learning curve may be based on an inverse power law model that describes the performance of a modelling strategy as a function of the sample size [9,10] used for developing a prediction model.Related work in different scenarios [11,12,13] estimate the expected performance of a linear classifier for different sample sizes.However, prediction models can be highly unstable, particularly when conditioned upon small sample sizes used at the modelling stage.Therefore one may benefit by leveraging information from publicly available data from related studies in order to inform the operating characteristics of a strategy's performance for substantially larger sample sizes.In particular, during the study planning stage, this may help determine the optimal sample size needed to achieve a desired accuracy in a robust manner.
In this work, we study a learning curve-based framework to estimate the expected predictive performance of a modelling strategy.We propose and evaluate approaches that utilise external data from similar studies, leading to robust estimates for a given sample size.Specifically, we employ a bootstrap strategy on external data to derive performance estimates at different (incremental) sample sizes.We treat these estimates as data and estimate the parameters of a learning-curve which in turn estimates the expected performance of the underlying model.We employ four statistical models and evaluate the proposed approaches on a series of real-world experiments, predicting binary and time-to-event outcomes.We show that extrapolation may be unstable in scenarios with limited available data and illustrate that incorporating external data via evidence synthesis can be highly beneficial at the design stage.The paper is structured as follows.Section 2 discusses the developed learning-curve framework that estimates the prediction accuracy of prediction models at a given sample size.In Section 3, we apply this method to publicly available data with binary and survival outcomes and Section 4 concludes with a discussion.

Methods
We introduce a learning curve meta-modelling framework to estimate the expected performance of risk prediction models (and associated strategies) conditional on sample size.Specifically, we fit a learning curve using estimates of predictive performance derived from different modelling strategies over repeated random draws of incremental sample sizes.To achieve more stable performance, we leverage information from external data and anchor against those for extrapolating to larger sample sizes.The main elements of our proposed framework are presented below.

Learning curves
Learning curves model the relationship between the predictive performance (PP), Y pp , of a prediction model as a function of the sample size n used for constructing it [14].Here we employ a regular learning curve whose expected predictive performance is non-decreasing with sample size: While there are several curve types that can be utilised for modelling such behaviour, in this paper we focus on the power law model since it typically provides very good fit [15,16] while retaining a natural interpretation of its parameters.The functional form is given by: where the parameter a denotes the minimum achievable error (ranging from 0 to 1), b denotes the learning rate and c the decay rate with c ∈ (0, 1).For the predictive tasks considered in this paper Y pp is bounded in [0, 1] and the maximum is achieved at (1 − a) for n → ∞.
In this work, we build on two distinct but related approaches for estimating a learning curve: (i) a frequentist approach fitted via Nonlinear Least Squares (NLS), and (ii) a Bayesian approach where the learning curve is modelled as a Gaussian Processes (GP).The NLS is implemented using the Levenberg-Marquardt algorithm [17] for weighted NLS to fit the learning curve given in equation (1).The a and c parameters are logit transformed for stability.
The Bayesian approach is based upon a Gaussian processes assuming the model performance is distributed as: where σ y denotes the standard deviation and g(n) is a Gaussian process with mean µ(n) = (1 − a) − bn −c , and covariance matrix Σ.We consider Σ ij = ϕ 2 exp (−ρ|n i − n j | 2 ) and complete the model with equation (2).We place beta priors on a and c and a normal distribution on b.Note that, since Y pp ∈ [0, 1] we could have transformed Y pp to R, but we found this to be unnecessary in our application.
While superficially there are similarities between our work and the methods of [11] and [12] who also use NLS to construct a deterministic learning curve, our approach facilitates several additional methodological features mostly related to the Gaussian process and anchoring against external data for both linear and non-linear models.

Other learning curve components
The estimated learning curve parameters generally depend on the data, the sampling method, the predictive tasks, and the modelling approach.An essential component of developing risk prediction models is the choice of appropriate evaluation criteria that assess the predictive and/or discriminative ability of the developed models for clinical decision support/making.Commonly used metrics include R 2 , the Brier score and the Area Under the Receiver Operating Characteristics Curve (AUC) among others.In this study, we measure the predictive performance Y pp using the C-statistic, commonly used in clinical studies, to evaluate the accuracy of risk prediction models [18].For binary outcomes, it is a proxy to the AUC [19].For time-to-event outcomes we use the censoring-adjusted C-statistic [20] as a prediction measure.
A crucial aspect for the model's generalisability and downstream predictive performance is the underlying modelling methodology.Risk prediction models typically build on well-established approaches from statistical modelling and machine learning such as: (group) lasso [21,22], support vector machines [23] as well as ensemble methods [24,25,26,27], which achieve high predictive performance by combining multiple predictors.The choice of an appropriate modelling methodology can rely on many different factors that relate to the study design, dataset properties and the predictive task at hand and the methods we propose may be used for any such method.

Algorithm
Fitting a learning curve of a model's predictive performance, Y pp , as a function of the sample size requires a series of performance estimates derived under a series of m sub-samples with sizes of {s 1 , s 2 , . . ., s m }, sampled from the total sample N with size s m and this sub-sample referred as n m .The sub-samples n m are sampled randomly and vary in size, increasing from samples with sizes as small as s 1 = 50 to a maximum of samples with size s m = N .
In this work, we sample m = 50 such samples, which we then use for training and evaluating predictive models and, in turn, fitting the learning curve using the m = 50 performance estimates.Note that, for each m we repeatedly sampled k times without replacement.Specifically, for each n mk subsample we estimate a set of model performance estimates Y pp mk , where n m denotes a particular sample with size s m from the set m, and k denotes the repeated random draws performed, k ∈ {1, 2, . . ., 100}.We develop and validate the modelling strategy via a stratified split of each n mk into 70% training and 30% validation data, the latter being used to obtain Y pp mk .More formally, the procedure is as follows: 1. Select the number of sample sizes m and fit a learning curve.(e) Repeat the previous steps for k = 100 times for every size s m 3. Fit a learning curve to using the obtained Y pp mk estimates.

Transfer learning strategy
Extrapolating beyond the observed range of the data, typically in scenarios with limited sample size, can lead to highly unstable predictions and should be done with caution.In such cases anchoring upon data from an external study allows for robust extrapolation to large sample sizes.In particular at the study planning stage one may fit a learning curve, f (n) ext , to external but related data.The degree of relatedness may be assessed using commonly used methods.These can vary from trivial similarity such as the same (say binary) outcome, to disease type to more stringent similarity based on inclusion-exclusion criteria.
In the case of estimating the predictive performance of a modelling strategy via NLS, we assume the learning curves fitted on the target and external data have a similar shape but a different scale.Therefore, we first pre-fit a learning curve on external data f (n) ext and on the small-sample target data f (n) target .For the estimation, we transfer information on the learning rate b and the decay rate c from the external data, while learning the increment on a from the target data.The estimated predictive performance can be calculated by following for a given sample size where a target is estimated from the target data, while b ext and c ext estimated from the external data.
The Gaussian processes naturally facilitate the synthesis of different sources of evidence.We use the posterior of b, c and ϕ estimated from external data as prior when fitting the learning curve to the target data.The predicted performance of Y pp at a sample n i is used to predict the accuracy of the chosen modelling strategy at a larger sample size for the target data.Given the marginal estimates of µ p at n p and µ o at n o , the predicted value y p is given by: where y o denotes the observed performance and Σ the covariance matrix.The subscript o denotes the observed data and p denotes the data points to be predicted.In contrast to NLS, using the GP means that the uncertainty around the predicted value can naturally be obtained via the covariance matrix.

Experimental Design
Our application focuses on two clinical settings: (i) estimating the sample size needed at the study design stage and (ii) evaluating the feasibility of achieving a specific performance for an ongoing study with limited samples accrued.We demonstrate our framework using two distinct breast cancer datasets as described below.

Breast cancer use cases
Consider a typical scenario in clinical trials where an investigator aims to design a study to explore different modelling strategies' predictive abilities.In this context, we focus on predicting breast cancer outcomes and select the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) [28] study as external data.METABRIC is an extensive dataset, consisting of detailed genomic profiles and clinical data from breast-cancer patients.In particular, the genomic data includes mRNA expression and Copy number alterations (CNA), while the clinical data encompasses patien t outcomes and treatment responses from patients across various different subtypes.
For our limited (target) data, we select a subset of the Memorial Sloan Kettering Cancer Center (MSK) breast-cancer data [29].The complete dataset consists of genomics profiles (CNA) obtained from targeted sequencing of tumour-and normal-sample pairs from 1,918 Breast cancers from 1,756 patients.The genomic data is paired with clinical data, that includes patient outcomes and treatment responses.In our analysis, we experiment with a subset of the MSK dataset, by attempting to extend it to a larger sample sizes by anchoring it to the METABRIC data.The clinical and genomics data for both METABRIC and MSK datasets are publicly available and were obtained from cBioPortal [30,31].
We consider both binary (five-year survival status) and time-to-event (overall survival) outcomes.The models incorporate variables reported by Margolin et al. [32].In the case of METABRIC this includes age at diagnosis, tumour size, number of positive lymph nodes, tumour grade, oestrogen receptor (ER) status, progesterone receptor (PR) status, human epidermal growth factor receptor 2 (HER2) status, radiotherapy, and hormone therapy.After excluding missing observations, the METABRIC dataset includes 1978 participants, with 427 (21.6%) death events within the first five years and 1144 (57.8%) over the complete follow-up period, resulting in a median survival of 13 years.In terms of analysis, we followed Curtis et al. [28] and used a pre-selected set of the input CNA features.Specifically, we used the most significant cis-acting genes that are significantly associated with CNAs determined by a gene-centric ANOVA test.To simplify the computational analysis, we selected the genes with the most significant Bonferroni adjusted p-value from the Illumina database containing 30566 probes.After missing-data removal, the input data sets consisted of 1000 CNA categorical features.
In the case of MSK, clinical variables include age at diagnosis, tumour stage, ER status, HER2 status, and PR status.The MSK dataset, after removing missing observations, comprises 1640 participants; 175 (10.7%) died within 5 years, and 343 (20.9%) overall, with a median survival of 14.8 years.We present the Kaplan-Meier curves of both datasets in Figure S1 in the Appendix.We assess accuracy using the C-statistics, AUC for binary outcomes, and Uno's C at 10 years for time-to-event outcomes.

Learning curve by sample size
We used the METABRIC data to explore the baseline behaviour of the learning curve over different total sample sizes N ∈ {100, 150, 300, 450, 600, 900, 1200, 1500, 1978}.We used logistic regression and included the clinical variables and a total of m = 50 data points with s 1 = 50 being the smallest sample size of n 1 .Both NLS and GP were used to fit a learning curve.To further investigate the impact of number of data points on fitting learning curve, a series of m ∈ {5, 10, 20, 30, 40, 50} were explored at different total sample sizes N ∈ {100, 300, 600} with METABRIC data with the same modelling strategy described above.

Leveraging multi-modal data and models
In clinical research it is vital to select the modelling strategy, variables and prediction model, that is able to achieve good and robust performance with as small a sample size as possible.Under such circumstances, this performance may be further improved by integrating different data modalities collected from the same patients.We explore the accuracy of different models in predicting the METABRIC 5 year survival status using (1) clinical data only and (2) clinical data plus copy number alteration (CNA) variables.The clinical variables included in the prediction model were outlined in 3.1.In this setting we used all the METABRIC data and a total of m = 50 data points with the smallest sample size s 1 = 50 of the sample n 1 .
We evaluate prediction models developed using four different approaches: (1) Elastic net (El-Net) [33] (2) Support vector machines (SVMs) [23] (3) Random Forests (RF) [24] and (4) Gradient Boosted Trees (LGBM) [25,34].All models, from each of the four methods, have their hyperparameters optimised and are calibrated.For the hyper-parameter optimisation procedure we use Bayesian optimisation search.Specifically, before performing k = 100 training repetitions, we perform one extra repetition at each sample size, which we use for cross-validated search over hyper-parameters of a given model for optimal C-statistic.We use the resulting hyper-parameter values for the subsequent repetitions.Each model, at every repeat, is calibrated based on Platt's logistic model [35] using internal validation.This is largely redundant for logistic regression but we retain the same approach for consistency.

Sample size extrapolation
We consider a scenario where limited data are available and a feasibility analysis is planned: the study may terminate early or recruit more patients if the expected performance of the prediction model seems feasible and appropriate.It is hard to fit a robust model with a small sample size so we may borrow information from external data as described above.We illustrate this task using a subset of the MSK study as target data and information from METABRIC as an anchor for extrapolation.Logistic regression and the Cox model were selected as prediction models using the available clinical variables.We assume an interim analysis to evaluate the feasibility of the proposed modelling strategy will be performed after limited samples become available from the MSK study.We choose m = 20 data points and s 1 = 30 as the smallest sample size, assuming a total of N ∈ {50, 80, 100, 150, 300, 450} were available from the MSK study.The s 1 = 30 chosen in this example to allow for 20 data points when the total sample size N = 50

Results
Figure 1 presents the two (NLS and GP) C-statistic learning curves as a function of sample size.Both GP and NLS work well with most predicted C-statistics falling within the 95% CIs across sample sizes.The GP learning curve appears more robust, especially for sample sizes larger than 100, while the NLS method stabilises only after 300 sample size.The shape of both learning curves was similar and the improvement in C-statistics saturated at sample sizes higher than 1000.It is apparent that a C-statistics of 0.8 (or more) is not feasible based on the current modelling strategy and the problem at hand.Table 1 presents the point estimates and standard error of the coefficients of both learning curves.The standard errors are higher for the NLS curve and the point estimates stabilise at larger sample sizes suggesting that the GP curve is statistically more efficient.In this settings, a total of m = 50 data points were used.Table A1 and Figure S2 in the Appendix reveals that the proposed approach is reasonably robust to smaller series of sample sizes (data points) such as m = 5 for example, with the GP curve being more efficient.Figure 2 presents the learning curves with the predicted C-statistics of different modelling strategies in an uni-modal (clinical data alone) and multi-modal (clinical data plus CNA data) setting.The NLS and GP-based learning curves were similar across sample sizes.Including the CNA data in the prediction model did not considerably improve the accuracy in this use case.All models performed similarly at large sample sizes while the LGBM slightly under-performed at smaller samples.The ElNet and SVM reached a plateau at 1000 samples while accuracy improvement was minimal when more data were added to the other models.
Figure 3 presents the behaviour of NLS and GP learning-curve models on the MSK cohort, with and without anchoring to the METABRIC data.For binary outcomes, both GP and NLS can produce curves that are effective and stable, independently of anchoring.Specifically, while NLS generally stabilises at sample sizes of 80 and larger, GPs exhibit reasonable performance at sample sizes of 50.However, for survival outcomes, anchoring can lead to additional stability, especially at small sample sizes, in the case of GP learning curves.This, however, is not the case for NLS curves, which tend to diverge at sample sizes smaller than 300.
In a case study, the MSK study in this example, an interim analysis was performed after accruing a total of 50 samples.The investigator was interested in predicting the accuracy of the model at 1640 samples (full MSK samples).The logistic model predicted C-statistics of 0.90 (95% CI 0.56-1.23)and 0.74 (95% CI 0.69-0.79)for the NLS and GP methods, respectively, without anchoring.When anchoring against METABRIC data, the predicted C-statistics were 0.98 and 0.73 (95% CI 0.72-0.73)for the NLS and GP methods.The raw C-statistics in the random draws from the total MSK samples were 0.76 (95% CI 0.70-0.81).For the survival model without anchoring, the predicted C-statistics were 0.83 (95% CI 0.49-1.16)and 0.61 (95% CI 0.54-0.68)for the NLS and GP methods, respectively.When anchoring against METABRIC data, the predicted C-statistics were 1.00 and 0.70 (95% CI 0.69-0.70)for the NLS and GP methods.The raw C-statistics in the random draws from the total MSK samples were 0.69 (95% CI 0.64-0.73).

Discussion
We proposed a flexible and robust approach to sample size prediction based on learning curves.The goal is to estimate the expected predictive performance of a model (and a modelling strategy) at a given sample size in the design stage.Our approach has several practical benefits which can be employed for model selection via analysing the performance of different prediction models, and, as such, used in an ongoing study to inform their feasibility or futility.Specifically, we propose two variants of a learning curve, showing that the Bayesian GP-based version can achieve better performance at small to moderate sample sizes.The obtained learning curves of the NLS were considerably different at sample size N = 100 and N = 150, which is reflected in the NLS parameter estimation.At a small sample size, the C-statistics were increasing before the inflection point.The parameters of the NLS were bounded to the parameter space by parameterisation and may result in a converging issue.These differences, on the other hand, were negligible at large sample sizes.We also showed that the stability and robustness of the developed learning curves could be further improved by anchoring the fitting process against a larger study on related data, where available.
We explored the value of adding different data modalities such as genetic data, which in our experiments led to small gains in predictive performance.Implicitly and more broadly, our approach allows for assessing the gain and the cost of collecting distinct data modalities at the planning stage.As a result, it appears sensible for one to modify the prioritisation of the research question in order to maximise the expected benefit.
The proposed approach is general, modular and flexible.We used the C-statistic for assessing the predictive performance, but different scoring rules, like the Brier score, may also be used.Moreover, our approach allows for principled analysis of modelling strategies.As the sample size increases, different statistical and machine learning algorithms may be investigated for model development.Within-model variable selection can also be explored in a similar manner.We employed a 70/30 Figure 3: Predicted C-statistics of learning curve of MSK with GP and NLS with (solid line) and without (dashed line) anchoring METABRIC data for binary and time-to-event outcomes using logistic regression and survival model, respectively.The shaded area is the 95% confidence interval of C-statistics from repeated random draws using all MSK samples without extrapolation.train/test split when obtaining predictive performance estimates but different rules, such as 80/20 splits, may also be used.
Finally, in the presence of limited data, we suggest anchoring on a larger study of similar data with a GP approach.The use of external data is routinely employed in different extrapolation settings like survival analysis and makes intuitive sense.It may be hard to select the external data and quantify the similarity to the target data.The choice of external data can rely on expert (e.g.clinical) opinion or by inspecting the definition of the populations, such as inclusion-exclusion criteria and the relatedness of the endpoints.In the absence of high-quality clinical trial data, realworld evidence could be a useful alternative.From the theoretical standpoint one could define an appropriate loss function that assesses fidelity to the distinct data sources via appropriate weights and this is the subject of current research.
size N ∈ {100, 300, 600}.We use METABRIC data and evaluate a modelling strategy with a logistic regression model trained and evaluated only with clinical data on a task of predicting a binary outcome.
Figure ?? presents the two (NLS and GP) C-statistic learning curves as a function of sample size.Both GP and NLS work well with most predicted C-statistics falling within the 95% CIs for most sample sizes.Table ?? presents the point estimation of learning learning model.The GP model appears more robust even with only 5 data points and 100 total sample size.GP and NLS had similar results after the total sample size were 300 or more.

2 .
For each size s m of the set m: (a) Randomly sample n mk of size s m from the complete data of size N without replacements.(b) Perform a randomly stratified 70/30 split of n mk into D train for training and D test for validating the modelling strategy, respectively.This was repeated for 100 times.(c) Fit a model with a pre-specified modelling strategy using the training set D train .(d) Validate the fitted model in the validation sample D test and calculate the predictive performance Y pp mk on D test .

Figure 1 :
Figure1: Predicted C-statistics of learning curves of METABRIC data alone with GP and NLS of logistic regression for binary outcomes at different sample size (lies) and 95% confidence interval of the raw results using all samples (shaded area)

Table 1 :
Point estimation and standard error (SE) of GP and NLS at different total sample size