Bayesian modelling strategies for borrowing of information in randomised basket trials

Abstract Basket trials are an innovative precision medicine clinical trial design evaluating a single targeted therapy across multiple diseases that share a common characteristic. To date, most basket trials have been conducted in early‐phase oncology settings, for which several Bayesian methods permitting information sharing across subtrials have been proposed. With the increasing interest of implementing randomised basket trials, information borrowing could be exploited in two ways; considering the commensurability of either the treatment effects or the outcomes specific to each of the treatment groups between the subtrials. In this article, we extend a previous analysis model based on distributional discrepancy for borrowing over the subtrial treatment effects (‘treatment effect borrowing’, TEB) to borrowing over the subtrial groupwise responses (‘treatment response borrowing’, TRB). Simulation results demonstrate that both modelling strategies provide substantial gains over an approach with no borrowing. TRB outperforms TEB especially when subtrial sample sizes are small on all operational characteristics, while the latter has considerable gains in performance over TRB when subtrial sample sizes are large, or the treatment effects and groupwise mean responses are noticeably heterogeneous across subtrials. Further, we notice that TRB, and TEB can potentially lead to different conclusions in the analysis of real data.

heterogeneous across subtrials. Further, we notice that TRB, and TEB can potentially lead to different conclusions in the analysis of real data.

K E Y W O R D S
biomarker-guided trial, master protocol, personalised medicine, precision medicine, randomised controlled basket trial

INTRODUCTION
Improved understanding of molecular biology has facilitated rapid advances in identification of patient subgroups with better response to a targeted therapy. For instance, ado-trastuzumab emtansine in HER2 amplified ovarian and lung cancers (Li et al., 2018) and pembrolizumab in metastatic, microsatellite-instability-high cancer patients (Lemery et al., 2017) have demonstrated efficacy leading to FDA approval. Consequently, several innovative trial designs have emerged to efficiently evaluate targeted therapies under a single master protocol with the potential to expedite drug development (Woodcock & LaVange, 2017). The core idea in these designs is to (i) stratify patients with a given disease based on disease characteristics most predictive of treatment response and evaluate several linked targeted therapies (umbrella trials) Woodcock & LaVange, 2017) or (ii) stratify patients with a shared potential therapeutic target in different underlying conditions based on their disease subgroups, evaluating a single therapy (basket trials) (Kaizer et al., 2019;Siden et al., 2019;Woodcock & LaVange, 2017). Trial designs (i) and (ii) can be randomised or non-randomised and can allow addition of arms and incorporation of adaptive design considerations. This paper will focus on basket trials that evaluate a single treatment for multiple disease indications with a common characteristic Tao et al., 2018;Woodcock & LaVange, 2017), which could be a common genomic biomarker, mechanism of drug activity or clinical symptom that the treatment targets. Figure 1 illustrates randomised and non-randomised basket trial designs, respectively. An example of a basket trial is that of TRK inhibitor Larotrectinib in 17 unique rare TRK fusion-positive cancers (Drilon et al., 2018). This study enrolled both paediatric and adult patients and reported a 75% overall response rate, demonstrating the utility of basket designs in streamlining drug development in rare diseases.
In general, statistical analysis of basket trials follows one of three approaches: (i) stand-alone analysis where subtrials are conducted and analysed separately as if they are individual studies (an example is the imatinib basket trial [Heinrich et al., 2008]); (ii) pooled analysis where all patients are considered as 'homogenous' and thus combined as one pooled data set, as in the basket trial of Larotrectinib (Drilon et al., 2018); or (iii) an analysis strategy that accounts for potential heterogeneity in treatment response across subtrials (Kaizer et al., 2019;Pohl et al., 2021). In practice, considerable heterogeneity in treatment response across subgroups within a basket trial can be observed. An example is the basket trial by Hyman et al. where vemurafenib displayed efficacy in BRAF-V600-mutant NSCLC and Erdheim-Chester disease patients but not in colorectal cancer patients (Hyman et al., 2015). Consequently, there is a need to account for potential heterogeneity in both design and analysis; one basket trial accounting for heterogeneity is the ongoing Chinese Bayesian basket trial of Danzhu Fuyuan Granule as Adjunctive Therapy for Chronic Stable Angina, Vascular Dementia, and Idiopathic Membranous Nephropathy (NCT04498962). Several statistical analysis methods accounting for potential heterogeneity (type [iii] analyses above) in basket trials have been proposed and/or implemented, using either Bayesian approaches, frequentist approaches or a combination of both. Such methods enable borrowing of information across subtrials; consequently, the statistical power may be increased for detecting that a treatment benefits a specific subgroup. This could be particularly important when a basket trial involves rare subgroups. Pohl et al. present a comprehensive review of existing statistical analysis methods that incorporate borrowing of information in basket trials (Pohl et al., 2021). Notable Bayesian methods include Bayesian hierarchical modelling (Berry et al., 2013), EXNEX (Neuenschwander et al., 2016), Simon's basket design (Simon et al., 2016), calibrated Bayesian hierarchical models (Chu & Yuan, 2018a), Bayesian latent subgroup design (Chu & Yuan, 2018b), Bayesian model averaging (Asano & Hirakawa, 2020;Psioda et al., 2019), Bayesian cluster hierarchical models (Chen & Lee, 2020), and the commensurate predictive prior (CPP) approach (Zheng & Wason, 2022). Notably, majority of these methods consider a single-arm, non-randomised trial setting that is often used in early-phase oncology trials.
Recently, there has been increased interest in conducting randomised precision medicine trials, whenever possible, to improve the chances of a successful phase III trial (Saad et al., 2017), even in rare diseases (Lasch et al., 2017;Prasad & Oseran, 2015). The potential and growing use of basket trials outside of oncology carries even greater prospects for randomisation as an important feature in the design of future basket trials. Currently, randomised basket trials have been implemented in oncology (IMPACT II trial, NCT02152254), HIV/AIDS (Moore et al., 2021), autoimmune and neurological disorders like primary biliary cholangitis (PBC) and Parkinson's disease (Zheng & Wason, 2022), auto-inflammatory syndromes like familial Mediterranean fever and TRAP syndrome (De Benedetti et al., 2018), and have been proposed for immune-mediated inflammatory diseases (Grayling et al., 2021). Evidently, development of statistical methodology for randomised basket trials will be of great importance.
A key question to randomised basket trials relates to how borrowing of information is handled in the analysis. In general, approaches to borrowing information involve consideration of how similar the treatment response is across subtrials. In a two-arm setting, however, this could be achieved by considering the treatment effects across subtrials (referred to as 'Treatment Effect Borrowing', TEB) or the responses per treatment arm (later called groupwise mean responses in this paper, and referred to as 'Treatment Response Borrowing', TRB) across subtrials. This raises two important issues. Firstly, there is a need for a suitable method for TRB to be proposed for randomised basket trials. Secondly, the question of which borrowing approach is better is important; it is plausible that either method might be best in different situations. It is these points that this paper seeks to address. This paper builds on work by Zheng and Wason (2022), which proposed a CPP approach that models the commensurability of treatment effects to leverage information from complementary subtrials (i.e., it facilitates TEB). We propose a new modified implementation of their methodology that models the commensurability of the groupwise mean responses to permit borrowing of information across subtrials (i.e., we extend their approach to allow for TRB). The groupwise mean responses refer to the mean response in either the control or experimental treatment arm within a subtrial. It is expected that this new TRB approach could outperform the original TEB modelling strategy if the response in one arm (e.g., the experimental treatment) is more similar across subtrials than the other (e.g., the control). We then evaluate the new and existing Bayesian modelling strategies for borrowing of information in randomised basket trials under different scenarios, through a comprehensive simulation study.
Our research is motivated by a phase II randomised basket trial (hereafter referred to as the OACS trial) which evaluates the efficacy of Obeticholic acid for treating cognitive dysfunction in three distinct patient subgroups: early-stage PBC, late-stage PBC and Parkinson's disease. Each patient subgroup has a corresponding subtrial in which patients are randomised equally to a new targeted treatment or placebo. The primary outcome is the change in a composite score that measures the patients overall cognitive performance after 26 weeks of treatment, which is regarded as a continuous endpoint. Each subtrial is to be analysed separately; a secondary analysis will allow borrowing of information across subtrials. The exact model to be used, and whether TEB or TRB will be carried out, is currently not determined. This work is part of on-going research to ascertain which approach will be used.
The remainder of this article is organised as follows. In Section 2 we elaborate on our proposed TRB approach, provide an overview of the TEB methodology, and describe the simulation study including considered scenarios and trial characteristics. We also provide a description of case studies that will be used to further investigate the relative merits of the methods considered. Section 3 presents our results and Section 4 a discussion.

METHODS
In this paper, we focus on the commensurate prior approach (Zheng & Wason, 2022) to implementing borrowing of information across patient subgroups in randomised basket trials based on two modelling strategies. To be more precise, we parameterise the level of commensurability in terms of (i) the subtrial-specific treatment effects (TEB) or (ii) the mean responses by treatment groups across the subtrials (TRB). We elaborate on an existing commensurate prior approach incorporating TEB in Section 2.1 (Zheng & Wason, 2022), and develop a new strategy for TRB in Section 2.2. Both will be compared with stand-alone analyses (i.e., no borrowing of information; referred to as 'No Borrowing', NB) in a comprehensive simulation study.

Treatment effect borrowing
Consider a basket trial that enrols patients in K subgroups. Furthermore, n k patients are randomised (equally) to receive a new treatment or a control per subtrial k = 1, … , K. The primary endpoint in this basket trial is a continuous outcome, denoted by y ik , is collected from each patient i = 1, … , n k in each subtrial k = 1, … . K. Let k be the subtrial-specific treatment effect and T ik is a binary treatment assignment indicator; T ik = 1 if patient i in subtrial k receives the experimental treatment and 0 otherwise (control). For simplicity, we consider here a regression model that includes just the intercept and treatment effect parameters Then, for each subtrial k = 1, … . K, we have ., n k , where T ik denotes treatment allocation for the ith patient in subtrial k to either control (T ik = 0) or the experimental treatment (T ik = 1) and k is the estimator of the experimental treatment benefit over control. We use k * to label a specific subtrial of our local estimation interest. We wish to assess the degree of commensurability between k * and k , for k ≠ k * , to determine how much information we can leverage from complementary subtrial k. We introduce a precision parameter, denoted by kk * , which captures the degree of consistency between the two location parameters k * and k . Then, k * would have a normal predictive prior centred on the complementary subtrial parameter k such that We place a spike and slab prior (Hobbs et al., 2012;Mitchell & Beauchamp, 1988) g ( kk * ) on each precision parameter kk * . The spike and slab prior distribution is a discrete mixture distribution defined as locally uniform between two limits 0 ≤ l ≤ u (the slab component), and with a probability mass concentrated at a point S > u (the spike component). Explicitly, the prior assumes that where kk * denotes the prior probability that l ≤ kk * ≤ u . A small kk * corresponds to a large value of kk * at S, thus allowing for strong borrowing of information. When the complementary subtrials are incommensurate, a greater probability mass is allocated to the slab prior through a large kk * , hence information from the complementary subtrial is discounted.
To quantify kk * , the prior belief about the incommensurability between subtrial k and k * , we proceed by computing the distributional divergence between the posterior distributions k ( k |y k ) and k * ( k * |y k * ) and relate these distances to the spike prior probability (y k and y k * are outcome data from subtrial k and k * , respectively, k * ≠ k). Following Zheng and Wason (2022) . We compute d H as follows The prior probability kk * can be defined as a function of the computed Hellinger distance, for the desire of incorporating commensurate information only. For simplicity, we let kk → 1, more probability mass is allocated to the slab prior, hence a smaller precision parameter (larger variance), which denotes a high degree of in commensurability between subtrials k * and k. Consequently, we down-weight information from the corresponding complementary subtrial. On the contrary, as d H ( k * , k ) → 0, more probability mass is allocated to the spike prior, leading to a large precision parameter kk * (smaller variance). Consequently, information from the complementary subtrial is largely incorporated. The Hellinger distances are then normalised into a series of weights to determine how much information is leveraged from a complementary subtrial. When K ≥ 3, the K − 1 CPPs are combined into a marginal predictive prior for k * .
The resulting marginal predictive prior for k * (Equation 5) is updated to the posterior using data from subtrial k * to inform decision-making.
where (⋅) and L(⋅) denote the prior/posterior distribution and likelihood function respectively. Specifically, go and no-go decisions are made which refer to whether to recommend continuation or termination of investigation of the experimental therapy. Such decisions are made within all K subtrials, each testing the hypothesis that the experimental therapy provides benefit over control: Precisely, decisions are based on probabilistic inferences about k ; a go decision is made if the estimated posterior probability that k is greater than a prespecified threshold 0 exceeds a desired probability level (e.g., 0.975), that is, P ( * k > 0 ) > 0.975, otherwise a no-go decision is taken. The threshold 0 represents a nominated magnitude of improvement that provides compelling evidence of experimental treatment benefit over control.
This TEB approach accommodates (i) complete discarding of information if k and k * are largely incommensurate and (ii) reduces to the complete pooling approach if the parameters are commensurate. In the special case when K = 2, and there is no a priori assumption about which subtrial has a larger treatment effect, the degree of borrowing between 1 and 2 can be anticipated to be symmetric (although this may differ) due to the symmetry property of the Hellinger distance, following (Equation 4) to define the spike and slab prior.

Treatment response borrowing
Let j = E, C index the experimental treatment and control groups, respectively. Denote the groupwise mean response by jk , for subtrial k = 1, … , K. The groupwise mean response refers to the mean for a treatment group j within a subtrial k. Given that jk * is our subtrial-specific parameter of interest, we define subtrial k * and k (k * ≠ k) as our contemporary and complementary subtrials, respectively. Let y (j) k represent outcome measurement from treatment arm j in subtrial k. We place an operational prior 0k * ( jk * ) on the parameter jk * . Using Bayes' Theorem, we have an operational posterior as follows which represents information carried by the contemporary subtrial k*.
To facilitate information borrowing, we make use of a CPP-a normal predictive prior centred at jk , with a precision parameter (j) kk * . For treatment arm j, we can represent the normal predictive distribution for jk * as As such, we have the CPP as where g denotes the prior distribution of the unknown parameter (j) kk * and is the standard normal density function. The choice of g is important to allow for robust borrowing on treatment arm j from subtrial k to k * ; specifically, we expect that (i) information can be completely discarded if jk * and jk are incommensurate, (j) kk * ≈ 0, and (ii) strong borrowing occurs when subtrial k * and k are largely commensurate, Our new modelling strategy allows for more flexible borrowing of information between subtrials, using their commensurability in either the experimental or control treatment, or both. Thus, for each precision parameter (j) kk * , we place a spike and slab prior with respective prior probabilities of Specifically, when kk * is as large as S ≫ 0, thus allowing for strong borrowing of information.
To quantify (j) kk * , we compute the distributional divergence between the posterior distribu- using the Hellinger distance discrepancy measure, d H , (Equation 10) as previously described in Section 2.1 above.
Likewise, we let → 0 (high commensurability) and discount information when → 1 (high incommensurability). In a randomised basket trial with K ≥ 3 subtrials, we have K − 1 complementary subtrials from which we may borrow information when estimating jk * . From Equation (8), this translates to having K − 1 CPPs marginally on jk * , hence we need to obtain a collective prior for jk * . In theory, we can see the location parameter estimate jk * as a weighted sum of K − 1 hypothetical random variables, where p (j) kk * quantifies the relative importance of information from arm j in subtrial k. Given each of the K − 1 CPPs as in Equation (8) can be represented approximately as a N for ease of notation, we can suppose that̂j k ∼ N . Then, the collective marginal predictive prior for jk * would be as follows, To obtain p (j) kk * (the weight representing the degree of commensurability between jk * and jk , k ≠ k * ), we organise the pairwise discrepancy measurements into a symmetric K × K matrix where each column lists the pairwise discrepancies between jk * and jk , for k* = 1, … , K. We then normalise the discrepancies columnwise. A decreasing function given by p kk * ∕R 0 ) , has been shown to have satisfactory properties to compute these weights (Zheng et al., 2020;Zheng & Wason, 2022). The largest weight is assigned to a subtrial with the smallest d kk * is the Hellinger distance when performing borrowing in treatment arm j across the k subgroups while R 0 controls the influence of the d H on the weights. The value R 0 of is pre-defined here to be 0.15 as in Zheng and Wason (2022). Simulation results elsewhere have shown that small values are preferred as the methodology identifies commensurate subtrials more sensitively relative to when R 0 is large where it loses this ability (Zheng & Wason, 2022). When R 0 is sufficiently large (R 0 ≫ d (j) kk * ), the weights will be approximately the same irrespective of the magnitude of the pairwise Hellinger distance. However as R 0 tends to 0 (R 0 → 0 + ), then the weights p (j) kk * → 1 as d (j) kk * → 0. Thereafter, we update the marginal predictive prior (Equation 12) using information from subtrial k, where y (j) (−k * ) denotes information from all subtrials except that from the contemporary subtrial k * for arm j.
Although we implement borrowing by treatment response based on jk * , our interest lies on the subtrial-specific treatment effect defined as k = Ek − Ck for subtrial k. The decision-making criteria (i.e., for go vs. no-go conclusions) is the same as in Section 2.1 above based on k .

Basic setting of the randomised basket trials
Consider a hypothetical phase II basket trial with K = 5 subtrials. A maximum of 336 patients are to be recruited overall, and within each subtrial, k = 1, … , 5, unequal numbers of patients n k are to be enrolled: n 1 = 70, n 2 = 66, n 3 = 64, n 4 = n 5 = 68. Our choice of sample sizes and number of subtrials is informed by recent basket trials. In particular, the CLUSTER randomised basket trial (De Benedetti et al., 2018) had 63-72 patients in two-thirds of its subtrials, while several basket trials (Hyman et al., 2015;Slosberg et al., 2018) have examined a fairly large number of patient subgroups. We simulate data from a hypothetical randomised basket trial as follows; we suppose that y ik ∼ N ( ik , 2 ) with ik = 0k + k T ik . We assume 0k = 5 and the inter-patient SD = 0.4. In the Bayesian analysis, we specify a random effects model for 0k ; 0k = 0 + 2 0 . An uninformative normal prior is specified for 0 , 0 ∼ N(0, 5) and 2 0 ∼ Half Normal(1), where Half Normal(1) denotes a truncated N(0, 1) to cover the range (0, ∞).
Besides TRB and TEB, we also consider stand-alone subtrial analyses, referred to as the NB approach. The NB approach implements a Bayesian analysis of the basket trial without sharing information across subtrials. Each of the subtrials is analysed in isolation. We set a normal prior N ( 0, 10 2 ) for each subtrial-specific treatment k . If there is no information to be shared across subtrials, estimation of k using the treatment effect and the difference in treatment responses is equivalent. Hence, we do not need to consider two versions of NB.

Simulation scenarios
We simulate randomised basket trial data under nine plausible scenarios listed in Table 1. We assume a known "true" treatment effect for each subtrial. Scenarios 1-4 feature the consistency of responses on both the experimental and control arms (not only the treatment effect) across subtrials. For scenarios 5 and 6, the set of treatment effects, that is, the values set for k , are generated from distinct multivariate normal distributions, with a high pairwise correlation coefficient ( = 0.8) for low heterogeneity and a low pairwise correlation coefficient ( = 0.1) for high heterogeneity, respectively. Furthermore, scenarios 5 and 6 feature the consistency of responses on the control arm only. Scenario 7 represents cases of inconsistency of responses on both experimental and control arms (not only the treatment effect) across subtrials. Scenario 8 is a case with consistent (null) treatment effects while scenario 9 has consistent responses on the control. Scenarios 1 and 8 are global null scenarios while scenario 9 is a mixed null scenario (some subtrials have non-null treatment effect). The simulation parameters are chosen to represent plausible settings informed by varying specific parameters in recent basket trials with continuous endpoints. We additionally evaluate the performance of the borrowing of information approaches for settings with smaller (n S1 = 12, n S2 = 18, n S3 = 10, n S4 = 16, n S5 = 14) and larger (n L1 = n L2 = 100, n L3 = 106, n L4 = 110, n L5 = 104) subtrial sample sizes.

Performance evaluation metrics
We investigate the performance of the modelling approaches, focusing on the bias, mean square error (MSE), empirical standard error (EmpSE), coverage and width of the 95% posterior credible interval for k , and the Monte-Carlo standard error (MCSE). These are given by , where l = 1, … ., b sim indexes the simulated basket trials (i.e., b sim is the total number of replicates performed for each simulation scenario); k in the above is the assumed value in the particular simulation scenario;̂l k is the estimate from one of the analysis models in Section 2 from the lth simulation replicate; k = b −1 sim ∑ l̂l k is the mean of thêl k across all replicates; and Q l θ (a), with 0 < a ≤ 100, denotes the ath quantile of the posterior distribution for ; the lower and upper limits of the coverage definition are the 2.5th and 97.5th percentiles.
For each subtrial k, we also obtain the estimated posterior probability that k exceeds the threshold 0 to estimate type I error-rates and power, P (θ k > 0 ) > 0.975. For illustration, we specify 0 = 0.3. Of interest is the subtrial-wise and familywise type I error-rates and the power for identifying effective subgroups. The subtrial-wise type I error-rate is the proportion of trials, with respect to a given subtrial, with an erroneous go decision made for a given null scenario (scenario 1). The familywise error-rate is computed as the proportion of trials in which an incorrect go decision is made for any null hypothesis that is true. A similar approach to type I error-rate and power computation has been used previously (Hobbs & Landin, 2018;Zheng & Wason, 2022).
We simulate 10,000 randomised basket trials for each of the simulation scenarios. For each replicate, we use two parallel chains each with 10,000 MCMC iterations. The first 3000 samples in each chain are discarded as burn-in. Convergence of the MCMC chains was checked using Gelman-Rubin diagnostic plots. All Bayesian analysis models are implemented in R (version 3.6.3) using the Rjags package. Both R and JAGS functions to reproduce the simulation study are available on GitHub (https://github.com/oondijo/RandBasketTrials). Figure 2 compares the TRB, TEB and NB analysis strategies in terms of the bias and MSE of the posterior mean estimators for the subtrial-specific treatment effects. Under the global null scenarios, there is similar and noticeably small bias across the three approaches although TRB and TEB have smaller MSE especially in scenario 1. In scenario 8 (same as scenario 1 except that mean response across control, Ck , is different across subtrials), the bias and MSE for TEB and TRB are slightly different in some scenarios, occasionally higher for TRB. The MSEs of the point estimates differ considerably; the NB approach produces the highest MSE in scenarios 1-3 and comparable MSE to the TRB approach in scenarios 7-9. When all or most subtrials are consistent in the mean responses on both the experimental and control arms (scenarios 1-4), the TRB and TEB approaches have comparable bias, but the TRB modelling approach has smaller. An exception to this is scenario 3, where all Ck ′ s and 1 − 4 are similar, but 5 is different; we observe here that TRB has slightly higher bias although the MSE are similar. In the mixed null scenario (scenario 9) where we anticipate very limited borrowing or heavy discounting of information, TEB produces smaller bias (similar to the NB approach) than TRB as well as the smallest MSE. We observe similar results under varying degrees of the treatment effect across the subtrials (scenarios 5 and 6). Overall, the TRB approach consistently provides smaller bias and MSE when the k are small and similar across subtrials where response in control and experimental is also similar (for instance scenarios 1, 2 and 4), while TEB provides more accurate estimates when at least one of the true treatment effects and groupwise mean responses ( Ck ) are different (for instance scenarios 3, 8 and 9). In Figure S3, we demonstrate that these differences hold for the considered circumstances of smaller and larger basket trial sample sizes. Figure 3 visualises the median width of the posterior credible intervals for the k yielded by the respective Bayesian analysis models. In all scenarios, we see that either TEB or TRB have the narrowest 95% credible intervals compared to the NB approach. The differences are particularly noticeable in scenarios 1-4 (apart from subtrial 5 of scenario 3 where differences are marginal) where treatment effects are consistent across all subtrials. This is because borrowing of information leads to greater efficiency gains and more precise posterior estimates of the treatment effect when subtrials are more commensurate. Although there are no observable differences in scenario 7 where both Ck and k are different across subtrials, it clear from Figure S6 that TRB works best under small sample sizes. However, in this scenario there are no differences across when the sample size is large, another indication that multiple metrics are necessary to distinguish TRB and TEB in that case.

Results
For the global null scenario 1, we notice that TRB leads to more precise estimates, due to commensurability of mean responses of the control arm across all subtrials. However, when the mean response in control is different across subtrials (scenario 8), TEB gives the smallest posterior credible intervals across all subtrials. It is interesting to note that in circumstances of a small overall basket trial sample size, TRB gives the narrowest 95% credible intervals across nearly all scenarios; but for circumstances of a large sample size, TEB gives far narrower posterior credible intervals for the treatment effect in nearly all scenarios ( Figure S6).
We also compared the two modelling strategies for borrowing of information based on their statistical power, type I error-rate, and familywise error-rate. Figure 4 compares the statistical power under small (S) or large (L) sample sizes. The first thing to note is the increase in statistical power to demonstrate a treatment benefit in a particular subtrial as we leverage information from other subtrials. Both TRB and TEB, achieve higher statistical power in almost all scenarios compared to the NB approach. power power F I G U R E 4 A comparison of the Bayesian analysis models for randomised basket trials, in terms of the frequentist power, under small (S -n S1 = 12, n S2 = 18, n S3 = 10, n S4 = 16, n S5 = 14) or large (L -n L1 = n L2 = 100, n L3 = 106, n L4 = 110, n L5 = 104) sample sizes. Frequentist power is defined as the proportion of trials with a correct go decision under H 1k ∶ k > 0 . Horizontal dashed line represents 80% power.
[Colour figure can be viewed at wileyonlinelibrary.com] We observe very low statistical power in scenario 2 ( k = 0.35 for all k), which is expected because go decisions based on P ( k > 0.3) > 0.975 are made to be very unlikely. In scenarios 3 and 4, where treatment effects are commensurate across most subtrials (except subtrial 5), we observe that TRB leads to higher statistical power over TEB and no borrowing. As the treatment effect gets larger (scenario 4), there is an increase in power, apart from subtrial 5 which is different in scenarios 3 and 4. Notably, the differences between the borrowing approaches are more pronounced when the subtrial sample sizes are small. For instance, in scenarios 3 and 4, TRB gives 0.3%-11% and 2%-15% higher power than TEB on average in subtrials 1-4 when sample sizes are small. However, when the sample size is larger, TEB outperforms TRB marginally (<2%) in nearly all scenarios. For the mixed global null (scenario 9), TEB yields at least 13% higher power than TRB in subtrial 4. Scenarios 5 and 7 mimic situations where the true k are highly heterogeneous across all subtrials. Scenario 7 is further featured by the mean responses of the control group differing across the subtrials. We observe that TRB achieves the highest statistical power in scenario 5, but this is not always the case in scenario 7. In subtrial 5, TRB outperforms TEB by an increase of 14% in scenario 5, while in scenario 7 the gap is 10% in the opposite direction' (Figure 4-S).
Principally, TRB achieves higher power compared to TEB when there is high commensurability across all subtrials, the mean response in control is similar across subtrials and the basket trial sample size is small. However, when either the treatment effects, and/or the mean response in the control/experimental arm are inconsistent across subtrials (scenarios 5-7 and 9), TRB still outperforms TEB in small sample sizes; in scenario 7 (subtrial 5) and scenario 9 (subtrial 4), TEB has higher power by at least 10% and 13%, respectively. When the sample size is larger, we notice TEB has marginal absolute gain in power in all scenarios over TRB. The observation that the power is high and similar across the three approaches in Figure 4 (bottom panel) is entirely expected. This is because when a treatment truly works, higher power will be observed as sample size gets larger for the same specified effect sizes. The low power in subtrial 4 in scenarios 5, 6 and 7 is because 4 is assumed to be small (see Table 1) hence Pr ( k > 0 ) > 0.975 is expected to be low. Table 2 quantifies the coverage probability of the posterior distributions based on the respective Bayesian models, as we vary the sample size. Both TEB and TRB approaches guarantee high coverage that is well above the target coverage (95%) in most scenarios. However, we notice substantial differences in the coverage for cases of small subtrial sample sizes. For example, in the global null scenario where the mean response in control is different across subtrials (scenario 8), TEB gives higher coverage for k (which is also above the target) relative to TRB (approximately 4% in some cases). When the sample size is large, similar conclusions apply except that we observe smaller differences of 1%-2%. Further, TEB and TRB have similar coverage probability for k (differences of <1%) in the null scenario 1 where the mean response in the control arm was assumed to be similar across subtrials. When treatment effects are assumed to be heterogeneous across subtrials (with subtrials inconsistent in the mean responses of the experimental or control arm, or both), we observe mixed results where TRB can perform poorest of the three (subtrial 1 of scenarios 5 and 7).
In scenario 7 where the true k are highly heterogeneous across all subtrials and the true mean of the control arm varies across subgroups, all approaches produce coverage that is below the target (95%) in nearly all subtrials (Table 2). For instance, in subtrial 3, TEB gives coverage probability of nearly 3% higher than TRB when the overall sample size is small, but which is also closest to the target. Overall, TEB gives higher coverage probability in most scenarios when the sample size is large. In Figures S1 and S5, we present graphical results of a comparison of the empirical standard error (EmpSE) from the Bayesian modelling approaches. We also present additional results on the subtrial-wise and familywise error-rate control for each of the three analysis approaches in Table S2. We notice that TRB has the smallest EmpSE when the sample size is small but as this gets larger, TEB has the smallest EmpSE. Furthermore, the error rates (FWER and type I error) from the borrowing strategies are smaller than those of the no borrowing strategy.

Data description
We apply the TRB, TEB and NB approaches to Bayesian analyses of two case study data sets. Currently, there are no published randomised basket trials with a continuous primary outcome, based on our assessment of recent systematic reviews, a search of clinicaltrials.gov, and an ongoing systematic review in our group. We instead identified trials with continuous primary outcomes whose design matches that of a randomised basket trial to utilise as case studies. The first case study data is drawn from a trial of 60 participants randomised between different structured exercise and diet interventions. Participants were first randomly assigned to one of three different exercise regimes, and within each exercise regime subgroup, they were further randomised to two dietary interventions. Although not labelled as a basket trial, this could be considered as a randomised basket trial with three subtrials defined by the exercise regime assigned. Each subtrial then comprises the 20 patients further randomised to either diet 1 or diet 2. Both diet 1 and diet 2 are similar across the subtrials. The post-intervention clinical outcome of interest was the pulse rate, a continuous measurement was taken at either three or four timepoints from baseline for each participant. We analyse the pulse rate data from the end of the trial. This data is publicly available here (Repeated measures analysis with R. UCLA: Statistical Consulting Group, n.d.). The second case study data set is drawn from a multi-centre, randomised, double-blind, placebo-controlled, proof-of-concept study of iscalimab in patients with primary Sjogren's syndrome (Fisher et al., 2020). Iscalimab is a novel anti-CD40 monoclonal antibody that targets the CD40-CD154-mediated T cell-B cell interactions which have been shown to be responsible for aberrant lymphocyte activation in inflamed tissue (Fisher et al., 2020). The trial was conducted in 10 sites across five different countries, United Kingdom, Germany, Switzerland, Hungary, and United States, recruiting patients aged 18-75 years. This trial had two cohorts. In cohort 1, 12 patients were randomised to receive subcutaneous iscalimab (3 mg/kg) or placebo for 12 weeks, after which patients on placebo would receive iscalimab up to week 32 (end of the trial). Cohort 2 (with 32 patients) followed a similar design as cohort 1 except that a higher dose of iscalimab (10 mg/kg) was administered. As such, this could be viewed as a randomised basket trial with two subtrials. The primary outcome was the ESSDAI score, which has a maximum possible score of 123 points, with higher scores indicating more severe symptoms. The minimum clinically important improvement in ESSDAI has been published as three points. Our interest is the comparison of ESSDAI scores between iscalimab and placebo patients at week 12. For this trial, the process to obtain individual level patient data would have been (lasting several months), and we resorted to obtaining all relevant information on the distribution of ESSDAI scores for each treatment arm in the two subtrials (mean ESSDAI score for baseline and at each timepoint, range of scores, SDs) from the chief investigator and the published paper. Using this information, we simulate data for 24 patients in subtrial 1 and 64 patients in subtrial 2. Note that one key assumption in our analysis is that the difference in dosage of Iscalimab between the cohorts does not influence the outcomes. We view that as appropriate as our goal is to illustrate how different the results of the analyses can be, and what the differences may look like in a real trial.

4.2
Analysis of case studies Table 3 presents the results of the three analyses (TRB, TEB, NB) of the two case study datasets.
Our results demonstrate that TRB and TEB can give substantially different point estimates for the k . This is a consequence of the differential information borrowing from complementary subtrials. For case study 1, we notice that TEB has the smallest estimates in subtrials 1 and 2 and the largest in subtrial 3. Considering the posterior probabilities P ( k > 3|data), we see that for subtrial 2, P ( 2 > 3|data) = 0.615, 0.852 and 0.40 for no borrowing, TRB and TEB respectively (see Table 4). Thus, if a typical threshold probability to inform decision-making such as = 0.9 was specified (i.e., decision-making was based on P ( k > 3|data) > ), we would arrive at identical no-go conclusions for all subtrials since all the posterior probabilities P ( k > 3|data) are less than = 0.9. However, if a smaller value such as = 0.8 is chosen for some trials, for example as in the case of a small proof-of-concept study, we can see the TRB and TEB analyses would then lead to different conclusions in subtrial 2 of case study 1; there is an 85.2% chance of recommending the intervention by the TRB analysis, while the NB and TRB analyses, this probability is 61.5% and 40%, respectively.
As in case study 1, we notice that the treatment effects for case study 2 are different between the NB approach and the two modelling strategies TEB and TRB (Table 3). Here TRB and TEB produce comparable results, both for the treatment effects and the 95% credible intervals, an observation more likely when the groupwise mean responses are comparable across subtrials. However, we notice that the posterior probabilities to inform go/no-go decisions were slightly different (0.365 vs. 0.468) ( Table 4). While no more than three subtrials are analysed here and our simulation study had five subtrials, there is no expectation for the number of subtrials to have an impact. In principle, the distributional discrepancy approach here allows for borrowing depending only on the commensurability of information across subtrials.

DISCUSSION
The use of basket trials has grown substantially in the last decade, especially in early-phase trial settings targeting rare diseases (Hazim & Prasad, 2018;Park et al., 2019). Randomised basket trials provide higher-quality evidence of treatment benefit over the control relative to non-randomised basket trials and will increasingly characterise the future landscape of phase II and phase III basket trials given the enormous weight for randomised evidence in drug approvals (Prasad & Oseran, 2015;Saad et al., 2017). Some recent ongoing randomised basket studies include NCT04498962 and NCT04988087.
In this article, we sought to understand the best modelling strategy to borrow information in the Bayesian analysis of a randomised basket trial with a continuous endpoint, motivated in part by our desire to identify a suitable analysis approach for the OACS randomised basket trial. We have proposed a TRB approach focusing on the borrowing by treatment arms (groupwise responses) and compared its performance to TEB as well as the NB approach. Specifically, the TRB approach assesses how similar the responses to a specific treatment are across subtrials, in contrast to the TEB that assesses how similar the experimental treatment benefit over the control (treatment effects) is across subtrials, for leveraging information from commensurate subtrials.
Our work is relevant to debates on the arm-based (Hong et al., 2016) against contrast-based (Begg & Pilote, 1991) frameworks for performing meta-analysis of randomised controlled trials with at least two treatment arms (Dias & Ades, 2016). The newly proposed TRB offers added flexibility to implementing borrowing of information within basket trials. Firstly, data from complementary subtrials that are commensurate with the contemporary subtrial by either the experimental or control arm can be leveraged. Secondly, incommensurate data can be down weighted to a different extent by treatment arms. Our results are congruent with studies comparing arm-based and contrast-based models; estimates from arm-based models were slightly larger than those obtained from contrast-based models, but with smaller SEs (Karahalios et al., 2022).
Both TRB and TEB are based on the commensurate predictive prior approach to quantifying the degree of commensurability; synthesis weights are specified as a function of the pairwise Hellinger distance and assigned to the corresponding complementary subtrial. When fitting the Bayesian models with a TRB or TEB approach, we placed a spike and slab prior on the precision parameter that gauges the degree of borrowing, but alternative priors could be considered. Zheng et al. used a two-component gamma mixture prior instead for analytic tractability in the exact Bayesian inference of treatment effects when leveraging pre-experimental data from multiple sources (Zheng et al., 2020). Using a two-component gamma mixture prior brings about the convenience of analytic tractability and would particularly facilitate the derivation of a sample size formula for the trial design. The two choices of prior on the commensurate parameter would mean different interpretation/implication of a specified value for the prior probability of incommensurability. For instance, setting kk * to a value as small as 0.3 may suggest a priori strong borrowing in the models coupled with a spike and slab prior, while the degree of borrowing might be attenuated with a two-component gamma mixture prior. Indeed, we expect differences associated with these two prior specifications, and it is our plan to perform another comprehensive simulation study to formally compare the marginal changes in the degree of borrowing induced by these two prior specifications.
Our simulation results show that permitting borrowing of information in the analysis of randomised basket trials offers substantial gains in efficiency over independent analysis of subtrials (NB) as shown in previous simulation studies (Jin et al., 2020;Zheng & Wason, 2022). The proposed approaches to TRB and TEB differ most widely in performance when subtrial sample sizes are small, with differences depending on the similarity of treatment effects or treatment responses within each subtrial. In circumstances of small subtrial sizes, TRB typically outperforms TEB, especially when either the treatment effects or the mean response in control (or both) are consistent across subtrials. On the contrary, TEB performs better when there is considerable heterogeneity in treatment effects and/or the mean response in either the control or experimental arm across subtrials. Our findings indicate that TEB is a more robust approach that provides more precise estimates with better coverage probability and marginally higher power over TRB in nearly all scenarios when the trial sample size is sufficiently large. We also applied both TRB and TEB to the analysis of two case study data sets, showing that a trial may arrive at a different conclusion based on the chosen analysis approach. An important note is that our choice of case study 1 is different from the common basket design (Figure 1). This helps illustrate that basket trials are equally applicable to non-oncology settings and non-pharmaceutical interventions and are not restricted to biomarker-guided pharmaceutical interventions as is common in oncology.
While we have considered a single Bayesian analysis methodology in Section 2, the concept underlying TRB can be implemented using any proposed Bayesian hierarchical method for basket trials. For illustrative purposes, we have focused on a setting where patients are equally randomised to a single experimental treatment or control within each subtrial, with a common continuous endpoint. A possible extension to our proposed methodology is thus to consider binary or time to event outcomes like progression-free survival. In a binary setting, we can assume response in each subgroup follows a binomial distribution, and we can model the log-odds of response. Borrowing in this case would be based on the similarity of the difference in log-odds per subgroup for TEB, while for TRB we account for the similarity of the log-odds separately for control and treatment arms across subgroups. The log-odds can be assumed to follow a normal distribution, with normal hyperpriors and the Bayesian hierarchical modelling proceeds as in Section 2.1 and 2.2.
There are also several avenues to enhance the practicality of our proposed methodology (particularly the TRB strategy) in precision medicine trials. One future direction is to explore its performance in a basket trial setting where patients are randomised to at least two experimental treatment groups plus control in all or some of the subtrials. Another type of randomised basket design could consider incorporating response-adaptive randomization (Lin et al., 2021;Ventz et al., 2017); more specifically, the randomization ratio could be adapted at an interim analysis to favour a treatment with the highest efficacy, as estimated by our proposed TRB. Ventz and colleagues recently developed a class of Bayesian response adaptive basket designs and demonstrated the advantages of adapting randomization with borrowing over not borrowing (Ventz et al., 2017). Furthermore, it is also possible to incorporate early stopping for futility or efficacy. For example, we may decide to drop subgroup k if the posterior probability that the treatment effect k exceeds a prespecified threshold 0 is low, that is, Pr ( k > 0 ) ≤ for TRB. Similarly, if the posterior probability is high, the subgroup can be stopped for efficacy. An important note here is that the decision-making criteria using k is the same TRB or TEB as this decision is made post-computation of the treatment effects.
The use of TRB is in no way restricted to basket trial designs. Our proposed methodology could potentially be used in umbrella trials with a common control in each subtrial. The Lung-MAP umbrella trial (Ferrarotto et al., 2015) is one such example where this may be plausible. Here, information borrowing across subtrials would be implemented for the control arm only. Relevant to this context, Zang et al. proposed a Bayesian hierarchical model to borrow information across subgroups receiving standard-of-care only in a Bayesian adaptive marker-stratified design (Zang et al., 2019). By using the TRB proposed in this paper, we could expect a more precise estimate of the mean response in each control arm to be achieved, which would aid the estimation of the treatment effect specific to each subgroup.
We caution that despite the efficiencies of borrowing across subgroups in basket trials, this should always be motivated by a strong biological and clinical rationale. For instance, when the control (e.g., the standard of care treatment) is different in certain subgroups, the strength of scientific and clinical evidence underpinning such borrowing is weak. This is because different drugs may likely have potentially different mechanisms of action and may elicit different responses across subgroups. As such, borrowing is more justifiable only across similar controls and TRB would be suitable if the experimental arm is the same, but the control arm is not. We also note that Bayesian simulations undertaken here demand high computational requirements. The computational time for a single scenario with 10,000 simulation replicates, two MCMC chains each with 10,000 iterations for TEB/TRB was 3.four to five hours on average when parallelising on 44 cores in a high-performance computing cluster node (2.2 GHz Intel Xeon E5-2699 v4 processor). For the NB approach, this takes approximately 10 min.
Finally, we conclude that borrowing of information in randomised basket trials offers considerable efficiency gains but must be implemented carefully depending on the potential heterogeneity in effects across subgroups and the subtrial sample sizes. Consequently, investigators and statisticians must thoroughly investigate any borrowing approach based on plausible trial scenarios.