Theses - MRC Biostatistics Unit


Recent Submissions

Now showing 1 - 20 of 29
  • ItemOpen Access
    Dynamic risk prediction of cardiovascular disease using primary care data from New Zealand
    Barrott, Isobel
    Cardiovascular disease (CVD) gradually progresses over a period of time, and can lead to a cardiovascular disease event (“CVD event”) such as stroke or heart attack. There are several widely researched risk factors for CVD, such as smoking, diet, exercise, and stress (Perk et al., 2012). These risk factors can impact biomarkers like blood pressure and lipid levels, which can be measured by a primary care practitioner and are themselves risk factors (World Health Organization, 2021). The PREDICT cohort study (Wells et al., 2017, Pylypchuk et al., 2018) is comprised of the electronic health records (EHRs) of such CVD risk factor measurements, which were collected to assess the 5-year CVD risk of a patient in primary care. A risk prediction model previously developed for this population by Pylypchuk et al. (2018) is based on using only the most recent observations of these biomarkers. Dynamic prediction is an alternative to this approach which updates risk predictions as measurements are collected, therefore using the entire history of these measurements. There are two main statistical frameworks that exist for performing dynamic prediction: the joint model and the landmark model. This thesis explores the use of dynamic prediction, and in particular the landmark model, to improve CVD risk prediction. Two types of landmark model for 5-year CVD risk are presented in this thesis, which were developed using the PREDICT cohort study dataset: one of these models the longitudinal data using a linear mixed effects (LME) model, and one which uses the last observation carried forward (LOCF) approach. It was found that these dynamic prediction models have some improvement in model performance over a “static” model which is similar to that developed by Pylypchuk et al. (2018). This thesis also presents the results of a simulation study to explore the difference between these two types of landmark models as the number of repeated measurements of the biomarkers increase, in particular finding that there is little difference in terms of model performance. Finally, this thesis presents an R package ‘Landmarking’ which allows the user to perform various analyses relating to the landmark model.
  • ItemOpen Access
    Using generative modelling in healthcare
    Skoularidou, Maria
    In the present thesis a broad spectrum of high dimensional problems with application to healthcare will be explored. We shall review the state-of-the-art methods that are employed when trying to detect genetic factors that affect gene expression, which is a core problem in genetics. We shall also present two popular classes of generative models, namely Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) and their variants. Subsequently, we shall review some new developed imputation methods which are based on GANs and VAEs. We shall assess their performance under various missingness scenarios via accordingly designed experiments and simulation studies. We shall proceed via introducing our method on GANs’ inversion and evaluate its performance in a newly suggested manner. Finally, we shall conclude this thesis with our main findings and future work.
  • ItemOpen Access
    Bayesian methodology for integrating multiple data sources and specifying priors from predictive information
    Manderson, Andrew; Manderson, Andrew [0000-0002-4946-9016]
    The joint model for observable quantities and latent parameters is the starting point for Bayesian inference. It is challenging to specify such a model so that it both accurately describes the phenomena being studied, and is compatible with the available data. In this thesis we address challenges to model specification when we have either multiple data sources and/or expert knowledge about the observable quantities in our model. We often collect many distinct data sets that capture different, but partially overlapping, aspects of the complex phenomena. Instead of immediately specifying a single joint model for all these data, it may be easier to instead specify distinct submodels for each source of data and then join the submodels together. We specifically consider chains of submodels, where submodels directly relate to their neighbours via common quantities which may be parameters or deterministic functions thereof. We propose chained Markov melding, an extension of Markov melding, a generic method to combine chains of submodels into a joint model. When using any form of Markov melding, one challenge is that the prior for the common quantities can be implicit, and their marginal prior densities must be estimated. Specifically, when we have just two submodels, we show that error in this density estimate makes the two-stage Markov chain Monte Carlo sampler employed by Markov melding unstable and unreliable. We propose a robust two-stage algorithm that estimates the required prior marginal self-density ratios using weighted samples, dramatically improving accuracy in the tails of the distribution. Expert information often pertains to the observable quantities in our model, or can be more easily elicited for these quantities. However, the appropriate informative prior that matches this information is not always obvious, particularly for complex models. Prior predictive checks and the Bayesian workflow are often undertaken to iteratively specify a prior that agrees with the elicited expert information, but for complex models it is difficult to manually adjust the prior to better align the prior predictive distribution and the elicited information. We propose a multi-objective global optimisation approach that aligns these quantities by adjusting the hyperparameters of the prior, thus "translating" the elicited information in an informative prior.
  • ItemOpen Access
    Bayesian model-based clustering of multi-source data
    Coleman, Stephen
    Inferring a partition of a dataset can help in downstream analyses and decision making. However, there often exist many feasible partitions, which makes the problem of inferring clusters challenging. A demanding problem is analysis of data generated across multiple sources. Bayesian mixture models and their extensions are effective tools for partition inference in this setting as we can use these to describe and infer the relationship between different sources. I consider applying such methods to two cases of multi-source data: multi-view, where the same items have data generated across different contexts, and multi-batch, where the same measurements are taken on sets of items. I develop and explore a consensus clustering approach to navigate the problem of poor mixing, which refers to a failure of Markov chain Monte Carlo methods wherein the sampler becomes trapped in local high posterior density modes. This problem is commonly encountered when seeking to infer latent structure in high-dimensional data. I propose running many short Markov chains in parallel and using the final sample from each chain. My results suggest that performing inference this way frequently better describes model uncertainty than individual long chains. I use the method in a multi-omics analysis of the cell cycle of Saccharomyces cerevisiae and identify biologically meaningful structure. I subsequently implement Multiple Dataset Integration (MDI), a Bayesian integrative clustering method, in C++ with a wrapper in R, correcting an error that was present in previous implementations, and extending MDI to be semi-supervised. My implementation allows a range of models for a variety of different data types, such as t-augmented mixtures of Gaussians and Gaussian processes. I then consider a semi-supervised multi-omics analysis of the model apicomplexan, Toxoplasma gondii. In my final content chapter I consider the problem of analysing data generated across multiple batches. Such data can have structural differences which should be accounted for when inferring a partition. I propose a mixture model that includes both cluster/class and batch parameters to simultaneously model batch effects upon location and scale with the partition. I validate my method in a simulation study and using held out seroprevalence data, and compare to existing methods. Finally, I discuss the state of the field of Bayesian mixture models and some potential future research directions.
  • ItemOpen Access
    Adaptive Designs and Methods for More Efficient Drug Development
    Serra, Alessandra
    The development of a novel drug is a time-consuming and expensive process. Innovative trial designs and optimal sequences of clinical trials aim to increase the efficiency of this process by improving flexibility and maximising the use of accumulated information throughout the trials while minimizing the number of patients that are exposed to unsafe or ineffective regimens. In Chapter 2 of this thesis, we focus on confirmatory trials that are one of the largest contributors to cost and time in later stages of the drug development process. We consider a clinical trial setting where multiple treatment arms are studied concurrently and an ‘order’ (i.e. a monotonic relationship) among the treatment effects can be assumed. We propose a novel design which incorporates the information about the order in the decision-making without assuming any parametric arm-response model and controlling error rates. We compare the performance of this novel approach with currently used trial designs and we describe its application to design an actual trial in tuberculosis in Chapter 3. In Chapter 4, we propose a Bayesian extension of the design described in Chapter 2 allowing to relax the order assumption and incorporate historical information. This is needed for settings where, for example, the increase in side effects or compliance with the treatment lead to a reduced efficacy of the treatment and, hence, violation of the order assumption. We compare this design with other competing approaches that do not consider uncertainty in the order. In Chapter 5, we focus on the whole drug development process. We consider an oncology trial setting and we compare two sequences of clinical trials, the first targeting the whole patient population and the second a molecularly defined subgroup within the population. We propose a metric to quantify the expected clinical benefit of these two strategies. In addition, for each strategy we measure the cost of development as the expected proportion of patients enrolled over the total common sample size. We illustrate a performance evaluation of the proposed metric in an actual trial.
  • ItemOpen Access
    Modularized Bayesian Inference: Methodology, Algorithm, Theory And Application.
    Liu, Yang; Liu, Yang [0000-0001-7221-5877]
    Bayesian inference has shown powerful impacts on understanding and explaining data and their generating mechanisms, but misspecification of the model is a major threat to the validity of the inference. Although methods that deal with misspecification have been developed and their properties have been studied, these methods are mainly established based on the premise that the whole model is misspecified. Since the real mechanism of the data generating process is often complex and many factors can affect the observation and collection of data, the reliability of the model may widely vary across its components and lead to partial misspecification. Dealing with such partial misspecification for a robust inference remains challenging and requires comprehensive studies of its methodology, algorithm, theory and potential application. Modularized Bayesian inference has been developed as a robust alternative to standard Bayesian inference for partial misspecification. As a particular form, cut inference completely removes the influence from misspecified components and involves a cut distribution which differs from the standard posterior distribution. Existing algorithms which sample from this cut distribution suffer from unclear convergence properties or slow computations. A novel algorithm named the stochastic approximation cut algorithm (SACut) is proposed in this thesis. The theoretical and computational properties of the SACut algorithm are studied. A general framework of cut inference beyond a generic two-module case, where one component is assumed to be misspecified, is not clear. In particular, the definition of what a ``module'' is remains vague in the literature. Furthermore, implementing cut inference for an arbitrary multiple-module case remains an open question. Solving these basic questions is appealing and necessary. This thesis formulates rules including the definition of modules; determination of relationships between modules and building the cut distribution that one should follow to implement cut inference within an arbitrary model structure. Semi-Modular inference bridges the gap between standard Bayesian inference and cut inference through the use of a likelihood with a power term. Interestingly, this feature corresponds to a geographically weighted regression (GWR) model that has been developed to handle the spatial non-stationarity but hitherto not been extended to Bayesian inference except for the Gaussian regression. This thesis proposes the Bayesian GWR model as a certain multiple-module case of Semi-Modular inference. The theory of Semi-Modular inference is extended to the multiple-module case to justify the Bayesian GWR model. Modularized Bayesian inference remains a young and emerging topic. Being one of the many pioneering works that promote the modularized Bayesian inference to a broader range of statistical models, it is hoped that this thesis will enlighten future developments of methodology and algorithm, and stimulate applications of modularized Bayesian inference.
  • ItemOpen Access
    Statistical methods to improve understanding of the genetic basis of complex diseases
    Hutchinson, Anna; Hutchinson, Anna [0000-0002-9224-4410]
    Robust statistical methods, utilising the vast amounts of genetic data that is now available, are required to resolve the genetic aetiology of complex human diseases including immune-mediated diseases. Essential to this process is firstly the use of genome-wide association studies (GWAS) to identify regions of the genome that determine the susceptibility to a given complex disease. Following this, identified regions can be fine-mapped with the aim of deducing the specific sequence variants that are causal for the disease of interest. Functional genomic data is now routinely generated from high-throughput experiments. This data can reveal clues relating to disease biology, for example elucidating the functional genomic annotations that are enriched for disease-associated variants. In this thesis I describe a novel methodology based on the conditional false discovery rate (cFDR) that leverages functional genomic data with genetic association data to increase statistical power for GWAS discovery whilst controlling the FDR. I demonstrate the practical potential of my method through applications to asthma and type 1 diabetes (T1D) and validate my results using the larger, independent, UK Biobank data resource. Fine-mapping is used to derive credible sets of putative causal variants in associated regions from GWAS. I show that these sets are generally over-conservative due to the fact that fine-mapping data sets are not randomly sampled, but are instead sampled from a subset of those with the largest effect sizes. I develop a method to derive credible sets that contain fewer variants whilst still containing the true causal variant with high probability. I use my method to improve the resolution of fine-mapping studies for T1D and ankylosing spondylitis. This enables a more efficient allocation of resources in the expensive functional follow-up studies that are used to elucidate the true causal variants from the prioritised sets of variants. Whilst GWAS investigate genome-wide patterns of association, it is likely that studying a specific biological factor using a variety of data sources will give a more detailed perspective on disease pathogenesis. Taking a more holistic approach, I utilise a variety of genetic and functional genomic data in a range of statistical genetics techniques to try and decipher the role of the Ikaros family of transcription factors in T1D pathogenesis. I find that T1D-associated variants are enriched in Ikaros binding sites in immune-relevant cell types, but that there is no evidence of epistatic effects between causal variants residing in the Ikaros gene region and variants residing in genome-wide binding sites of Ikaros, thus suggesting that these sets of variants are not acting synergistically to influence T1D risk. Together, in this thesis I develop and examine a range of statistical methods to aid understanding of the genetic basis of complex human diseases, with application specifically to immune-mediated diseases.
  • ItemOpen Access
    Approaches to developing clinically useful Bayesian risk prediction models
    Karapanagiotis, Solon; Karapanagiotis, Solon [0000-0003-4460-2073]
    Prediction of the presence of disease (diagnosis) or an event in the future course of disease (prognosis) becomes increasingly important in the current era of personalised medicine. Both tasks (diagnosis and prognosis) are supported using (risk) prediction models. Such models usually combine multiple variables by using different statistical and/or machine learning approaches. Recent advances in prediction models have improved diagnostic and prognostic accuracy, in some cases surpassing the performance of clinicians. However, evidence is lacking that deployment of these models has improved care and patient outcomes. That is, their clinical usefulness is debatable. One barrier to demonstrating such improvement is the basis used to evaluate their performance. In this thesis, we explore methods for developing (building and evaluating) risk prediction models, in an attempt to create clinically useful models. We start by introducing a few commonly used metrics to evaluate the predictive performance of prediction models. We then show that a model with good predictive performance is not enough to guarantee clinical usefulness. A well performing model can be clinically useless, and a poor model valuable. Following recent line of work, we adopt a decision theoretic approach for model evaluation that allows us to determine whether the model would change medical decisions and, if so, whether the outcome of interest would improve as a result. We then apply this approach to investigate the clinical usefulness of including information about circulating tumour DNA (ctDNA) when predicting response to treatment in metastatic breast cancer. ctDNA has been proposed as a promising approach to assess response to treatment. We show that incorporating trajectories of circulating tumour DNA results in a clinically useful model and can improve clinical decisions. However, an inherit limitation to the decision theoretic approach (and related ones) is that model building and evaluation are done independently. During training, the prediction model is agnostic of the clinical consequences from its use. That is, the prediction model is agnostic of its (clinical) purpose, e.g., which type of classification error is more costly (i.e., undesirable). We address this shortcoming by introducing Tailored Bayes (TB), a novel Bayesian inference framework which “tailors” model fitting to optimise predictive performance with respect to unbalanced misclassification costs. In both simulated and real-world applications, we find our approach to perform favourably in comparison to standard Bayesian methods. We then move to extend the framework to situations where a large number of (potentially irrelevant) variables are measured. Such high-dimensional settings represent a ubiquitous challenge in modern scientific research. We introduce a sparse TB framework for variable selection and find that TB favours smaller models (with fewer variables) compared to standard Bayesian methods, whilst performing better or no worse. This pattern was seen both in simulated and real data. In addition, we show the relative importance of the variables changes when we consider unbalanced misclassification costs.
  • ItemOpen Access
    Weighting and moment conditions in Bayesian inference
    (2021-10-23) Yiu, Andrew
    The work presented in this thesis was motivated by the goal of developing Bayesian methods for "weighted" biomedical data. To be more specific, we are referring to probability weights, which are used to adjust for distributional differences between the sample and the population. Sometimes, these differences occur by design; data collectors can choose to implement an unequal probability sampling frame to optimize efficiency subject to constraints. If so, the probability weights are known and are traditionally equal to the inverse of the unit sampling probabilities. It is often the case, however, that the sampling mechanism is unknown. Methods that use estimated weights include so-called doubly robust estimators, which have become popular in causal inference. There is a lack of consensus regarding the role of probability weights in Bayesian inference. In some settings, it is reasonable to believe that conditioning on certain observed variables is sufficient to adjust for selection; the sampling mechanism is then deemed \textit{ignorable} in a Bayesian analysis. In Chapter 2, we develop a Bayesian approach for case-cohort data that ignores the sampling mechanism and outperforms existing methods, including those that involve inverse probability weighting. Our approach showcases some key strengths of the Bayesian paradigm---namely, the marginalization of nuisance parameters, and the availability of sophisticated computational techniques from the MCMC literature. We analyse data from the EPIC-Norfolk cohort study to investigate the associations between saturated fatty acids and incident type-2 diabetes. However, ignoring the sampling is not always beneficial. For a variety of popular problems, weighting offers the potential for increased robustness, efficiency and bias-correction. It is also of interest to consider settings where sampling is nonignorable, but weights are available (only) for the selected units. This is tricky to handle in a conventional Bayesian framework; one must either make ad-hoc adjustments, or attempt to model the distribution of the weights. The latter is infeasible without additional untestable assumptions if the weights are not exact probability weights---e.g. due to trimming or calibration. By contrast, weighting methods are usually simple to implement in this context and are virtually model-free. Chapters 3 and 4 develop approaches that are capable of combining weighting with Bayesian modelling. A key ingredient is to define target quantities as the solutions to moment conditions, as opposed to ``true'' components of parametric models. By doing so, the quantities coincide with the usual definitions if working model assumptions hold, but retain the interpretation of being projections if the assumptions are violated. This allows us to nonparametrically model the data-generating distribution and obtain the posterior of the target quantity implicitly. Crucially, our approaches still enable the user to directly specify their prior for the target quantity, in contrast to common nonparametric Bayesian models like Dirichlet processes. The scope of our methodology extends beyond our original motivations. In particular, we can tackle a whole class of problems that would ordinarily be handled using estimating equations and robust variance estimation. Such problems are often called semiparametric because we are interested in estimating a finite-dimensional parameter in the presence of an infinite-dimensional nuisance parameter. Chapter 4 studies examples such as linear regression with heteroscedastic errors, and quantile regression.
  • ItemOpen Access
    Curtailed phase II binary outcome trials and adaptive multi-outcome trials
    (2021-11-27) Law, Martin; Law, Martin [0000-0001-9594-348X]
    Phase II clinical trials are a critical aspect of the drug development process. With drug development costs ever increasing, novel designs that can improve the efficiency of phase II trials are extremely valuable. Phase II clinical trials for cancer treatments often measure a binary outcome. The final trial decision is generally to continue or cease development. When this decision is based solely on the result of a hypothesis test, the result may be known with certainty before the planned end of the trial. Unfortunately though, there is often no opportunity for early stopping when this occurs. Some existing designs do permit early stopping in this case, accordingly reducing the required sample size and potentially speeding up drug development. However, more improvements can be achieved by stopping early when the final trial decision is very likely, rather than certain, known as stochastic curtailment. While some authors have proposed approaches of this form, these approaches have limitations, such as relying on simulation, considering relatively few possible designs and not permitting early stopping when a treatment is promising. In this thesis we address these limitations by proposing design approaches for single-arm and two-arm phase II binary outcome trials. We use exact distributions, avoiding simulation, consider a wider range of possible designs and permit early stopping for promising treatments. As a result, we are able to obtain trial designs that have considerably reduced sample sizes on average. Following this, we switch attention to consider the fact that clinical trials often measure multiple outcomes of interest. Existing multi-outcome designs focus almost entirely on evaluating whether all outcomes show evidence of efficacy or whether at least one outcome shows evidence of efficacy. While a small number of authors have provided multi-outcome designs that evaluate when a general number of outcomes show promise, these designs have been single-stage in nature only. We therefore propose two designs, of group-sequential and drop the loser form, that provide this design characteristic in a multi-stage setting. Previous such multi-outcome multi-stage designs have allowed only for a maximum of two outcomes; our designs thus also extend previous related proposals by permitting any number of outcomes.
  • ItemOpen Access
    Methods for Using Biomarker Information in Randomized Clinical Trials
    Wang, Jixiong
    Advances in high-throughput biological technologies have led to large numbers of potentially predictive biomarkers becoming routinely measured in modern clinical trials. Biomarkers which influence treatment efficacy may be used to find subgroups of patients who are most likely to benefit from a new treatment. Consequently, there is a growing interest in better approaches to identify biomarker signatures and utilize the biomarker information in clinical trials. The first focus of this thesis is on developing methods for detecting biomarker-treatment interactions in large-scale trials. Traditional interaction analysis, using regression models to test biomarker-treatment interactions one biomarker at a time, may suffer from poor power when there is a large multiple testing burden. I adapt recently proposed two-stage interaction detecting procedures for application in randomized clinical trials. I propose two new stage 1 multivariate screening strategies using lasso and ridge regressions to account for correlations among biomarkers. For these new multivariate screening strategies, I prove the asymptotic between-stage independence, required for family-wise error rate control. Simulation and real data application results are presented which demonstrate greater power of the new strategies compared with previously existing approaches. The second focus of this thesis is on developing methods for utilizing biomarker information during the course of a randomized clinical trial to improve the informativeness of results. Under the adaptive signature design (ASD) framework, I propose two new classifiers that more efficiently leverage biomarker signatures to select a subgroup of patients who are most likely to benefit from the new treatment. I provide analytical arguments and demonstrate through simulations that these two proposed classification criteria can provide at least as good, and sometimes significantly greater power than the originally proposed ASD classifier. Third, I focus on an important issue in the statistical analysis of interactions for binary outcomes, which is pertinent to both topics above. Testing for biomarker-treatment interactions with logistic regression can suffer from an elevated number of type I errors due to the asymptotic bias of the interaction regression coefficient under model misspecification. I analyze this problem in the randomized clinical trial setting and propose two new de-biasing procedures, which can offer improved family-wise error rate control in various simulated scenarios. Finally, I summarize the main contributions from the work above, discuss some practical limitations as well as their real world value, and prioritize future directions of research building upon the work in this thesis.
  • ItemOpen Access
    Modelling longitudinal data on respiratory infections to inform health policy
    Chiavenna, Chiara; Chiavenna, Chiara [0000-0003-0761-5633]
    Detecting the start of an outbreak, quantifying its burden, disentangling the contribution of different pathogens and evaluating the effectiveness of an intervention are research questions common to several infectious diseases. The answers to these questions provide the epidemiological understanding to prevent future outbreaks, by informing public health policies such as drug stockpiling, vaccination regimes or non-medical interventions. We investigate the use of statistical models to quantify burden of respiratory disease and evaluate effectiveness of public health interventions, while accounting for the challenges posed by surveillance data. The observational nature of the available information, affected by confounding, makes causal statements difficult. Improvements to routinely employed methodologies are proposed, employing phenomenological models to estimate a counterfactual, i.e. what what would have happened in the absence of a contributing factor or intervention. We apply these methods to different types of studies, to address specific gaps in the literature. S. pneumoniae is the leading cause of respiratory morbidity and mortality globally, especially in young children and in the elderly. To improve the understanding of factors triggering disease progression, we firstly analyse individual-level information about pneumococcal carriage and lower respiratory tract infection with a multi-state model, using data from a cohort study in Thailand. Secondly, we clarify the role of viral coinfection and meteorological conditions in invasive pneumococcal disease (IPD) incidence using English surveillance data. A novel multivariate linear regression model is proposed to estimate the influenza-specific contribution additional to the seasonal IPD burden across age groups. We then quantify the impact of the currently implemented vaccination policy, by estimating the counterfactual of IPD incidence in absence of vaccination. This allows disentangling serotype replacement from the vaccine effect, making use of a synthetic control approach. Finally, an empirical dynamical modelling strategy is employed to quantify the interaction between influenza and pneumococcus. Counterfactual analysis can also be employed to quantify the burden of novel respiratory pathogens. The last application of this approach is to estimate the excess mortality during the the COVID-19 pandemic in England.
  • ItemOpen Access
    Statistical methods for multi-omic data integration
    Cabassi, Alessandra; Cabassi, Alessandra [0000-0003-1605-652X]
    The thesis is focused on the development of new ways to integrate multiple ’omic datasets in the context of precision medicine. This type of analyses have the potential to help researchers deepen their understanding of biological mechanisms underlying disease. However, integrative studies pose several challenges, due to the typically widely differing characteristics of the ’omic layers in terms of number of predictors, type of data, and level of noise. In this work, we first tackle the problem of performing variable selection and building supervised models, while integrating multiple ’omic datasets of different type. It has been recently shown that applying classical logistic regression with elastic-net penalty to these datasets can lead to poor results. Therefore, we suggest a two-step approach to multi-omic logistic regression in which variable selection is performed on each layer separately and a predictive model is subsequently built on the ensemble of the selected variables. In the unsupervised setting, we first examine cluster of clusters analysis (COCA), an integrative clustering approach that combines information from multiple data sources. COCA has been widely applied in the context of tumour subtyping, but its properties have never been systematically explored before, and its robustness to the inclusion of noisy datasets is unclear. Then, we propose a new statistical method for the unsupervised integration of multi-omic data, called kernel learning integrative clustering (KLIC). This approach is based on the idea to frame the challenge of combining clustering structures as a multiple kernel learning problem, in which different datasets each provide a weighted contribution to the final clustering. Finally, we build upon the notion of the posterior similarity matrix (PSM) in order to suggest new approaches for summarising the output of MCMC algorithms for Bayesian mixture models. A key contribution of our work is the observation that PSMs can be used to define probabilistically-motivated kernel matrices that capture the clustering structure present in the data. This observation enables us to employ a range of kernel methods to obtain summary clusterings, and, if we have multiple PSMs, use standard methods for combining kernels in order to perform integrative clustering. We also show that one can embed PSMs within predictive kernel models in order to perform outcome-guided clustering.
  • ItemOpen Access
    Developing tailored approaches from multi-arm randomised trials with an application to blood donation
    Xu, Yuejia
    There is a growing interest in personalised medicine where individual heterogeneity is incorporated into decision-making and treatments are tailored to individual patients or patient subgroups in order to provide better healthcare. The National Health Service Blood and Transplant (NHSBT) in England aims to move towards a more personalised service and the National Institute for Health Research Blood and Transplant Research Unit (NIHR BTRU) call has mandated research to “identify, characterise and exploit biomarkers in personalising donation strategies to maximise donor health and the blood supply”. The work presented in this thesis was motivated by a large-scale, UK-based blood donation trial called INTERVAL. In INTERVAL, male donors were randomly assigned to 12-week, 10-week, and 8-week inter-donation intervals, and female donors to 16-week, 14-week, and 12-week inter-donation intervals. The outcomes of this trial include the amount of blood collected (primary), the number of low haemoglobin deferrals, and donor's quality of life. The INTERVAL trial has collected a wealth of information on individual donor characteristics, enabling us to explore (i) whether different inter-donation intervals should be recommended for donors with different characteristics (by age, blood measurements, etc.), and (ii) donor stratification schemes, for example, how to partition donors into those who have the capacity to give blood more frequently than the general donor population and those who tend to be deferred more often due to safety concerns than the average donors. One of the main statistical challenges arising from the development of personalised donation strategies using the data from the INTERVAL trial is that there are three (ordered) randomised groups for each gender in this trial, while the majority of existing statistical approaches developed in the personalised medicine context can only handle two randomised groups and thus are not directly applicable to the INTERVAL data. This thesis aims to address issues related to this added methodological complexity. We hope that the methodologies developed in this thesis can not only help us better analyse the INTERVAL data but also facilitate the analysis of other multi-arm trials in a wider range of medical applications in addition to blood donation. We begin by summarising methods that can be used to estimate the optimal individualised treatment rule (ITR) in multi-arm trials and comparing their performance in large-scale trials via simulation studies in Chapter 2. We also apply these methods to the data from male donors in the INTERVAL trial to estimate the optimal personalised donation strategies under three different objectives: (i) maximise the total units of blood collected by the blood service, (ii) minimise the low haemoglobin deferral rates, and (iii) maximise a utility score that “discounts” the total units of blood collected by the incidences of low haemoglobin deferrals. The three inter-donation intervals in the INTERVAL trial exhibit a natural ordering, and applying the ITR estimation methods that ignore the ordinality may result in suboptimal decisions. We are thus motivated to propose a method that effectively incorporates information on the ordinality of randomised groups to identify the optimal ITR in the ordinal-arm setting in Chapter 3. We further develop variable selection methods under the proposed framework to handle situations with noise covariates that are irrelevant for decision-making. Through simulation studies and an application to the data from a target donor population (“much-in-demand but vulnerable”) in the INTERVAL trial, we demonstrate that the proposed method has superior performance over methods that ignore the ordinality. In Chapter 4, we switch focus to donor (or “patient” in a more general sense) stratification in multi-arm trials and develop a novel method for stratifying subjects with heterogeneous intervention responses and covariate profiles into more homogeneous subgroups using Bayesian clustering techniques. The “imputed” potential outcomes under different randomised groups are linked to subjects’ baseline characteristics nonparametrically through cluster (subgroup) memberships. We examine the performance of our proposed method via simulation studies and we illustrate the utility of the method by applying it to the INTERVAL data to stratify donors based on their capacity to donate.
  • ItemOpen Access
    Beyond Parameter Estimation: Analysis of the Case-Cohort Design in Cox Models
    (2020-06-30) Connolly, Susan Elizabeth; Connolly, Susan Elizabeth [0000-0003-1443-3598]
    Cohort studies allow for powerful analysis, but an exposure may be too expensive to measure in the whole cohort. The case-cohort design measures covariates in a random sample (subcohort) of the full cohort, as well as in all cases that emerge, regardless of their initial presence in the subcohort. It is an increasingly popular method, particularly for medical and biological research, due to its efficiency and flexibility. However, the case-cohort design poses a number of challenges for estimation and post-estimation procedures. Cases are over-represented in the dataset, and hence estimation of coefficients in this design requires weighting of observations. This results in a pseudopartial likelihood, and standard post-estimation methods may not be readily transferable to the case-cohort design. This thesis presents theory and simulation studies for application of estimation and post-estimation methods in the case-cohort design. In the majority of extant literature considering methods for the case-cohort design, simulation studies generally consider full cohort sizes, sampling fractions, and case percentages that are dissimilar to those seen in practice. In this thesis the design of the simulation studies aims to provide circumstances which are similar to those encountered when using case-cohort designs in practice. Further, these methods are applied to the InterAct dataset, and practical advice and sample code for STATA is presented. Estimation of Coefficients & Cumulative Baseline Hazard: For estimation of coefficients, Prentice weighting and Barlow weighting are the most commonly used (Sharp et al, 2014). Inverse Probability Weighting (IPW), in this context, refers to methods where the entire case-cohort sample at risk is used in the analysis, as opposed to Prentice and Barlow weighting systems, where cases outside the subcohort sample are only included in risk sets just prior to their time of failure. This thesis assesses bias and precision of Prentice, Barlow and IPW weighting methods in the case-cohort design. Simulation studies show IPW, Prentice and Barlow weighting to have similar low bias. Where case percentage is high, IPW weighting shows an increase in precision over Prentice and Barlow, though this improvement is small. Checks of Model Assumptions: Appropriateness of covariate functional form in the standard Cox model can be assessed graphically by smoothed martingale residuals against various other values, such as time and covariates of interest (Therneau et al, 1990). The over-representation of cases in the case-cohort data, as compared to the full cohort, distorts the properties of such residuals. Methods related to IPW that adapt such plots to the case-cohort design are presented. Detection of non-proportional hazards by use of Schoenfeld residuals, scaled Schoenfeld residuals, and inclusion of time-varying covariates in the model are assessed and compared by simulation studies, finding that where risk set sizes are not overly variable, all three methods are appropriate for use in the case-cohort design, with similar power. Where case-cohort risk set sizes are more variable, methods based on Schoenfeld residuals and scaled Schoenfeld residuals show high Type 1 error rate. Model Comparison & Variable Selection: The methods of Lumley & Scott (2013, 2015) for modification of the Likelihood Ratio test (dLR), AIC (dAIC) and BIC (dBIC) in complex survey sampling are applied to case-cohort data and assessed in simulation studies. In the absence of sparse data, dLR is found to have similar power to robust Wald tests, with Type 1 error rate approximately 5%. In the presence of sparse data, the dLR is superior to robust Wald tests. In the absence of sparse data dBIC shows little difference from the naieve use of the pseudo-log-likelihood in the standard BIC formula (pBIC). In the presence of sparse data dBIC shows reduced power to select the true model, and pBIC is superior. dAIC shows improvement in power to select the true model over naieve methods. Where subcohort size and number of cases is not overly small, loss of power from the full cohort for dAIC, dBIC and pBIC is not substantial.
  • ItemOpen Access
    Statistical Methods to Improve Efficiency in Composite Endpoint Analysis
    (2020-04-25) McMenamin, Martina; McMenamin, Martina [0000-0001-7784-2271]
    Composite endpoints combine a number of outcomes to assess the efficacy of a treatment. They are used in situations where it is difficult to identify a single relevant endpoint, such as in complex multisystem diseases. Our focus in this thesis is on composite responder endpoints, which allocate patients as either ‘responders’ or ‘non-responders’ based on whether they cross predefined thresholds in the individual outcomes. These composites are often combinations of continuous and discrete measures and are typically collapsed into a single binary endpoint and analysed using logistic regression. However, this is at the expense of losing information on how close each patient was to the responder threshold. As well as being inefficient the analysis is sensitive to misclassification due to measurement error. The augmented binary method was introduced to improve the analysis of composite responder endpoints comprised of a single continuous and binary endpoint, by making use of the continuous information. In this thesis we build on this work to address some of the existing limitations. We implement small sample corrections for the standard binary and augmented binary methods and assess the performance for application in rare disease trials, where the gains are most needed. We find that employing the small sample corrected augmented binary method results in a reduction of required sample size of 32%. Motivated by systemic lupus erythematosus (SLE), we consider the case where the composite has multiple continuous, ordinal and binary components. We adapt latent variable models for application to these endpoints and assess the performance in simulated data and phase IIb trial data in SLE. Our findings show reductions in required sample size of at least 60%, however the magnitude of the gains depends on which components drive response. Finally, we develop a method for sample size estimation so that the model may be used as a primary analysis method in clinical trials. We assess the impact of correlation structure and drivers of response on the sample size required.
  • ItemOpen Access
    High-dimensional covariance estimation with applications to functional genomics
    (2020-03-11) Gray, Harry; Gray, Harry [0000-0002-6714-0089]
    Covariance matrix estimation plays a central role in statistical analyses. In molecular biology, for instance, covariance estimation facilitates the identification of dependence structures between molecular variables that shed light on the underlying biological processes. However, covariance estimation is generally difficult because high-throughput molecular experiments often generate high-dimensional and noisy data, possibly with missing values. In such context, there is a need to develop scalable and robust estimation methods that can improve inference by, for example, taking advantage of the many sources of external information available in public repositories. This thesis introduces novel methods and software for estimating covariance matrices from high-dimensional data. Chapter 2 introduces a flexible and scalable Bayesian linear shrinkage covariance estimator. This accommodates multiple shrinkage target matrices, allowing the incorporation of external information from an arbitrary number of sources. It is also less sensitive to target misspecification and can outperform state-of-the-art single-target linear shrinkage estimators. Chapter 3 explores a dimensionality reduction approach --- probabilistic principal component analysis --- as a model-based covariance estimation method that can handle missing values. By assuming a low-dimensional latent structure, this is particularly useful when the inverse covariance is required (e.g. network inference). All of our methods are implemented as well-documented open-source R libraries. Finally, Chapter 4 presents a case study using a dataset of cytokine expression in patients with traumatic brain injury. Studies of this type are crucial to researching the inflammatory response in the brain and potential patient recovery. However, due to the difficulties in patient recruitment, they result in high-dimensional datasets with relatively low sample sizes. We show how our methods can facilitate the multivariate analysis of cytokines across time and different treatment regimes.
  • ItemOpen Access
    Statistical inference in stochastic/deterministic epidemic models to jointly estimate transmission and severity
    (2019-07-19) Corbella, Alice; Corbella, Alice [0000-0002-8751-181X]
    This thesis explores the joint estimation of transmission and severity of infectious diseases, focussing on the specific case of influenza. Transmission governs the speed and magnitude of viral spread in a population, while severity determines morbidity and mortality and the resulting effect on health care facilities. Their quantification is crucial to inform public health policies, motivating the routine collection of data on influenza cases. The estimation of severity is compromised by the high degree of censoring affecting the data early during the epidemic. The challenge of estimating transmission is that each influenza data source is often affected by noise and selection bias and individually provides only partial information on the underlying process. To address severity estimation with high censored data, new methods, inspired by demographic models and by parametric survival analysis, are formulated. A comprehensive review of these methods and existing methods is also carried out. To jointly estimate transmission and severity, an initial Bayesian epidemic model is fitted to historical data on severe cases, assuming a deterministic severity process and using a single data source. This model is then extended to describe a more stochastic and hence more realistic process of severe events, with the data generating process governed by hidden random variables in a state-space framework. Such increased realism necessitates the use of multiple data sources to enhance parameter identifiability, in a Bayesian evidence synthesis context. In contrast to the literature in the field, the model introduced accounts for dependencies between datasets. The added stochasticity and unmeasured dependencies result in an intractable likelihood. Inference therefore requires a new approach based on Monte Carlo methods. The method proposed proves its potential and usefulness in the concluding application to real data from the latest (2017/18) epidemic of influenza in England.
  • ItemOpen Access
    Bayesian modelling and sampling strategies for ordering and clustering problems with a focus on next-generation sequencing data
    (2019-07-20) Strauss, Magdalena Elisabeth
    This thesis presents novel methods for ordering and clustering problems. The first two parts focus on the development of models and sampling strategies specifically tailored for next-generation sequencing data. Most high-throughput measurements for single-cell data are destructive, resulting in the loss of longitudinal information. I developed a new, Bayesian, way of reconstructing this information computationally, sampling orders efficiently using MCMC on a space of permutations. This Bayesian approach provides novel insights into biological phenomena and experimental artefacts. The second part presents a new clustering method for single-cell data, which specifically models the uncertainty of the clustering structure that results in part from the uncertainty of the orders discussed above. The proposed method uses nonparametric Bayesian methods, consensus clustering and efficient MCMC sampling to identify differences in dynamic patterns for different branches of gene expression data. It also categorises genes in a way consistent with biological function in an application to stimulated dendritic cells, and integrates data from different cell lines in a principled way. The third part of the thesis adapts some of the methods developed in the first two parts to applications with very sparsely and irregularly sampled data, and explores through simulations the applicability of such models in different circumstances. The fourth part discusses clustering methods for samples in a variety of different contexts, such as RNA expression, methylation or protein expression, and develops and critically discusses a novel hierarchical Bayesian method that integrates both different contexts and different groups of samples, for example different cancer types. The unifying underlying theme of the thesis is the development of methods and efficient sampling and approximation strategies capable of capturing the uncertainty inherent in any statistical analysis of high-dimensional and noisy data.
  • ItemOpen Access
    New statistical perspectives on efficient Big Data algorithms for high-dimensional Bayesian regression and model selection
    (2019-04-27) Ahfock, Daniel Christian
    This thesis is focused on the development of computationally efficient procedures for regression modelling with datasets containing a large number of observations. Standard algorithms be prohibitively computationally demanding on large $n$ datasets, and we propose and analyse new computational methods for model fitting and selection. We explore three different generic strategies for tall datasets. Divide and conquer approaches split the full dataset into subsets, with the subsets then being analysed independently in parallel. The subset results are then pooled into an overall consensus. Subsampling based methods repeatedly use minibatches of data to estimate quantities of interest. The third strategy is ‘sketching’, a probabilistic data compression technique developed in the computer science community. Sketching uses random projection to compress the original large dataset, producing a smaller surrogate dataset that is less computationally demanding to work with. The sketched dataset can be used for approximate inference. We test our regression algorithms on a number of large $n$ genetic datasets, aiming to find associations between genetic variants and red blood cell traits. Bayesian divide and conquer and subsampling methods have been studied in the fixed model setting but little attention has been given to model selection. An important task in Bayesian model selection is computation of the integrated likelihood. We propose divide and conquer and subsampling algorithms for estimating the integrated likelihood. The divide and conquer approach is based on data augmentation, which is particularly useful for logistic regression. The subsampling approach involves constructing upper and lower bounds on the integrated likelihood. Lower bounds can be formed using variational Bayes techniques and we show how subsampling can be used to estimate an upper bound on the integrated likelihood. Sketching algorithms generate a compressed set of responses and predictors than can then be used to estimate regression coefficients. Sketching algorithms use random projections to compress the original dataset and this stochastic generation process makes them amenable to statistical analysis. We examine the statistical properties of sketching algorithms, which allows us to quantify the error in the coefficients estimated using the sketched dataset. The proportion of variance explained by the model proves to be an important quantity in choosing between alternative sketching algorithms. This is particularly relevant to genetic studies, where the signal to noise ratio can be low. We also investigate sketching as a tool for posterior approximation. The sketched dataset can be used to generate an approximate posterior distribution over models. As expected, the quality of the posterior approximation increases with the number of observations in the sketched dataset. The trade-off is that computational cost of sketching increases with the size of the desired sketched dataset. The main conclusion is that impractically large sketch sizes are needed to obtain a tolerable approximation of the posterior distribution over models. We test the performance of sketching for posterior approximation on a large genetic dataset. A key finding is that false positives are a major issue when performing model selection. Practical regression analysis with large $n$ datasets can require specialised algorithms. Parallel processing, subsampling and random projection are all useful tools for computationally efficient regression modelling.