Repository logo

Beyond Parameter Estimation: Analysis of the Case-Cohort Design in Cox Models



Change log


Connolly, Susan Elizabeth  ORCID logo


Cohort studies allow for powerful analysis, but an exposure may be too expensive to measure in the whole cohort. The case-cohort design measures covariates in a random sample (subcohort) of the full cohort, as well as in all cases that emerge, regardless of their initial presence in the subcohort. It is an increasingly popular method, particularly for medical and biological research, due to its efficiency and flexibility. However, the case-cohort design poses a number of challenges for estimation and post-estimation procedures. Cases are over-represented in the dataset, and hence estimation of coefficients in this design requires weighting of observations. This results in a pseudopartial likelihood, and standard post-estimation methods may not be readily transferable to the case-cohort design.

This thesis presents theory and simulation studies for application of estimation and post-estimation methods in the case-cohort design. In the majority of extant literature considering methods for the case-cohort design, simulation studies generally consider full cohort sizes, sampling fractions, and case percentages that are dissimilar to those seen in practice. In this thesis the design of the simulation studies aims to provide circumstances which are similar to those encountered when using case-cohort designs in practice. Further, these methods are applied to the InterAct dataset, and practical advice and sample code for STATA is presented.

Estimation of Coefficients & Cumulative Baseline Hazard: For estimation of coefficients, Prentice weighting and Barlow weighting are the most commonly used (Sharp et al, 2014). Inverse Probability Weighting (IPW), in this context, refers to methods where the entire case-cohort sample at risk is used in the analysis, as opposed to Prentice and Barlow weighting systems, where cases outside the subcohort sample are only included in risk sets just prior to their time of failure. This thesis assesses bias and precision of Prentice, Barlow and IPW weighting methods in the case-cohort design. Simulation studies show IPW, Prentice and Barlow weighting to have similar low bias. Where case percentage is high, IPW weighting shows an increase in precision over Prentice and Barlow, though this improvement is small.

Checks of Model Assumptions: Appropriateness of covariate functional form in the standard Cox model can be assessed graphically by smoothed martingale residuals against various other values, such as time and covariates of interest (Therneau et al, 1990). The over-representation of cases in the case-cohort data, as compared to the full cohort, distorts the properties of such residuals. Methods related to IPW that adapt such plots to the case-cohort design are presented. Detection of non-proportional hazards by use of Schoenfeld residuals, scaled Schoenfeld residuals, and inclusion of time-varying covariates in the model are assessed and compared by simulation studies, finding that where risk set sizes are not overly variable, all three methods are appropriate for use in the case-cohort design, with similar power. Where case-cohort risk set sizes are more variable, methods based on Schoenfeld residuals and scaled Schoenfeld residuals show high Type 1 error rate.

Model Comparison & Variable Selection: The methods of Lumley & Scott (2013, 2015) for modification of the Likelihood Ratio test (dLR), AIC (dAIC) and BIC (dBIC) in complex survey sampling are applied to case-cohort data and assessed in simulation studies. In the absence of sparse data, dLR is found to have similar power to robust Wald tests, with Type 1 error rate approximately 5%. In the presence of sparse data, the dLR is superior to robust Wald tests. In the absence of sparse data dBIC shows little difference from the naieve use of the pseudo-log-likelihood in the standard BIC formula (pBIC). In the presence of sparse data dBIC shows reduced power to select the true model, and pBIC is superior. dAIC shows improvement in power to select the true model over naieve methods. Where subcohort size and number of cases is not overly small, loss of power from the full cohort for dAIC, dBIC and pBIC is not substantial.





White, Ian


case-cohort, complex survey sampling, cohort study, cox model, proportional hazards, EPIC, Interact, EPIC-Interact


Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge
The EPIC-InterAct study received funding from the European Union (Integrated Project LSHM-CT-2006-037197 in the Framework Programme 6 of the European Community). I thank all participants and staff for their contribution to this study. I thank the EPIC-InterAct PI, management team and wider consortium for their permission to use the data, and Nicola Kerrison (MRC Epidemiology Unit, University of Cambridge) for preparing the dataset which I used in Chapters 2 and 7. I acknowledge personal financial support from the UK Medical Research Council and St John's College, Cambridge.