Effect of Sociodemographic Characteristics on Longitudinal Progression of Multimorbidity: A Multistate Modelling Analysis of a Large Primary Care Records Dataset

Background: Multimorbidity presents a growing public health challenge. While existing research has mainly focused on cross-sectional patterns of multimorbidity, there remains a need to better understand the longitudinal accumulation of diseases and the effects of important sociodemographic characteristics on the accumulation and progression of chronic conditions. Methods and findings: We use a spline-based parametric multistate modelling (MSM) framework for investigating the role of several sociodemographic characteristics such as ethnicity, deprivation, gender, and age on the rate of disease transitions along different pathways to multimorbidity accumulation. The model is highly flexible and permits a parallel estimation framework that scales well to very large datasets. Focusing on cardiometabolic, renal conditions and mental conditions, we analysed a dataset of about 13 million English electronic primary care records. Our results revealed complex interactions between temporal trajectories of multimorbidity and the sociodemographic factors, with effects that varied depending on the comorbidity status of the subject. These factors tended to have a larger impact on initial multimorbidity development while a smaller impact on later stages. Mental health as a comorbidity often strengthened the adverse effects of sociodemographic characteristics on disease accumulation. Conclusion: Our findings underscore the potential importance of earlier screening and intervention of chronic conditions, particularly mental health conditions, in higher-risk populations. This may have important implications for the management of multimorbidity.


Background
Multimorbidity or multiple long-term conditions is defined as the co-existence of two or more chronic conditions in an individual [1].Multimorbidity increases with age and because of the complexity of managing those with multiple health conditions, it is of increasing importance in countries with ageing populations [2].Existing research indicates that multimorbidity is associated with poorer quality of life, worsened health outcomes, increased healthcare use and higher mortality [3,4].Multimorbidity develops sequentially, as individuals accumulate diagnoses, but to date, research has mainly described crosssectional patterns of multimorbidity, primarily focusing on identifying multimorbidity clusters [5][6][7].We need to understand how diseases accumulate over time and the effects of potentially important sociodemographic characteristics on the accumulation and progression of diagnoses.This may help identify those at high risk of developing further diseases.A recent systematic scoping review summarised research on longitudinal modelling and analysis of multimorbidity patterns, see [8].One of these analyses investigated the sequence of occurrence of diseases using a visualisation approach, stratified by deprivation status, age and ethnicity [9].They reported that the more socially deprived had a greater risk for multimorbidity and that ethnicity was associated with the development of multimorbidity.
The multistate model (MSM) extends the traditional survival model to encompass more than two states, providing a robust framework for modelling longitudinal event history data and making predictions.MSMs have been effectively used in studies of multimorbidity, where states represent the individual's health status.In recent studies, MSMs have been applied to examine the progression of chronic conditions [10] and the impact of lifestyle factors on the accumulation of diseases [11].In [12] the authors evaluated the effect of the order of disease acquisition on patient mortality using a 17-state MSM, while in [11] a 5-state MSM is used to study the impact of lifestyle factors on cancer and cardiometabolic diseases.MSMs have also been used to analyse socioeconomic inequalities in life expectancy with and without multimorbidity [13], the effects of sociodemographic and clinical factors on different stages of multimorbidity development [14], and the potentially differing impacts of lifestyle risk factors on cardiometabolic multimorbidity [15].MSMs have also been used to examine the role of midlife factors on different cardiometabolic disease trajectories [16].
Most of the existing MSM models of multimorbidity have included only small numbers of conditions (e.g.three) or represent multimorbidity status as counts of conditions rather than combinations of specific conditions.The resulting MSMs thus permit a relatively simple structure and are specified either using simple parametric models such as the Weibull model, which has limited flexibility, or Cox semi-parametric models, which are incapable of absolute risk quantifications or predictions over time.While some risk factors such as lifestyle factors have been considered, detailed studies on the effects of sociodemographic characteristics on the risk of disease-specific multimorbidity progression remain scarce.In addition, previous analyses are mostly based on cohorts of relatively small sample sizes, which may limit the power of their findings relative to the number of states of the models and subgroups.
Here we propose a flexible parametric multistate modelling framework to investigate the role of several sociodemographic risk factors on the temporal progression of 5 chronic conditions and 33 states, based on a dataset of approximately 13 million English electronic primary care records.We have focused on cardiometabolic, renal conditions and mental conditions because they are common, important conditions and are known to frequently co-occur, especially in socio-economically deprived areas [17].The considered conditions are cardiovascular disease (CVD), type 2 diabetes (T2D), chronic kidney disease (CKD), mental health problems (MH) and heart failure (HF).Our MSM extends previous works by allowing for more comorbidity states and disease transition types, providing a more detailed characterization of the pattern of progression of these five diseases.

Data sources
The study investigated the temporal sequence of diagnosed chronic conditions in a large anonymised dataset of electronic primary care records, the Clinical Practice Research Datalink (CPRD) [18].The CPRD Aurum database contains routinely-collected primary care electronic health records from 19 million patients across the 738 English general practices that use EMIS Web® software [18].It is representative of the English population in terms of age, gender, regional geographical spread and deprivation.It contains information on individual patient demographics.Clinical observations, diagnoses and treatments are recorded as Read Version 2, SNOMED-CT, and EMIS Web® clinical codes.The full data resource profile has been described elsewhere [18].

Conditions
Five chronic conditions were considered in this study, namely: 1) cardiovascular disease (CVD), including ischaemic heart disease (IHD), peripheral arterial disease (PAD) and stroke or transient ischaemic attack (CVA); 2) type-2 diabetes (T2D); 3) chronic kidney disease (CKD); 4) mental health problems (MH), including depression or anxiety; and 5) heart failure (HF).The presence of a chronic condition was determined by the occurrence of one of the relevant clinical codes (code lists are provided in the supplementary material) in the records, and the date of diagnosis of a condition was taken as the date of the first occurrence of any associated clinical codes.An absence of the code was considered to indicate an absence of the condition.

Sociodemographic characteristics
Sociodemographic characteristics include ethnicity, social and material deprivation, age (calculated from the year of birth) and sex (male/female).Ethnicity was initially classified into five different groups, including White, South Asian (South Asian or British Asian), Black (Black African, Black Carribean or Black British), Mixed and Others (ethnicity codings are provided in the supplementary material) [19].In our multistate analysis described below, Mixed was merged into Others due to their infrequency in many disease transitions.Social and material deprivation was quantified by the English Index of Multiple Deprivation (IMD), a widely used area-based composite measure of individual-level deprivation [20].IMD is derived from income, employment, education and living environment and in the CPRD Aurum data, each subject was assigned an IMD decile from 1 (least deprived) to 10 (most deprived).For our analysis, we used IMD quintiles.

Cohort description
Participants were eligible for inclusion in the study if they were registered in a general practice contributing to CPRD between the study start date and the study end date.The study start date was defined as the latest of date the subject turned 18, 01/01/2005, or six months after the patient registration start date.The study end date was defined as the earliest of the patient registration end date, date of death, or the last collection date for practice (11/05/2020).The minimum period of registration ensures time for patient characteristics to be recorded and prior diagnoses to be transferred/recorded.Patients were followed up continuously during the study period.For conditions where code lists were combined the earliest valid code in any code list was used as the date of incidence, and long-term conditions recorded before the study start date were treated as prevalent.For our analysis dataset, we excluded patients with gender classified as indeterminate, those aged >90 years at the study start date and those with missing data on ethnicity or IMD (the vast majority is due to missing information on ethnicity), see supplemental Figure A for more details on sample selection and supplemental Figure C for results of a sensitivity analysis.The resulting cohort consists of a total of 13.48 million individuals.

Statistical method
To examine the effects of ethnicity, deprivation, age and sex on the rate of different pathways of multimorbidity accumulation, we consider a MSM with a total of 33 states and 112 allowable direct transitions (see Figure 1 for an illustration).The states incorporate all possible combinations of the five conditions a person may have during the process of disease progression and the permitted instantaneous state-to-state transitions were indicated by a direct solid line with an arrow indicating the direction of the transition.The death state is the terminal state and can be directly reached from any other state.Here we made the simplifying assumption that no more than one event per person (diagnosis or death) can happen on the same day (so the number of transitions to be estimated is greatly reduced).It is extremely rare for two or more events to take place on a single day and even the most common transitions on the same day were no more than about 5000 among the 13.47 million subjects (i.e. less than 0.04%) in the cohort.Disease transitions recorded on the same day are also likely to be artificially increased by same-day diagnostic tests or recording of diagnoses, so the accuracy of the timing of these recorded transitions is questionable.To flexibly specify the MSM, we followed the suggestions of [21] to use a separate time-toevent model for each transition, and we used a clock-reset time scale (i.e. the time scale resets to zero after every transition) so that the outcome of interest is the time from one disease diagnosis to the next diagnosis or death.More specifically, we modelled the logarithm of the cumulative transition-specific intensities as a spline function of log time [22], from which age, sex, ethnicity and deprivation adjusted hazard ratios (HR) were estimated, assuming proportional hazards and no interactions between these predictors.For each sociodemographic characteristic, we report hazard ratios (and the associated 95% confidence intervals) adjusted for the other sociodemographic characteristics; e.g.hazard ratios for the ethnic group are adjusted for age, sex and IMD.We were able to conduct inference for such a complex MSM by taking advantage of the factorization property of the model likelihood and parallel computing; see the supplementary file for additional technical details.Model fitting for different transitions is performed in parallel via maximum likelihood estimation using the flexsurv package [23] in the R software environment.
Figure 1: The multistate diagram.The comorbidity states are represented in black boxes where we use red, yellow, green, blue and purple colours to distinguish between cardiovascular disease (CV), type-2 diabetes (T2), chronic kidney disease (CK), mental health (MH) and heart failure (HF), respectively.The transitions are represented using coloured arrows, with colours indicating the disease that is diagnosed when the transition takes place.

Summary statistics
The cohort baseline characteristics are shown in Table 1.At study entry, 78.5% (n=10,577,667) of the eligible participants had no diagnosis of any of the five conditions of interest.By the end of the follow-up, 83.7% (n=8,856,098) of these had not developed any of the five conditions, 13.8% (n=1,462,966) had developed one, 2% (n=209,237) had developed two, 0.4% (n=42504) had developed three, 0.06% (n=6399) four and 0.0044% (n=463) five of the conditions.Five percent of the study population (n=670,753) died during the study period.Detailed summary statistics for each comorbidity state in our MSM are presented in Supplementary Table A. A summary of the incidence rate for each disease transition in our MSM is provided in Supplementary Table B. The incidence rates of transitions to the first morbidity were notably lower compared to the other transitions, and generally, the incidence rates for transitions increased with the number of comorbidities.It should be noted that the relatively lower incidence rates for transitions from MH compared to other disease transitions could be due to the fact that subjects in the MH state were generally younger (this age effect is adjusted in our MSM analysis).

Effect of ethnicity on the rate of disease progression
There was no consistent association pattern between ethnicity and disease accumulation as this was also modified by comorbidity status (Figure 2).For CVD diagnosis among different comorbidity groups (panel (a)), the hazard ratios tended to be higher for South Asian but lower for black compared to white ethnicity.For T2D (panel (b)), hazard ratios for South Asian, black and other ethnicities compared to White were higher than 1 for most comorbidity groups, but the differences between each of the non-white and white groups decreased with increasing numbers of comorbidities.Generally, differences were less pronounced between the four ethnic groups for CKD (panel (c)), HF (panel (d)) and MH (panel (e)) diagnosis among groups with different comorbidities.For a few transitions, people from the three non-white groups had a slightly reduced rate of disease progression compared to the white ethnicity.For MH, the hazard ratios for black compared to white were mostly lower than 1.

Effect of deprivation on the rate of disease progression
For the diagnosis of CVD, T2D, CKD, HF or MH, hazard ratios were greater for the more deprived groups (IMD>1) compared with the least deprived group (IMD=1), with a higher level of deprivation associated with a higher rate of disease transition (Figure 3).The strength of association with deprivation was stronger for T2D (HR between 1 to 2) and HF (HR between 1 to 1.7), while it was weaker for CKD, MH, and CVD.The effect of greater deprivation on all five conditions tended to decrease with increasing numbers of comorbidities.

Effect of gender on the rate of disease progression
Adjusted hazard ratios for men (versus women) were statistically significantly higher for CVD, T2D and HF diagnosis but lower for CKD and MH diagnosis across different comorbidity groups (Figure 4).The difference between men and women for CVD and T2D decreased with increasing numbers of comorbidities (HR for men compared to women was still statistically significantly higher than 1 for most transitions).In contrast, for CKD and HF diagnosis (except for transitions from baseline or MH), there was little difference between men and women regardless of the number of comorbidities.The lower hazards of MH diagnosis for men were slightly reduced with increasing number of comorbidities.

Effect of age on the rate of disease progression
A 10-year increase in age was associated with an elevated rate of CVD (HR between 1 to 2), CKD (HR between 1.3 to 2.5) or HF (HR between 1.2 to 2.4) diagnosis (Figure 5).Adjusted hazard ratios for a 10-year increase in age were greatest for those with a single condition, particularly MH, and approached the null for those with increasing numbers of comorbidities.By contrast, adjusted hazard ratios for age were generally statistically significantly lower than 1 for T2D or MH diagnosis, with little difference across the number of comorbidities.

Comparing the effects of the sociodemographic characteristics
For CVD, CKD, MH and HF diagnosis, a 10-year increase in age, higher deprivation levels, and the sex of the individual (male for CVD and HF, female for CKD and MH) tended to be more strongly associated with an increased rate of diagnosis of a condition compared to ethnic group differences (White as the reference group).Among these four characteristics, the age effect was strongest for CKD and HF.For T2D, belonging to the South Asian ethnic group was associated with the greatest increase in the rate of the diagnosis.

Mental health
The impact of the sociodemographic characteristics on the rate of diagnosis of a condition was generally more significant (in the direction of increasing the transition intensity) for individuals with current diagnoses of MH compared to those with other comorbidities.For age and deprivation (particularly age), this pattern is observed for transitions to all four conditions except MH.Even the slightly protective effect of higher age on transitions to T2D is reversed in the presence of a comorbidity of MH.The adverse effect of being male on transitions to CVD, T2D, or HF was amplified with a MH comorbidity.For ethnicity, the pattern is only observed for transitions to T2D, where the hazard ratio for South Asian ethnicity compared to White was higher for those with a MH diagnosis.

Sensitivity analysis
We conducted a sensitivity analysis to examine the impact of data exclusion on the inferences (here we focused on missing ethnicity as it accounts for the vast majority of missing data).More specifically, we expanded our cohort by including subjects with missing ethnicity, who were assigned to an artificial ethnic group named "missing".We then re-fitted the MSM in the same manner as before.The results, including estimated HRs and associated confidence intervals for each transition, are presented in supplemental Figure C. Upon comparison with Figures 2-5 in the main paper, we observed that there were no significant differences and the key patterns remained unchanged.

Main findings
We analysed the progression of multimorbidity, defined as the combination of five chronic conditions, including CVD, T2D, CKD, MH and HF, in a cohort of more than 13 million English adults.To our knowledge, this is the largest cohort and the highest number of chronic conditions and disease transitions analysed using a flexible multistate modelling framework.The proposed MSM-based approach provides a detailed picture of the effects of sociodemographic characteristics, including ethnicity, deprivation, gender and age, on the rate of disease transitions along different pathways to multimorbidity accumulation.
Our study revealed complex interactions between trajectories of multimorbidity progression and ethnicity, deprivation, gender and age, with effects that varied depending on mental and physical health conditions.Generally, the effect of these characteristics was greater for those with fewer morbidities than those with many comorbidities.This suggests sociodemographic characteristics have a larger impact on initial multimorbidity development while a smaller impact on those with many long-term conditions.Notably, MH, as a co-morbidity, was typically an "outlier" as often exacerbated the effects of sociodemographic characteristics on the transition to the next disease.We also found that the relative magnitude of the effects of these sociodemographic characteristics on the rate of diagnosis of a new condition also depended greatly on the comorbidity status of the subject.

Ethnicity
Compared to the White UK population, a higher risk of diabetes in those of South Asian and Black ethnicity has been observed in other epidemiological research [24].This research also found higher risk and lower risks of CVD in those of South Asian and Black ethnicity respectively.CKD is underdiagnosed in UK primary care, therefore diagnosis is likely to be influenced by the frequency of diagnostic testing [25].Rates of testing are strongly associated with the presence of chronic conditions but less strongly with ethnicity [26].
In the past, the estimation of CKD prevalence in Black patients and women may have been overestimated due to adjustments for Black ethnicity and female sex in the calculation of the estimated glomerular filtration rate (eGFR) [27].Epidemiological research has suggested that depression (the most common mental health condition) is more common in older UK adults of Black and South Asian ethnicity than those of White ethnicity [28].Our finding of slightly lower rates of mental health problems for those in the Black and South Asian ethnic groups may suggest underascertainment of mental health problems, particularly for these groups.

Deprivation
Our results show that higher levels of deprivation tended to be associated with an elevated risk of disease progression for all conditions and types of transitions.This is in keeping with a previous study, which showed that multimorbidity occurred 10-15 years earlier and physical-mental comorbidity was more than twice as prevalent in the most socioeconomically deprived areas of Scotland compared to the least deprived areas [17].We found that the effect of deprivation was less strong as the number of comorbidities increased, suggesting that deprivation may have a stronger influence on the initial trajectory of multimorbidity.

Sex
As seen in other studies, men tend to have an increased risk of CVD, T2D, and HF, but a lower risk of MH problems and CKD [29].Our results echo previous findings and additionally demonstrate that the effect of sex on disease progression is dependent on the underlying comorbidity status.For instance, the association between the male gender and the rate of transition to CVD or T2D was most significant for those with the least comorbidities.For MH, the greater the number of comorbidities, the weaker the gender difference in the rate of MH accumulation.For HF and CKD, the gender effect is less influenced by the level of multimorbidity.Note that our results may be affected by gender disparities in the presentation to healthcare, where women are more likely to present for screening and investigations [30].On the other hand, those with known health conditions are more likely to be invited for (and maybe more inclined to present for) further monitoring and screening for other health conditions.Thus this gender difference regarding clinical presentation is expected to be smaller with increasing comorbid conditions.

Age
As expected, age was associated with increased risk of CVD, CKD and HF.In addition, we found that age amplified the effect of MH on the likelihood of transition to each of these conditions.Increasing age had a much greater impact on the likelihood of transition to a new condition on those with a single condition than those with multiple conditions.Multimorbidity is increasingly prevalent with increasing age [17].

Mental health
In comparison to physical health conditions such as CVD, T2D, HF, and CKD, MH was a notable outlier in its effect on the rates of disease transition.Our results suggest that the adverse effect of sociodemographic factors on the rate of transitioning to the next condition (CVD, T2D, CKD or HF) tended to be enhanced for trajectories starting with MH compared to those starting with other conditions.The pattern may be partly explained from biological and behavioural points of view.Those with MH as a starting condition tend to fall into the younger age groups (see Supplemental Table A) and mental ill health is associated with unhealthy lifestyle habits such as smoking, alcohol or substance abuse etc [31], which are also known to be important risk factors for cardiometabolic diseases and contribute to the worsening of multimorbidity [32,11].On the other hand, this adverse impact of MH is attenuated with additional comorbidities, which could be influenced by the additional medical intervention/treatment received by the subject and relevant lifestyle modifications.

Strengths and limitations
This study benefits from a large dataset from primary health care which is broadly representative of the English population.The conditions considered in this study are included in an incentive scheme for the management of chronic conditions, which means they are likely to be more accurately coded [33].The use of a highly flexible multistate model allows for a detailed characterization of the effect of key sociodemographic characteristics on 112 different transitions with a very good overall fit for the vast majority of the transitions, demonstrating the utility and computational feasibility of the approach.
There are some limitations with the current study.One primary limitation is the possibility of underdiagnosis or ascertainment bias of the onset of chronic conditions.For instance, diseases like CKD and T2D are often asymptomatic in the early stages, and diagnosis may occur during routine health check/screening or following the diagnosis of a related condition (once diagnosed, general practices actively follow up patients for further disease monitoring).The uptake of such checkups is generally less than 50%, with lower figures for men and some ethnic groups [34].For CKD, renal function testing is lower in minority ethnic groups [26,35].For MH, certain groups (e.g.those in the Black or South Asian ethnic groups) are less likely to present to the GP even with symptoms developed [36].The implication is that the identified date of diagnosis or the transition time between diagnoses of the conditions in the electronic health record data may not reflect the corresponding true onset or transition times.Additionally, the ethnicity categories used in this study crudely group together communities with diverse cultural practices and sociodemographic characteristics, which may mask differences in outcomes within each category.It is also important to note that time-varying factors associated with the disease progression such as treatment information, lifestyle modifications, and change in deprivation status are not available from the present dataset and thus they were not taken into account in the present analysis.Finally, the median follow-up time for the study participants was about 5 years, with a maximum period of about 15 years, which limits our ability to observe complete trajectories for most subjects and gives rise to a very high right censoring rate for most transitions.A longer follow-up period would be desirable for obtaining better inference for transitions in later stages of multimorbidity.

Unanswered questions and future research
Clinical research implications of this analysis are to investigate and better understand the biopsychosocial factors driving progression to multimorbidity.The greatest differences in transition rates to multimorbidity between different groups are in the development of the first few conditions, which suggests the early stages in the pathway to multimorbidity may be particularly important.
The MSM framework used here could permit broader types of inference, such as the probability of transition between two comorbidity states over a given time interval (conditional on sociodemographic characteristics), which can be inferred by simulating trajectories conditional on the fitted model.We could also analyse the effect of disease transition history on subsequent disease progression by representing previous transitions as additional covariates in the model.There is also scope to study the association between concurrently measured longitudinal covariates (e.g.biomarkers) and multimorbidity progression using a joint modelling framework for both the longitudinal marker and the multistate processes [37].Further methodological research is required to overcome the computational challenge of fitting this model.

Conclusion
Multimorbidity is a growing public health challenge and there is a need of moving from a single diseaseoriented to a more individualised and holistic approach.In this study, we investigated the influence of key sociodemographic characteristics on the development and progression of multimorbidity, revealing distinctive association patterns for different levels of multimorbidity and disease transitions.Our results indicate that the effects of these factors tend to be weakened with increasing levels of multimorbidity.The difference of mental health as a comorbidity is also highlighted as it often strengthens the adverse effects of sociodemographic characteristics on disease accumulation.These findings underscore the need for earlier screening and intervention of mental health conditions, especially for those in the higher-risk populations, which may have important implications for the management of multimorbidity.

Supplemental Technical Details
The multistate modelling approach In this section we provide a brief introduction to multistate modelling (MSM), and we refer to [38] and references therein for more in-depth coverage of the topic.A multistate model is built from a mathematical object named a multistate process.A multistate process is a stochastic process {(),  ≥ 0} with a finite state space  = {1, … , } (here we focus on the setting where  is continuous and () is continuously observed).The law of a multistate process can be fully characterised via the transition intensities, which directly describe the instantaneous dynamic of the process where   − represents the history of the process over [0, ).There are two special classes of the multistate processes that have been thoroughly studied.The first is the Markov processes, which relies on the Markov assumption such that risks at a particular time are independent of the history, so that  ℎ (;   − ) =  ℎ ().
The second class is the so-called clock-reset semi-Markov processes, which relaxes the Markov property by assuming that where () is the time since entry to the current state h.Therefore,  ℎ depends on the history only through (), and in effect, the time scale after each transition is reset to zero.In the former scenario, the transition probabilities between two states can be derived from the intensities by solving the Kolmogorov differential equations, whereas for the latter case, no explicit formulation exists.In such scenarios, one could estimate transition probabilities from intensities via simulation-based techniques.
A multistate model can be fitted to data representing the evolution of  subjects (defined on the same state space) in a population, each of whom is followed up continuously for a potentially different length of time (may be subject to delayed entry or right censoring).To model the covariate-dependent transition intensities for subject ,  ℎ () (;   − ), Markov or semi-Markov assumptions are commonly adopted and semi-parametric or parametric proportional hazard regression framework is often used.If we in addition assume that each individual's trajectory,   (), is independent, then it can be shown using counting process theory that the MSM likelihood factorizes as where  is the total number of allowed transition types in the multistate process (e.g. if  ℎ ≠ 0 then ℎ →  represents one type of transition),   is the likelihood obtained by considering transition type  as a survival problem with data based on all subjects at risk for this transition and  is the parameter vector consisting of parameters from all transitions.An important consequence of this factorization property is that if no parameters are shared across transitions, i.e.  = ( 1 , … ,   ), then we could perform maximum likelihood estimation of the MSM by maximising each   independently with respect to its parameter set   (see e.g.[39]).
In our analysis we specify the MSM by using a separate time-to-event model for each permitted ℎ →  transition.To ensure flexibility, we use the spline-based Royston-Parmar model [22] for the transitionspecific intensities,  ℎ () (; ()),  ( ℎ () (; )) =  ℎ (  ;  ℎ ) +  ℎ    , where  refers to the time since entry to the state ℎ,  ℎ () is the cumulative transition intensity associated with  ℎ () ,  ℎ (  ;  ℎ ) is a natural cubic spline in   with coefficient vector  ℎ (see [22] for more details on the construction of the spline function),   denotes the vector of exogenous covariates for subject  and  ℎ is the vector of the associated linear effects.Note that the resulting time-to-event model retains a proportional hazard structure so that elements of  ℎ are still interpreted as log-hazard ratios.In our context, the are IMD, ethnicity, gender and age (at the time of the previous transition; in 10 years).The first three variables are categorical and are represented via dummy variables with reference groups set to female, IMD=1 (the least deprived group) and White (the dominant ethnic group), respectively.Age is approximated by a piecewise constant process with jumps occurring at the transition times (hence is treated as a time constant covariate in a transition model).A practical issue here is to determine the number and location of the internal knots for the spline term  ℎ , which would influence the flexibility of the resulting model.We follow the suggestions in [22] to use the Akaike information criterion (AIC) to determine the number of internal knots in the spline from the candidate values of {0, … ,5}.Note that when there are zero knots the model reduces to a Weibull model.Conditional on the knot number, the location of these knots is set to equally spaced percentiles of the logarithm of the uncensored transition times for ℎ →  (empirical study indicates that the knot position is not a key driver of model fit).

Model checking
We used the Cox-Snell residuals to graphically examine the goodness of fit of each of our transition models [40,41].For a fitted time-to-event model, the Cox-Snell residuals are defined as    =  ̂(  ), where  ̂ is the estimated cumulative hazard function for subject  under the fitted model and   is the observed event time.If the model fits the data well, we expect the    to be close to draws from a unit exponential distribution.This diagnostic is motivated by the fact that for any random variable  with survival function () = ( > ), the transformed random variable −  () is distributed as an exponential random variable with unit mean.To appropriately take censored event times into account, in practice we estimate the cumulative hazard of the residuals, i.e. −   ̂ (   ), where  ̂ is the Kaplan-Meier estimate of the survival function associated with    (for censored event times   the corresponding    are treated as censored as well).A good fit is suggested if a plot of −   ̂ (   ) against    is close to a straight line an intercept of zero and a slope of one.For our fitted MSM, we computed the Cox-Snell residuals for each transition, and we observed satisfactory results for most of the transitions, see Figure B for selective results.Here we observed a slight lack of fit for transitions MH->MH+T2 and MH->MH+CK.We found that residuals that significantly deviated from the reference line (in red) correspond to subjects who had a long transition time, but their sociodemographic characteristics suggest that the transition should have happened earlier (e.g.diagnosed MH at a high age and with high deprivation levels).The presence of such "outliers" may not be surprising given the complexity of the disease progression mechanisms and the intersubject heterogeneity in such a large cohort.On the other hand, it should be noted that these outliers only account for an extremely small proportion of the data involved in these 112 transitions.Out of the 112 transitions these two were among the most ill-fitting.

Figure 2 .
Figure 2. Effect of ethnicity on the rate of disease transition by comorbidity.Estimated hazard ratios are shown in black dots and the associated 95% confidence intervals are represented with coloured bands.White ethnicity is treated as the reference category.

Figure 3 .
Figure 3.Effect of deprivation on the rate of disease transition by comorbidity.Estimated hazard ratios are shown in black dots and the associated 95% confidence intervals are represented with coloured bands.Quintile 1 is treated as the reference category and represents the least deprived.

Figure 4 .
Figure 4. Effect of gender (male) on the rate of disease transition by comorbidity.Estimated hazard ratios are shown in red dots and the associated 95% confidence intervals are represented with black bands.The female gender is treated as the reference category.

Figure 5 .
Figure 5.Effect of age at entry to the current state (in 10 years) on the rate of next disease transition by comorbidity.Estimated hazard ratios are shown in red dots and the associated 95% confidence intervals are represented with black bands.

Supplemental
Figure B. Plots of the estimated cumulative hazard of the Cox-Snell residuals.Note: The red line is a reference line with an intercept of zero and a slope of one.Supplemental Figure C. Effect the four sociodemographic characteristics on the rate of disease transition by comorbidity.Results are obtained based on the cohort including those with missing ethnicity information.Graphical settings are the same as in Figures 2-5 of the main paper.

Table 1 .
Baseline sociodemographic characteristics of study participants Supplemental TableA.Summary of disease states by sociodemographic characteristics.