Repository logo

In simulated data and health records, latent class analysis was the optimum multimorbidity clustering algorithm.

Published version



Change log


Nichols, Linda 
Taverner, Tom 
Crowe, Francesca 
Richardson, Sylvia 
Yau, Christopher 


BACKGROUND AND OBJECTIVES: To investigate the reproducibility and validity of latent class analysis (LCA) and hierarchical cluster analysis (HCA), multiple correspondence analysis followed by k-means (MCA-kmeans) and k-means (kmeans) for multimorbidity clustering. METHODS: We first investigated clustering algorithms in simulated datasets with 26 diseases of varying prevalence in predetermined clusters, comparing the derived clusters to known clusters using the adjusted Rand Index (aRI). We then them investigated the medical records of male patients, aged 65 to 84 years from 50 UK general practices, with 49 long-term health conditions. We compared within cluster morbidity profiles using the Pearson correlation coefficient and assessed cluster stability using in 400 bootstrap samples. RESULTS: In the simulated datasets, the closest agreement (largest aRI) to known clusters was with LCA and then MCA-kmeans algorithms. In the medical records dataset, all four algorithms identified one cluster of 20-25% of the dataset with about 82% of the same patients across all four algorithms. LCA and MCA-kmeans both found a second cluster of 7% of the dataset. Other clusters were found by only one algorithm. LCA and MCA-kmeans clustering gave the most similar partitioning (aRI 0.54). CONCLUSION: LCA achieved higher aRI than other clustering algorithms.



Clustering methods, Electronic medical records, Hierarchical cluster analysis, K-means, Latent class analysis, Multimorbidity, Multiple correspondence analysis, Humans, Male, Latent Class Analysis, Multimorbidity, Reproducibility of Results, Algorithms, Cluster Analysis

Journal Title

J Clin Epidemiol

Conference Name

Journal ISSN


Volume Title


Elsevier BV
MRC (MC_UU_00006/6)
Medical Research Council (MR/S027602/1)
Medical Research Council (MR/P021573/1)
This work is part of the Bringing Innovative Research Methods to Clustering Analysis of Multimorbidity (BIRM-CAM) project funded by the UKRI. SR, PK, JB are funded by the Medical Research Council as part of the Precision Medicine and Inference for Complex Outcomes theme of the MRC Biostatistics Unit. TM is supported by the National Institute for Health Research Collaboration Applied Research Collaboration West Midlands (NIHR ARC WM). The views expressed in this publication are those of the author(s) and not necessarily those of the NIHR or the Department of Health and Social Care. Neither funder had any role in the design of the study, the collection, analysis and interpretation of data, or the writing of the manuscript. KN is funded by a Health Data Research UK Fellowship.