Statistical power for cluster analysis.
Authors
Dalmaijer, Edwin S
Nord, Camilla L
Astle, Duncan E
Publication Date
2022-05-31Journal Title
BMC Bioinformatics
ISSN
1471-2105
Publisher
Springer Science and Business Media LLC
Volume
23
Issue
1
Language
en
Type
Article
This Version
VoR
Metadata
Show full item recordCitation
Dalmaijer, E. S., Nord, C. L., & Astle, D. E. (2022). Statistical power for cluster analysis.. BMC Bioinformatics, 23 (1) https://doi.org/10.1186/s12859-022-04675-1
Description
Funder: AXA Research Fund; doi: http://dx.doi.org/10.13039/501100001961
Abstract
BACKGROUND: Cluster algorithms are gaining in popularity in biomedical research due to their compelling ability to identify discrete subgroups in data, and their increasing accessibility in mainstream software. While guidelines exist for algorithm selection and outcome evaluation, there are no firmly established ways of computing a priori statistical power for cluster analysis. Here, we estimated power and classification accuracy for common analysis pipelines through simulation. We systematically varied subgroup size, number, separation (effect size), and covariance structure. We then subjected generated datasets to dimensionality reduction approaches (none, multi-dimensional scaling, or uniform manifold approximation and projection) and cluster algorithms (k-means, agglomerative hierarchical clustering with Ward or average linkage and Euclidean or cosine distance, HDBSCAN). Finally, we directly compared the statistical power of discrete (k-means), "fuzzy" (c-means), and finite mixture modelling approaches (which include latent class analysis and latent profile analysis). RESULTS: We found that clustering outcomes were driven by large effect sizes or the accumulation of many smaller effects across features, and were mostly unaffected by differences in covariance structure. Sufficient statistical power was achieved with relatively small samples (N = 20 per subgroup), provided cluster separation is large (Δ = 4). Finally, we demonstrated that fuzzy clustering can provide a more parsimonious and powerful alternative for identifying separable multivariate normal distributions, particularly those with slightly lower centroid separation (Δ = 3). CONCLUSIONS: Traditional intuitions about statistical power only partially apply to cluster analysis: increasing the number of participants above a sufficient sample size did not improve power, but effect size was crucial. Notably, for the popular dimensionality reduction and clustering algorithms tested here, power was only satisfactory for relatively large effect sizes (clear separation between subgroups). Fuzzy clustering provided higher power in multivariate normal distributions. Overall, we recommend that researchers (1) only apply cluster analysis when large subgroup separation is expected, (2) aim for sample sizes of N = 20 to N = 30 per expected subgroup, (3) use multi-dimensional scaling to improve cluster separation, and (4) use fuzzy clustering or mixture modelling approaches that are more powerful and more parsimonious with partially overlapping multivariate normal distributions.
Keywords
Cluster analysis, Covariance, Dimensionality reduction, Effect size, Latent class analysis, Latent profile analysis, Sample size, Simulation, Statistical power, Algorithms, Cluster Analysis, Humans, Normal Distribution, Sample Size, Software
Sponsorship
Medical Research Council (MC-A0606-5PQ41)
Templeton World Charity Foundation (TWCF0159)
Identifiers
s12859-022-04675-1, 4675
External DOI: https://doi.org/10.1186/s12859-022-04675-1
This record's URL: https://www.repository.cam.ac.uk/handle/1810/337753
Rights
Licence:
http://creativecommons.org/licenses/by/4.0/
Statistics
Total file downloads (since January 2020). For more information on metrics see the
IRUS guide.
Recommended or similar items
The current recommendation prototype on the Apollo Repository will be turned off on 03 February 2023. Although the pilot has been fruitful for both parties, the service provider IKVA is focusing on horizon scanning products and so the recommender service can no longer be supported. We recognise the importance of recommender services in supporting research discovery and are evaluating offerings from other service providers. If you would like to offer feedback on this decision please contact us on: support@repository.cam.ac.uk