Repository logo
 

Fast hierarchical Bayesian analysis of population structure.

Published version
Peer-reviewed

Type

Article

Change log

Authors

Lees, John A 
Bentley, Stephen D 
Frost, Simon DW 
Corander, Jukka 

Abstract

We present fastbaps, a fast solution to the genetic clustering problem. Fastbaps rapidly identifies an approximate fit to a Dirichlet process mixture model (DPM) for clustering multilocus genotype data. Our efficient model-based clustering approach is able to cluster datasets 10-100 times larger than the existing model-based methods, which we demonstrate by analyzing an alignment of over 110 000 sequences of HIV-1 pol genes. We also provide a method for rapidly partitioning an existing hierarchy in order to maximize the DPM model marginal likelihood, allowing us to split phylogenetic trees into clades and subclades using a population genomic model. Extensive tests on simulated data as well as a diverse set of real bacterial and viral datasets show that fastbaps provides comparable or improved solutions to previous model-based methods, while being significantly faster. The method is made freely available under an open source MIT licence as an easy to use R package at https://github.com/gtonkinhill/fastbaps.

Description

Keywords

Algorithms, Bacterial Proteins, Bayes Theorem, Cluster Analysis, Computational Biology, Databases, Protein, Human Immunodeficiency Virus Proteins, Models, Theoretical, Phylogeny, Reproducibility of Results

Journal Title

Nucleic Acids Res

Conference Name

Journal ISSN

0305-1048
1362-4962

Volume Title

47

Publisher

Oxford University Press (OUP)
Sponsorship
Wellcome Trust (204016/Z/16/Z)