A theoretical analysis of taxonomic binning accuracy.
Publication Date
2022-08Journal Title
Mol Ecol Resour
ISSN
1755-098X
Publisher
Wiley
Language
en
Type
Article
This Version
AO
VoR
Metadata
Show full item recordCitation
De Sanctis, B., Money, D., Pedersen, M. W., & Durbin, R. (2022). A theoretical analysis of taxonomic binning accuracy.. Mol Ecol Resour https://doi.org/10.1111/1755-0998.13608
Abstract
Many metagenomic and environmental DNA studies require the taxonomic assignment of individual reads or sequences by aligning reads to a reference database, known as taxonomic binning. When a read aligns to more than one reference sequence, it is often classified based on sequence similarity. This step can assign reads to incorrect taxa, at a rate which depends both on the assignment algorithm and on underlying population genetic and database parameters. In particular, as we move towards using environmental DNA to study eukaryotic taxa subject to regular recombination, we must take into account issues concerning gene tree discordance. Though accuracy is often compared across algorithms using a fixed data set, the relative impact of these population genetic and database parameters on accuracy has not yet been quantified. Here, we develop both a theoretical and simulation framework in the simplified case of two reference species, and compute binning accuracy over a wide range of parameters, including sequence length, species-query divergence time, divergence times of the reference species, reference database completeness, sample age and effective population size. We consider two assignment methods and contextualize our results using parameters from a recent ancient environmental DNA study, comparing them to the commonly used discriminative k-mer-based method Clark (Current Biology, 31, 2021, 2728; BMC Genomics, 16, 2015, 1). Our results quantify the degradation in assignment accuracy as the samples diverge from their closest reference sequence, and with incompleteness of reference sequences. We also provide a framework in which others can compute expected accuracy for their particular method or parameter set. Code is available at https://github.com/bdesanctis/binning-accuracy.
Keywords
RESOURCE ARTICLE, RESOURCE ARTICLES, coalescent theory, environmental DNA, metagenomics, taxonomic binning
Sponsorship
Wellcome Trust (UNS69906, WT220023)
Identifiers
men13608
External DOI: https://doi.org/10.1111/1755-0998.13608
This record's URL: https://www.repository.cam.ac.uk/handle/1810/336849
Rights
Licence:
http://creativecommons.org/licenses/by/4.0/
Statistics
Total file downloads (since January 2020). For more information on metrics see the
IRUS guide.