Repository logo

A framework for automated scalable designation of viral pathogen lineages from genomic data.

Published version

Repository DOI

Change log


de Bernardi Schneider, Adriano 
Roemer, Cornelius 
Wolfinger, Michael T  ORCID logo


Pathogen lineage nomenclature systems are a key component of effective communication and collaboration for researchers and public health workers. Since February 2021, the Pango dynamic lineage nomenclature for SARS-CoV-2 has been sustained by crowdsourced lineage proposals as new isolates were sequenced. This approach is vulnerable to time-critical delays as well as regional and personal bias. Here we developed a simple heuristic approach for dividing phylogenetic trees into lineages, including the prioritization of key mutations or genes. Our implementation is efficient on extremely large phylogenetic trees consisting of millions of sequences and produces similar results to existing manually curated lineage designations when applied to SARS-CoV-2 and other viruses including chikungunya virus, Venezuelan equine encephalitis virus complex and Zika virus. This method offers a simple, automated and consistent approach to pathogen nomenclature that can assist researchers in developing and maintaining phylogeny-based classifications in the face of ever-increasing genomic datasets.


Acknowledgements: This study was funded in part by the Austrian Science Fund (FWF) project I 6440-N to M.T.W. This study was funded in part by CDC award BAA 200-2021-11554 to R.C.-D. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the paper. We gratefully acknowledge O. Pybus and S. Bell for contributing to early discussions on the topic. We also thank all the volunteers who have worked on finding new Pango lineages.


Animals, Horses, Phylogeny, Encephalitis Virus, Venezuelan Equine, Genomics, Base Sequence, Genome, Viral, SARS-CoV-2, Zika Virus, Zika Virus Infection

Journal Title

Nat Microbiol

Conference Name

Journal ISSN


Volume Title



Springer Science and Business Media LLC
U.S. Department of Health & Human Services | CDC | Center for Surveillance, Epidemiology, and Laboratory Services (CSELS) (BAA 200-2021-11554, BAA 200-2021-11554, BAA 200-2021-11554, BAA 200-2021-11554)
Austrian Science Fund (Fonds zur Förderung der Wissenschaftlichen Forschung) (I 6440-N)