Repository logo
 

Identifying and removing haplotypic duplication in primary genome assemblies

Published version
Peer-reviewed

Type

Article

Change log

Authors

Wood, Jonathan 
Wang, Yadong 

Abstract

Abstract Motivation Rapid development in long read sequencing and scaffolding technologies is accelerating the production of reference-quality assemblies for large eukaryotic genomes. However, haplotype divergence in regions of high heterozygosity often results in assemblers creating two copies rather than one copy of a region, leading to breaks in contiguity and compromising downstream steps such as gene annotation. Several tools have been developed to resolve this problem. However, they either only focus on removing contained duplicate regions, also known as haplotigs, or fail to use all the relevant information and hence make errors. Results Here we present a novel tool “purge_dups” that uses sequence similarity and read depth to automatically identify and remove both haplotigs and heterozygous overlaps. In comparison with the current standard, purge_haplotigs, we demonstrate that purge_dups can reduce heterozygous duplication and increase assembly continuity while maintaining completeness of the primary assembly. Moreover, purge_dups is fully automatic and can be easy integrated into assembly pipelines. Availability The source code is written in C and is available at https://github.com/dfguan/purge_dups . Contact ydwang@hit.edu.cn , rd109@cam.ac.uk

Description

Keywords

Journal Title

Bioinformatics

Conference Name

Journal ISSN

Volume Title

Publisher

Oxford University Press
Sponsorship
This work was supported by the National Key Research and Development Program of China [2017YFC0907503, 2018YFC0910504 and 2017YFC1201201 to D.G. and Y.W.]; China Scholarship Council to D.G.; Wellcome Trust [WT207492 to S.A.M. and R.D., and WT206194 to J.W. and K.H.].