Identification and Prioritisation of Variants in the Short Open-Reading Frame Regions of the Human Genome

Change log
Jackson, Felix 
Prabakaran, Sudhakaran  ORCID logo

As whole-genome sequencing technologies improve and accurate maps of the entire genome are assembled, short open-reading frames (sORFs) are garnering interest as functionally important regions that were previously overlooked. However, there is a paucity of tools available to investigate variants in sORF regions of the genome. Here we investigate the performance of commonly used tools for variant calling and variant prioritisation in these regions, and present a framework for optimising these processes. First, the performance of four widely used germline variant calling algorithms is systematically compared. Haplotype Caller is found to perform best across the whole genome, but FreeBayes is shown to produce the most accurate variant set in sORF regions. An accurate set of variants is found by taking the intersection of called variants. The potential deleteriousness of each variant is then predicted using a pathogenicity scoring algorithm developed here, called sORF-c. This algorithm uses supervised machine-learning to predict the pathogenicity of each variant, based on a holistic range of functional, conservation-based and region-based scores defined for each variant. By training on a dataset of over 130,000 variants, sORF-c outperforms other comparable pathogenicity scoring algorithms on a test set of variants in sORF regions of the human genome.

List of Abbreviations

AUPRC Area under the precision-recall curve BED Browser Extensible Data CADD Combined annotation-dependent depletion DANN Deleterious annotation of genetic variants using neural networks EPO Enredo, Pecan, Ortheus pipeline GATK Genome analysis toolkit GIAB Genome in a bottle HGMD Human gene mutation database Indels Insertions and deletions MS Mass spectrometry ORF Open reading frame RF Random Forests ROC Receiver Operating Characteristics SEP sORF encoded peptide sklearn Scikit-learn package SNVs Single nucleotide variants sORF Short open-reading frame TF Transcription factor TSS Transcription start site VCF Variant Call Format file

Is Part Of