Show simple item record

dc.contributor.authorJackson, Felixen
dc.contributor.authorWayland, Matthewen
dc.contributor.authorPrabakaran, Sudhakaranen
dc.date.accessioned2018-12-18T00:32:41Z
dc.date.available2018-12-18T00:32:41Z
dc.date.issued2017-05-03en
dc.identifier.urihttps://www.repository.cam.ac.uk/handle/1810/287105
dc.description.abstractAs whole-genome sequencing technologies improve and accurate maps of the entire genome are assembled, short open-reading frames (sORFs) are garnering interest as functionally important regions that were previously overlooked. However, there is a paucity of tools available to investigate variants in sORF regions of the genome. Here we investigate the performance of commonly used tools for variant calling and variant prioritisation in these regions, and present a framework for optimising these processes. First, the performance of four widely used germline variant calling algorithms is systematically compared. Haplotype Caller is found to perform best across the whole genome, but FreeBayes is shown to produce the most accurate variant set in sORF regions. An accurate set of variants is found by taking the intersection of called variants. The potential deleteriousness of each variant is then predicted using a pathogenicity scoring algorithm developed here, called sORF-c. This algorithm uses supervised machine-learning to predict the pathogenicity of each variant, based on a holistic range of functional, conservation-based and region-based scores defined for each variant. By training on a dataset of over 130,000 variants, sORF-c outperforms other comparable pathogenicity scoring algorithms on a test set of variants in sORF regions of the human genome. <h4>List of Abbreviations</h4> AUPRC Area under the precision-recall curve BED Browser Extensible Data CADD Combined annotation-dependent depletion DANN Deleterious annotation of genetic variants using neural networks EPO Enredo, Pecan, Ortheus pipeline GATK Genome analysis toolkit GIAB Genome in a bottle HGMD Human gene mutation database Indels Insertions and deletions MS Mass spectrometry ORF Open reading frame RF Random Forests ROC Receiver Operating Characteristics SEP sORF encoded peptide sklearn Scikit-learn package SNVs Single nucleotide variants sORF Short open-reading frame TF Transcription factor TSS Transcription start site VCF Variant Call Format fileen
dc.titleIdentification and Prioritisation of Variants in the Short Open-Reading Frame Regions of the Human Genomeen
dc.typeWebpages
prism.publicationDate2017en
dc.identifier.doi10.17863/CAM.34415
rioxxterms.versionofrecord10.1101/133645en
rioxxterms.versionAM
rioxxterms.licenseref.urihttp://www.rioxx.net/licenses/all-rights-reserveden
rioxxterms.licenseref.startdate2017-05-03en
dc.contributor.orcidWayland, Matthew [0000-0002-8095-858X]
dc.contributor.orcidPrabakaran, Sudhakaran [0000-0002-6527-1085]
rioxxterms.typeOtheren
rioxxterms.freetoread.startdate2018-05-03


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record