Repository logo
 

Genome-graph based genotyping with applications to highly variable genes in P. falciparum


Type

Thesis

Change log

Authors

Abstract

Analysing genetic variation in pathogen genomes is key to understanding their biology, evolution and epidemiology. Typically, this is done by assembling one arbitrary genome, defined as the ‘reference’, and describing other samples as deviations from it. However, this model breaks down in highly diverse regions of the genome, where sample sequencing reads, differing too substantially from the reference, fail to map. This ‘bias against diversity’, due to using a single reference, naturally affects genomic regions under pressure to diversify: this includes the human MHC, a motivating example for the field, and vaccine candidate genes in the malaria parasite Plasmodium falciparum (Pf ), the motivating example for this thesis. The growing solution in the field is to build graph-based models, representing not one, but a population of genomes from a species, and using these genome graphs as a substrate for read mapping and genotyping instead. In this thesis, I develop new algorithms and data structures for genotyping highly variable genes in Pf, using genome graphs. In Chapters 2 and 3, I describe methods and code to analyse variation in highly diverse regions of the genome, across many genomes in a cohort. In doing so I provide two main advances on the state-of-the-art: jointly studying small (SNPs, indels) and large (indels >50bp) variation, and accessing variation on multiple references. I validate these methods using different datasets, ultimately genotyping SNPs on diverged haplotypes in two highly variable Pf genes, including one gene from the major methodological and biological motivation for this thesis, paralogs DBLMSP and DBLMSP2 (DBs). In Chapter 4, I study the DBs in greater detail, using a global dataset of >3,500 Pf genomes. Building a genome-graph-based pipeline, I recover variation inaccessible to single-reference based approaches (GATK), before uncovering new biology. Expressing each diverged DB haplotype as a mosaic of the others, I find widespread recombination in each gene, and also discover recent evidence of gene conversion between the two genes. In summary, this thesis provides both methodological advances into genome-graph based genotyping, and practical insights into the genome biology of an important human pathogen.

Description

Date

2022-09-23

Advisors

Iqbal, Zamin

Keywords

bioinformatics, genotyping, genome graphs, malaria genomics, evolution

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge
Sponsorship
EMBL International PhD Programme