Repository logo

Pathogen evolution within and between hosts with applications to the pneumococcus and SARS-CoV-2



Change log



The tracking of a pathogen's evolution over different timescales is essential: for the design of vaccines, to control antibiotic resistance, for the surveillance of disease, to identify transmission chains and to understand the mechanisms behind the pathogen’s evolution. Although the first bacterial genome was sequenced in 1995, techniques and costs have only recently enabled the large scale study of the evolution of pathogens in human populations over short timescales. This thesis begins by developing methods for the analysis of pathogen genomes, including the identification of population structure, characterisation of gene content and tracking of transmission. These techniques are then applied to the analysis of two large pathogen sequencing studies investigating the evolution of Streptococcus pneumoniae and SARS-CoV-2 within the host and during transmission. Initially, a fast solution to the problem of clustering pathogen genomes is presented. An approximation to a Dirichlet process mixture model is shown to produce accurate genome clusterings 10-100 times faster than existing methods. Next, the problem of identifying the pangenome (the set of all genes that have been found in a species) is considered and a method is presented that is able to reduce by over 10 fold the number of errors introduced by misannotation. Methods for the inference of transmission chains using genomic data are then described. Existing methods are adapted to improve their efficiency and make them applicable to the tracking of endemic disease. In addition, a novel simulation based transmission inference method is introduced. Following the description of these techniques, a large dataset of nearly 4000 pneumococcal positive nasopharyngeal swabs is considered, comprising samples taken over a period of two years from mothers and their children in the Maela refugee camp in Thailand. After culturing S. pneumoniae using selective agar, whole plate sweeps were taken for sequencing allowing for the investigation of the within-host genetic diversity of S. pneumoniae. Analysis of the resulting genomic data reveals that the technique has an increased sensitivity to detect instances of multiple pneumococcal carriage of 17.7% over latex sweeps and that different pneumococcal lineages are associated with increased and decreased rates of multiple carriage. The impact of antibiotic resistance and treatment within the host is examined with treatment identified as a risk factor for the transmission of multidrug resistant lineages. The evolution of pneumococcal lineages within the host is explored allowing for the consideration of selection and the identification of a unique mutational spectrum consistent with previous laboratory experiments in E. coli. Using minority variant calls found in each S. pneumoniae sample, routes of transmission are then determined, including those between mother and child. This indicated that mothers were more likely to transmit pneumococci to their children during the first year of the child's life than in the subsequent year. Finally, a similar analysis was performed on a large collection of over 1000 SARS-CoV-2 isolates, sequenced in replicate, from the East of England. In addition to considering the selective and mutational processes acting on the virus within the host, a high rate of independent recurrent within-host mutations is identified which could complicate efforts to infer transmission chains using within-host minority variants.





Bentley, Stephen
Frost, Simon
Corander, Jukka


Genomics, Epidemiology, Pathogen, Transmission, Streptococcus pneumoniae, SARS-CoV-2


Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge
This work was funded by a Wellcome Trust PhD scholarship grant (204016/Z/16/Z) and the Cambridge Commonwealth, European & International Trust.