Computational optimisation strategies for targeted DNA sequencing using nanopores
Long-read DNA sequencing is causing a generational shift in genome sequencing productivity and is revolutionising many aspects of biological discovery. One of the technologies behind this transformation is sequencing using nanopores that act as biosensors measuring fluctuations of an ionic current caused by traversing polynucleotides. By partially blocking the flow of ions, distinct patterns of current are generated as nucleotides pass through the narrowest constriction; these are subsequently bioinformatically demixed into nucleobase sequences.
This approach for sequencing has multiple advantages, including reading native molecules that can be up to megabases long without prior amplification, and the ability to analyse data in real time. This allows for ultra-fast time-to-answer investigations, and also manipulation of ongoing experiments by processing the generated data and feeding instructions back to the sequencing machine; a unique feature not realisable with any other existing sequencing technology.
Currently, a few methods exist that make use of this by analysing nascent fragments of DNA and testing whether they originate from predetermined areas of a genome marked as targets. Otherwise, the voltage bias across the membrane, into which the nanopores are embedded, can be reversed and the molecule will be ejected from the nanopore, allowing another one to be sequenced in its place with the aim of saving time and enriching for on-target sequences. This process has been termed `adaptive sampling'; but prior to the work presented in my thesis, these methods were based entirely on static instructions. In other words, target regions of a genome were defined before an experiment and remain constant throughout.
With this thesis, I extend adaptive sampling such that decisions about molecule rejections can incorporate information obtained during an ongoing experiment. One of the main motivations is to address the current need for oversampling genomes many-fold to ensure a minimum coverage across a sequenced genome high enough for downstream analyses, which can be wasteful. Similarly, sequencing mixed samples without wasteful oversampling might lead to underrepresented or missing rare species. This thesis describes two approaches for dynamic extensions to adaptive sampling to address these issues, by implementing more versatile real-time analysis and control of sequencing experiments.
The first approach is intended for resequencing experiments, where reference sequences of the studied sample are available. For this, I implement an algorithmic framework and software that generates dynamically adapting decision strategies that are continuously updated to steer an active sequencing run. More specifically, this method quantifies uncertainty at each position in a genome and for each novel DNA fragment decides whether the expected decrease in uncertainty warrants fully sequencing it. This way, sequencing can be focused on molecules from areas with the highest uncertainty, e.g. regions of low coverage, thus optimising the information gain. I illustrate the effectiveness of the method by mitigating coverage bias between and within members of a microbial mixture sample. In particular, it adapts to the differential abundances without prior knowledge about sample composition, thereby reducing the interspecies bias and effectively redistributing coverage within species.
In some scenarios, the need for reference genomes poses a limitation, e.g. when sample content is unknown. In this case previous implementations are not useful, since underrepresented species cannot be targeted. A second approach I develop in the thesis aims to overcome this limitation by exploring how rejection decisions can be made while simultaneously creating a genome assembly from the fragments read so far. Here, the method rejects molecules from regions of genomes that are already well-represented and instead focuses on sequence that either helps to extend a species' assembly or is entirely unknown. I show how refocusing sequencing in this way is useful to increase the detection limit for rare organisms in a mixed sample, leads to higher quality assemblies, and allows for true de novo enrichment of unknown species for the first time.
Overall, the data-driven approaches to targeted sequencing with nanopores that I have created expand the applicability of adaptive sampling and could be applied to many other sequencing scenarios. The resulting reduction in the time-to-answer or increased information gain might be critical in clinical settings or for pathogen surveillance.