Repository logo
 

Biological applications, visualizations, and extensions of the long short-term memory network


Type

Thesis

Change log

Authors

van der Westhuizen, Jos 

Abstract

Sequences are ubiquitous in the domain of biology. One of the current best machine learning techniques for analysing sequences is the long short-term memory (LSTM) network. Owing to significant barriers to adoption in biology, focussed efforts are required to realize the use of LSTMs in practice. Thus, the aim of this work is to improve the state of LSTMs for biology, and we focus on biological tasks pertaining to physiological signals, peripheral neural signals, and molecules. This goal drives the three subplots in this thesis: biological applications, visualizations, and extensions.

We start by demonstrating the utility of LSTMs for biological applications. On two new physiological-signal datasets, LSTMs were found to outperform hidden Markov models. LSTM-based models, implemented by other researchers, also constituted the majority of the best performing approaches on publicly available medical datasets. However, even if these models achieve the best performance on such datasets, their adoption will be limited if they fail to indicate when they are likely mistaken. Thus, we demonstrate on medical data that it is straightforward to use LSTMs in a Bayesian framework via dropout, providing model predictions with corresponding uncertainty estimates.

Another dataset used to show the utility of LSTMs is a novel collection of peripheral neural signals. Manual labelling of this dataset is prohibitively expensive, and as a remedy, we propose a sequence-to-sequence model regularized by Wasserstein adversarial networks. The results indicate that the proposed model is able to infer which actions a subject performed based on its peripheral neural signals with reasonable accuracy.

As these LSTMs achieve state-of-the-art performance on many biological datasets, one of the main concerns for their practical adoption is their interpretability. We explore various visualization techniques for LSTMs applied to continuous-valued medical time series and find that learning a mask to optimally delete information in the input provides useful interpretations. Furthermore, we find that the input features looked for by the LSTM align well with medical theory.

For many applications, extensions of the LSTM can provide enhanced suitability. One such application is drug discovery -- another important aspect of biology. Deep learning can aid drug discovery by means of generative models, but they often produce invalid molecules due to their complex discrete structures. As a solution, we propose a version of active learning that leverages the sequential nature of the LSTM along with its Bayesian capabilities. This approach enables efficient learning of the grammar that governs the generation of discrete-valued sequences such as molecules. Efficiency is achieved by reducing the search space from one over sequences to one over the set of possible elements at each time step -- a much smaller space.

Having demonstrated the suitability of LSTMs for biological applications, we seek a hardware efficient implementation. Given the success of the gated recurrent unit (GRU), which has two gates, a natural question is whether any of the LSTM gates are redundant. Research has shown that the forget gate is one of the most important gates in the LSTM. Hence, we propose a forget-gate-only version of the LSTM -- the JANET -- which outperforms both the LSTM and some of the best contemporary models on benchmark datasets, while also reducing computational cost.

Description

Date

2018-06-01

Advisors

Lasenby, Joan

Keywords

long short-term memory, janet, human machine interfacing, predictive diagnostics, neural network, deep learning, drug discovery, lstm

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge
Sponsorship
My PhD was funded by the Skye Cambridge Trust.