Sequential decision making with feature-linear models
This thesis is concerned with the problem of sequential decision making, where an agent interacts sequentially with an unknown environment and aims to maximise the sum of the rewards it receives. Our focus is on methods that model the reward as linear in some feature space. We consider a bandit problem, where the rewards are linear in a reproducing kernel Hilbert space, and a reinforcement learning setting with features given by a neural network. The thesis is split into two parts accordingly.
In part I, we introduce a new algorithm for the optimisation of continuous functions with a known number of derivatives under noisy bandit feedback. The algorithm, Partitioned GP-UCB, combines ideas from classic kernel-based bandit algorithms and hierarchical optimisation methods to provide near-tight confidence intervals and strong guarantees on computational complexity. As part of our analysis, we develop a novel bound on the effective dimension of kernel ridge regression with the Matérn kernel with respect to the volume of the domain. Part I is mainly concerned with the derivation of this bound.
In part II we tackle practical exploration in deep reinforcement learning, with a focus on methods that combine linear uncertainty estimates with feature embeddings given by deep Q-function networks. We observe a flaw within previous such work: while these methods enforce a temporal difference relationship on the mean state-action values given by the linear model, they do not constrain its inputs---the neural network embeddings---and these determine the uncertainty estimates of linear models. We show that such embeddings need to follow a certain successor feature structure and develop a model based on this insight: Successor Uncertainties. We demonstrate that our model is capable of solving hard tabular exploration challenges and that it scales to the Atari Learning Environment, where it attains strong scores relative to competing methods.