Towards Improved Variational Inference for Deep Bayesian Models

Change log
Ober, Sebastian William 

Deep learning has revolutionized the last decade, being at the forefront of extraordinary advances in a wide range of tasks including computer vision, natural language processing, and reinforcement learning, to name but a few. However, it is well-known that deep models trained via maximum likelihood estimation tend to be overconfident and give poorly-calibrated predictions. Bayesian deep learning attempts to address this by placing priors on the model parameters, which are then combined with a likelihood to perform posterior inference. Unfortunately, for deep models, the true posterior is intractable, forcing the user to resort to approximations.

In this thesis, we explore the use of variational inference as an approximation, as it is unique in simultaneously approximating the posterior and providing a lower bound to the marginal likelihood. If tight enough, this lower bound can be used to optimize hyperparameters and to facilitate model selection. However, this capacity has rarely been used to its full extent for Bayesian neural networks, likely because the approximate posteriors typically used in practice can lack the flexibility to effectively bound the marginal likelihood. We therefore explore three aspects of Bayesian learning for deep models. First, we begin our investigation by asking whether it is necessary to perform inference over as many parameters as possible, or whether it is reasonable to treat many of them as hyperparameters that we optimize with respect to the marginal likelihood. This would introduce significant computational savings; however, we observe that this can lead to pathological behavior and severe overfitting, suggesting that it is better to be as “fully Bayesian” as possible. We continue our thesis by proposing a variational posterior that provides a unified view of inference in Bayesian neural networks and deep Gaussian processes, which we show is flexible enough to take advantage of added prior hyperparameters. Finally, we demonstrate how variational inference can be improved in certain deep Gaussian process models by analytically removing symmetries from the posterior, and performing inference on Gram matrices instead of features. While we do not directly investigate the use of our improvements for model selection, we hope that our contributions will provide a stepping stone to fully realize the promises of variational inference in the future.

Rasmussen, Carl
Bayesian neural networks, Deep Gaussian processes, Variational inference
Doctor of Philosophy (PhD)
Awarding Institution
University of Cambridge
Gates Cambridge Trust