Priors in finite and infinite Bayesian convolutional neural networks

Change log
Garriga Alonso, Adrià  ORCID logo

Bayesian neural networks (BNNs) have undergone many changes since the seminal work of Neal [Nea96]. Advances in approximate inference and the use of GPUs have scaled BNNs to larger data sets, and much higher layer and parameter counts. Yet, the priors used for BNN parameters have remained essentially the same. The isotropic Gaussian prior introduced by Neal, where each element of the weights and biases is drawn independently from a Gaussian, is still used almost everywhere.

This thesis seeks to undo the neglect in the development of priors for BNNs, especially convolutional BNNs, using a two-pronged approach. First, I theoretically examine the effect of the Gaussian isotropic prior on the distribution over functions of a deep BNN prior. I show that, as the number of channels of a convolutional BNN goes to infinity, its output converges in distribution to a Gaussian process (GP). Thus, we can draw rough conclusions about the function-space of finite BNNs by looking at the mean and covariance of their limiting GPs.

The limiting GP itself performs surprisingly well at image classification, suggesting that knowledge encoded in the convolutional neural network (CNN) architecture, as opposed to the learned features, plays a larger role than previously thought.

Examining the derived CNN kernel shows that, if the weights are independent, the output of the limiting GP loses translation equivariance. This is an important inductive bias for learning from images. We can prevent this loss by introducing spatial correlations in the weight prior of a Bayesian CNN, which still results in a GP in the infinite width limit.

The second prong is an empirical methodology for identifying new priors for BNNs. Since BNNs are often considered to underfit, I examine the empirical distribution of weights learned using stochastic gradient descent (SGD). The resulting weight distributions tend to have heavier tails than a Gaussian, and display strong spatial correlations in CNNs.

I incorporate the found features into BNN priors, and test the performance of the resulting posterior. The spatially correlated priors, recommended by both prongs, greatly increase the classification performance of Bayesian CNNs. However, they do not at all reduce the cold-posterior effect (CPE), which indicates model misspecification or inference failure in BNNs. Heavy-tailed priors somewhat reduce the CPE in fully connected neural networks.

Ultimately, it is unlikely that the remaining misspecification is all in the prior. Nevertheless, I have found better priors for Bayesian CNNs. I have provided empirical methods that can be used to further improve BNN priors.

Rasmussen, Carl
Bayesian neural networks, Machine learning, Gaussian processes, Infinitely wide neural networks, Cold-posterior effect, Convolutional neural networks, Bayesian statistics, Neural tangent kernel
Doctor of Philosophy (PhD)
Awarding Institution
University of Cambridge
EPSRC (1950008)
Engineering and Physical Sciences Research Council (1950008)
Is supplemented by: