Priors in finite and infinite Bayesian convolutional neural networks

Garriga Alonso, Adrià

doi:10.17863/CAM.93557

Priors in finite and infinite Bayesian convolutional neural networks

Repository URI

https://www.repository.cam.ac.uk/handle/1810/346133

Repository DOI

https://doi.org/10.17863/CAM.93557

Files

Thesis (6.55 MB)

Type

Thesis

Authors

Garriga Alonso, Adrià

https://orcid.org/0000-0003-3409-5047

Abstract

Bayesian neural networks (BNNs) have undergone many changes since the seminal work of Neal [Nea96]. Advances in approximate inference and the use of GPUs have scaled BNNs to larger data sets, and much higher layer and parameter counts. Yet, the priors used for BNN parameters have remained essentially the same. The isotropic Gaussian prior introduced by Neal, where each element of the weights and biases is drawn independently from a Gaussian, is still used almost everywhere.

This thesis seeks to undo the neglect in the development of priors for BNNs, especially convolutional BNNs, using a two-pronged approach. First, I theoretically examine the effect of the Gaussian isotropic prior on the distribution over functions of a deep BNN prior. I show that, as the number of channels of a convolutional BNN goes to infinity, its output converges in distribution to a Gaussian process (GP). Thus, we can draw rough conclusions about the function-space of finite BNNs by looking at the mean and covariance of their limiting GPs.

The limiting GP itself performs surprisingly well at image classification, suggesting that knowledge encoded in the convolutional neural network (CNN) architecture, as opposed to the learned features, plays a larger role than previously thought.

Examining the derived CNN kernel shows that, if the weights are independent, the output of the limiting GP loses translation equivariance. This is an important inductive bias for learning from images. We can prevent this loss by introducing spatial correlations in the weight prior of a Bayesian CNN, which still results in a GP in the infinite width limit.

The second prong is an empirical methodology for identifying new priors for BNNs. Since BNNs are often considered to underfit, I examine the empirical distribution of weights learned using stochastic gradient descent (SGD). The resulting weight distributions tend to have heavier tails than a Gaussian, and display strong spatial correlations in CNNs.

I incorporate the found features into BNN priors, and test the performance of the resulting posterior. The spatially correlated priors, recommended by both prongs, greatly increase the classification performance of Bayesian CNNs. However, they do not at all reduce the cold-posterior effect (CPE), which indicates model misspecification or inference failure in BNNs. Heavy-tailed priors somewhat reduce the CPE in fully connected neural networks.

Ultimately, it is unlikely that the remaining misspecification is all in the prior. Nevertheless, I have found better priors for Bayesian CNNs. I have provided empirical methods that can be used to further improve BNN priors.

Date

2021-12-01

Advisors

Rasmussen, Carl

Keywords

Bayesian neural networks, Machine learning, Gaussian processes, Infinitely wide neural networks, Cold-posterior effect, Convolutional neural networks, Bayesian statistics, Neural tangent kernel

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge

Rights

Attribution 4.0 International (CC BY 4.0)

Sponsorship

EPSRC (1950008)
Engineering and Physical Sciences Research Council (1950008)

Relationships

Is supplemented by:

https://doi.org/CIFAR-10
https://doi.org/MNIST
https://doi.org/UCI dataset repository

Collections

Theses - Engineering

Priors in finite and infinite Bayesian convolutional neural networks

Repository URI

Repository DOI

Files

Type

Change log

Authors

Abstract

Description

Date

Advisors

Keywords

Qualification

Awarding Institution

Rights

Sponsorship

Relationships

Collections