Repository logo

Scaling behaviour of neural networks: Existence and character of large width limits



Change log


Hron, Jiri 


It took until the last decade to finally see a machine match human performance on essentially any task related to vision or natural language understanding. Most of these successes were achieved by neural networks (NNs), a class of algorithms for finding patterns in large swaths of data. The progress since 2020 in particular has been driven by: (i) growing the number of parameters NNs can use to make predictions, and (ii) increasing the amount of data used to optimise these parameters. The race for scale has been fuelled by the discovery of scaling laws, an empirical phenomenon where the number of errors a model makes decays as a power law of the dataset size and NN parameter count.

This thesis is devoted to understanding how parameter count affects behaviour of NNs. Our results are probabilistic in nature, describing how random initialisation and non-convexity of the loss surface impact NNs. We prove that most NNs exhibit a Gaussian process (GP) type behaviour when the parameter count (in each layer) is sufficiently large. The choice of NN architecture and initialisation distribution determines the GP kernel function. Each layer is thus associated with its own kernel transformation, which can be mixed-and-matched the same way NN layers can. The only exception in this thesis is single-head attention, which exhibits Gaussian mixture, rather than GP behaviour, disregards of the number of parameters.

The GP behaviour occurs at initialisation, during, and after training. We focus on two types of training: gradient descent, and Bayesian inference. We obtain an exact description of the large width Bayesian posterior in both the parameter space, and the space of functions represented by the NNs. For gradient descent, we derive a formula which characterises behaviour of large attention NNs; this complements earlier results for fully connected and convolutional architectures.

In the last chapter, we focus on making inference in large NNs easier and better. The first part is about Neural Tangents, a Python deep learning library which automates computation of the GP limits for a wide variety of NNs. Beyond our own work, the library has facilitated around 100 independent research papers to date. The second part of the chapter exploits our insights about behaviour of wide Bayesian NNs to design a posterior inference method which is more efficient the larger the NN, while introducing minimal computational overhead.





Turner, Richard
Ghahramani, Zoubin


Deep learning, Gaussian processes, Machine learning, Neural networks


Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge
EPSRC (1949878)
Engineering and Physical Sciences Research Council (1949878)
EPSRC, Nokia ICASE Fellowship