Methods for Constructing and Exploiting Information Measures for Neural Networks
Repository URI
Repository DOI
Change log
Authors
Abstract
Gregory Chaitin, the noted mathematician and co-founder of algorithmic information theory, once put forth the claim that “compression is comprehension” [28]. This hypothesis, which echoes similar sentiments found in principles and theories such as Occam’s Razor and Solomonoff induction, underscores the duality of information and learning. In this thesis we put forth a selection of methods studying and leveraging this duality under the context of neural networks. The first subject of discussion is the description length of deep neural networks (DNNs) and the various methodologies for estimating minimum description length (MDL) codelengths. The symbiotic relationship between neural networks and compression research has produced profound work in both fields. For example, many state-of-the-art compression algorithms are driven by neural networks; in the other direction, variational inference for neural networks has important roots in the MDL principle and its compression perspective. However, DNNs appear to violate MDL tenants and the compression-comprehension hypothesis as they achieve excellent generalisation on a variety of complex tasks, yet have exceedingly large parameter counts. In an important paper which studied this apparent contradiction, Blier and Ollivier [16] found that the description lengths of DNNs could be significantly reduced using a coding technique derived from the prequential approach to statistics. Prequential coding outperformed all other algorithms, including two-part and variational methods, and has endured as the standard, even powering the benchmark- topping NNCP compression algorithm [12, 13]. Our research asks the following question: are there better alternatives to prequential coding for obtaining description lengths of DNNs? We start our investigation by generalising the prequential method into a broad framework we call instructional coding and find that other compression algorithms also fall under this paradigm. We proceed to describe a particular instance of instructional coding based on intermediate datasets which allow false labelling. This approach, which we term prechastic coding, achieves promising results and is competitive with prequential coding under select scenarios. In the subsequent chapter we progress to the topic of diffusion and conditional generation and present our paper on importance-guided diffusion. Diffusion models are a class of generative models inspired by non-equilibrium thermodynamics which have become extremely popular in recent years for image synthesis and other related tasks. The generative process involves an extensive sequence of latent variables which iteratively refine Gaussian noise into an image that resembles the images in the training set. This process has a clear information-theoretic interpretation and in our research we harness this perspective to design a principled conditional generative algorithm which does not require any retraining of the original neural network. This is accomplished using techniques from relative entropy coding, a subject which was also employed in our research on prequential coding, and achieves visually compelling results. Our final setting is the algorithmic notion of information (Kolmogorov complexity) along with related topics such as algorithmic probability which, together, have laid important theoretical foundations for machine learning. While practical results are elusive due to the uncomputability of Kolmogorov complexity, recent work on subjects such as simplicity bias and generalization in deep neural networks have brought algorithmic information theory back into the applied spotlight. We present our paper on regularising link prediction for graphs with potentially simple generating mechanisms. Specifically, we detail a regularised loss function for link prediction with graph neural networks utilising machinery which claims to estimate the Kolmogorov complexity of graphs. Although the loss function demonstrates improved link prediction on a selection of real-world networks, this performance is roughly equivalent to that of a control version which does not directly estimate Kolmogorov complexity. Consequently, we conclude that the successful regularisation results were not due to any actual approximation of the Kolmogorov complexity and frame our findings within the context of an ongoing debate about such estimation methods. The algorithms presented in this thesis offer a glance into the rich intersection of information theory, description lengths and neural networks. Furthermore, they advance the current understanding of methods for gauging and manipulating notions of information pertaining to neural networks for both compressive and generative purposes. The resulting insights represent stepping stones towards future research into neural network performance from an information-theoretic and MDL perspective.
