On Machine Learning Loss Landscapes
Repository URI
Repository DOI
Change log
Authors
Abstract
Loss functions are pivotal to the training of every machine learning model. When evaluating the loss function over a large range of parameters, a loss landscape is obtained. In this thesis, loss landscapes for various classes of machine learning models are explored. Geometric properties of the loss landscape can provide deep insights into machine learning models and their decision making process. Today, the most critical questions in machine learning are matters of model performance, robustness and interpretability. Here, tools from the energy landscapes field in theoretical chemistry are used to study these questions for neural networks and Gaussian processes. This cross-disciplinary approach is facilitated by a collection of analogues between both fields, such as the concept of heat capacity, monotonic sequence basins, or catastrophe theory, as developed in this work.
The energy landscape perspective of loss landscape proves to be a helpful approach to examine critical questions in the machine learning field. Energy landscapes tools are applied to neural network ensemble generation and reveal that different minima of the loss landscape specialise on different sections of the input data, improving classification accuracy when combined. A second application is the evaluation and selection of loss functions, guiding hyperparameter choices. By comparing the appAUC loss function with conventional alternatives, I identify that the appAUC, while more accurate, is less robust and therefore inferior for real-life problems. Lastly, monotonic sequence basins can be used to group minima for which conserved weights can be identified to decode algorithmic decision making. Using this approach, I am able to identify and quantify input feature relevance for classification problems, a fundamental step towards trustworthy and accountable machine learning systems.
For Gaussian processes, I present an analysis of the loss function hyperparameter space over different kernel choices, specifically over changes in the Matérn smoothness parameter $\nu$, which is equivalent to different kernel choices. I identify fold catastrophes in the landscape as $\nu$ changes around critical points and highlight the non-optimality of half-integer parameterisations of $\nu$ that are commonly employed in the field. Towards the end of this dissertation, a comparison of loss landscapes between machine learning models is provided, and suggestions for future work are highlighted.
