Repository logo
 

Interpretability of Neural Networks Latent Representations


Loading...
Thumbnail Image

Type

Change log

Abstract

Deep neural networks are complicated objects. They typically involve millions to trillions of operations in order to turn their input data into a prediction. Since it is not possible for a human user to analyze each of these operations, the models appear as black- boxes. The field of interpretability has developed with the aim to make these models more transparent by explaining their predictions. However, the field has historically focused on explaining the output of these models, leaving their internal latent representations unexplored. In this thesis, we describe how interpretability research has recently expanded its scope beyond the output of neural networks. We focus on 3 types of interpretability methods: feature importance, example-based and concept-based explanations. Most feature importance methods require a label to select which component of the neural network output to interpret. When interpreting neural networks representations, components corresponds to neuron activations and there is no clear label to select which component to interpret. Hence, these features importance methods can not be directly applied to interpret latent representations. As a remedy, we introduce the Label-Free Feature Importance framework. In this framework, the feature importance is aggregated over all neurons by weighing each neuron contribution with the neuron activation. We show that this yields actionable explanations of latent representations. Many example-based explanations also implicitly rely on a label, as they require the computation of the model loss for the explained example. To circumvent this limitation, we introduce SimplEx, an example-based interpretability method based on the representation spaces of neural networks. SimplEx reconstructs the latent representation of a given example as a weighted combination of latent representations from a corpus of examples. If this corpus corresponds to the model training data, SimplEx then decomposes the representation of a new example based on examples previously seen by the model. This provides a way to perform case-based reasoning in the latent spaces of neural networks. Concept-based explanations typically assume that human concepts are encoded in a model latent space if the positives (i.e., examples with the concept) are linearly separable from the negatives (i.e., examples without the concept) in this space. We discuss limitations of this assumption by illustrating how neural networks can encode concepts while violating the linear separability assumption. This motivates our Concept Activation Region paradigm, which introduces a more flexible way to define concept fingerprints in latent spaces based on kernels. We show that this paradigm leads to explanations that better describe how concepts are represented by neural networks. Throughout this thesis, we study 2 properties of interpretability methods: completeness and invariance. The completeness property establishes a link between the explanations and the model prediction they explain. The invariance property ensures that the explanations do not change if the latent space is modified by an orthogonal transformation conserving its geometrical structure. We show that all the above methods are endowed with these properties. We end with a broad discussion on the robustness of interpretability methods. When a neural network is invariant with respect to a group of input symmetries, it is legitimate to expect that these symmetries are reflected in the explanations of these models. Based on this observation, we introduce geometric robustness metrics to evaluate the faithfulness of interpretability methods with respect to these symmetries. We show that many interpretability methods fall short of these expectations. This leads us to re-emphasize the importance of using interpretability methods with a sane skepticism.

Description

Date

2024-04-01

Advisors

van der Schaar, Mihaela

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge

Rights and licensing

Except where otherwised noted, this item's license is described as Attribution 4.0 International (CC BY 4.0)
Sponsorship
Aviva Fellowship