Interpretable Deep Learning: Beyond Feature-Importance with Concept-based Explanations
Deep Neural Network (DNN) models are challenging to interpret because of their highly complex and non-linear nature. This lack of interpretability (1) inhibits adoption within safety critical applications, (2) makes it challenging to debug existing models, and (3) prevents us from extracting valuable knowledge. Explainable AI (XAI) research aims to increase the transparency of DNN model behaviour to improve interpretability. Feature importance explanations are the most popular interpretability approaches. They show the importance of each input feature (e.g., pixel, patch, word vector) to the model’s prediction. However, we hypothesise that feature importance explanations have two main shortcomings concerning their inability to describe the complexity of a DNN behaviour with sufficient (1) fidelity and (2) richness. Fidelity and richness are essential because different tasks, users, and data types require specific levels of trust and understanding. The goal of this thesis is to showcase the shortcomings of feature importance explanations and to develop explanation techniques that describe the DNN behaviour with greater richness. We design an adversarial explanation attack to highlight the infidelity and inadequacy of feature importance explanations. Our attack modifies the parameters of a pre-trained model. It uses fairness as a proxy measure for the fidelity of an explanation method to demonstrate that the apparent importance of a feature does not reveal anything reliable about the fairness of a model. Hence, regulators or auditors should not rely on feature importance explanations to measure or enforce standards of fairness. As one solution, we formulate five different levels of the semantic richness of explanations to evaluate explanations and propose two function decomposition frameworks (DGINN and CME) to extract explanations from DNNs at a semantically higher level than feature importance explanations. Concept-based approaches provide explanations in terms of atomic human-understandable units (e.g., wheel or door) rather than individual raw features (e.g., pixels or characters). Our function decomposition frameworks can extract specific class representations from 5% of the network parameters and concept representations with an average-per-concept F1 score of 86%. Finally, the CME framework makes it possible to compare concept-based explanations, contributing to the scientific rigour of evaluating interpretability methods.