You shouldn’t trust me: Learning models which conceal unfairness from multiple explanation methods.

Transparency of algorithmic systems is an important area of research, which has been discussed as a way for end-users and regulators to develop appropriate trust in machine learning models. One popular approach, LIME [23], even suggests that model expla- nations can answer the question “Why should I trust you?”. Here we show a straightforward method for modifying a pre-trained model to manipulate the output of many popular feature importance explana- tion methods with little change in accuracy, thus demonstrating the danger of trusting such explanation methods. We show how this ex- planation attack can mask a model’s discriminatory use of a sensitive feature, raising strong concerns about using such explanation meth- ods to check fairness of a model.

Journal Title

Frontiers in Artificial Intelligence and Applications: ECAI 2020

Conference Name

ECAI European Conference on Artificial Intelligence

Journal ISSN

1535-6698
1879-8314

Publisher

IOS Press

Publisher DOI

https://doi.org/10.3233/FAIA200380

Rights and licensing

Except where otherwised noted, this item's license is described as Attribution-NonCommercial 4.0 International

Sponsorship

Leverhulme Trust (RC-2015-067)
Alan Turing Institute (Unknown)
Engineering and Physical Sciences Research Council (1778323)

Collections

University of Cambridge Research Outputs (Articles and Conferences)