Scaling Ensemble Distribution Distillation to Many Classes with Proxy Targets

Ryabinin, M; Malinin, A; Gales, M

Scaling Ensemble Distribution Distillation to Many Classes with Proxy Targets

Accepted version

Peer-reviewed

Repository URI

https://www.repository.cam.ac.uk/handle/1810/330661

Repository DOI

https://doi.org/10.17863/CAM.78106

Files

Accepted version (1.55 MB)

Type

Conference Object

Authors

Ryabinin, M

Malinin, A

Gales, Mark

https://orcid.org/0000-0002-5311-8219

Abstract

Ensembles of machine learning models yield improved system performance as well as robust and interpretable uncertainty estimates; however, their inference costs can be prohibitively high. Ensemble Distribution Distillation (EnD2) is an approach that allows a single model to efficiently capture both the predictive performance and uncertainty estimates of an ensemble. For classification, this is achieved by training a Dirichlet distribution over the ensemble members’ output distributions via the maximum likelihood criterion. Although theoretically principled, this work shows that the criterion exhibits poor convergence when applied to large-scale tasks where the number of classes is very high. Specifically, we show that for the Dirichlet log-likelihood criterion classes with low probability induce larger gradients than high-probability classes. Hence during training the model focuses on the distribution of the ensemble tail-class probabilities rather than the probability of the correct and closely related classes. We propose a new training objective which minimizes the reverse KL-divergence to a Proxy-Dirichlet target derived from the ensemble. This loss resolves the gradient issues of EnD2, as we demonstrate both theoretically and empirically on the ImageNet, LibriSpeech, and WMT17 En-De datasets containing 1000, 5000, and 40,000 classes, respectively.

Journal Title

Advances in Neural Information Processing Systems

Conference Name

Neural Information Processing Systems (NeurIPS 2021)

Journal ISSN

1049-5258

Publisher DOI

https://doi.org/10.17863/CAM.78106

Rights

Sponsorship

Cambridge Assessment (Unknown)

Andrey

Collections

Cambridge University Research Outputs