Repository logo
 

Ranking the information content of distance measures

Published version
Peer-reviewed

Repository DOI


Change log

Authors

Zeni, Claudio 
Csányi, Gábor 
Laio, Alessandro 

Abstract

Real-world data typically contain a large number of features that are often heterogeneous in nature, relevance, and also units of measure. When assessing the similarity between data points, one can build various distance measures using subsets of these features. Finding a small set of features that still retains sufficient information about the dataset is important for the successful application of many statistical learning approaches. We introduce a statistical test that can assess the relative information retained when using 2 different distance measures, and determine if they are equivalent, independent, or if one is more informative than the other. This ranking can in turn be used to identify the most informative distance measure and, therefore, the most informative set of features, out of a pool of candidates. To illustrate the general applicability of our approach, we show that it reproduces the known importance ranking of policy variables for Covid-19 control, and also identifies compact yet informative descriptors for atomic structures. We further provide initial evidence that the information asymmetry measured by the proposed test can be used to infer relationships of causality between the features of a dataset. The method is general and should be applicable to many branches of science.

Description

Acknowledgements: A.G., C.Z., and A.L. gratefully acknowledge support from the European Union’s Horizon 2020 research and innovation program (grant number 824143, MaX ’Materials design at the eXascale’ Centre of Excellence). The authors would like to thank M. Carli, D. Doimo, and I. Macocco (SISSA) for the discussions, M. Caro (Aalto University) for the precious help in using the TurboGap code, and D. Frenkel (University of Cambridge) and N. Bernstein (US Naval Research Laboratory) for useful feedback on the manuscript.

Keywords

feature selection, causality detection, information theory

Journal Title

PNAS Nexus

Conference Name

Journal ISSN

2752-6542

Volume Title

1

Publisher

Oxford University Press
Sponsorship
Horizon 2020 Framework Programme 0|0 (824143)