Automatic Discovery of the Statistical Types of Variables in a Dataset

Valera, Isabel; Ghahramani, Zoubin

Automatic Discovery of the Statistical Types of Variables in a Dataset

Accepted version

Peer-reviewed

Repository URI

https://www.repository.cam.ac.uk/handle/1810/269733

Repository DOI

https://doi.org/10.17863/CAM.11067

Files

Accepted version (916.39 KB)

Type

Conference Object

Authors

Valera, Isabel

Ghahramani, Zoubin

https://orcid.org/0000-0002-7464-6475

Abstract

A common practice in statistics and machine learning is to assume that the statistical data types (e.g., ordinal, categorical or real-valued) of variables, and usually also the likelihood model, is known. However, as the availability of real- world data increases, this assumption becomes too restrictive. Data are often heterogeneous, complex, and improperly or incompletely documented. Surprisingly, despite their practical importance, there is still a lack of tools to automatically discover the statistical types of, as well as appropriate likelihood (noise) models for, the variables in a dataset. In this paper, we fill this gap by proposing a Bayesian method, which accurately discovers the statistical data types in both synthetic and real data.

Journal Title

INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 70

Conference Name

ICML 2017

Journal ISSN

2640-3498

Volume Title

70

Publisher

PMLR

Publisher DOI

https://doi.org/10.17863/CAM.11067

Rights

http://www.rioxx.net/licenses/all-rights-reserved

Sponsorship

Humboldt Research Fellowship for Postdoctoral Researchers, which funded this research during her stay at the Max Planck Institute for Software Systems. ATI Grant EP/N510129/1 EPSRC Grant EP/N014162/1 Google

Collections

Scholarly Works - Engineering