Principles of Experimental Design for Big Data Analysis

Big Datasets are endemic, but are often notoriously difficult to analyse because of their size, heterogeneity and quality. The purpose of this paper is to open a discourse on the potential for modern decision theoretic optimal experimental design methods, which by their very nature have traditionally been applied prospectively, to improve the analysis of Big Data through retrospective designed sampling in order to answer particular questions of interest. By appealing to a range of examples, it is suggested that this perspective on Big Data modelling and analysis has the potential for wide generality and advantageous inferential and computational properties. We highlight current hurdles and open research questions surrounding efficient computational optimisation in using retrospective designs, and in part this paper is a call to the optimisation and experimental design communities to work together in the field of Big Data analysis.

Keywords

active learning, big data, dimension reduction, experimental design, sub-sampling

Journal Title

Statistical Science

Journal ISSN

0883-4237
2168-8745

Volume Title

32

Publisher

Institute of Mathematical Statistics

Publisher DOI

https://doi.org/10.1214/16-STS604

Rights

http://www.rioxx.net/licenses/all-rights-reserved

Sponsorship

CCD was supported by an Australian Research Council’s Discovery Early Career Researcher Award funding scheme (DE160100741). CH would like to gratefully acknowledge support from the Medical Research Council (UK), the OxfordMAN Institute, and the EPSRC UK through the i-like Statistics programme grant. CCD, JMM and KM would like to acknowledge support from the Australian Research Council Centre of Excellence for Mathematical and Statistical Frontiers (ACEMS). Funding from the Australian Research Council for author KM is gratefully acknowledged.

Collections

Scholarly Works - MRC Biostatistics Unit
Symplectic mapped items for data match