## A concentration inequality based statistical methodology for inference on covariance matrices and operators

##### View / Open Files

##### Authors

##### Advisors

Aston, John A D

Nickl, Richard

##### Date

2017-06-23##### Awarding Institution

University of Cambridge

##### Author Affiliation

Department of Pure Maths and Mathematical Statistics

##### Qualification

Doctor of Philosophy (PhD)

##### Language

English

##### Type

Thesis

##### Metadata

Show full item record##### Citation

Kashlak, A. B. (2017). A concentration inequality based statistical methodology for inference on covariance matrices and operators (Doctoral thesis). https://doi.org/10.17863/CAM.13757

##### Abstract

In the modern era of high and infinite dimensional data, classical statistical methodology is often rendered inefficient and ineffective when confronted with such big data
problems as arise in genomics, medical imaging, speech analysis, and many other
areas of research. Many problems manifest when the practitioner is required to take
into account the covariance structure of the data during his or her analysis, which
takes on the form of either a high dimensional low rank matrix or a finite dimensional
representation of an infinite dimensional operator acting on some underlying function
space. Thus, novel methodology is required to estimate, analyze, and make inferences
concerning such covariances.
In this manuscript, we propose using tools from the concentration of measure
literature–a theory that arose in the latter half of the 20th century from connections
between geometry, probability, and functional analysis–to construct rigorous descriptive and inferential statistical methodology for covariance matrices and operators.
A variety of concentration inequalities are considered, which allow for the construction of nonasymptotic dimension-free confidence sets for the unknown matrices and
operators. Given such confidence sets a wide range of estimation and inferential
procedures can be and are subsequently developed.
For high dimensional data, we propose a method to search a concentration in-
equality based confidence set using a binary search algorithm for the estimation of
large sparse covariance matrices. Both sub-Gaussian and sub-exponential concentration inequalities are considered and applied to both simulated data and to a set
of gene expression data from a study of small round blue-cell tumours. For infinite
dimensional data, which is also referred to as functional data, we use a celebrated
result, Talagrand’s concentration inequality, in the Banach space setting to construct
confidence sets for covariance operators. From these confidence sets, three different
inferential techniques emerge: the first is a k-sample test for equality of covariance
operator; the second is a functional data classifier, which makes its decisions based
on the covariance structure of the data; the third is a functional data clustering
algorithm, which incorporates the concentration inequality based confidence sets
into the framework of an expectation-maximization algorithm. These techniques are
applied to simulated data and to speech samples from a set of spoken phoneme data.
Lastly, we take a closer look at a key tool used in the construction of concentration
based confidence sets: Rademacher symmetrization. The symmetrization inequality,
which arises in the probability in Banach spaces literature, is shown to be connected
with optimal transport theory and specifically the Wasserstein distance. This insight
is used to improve the symmetrization inequality resulting in tighter concentration
bounds to be used in the construction of nonasymptotic confidence sets. A variety of
other applications are considered including tests for data symmetry and tightening
inequalities in Banach spaces. An R package for inference on covariance operators is
briefly discussed in an appendix chapter.

##### Keywords

Sparsity, Thresholding estimator, Procrustes, Functional Data, Talagrand's Inequality, Log Sobolev Inequality, Sub-Gaussian, Sub-Exponential, Classification, Clustering, Banach Space, Rademacher Symmetrization, Wasserstein Distance, High Dimensional Data

##### Sponsorship

National Security Agency Graduate Fellowship Award

##### Identifiers

This record's DOI: https://doi.org/10.17863/CAM.13757

##### Rights

Attribution-ShareAlike 4.0 International

Licence URL: https://creativecommons.org/licenses/by-sa/4.0/