Modern k-Nearest Neighbour Methods in Entropy Estimation, Independence Testing and Classification
University of Cambridge
Doctor of Philosophy (PhD)
MetadataShow full item record
Berrett, T. B. (2017). Modern k-Nearest Neighbour Methods in Entropy Estimation, Independence Testing and Classification (Doctoral thesis). https://doi.org/10.17863/CAM.13756
Nearest neighbour methods are a classical approach in nonparametric statistics. The k-nearest neighbour classifier can be traced back to the seminal work of Fix and Hodges (1951) and they also enjoy popularity in many other problems including density estimation and regression. In this thesis we study their use in three different situations, providing new theoretical results on the performance of commonly-used nearest neighbour methods and proposing new procedures that are shown to outperform these existing methods in certain settings. The first problem we discuss is that of entropy estimation. Many statistical procedures, including goodness-of-fit tests and methods for independent component analysis, rely critically on the estimation of the entropy of a distribution. In this chapter, we seek entropy estimators that are efficient and achieve the local asymptotic minimax lower bound with respect to squared error loss. To this end, we study weighted averages of the estimators originally proposed by Kozachenko and Leonenko (1987), based on the k-nearest neighbour distances of a sample. A careful choice of weights enables us to obtain an efficient estimator in arbitrary dimensions, given sufficient smoothness, while the original unweighted estimator is typically only efficient in up to three dimensions. A related topic of study is the estimation of the mutual information between two random vectors, and its application to testing for independence. We propose tests for the two different situations of the marginal distributions being known or unknown and analyse their performance. Finally, we study the classical k-nearest neighbour classifier of Fix and Hodges (1951) and provide a new asymptotic expansion for its excess risk. We also show that, in certain situations, a new modification of the classifier that allows k to vary with the location of the test point can provide improvements. This has applications to the field of semi-supervised learning, where, in addition to labelled training data, we also have access to a large sample of unlabelled data.
Nonparametric statistics, Nearest neighbour methods, Entropy Estimation, Independence Testing, Classification
My PhD was funded by a Sims Scholarship.
This record's DOI: https://doi.org/10.17863/CAM.13756