High-dimensional Online Changepoint Detection
The problem of changepoint detection and estimation has a long history, dating back to at least Page (1954, 1955). As modern technological advances, data sets of unprecedented size can be collected at high frequency. This provides statisticians with new challenges in this field. In this thesis we study the online version of the changepoint detection problem in high-dimensional settings. In Chapter 1, we survey the field of changepoint detection. We focus, in particular, on the comparison between the offline problem and the online problem, the contrast between the univariate setting and the high-dimensional setting and inference problems associated with changepoints. In Chapter 2, we introduce a novel method for high-dimensional, online changepoint detection in settings where a multivariate data stream may undergo a change in mean. The procedure works by performing Gaussian likelihood ratio tests against simple alternatives of different scales in each coordinate, and then aggregating test statistics across scales and coordinates. Our algorithm is online in the sense that both its storage requirements and worst-case computational complexity per new observation are independent of the number of previous observations. We prove that the patience, or average run length under the null, of our procedure is at least at the desired nominal level, and provide guarantees on its response delay under the alternative that depend on the sparsity of the vector of mean change. Our procedure shows excellent performance compared to existing methods in the numerical studies. Our algorithm is implemented in the R package ocd, and we also demonstrate its utility on a seismology data set. In Chapter 3, we focus on the problem of inference for high-dimensional online changepoint detection. We propose a confidence interval for the changepoint location. The procedure first identifies coordinates with large signals and then combines univariate confidence intervals constructed from each of these coordinates. We prove that the confidence interval constructed has the desired coverage level and provide a guarantee on the length of the confidence interval. Our procedure also provides an estimate for the effective support of the signal as a byproduct. Simulations confirm the practical effectiveness of our proposal, and we also illustrate its applicability on both US excess deaths data from 2017-2020 and S&P 500 data from the 2007-2008 financial crisis.