Modern Methods in Semiparametric Statistics
Repository URI
Repository DOI
Change log
Authors
Abstract
The ethos that “all models are wrong but some are useful” drives many modern statistical advancements, with methods that are lean with regards to modeling assumptions often out-competing their classical counterparts when deployed on modern datasets. Such principles underpin the success of many modern machine learning methods. It is therefore of interest to look beyond parametric models. Semiparametric models, defined in terms of infinite dimensional parameters, provides extra flexibility over potentially misspecified parametric models. The desire to construct estimators for parameters of interest that are optimal over the broad class of distributions given by these models however introduces a number of challenges. This thesis contributes to a growing body of literature that aims to address these challenges.
This thesis consists of four chapters. The first chapter summarises some relevant literature on semiparametric theory. A key concept in semiparametric statistics is that of the semiparametric efficiency bound for a target parameter, defined as the infimum of the variances of all its regular estimators. Whilst estimators which attain this efficiency bound can be constructed, many authors tend not to recommend them in practice, citing poor empirical performance. This is the focus of the second chapter. A theoretical hardness result is presented, showing it may not be possible to construct an estimator achieving the semiparametric efficiency bound uniformly over a class of plausible data-generating distributions. This hardness result motivates a class of robust semiparametric estimators, with the robust semiparametric efficiency (ROSE) bound defined to be the infimum of the variances of such estimators. Further, a novel ROSE random forest pro- cedure is developed that achieves this robust efficiency bound, with simulations and real data analyses confirming its improved performance over classically semiparametric efficient estimators. Code to fit ROSE random forests is provided as the R package rose available from https://github.com/elliot-young/rose/.
The third and fourth chapters study estimation for grouped data, where observa- tions are arranged in independent groups but may exhibit within-group dependence. Ex- isting approaches estimate model parameters through weighted least squares or quasi- maximum likelihood approaches, with optimal weights typically estimated by maximising a (restricted) likelihood from random effects modeling or by using generalised estimating equations. A new ‘sandwich loss’ is introduced whose population minimiser coincides with the weights of these approaches when the parametric forms for the conditional covariance are well-specified, but can yield arbitrarily large improvements in fixed effects parameter estimation accuracy when they are not. A corresponding ‘sandwich regression’ methodology is developed and studied in two settings. Firstly in Chapter 3 an empirical sandwich loss is introduced for datasets of small to moderate sample sizes, where the conditional first moment of the response takes a fully parametric form, such as in grouped linear models. This empirical sandwich loss takes the form of a Jackknife estimate of the variance, defined in terms of a finite-dimensional dispersion parameter. Code to perform this sandwich regression is made available from https://github.com/elliot-young/parametric.sand.reg/. In the fourth chapter the partially linear model is studied. The empirical sandwich loss is generalised beyond parametric models, with weights estimated by a new ‘sandwich boosting’ gradient boosting scheme. Sandwich boosting can be performed using the R package sandwich.boost from https://github.com/elliot-young/sandwich.boost/.
