Rethinking Machine Learning for Heterogeneous Treatment Effect Estimation
Repository URI
Repository DOI
Change log
Authors
Abstract
The need to estimate the causal effect that a treatment, policy or other intervention had on an outcome variable is a problem which ubiquitously appears in fields ranging from economics to marketing and medicine. Historically, population average effects have been the main estimand of interest, however, a growing focus on personalization has led to a recent surge in interest in using machine learning to enable heterogeneous treatment effect estimation and resulted in rapid expansion of the machine learning toolbox for this problem over the last decade. Because such effect estimates are often of interest in high-stakes environments, one needs to be sure that the best method from the ever-growing toolbox can be identified for a given application – however, this turns out to be much more difficult than in more standard machine learning problems. That is, in canonical supervised machine learning settings, choosing between different methods is relatively straightforward as their prediction performance can usually simply be evaluated against held-out labels on validation datasets. In the context of treatment effect estimation, however, there are no such labels because ground truth individualized treatment effects are never observed in practice – which substantially complicates the choice between available methods. Thus, the bottleneck for obtaining actionable personalized effect estimates in practice is no longer a lack in availability of good candidate estimators, but a lack of understanding when to use which of the many existing methods – and why.
With the ultimate goal of providing new fundamental insights into heterogeneous treatment effect estimation as a machine learning problem, this thesis therefore aims to build deeper understanding of the machine learning challenges inherent to different classes of effect estimation problems through theoretical and empirical studies. We investigate the success- and failure modes of different approaches to affect estimation, model evaluation and model selection in this context, and provide new methodology that fills identified gaps where necessary. At a higher level, we also use the discovered insights to reevaluate the research priorities that have been set in the machine learning literature on the topic thus far and investigate important problem characteristics and research directions that have been relatively understudied.
We explore four subproblems in detail that fall within the landscape of machine learning for heterogeneous treatment effect estimation. First, we study how to best design machine learning methods for estimating heterogeneous treatment effects, and theoretically and empirically demonstrate that methods that target treatment effects directly often perform better than those that simply predict outcomes under different treatment choices separately. We show that this can not only be achieved by using multi-stage estimators adapted from the statistics literature, but also by designing new deep learning architectures that can be trained in an end-to-end fashion. Second, we take a critical look at model evaluation practices in the machine learning literature on this topic, and demonstrate that conclusions on the relative performance of different estimation strategies depend heavily on the data-generating mechanisms that have been used to compare them empirically – which, due to their simulated nature, do not necessarily reflect likely real-world problem characteristics. We also discuss the (dis)advantages of alternative types of benchmark datasets and evaluation metrics. Third, we study the model selection dilemma that the absence of ground truth labels causes in this context, and carefully design an empirical study to compare different model selection criteria that have been used in the literature. We show that there is a complex interplay between selection strategies, candidate estimators and the data used for comparing them, and that there is no magic bullet for model selection that always wins. Finally, we study what challenges arise when we allow for additional problem characteristics often present in real world medical applications. In particular we consider heterogeneous treatment effect estimation from time to-event data in the presence of censoring and competing events, and longitudinal treatment effect estimation problems with informative sampling. We highlight that because these mechanisms further coarsen the observed data relative to the distribution of interest, this leads to additional covariate shifts that need to be accounted for and investigate strategies for doing so.
Overall, we highlight that while the machine learning literature on heterogeneous treatment effect estimation has focused almost exclusively on tackling the presence of confounding-induced covariate shifts thus far, the absence of the ground truth label of interest is an orthogonal problem characteristic that has received much less attention but equally affects all stages of the machine learning pipeline for heterogeneous treatment effect estimation. We also show that confounding is not the only source of covariate shift in many heterogeneous treatment effect estimation problems encountered in reality, and that therefore distribution shifts arising due to different sources of missingness may have to be considered more holistically. Finally, we demonstrate that, across all problem settings we consider, the relative performance of different models is heavily influenced by the (often simulated) data used to compare them. We thus note that the lack in availability of realistic datasets for benchmarking and testing methods in this area remains a crucial problem to tackle in the future, as such datasets would be needed to ensure that methods for heterogeneous treatment effect estimation are truly ready for use in practice.

