Repository logo
 

Statistical Methods for Off-Policy Learning


Loading...
Thumbnail Image

Type

Change log

Abstract

Sequential decision problems are ubiquitous and have been studied across many areas of science, engineering, and business. The task of learning good policies from historical records collected under another policy—commonly referred to as off-policy learning—has mainly been studied in two settings: dynamic treatment regimes (DTRs), which focus on controlling for confounding in medical problems with short decision horizons, and Markov decision processes (MDPs) in reinforcement learning (RL), which focus on dimension reduction in closed systems such as games. Many real-world problems bear resemblance to both MDPs and DTRs. Yet, the absence of a general methodology compels practitioners to choose one framework or the other, often times resulting in inappropriate causal assumptions and statistically inefficient estimation procedures. This has limited the wider application of off-policy learning to many real-world problems.

Motivated by the dynamic pricing problem encountered in container logistics, this thesis takes steps towards a general off-policy learning methodology by addressing two key challenges. The first contribution is a set of graphical criteria in Chapter 3 that asserts when the value of a policy is “identified” in a general decision process that encompasses both DTRs and MDPs. This is based on the theory for causal identification in problems described by acyclic directed mixed graphs (ADMGs). The conditions generalise the existing notion of “sequential ignorability” from the DTR literature by introducing a collection of variables—referred to as the “state”—that fully summarises the system dynamics and allows the decision-maker to discard the rest of the history.

The second contribution is a simplified approach to constructing semiparametric estimators in sequential decision problems in Chapter 4. By first computing tangent spaces directly from directed acyclic graphs (DAGs)—possibly with additional time-invariance assumptions—the chapter shows how the efficient influence function can be calculated via the chain rule by using a set of common “building blocks” shared across related decision problems. This approach is based on a discretisation assumption and requires verifying the validity of the conjectured influence function. A separate contribution of this thesis is a novel calculus of influence functions in Chapter 2 that works for Hilbert-valued functionals that can be expressed as a composition of a pathwise and a Hadamard differentiable map. This approach similarly works via the chain rule, but is based on adjoint operators between Hilbert spaces, and does not require any discretisation assumption.

At the end of the thesis in Chapter 5, the methodology is applied to the dynamic pricing problem in container logistics. A novel doubly robust estimator that targets the risk-adjusted revenue is presented, and its asymptotic properties are explored in a simulation study. The chapter closes with an empirical case study based on real data from the logistics carrier A.P. Moller - Maersk.

Description

Date

2025-03-18

Advisors

Zhao, Qingyuan

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge

Rights and licensing

Except where otherwised noted, this item's license is described as All rights reserved
Sponsorship
Innovation Fund Denmark and A.P. Moller - Maersk