Repository logo
 

Aligning Models for Human-Centric Decision Systems


Loading...
Thumbnail Image

Type

Change log

Abstract

When employed for important decisions, machine learning models do not operate in a vacuum. While they have achieved impressive results across a variety of tasks, they are by no means faultless in their decisions and, without any responsibility or accountability on the part of algorithms, human oversight is needed to ensure their safe deployment. In this thesis we explore the development and alignment of machine learning models for this larger decision system, where careful consideration must be given to the interplay between machine learning models and the humans that will use them. We consider a number of core pressing problems for developing such systems: First, behavioural inference over human decision makers in order to understand their goals, diagnose their weaknesses, and inform support policies. In particular, in Chapter 2 we develop the first method for scalable Bayesian inverse reinforcement learning that can be used in large and complicated environments such as medical decision support. Second, the design of systems that moderate the information flow between predictive machine learning models and human users in order to achieve the best overall policy on a task. In Chapter 3 we consider how best to personalise explanations of a machine learning model to a human user in order to optimise overall task performance, while in Chapter 4 we instil models with a sense of local cultural nuance in order to improve their ability to detect violations in a content moderation setting, proposing a larger reasoning model use them as tools to better assist a human moderator. Third, improving the trustworthiness and alignment of language models in particular, as powerful generative models that can form the backbone of human-centric decision systems given their particularly natural way of interfacing with humans and impressive general abilities. We consider in Chapter 5 the problem of detecting when the output of a language model is not consistent with an internal notion of “truth”, which can occur inconspicuously and is clearly damaging for important decisions. Finally, in Chapter 6 we work on the problem of preference alignment in large language models, diagnosing instabilities in the common reinforcement learning from human feedback paradigm and presenting a new method for improved credit assignment that stabilises and accelerates training. In each case we conduct an investigation of the topic, provide algorithmic solutions for the challenge at hand, and validate proposals through experiments on both simulations and real-world data. Our results demonstrate consistent improvements over the contemporary state-of-the-art, and highlights the importance of taking a holistic approach to decision systems rather than simply focusing on the machine learning methods in isolation.

Description

Date

2024-03-01

Advisors

van der Schaar, Mihaela

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge

Rights and licensing

Except where otherwised noted, this item's license is described as All rights reserved
Sponsorship
EPSRC (2482741)
EPSRC iCASE Scholarship with Microsoft Research