Sample-Efficient Reinforcement Learning for Spoken Dialogue Systems
Repository URI
Repository DOI
Change log
Authors
Abstract
Conversational Artificial Intelligence (Conversational AI) platforms such as Siri or Alexa have become deeply integrated into human lives in recent years. Despite this widespread adoption, the development of a general open-domain dialogue system capable of engaging in natural conversations with humans remains a challenging task.
Developing effective dialogue systems poses a challenge in the face of non-deterministic environments. Users exhibit diverse behaviors, making it difficult to provide an optimal demonstrative response based on user inputs. Additionally, achieving the desired outcome often requires multiple turns of responses, and the consequences of an action may not manifest immediately. Consequently, dialogue management, which involves determining how to respond to users, is commonly approached as a reinforcement learning (RL) problem. RL algorithms are specifically designed to adapt and learn in the presence of non-deterministic probabilities associated with different states and actions. Moreover, RL algorithms excel at handling delayed rewards, enabling agents to understand the long-term consequences of their actions.
The second challenge pertains to the sample efficiency of reinforcement learning algorithms, particularly in the context of enabling online learning with human interaction. Dialogue managers are expected to achieve effective training using minimal sample data. Model-based reinforcement learning (MBRL) offers a promising approach to enhance sample efficiency by constructing an environment model capable of predicting user behaviour. However, the problem of noisy environment model poses an inevitable hurdle in MBRL. Incorrect predictions from environment model could hinder rather than aid the training process. In this thesis, we address this noisy environment model problem in three distinct stages:
The first concept revolves around extracting valuable information from noisy future predictions. Building upon the classical actor-critic architecture, we introduce the Actor-Double-Critic (ADC) algorithm, which enhances the available information through the inclusion of a model-based critic. The model-based critic utilizes the noisy future predictions as input and ensembles the outputs of both critics to optimize the policy. Through experiments, we demonstrate the robustness of ADC in noisy environments where accurate modeling of user behavior is challenging. Furthermore, ADC exhibits superior sample efficiency and stability compared to its model-free baseline. This work represents one of the pioneering applications of model-based reinforcement learning in the realm of dialogue systems.
Instead of learning how to select useful information like ADC, the second approach involves leveraging domain knowledge to select useful information. Specifically, actions that result in repetitive or undesirable termination of dialogues are excluded from the action space. Drawing inspiration from the concept of action masks, a manually crafted component traditionally employed in dialogue managers to eliminate unfavorable actions, we introduce the Trainable-Action-Mask (TAM) algorithm. TAM automates the construction of action masks by utilizing an environment model. The simplified environment model in TAM solely focuses on predicting repetitive patterns and unfavorable dialogue terminations. Experimental results indicate that the simplified environment model facilitates faster learning, and evaluating the accuracy of the environment model becomes a more manageable task.
The final solution entails leveraging domain knowledge to optimise dialogue policy directly, eliminating the need for constructing an environment model. We present Loop-Clipping Policy Optimisation (LCPO), a methodology that directly re-estimates the advantages of taking unfavorable actions in dialogue management policy. LCPO stands out as it does not necessitate environment model training, has no additional hyper-parameters to tune, and is straightforward to implement. LCPO realises online learning in the context of the Cambridge Restaurant Booking task, achieving 80% success rate within only 260 training dialogues. This efficiency is more than eight times greater than that of its baseline model, PPO (Proximal Policy Optimization). In comparison, the state-of-the-art online learning algorithm, GP-SARSA, requires 680 dialogues to achieve a similar level of performance (Gašic ́ et al., 2011). It is important to highlight that GP-SARSA, is a complex algorithm with a time complexity of O(N³), whereas LCPO is a lightweight and versatile algorithm with a time complexity of O(N²).
