Clipping Loops for Sample-Efficient Dialogue Policy Optimisation

Wu, Yen-Chen; Rasmussen, Carl Edward

doi:10.18653/v1/2021.naacl-main.267

Clipping Loops for Sample-Efficient Dialogue Policy Optimisation

Published version

Peer-reviewed

Repository URI

https://www.repository.cam.ac.uk/handle/1810/399742

Files

Primary Published version (643.74 KB)

Type

Conference Object

Authors

Wu, Yen-Chen

Rasmussen, Carl Edward

Abstract

Training dialogue agents requires a large number of interactions with users: agents have no idea about which responses are bad among a lengthy dialogue. In this paper, we propose loop-clipping policy optimisation (LCPO) to eliminate useless responses. LCPO consists of two stages: loop clipping and advantage clipping. In loop clipping, we clip off useless responses (called loops) from dialogue history (called trajectories). The clipped trajectories are more succinct than the original ones, and the estimation of state-value is more accurate. Second, in advantage clipping, we estimate and clip the advantages of useless responses and normal ones separately. The clipped advantage distinguishes useless actions from others and reduces the probabilities of useless actions efficiently. In experiments on Cambridge Restaurant Dialogue System, LCPO uses only 260 training dialogues to achieve 80% success rate, while PPO baseline requires 2160 dialogues. Besides, LCPO receives 3.7/5 scores in human evaluation where the agent interactively collects 100 real-user dialogues in the training phase.

Journal Title

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Conference Name

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Publisher

Association for Computational Linguistics (ACL)

Publisher DOI

https://doi.org/10.18653/v1/2021.naacl-main.267

Rights and licensing

Except where otherwised noted, this item's license is described as Attribution 4.0 International

Collections

University of Cambridge Research Outputs (Articles and Conferences)

Clipping Loops for Sample-Efficient Dialogue Policy Optimisation

Published version

Peer-reviewed

Repository URI

Repository DOI

Files

Type

Change log

Authors

Abstract

Description

Keywords

Journal Title

Conference Name

Journal ISSN

Volume Title

Publisher

Publisher DOI

Rights and licensing

Collections