Show simple item record

dc.contributor.authorSu, Pei-Hao
dc.date.accessioned2018-05-09T10:50:05Z
dc.date.available2018-05-09T10:50:05Z
dc.date.issued2018-07-20
dc.date.submitted2018-05-08
dc.identifier.urihttps://www.repository.cam.ac.uk/handle/1810/275649
dc.description.abstractModelling dialogue management as a reinforcement learning task enables a system to learn to act optimally by maximising a reward function. This reward function is designed to induce the system behaviour required for goal-oriented applications, which usually means fulfilling the user’s goal as efficiently as possible. However, in real-world spoken dialogue systems, the reward is hard to measure, because the goal of the conversation is often known only to the user. Certainly, the system can ask the user if the goal has been satisfied, but this can be intrusive. Furthermore, in practice, the reliability of the user’s response has been found to be highly variable. In addition, due to the sparsity of the reward signal and the large search space, reinforcement learning-based dialogue policy optimisation is often slow. This thesis presents several approaches to address these problems. To better evaluate a dialogue for policy optimisation, two methods are proposed. First, a recurrent neural network-based predictor pre-trained from off-line data is proposed to estimate task success during subsequent on-line dialogue policy learning to avoid noisy user ratings and problems related to not knowing the user’s goal. Second, an on-line learning framework is described where a dialogue policy is jointly trained alongside a reward function modelled as a Gaussian process with active learning. This mitigates the noisiness of user ratings and minimises user intrusion. It is shown that both off-line and on-line methods achieve practical policy learning in real-world applications, while the latter provides a more general joint learning system directly from users. To enhance the policy learning speed, the use of reward shaping is explored and shown to be effective and complementary to the core policy learning algorithm. Furthermore, as deep reinforcement learning methods have the potential to scale to very large tasks, this thesis also investigates the application to dialogue systems. Two sample-efficient algorithms, trust region actor-critic with experience replay (TRACER) and episodic natural actor-critic with experience replay (eNACER), are introduced. In addition, a corpus of demonstration data is utilised to pre-train the models prior to on-line reinforcement learning to handle the cold start problem. Combining these two methods, a practical approach is demonstrated to effectively learn deep reinforcement learning-based dialogue policies in a task-oriented information seeking domain. Overall, this thesis provides solutions which allow truly on-line and continuous policy learning in spoken dialogue systems.
dc.description.sponsorshipTaiwan Cambridge Scholarship
dc.language.isoen
dc.rightsAll rights reserved
dc.rightsAll Rights Reserveden
dc.rights.urihttps://www.rioxx.net/licenses/all-rights-reserved/en
dc.subjectSpoken Dialogue Systems
dc.subjectGaussian Processes
dc.subjectNeural Network
dc.subjectPolicy Optimisation
dc.subjectOn-line Learning
dc.subjectReward Estimation
dc.subjectReward Shaping
dc.subjectActive Learning
dc.subjectReinforcement Learning
dc.subjectDeep Learning
dc.subjectDeep Reinforcement Learning
dc.subjectReward Learning
dc.titleReinforcement Learning and Reward Estimation for Dialogue Policy Optimisation
dc.typeThesis
dc.type.qualificationlevelDoctoral
dc.type.qualificationnameDoctor of Philosophy (PhD)
dc.publisher.institutionUniversity of Cambridge
dc.publisher.departmentEngineering
dc.date.updated2018-05-08T22:43:30Z
dc.identifier.doi10.17863/CAM.22901
dc.publisher.collegeQueens' College
dc.type.qualificationtitlePhD in Engineering
cam.supervisorYoung, Steve
cam.thesis.fundingfalse
rioxxterms.freetoread.startdate2018-05-08


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record