Parallel reward and punishment control in humans and robots: Safe reinforcement learning using the MaxPain algorithm

Elfwing, S; Seymour, B

Parallel reward and punishment control in humans and robots: Safe reinforcement learning using the MaxPain algorithm

Accepted version

Peer-reviewed

Repository URI

https://www.repository.cam.ac.uk/handle/1810/286048

Repository DOI

https://doi.org/10.17863/CAM.33366

Files

Accepted version (739 KB)

Type

Article

Authors

Elfwing, S

Seymour, Benjamin

https://orcid.org/0000-0003-1724-5832

Abstract

An important issue in reinforcement learning systems for autonomous agents is whether it makes sense to have separate systems for predicting rewards and punishments. In robotics, learning and control are typically achieved by a single controller, with punishments coded as negative rewards. However in biological systems, some evidence suggests that the brain has a separate system for punishment. Although this may in part be due to biological constraints of implementing negative quantities, it raises the question as to whether there is any computational rationale for keeping reward and punishment prediction operationally distinct. Here we outline a basic argument supporting this idea, based on the proposition that learning best-case predictions (as in Q-learning) does not always achieve the safest behaviour. We introduce a modified RL scheme involving a new algorithm which we call 'MaxPain' - which back-ups worst-case predictions in parallel, and then scales the two predictions in a multi-attribute RL policy. i.e. independently learning 'what to do' as well as 'what not to do' and then combining this information. We show how this scheme can improve performance in benchmark RL environments, including a grid-world experiment and a delayed version of the mountain car experiment. In particular, we demonstrate how early exploration and learning are substantially improved, leading to much 'safer' behaviour. In conclusion, the results illustrate the importance of independent punishment prediction in RL, and provide a testable framework for better understanding punishment (such as pain) and avoidance in humans, in both health and disease.

Keywords

46 Information and Computing Sciences, 4602 Artificial Intelligence, 4611 Machine Learning, Clinical Research, 1.2 Psychological and socioeconomic processes, 1 Underpinning research, Mental health, Generic health relevance

Journal Title

7th Joint IEEE International Conference on Development and Learning and on Epigenetic Robotics, ICDL-EpiRob 2017

Journal ISSN

2161-9484

Volume Title

2018-January

Publisher

IEEE

Publisher DOI

https://doi.org/10.1109/DEVLRN.2017.8329799

Rights

http://www.rioxx.net/licenses/all-rights-reserved

Sponsorship

Wellcome Trust (097490/Z/11/Z)
Arthritis Research UK (21537)

Collections

Cambridge University Research Outputs