Deep Reinforcement Learning for Real-World Humanoid Robot Locomotion Control with Automatic Reward Learning

Humanoid robots possess the potential to solve complex problems across diverse environments, such as nuclear-contaminated zones, epidemic-affected areas, and extraterrestrial missions. However, a humanoid robot is an inherently complex system that integrates multiple disciplines, including perception, mechanical design, materials science, and motion control, each of which requires comprehensive and in-depth investigation. Among these aspects, motion control plays a crucial role, as it directly determines the robot’s motion accuracy, stability, and flexibility. In recent years, with the rapid evolution of graphics-processing-unit-based parallel computing and high-fidelity simulation environments, various deep-reinforcement-learning (DRL)-based approaches have been proposed to achieve precise and robust motion control due to its flexibility and adaptability in uncertain and dynamic environments. However, the inherent complexity and uncertainty of real-world tasks pose substantial challenges when designing effective reward functions for DRL agents. Most current methods typically rely on manually engineered or externally tuned reward signals and therefore require considerable domain expertise, associated with considerable human efforts and a long convergence time; these issues may even trigger mission failure. This work proposes an automatic reward learning method to derive reward functions for DRL in humanoid robot locomotion control. Specifically, a bilevel optimization framework is developed to enable automatic reward learning during policy learning. The reward learning mechanism in the upper level adaptively constructs and optimizes the reward function. The DRL framework in the lower level learns the locomotion control policy using the learned reward function. Three sets of experiments are conducted to verify the effectiveness of the proposed approach: training soft actor-critic and proximal policy optimization agents in MuJoCo environments, training the proximal policy optimization agent in a humanoid robot environment built with Isaac Lab, and transferring the agent to the real-world Unitree G1 humanoid robot through sim-to-real. The experimental results demonstrate that the proposed automatic reward learning method substantially improves learning efficiency and achieves superior performance to manually designed reward functions in both simulation and real-world deployment. By enhancing the success rate of transferring control policies from simulation to real-world humanoid robots, this approach provides a promising pathway toward accelerating the deployment of stable and adaptive humanoid robots in practical applications.

Keywords

46 Information and Computing Sciences, 4602 Artificial Intelligence, 7 Affordable and Clean Energy

Journal Title

Research

Journal ISSN

2518-9247
2639-5274

Volume Title

9

Publisher

American Association for the Advancement of Science (AAAS)

Publisher DOI

https://doi.org/10.34133/research.1123

Rights and licensing

Except where otherwised noted, this item's license is described as Attribution 4.0 International

Sponsorship

National Natural Science Foundation of China

Collections

University of Cambridge Research Outputs (Articles and Conferences)