Fast upper body pose estimation for human-robot interaction
Repository URI
Repository DOI
Change log
Authors
Abstract
This work describes an upper body pose tracker that finds a 3D pose estimate using video sequences obtained from a monocular camera, with applications in human-robot interaction in mind. A novel mixture of Ornstein-Uhlenbeck processes model, trained in a reduced dimensional subspace and designed for analytical tractability, is introduced. This model acts as a collection of mean-reverting random walks that pull towards more commonly observed poses. Pose tracking using this model can be Rao-Blackwellised, allowing for computational efficiency while still incorporating bio-mechanical properties of the upper body. The model is used within a recursive Bayesian framework to provide reliable estimates of upper body pose when only a subset of body joints can be detected. Model training data can be extended through a retargeting process, and better pose coverage obtained through the use of Poisson disk sampling in the model training stage. Results on a number of test datasets show that the proposed approach provides pose estimation accuracy comparable with the state of the art in real time (30 fps) and can be extended to the multiple user case. As a motivating example, this work also introduces a pantomimic gesture recognition interface. Traditional approaches to gesture recognition for robot control make use of predefined codebooks of gestures, which are mapped directly to the robot behaviours they are intended to elicit. These gesture codewords are typically recognised using algorithms trained on multiple recordings of people performing the predefined gestures. Obtaining these recordings can be expensive and time consuming, and the codebook of gestures may not be particularly intuitive. This thesis presents arguments that pantomimic gestures, which mimic the intended robot behaviours directly, are potentially more intuitive, and proposes a transfer learning approach to recognition, where human hand gestures are mapped to recordings of robot behaviour by extracting temporal and spatial features that are inherently present in both pantomimed actions and robot behaviours. A Bayesian bias compensation scheme is introduced to compensate for potential classification bias in features. Results from a quadrotor behaviour selection problem show that good classification accuracy can be obtained when human hand gestures are recognised using behaviour recordings, and that classification using these behaviour recordings is more robust than using human hand recordings when users are allowed complete freedom over their choice of input gestures.