Machine-Learning-Enabled Gestural Interaction in Mixed Reality
Repository URI
Repository DOI
Change log
Authors
Abstract
Mixed Reality (MR) is a term used to describe the seamless blending of a physical environment with a digitally generated environment, creating an integrated space where real and virtual elements coexist and interact. It has introduced the possibility of new computing and interaction paradigms for users with the use of commercial headsets such as Microsoft HoloLens and Meta Quest. However, various factors have prevented these products from becoming widely adopted as everyday devices. The lack of frictionless interactions is one of the preventive factors. The thesis focuses on gestural interactions that do not require the use of controllers and are instead performed with bare hands. However, gestural interactions in MR are still in their infancy, with potential for exploration and development. The direction of gestural interaction in this thesis is partitioned into two parts: mid-air gesture keyboard for text entry, and gesture recognition for eliciting commands. They are similar to the keyboard and mouse used on personal computers. Machine learning has significantly advanced various technologies; thus the central hypothesis in this thesis is that Machine learning enables fast and accurate gestural interaction systems in Mixed Reality.
The design of machine-learning-enabled gestural interaction systems faces many challenges. Standard machine learning models require a significant amount of data for training so that complex patterns can be recognized from the data. Owing to the limited adoption of MR devices and considering that novel interaction systems are proposed, acquiring large amounts of data can be time-consuming and challenging. Thus, data sparsity poses the first challenge, which leads to Research Question 1: How to use generative machine learning models to synthesize skeleton gestural data? This thesis proposes a novel model, the Imaginative Generative Adversarial Network (GAN), to automatically synthesize skeleton-based hand gesture data aimed at data augmentation for training gesture classification models. The results demonstrate that the proposed model trains quickly and can enhance classification accuracy compared to conventional data augmentation strategies. Furthermore, this model is extended to generate trajectory data for mid-air gesture keyboards, compare it with other generative models, and discuss the comparative advantages of each model.
Designing a mid-air gesture keyboard involves both user interface and user experience design. MR introduces a vast design space through its high-degree of interactivity and expansive display area. Consequently, the optimal design for a mid-air gesture keyboard size varies among users due to differences in users' gesture motions and preferences. Thus, Research Question 2 is: How to use machine learning-based optimization methods to adaptively personalize the size of a mid-Air gesture keyboard? This thesis proposes a multi-objective Bayesian optimization approach for adapting the layout size of a mid-air gesture keyboard to individual users. The results demonstrates that this process can achieve a 14.4% improvement in speed and a 13.8% improvement in accuracy relative to a baseline design with a constant size.
Furthermore, MR products face challenges related to the inaccuracy and high latency of tracking, as well as the lack of haptic feedback for users. These issues result in lower text entry accuracy and speed on mid-air gesture keyboards, subsequently causing a sub-optimal user experience. Therefore, Research Question 3 is: How to use machine learning methods to accurately decode gesture trajectories into intended text on a mid-air gesture keyboard, thereby facilitating innovative designs that enhance user experience? This thesis introduces a novel gesture trajectory decoding model that robustly translates users’ three-dimensional fingertip gestural trajectories into their intended text. This accurate decoding model enables the investigation of innovative open-loop interaction designs, including the removal of visual feedback and the relaxation of delimitation thresholds.
Lastly, gesture recognition is a complex task that requires the recognition of noisy and intricate gesture data in real-time. Moreover, for those without experience in machine learning, it can be a challenge to apply these techniques to gesture recognition. Consequently, Research Question 4 is: How to use machine learning models to perform accurate gesture recognition in real-time and empower non-experts to use this technology? This thesis proposes a key gesture spotting architecture that includes a novel gesture classifier model and a novel single-time activation algorithm. This architecture is evaluated on four separate skeleton-based hand gesture datasets and achieves high recognition accuracy with early detection. Furthermore, various data processing and augmentation strategies, along with the proposed key gesture spotting architecture, are encapsulated into an interactive application, which demonstrates high usability in a user study.