This article was submitted to Human-Robot Interaction, a section of the journal Frontiers in Robotics and AI
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
To date, endowing robots with an ability to assess social appropriateness of their actions has not been possible. This has been mainly due to (i) the lack of relevant and labelled data and (ii) the lack of formulations of this as a lifelong learning problem. In this paper, we address these two issues. We first introduce the Socially Appropriate Domestic Robot Actions dataset (MANNERS-DB), which contains appropriateness labels of robot actions annotated by humans. Secondly, we train and evaluate a baseline Multi Layer Perceptron and a Bayesian Neural Network (BNN) that estimate social appropriateness of actions in MANNERS-DB. Finally, we formulate learning social appropriateness of actions as a continual learning problem using the uncertainty of Bayesian Neural Network parameters. The experimental results show that the social appropriateness of robot actions can be predicted with a satisfactory level of precision. To facilitate reproducibility and further progress in this area, MANNERS-DB, the trained models and the relevant code are made publicly available at
Social robots are required to operate in highly challenging environments populated with complex objects, articulated tools, and complicated social settings involving humans, animals and other robots. To operate successfully in these environments, robots should be able to assess whether an action is socially appropriate in a given context. Learning to navigate in the jungle of social etiquette, norms, verbal and visual cues that make up such a social context, is not straightforward. Little work has been done on allowing robots to obtain this ability and even for humans, it takes years to learn to accurately read and recognise the signals involved when determining the
The social robotics community has studied related problems such as socially appropriate navigation (
To this end, we first introduce the Socially Appropriate Domestic Robot Actions Dataset (MANNERS-DB) that constitutes simulated robot actions in visual domestic scenes of different social configurations (see an example in
An example scene from the simulated living room environment. The robot (in circle) is expected to execute an action that is appropriate to the given social context.
Operating successfully in a social environment is already challenging for most people, let alone robots. The social cues and signals that need to be interpreted and acted upon are numerous and complex. However, some of the social rules and conventions that need to be followed and understood are similar for both humans and robots. A good starting point for this is the survey paper on social signal processing by
In the context of group behaviour,
When it comes to assessing how we use the space and the environment around us in social interactions,
In the field of human-robot interaction, studies have shown that robots are treated differently than humans with respect to appropriate interpersonal distance and invasion of personal space. Evidence suggests that, when introduced to a robot, people prefer it to be positioned in what
We note that majority of the existing works on robot behaviour toward and around people have focused on socially aware motion planning and navigation. Traditional approaches in this area rely on hand-crafted methods such as the work of
Researchers have also examined how and when to engage humans appropriately in HRI situations (
Determining the social appropriateness of an action relies on determining the social context in which that action will be executed. Contextual understanding has been an important area of research in human-computer interaction (HCI) (
Humans excel at continuously learning new skills and new knowledge with new experiences. This has inspired a new problem in machine learning, coined as lifelong learning (
An important challenge in CL is to be able to retain the previously acquired knowledge while learning new ones. This is known as the catastrophic forgetting problem (
In this paper, we use a method that regularizes updates to parameters by looking at their uncertainties, following the approach of
Continual learning is essential for robotics since robots interacting with the environment and the humans continuously discover new tasks, contexts and interactions. For a widespread use of robots, whenever needed, robots are expected to learn new tasks and skills, and to adapt to new experiences or contexts (
There has been substantial work lately on addressing lifelong learning in robots, to enable lifelong learning in various robot capabilities, ranging from perception to navigation and manipulation (for reviews, see
Although these studies are promising, task-incremental learning within the social robotic aspect of HRI is less explored. What’s more, adapting to the behaviours of a robot in accordance with its users or new contexts is essentially a very practical setting of continual learning (
There exist a couple of datasets for studying socially appropriate navigation in environments populated with objects and humans. For example, the Edinburgh Informatics Forum Pedestrian Database (
Another dataset that is pertinent to our study is the CMU Graphics Lab Motion Capture Database (
Compared to the above mentioned datasets, MANNERS-DB is distinct as it considers a wider range of actions (cleaning, carrying objects etc.), modalities (includes sound) and social settings (includes children, pets, lying humans etc.). Therefore, our dataset makes it possible to study social appropriateness of robot actions in a more generic context than navigation.
Decision-making physical robots should provide insight into the uncertainty behind their actions, in particular when interacting with humans. In this work, we model two types of uncertainty, namely, aleatoric uncertainty describing the underlying ambiguity in the data and epistemic uncertainty reflecting the lack of or unfamiliarity with data. The two types were first combined in one model by
Creating a real environment with simultaneously controlled and varied social configurations and attributes is difficult. Therefore, we developed a simulation environment to generate static scenes with various social configurations and attributes. The scenes were then labeled by independent observers
The factors forming the 29-dimensional input to the learning models.
Feature | Variable type | Range |
---|---|---|
Operating within circle | Int | 0 or 1 |
Radius of action circle | Float | 0.5 → 3 |
Operating in the direction of an arrow | Int | 0 or 1 |
Number of humans | Int | 0 → 9 |
Number of children | Int | 0 → 2 |
Distance to closet child | Float | 0.4 → 6 |
Number of animals | Int | 0 or 1 |
Distance to animal | Float | 0.4 → 6 |
Number of people in a group | Int | 2 → 5 |
Group radius | Float | 0.50 → 1 |
Distance to group | Float | 0 → 6 |
Robot within group? | Int | 0 or 1 |
Robot facing group? | Int | 0 or 1 |
Distance to 3 closest humans | 3 x Float | 0.3 → 5 |
Direction robot to 3 closest humans | 3 x Float | 0.0 → 360.0 |
Direction closest human to robot | Float | 0.0 → 360.0 |
Robot facing 3 closest humans? | 3 x Int | 0 or 1 |
3 closest humans facing robot? | 3 x Int | 0 or 1 |
Number of people sofa | Int | 0 → 2 |
Playing music? | Int | 0 or 1 |
Total number of agents in scene | Int | 1 → 11 |
The environment was developed in Unity3D simulation software (
Social appropriateness of robot actions can be investigated in numerous social contexts. In this work, we chose to focus on “visual” robot domestic actions that could potentially occur in a home setting as social robots are envisaged to be incorporated into our homes in the near future. Modelling the social appropriateness in the living room of such a home setting is more challenging and complex than for example a bedroom or a bathroom setting. Therefore, we chose the living room scenario as the context for MANNERS-DB and represented this context using and varying the features defined in
We specifically consider the social appropriateness of the actions listed in
The robot actions investigated in each scene.
Actions within a circle | Actions along an arrow |
---|---|
Vacuum cleaning | Vacuum cleaning |
Mopping the floor | Mopping the floor |
Carry warm food | Carry warm food |
Carry cold food | Carry cold food |
Carry drinks | Carry drinks |
Carry small objects (plates, toys) | Carry small objects (plates, toys) |
Carry big objects (tables, chairs) | Carry big objects (tables, chairs) |
Cleaning (Picking up stuff) | Starting conversation |
The generated scenes were labelled for social appropriateness of the robot actions in the depicted domestic setting using a crowd-sourcing platform (
The annotation task as shown to the annotators on the crowd-sourcing platform. The page includes an image of the scene along with a honey-pot question (bottom-left) and questions around the appropriateness of robot actions.
When collecting subjective opinions similarly to how it is done in our work, evaluating the inter-rater reliability is necessary to ensure that there is a sufficient level of consistency in the gathered labels (
In
Inter-class correlation values for all actions over all scenes.
Actions | Intra-class correlation | |
---|---|---|
Actions within a circle | ICC(1,1) | ICC(1,k) |
|
0.317 | 0.848 |
|
0.339 | 0.860 |
|
0.068 | 0.465 |
|
0.043 | 0.355 |
|
0.048 | 0.378 |
|
0.087 | 0.533 |
|
0.256 | 0.805 |
|
0.192 | 0.740 |
Actions in the direction of an arrow | ICC(1,1) | ICC(1,k) |
|
0.267 8 | 0.814 |
|
0.278 | 0.822 |
|
0.048 | 0.378 |
|
0.047 | 0.371 |
|
0.042 | 0.346 |
|
0.078 | 0.503 |
|
0.203 | 0.753 |
|
0.111 | 0.600 |
By looking at the intra-class correlations in
We also analyzed the reliability of the annotations using the Cronbach’s
We explored the relation between the various factors and the social appropriateness of actions.
Pearson correlation (
We observe that, in group related contexts, the number of people in the room, as well as in a group, seems to have slight negative correlation with the appropriateness of different actions. However, most of these are very close to zero, except for intrusive actions such as
The correlations of the distance related features (
Building on the personal spaces of
Average appropriateness of actions with respect to the distance to the closest person in the environment
We investigate four different scenarios with respect to how the robot and the closest human face each other, see
In this section, we propose a continual learning model for learning social appropriateness of robot actions. For training our model, we use our MANNERS-DB dataset.
We experiment with two approaches, Multi Layer Perceptron (MLP) and Bayesian Neural Network (BNN), as baselines for estimating appropriateness
• Baseline (BNN and MLP):
Our baseline is the conventional MLP and BNN with the architecture shown in
Neural network architecture for all models. The models take in the representation of the scene as a 29-dimensional vector (
• 2-tasks model (BNN-2CL and MLP2):
For the second experiment, we split the dataset into two. First training on the actions executed within a circle, followed by a new training session given samples with actions executed along the direction of an arrow. In other words, in this experiment, 2 tasks of 8 actions each are trained on sequentially.
• 16-tasks model (BNN-16CL and MLP16):
In the third experiment, the models are also given data sequentially, separated into parts for each of the 16 actions. Meaning, they are trained over 16 tasks of 1 action each.
All models share the same architecture, illustrated in
Two of the models are implemented with active measures to handle catastrophic forgetting, the BNN-2CL and the BNN-16CL. They are extensions of the work of
We would like to note that the continual learning problem in our paper is slightly different than in many other CL applications in that the distribution of the input data does not change significantly between two tasks, however, the labels do. For every task, the model is trained to predict the social appropriateness of a new set of actions. In more traditional applications, like sequentially learning to classify the handwritten digits of the MNIST dataset, both the input and the labels change. In other words, a handwritten five looks different from a four and they should be assigned to different classes. However, the approach taken in our work follows tightly with the overall human-like learning approach we have taken. As humans, we might face situations and contexts that we have seen before, but where we discover a new skill or develop our understanding related to that context, i.e. the input features are the same, but what we want to learn to predict changes.
The inherent stochastic nature of BNNs leads to challenges at inference due to the intractability of the marginal probability,
When combining the cost function given by
Here, the first component controls variational approximation; the second component enforces the correctness of the predictions and estimates their uncertainty;
When undertaking continual learning, we need to deal with catastrophic forgetting. To prevent this, we use the uncertainty-guided continual learning strategy of
We want to extract rich uncertainty estimates from the models. We do this through epistemic uncertainty (only BNNs) related to the lack of or unfamiliar data, as well as aleatoric uncertainty describing the underlying noise in the data. Examples of these two in our work could be high epistemic uncertainty for scenes with features that do not occur often in the training set, and high aleatoric uncertainty for scenes or actions where annotators had a high level of disagreement. Following the work of
As mentioned, we kept the hyperparameters the same for all experiments to allow for a reasonable comparison in performance. Nevertheless, an extensive hyperparameter search was carried out to validate that this did not lead to a substantial drop in performance. For training, we used a batch size of 64, 200 epochs per task and an initial global learning rate
Training on each task was done sequentially and the models’ weights were saved between tasks. This way, the change in performance, both the ability to predict accurate appropriateness and obtain sensible uncertainty measures, can be investigated with respect to the number of tasks the model has been trained on.
Training and Test Sets. For all three experiments, we split the dataset into training, validation and testing sets. The number of test samples, 100 scenes, are the same for all experiments, the training and validation sets are, however, separated differently to facilitate Continual Learning. The 650 scenes used for training and validation contain 9584 individual labelled samples. The validation part consist of 1000 samples for the baseline experiment, 400 samples per task (circle and arrow) for the BNN-2CL and MLP2 models and 100 per task (each action) for the BNN-16CL and MLP16 models. This means that the size of the training set for each experiment is approximately 8500 for the baseline, 4400 per task for the 2 task models and 500 per task for the 16 task models. It is worth noting that these differences in size of training set affect the comparative results obtained for each model as discussed in the next section.
The prediction results from the experiments are presented in
Root-mean-squared error (RMSE) of predictions.
Actions | RMSE | |||||||
---|---|---|---|---|---|---|---|---|
MLP | MLP-2 | MLP-16 | BNN | BNN-2CL | BNN-16CL | |||
Within a circle | ||||||||
|
0.493 | 0.877 | 0.941 | 0.467 | 0.501 | 0.767 | ||
|
0.516 | 0.817 | 1.214 | 0.502 | 0.594 | 0.581 | ||
|
0.472 | 0.796 | 0.617 | 0.445 | 0.448 | 0.810 | ||
|
0.402 | 0.656 | 0.749 | 0.420 | 0.403 | 0.561 | ||
|
0.390 | 0.771 | 0.790 | 0.402 | 0.485 | 0.733 | ||
|
0.375 | 0.437 | 0.979 | 0.386 | 0.879 | 0.517 | ||
|
0.533 | 0.734 | 1.271 | 0.497 | 0.520 | 0.665 | ||
|
0.413 | 0.624 | 1.390 | 0.192 | 0.479 | 0.491 | ||
In direction of arrow | ||||||||
|
0.547 | 0.573 | 1.014 | 0.555 | 0.591 | 0.750 | ||
|
0.541 | 0.551 | 1.063 | 0.542 | 0.602 | 0.664 | ||
|
0.416 | 0.431 | 0.759 | 0.468 | 0.489 | 0.678 | ||
|
0.441 | 0.446 | 0.883 | 0.477 | 0.495 | 0.526 | ||
|
0.434 | 0.440 | 0.798 | 0.451 | 0.465 | 0.586 | ||
|
0.417 | 0.425 | 0.431 | 0.431 | 0.464 | 0.548 | ||
|
0.502 | 0.497 | 1.361 | 0.498 | 0.535 | 0.594 | ||
|
0.513 | 0.525 | 0.601 | 0.539 | 0.523 | 0.678 | ||
Mean over all actions | 0.463 | 0.600 | 0.968 | 0.480 | 0.530 | 0.630 |
We provide an analysis of one of the continual learning model’s performance (BNN-16CL) in
Per action performance on test data at different stages of continual learning. As expected, when training of a task starts, its loss decreases and the performances of previously trained tasks do not change significantly.
The metrics presented above provide a good indication that all three BNN models perform well on unseen data. In this section we provide a qualitative evaluation of the predictions by taking a closer look at a number of representative scenes from the test set and the corresponding predicted social appropriateness of robot actions.
In
Predictions for test scene with no people
Next, we take a closer look at a scene with more complex input features. In
To conclude the qualitative evaluation of the predictions, two scenes where the robot executes actions along the direction of an arrow are presented. In
Predictions for test scene with actions along an arrow, with robot within group of people
In
We will now qualitatively evaluate the epistemic and aleatoric uncertainties for some example scenes in the test set. Since the MANNERS-DB dataset does not have significant differences in the input features for different actions, we compute the epistemic uncertainty per scene, averaged over all actions. However, we report the aleatoric uncertainty per action, as it should reflect the disagreement between annotators’ labels on each action. In
Four example scenes and their corresponding epistemic uncertainty estimate from the BNN-16CL model. As scenes 1 and 3 are less frequent in the dataset compared to scenes 2 and 4, we observe high epistemic uncertainty (indicating amount of familiarity) for scenes 1 and 3—see the text for more details.
As we discussed in
The aleatoric uncertainty should indicate annotator disagreement regarding the appropriateness of different robot actions. In our case, the dataset had high annotator agreement and therefore, annotator disagreement in aleatoric uncertainty was not observed. However, to further validate the models’ capability to capture aleatoric uncertainty, we increased the disagreement between annotators by artificially modifying the labels. In detail, we increased the variance in the labels by changing the annotators answers on the first 7 actions (for half of the dataset, they are set to one, and for the other half, to five) and leaving the original answer for the eighth action. By doing this we created a more distinct change in agreement levels between actions. See
Example scene from test set with corresponding variance in annotator labels and aleatoric uncertainty per action. We observe that aleatoric uncertainty is high when the variance of labels is high, and vice versa. In other words, aleatoric uncertainty is able to capture the disagreement between the annotators in evaluations of social appropriateness of robot actions.
In this work, we studied the problem of social appropriateness of domestic robot actions which, to the best of our knowledge, had not been investigated before. To this end, we first introduced a dataset with social appropriateness annotations of robot actions in static scenes generated using a simulation environment. The subjective appropriateness annotations were obtained from multiple people using a crowd-sourcing platform.
Our analysis of the annotations revealed that human annotators do perceive appropriateness of robot actions differently based on social context. We identified, for example, starting a conversation is perceived more appropriate if the robot is both close to the human and facing the human. We then formulated learning of social appropriateness of actions as a lifelong learning problem. We implemented three Bayesian Neural Networks, two of which employed continual learning. Our experiments demonstrated that all models provided a reasonable level of prediction performance and the continual learning models were able to cope well with catastrophic forgetting.
Despite its significant contributions, our work can be extended in various ways. For example, other environments, social settings and robot actions can be considered to study the social appropriateness of robot actions at large. This appears to be especially important for obtaining generalizable estimations from data-hungry learning models. In addition, our work’s simulation-based nature is likely to limit the ability of the models to catch nuances that are important when evaluating the appropriateness of actions in real-world scenarios. An interesting avenue of future research could include similar experiments with real-world social contexts. Moreover, our dataset contains textual annotations provided by users explaining the reasons behind their choices. This rich information can be leveraged for developing explainable models that can provide justifications for their social appropriateness predictions.
Going beyond the aforementioned future directions would entail generating dynamic scenes in which a robot is moving and/or generating scenes from robot’s first-person perspective, and obtaining relevant annotations for these scenes and movements. How to extend the research work and the results from the annotations obtained from the third-person perspective, as has been done in this paper, to the first-person perspective of the robot, would also be an interesting area to explore. Moreover, other CL methods (e.g.,
The original contributions presented in the study are included in the article/Supplementary Material, and the dataset, the trained models and the relevant code are made publicly available at
The studies involving human participants were reviewed and approved by Ethics Committee of the Department of Computer Science and Technology, University of Cambridge. Written informed consent for participation was not required for this study in accordance with the national legislation and the institutional requirements.
All authors contributed to the formulation of the research problem, the design of the dataset and the methods as well as the writing of the manuscript. JT implemented the methods and conducted the experiments.
JT and HG work has been partially supported by the EPSRC under grant ref: EP/R030782/1. SK is supported by Scientific and Technological Research Council of Turkey (TUBITAK) through BIDEB 2219 International Postdoctoral Research Scholarship Program and the BAGEP Award of the Science Academy.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.