Alone versus in-a-group: A multi-modal framework for automatic affect recognition
View / Open Files
Publication Date
2019-06-01Journal Title
ACM Transactions on Multimedia Computing, Communications and Applications
ISSN
1551-6857
Publisher
Association for Computing Machinery (ACM)
Volume
15
Issue
2
Type
Article
This Version
AM
Metadata
Show full item recordCitation
Mou, W., Gunes, H., & Patras, I. (2019). Alone versus in-a-group: A multi-modal framework for automatic affect recognition. ACM Transactions on Multimedia Computing, Communications and Applications, 15 (2)https://doi.org/10.1145/3321509
Abstract
Recognition and analysis of human affect has been researched extensively within the field of computer
science in the last two decades. However, most of the past research in automatic analysis of human affect has
focused on the recognition of affect displayed by people in individual seeings and little attention has been
paid to the analysis of the affect expressed in group settings. In this paper, we first analyze the affect expressed
by each individual in terms of arousal and valence dimensions in both individual and group videos and then
propose methods to recognize the contextual information, i.e., whether a person is alone or in-a-group by
analyzing their face and body behavioral cues. For affect analysis, we first devise affect recognition models
separately in individual and group videos and then introduce a cross-condition affect recognition model that
is trained by combining the two different types of data. We conduct a set of experiments on two datasets that
contain both individual and group videos. Our experiments show that (1) the proposed Volume Quantized
Local Zernike Moments Fisher Vector (vQLZM-FV) outperforms other unimodal features in affect analysis;
(2) the temporal learning model, Long-Short Term Memory Networks (LSTM), works better than the static
learning model, Support Vector Machine (SVM); (3) decision fusion helps to improve the affect recognition,
indicating that body behaviors carry emotional information that is complementary rather than redundant to
the emotion content in facial behaviors; and (4) it is possible to predict the context, i.e., whether a person is
alone or in-a-group, using their non-verbal behavioral cues.
Identifiers
External DOI: https://doi.org/10.1145/3321509
This record's URL: https://www.repository.cam.ac.uk/handle/1810/290132
Rights
Licence:
http://www.rioxx.net/licenses/all-rights-reserved