Multi-modal Human Behaviour Graph Representation Learning for Automatic Depression Assessment.
Accepted version
Peer-reviewed
Repository URI
Repository DOI
Change log
Authors
Abstract
Automatic depression assessment (ADA) often relies on crucial cues embedded in human verbal and non-verbal behaviors, which exists in video, audio, and text modalities. Although these modalities often show in time-series forms, current research offers limited exploration of the complex intra-modal temporal dynamics inherent to each modality, failing to extract the depression-related cues in a global view. While many methodologies attempt to exploit the multifaceted information encoded across modalities via decision-level or feature-level fusion techniques, they often fall short in effectively representing pairwise inter-modal relationships, which is the key to utilize the distinct complementary relationship between each modality pair. This paper presents a novel graph-based multimodal fusion approach, which can model intra-modal and inter-modal dynamics conveniently using a graph representation. It adopts undirected edges to link not only temporally continuous, pre-extracted features of each modality, but also temporally aligned features across each pair of modalities. This ensures the seamless propagation of global information across temporal dimensions and helps capture the pairwise inter-modal dynamics. We conduct experiments on the E-DAIC dataset to prove our approach's effectiveness, with an RMSE of 4.80 and a CCC value of 0.563, which rival the top-performing method. We also experiment on the AFAR-BSFP dataset to show the generality of our approach. Our code will be made publicly available.