Learning Person-Specific Cognition From Facial Reactions for Automatic Personality Recognition

This article proposes to recognise the true (self-reported) personality traits from the target subject's cognition simulated from facial reactions. This approach builds on the following two findings in cognitive science: (i) human cognition partially determines expressed behaviour and is directly linked to true personality traits; and (ii) in dyadic interactions, individuals’ nonverbal behaviours are influenced by their conversational partner's behaviours. In this context, we hypothesise that during a dyadic interaction, a target subject's facial reactions are driven by two main factors: their internal (person-specific) cognitive process, and the externalised nonverbal behaviours of their conversational partner. Consequently, we propose to represent the target subject's (defined as the listener) person-specific cognition in the form of a person-specific CNN architecture that has unique architectural parameters and depth, which takes audio-visual non-verbal cues displayed by the conversational partner (defined as the speaker) as input, and is able to reproduce the target subject's facial reactions. Each person-specific CNN is explored by the Neural Architecture Search (NAS) and a novel adaptive loss function, which is then represented as a graph representation for recognising the target subject's true personality. Experimental results not only show that the produced graph representations are well associated with target subjects’ personality traits in both human-human and human-machine interaction scenarios, and outperform the existing approaches with significant advantages, but also demonstrate that the proposed novel strategies help in learning more reliable personality representations.

Learning Person-Specific Cognition From Facial Reactions for Automatic Personality Recognition Siyang Song , Zilong Shao, Shashank Jaiswal , Linlin Shen , Michel Valstar , and Hatice Gunes Abstract-This article proposes to recognise the true (self-reported) personality traits from the target subject's cognition simulated from facial reactions.This approach builds on the following two findings in cognitive science: (i) human cognition partially determines expressed behaviour and is directly linked to true personality traits; and (ii) in dyadic interactions, individuals' nonverbal behaviours are influenced by their conversational partner's behaviours.In this context, we hypothesise that during a dyadic interaction, a target subject's facial reactions are driven by two main factors: their internal (person-specific) cognitive process, and the externalised nonverbal behaviours of their conversational partner.Consequently, we propose to represent the target subject's (defined as the listener) person-specific cognition in the form of a person-specific CNN architecture that has unique architectural parameters and depth, which takes audio-visual non-verbal cues displayed by the conversational partner (defined as the speaker) as input, and is able to reproduce the target subject's facial reactions.Each person-specific CNN is explored by the Neural Architecture Search (NAS) and a novel adaptive loss function, which is then represented as a graph representation for recognising the target subject's true personality.Experimental results not only show that the produced graph representations are well associated with target subjects' personality traits in both human-human and human-machine interaction scenarios, and outperform the existing approaches with significant advantages, but also demonstrate that the proposed novel strategies help in learning more reliable personality representations.
Index Terms-True personality recognition, dyadic interaction, person-specific cognition simulation, facial reaction generation, end-to-end graph representation learning, multi-dimensional edge feature.
In some scenarios, the goal is to infer true personality from machine detectable distal cues, i.e., Automatic Personality Recognition (APR) [23].While APP approaches predict apparent personality (perception) based on proximal behavioural cues, APR aims to recognise the true personality that impacts the generation of distal behavioural cues.Thus, APP models that were trained as external observers to provide personality perceptions may not be reliable for recognising true personality traits (Problem 1).Moreover, the majority of these APP solutions [8], [19], [20], [24], [25] recognise personality traits from single frames or thin slices of behaviour, independently, by re-using clip-level personality labels as the frame/thin slicelevel labels to train ML models that can provide a personality prediction for each frame/thin slice.This is problematic as people with different personality traits may express very similar non-verbal audio-visual behaviours in a single frame or a thin slice.As a result, such strategies may lead to the same input pattern being paired with multiple labels during the training, making them theoretically impossible to learn a good hypothesis (Problem 2).Although recent approaches [5], [6], [26] address this issue by modelling personality using an entire clip, (i.e., recognising personality traits at the clip-level), they only select a set of key frames to represent an entire clip.This may ignore the short-term behaviours displayed by the discarded frames (Problem 3), which may contain crucial cues for personality recognition.
In this paper, we propose a novel audio-visual automatic true personality recognition framework that addresses the problems highlighted above.It is built on the definition that true personality influences the cognitive process of individual's distal cues externalization [23] (e.g., facial reactions).In particular, recent works [27], [28] show that in dyadic and group interactions, subjects' nonverbal behaviours (e.g., facial reactions) are influenced by, and therefore can be predicted from, the behaviours of their conversational partner(s).Therefore, this paper assumes that during a dyadic interaction, the target subject's (listener) facial reactions are driven by two main factors: (i) the target subject's internal (person-specific) cognition, and (ii) the externalised nonverbal behaviours of the conversational partner (the speaker).Therefore, we propose to learn a person-specific CNN for each subject, which reproduces the subject's facial reaction in response to the conversational partner.Consequently, the explored person-specific CNN can represent the target subject's cognitive process during the facial reaction generation, which is well associated with the subject's true personality (addressing Problem 1).More importantly, each person-specific CNN is explored using the behaviours contained in all available frames of the target video (addressing Problem 3) and thus its architecture and parameters contain the clip-level information, which is then encoded as a graph representing the target subject's personality.This allows the training of the GNN-based personality model to be implemented by pairing the clip-level representation with the clip-level personality labels (avoiding Problem 2).The pipeline of the proposed approach is illustrated in Fig. 1.The main contributions of this paper are summarised as follows: r We propose to use the simulated person-specific cognition of the target subject as the source descriptor to recognise the subject's true personality traits.To the best of our knowledge, this is the first audio-visual approach that uses person-specific CNN architecture and weights to represent the target subject's cognition, and recognizes the true (self-reported) personality traits from the simulated cognition.
r We propose a novel audio-visual non-invasive human person-specific cognition simulation strategy which automatically searches for an optimised multi-modal personspecific CNN for each subject to reproduce the subject's facial reactions.The explored person-specific CNN has a unique combination of layers (operations), weights and depth, and plays the role of the target subject's personspecific cognitive process for generating the unique and person-specific facial reactions.
r We propose a novel graph encoding strategy to parameter- ize the unique architecture and parameters of an explored CNN into an graph representation, where each CNN edge that contains a set of operations (convolution, pooling, etc.) is treated as a vertex, while the existence of edges between vertices in the graph are decided by the CNN's architecture.
r We propose a novel transformer-based feature learning strategy that deep learns a task-specific multi-dimensional edge feature for each pair of adjacent vertices in the graph representation.To the best of our knowledge, this is the first approach that deep learns multi-dimensional edge features for graphs to represent convolution neural network architectures.
r We conduct a set of experiments under both human-human and human-machine dyadic interaction settings, which not only validate the superior performance of the proposed approach in recognising true personality traits but also systematically demonstrate the influence of the various internal (methodological) and external (subject demographic) factors on the proposed approach.Compared with our earlier conference version [29], the extended journal version has following additional contributions and novelties: Methodologies: First, we introduce a depth searching strategy, allowing each person-specific CNN to not only have unique architectural parameters but also a unique depth.In addition to aligning all person-specific CNNs as graph representations with the same topology as the conference version, we further propose to encode CNNs of variable depths as heterogeneous graph representations that have different typologies.Second, we propose a novel end-to-end vertex feature learning strategy to encode task-specific vertex features from corresponding OPs and LWs, replacing the hand-crafted vertex feature encoding strategy introduced in the conference version.Finally, we propose a novel transformer-based multi-dimensional edge feature learning strategy which employs attention operations to learn salient task-specific relationship cues between vertices.
Experiments: First, we have conducted additional ablation studies for different demographic groups.Second, we have added an experiment that compares the proposed approach (i.e., using the weights and architectural parameters of the explored person-specific CNN as the person-specific cognition representation), with the system that uses the personalized weights of the standard CNN architecture (ResNet) as the person-specific cognition representation.Third, we have conducted additional ablation studies for evaluating the new methodological contributions described above.Finally, we have conducted all experiments on an additional self-reported personality dataset that was collected under human-machine interaction scenarios.
Presentations: Firstly, we have added a detailed overview with a set of formulations to explain the full pipeline of the proposed approach at the beginning of the Section III.Secondly, we have added texts and figures to explain the new methodological contributions and new experimental results described above.Thirdly, we additionally provide the pseudocode of the VFE and EFE in the supplementary material.Finally, we provide the detailed settings of all reproduced baselines in the supplementary material.

II. RELATED WORK
This section first reviews previous audio-visual automatic personality analysis approaches in Section II-A.Then, it summarizes biological and psychological studies which found that personality can be reflected by human cognition, providing the theoretical basis for our work, i.e., recognising true personality traits from the simulated human cognitive processes (Section II-B).

A. Audio-Visual Automatic Personality Analysis
Early audio-visual automatic personality analysis approaches usually extract hand-crafted features to describe audio-visual human non-verbal behaviours or interpersonal relationship between subjects, including low-level features such as histogram of oriented gradients (HOG) [30], Local Phase Quantization (LPQ) [31] and mid-level cues such as statistics of mid-level behaviour attributes [8], [32], facial attributes (gazes, head motions, etc.) [33], human posture and gesture cues while speaking [34], co-occurrent patterns of behaviours [35], body skeleton activity [21], Quantised Local Zernike Moments (QLZM) [9], visual focus of attention [21], etc.These hand-crafted features are then fed to traditional machine learning models such as support vector machine regressor (SVR) or logistic regression to generate apparent personality predictions.
Due to recent advances in deep learning, most existing approaches employ Convolution Neural Networks (CNNs) to learn task-specific deep features from each frame or a thin video slice.For example, Ventura et al. [7] propose a Descriptor Aggregation Network (DAN) to extract a frame-level feature at multiple spatial resolutions, and use such multi-level visual features to infer personality at the frame-level.To learn both personality-related audio and visual cues, two-stream bi-modal networks are proposed in [20] and [19], which first learn framelevel audio-visual features and then combine them at the the fully connected layer to provide frame-level personality prediction.The video-level prediction is then obtained by averaging predictions of all frames.Principi et al. [18] propose a multimodal CNN to jointly learn audio and visual information from every image sequence (thin video slice) and audio segment.The extracted features are combined with attribute-specific models to predict personality traits.
Since personality trait models focus on evaluating the aspects of personality that are relatively stable over a long period of time for the target subject [36] (usually much longer than the duration of a single audio-visual clip), the frame/thin slicelevel behaviours may not be reliable in reflecting personality traits [36].Consequently, approaches that model personality traits based on clip-level/long-term behaviours also have been investigated.One popular solution is to summarise frame/thin slice-level features of an entire audio-visual clip into a global statistical descriptor to infer personality [34], [37], e.g., averaging all frame/thin slice-level vectors or using a histogram to represent frame/thin slice-level feature distributions.To consider important dynamic cues, Li et al. [5] first divide the video into 32 slices, and then randomly select a face image and a face-background image from each of them, which are then stacked as the clip-level stream.Subsequently, the videolevel prediction is made by these selected frames.Beyan et al. [26] propose to generate multiple dynamic facial images [38], [39], [40] to represent each video segment and then choose a set of dynamic facial images that have the highest spatiotemporal saliency as the key frames to construct the video-level representation.
In summary, while modelling personality traits at the frame/segment-level is problematic, the recent clip-level representations usually failed to utilise the full scale of the available information in the data, as they select a subset or key frames to represent an entire video.To avoid these problems, Song et al. [10] propose a domain adaption approach to learn a set of intermediate convolution layers from all available data as the person-specific representation for the target subject, Fig. 2. The difference between the proposed approach (depicted in orange) and existing approaches (depicted in blue).While existing approaches attempt to directly using the target subject's external non-verbal behavioural data to predict personality perception, our approach learns to recognise true personality by modelling the target subject's internal person-specific cognition.
which achieved a comparable performance to the state-of-theart method [5].However, similar to the approaches described above, it still directly infers apparent personality based on the subjects' observable behaviours.In other words, all the aforementioned studies focused on automatic personality perception analysis.

B. The Relationship Between Personality and Human Cognition
According to previous biological studies [11], [12], personality traits (e.g., Extraversion, Conscientiousness, and Neuroticism) are well associated with human brain structures [50] and activities such as brain local volumes [51] and gray and white matter [52], which are key factors in deciding and controlling human cognitive processes.For example, Kumari et al. [14] investigated brain fMRI activity based on the "n-back" task, and found that brain responses during cognitive activities are related to Extraversion and Neuroticism traits.Previous psychological studies also frequently claimed that people's personality is well associated with their cognitive processes in various daily activities such as risk taking [53], creativity [13], [54], and music learning [13].An exploratory factor analysis was conducted by [13], whose results show that creativity and primary cognitive processes are correlated with the Extraversion and psychoticism (Neuroticism) traits.Importantly, the relationship between human cognition and personality are relatively stable, as a longitudinal study conducted by Schaie et al. [55] showed that some of the personality-cognition relations could last for over 35 years.This finding gives us the inspiration that human cognition can be a reliable and stable source for recognising personality.
As reviewed in Section II-A, the main difference between our approach and the existing approaches (illustrated in Fig. 2), is the fact that the existing approaches attempt to achieve automatic personality perception directly from observable non-verbal behaviours of the target subjects, where the ML model acts as an external observer.Instead, our approach draws inspiration from the aforementioned works on the interrelationship between personality and human cognition, and learns to recognise true personality by simulating and modelling target subjects' personspecific cognitive processes.

III. METHODOLOGY
The proposed approach recognises each subject's true personality traits based on three steps: (i) person-specific cognition (CNN) simulation (Section III-A); (ii) person-specific graph representation generation (Section III-B); and (iii) personality recognition based on the produced person-specific graph representation (Section III-C).
Person-Specific Cognition (CNN) Simulation: our approach starts with simulating and modelling each target subject's (listener) cognition by individually searching for an optimal personspecific multi-modal CNN H L .The explored person-specific CNN is expected to accurately reproduce the listener's facial reactions F L in response to the conversational partner's (the speaker) audio A S and facial behaviours F S , i.e., the signals that the subject received (explained in Fig. 1(a)) in a dyadic interaction.Mathematically speaking, the H L is achieved by: where NAS denotes the DARTs-based neural architecture searching algorithm.Here, the H L is defined by its depth D CNN L , operation parameters (OPs) O L and layers' weights (LWs) W L of operations: In summary, we individually search for a person-specific CNN for each listener by considering the listener's facial reaction as well as the corresponding speaker's audio and facial behaviours.Specifically, in this stage, our goal is to adjust the person-specific CNN to fit the provided listener-speaker dyadic interaction data, where we search and validate the person-specific CNN on the same data.Representative loss curves for the search process are illustrated in Fig. 9.

Person-Specific Graph Representation Generation:
In this paper, we hypothesize that the well explored CNN H L represents the person-specific cognition of the target listener, and thus H L is well associated with the listener's true personality.However, it is not possible to directly feed the explored CNN to a ML predictor for personality recognition, as it can not be directly processed by any existing ML model.Since each CNN network can be well described by a graph, where a set of layers that contain parameters can be treated as vertices and their connection relationship can be treated as edges, we parameterize the H L into a learnable graph representation G L (V, E) as the corresponding listener's person-specific cognition representation for personality recognition: where GE denotes the proposed graph encoding strategy (explained in Fig. 1(b)); V and E represent the nodes and edges of the graph representation G L .Personality Recognition: Finally, the produced graph representation G L is fed to a GNN model to recognise the target listener's true personality as: where P L represents the predicted five personality traits of the target listener.

A. Simulating Person-Specific Cognition
This section explains how we search for a person-specific multi-modal CNN that represents the target listener's cognition.Specifically, we introduce the input and target of the CNN (Section III-A1), the CNN settings that allows each person-specific CNN to accurately simulate the cognition of the target listener (Section III-A2), the loss function for searching and training person-specific CNNs (Section III-A3), and the architectural parameters' optimization strategy (Section III-A4).The complexity analysis of the person-specific searching is provided in the supplementary material, available online.
1) Input and Target: Previous findings [27], [28] suggest that during a dyadic interaction, the listener's facial reactions are driven by two main factors: (i) listener's person-specific cognition, and (ii) the externalised nonverbal behaviours of the conversational partner (the speaker).Based on this, the person-specific CNN model H L that represents the cognition of the listener is explored to output facial reactions F L of the listener when given audio signal A S and facial behaviours F S of the speaker as the input.This can be formulated as: Once H L is obtained, it takes on the role of the corresponding listener's cognitive processor in generating facial reactions during the provided dyadic interaction.Consequently, the learnt H L is sufficiently informative for modeling the listener's true personality traits not only because the true personality relates to the listener's cognition but also because true personality is a key factor in governing how non-verbal behaviours are generated and displayed by humans [23].In this paper, we use the speaker and listener's facial landmark sequences to represent the input and target facial movements, respectively.In this paper, we empirically set the sequence length as 80 frames (around 3 seconds).This is because that this duration is not only enough to contain a complete facial behaviour/reaction that consists of multiple facial expressions [56], but also not too long to contain several reactions in response to multiple stimulus.The aligned facial landmarks are obtained for each frame using OpenFace 2.0 [57], which are then transformed based on a pre-defined mean face shape in order to keep only facial behaviours without the identity information (as suggested by [58]).Also, we use 64 bin log-mel spectra as the audio representation, where each audio frame is computed by a 40 ms hanning window with stride size of 40 ms.This way, the number of audio frames for each video is the same as the number of video frames.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

2) Multi-Modal Cognitive Processor Model: Basic topology:
The basic topology of each person-specific CNN is inspired by the Model Human Processor (MHP) [59] (visualized in Fig. 3), which is set to have a visual encoder and an audio encoder that simulate the human perceptual processor, an audio-visual decoder that simulates the human motor processor and a fusion module that partially simulates the human cognitive processor, i.e., jointly processing audio-visual cues at multiple levels.Since during each person-specific CNN's optimisation, it takes audiovisual sequences of the speaker, and outputs facial reaction sequence of the listener, we also employ a Long-Short-Term-Memory network (LSTM) to process the latent feature sequence generated from the two encoders, which aims to simulate the human working memory module of the MHP.The basic topology of each person-specific CNN is also illustrated in Fig. 1(a)).
Model Settings for Person-Specific Cognition Simulation: We follow the similar settings of the DARTs [60] to represent each module as a directed acyclic sub-graph that is made up of several cells.Each cell contains a set of nodes that represent latent features as well as a set of CNN edges, where each edge contains a set of operations defined by OPs and LWs.Specifically, in the proposed person-specific CNNs, a node N j represents a set of feature maps generated from its adjacent parent nodes N i , N i+1 , . . .N j−1 (i < j), where a pair of adjacent nodes (N i and N j ) are connected by a CNN edge O i,j which consists of a set of pre-defined operations o k i,j (e.g., convolution, pooling, etc.): where each o k i,j contains a set of layer weights (LWs) w k i,j (e.g., kernel weights of a convolution layer).In particular, for those operations that do not have learnable LWs (e.g., pooling, identity mapping, etc), we define their LWs as o k i,j = ∅.As a result, the (6) can be re-written as: During the propagation, the feature maps in the node N j are produced from all of its adjacent parent nodes N i (i < j and textadj(i, j) = 1) via all operations of corresponding CNN edges O i,j , which can be formulated as: where adj(i, j) denotes the connectivity between N i and N j (i.e., 0 denote N i and N j are not connected while 1 denoting they are connected).Here, each operation o k i,j in O i,j is assigned to have a operation parameter (OP) α k i,j to represent its importance.This way, when feeding feature maps contained in the node N i to a CNN edge O i,j , the output can be represented as: This process is also illustrated in Fig. 4(a).To simulate uncertain and complex human cognitive processes of facial reactions, we set the n th (n > 2) fusion cell to take four inputs: the outputs of the n th visual cell C Visual where is the concatenation operator.Consequently, the input audio and visual signals can be combined and jointly processed at multiple levels (illustrated in Fig. 1(a)).Second, in each cell, we set each node to connect to all of its previous nodes to represent all possible information flow, allowing the extracted features (nodes) to be potentially influenced by the information of multiple previous states (parent nodes) during the CNN propagation (illustrated in Fig. 4(b)).Third, we set each CNN edge to have a set of unique OPs and LWs rather than setting all cells to share the same set of OPs [60], [61].Finally, since depth is also a key factor that impacts a CNN's cognitive process, we also search for a unique number of cells for the person-specific CNN of each target listener (illustrated in Fig. 1(a)).This way, the person-specific CNN H L that represents the target listener's cognition can be defined as: where D AVF L denotes the depth of the audio, visual and fusion modules, i.e., these three modules are set to have the same number of cells.In summary, compared with training a personspecific CNN with a fixed architecture for each subject [10], which only represents the person-specific cognition of the subject using a set of unique LWs, the person-specific CNN explored by our approach allows the subject's cognition to be represented by not only a set of unique LWs but also unique OPs and depth (i.e., the architecture of the CNN).In other words, the complex Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.human cognition would theoretically be better represented by the CNN explored by our approach (evaluated in Section IV-D).
Employed Operations (Search Space): In this paper, we predefine υ = 5 operations that have layer weights (LWs), and κ = 5 operations that do not have LWs for each edge O i,j .Here, all OPs and LWs of the target listener's person-specific CNN are defined as α L and W L , respectively.The details of the person-specific CNN settings (e.g., cell, nodes and operations) are provided in Table I.As the first work that searches for a person-specific CNNs to represent each listener's personspecific cognition as their personality representation, there is no previous study suggesting the optimal operations and search space.Since the main goal of this paper is to validate the concept that the personalized architecture and weights of the explored person-specific CNN can reflect the corresponding listener's self-reported personality traits, we found that the standard deep learning operations (convolution, pooling, etc) that have been frequently used in previous deep learning models (e.g., classification [62], segmentation [63], etc.) can already allow most explored person-specific CNNs to accurately reproduce their target listeners' facial reactions.Consequently, we decided to define the search space based on these standard deep learning operations.Although more operations can be employed in the search space, this would further increase the computational cost of the searching process.In future work, we will propose a more efficient person-specific CNN searching strategy, and specifically investigate the optimal searching space for representing personality-related person-specific cognition.
3) Adaptive Loss Function: To supervise the searching process of each person-specific CNN (as described by ( 5)) we first highlight some important aspects of human psychology and behaviour pattern.First, facial reactions of similar emotions or intentions can be displayed by different facial spatio-temporal patterns, which is partially caused by the differences in listeners' facial identities, responding times and personalities.While differences in facial identities can be partially addressed by projecting faces of different subjects to a mean face, we consider that there is always a time delay for a listener to generate a facial reaction in response to speaker's behaviours.This is because the execution of the corresponding cognitive processes takes some time [59].Importantly, the duration of the time delay may vary not only for different listeners but also for the same listener depending upon other external factors.
In light of this, we introduce an adaptive factor τ to model this uncertainty.Let us define an audio-facial input A S (t 1 , t 2 ) Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
and F S (t 1 , t 2 ) that represent the speaker's audio-facial nonverbal behaviours expressed from time t 1 to t 2 .We propose the following adaptive loss (A-loss) function to measure the similarity between the predicted listener's facial reaction and the ground-truth: min L (x p i,j , x g i+τ,j ) + L (y p i,j , y g i+τ,j ), ε (12) where F p L (t 1 , t 2 ) denotes the predicted landmarks of the listener's facial reaction corresponding to the input A S (t 1 , t 2 ) and F S (t 1 , t 2 ); F g L (t 1 + τ, t 2 + τ ) are the listener's real facial reaction landmarks induced by A S (t 1 , t 2 ) and F S (t 1 , t 2 ), where τ represents the time delay; (x p i,j , y p i,j ) denotes the predicted coordinates of the jth facial landmark of the ith frame and (x g i+τ,j , y g i+τ,j ) is the corresponding ground-truth coordinate.Specifically, the ε is a constant value employed to avoid extremely large loss values caused by outliers (e.g., incorrectly detected face regions) which can lead to a misguided CNN search.L represents the similarity measurement between the prediction and ground-truth.In this paper, L is defined as the Mean Square Error (MSE).
To achieve the proposed adaptive loss (i.e., computing a τ at each time), in practice we use a sliding time-window to compare the prediction of listener's facial reactions with a set of groundtruth candidates (the duration of the ground-truth candidates is longer than the time-window, as illustrated in the last section of Fig. 1(a)).Specifically, we set R ground-truth candidates, i.e., F g L (t 1 + r, t 2 + r), r = 1, 2, . . .R, and only choose r = τ that allows the loss L A−loss (t 1 , t 2 , τ) to have the lowest value: As a result, the delay period can be automatically adapted for each listener at each training iteration.

4) Person-Specific CNN Optimization:
To search for an optimal multi-modal CNN (described in Section III-A2) for each listener, we conduct a single-level optimization based on the continuous relaxation algorithm [60].It adjusts all OPs, LWs as well as depths of the person-specific CNN at the same time during the optimization.In comparison to the widely-used bi-level optimization strategy [60] which separately optimizes OPs in the validation set and LWs in the training set, i.e., freezing one of them while optimizing the other, the proposed single-level optimization strategy allows the OPs, LWs and depths to be simultaneously optimized.This aims to replicate how the human cognition operates with all cognitive processes jointly activated during reaction generation -there is no evidence suggesting that some parts of the human cognitive processors are frozen during the reaction generation.In addition, this strategy allows the OPs, LWs and depths to be optimized using the full audiofacial frames instead of a sub-segment of it, i.e., the explored CNN is a clip-level representation without ignoring any frames.The pseudocode of the proposed single-level OPs, LWs and depths optimization is provided in Algorithm 1. Representative

B. Graph Representation of the Person-Specific CNN
Let's recall that the main hypothesis of this paper is that if a CNN can reproduce the target subject's facial reactions, it represents the person-specific cognition of the subject, which is well associated with the subject's true personality.Consequently, we search for a person-specific CNN for each subject.Since the explored CNN is a directed acyclic graph, we encode each explored CNN as a graph G(V, E), and treated it as the personality representation for the corresponding subject.This process is formulated in (3), where each graph representation G is made up of a set of vertices V and edges E. Specifically, we represent each CNN edge O i,j as a vertex V i,j in the corresponding graph representation G(V, E).Meanwhile, the edge presence A i,j,m between V i,j and V j,m in G(V, E) is decided by the relationship between their corresponding CNN edges O i,j and O j,m .If A i,j,m = 1, the edge feature E i,j,m in G(V, E) is obtained by considering both vertex features V i,j and V j,m .These can be formulated as: where VFE denotes the proposed vertex feature encoding strategy described in Section III-B1, and EFE denotes the proposed edge feature encoding strategy described in Section III-B2.We additional provide the pseudocode of the VFE and EFE in the supplementary material, available online. 1) Vertex Feature Encoding: Given a CNN edge O i,j , we categorize all its operations into two parts: υ operations o w i,j that have LWs (e.g., convolution) and κ operations o n i,j that do not have LWs (e.g., pooling).Then, we propose a vertex feature encoding (VFE) strategy to achieve its corresponding vertex V i,j for the graph representation as: r Step 1 LWs alignment: we first notice that the number of LWs are different (ranging from hundreds to tens of thousands in our study) in each CNN edge because they have different number of input and output feature maps.Consequently, we follow the idea of [64] to select a fixed number of most representative weights from each operation o k i,j , which are denoted as Sω k i,j .For examples we choose weights of five kernels with the top-5 highest L1 values (sum of absolute weights) from each convolution operation.This way, the LWs representation LW i,j of the CNN edge O i,j (having K operations) is denoted as: where the LW i,j would have a fixed dimension for all CNN edges. r Step 2 Fusion of OPs and LWs: Since each OP α k i,j reflects the importance of the operation o k i,j (as well as its LWs ω k i,j ), we use OPs to weight corresponding LWs.For operations that have LWs, their original OPs α w i,j are projected to a OP-LW weighting vector OP-LW i,j that has the same dimension as LWs representation LW i,j : where VEN is a Multi-Layer Perceptron (MLP) that has two hidden layers.Then, the OPs and LWs for operations that have LWs are combined by computing the dot product between the weighting vector OP-LW i,j and the LWs representation LW i,j , which can be denoted as: r Step 3 Vertex feature generation: Finally, we concatenate the obtained V w i,j with OPs α n i,j of κ operations that do not have LWs as the final vertex feature: Since the dimension of both α n i,j and V w i,j are fixed, all vertex features would have the same dimension.This process is illustrated in Fig. 1(b) and depicted in purple.
2) Edge Feature Encoding: For a graph representation G(V, E), we define a pair of vertices V i,j and V j,m are connected (i.e., the edge E i,j,m exists (A i,j,m = 1)) if their corresponding CNN edges O i,j and O j,m are connected to the same node N j in the CNN (illustrated in Fig. 1(b)).While most existing approaches only use a single binary value (0 or 1) to define the relationship between a pair of vertices in graphs, this singlevalue binary edge feature usually fail to describe all task-related relationship cues, as sometimes the relationship between vertices can be described by multiple attributes.On the contrary, we aim to produce a person-specific graph representation that not only encodes parameters (contained in vertex features) and the architecture (encoded as the graph topology) of the personspecific CNN, but also the underlying relationship between CNN edges, which may provide additional personality-related cues.
To this end, we propose a novel multi-dimensional edge feature encoding strategy that represents the relationship (edge) between each pair of connected vertices as a multi-dimensional vector.In particular, we propose to produce the edge feature E i,j,m directly from the obtained vertices V i,j and V j,m : where ERN is an attention-based edge relationship network.It takes a pair of vertices' feature V i,j and V j,m as the input, and outputs an edge feature E i,j,m that contains the task-specific relationship feature that are related to both vertex features V i,j and V j,m .The detailed process of the ERN is illustrated in Fig. 6.
It should be noted that both ERN and VEN are jointly trained with the personality recognition model in an end-to-end manner.This way, they learn to generate personality-related vertex and edge features (personality-related features) from the explored person-specific CNN's parameters and architecture.

B. Personality Recognition Model
In this paper, each subject's personality traits are recognised from the graph representation of the subject's person-specific CNN.We formulate the personality recognition as a multi-task graph regression problem (jointly recognizing 5 traits).Particularly, we employ the state-of-the-art residual gated graph convolution neural network (residual GatedGCN) [65] provided by [66] as the personality recognition model to process the produced graph representations, as it is the state-of-the-art GNN model which can process heterogeneous graphs and graphs containing multi-dimensional edge features.We empirically employ a network that consists of six GatedGCN layers (the detailed settings are provided in supplementary material, available online).Then, two fully connected (FC) layers are attached to the last GatedGCN layer to concatenate all produced vertices features, where a ReLU activation and a dropout (0.3) are followed by each FC layer.The size of the output layer is set to 5 to jointly recognise the five personality traits of Extraversion (Ext), Agreeableness (Agr), Openness (Ope), Conscientiousness (Con), and Neuroticism (Neu).
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Step 1: Learning a pair of 1D representations F conv i,j and F conv j,m from a pair of connected vertex features V i,j and V j,m ; Step 2: generating cross-vertex attention maps A cross i,j and A cross j,m , where A cross i,j emphasizes a part of F conv i,j 's information that is correlated with F conv j,m while A cross j,m highlighting a part of F conv j,m 's information that is correlated with F conv i,j ; Step 3: Generating weighted features F cross i,j and F cross j,m ; Step 4: Concatenating F cross i,j and F cross j,m , and producing self-attention maps; and Step 5: Generating the final edge feature E(i, j, m).In this figure, the K, Q, V depicted in purple represent 'key', 'query' and 'value' of the attention operation.

IV. EXPERIMENTS
This paper evaluates the proposed approach on a humanhuman dyadic interaction dataset and a human-machine dyadic interaction dataset, which are described in Section IV-A.We then present the implementation details in Section IV-B followed by evaluation metrics in Section IV-C.We compare the personality recognition performance achieved by the proposed approach to other existing personality recognition approaches that directly infer personality from subjects' external behaviours in Section IV-D, showing the advantages of the proposed novel strategy which recognises personality traits from the simulated human cognition.Finally, we experiment the influence of different CNN searching and graph representation encoding settings on personality recognition in Section IV-E, where we also specifically compare the person-specific CNNs that are explored using NAS with person-specific CNNs that have a fixed architecture.Additionally, we conduct a set of experiments to systematically evaluate the sensitivity of our approach for different demographic groups (provided in supplementary material, available online.

A. Datasets
In this paper, we evaluate our approach in both human-human and human-machine dyadic interaction scenarios.While many existing datasets [41], [42], [43], [44] are built for personality perception prediction study, some publicly available datasets [1], [46], [47], [48], [67] also can be used for audio-visual true personality recognition studies.However, most of these available datasets are not suitable for our study as our models assume dyadic interaction.
Human-Human Interaction: The NoXi dataset [46] is a multilingual human-human dyadic interaction dataset that was designed to generate spontaneous interactions with emphasis on adaptive behaviours in unexpected situations.It consists of 84 sessions in which one participant acts as an Expert and the other acts as a Novice interacting on a chosen topic of expertise via video conferences.The participants were allowed to continue the conversation until it reached a natural end.During the interaction, participants can interrupt each other for either changing the topic or inducing a mild debate whenever possible.This dataset contains 84 pairs of audio-visual clips (168 clips in total from 89 participants) with participants' ages ranging from 21 to 50 years old.The average and standard deviation of clips' duration are 18m6s and 6m28s, respectively.All participants provided self-assessments of their Big-Five Personality Traits using the Saucier's Mini-Markers [68].
Human-Machine Interaction: We conducted the humanmachine experiments on the Virtual Human Questionnaire (VHQ) database [1].The VHQ database consists of 165 videos collected from 55 participants, where each participant completed 3 questionnaire interview sessions.During each session, participants were asked to answer a set of questions verbally based on one of three questionnaires: BFI-10 [69], PHQ-9 [70] or GAD-7 [71].In this database, 55 videos (corresponding to 55 subjects) were recorded under the human-machine dyadic interaction mode.More specifically, a virtual human agent interviewer (Fig. 7) was projected directly in front of the participant and ask questions, which was implemented using the ARIA-VALUSPA Platform [72].The self-reported labels of the Big-Five personality traits were obtained by asking participants to fill the BFI-44 questionnaire online.

B. Implementation Details
Persons-Specific CNN Settings: For subjects in the NoXI dataset, their multi-modal person-specific CNNs have 6 cells (a pre-defined convolution cells, 3 down-sampling cells and 2 regular cells) for each encoder and 5 cells (2 regular cells and 3 up-sampling cells) for each decoder.The employed LSTMs have 3 hidden layers.During the neural architecture searching, the input speaker's audio-visual signal lasts for 80 frames and the listener's candidate ground-truth consists of 105 frames and the delay factor r ranges from 0 to 25 frames, i.e., selecting 80 consecutive frames as the final reaction.Since the audio data in the VHQ dataset is very noisy, we only search for a single-modal person-specific CNN for each subject, which takes the virtual human's facial landmarks as the input and aims to reproduce the target subject's facial reactions.We noticed that the virtual human only spoke a set of pre-defined sentences during the interaction.Thus, we also categorized all sentences into 4 classes: depression-related questions, anxiety-related questions, personality-related questions, and other sentences (e.g., virtual human asks the subject to repeat the answer.),and then encoded each as a one-hot vector (e.g., 1000, 0100, 0010, and 0001).Consequently, the multi-modal CNNs explored in the VHQ dataset are implemented by concatenating the deep-learned virtual human's face feature with the proposed sentence categorical feature at the last FC layer of the encoder.
Neural Architecture Search: In this paper, all person-specific CNNs for each dataset have the same initial architecture and parameters, where OPs and LWs are initialized with the Xavier strategy [73].Meanwhile, we used the same training strategies to obtain all person-specific CNNs in each dataset.In particular, during each person-specific CNN's searching, we fed facial landmark sequences to each CNN based on their time stamps in the corresponding video, i.e., from the beginning of the video to the video's end.This not only ensures that the OPs and LWs of the CNN always converge (during searching) to the same set of values for a particular video, but also ensures that the difference between individually explored person-specific CNNs is only influenced by person-specific reactions rather than the initialization of weights or the order in which the frames are used for searching.During the searching, the batch size was set to 60 audio-visual clips, while 2 Adam optimizers were independently used to jointly adjust OPs and LWs, with the learning rate of 0.05 and 0.001, respectively.
Personality Model Training Details: In this paper, we conduct a 12-fold subject-independent cross-validation on the NoXI dataset.For each fold, 154 videos were used for training and hyperparameter optimisation and 14 videos were used for testing (each subject appeared in either training or test set, not both).Due to the limited number of data, we conduct a leave-onesubject-out cross-validation on the VHQ dataset.For each fold, 154 videos were used for training and hyperparameter optimisation, and the remaining video was used for testing.For both NoXI and VHQ datasets, we report the accuracy on the test sets averaged over all folds.In this paper, all experiments were conducted on the PyTorch platform using Nvidia V100 GPUs.

C. Evaluation Metrics
Two common metrics are used to evaluate the personality recognition performance: the Pearson Correlation Coefficient (PCC) and the mean accuracy measurement (ACC), which has been adopted in relevant challenge events (e.g., the ChaLearn challenge [41].

D. Comparison to Existing Approaches
To compare the proposed approach with other video-based automatic personality analysis solutions, we reproduced four existing personality computing approaches that have been reported on ChaLearn dataset [41], which are DCC [20], NJU-LAMDA Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.[19], CR-Net [5], and PALs [10] as well as spectral representation [74], [75].The detailed reproduction settings are provided in the supplementary material, available online.
Tables II and III compare the variations of the proposed models to the existing state-of-the-art audio-visual personality analysis approaches (automatic personality perception (APP) solutions) on the NoXi and the VHQ datasets.Table IV compares the results achieved by our best systems (the graph representations of multi-modal CNNs that are learned using adaptive loss, independent parameter settings and the graph representations are constructed using end-to-end vertex and edge feature learning strategy) and the results achieved by other methods under both interaction scenarios with highlighted statistical significance.It can be observed that for both datasets, the predictions produced by the graph representations of the explored multi-modal CNNs are positively correlated with all self-reported personality traits.Specifically, these graph representations achieved PCC> 0.37 for Con, Ext, and Neu traits on the NoXI dataset, which shows significant advantages over the other listed methods.Meanwhile, the graph representations of the explored multi-modal model (A-MModal (M)) also achieved the best average ACC result and the second best PCC result in recognizing the self-reported personality traits under human-computer interaction scenarios, i.e., it generated the best PCC results between predictions and ground-truth of the Neu trait with PCC of 0.363, showing more than 8% relative improvements over the second best method [10].In addition, we also train a CRNet-based baseline (M-CRNet) that takes the audio-visual signal of both the listener and speaker, to predict the listener's personality.While the proposed approaches and the M-CRNet both using the listener and speaker's data to predict the listener's true personality traits, the results demonstrate that using the representation of the simulated person-specific cognition still provide significant advantages over directly extracting features from external behaviours, in recognizing all five true personality traits.
It also can be observed from Table V that the explored personspecific CNNs can accurately predict the corresponding target subjects' facial reactions, which are evidenced by the promising PCC results (more than 0.75 for all systems) and RMSE results of the facial reaction (facial landmarks) generation.This indicates that the simulated person-specific cognition can reproduce similar facial reactions for the majority frames of the target subject's video.In other words, the explored person-specific CNN can accurately represent the target subject's cognition over time (a video's duration).On the contrary, these person-specific CNNs have poor performance in reproducing other subjects' facial reactions with the PCC between the reproduced reaction landmarks and the ground-truth being less than 0.1, which means each person-specific CNN can not reflect other subjects' cognition for facial reactions (i.e., the simulated cognition of the target subject is different from others' cognition).As a result, we assume the proposed approach can partially encode the target subject's person-specific cognitive process that is stable over time but different from other subjects.
Discussion: In summary, the results presented above indicate that despite CNNs and humans having different cognitive mechanisms, if a CNN can simulate a subject's cognitive process for generating facial reactions, this CNN's architectural parameters are positively associated with the subject's self-reported personality traits.Compared to existing solutions that directly predict personality traits from the listener's non-verbal behaviours, the proposed approach that recognises self-reported personality traits from the simulated cognition seems a more reliable solution.It is clear that the performance in recognising Neu and Ext traits are better than the other traits, which is consistent with what has been frequently claimed by previous studies [14], [50], i.e., Ext and Neu traits are well associated with human cognition.We also observed that the approaches which predict personality using video-level features (e.g., CR-Net, PALs, Spectral and the proposed approach) have clear advantages over the approaches that infer personality from a single frame or a thin slice (DCC and NJU-LAMDA), demonstrating that long-term information is more reliable for modelling self-reported personality traits.It should be noted that the advantage of our approach in humanmachine interaction scenarios is not as clear as the human-human interaction scenarios.This can be explained by the fact that the virtual human used in the VHQ dataset only has limited non-verbal facial behaviours, they are not as rich as real human speakers in the NoXI dataset.Thus, the listeners' facial reactions in human-machine interaction scenarios may be less correlated with the non-verbal behaviours expressed by the virtual human.

E. Ablation Studies
In this section, we explicitly investigate the sensitivity of the proposed approach to different NAS settings (i.e., modality, parameter sharing strategy, loss function and topology alignment) and graph representation learning settings.The statistical significance testing results achieved by the best system and the second best system in terms of each ablation study, as well as the detailed settings for each 'second best system' are provided in supplementary material, available online.
We first demonstrate the importance of: (i) applying NAS to obtain unique architecture and parameters for each personspecific CNN; and (ii) encoding person-specific CNNs as graph representations in Fig. 8. Specifically, the system 'NAS+MLP' is obtained by simply concatenating all OPs and LWs of the personspecific CNN as a vector, whose dimension is reduced by CFS [76].Meanwhile, the system 'Unet+MLP' is achieved using the same strategy as the 'NAS+MLP' system without NAS, i.e., all person-specific Unets of the system 'Unet+MLP' have the same and fixed architecture.Each person-specific Unet consists of a audio encoder, a visual encoder, a fusion module and decoder, and each module is made up of a set of ResNet blocks.First, predictions of all three NAS-based models achieved positive correlations across all traits under both interaction scenarios, demonstrating that the explored person-specific CNNs are indeed positively associated with target subjects' personality traits.Then, it can be observed from the figure that it is superior to encode each person-specific CNN to a graph representation than simply concatenating all OPs and LWs of the person-specific  CNN as a vector.This validates that the proposed graph representation a superior to architectures and parameters of the CNN.Finally, person-specific CNNs explored by NAS (NAS+MLP) generated better results than these of Unetbased person-specific CNNs (Unet+MLP), which shows that the CNNs explored by NAS can better simulate personality-related cognition for each subject.This is also evidenced by the better facial reaction generation performance displayed in Tables V and VI.We conclude these results as the person-specific CNNs explored by NAS have not only unique weights but also unique architectures, which would theoretically have better capability to fit complex human cognition.In addition, we provide the example neural architecture searching loss curves in Fig. 9, showing that despite the limited number of frames in each pair of speaker and listener's videos, the person-specific CNN can still be well explored to fit to the person-specific facial reactions of the given video, i.e., training losses are well converged.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Modalities: First, it can be observed from the results on the NoXI dataset that predictions generated by all settings are positively correlated with the self-reported values across all five traits.In general, the multi-modal system achieved better results than single-modal systems for both personality recognition and facial reaction generation tasks.This demonstrates that both non-verbal audio and facial behaviours of the speaker contribute to the listener's facial reactions, where each of them contain some unique aspects in forming reactions.In other words, the person-specific cognition triggered by each modality provides unique and useful clues for personality recognition.Meanwhile, as we can see from the results on the VHQ dataset, even simply adding the encoded virtual human sentence categorical feature (explained in Section IV-B) can improve the recognition performance of Ope, Con and Neu traits.In comparison to the real human speaker, the facial behaviours of the virtual human may not be the key factor to trigger the listener's facial reactions, and thus the explored person-specific CNNs may not learn good hypotheses of the listeners' cognitive processes.However, the sentence categorical feature provides the context information which provides a controlled condition for a subject's reaction as well as strong supervision for the search of person-specific CNNs.Thus, adding sentence categorical feature as the extra modality improves the recognition of most traits.
Parameter Sharing Strategies: For the results achieved on both datasets, it is clear that graph representations of personsspecific CNNs explored by the independent parameter (IP) strategy have clear advantages over the results achieved by the widely-used parameter sharing (PS) strategy [60], [61] over all five traits.These results validate our assumption that human cognition consists of a set of cognitive processes, each of which undertakes a unique function and can be different from others.Therefore, each part of the explored CNN should also have its own weights to better simulate a unique cognitive process/function.
Loss Function Settings: As we can see from the results on the NoXI dataset, despite most of our systems trained with standard MSE loss already achieved good performance in recognising Con, Ext and Neu traits, models trained using the proposed adaptive loss provided further improvements.Meanwhile, the system that used the adaptive loss achieved the similar results in recognising Ope and Agr traits with no significant differences.Specifically, the use of adaptive loss still brought more than 5.8% average improvement for Con, Ext and Neu traits.It can be observed from the results of human-machine interactions, the graph representations of person-specific CNNs trained with adaptive loss better recognised all five traits.Meanwhile, the systems that used the adaptive loss generated better facial reaction results.Since Neu and Ext traits can be better reflected by human cognition [14], [50], we hypothesize that the proposed adaptive loss can partially address the uncertainty of subjects' responding time, allowing the explored CNNs to better simulate target subjects' facial reaction-related cognition, which are well associated with Con, Ext and Neu traits.Depth Settings: We also evaluate the influence of depth settings on person-specific cognition simulation and personality recognition.It can be found from Tables V and VI that the person-specific CNNs with their unique depths do not show clear advantages in reproducing listeners' facial reactions.These results suggest that even CNNs that were searched using the same depth have a comparable or even better capability to represent the target subjects' cognition.Moreover, as we can see from Fig. 10(a) and (b), the personality recognition results achieved by the heterogeneous graph representations of person-specific CNNs that have various depths are not as good as the isomorphic graph representations of person-specific CNNs that have the same depth.This may indicate that the differences in typologies can not reflect the differences of personality.In addition, the typologies of heterogeneous graph representations are varied a lot, which leads the training process of the corresponding GCNs to become more difficult.
2) Graph Representation: In this section, we demonstrate the advantages of the proposed end-to-end vertex feature and edge feature learning strategy for constructing graph representations in Fig. 11.Our best settings for vertex feature, LWs representation and edge feature are: vertex feature learned by the strategy proposed in Section III-B1 (denoted as OP-LW (VEN)) and edge features learned by ERNs.
Vertex Feature Settings: We first compare the proposed deeplearned vertex feature (OP-LW (VEN)) to four hand-crafted vertex features: 1. the OP feature: a vector that concatenates all OPs of a CNN edge; 2. the LW representation; 3. OP-LW (C) feature: a vector that concatenates all OPs and the LW representation of a CNN edge; 4. OP-LW (W) feature: a vector obtained by concatenating OPs that do not have LWs and a weighted vector that produced by multiplying LWs with their corresponding OPs.It can be seen that graph representations of the LW and OP-LW (C) vertex features have a similar capability for recognising true personality traits, both of which outperformed the graph representations that only use OPs as the vertex feature.This can be explained by the fact that the OP vertex feature ignores all LWs which are crucial in deciding CNNs' generalization capabilities.Then, we can conclude that the personality-related cues reside in both OPs and their LWs.Meanwhile, the OP-LW (C) feature does not show a clear advantage over the LWs feature, whose performance is also not comparable to the OP-LW (W) and OP-LW (VEN) features, demonstrating that simply concatenating OPs and LWs is not a proper way to combine their clues.In other words, the best recognition results of all five traits are achieved either by OP-LW (VEN) feature or OP-LW (W) vertex feature.As a result, we concluded that using each OP to weight corresponding LWs is a more superior way to combine OPs and LWs.Moreover, the OP-LW (VEN) setting shows significant advantages over the OP-LW (W) on Ope and Con traits under human-human interaction and Ope, Con.Ext, and Neu traits under human-machine interaction.Thus, we assume that the OP-LW (VEN) setting allows a better weighting vector to be learned to construct the each vertex, which not only considers the original OPs but also task-specific information.
Edge Feature Settings: We also compare the proposed end-toend learned multi-dimensional edge features to the widely-used binary adjacency edge feature (0 or 1).It can be seen that the graph representations equipped with the proposed deep-learned multi-dimensional edge features outperformed the graph representations that only use a binary adjacency matrix to define the connectivity between vertices, with more than 3.4% and 10% average improvements under human-human and humanmachine interaction scenarios, respectively.More importantly, the improvements brought by these edge features are significant for some traits (Con, Neu in the human-human interaction setting and Con, Agr, Neu in the human-machine interaction setting) as well as the average performance (please check the supplementary material, available online).Such results validate the usefulness of the proposed end-to-end multi-dimensional edge feature learning strategy, which can better describe the relationship between adjacent vertices with multiple task-specific relationship clues, particularly for the Con and Neu trait.In other words, the task-specific multi-dimensional edge features lead the produced graph representations to have superior message passing mechanism when they are processed by GNNs, resulting in more discrminative latent personality representations.

V. CONCLUSIONS AND FUTURE WORK
This paper proposes the first work which recognises true personality traits from the graph representation of an automatically explored person-specific CNN's architecture and parameters, where each CNN simulates the cognition of each target subject in terms of person-specific facial reactions.Our approach is evaluated on datasets of different nature (i.e., they are recorded under human-human versus human-machine dyadic interaction scenarios), and the achieved results suggest the following conclusions: i) the graph representations of person-specific CNNs are positively associated with the target subjects' self-reported personality traits, showing that the CNNs explored by our approach may have their own personalities, which are similar Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
to their corresponding subjects; ii) the proposed approach has clear advantages over most existing APP approaches which predict personality directly from non-verbal behaviours the target subject, demonstrating that it is reliable to recognise self-reported (true) personality from the simulated cognition of subjects; iii) we found that the graph representations learned by the proposed approach are particularly informative for recognising Ext and Neu traits under both interaction scenarios; iv) the proposed approach performed better personality recognition and facial reaction prediction under the human-human interaction scenario than the human-machine scenario, indicating that nonverbal behaviours expressed by human speakers are more powerful to trigger the listeners' personality-related facial reactions; v) many human demographic attributes (e.g., age, gender, education level, and interpersonal relationship) can influence the performance of the proposed approach, where the gender and age are the most influential factors.This is caused by the fact that the facial reactions of a similar intention or emotion can be varied due to the these factors; and vi) among several technical settings, the proposed adaptive loss function, independent parameters strategy and end-to-end vertices/edges feature learning strategies have largely enhanced the personality recognition performance.
The main limitation of this work is that searching for a unique CNN architecture for each subject takes a relatively long time, i.e., the training and inference duration of the proposed approach are expected to be longer than most existing approaches.Therefore, it may not be suitable for fast personality assessment requirements.Another limitation is that we only used audio-visual modalities but ignored other human signals such as psychological signals (EEG, heart rates, skin temperature, etc.) and verbal information, which contribute important information to one's communication and reactions.As a result, a potential future direction is to accelerate the person-specific cognition simulation algorithm so that it does not require searching for a person-specific CNN for each person from scratch.Then, additional modalities (e.g., verbal signal, psychological signals, etc.) might enable the CNNs to be more similar to the target subjects' cognition in a dyadic interaction, where dialogue response generation can be utilised to predict listeners' verbal responses.All these modalities can in principle be combined via the proposed fusion module, i.e., combining them at multiple levels, as each is influenced by the others.There remain of course some modality-specific issues to resolve, so while it's definitely possible there is also substantial future research to be done in this area.Meanwhile, from the application perspective, this work opens up a new avenue of research for predicting and recognizing socio-emotional phenomena (personality, affect, engagement, etc.) from the simulations of person-specific cognitive processes that will have further implications for relevant fields including neuroscience, and cognitive, behavioural and emotion sciences.Another future work will focus on extending and evaluating our approach to analyze mental health or other human internal states with domain-specific loss functions under clinical settings, i.e., representing them with CNN parameters, or creating data-driven robot coaches that can express personalized behaviours during dyadic interactions [77], [78].

Fig. 1 .
Fig. 1.The pipeline of the proposed approach.(a) Our approach starts with searching for a person-specific processor (multi-modal CNN) architecture (unique topology, weights and depth) that can reproduce the target listener's facial reactions according to the speaker's audio-visual non-verbal signals (Section III-A); (b) Then, we parameterize the person-specific processor as a graph representation to represent the listener's cognition and feed it to a graph neural network for the target listener's personality recognition (Section III-B).It should be noted that we individually search for a person-specific processor with a unique architecture and parameters for each subject as the person-specific cognition/personality representation.

n
and n th audio cell C Audio n , the output of the (n − 1) th and (n − 2) th fusion cells (C Fusion n−1 , C Fusion n−2 ), which can be formulated as:

Fig. 4 .
Fig. 4. The details of the explored person-specific multi-modal CNN.

Fig. 5 .
Fig. 5. Visualization of person-specific CNNs explored on the NoXI dataset, where the initial CNN architectures for all subjects are the same (Epoch 0).After the searching, we can see that the explored CNN for each subject is unique and person-specific (Epoch 300).

Algorithm 1 :
Single-Level Optimization.Require: A multi-modal CNN that is parametrized by OPs, LWs and depths of an audio encoder, a visual encoder, a fusion module and a decoder, which are denoted as A t=0 V,A,F,D , W t=0 V,A,F,D and D t=0 AVF,D .Ensure: An optimal person-specific multi-modal CNN that can reproduce the target subject's facial reactions, which is parametrized by OPs A Optimal V,A,F,D , LWs W Optimal V,A,F,D and depths D Optimal AVF,D .1: repeat 2: Updating OPs A t V,A,F,D (t > 1) on the training set by descending ∇ A L A−loss (W t−1 V,A,F,D − η A ∇ W L A−loss (W t−1 V,A,F,D , A t−1 V,A,F,D , D t−1 AVF,D ), A t−1 V,A,F,D , D t−1 AVF,D ).3: Updating LWs W t V,A,F,D on the training set by descending ∇ W L A−loss (A t V,A,F,D , W t−1 V,A,F,D , D t−1 AVF,D ).4: Choosing the optimal depths D t AVF,D to achieve the best training loss L A−loss (A t V,A,F,D , W t V,A,F,D , D t AVF,D ) 5: until Convergence 6: A Optimal V,A,F,D = A Convergence V,A,F,D ; W Optimal V,A,F,D = W Convergence V,A,F,D and D Optimal AVF,D = D Convergence AVF,D examples of person-specific CNNs' optimization processes are visualized in Fig. 5.

Fig. 6 .
Fig. 6.Illustration of the ERN.Step 1: Learning a pair of 1D representations F conv i,j and F conv j,m from a pair of connected vertex features V i,j and V j,m ;Step 2: generating cross-vertex attention maps A cross i,j and A cross j,m , where A cross i,j emphasizes a part of F conv i,j 's information that is correlated with F conv j,m while A cross

Fig. 7 .
Fig. 7. Examples of a virtual human display (a) and automatically detected (aligned) faces (b) in VHQ dataset.

Fig. 8 .
Fig. 8.The results achieved by our best system and several baselines.

Fig. 10 .
Fig. 10.The results of different person-specific CNN settings.The definition of MModal, S, M, PS-, IP-, A-can be found in the captions of Tables II, III, V and VI.

Fig. 11 .
Fig. 11.The results of different vertex and edge feature learning settings.The definition of settings can be found in Section IV-E2.

TABLE I THE
OPERATIONS USED IN THIS PAPER

TABLE II PERSONALITY
RECOGNITION RESULTS ON THE NOXI DATASET

TABLE III PERSONALITY
RECOGNITION RESULTS ON THE VHQ DATASETTABLE IV STATISTICAL SIGNIFICANCE TESTING RESULTS IN TERMS OF PCC ACHIEVED BY OUR BEST SYSTEM AND THE FIVE REPRODUCED SYSTEMS ON THE NOXI AND THE VHQ DATASET, WHERE +/− DENOTES THAT THERE IS/THERE IS NO STATISTICALLY SIGNIFICANT DIFFERENCE BETWEEN OUR APPROACH AND THE OTHER APPROACH (THE SIGNIFICANCE LEVEL OF * P < 0.05, * * P < 0.01, * * * P < 0.001)

TABLE V FACIAL
REACTIONS PREDICTION RESULTS ON THE NOXI DATASET

TABLE VI FACIAL
REACTIONS PREDICTION RESULTS ON THE VHQ DATASET