Domain-Incremental Continual Learning for Mitigating Bias in Facial Expression and Action Unit Recognition

As Facial Expression Recognition (FER) systems become integrated into our daily lives, these systems need to prioritise making fair decisions instead of aiming at higher individual accuracy scores. Ranging from surveillance systems to diagnosing mental and emotional health conditions of individuals, these systems need to balance the accuracy vs fairness trade-off to make decisions that do not unjustly discriminate against specific under-represented demographic groups. Identifying bias as a critical problem in facial analysis systems, different methods have been proposed that aim to mitigate bias both at data and algorithmic levels. In this work, we propose the novel usage of Continual Learning (CL), in particular, using Domain-Incremental Learning (Domain-IL) settings, as a potent bias mitigation method to enhance the fairness of FER systems while guarding against biases arising from skewed data distributions. We compare different non-CL-based and CL-based methods for their classification accuracy and fairness scores on expression recognition and Action Unit (AU) detection tasks using two popular benchmarks, the RAF-DB and BP4D datasets, respectively. Our experimental results show that CL-based methods, on average, outperform other popular bias mitigation techniques on both accuracy and fairness metrics.


I. INTRODUCTION
Artificial Intelligence (AI) and Machine Learning (ML) systems are increasingly becoming an important part of human life, monitoring and controlling several aspects of our daily lives, with little to no human oversight.From security and surveillance systems that deploy several ML models such as face detection and recognition systems [1], social media platforms that auto-tag pictures of our friends and family [2], recommender systems that track our digital footprints to show us advertisements of products that we might like to indulge in [3], to banking and finance applications that work on credit approvals based on socio-economical backgrounds of individuals, AI systems are ubiquitous, making 'smart' decisions about several critical aspects of our lives [4], [5].It is thus important to ensure that these systems make fair and unbiased decisions to avoid potentially catastrophic consequences that adversely affect individuals [6].In this work, we focus on one such popular application of AI in real life; Facial Expression Recognition (FER) systems.
FER systems (see [7], [8], [9] for a survey) aim to analyse facial expressions either by encoding facial muscle activity as Facial Action Units (AUs) [10] or determining the emotional state being expressed by an individual [11], [12].Analysing large datasets of human faces, annotated for the expressions represented in the images, these models are heavily datadependent and thus may be prone to biases originating from imbalances in the training data distribution.For a large variety of FER datasets, attributes such as gender, race, age or skincolour are implicitly encoded in the data which may also be learnt by a (deep) learning model [13].If these attributes are not balanced across the entire distribution of the dataset, the model may learn to associate such confounding attributes with the task of FER.For example, if the training data has a disproportionate number of images of Males expressing 'Happy' than Females, the model may learn to associate gender with the expression, leading to a lot of 'happy female' samples being misclassified.
While the most effective method for preventing biases in FER datasets would be to ensure a balanced and representative data collection, this also turns out to be the most challenging problem.Owing to restrictions with respect to data recording settings, personal preferences, geographic location as well as several social and cultural constraints, it may not always be possible to ensure a balanced data collection.Most recent datasets try to ensure the data collection is fair and unbiased or at the least provide demographic annotations, along with affective labels, that enable researchers to make informed decisions while using these datasets for training ML models [13].Yet, to ensure fairness despite the inherent imbalances in data distributions, several methods have been proposed that handle these imbalances at the pre-processing, in-processing or postprocessing levels [14].
Pre-processing methods focus on strategically sampling training data, that is, given the distribution of data with respect to a selected demographic attribute, samples belonging to under-represented groups are either over-sampled compared to dominant groups [15], or scaled penalties are applied when a model incorrectly classifies these samples [16].Yet, these methods are not perfect and some bias might still creep in.To handle that, changes to the model architecture or the training regime needs to be made.Algorithm-level or In-Processing methods achieve this either by explicitly learning domainspecific information such that this can be discounted from the model's learning later [17] or they learn to completely discount domain-specific information by omitting these features from the learnt representations [18].Post-processing methods, on the other hand, are mostly used to quantify bias in trained algorithms [14] and offer effective tools to evaluate the fairness of an ML model.
Interestingly, the underpinning principle behind all the above-mentioned methods is essentially to focus on learning and adapting to the inherent imbalances in data distribution, either by synthetically balancing it or adjusting the learning algorithm itself to account for these imbalances.This principle is shared by Continual Learning (CL) methods [19], [20] that also aim to balance learning in the model by being sensitive to shifts in data distributions, ensuring that one particular domain or task does not dominate the model's learning.Their ability to continually learn and adapt to novel information, aggregating new knowledge without impacting previously acquired information, may allow them to balance learning across the different learning domains.Domain-Incremental CL settings [21] particularly focus on managing shifts in input data distribution while the task remains the same.This can be considered analogous to solving FER tasks where input data belongs to different domains of gender (male, female) and race (black, white, asian).The challenge for CL models will thus be to maintain performance on FER tasks with respect to one domain while acquiring information about new domains.
Motivated by this notion, we propose the novel use of CL as a learning paradigm that is well-suited for developing fairer FER models that can balance learning with respect to different attributes of gender and race.We formulate expression recognition and AU detection across these different domain attributes as a continual learning problem and compare several popular CL approaches with state-of-the-art bias mitigation approaches.To the best of our knowledge, this is the first application of CL learning as a bias mitigation strategy for facial affect analysis tasks.Furthermore, we explore the Domain-Incremental CL settings [21] where the models need to solve the facial analysis tasks across different domains, defined by the demographic attributes of gender and race.For each attribute, the data is split into different domains, that is, gender annotations are used to define male and female domains, while race annotations are used to split data into White/Caucasian, Black/African-American, Asian and Latino domains.We primarily focus on regularisation-based CL methods as these do not require setting up additional memory or computational resources, allowing a fair and direct comparison with other learning methods that focus on mitigating bias arising due to imbalances in data distributions.Our experimental results show that CL-based approaches are, on average, seen to outperform other bias mitigation strategies, both in terms of accuracy as well as fairness scores for both FER and AU detection tasks across the domain-splits.

A. Understanding Bias
Bias, both in human perception and behaviour as well as ML algorithms can be characterised as an inclination or prejudice towards a person or a group, that may be considered unfair.This may result from an over or under-exposure of an individual towards a certain group of individuals usually characterised by their gender, racial identity, social or economical background or age, amongst other factors.This exposure results in people considering individuals that share similar characteristics as themselves as "in-group" members and others different from them as "out-group" members [22].It is seen that people tend to be biased in favour of in-group members, evaluating them more positively on dimensions of judgement while being negatively biased or prejudiced against out-group members [23].
Understanding how humans consider in-group and outgroup members [22] in their immediate surroundings and base their decisions on aspects such as gender, race or age is important to view ML models in the right perspective when applied to real-world settings.Such an understanding will allow researchers to assess what may be considered 'fair' and how to achieve such fairness in algorithms.
1) Bias in Human Perception: Following the Perception-Action model of Empathy [24], an individual's behaviour, particularly their facial expressions and body gestures stimulate a similar neural activation in the observer, enabling them to empathise with and understand their actions, intentions and emotions.Gutsell et al. [25], through a series of experiments with multiple participants (30 White university students) interacting with in-group (in this case participants with a Caucasian ethnic identity) and out-group (excluded from this circle; in this case, African-Canadian, East-Asian, South-Asian ethnic identities) members, concluded that such perception-action couplings are reserved only for in-group members.In-group identification, that is, identifying other individuals to be sharing similar characteristics as oneself, causes a positive association with them [23].Furthermore, people have a harder time recognising the faces of out-group members and interpret their facial expressions [26], [27].
In the case of ML algorithms, we may understand such 'inter-group bias' to result from the imbalances in data distributions where certain groups may be considered to constitute 'in-group' attributes due to their dominance in the data, while under-represented attributes can be considered as members of the 'out-group'.Thus, having witnessed a lot of samples from certain groups, the models are more capable of correctly classifying such samples while performing poorly for the socalled out-group samples.
2) Bias in Machine Learning: Owing to similar reasons as in the case of human perception, over or under-exposure to experiences (or in this case, data) characterised by specific features, ML models are seen to acquire biases that prejudice model performance for one or more data attributes.Facial analysis models particularly are seen to be affected by biases with respect to demographic attributes of gender, race or age, where samples belonging to one group dominate the data distribution.In such situations, the under-represented groups get adversely impacted by the model misclassifying samples from these groups.Buolamwini et al. in their seminal work [28], highlighted how popular face recognition algorithms disproportionately misclassified darker females either misgendering them or not being able to detect their faces.Another study by Klare et al. [29] highlighted how face recognition algorithms employed by some law-enforcement agencies significantly underperform for people labelled as black or female compared to other demographics.Such biases in critical systems may lead to unnecessary targeting and exploitation of people from under-represented groups, further disadvantaging their opportunities in society.

B. Mitigating Bias in Facial Analyses
The origins of bias in most ML-based facial analyses algorithms can be traced back to imbalances in data distributions.Collating balanced datasets that enable a fair evaluation of ML models [30] despite being the most effective solution towards mitigating such biases, may not be as straightforward to achieve as a varied and diverse subject-pool might not always be available.As a result, several strategies have been proposed for mitigating the effects of bias on the training and evaluation of ML algorithms.We use a similar nomenclature as [14] to discuss these strategies.
1) Pre-Processing Approaches: The most simplistic of these is achieved by selectively sampling training data in a manner that balances learning.Samples from underrepresented domains are over-sampled while dominant domains are under-sampled to balance learning the training data [18], [31].This results in the training set to effectively have a balanced data distribution.However, this may not be possible in really small-scale datasets as under-sampling already limited data might not be efficient.An alternative approach is to use data-augmentation techniques to synthetically generate additional data for the under-represented groups [32], [33], [34], to balance training data distribution.
2) In-Processing Approaches: Another popular approach to mitigate the effects of imbalances in data distributions is to weight model prediction loss differently for the different domain attributes.A weighting factor is applied to the training loss computation based on the occurrence rate for the different classes or domains [15], [16], [35] penalising misclassifications for the under-represented groups more than others.This reduces the effect of these imbalances, mitigating biases in learning.
More recently, several learning strategies have been proposed that, while handing imbalances in data distributions using the above-mentioned techniques, also deal with biases in ML models at the algorithm-level.Howard et al. [5] propose a hierarchical approach that combines outputs from the cloudbased Microsoft Emotion API algorithm with a specialised learner, offering a 17.3% improvement in recognition results on a minority class, in this case, children's facial expressions.Other approaches focus on explicitly separating the decision boundaries with respect to different sensitive domain attributes ensuring that imbalances in data with respect to these attributes are not perpetuated while training the model, to achieve 'fairness through awareness' [17].Alternatively, the model can be trained to ignore domain-specific information, making it unaware or blind towards domain differences and focus only on the task at hand.Adversarial learning has been used to achieve such 'fairness through blindness' [18] using a min-max training regime that maximises sensitivity towards the task at hand while minimising learning of domainspecific information.Xu et al. [36] implement a disentangled approach [37] that uses a similar strategy to mitigate bias with respect to sensitive domain attributes of gender and race for FER by ensuring that the feature representations learnt by the model do not contain any domain-specific information.The model is split into two parts with a shared feature extraction sub-network.The first part focuses on facial expression analysis, while the other part consists of separate branches for each domain, designed to suppress domain-specific information.
3) Post-Processing Approaches: Despite several methods proposed for training fair ML systems, as described above, it may not always be possible to completely eradicate bias in the model.In such cases, it is still important to examine whether a model is biased and quantify the bias to mitigate it and make fairer decisions.Post-processing approaches (see [14], [38] for a general discussion) focus on quantifying bias in existing algorithms and attempt to counter the effects on classification tasks.

C. Continual Learning
Learning to detect and manage shifts in data distributions, Continual Learning (CL) methods (see [19], [20] for an overview) can effectively learn with incrementally acquired data, offering an improvement over traditional ML models, especially for real-world application.
Typically, CL models are evaluated on 3 different learning scenarios [21].The first scenario is termed as Task-Incremental Learning (Task-IL) where the model incrementally learns to solve several tasks, explicitly being informed about the task identity.Learning is split into different tasks, each corresponding to learning some sub-tasks or classes.The model is evaluated on its ability to preserve its knowledge across several tasks.The second scenario focuses on Domain-Incremental Learning (Domain-IL) where the task to be learnt by the model does not change but the input data distribution changes.While the model still needs to solve the same task but the inherent data distribution shifts and the model is evaluated on its ability to manage such a shift.The third and the most complex learning scenario is the Class-Incremental Learning (Class-IL) scenario where the model needs to learn a new class without being given any information on the tasks.The model incrementally learns one class at a time, sequentially receiving input data for only that class.
In recent years, several CL approaches have been proposed that employ (deep) ML architectures and equip them with learning capabilities such that they can incrementally integrate novel information while preserving past knowledge [19].The most common and straightforward approach to achieve this is by regulating model updates in a manner that en- ables the preservation of knowledge.Such regularisationbased methods minimise destructive interference by freezing those parts of the model that are most sensitive to previous tasks [39] and updating the rest of the model, selectively.Alternatively, weight-update constraints and penalties are applied that discourage changes in network parameters that deteriorate model performance on previous tasks [40], [41].
A priority or importance term may also be applied to network parameters based on their relevance to a given task and only those parameters are allowed to be updated which have lower importance [42].Despite the competitive performance of regularisation-based methods, they become computationally expensive as the number of tasks or classes grow, limiting their performance and applicability in Task-IL (in extreme cases) and Class-IL scenarios.
Other CL-based approaches include rehearsal-based methods [43] that aim to simulate offline batch-learning based settings by either physically storing previously encountered data samples; commonly known as Naive Rehearsal (NR) [44], or learning a generative or probabilistic model that learns data statistics to simulate pseudo-samples for previously seen tasks [45], [46], [47].Yet, as the number of tasks increase, it becomes extremely difficult to train these models.Furthermore, additional memory and computational resources need to be allocated to either store data samples or generate simulated pseudo-samples making it challenging to implement these approaches.
In this work, we focus on regularisation-based methods evaluated under Domain-IL settings where these models are required to learn to solve expression recognition and AU detection tasks across different domains of gender and race.

III. METHODOLOGY
In order to understand bias in FER algorithms, it is important to determine how the implicit data distribution affects model performance.For this, we need to understand which domain attributes dominate the data and how an algorithm performs with respect to these attributes.In this section, we present the problem formulation, the learning scenario as well as the different methods employed in this work, comparing them with popular CL-based methods.

A. Problem Formulation
We aim to measure the variance in model performance on a specific task with respect to gender and race as the different domain attributes and compare model performances for expression recognition and AU detection.Given a set of input images x i with task labels y i and domain label d i , we wish to determine how the performance of an algorithm A(x i |y i , d i ), varies with respect to different variations of the domain label d i .
To enable a fair comparison between the different bias mitigation methods, for our experiments, we implement the same ResNet architecture-based [48] CNN model for all the methods consisting of 4 convolutional blocks, each with 2 conv layers, a max-pooling layer and implementing drop-out with batch-normalisation.The output of the last conv block is connected to three dense layers and a classification layer making model predictions.ReLU activation is used for each conv as well as dense layer.The same architecture is used to implement all the approaches compared in this work (see Fig. 1) with the exception of the Disentangled Approach for which the results from the original paper [36] are used directly for comparison purposes.
1) The Baseline: For baseline evaluations, we split the dataset into different subsets based on the domain attributes.For example, for gender, the datasets are split into male and female splits and model performance is reported when trained incrementally on these data splits.This is sometimes also referred to as finetuning [49].The model (see Fig. 1a), without any explicit mechanism to preserve knowledge, is expected to suffer from forgetting old tasks while preference is given to the new tasks.
2) Off-line Training: Providing another baseline evaluation, the above-described Convolutional Neural Network (CNN) model (see Fig. 1a) is trained on all the training data, off-line, at once but its performance scores are reported individually on domain-specific test-splits.Off-line training provides a fair comparison with traditional ML-based learning models and is a popularly used benchmark for evaluating the performance of CL-based methods.

B. Non-CL-based Bias Mitigation Strategies
Here, we investigate some of the popular bias mitigation strategies that are found in the literature and implement 4 different methods for comparison.We group them under 'non-CL-based' strategies to differentiate them from our baselines as well as the CL approaches.
1) Domain Discriminative Classification (DDC): A popular method for mitigating bias is to focus on achieving 'fairness through awareness' [17] where information about sensitive attributes (or domains) is explicitly learnt in feature encodings.This information later allows models to account for bias in learning by being more 'aware'.One way to achieve this to create an N × M -way discriminative classifier where N denotes the number of domains and M is the number of classes to be learnt [18].For example, for FER classifying 7 different expression classes for samples encoding 3 different race labels, a classifier is used with each output unit corresponding to a unique expression-race label pair (in this case, 7 × 3 = 21 label pairs).This allows the model to be more 'aware' of the different domains in order to learn discriminative features for each of them.For our experiments, we use the same model architecture (see Fig 1b) only replacing the output layer.
2) Domain Independent Classification (DIC): A major concern with the DDC method is that the network may implicitly learn decision boundaries within the same class across different domains.This may be redundant as, despite the different domain attributions, the class-boundaries may remain the same and the network may be unnecessarily penalised due to incorrect domain predictions even if it predicts the task correctly.Wang et al. [18] offer a solution to this by training separate classifiers for each domain, sharing the feature extraction layers.For our experiments, we make use of the same model architecture (see Fig. 1c), connecting separate dense-layered classifiers for each of the domains.The DIC model consists of different heads, each consisting of the same number of output units but corresponding to different domain attribute labels.
3) Strategic Sampling (SS): A simple approach for handling bias arising from imbalanced data distributions is to strategically sample data [15] for each domain-class mapping such that the resultant data distribution 'appears' to be balanced.Samples from under-represented distributions can be sampled more often during training or equivalently, prediction loss can be appropriately weighted to account for the underrepresented classes.For our experiments, samples (s) for each of the N domains (d i ) are assigned a weight w i inversely proportional to the rate of occurrence of samples for that domain, scaling the loss function to appropriately account for imbalances in the training set distribution.The scaled crossentropy loss function is given as: 4) Disentangled Feature Learning (DA): Xu et al. [36] implement the Disentangled Feature Learning (DA) approach of [37] for facial expression recognition.This approach ensures that the feature representations learnt by the model do not contain any domain-specific information.The two sub-parts of the model focus on analysing facial expressions while learning to suppress domain-specific information.For our experiments, we sue the results from the original paper [36] for comparison.

C. Continual Learning Approaches
Domain-Incremental CL deals with scenarios where the structure of the tasks remains the same albeit with the input distribution is changing [21].For our experiments, we model the tasks of expression recognition and AU detection in a domain-incremental manner where the models learn to solve these tasks as the input data distributions change with respect to domain attributes of gender and race.For example, in the case of gender, the models first learn to classify expression classes or predict activated AUs for 'male' samples and then, sequentially, learn to solve these tasks for 'female' samples (or vice-versa), without forgetting the previous task.In our experiments, each approach is implemented using the Baseline CNN architecture as shown in Fig 1a .All implementations were based on the CL code-benchmarks provided by [21], [44].
1) Elastic Weight Consolidation (EWC): The EWC approach, as proposed by Kirkpatrick et al. [41] imposes a quadratic penalty on parameter updates between old and new tasks in order to avoid forgetting previously learnt information.For each parameter θ, its relevance is calculated with respect to a task's training data D, modelled as the posterior distribution p(θ|D).Thus, for two data distributions D A and D B , corresponding to two independent tasks A and B, according to Bayes' rule, the posterior probability is given as: log p(θ|D) = log p(DB|θ) + log p(θ|DA) − log p(DB), (2) such that log p(θ|D A ) embeds all information about previously learnt tasks.As this term becomes intractable, Laplace approximation is used to approximate it as a Gaussian Distribution with its mean given by parameters θ * A (referring to parameters of task A) and the importance of the parameters determined by the diagonal of the Fischer Information Matrix.The loss function for the EWC method thus becomes: where L B is the loss for task B, λ is the regularisation coefficient that determines the relevance of old tasks with respect to the new one, i denotes the index of the parameter θ and F i is the i th diagonal element of the Fischer Matrix.
2) EWC-Online: A disadvantage for the EWC method is that as the number of tasks increase, the number of quadratic terms in the regularisation term grows linearly.To handle this, Schwarz et al. [50] proposed a modification to EWC where instead of many quadratic terms, a single quadratic penalty is applied, determined by a running sum of the Fischer Information Matrices of the previous tasks.Thus, the updated regularisation term of the proposed EWC-online approach is given as:  where θ is the i th parameter after learning task T − 1 and i is the running sum of the diagonal elements of the Fischer Matrices of all previous tasks calculated as: where γ controls the contribution of previously learnt tasks.
3) Synaptic Intelligence (SI): Similar to EWC, this approach also penalises changes to relevant weight parameters (synapses) in a manner that new tasks can be learnt without forgetting the old [42].To alleviate catastrophic forgetting, the importance for solving a learned task is computed for each individual synapses and changes in the most important synapses are discouraged.A modified cost function L * n is used with a surrogate loss term which approximates the summed loss functions of all the previous tasks L * o : where θ k represents the parameters for the new task, θ * k represents the parameters at the end of the previous task, Ω n k is the parameter regulation strength and c is the weighting factor balancing new vs. old learning.
4) Memory Aware Synapses (MAS): Similar to EWC and SI, MAS also calculates the importance of each parameter although by looking at the sensitivity of the output function instead of the loss [49].For each new sample, MAS updates the importance of each parameter by evaluating how sensitive the model prediction is to the changes in that parameter.Parameters that have the most impact on model predictions are given high importance and changes to these parameters are penalised.Different from EWC and SI, parameter importance is computed only using unlabelled data by measuring changes in model performance.For each new task (T n ), in addition to the task-loss (L n (θ)), changes to parameters important for previous tasks are penalised: where λ is the hyperparameter balancing new vs. old task losses, and θ * denotes the old network parameters.5) Naive Rehearsal (NR): For the Naive Rehearsal (NR) approach, we implement a straightforward rehearsal-based method that combines new data with previously seen data while training the model.[44].A small replay buffer is implemented to randomly store a fraction of previously seen data samples that can be replayed to the model.Each minibatch of data is constructed using an equal number of samples from the new as well as previously seen data.This interleaving of data pertaining to previously learnt tasks with new data ensures that old knowledge is not overwritten by new data.

A. Datasets
For evaluating the different bias mitigation strategies and comparing them with CL-based methods, we use two popular benchmark datasets; the RAF-DB dataset for FER in-the-wild and the BP4D dataset recorded for AU detection in controlled settings .These datasets are selected due to (i) the diversity in their data acquisition settings, (ii) providing labels not only for expression/AU recognition but also gender and race attributes, and (iii) containing notable imbalances in the data distributions with respect to class and domain attribute labels.These factors make the RAF-DB and the BP4D datasets a good choice for our evaluation.
1) RAF-DB Dataset: The RAF-DB dataset [51] consists of ≈ 15K facial images labelled for six expression classes namely, Surprise, Fear, Disgust, Happy, Sad and Anger along with Neutral to denote absence of any expression.Additionally, it provides demographic attribute labels such as gender (Male, Female, Unsure) and race (Caucasian, African-American, Asian) labels.For our experiments, we split the dataset using multiple grouping strategies based on the gender and race Labels.For gender-based grouping, we exclude images labelled as 'Unsure' and only use the 'Male' and 'Female' samples.As shown in Fig. 2, not only is the dataset imbalanced with respect to the different expression categories, there exist stark imbalances with respect to different demographic attributes as well.The majority of the samples in the training set represent the "Happy" expression class and belong to "Female" and "Caucasian" categories.
2) BP4D Dataset: The BP4D dataset [52] consists of video sequences from 41 subjects performing 8 different affective tasks to elicit emotional reactions.Each video is annotated frame-wise for the occurrence and intensity of the activated AUs.In our experiments, we only use occurrence labels for 12 most frequent AU resulting in ≈ 150K labelled frames, in total.Other than the frame-wise AU labels, demographic attribute labels for gender (Male, Female) and race (Black, White, Latino, Asian) have been provided to us, specifically for research.Fig. 3 shows the data distribution of the BP4D dataset for the 12 AU labels with respect to the gender and race attributes.As can be seen, the majority of the samples in the dataset represent "White" and "Female" attribute labels.

B. Pre-processing and Data-Augmentation
Both RAF-DB and BP4D datasets provide face-centred RGB images which are resized to (100 × 100 × 3) and normalised to be used as input for all the models.Training deep neural networks requires a lot of training data for each of the classes to be learnt.Due to the inherent imbalances in the dataset with respect to the different expression classes (RAF-DB) or the AU labels (BP4D), we increase the overall training data by performing data-augmentation by randomly (p = 0.5) flipping images horizontally to create additional samples.For each experiment, we present the results with and without data-augmentation separately, for clarity.

C. Experiment Settings 1) Evaluation Metrics:
To compare the different methods on their ability to balance classification performance within individual domain-splits while remaining consistent across the domains, we evaluate them both in terms of their accuracy scores as well as fairness.Furthermore, for the CL methods, we also report Catastrophic Forgetting (CF) scores, measuring the ability of the models to maintain performance on previously seen tasks while learning new tasks.Accuracy (Acc): Accuracy is defined as the fraction of correctly classified samples.Given that TP = True Positives, FP = False Positives, TN = True Negatives and FN = False Negatives, Accuracy (Acc) can be computed as: In our experiments, we report accuracy scores separately for different gender and race attributes to highlight differences in model performance for these domains, underlining bias in the models' performance.Fairness Measure (F): To evaluate different approaches for their fairness with respect to model performance for gender and race attributes, we use the 'equal opportunity' definition of fairness, as proposed by Hardt et al. [54].Let x, y, ŷ be the variables denoting input, ground truth label and the predicted label, respectively, s ∈ S i be the sensitive (domain) attribute (for example, (S i = {male, female}), f be a function computing the accuracy score for a given sensitive attribute s and d be the dominant attribute which has the highest accuracy score, then the Fairness Measure F of a model is defined as the largest accuracy gap among all sensitive attributes computed as the minimum of the ratios of the accuracy scores of each sensitive attribute with respect to the dominant attribute.
In other words, F is defined as the ratio of the lowest accuracy for a sensitive attribute with respect to the highest accuracy value for that sensitive attribute.Catastrophic Forgetting (CF): Catastrophic forgetting [55] occurs when learning a new task negatively impacts previously learnt information.For our experiments, we also report the CF metric score [56] for the CL methods, measuring the average change in the accuracy scores of the CL model for each previous task right after learning a new task.This is computed as follows: where a i,j denotes the accuracy of i th task right after learning j th task, A is the matrix storing accuracy scores with dimensions (n × n) and n is the number of classes.
2) Implementation Details: All models are trained using the adam optimiser with a learning rate of 1.0e −4 and a batchsize of 24.For the experiments with the RAF-DB dataset, all models are trained for 25 epochs while for the BP4D dataset, due to a higher number of data samples, training converged after only 10 epochs for all the approaches.All experiments are repeated 3 times and the results are averaged across the repetitions to account for the random seeds.All models are implemented using the PyTorch Python Library based on the Continual Learning benchmarks provided by [21], [44].
Table I and II report the regularisation coefficient values for the CL methods for experiments with the RAF-DB and BP4D datasets, respectively.These values are set based on separate hyper-parameter searches for each model and selecting the best-performing values.

A. Experiment 1: Mitigating Bias in FER
In our experiments, we compare state-of-the-art bias mitigation approaches (see Section III-B) with popular CLbased methods (see Section III-C) on their ability to classify  of the total samples.The rest of the samples are labelled as 'Unsure' and omitted from our evaluations (see Fig. 2a).As a result, the effective split of the dataset with respect to gender is somewhat balanced, 56.3% Female against 43.7%Male samples.For the non-CL-based methods, the models are trained on the entire dataset and tested individually on the Male and Female subsets.For the CL evaluations, however, the learning is split into two tasks corresponding to expression recognition for the Male (Task 1) followed by expression recognition for the Female (Task 2) sub-sets.Task-ordering is discussed further in Section VI-A.
Table III presents the experimental results comparing the different methods on their Accuracy as well as Fairness Measure scores.It can be seen that the CL methods, overall, outperform all other methods both in accuracy as well as fairness scores while the baseline method performs the worst.Furthermore, although the accuracy scores of all the approaches increase when data-augmentation is used, not all of them are able to maintain fairness.CL methods (with the exception of NR) on the other hand, improve upon their fairness scores as well, with SI [42] achieving the highest fairness scores both without and with data-augmentation.
In order to fully appreciate how CL enables the models to retain their performance across the two tasks, it is important to understand how learning each new tasks impacts the model's performance on previously learnt tasks.Table IV further reports the overall accuracy and CF scores for the CL methods after each task, evaluating the performance of these methods in terms of their ability to classify expressions for both male and female sub-sets.We observe that both with and without augmentation, NR [44] achieves the highest overall accuracy score after both tasks are learnt, while EWC experiences the least forgetting.Furthermore, negative CF scores for all CL methods indicate that after learning task 2, that is, to predict expressions on Female samples, the overall accuracy of the model increased for both Male and Female samples, without any forgetting occurring in the model.
2) Bias Across Race Attributes: The data distribution of the RAF-DB dataset is highly imbalanced with respect to Race with a majority (77.4%) of the samples labelled as Caucasian, while the African-American and Asian subsets correspond to only 7.1% and 15.5% of the samples, respectively (see Fig. 2b).Similar to the evaluations across gender, for the Non-CL-based methods, the model is trained on the entire dataset and evaluated individually for the different race attributes.For the CL evaluations, the learning is again split into three tasks corresponding to learning to predict expressions for Caucasian (Task 1), African-American (Task 2) and Asian (Task 3) faces.Table V presents the results of the experiments comparing Accuracy and Fairness Measure scores.The imbalances in the data distribution affect all the approaches such that the model accuracy varies across the race groupings.Yet, the CL methods seem to handle this best, achieving comparable accuracy with high fairness scores.Even though the CL approaches are not always the best performing ones, particularly for the Asian subset, all of them achieve high fairness scores, with SI performing the best.This underlines their ability to balance learning across the different tasks.They are able to give preference to being consistent and fair, trading-off higher accuracy scores for any individual race label.The NR approach, on the other hand, achieves the highest accuracy scores on Task 1 and Task 2, owing to the explicit replay mechanism, but sacrifices fairness across all groups in the process.Additionally, as RAF-DB is a relatively small dataset, data-augmentation has a positive effect on accuracy scores of all the models but the fairness scores do not change significantly.
In Table VI, the accuracy and CF scores can be seen for all the CL methods reporting model performance on all previous tasks computed at the end of each new task.We see that all models tend to forget as they learn new tasks, yet the SI method is able to mitigate forgetting the best after having learnt all the tasks.When data-augmentation is used, the individual accuracy scores are enhanced but the CF scores do not improve.Owing to the explicit replay mechanism, the NR method performs the best in terms of mitigating forgetting when using data-augmentation.

B. Experiment 2: Mitigating Bias in AU Detection
As more than one AU may be activated at the same time (for example, AU 1, 2 and 26 together may depict surprise), predicting facial AUs poses a multi-label classification problem.Imbalances in data distributions with respect to the different gender and race attributes become even more prominent with certain AU classes having much more data samples than the others (see Fig. 3).We compare different bias mitigation strategies (see Section III-B) with CL-based methods (see Section III-C) to understand how they cope with imbalances in data distributions while retaining model performance.
Similar to Experiment 1, we compare the performance of all models (without and with data-augmentation) on detecting activations for the 12 AUs for gender (Male, Female) and race (White, Black, Asian, Latino) groupings.
1) Bias Across Gender Attributes: Similar to Experiment 1, for the non-CL-based methods, we train the individual models on the entire dataset but evaluate them individually for Male and Female subsets.The BP4D data distribution is skewed in favour of Female samples constituting 60.96% of the data while only 39.04% samples belong to the Male subset (see Fig. 3a).For CL methods, the learning is split into two tasks: Task 1: Male and Task 2: Female, incrementally Table VII compares the different methods on their Accuracy and Fairness Measure scores for both Male and Female subsets.CL methods are shown to outperform other methods in terms of accuracy, yet, DA [36] attains the highest Fairness score.Although CL methods perform better than others on individual tasks, they are not able to balance this learning across tasks as compared to DA.The MAS outperforms other CL methods on accuracy, yet SI and EWC-Online methods score higher on fairness, without and with data-augmentation, respectively.Data-augmentation, overall, has a positive impact on model accuracy scores but much like Experiment 1, it does not impact model fairness, significantly.Individual class accuracy between Male and Female splits does not vary significantly for the 12 AU labels with AU 2 and AU 12 achieving the lowest and highest accuracy scores, respectively, across the models for both the splits.This can be due to these classes consisting of the lowest and highest number of data samples in BP4D distribution across both gender and race splits (see Fig 3).
Comparing different CL methods on their ability to maintain performance across the tasks, Table VIII reports the overall accuracy and CF scores for the models.Owing to the complex multi-label nature of the tasks as well as the high gender disparity in the data distribution, we see a high variation in performance scores of the different CL methods.While EWC-Online performs the best without employing dataaugmentation achieving a negative CF score, the MAS model performs the best with data-augmentation.
2) Bias Across Race Attributes: The majority of the samples in the BP4D dataset are labelled as White (approximately 46.76%) with other samples corresponding to Asian (26.08%),Black (16.56%) and Latino (10.6%) groups (see Fig. 3b).For our evaluations, we split the dataset into 4 subsets based on these labels, representing the 4 tasks, that is, Task 1: Black, Task 2: Asian, Task 3: White and Task 4: Latino, for the CL models.Task-orderings are discussed further in Section VI-B.
It can be seen in Table IX that CL methods, overall, achieve high accuracy and Fairness scores, with the NR approach outperforming other methods both without and with data-augmentation.While NR achieves the highest accuracy scores for Black, Asian and White subsets even without dataaugmentation, none of the approaches is able to beat the offline baseline for the Latino subset.This may be owing to the extremely low sample-rate for the Latino subset that the CL methods, instead of focusing on improving performance on this sub-set, focus on maintaining performance across all the sub-sets.Data-augmentation has a positive impact on all the models in terms of improvements both in accuracy and fairness scores.
Different CL models handle the high variance in data distribution with respect to racial identity labels with varying levels of success.Table X shows how, at different points during the learning, different models perform better than others, while NR achieves the highest accuracy and CF scores after all tasks are learnt, both with and without data-augmentation.The negative CF scores for all the approaches at the end of Task 4 signifies that all the CL models were able to mitigate forgetting and the overall model performance improved as they incrementally learnt new tasks.VI.DISCUSSION Our experiments on FER (see Section V-A) and AU detection (see Section V-B) tasks are motivating as they highlight how adopting CL strategies may enable fairer facial affect analysis algorithms.Consistently achieving high accuracy as well as fairness measure scores, CL offers an improvement over existing learning strategies for bias mitigation in ML algorithms.Robustly managing imbalances in data distributions, both without and with data-augmentation, CL methods are better equipped to deal with biases owing to their learning strategy of focusing on one domain group at a time.Here, we discuss each task individually and highlight how CL provides a solution towards fairer facial affect analyses.

A. Facial Expression Recognition
When applied to FER, CL methods aim to sequentially learn to predict expression categories for the different gender and race groups.The models are trained with one group at a time and as the model experiences samples from other groups, it is actively trying to maintain performance at previously seen groups without forgetting.As a result, for both gender and race groups, CL models are able to achieve high fairness scores by balancing performance across the splits, with the SI model performing the best (see Table XI).Selective updates of network parameters in order to mitigate forgetting allows CL models to maintain high accuracy scores across the different gender and race attributes.This makes them distinct from other approaches, directly focusing on maintaining performance across different domain distributions instead of deciding whether to capture domain-specific features or not.In comparison, non-CL-based methods rely on becoming 'aware' of domain attributes to predict expressions according to the subjects sharing gender or race attributes or learning feature representations that actively 'block' domain discriminative features [36].Furthermore, for most of the non-CL methods, with the exception of DA, we need to know the domain groupings a priori which may not always be possible in real-world scenarios.For CL methods, however, as models learn sequentially, there is no need to provide any domain information a priori and learning can be extended to new domains.
One concern when applying CL methods to FER tasks is the class-ordering effect where model performance is seen to be sensitive to the order in which it learns different expression classes [47].In our experiments, as we implement the Domain-IL scenario where all classes are learnt at the same time, albeit one domain-group at a time, class-ordering does not play any role in the learning.Instead, we explore whether different task-orderings, that is, learning with different sequences of gender or race group splits has any effect on the models' ability to maintain performance.For both gender and race domains, we experiment with different orders of learning the tasks but no significant effect of domain ordering is witnessed on the models' performance.Class-wise accuracies are largely consistent between the different learning settings for the CL models with model accuracy being the worst for

B. Action Unit Detection
Action Unit (AU) detection poses a harder multi-label classification problem where the models need to predict all the AUs activated in a given sample.The inherent classimbalances in the BP4D dataset are further accentuated by the imbalances with respect to gender and race attributes, making it extremely difficult for CL as well as non-CL models to maintain performance across the different groups.The underrepresented classes are reduced to even fewer samples per class when split across gender or race, making it even more difficult for these models to cope with data imbalances.Even though CL-based methods are able to achieve the highest individual accuracy scores (averaged across the 12 AUs) for most of the gender and race groups, this comes at the cost of balancing learning across the different attributes.For the gender splits, Disentangled Feature Learning (DA) achieves the highest fairness scores, despite performing moderately in terms of accuracy on individual splits (see Table XII).Blocking individual domain-specific information, allows DA to balance learning across the different splits, resulting in high fairness scores.For CL models, however, the multilabel classification settings cause the models to focus more on overall individual performance rather than on maintaining performance across the gender splits.In the case of racesplits, we see that the NR approach achieves the highest fairness scores, while DA performs second-best.This is due to the memory-intensive rehearsal mechanism that physically stores and replays samples from previously seen domains to retain model performance.Even though regularisationbased approaches target accuracy and trade-off fairness in the process, they still perform better than most non-CL-based methods.
Due to the multi-label settings, all classes are learnt together with no ordering of the classes required.Furthermore, domainordering, that is, in which order the gender and race domains  [42] 0.986 0.946 0.965 0.954 MAS [49] 0.966 0.920 0.967 0.909 NR [53] 0.983 0.966 0.954 0.974 should be learnt, does not have any significant effect on model performance for the CL methods.Owing to the highly imbalanced class-distributions, the performance of all models are poor for under-represented classes such as AU 1, 2 and 4, across all gender and race splits.On the other hand, the highest model performances are achieved for dominant classes such as AUs 10 and 12.These results are inline with other AU prediction approaches [35], [57], [58] that report similar differences in performance across these AUs.

C. Limitations of CL-based Bias Mitigation
Our benchmark experiments with the RAF-DB and BP4D datasets highlight the potentials of CL-based models for creating fairer facial expression recognition systems.CLbased models outperform other bias mitigation strategies for evaluations across gender and race domains, managing shifts in data distributions well.However, more work is needed to optimise CL-based models for multi-label settings where they under-perform (see Table VII and IX).Recent work by Kim et al. [59] proposes a new replay-based strategy, the Partitioning Reservoir Sampling (PRS), that aims to tackle continual learning for multi-label classification, balancing both intra-and inter-task imbalances.Yet, they benchmark their approach on classification settings with little-to-no overlap between the tasks.This is not the case for AU detection where the different domains, as well as the classes within each domain, share feature representations, making it even harder for the models.
Furthermore, as regularisation-based CL models assign importance to different parameters based on their contribution towards previously learnt tasks, shared feature representations makes it harder for models to incrementally learn different tasks/domains as model parameters may contribute to more than one task or domain.Rehearsal-based methods such as NR, on the other hand, require the models to physically store seen samples from previous tasks, interleaving them with new data to maintain performance.As the number of tasks, or in the case of Domain-IL, data-splits across domains such as gender or race increase, storing samples from all the domains becomes extremely expensive both in terms of its memory footprint as well as the computational power needed to train the algorithms.
Additionally, as the tasks increase, models may experience saturation [60] requiring stronger regularisation in the models to be able to preserve past knowledge [61].The performance of the models also takes a hit where the model needs to reprioritise whether to give more importance to the new task or remembering previous tasks.We see this in race-wise splits for both the datasets (see Table V and IX) where regularisationbased models attain higher accuracy scores for the last split, while the NR method aims to maintain a higher fairness score instead.

VII. CONCLUSION AND FUTURE WORK
In this work, we propose the novel use of Domain Incremental CL as a potent bias mitigation method for facial analysis tasks.In particular, we highlight how using Domain-IL settings, regularisation-based CL methods can help develop fairer expression recognition and AU detection algorithms.Our experiments with popular benchmark datasets, RAF-DB for expression recognition and BP4D for AU detection, showcase the superlative performance of CL methods at handling imbalances in data distributions with respect to demographic attributes of gender and race.In comparison with state-ofthe-art bias mitigation approaches, these methods are able to balance learning across different domain splits, not only achieving high accuracy scores but also maintaining fairness across the different splits.
Yet, this proof-of-concept evaluation was limited to regularisation-based methods only and hence further experimentation is needed to fully understand the benefits of using CL as an effective bias mitigation strategy for facial expression and action unit recognition tasks.With harder problems, as in the case of multi-label AU detection, we see that even though regularisation-based methods achieve high accuracy, they do so by sacrificing fairness across different domain attributes.While a simplistic and naive rehearsal mechanism is able to improve model performance, our future work will aim to investigate other, more complex, pseudo-rehearsal methods [46], [47], [59], [61] or neuro-inspired [60], [62], [63] on their bias mitigation abilities.

TABLE I REGULARISATION
COEFFICIENT VALUES FOR FER EXPERIMENTS WITH THE RAF-DB DATASET (×10 3 ).

TABLE II REGULARISATION
COEFFICIENT VALUES FOR AU DETECTION EXPERIMENTS WITH THE BP4D DATASET (×10 3 ).

TABLE III EXPERIMENT 1 :
GENDER-WISE ACCURACY AND FAIRNESS SCORES ON RAF-DB DATASET.ACCURACY SCORES ARE REPORTED AFTER TRAINING THE MODELS ON BOTH MALE AND FEMALE SUBSETS.BOLD VALUES DENOTE BEST WHILE [bracketed] DENOTE SECOND-BEST VALUES FOR EACH COLUMN.

TABLE IV EXPERIMENT 1 :
CATASTROPHIC FORGETTING (CF) AND OVERALL ACCURACY (PREVIOUS TASKS) AFTER EACH TASK FOR GENDER-ORDERED LEARNING ON RAF-DB DATASET.BOLD VALUES DENOTE THE BEST WHILE [bracketed] DENOTE SECOND-BEST VALUES FOR EACH COLUMN.Bias Across Gender Attributes: For the RAF-DB dataset, approximately 53.4% of the samples are labelled as 'Female' while the 'Male' group constitutes about 40.3%

TABLE VI EXPERIMENT 1 :
CATASTROPHIC FORGETTING (CF) AND OVERALL ACCURACY (PREVIOUS TASKS) AFTER EACH TASK FOR RACE-ORDERED LEARNING ON RAF-DB DATASET.BOLD VALUES DENOTE BEST WHILE [bracketed] DENOTE SECOND-BEST VALUES FOR EACH COLUMN.

TABLE X EXPERIMENT 2 :
CF AND OVERALL ACCURACY (PREVIOUS TASKS) AFTER EACH TASK FOR RACE-ORDERED LEARNING ON BP4D DATASET.BOLD VALUES DENOTE THE BEST WHILE [bracketed] DENOTE SECOND-BEST VALUES FOR EACH COLUMN.

TABLE XI EXPERIMENT 1 :
FAIRNESS MEASURE SCORES ACROSS GENDER AND RACE DISTRIBUTIONS FOR THE RAF-DB DATASET.BOLD VALUES DENOTE BEST WHILE [bracketed] DENOTE SECOND-BEST VALUES FOR EACH COLUMN.

TABLE XII EXPERIMENT 2 :
FAIRNESS MEASURE SCORES ACROSS GENDER AND RACE DISTRIBUTIONS FOR THE BP4D DATASET.BOLD VALUES DENOTE BEST WHILE [bracketed] DENOTE SECOND-BEST VALUES FOR EACH COLUMN.