Automatic facial expression analysis Fitzwilliam College Tadas Baltrusˇaitis Submitted on March, 2014 for the Degree of Doctor of Philosophy Declaration This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration except where speci- cally indicated in the text. This dissertation does not exceed the regulation length of 60 000 words, including tables and footnotes. 3 Summary Humans spend a large amount of their time interacting with comput- ers of one type or another. However, computers are emotionally blind and indifferent to the affective states of their users. Human-computer interaction which does not consider emotions, ignores a whole channel of available information. Faces contain a large portion of our emotionally expressive behaviour. We use facial expressions to display our emotional states and to manage our interactions. Furthermore, we express and read emotions in faces effortlessly. However, automatic understanding of facial expressions is a very difficult task computationally, especially in the presence of highly variable pose, expression and illumination. My work furthers the field of automatic facial expression tracking by tackling these issues, bringing emotionally aware computing closer to reality. Firstly, I present an in-depth analysis of the Constrained Local Model (CLM) for facial expression and head pose tracking. I propose a number of extensions that make location of facial features more accurate. Secondly, I introduce a 3D Constrained Local Model (CLM-Z) which takes full advantage of depth information available from various range scanners. CLM-Z is robust to changes in illumination and shows better facial tracking performance. Thirdly, I present the Constrained Local Neural Field (CLNF), a novel instance of CLM that deals with the issues of facial tracking in complex scenes. It achieves this through the use of a novel landmark detector and a novel CLM fitting algorithm. CLNF outperforms state-of-the-art mod- els for facial tracking in presence of difficult illumination and varying pose. Lastly, I demonstrate how tracked facial expressions can be used for emotion inference from videos. I also show how the tools developed for facial tracking can be applied to emotion inference in music. 5 Acknowledgements Many people have supported me during my time as a PhD student, and I would like to thank them all. This dissertation would not have been possible without the guidance of my supervisor, Peter Robinson. I thank him for his constant support and for giving me this wonderful opportunity. I am also very grateful to Louis-Philippe Morency who hosted me during my visit to the Institute for Creative Technologies. His unending energy helped me to stay motivated. I would like to thank the Rainbow Group and the Computer Labora- tory, which provided me with the necessary atmosphere for my work. I enjoyed the daily coffee breaks with my colleagues, especially Leszek, Marwa, Vaiva, Christian, Ntombi, and Ian. I also thank Graham Titmus, our ever helpful system-administrator. I am grateful to Alan Blackwell and Neil Dodgson for keeping me on track during my yearly reports. My time at the Institute for Creative Technologies at the University of Southern California rekindled my interest in the dissertation topic. I would especially like to thank Julien-Charles, Sylwia, Dimitrios and Geovany for all the wonderful conversations we had. This work could not have been possible without the financial support of Thales Research and Technology UK. I would like to thank Chris Firth and Mark Ashdown for funding my work and for their support. My family provided me with an opportunity for growth and education and I am forever indebted to them. I am very grateful that they encour- aged me to pursue my education and did not mind me going far away from home. Finally, I am forever thankful to my soon-to-be wife, Rachael. She has painstakingly proof-read my dissertation and was always supportive and patient during deadlines and my time away from home. 7 8 Contents 1 Introduction 15 1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.2 Structure of the dissertation . . . . . . . . . . . . . . . . . . 19 1.3 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2 Affective Computing 23 2.1 Application areas . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2 Emotions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2.1 Theories of emotion . . . . . . . . . . . . . . . . . . 26 2.2.2 Affect expression and recognition . . . . . . . . . . 30 2.3 Facial expressions of emotion . . . . . . . . . . . . . . . . . 31 2.3.1 Head pose and eye gaze . . . . . . . . . . . . . . . . 34 2.4 Facial affect analysis . . . . . . . . . . . . . . . . . . . . . . 35 2.5 Facial tracking . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.5.1 Landmark detection and tracking . . . . . . . . . . 37 2.5.2 Head pose tracking . . . . . . . . . . . . . . . . . . . 39 2.5.3 Combined landmark and head pose tracking . . . . 39 3 Facial expression and head pose datasets 41 3.1 Image datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.1.1 Multi-PIE . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.1.2 BU-4DFE subset . . . . . . . . . . . . . . . . . . . . . 44 3.2 Image sequence datasets . . . . . . . . . . . . . . . . . . . . 47 3.2.1 ICT-3DHP . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2.2 Biwi Kinect Head Pose . . . . . . . . . . . . . . . . . 49 3.2.3 Boston University head pose dataset . . . . . . . . . 50 4 Constrained local model 51 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.1.1 Deformable model approaches to facial tracking . . 51 4.1.2 Problem formulation . . . . . . . . . . . . . . . . . . 53 9 4.1.3 Structure of discussion . . . . . . . . . . . . . . . . . 56 4.2 Statistical shape model . . . . . . . . . . . . . . . . . . . . . 57 4.2.1 Choosing the points . . . . . . . . . . . . . . . . . . 58 4.2.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.2.3 Dimensionality of the model . . . . . . . . . . . . . 62 4.2.4 Placing the model in an image . . . . . . . . . . . . 63 4.2.5 Point distribution model fitting . . . . . . . . . . . . 65 4.2.6 Model construction . . . . . . . . . . . . . . . . . . . 70 4.3 Patch experts . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.3.1 Implementation using convolution . . . . . . . . . . 73 4.3.2 Modalities to use . . . . . . . . . . . . . . . . . . . . 74 4.3.3 Multi-view patch experts . . . . . . . . . . . . . . . 75 4.4 Patch expert training . . . . . . . . . . . . . . . . . . . . . . 75 4.4.1 Training data . . . . . . . . . . . . . . . . . . . . . . 76 4.5 Constrained Local Model fitting . . . . . . . . . . . . . . . . 78 4.5.1 Regularised landmark mean shift . . . . . . . . . . . 79 4.5.2 Non-uniform Regularised landmark mean shift . . 82 4.5.3 Multi-scale fitting . . . . . . . . . . . . . . . . . . . . 83 4.6 System overview . . . . . . . . . . . . . . . . . . . . . . . . 83 4.6.1 Face detector . . . . . . . . . . . . . . . . . . . . . . . 85 4.6.2 Landmark detection validation . . . . . . . . . . . . 87 4.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.7.1 Methodology . . . . . . . . . . . . . . . . . . . . . . 89 4.7.2 Multi-modal patch experts . . . . . . . . . . . . . . . 95 4.7.3 Non-Uniform Regularised Landmark Mean Shift . 96 4.7.4 Multi-scale fitting . . . . . . . . . . . . . . . . . . . . 97 4.7.5 Head pose estimation . . . . . . . . . . . . . . . . . 99 4.7.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . 100 4.8 CLM issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.8.1 Illumination . . . . . . . . . . . . . . . . . . . . . . . 101 4.8.2 Pose . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.8.3 Expression issues . . . . . . . . . . . . . . . . . . . . 107 4.8.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 108 4.9 General discussion . . . . . . . . . . . . . . . . . . . . . . . 109 10 5 CLM-Z 111 5.1 Depth data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.1.1 Representation . . . . . . . . . . . . . . . . . . . . . 112 5.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.3 Patch experts . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.4 Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.5 Training data . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.6 Combining rigid and non-rigid tracking . . . . . . . . . . . 118 5.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.7.1 Methodology . . . . . . . . . . . . . . . . . . . . . . 120 5.7.2 Normalisation . . . . . . . . . . . . . . . . . . . . . . 120 5.7.3 Patch response combination . . . . . . . . . . . . . . 121 5.7.4 Landmark detection in images . . . . . . . . . . . . 122 5.7.5 Evaluation on image sequences . . . . . . . . . . . . 124 5.7.6 Head pose tracking using depth data . . . . . . . . 126 5.7.7 Head pose tracking on 2D data . . . . . . . . . . . . 129 5.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 6 Constrained Local Neural Field 131 6.1 Continuous Conditional Neural Field . . . . . . . . . . . . 133 6.1.1 Potential functions . . . . . . . . . . . . . . . . . . . 134 6.1.2 Learning and Inference . . . . . . . . . . . . . . . . . 136 6.2 Local Neural Field . . . . . . . . . . . . . . . . . . . . . . . 145 6.2.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . 146 6.3 Patch expert experiments . . . . . . . . . . . . . . . . . . . 147 6.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . 147 6.3.2 Importance of edge features . . . . . . . . . . . . . . 148 6.3.3 Facial landmark detection under easy illumination 149 6.3.4 Facial landmark detection under general illumina- tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 6.4 General experiments . . . . . . . . . . . . . . . . . . . . . . 153 6.4.1 Facial landmark detection . . . . . . . . . . . . . . . 153 6.4.2 Facial landmark tracking . . . . . . . . . . . . . . . . 160 6.4.3 Head pose estimation . . . . . . . . . . . . . . . . . 160 11 6.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 7 Case study: Automatic expression analysis 165 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 7.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 7.3 Continuous CRF . . . . . . . . . . . . . . . . . . . . . . . . . 168 7.3.1 Model definition . . . . . . . . . . . . . . . . . . . . 168 7.3.2 Feature functions . . . . . . . . . . . . . . . . . . . . 169 7.3.3 Learning . . . . . . . . . . . . . . . . . . . . . . . . . 171 7.3.4 Inference . . . . . . . . . . . . . . . . . . . . . . . . . 174 7.4 Video Features . . . . . . . . . . . . . . . . . . . . . . . . . . 174 7.4.1 Geometric features . . . . . . . . . . . . . . . . . . . 174 7.4.2 Appearance-based features . . . . . . . . . . . . . . 175 7.4.3 Motion features . . . . . . . . . . . . . . . . . . . . . 177 7.5 Audio Features . . . . . . . . . . . . . . . . . . . . . . . . . 178 7.6 Final system . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 7.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 7.7.1 Database . . . . . . . . . . . . . . . . . . . . . . . . . 180 7.7.2 Methodology . . . . . . . . . . . . . . . . . . . . . . 180 7.7.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 182 7.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 8 Case study: Emotion analysis in music 187 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 8.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 8.3 Linear-chain Continuous Conditional Neural Fields . . . . 189 8.3.1 Model definition . . . . . . . . . . . . . . . . . . . . 189 8.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 8.4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 190 8.4.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . 191 8.4.3 Error Metrics . . . . . . . . . . . . . . . . . . . . . . 191 8.4.4 Design of the experiments . . . . . . . . . . . . . . . 192 8.4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 194 8.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 12 9 Conclusions 197 9.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 197 9.1.1 Constrained Local Model extensions . . . . . . . . . 197 9.1.2 3D Constrained Local Model . . . . . . . . . . . . . 198 9.1.3 Continuous Conditional Neural Field . . . . . . . . 198 9.1.4 Emotion inference in continuous space . . . . . . . 198 9.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 Bibliography 201 13 1 Introduction Computers are quickly becoming a ubiquitous part of our lives. We spend a great deal of time interacting with computers of one type or an- other. At the moment the devices we use are indifferent to our affective states. They are emotionally blind. However, successful human-human communication relies on the ability to read affective and emotional sig- nals. Human-computer interaction (HCI) which does not consider the affective states of its users loses a large part of the information available in the interaction. Recently, affective computing has been widely studied and there is a growing belief that providing computers with the ability to read the affective states of their users would be beneficial (Pantic et al., 2006; Picard, 1997; Robinson and el Kaliouby, 2009). It is believed that in order to make future progress in HCI it is necessary to recognise users’ affect. This is informed by the importance of emotion in our daily lives (Cohn, 2006). Affective computing tries to bridge the gap between the emotionally expressive human and the emotionally deficient computer (D’Mello and Calvo, 2013). There are many application areas that could benefit from the ability to detect affect. These range from interfaces that do not interrupt their users when they are stressed, online learning systems that adapt the teaching if the student is confused, and video games that adapt their difficulty based on the player engagement. Further applications include: assisted living environments that can monitor the users’ state and re- port to medical professionals if the patient is feeling pain; assistive tech- 15 1. Introduction nologies for diagnosing conditions such as depression; and systems that monitor drivers or pilots for boredom. Reliable automated recognition of human emotions is crucial before the development of affect sensitive systems is possible (Picard and Klein, 2001). Humans display affective behaviour that is multi-modal, subtle and complex. People are adept at expressing themselves and interpret- ing others through the use of non-verbal cues such as vocal prosody, facial expressions, eye gaze, various hand gestures, head motion and posture. All of these modalities convey important affective information that humans use to infer the emotional state of each other (Ambady and Rosenthal, 1992). Out of these modalities, the face has received the most attention from both psychologists and affective computing researchers (Zeng et al., 2009). It is not very surprising as faces are the most visible social part of the human body. They reveal emotions (Ekman and Rosenberg, 2005), communicate intent, and help regulate social interaction (Schmidt and Cohn, 2001). Although not strictly part of the face, head gestures play an important part in human communication as well (Bavelas et al., 2000) and have been investigated for affect detection (Ramirez et al., 2011). We can look at facial expressions and head gestures from two main perspectives - message and sign judgement. Message judgement ap- proaches facial expressions in terms of meaning (emotion, intention, etc.), whereas sign judgement looks at the underlying anatomical struc- ture and does not interpret the message. In order to achieve message judgements we need to be able to read the signs. My work mainly con- centrates on sign judgement - reliably tracking faces and head pose, but I do present several case studies of message judgement. Most of the outlined potential uses of affective computing rely fully, or at least partially, on the ability to automatically analyse human facial expressions. In order to do so, an ability to locate certain areas of the face and head pose is necessary. The approaches need to work out- side the lab and in the wild: outdoor environments, dimly lit rooms, in 16 1.1. Contributions presence of harsh shadows, and various other noisy environments. The ideal facial tracker should also be person independent. Furthermore, to be of any use they have to be computationally efficient, especially if large scale monitoring, or analysis of large databases is needed. These requirements combine to present an extremely challenging task for com- puter vision. In this dissertation I attempt to bring the state-of-the-art closer to be- ing able to operate in the wild. I do this by extending the Constrained Local Model framework to work with depth data from various range scanners, thus reducing the effect of illumination and leading to better tracking. I also develop a novel patch expert that can learn complex non-linear relationships between pixel values and landmark locations leading to more accurate facial tracking, especially under difficult illu- mination conditions. Finally, while the main goal of the research was to develop methods for more reliable tracking of faces, I also show how the tracked points can be used for emotion recognition, and how some of the methods developed can even be used in emotion prediction from music. 1.1 Contributions The main contributions of my dissertation can be split into the following parts: Constrained Local Model extensions I present a number of extensions to the existing state-of-the art approach for facial landmark detection. These include a multi-modal, multi-scale formulation together with a novel fitting procedure. I demonstrate the benefits of these extensions on a number of publicly available datasets. Furthermore, I provide a detailed analysis of the issues the model faces. 17 1. Introduction 3D Constrained Local Model I present an extension of the Constrained Local Model paradigm to in- clude depth information in addition to a regular visible light camera. This approach leads to more accurate and more robust fitting, helping to deal with bad lighting conditions. This was published as Baltrusˇaitis et al. (2012) in the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), and awarded the publication of the year award by the Cambridge Computer Laboratory Ring. Continuous Conditional Neural Field I present a novel Continuous Conditional Neural Field model which is able to learn complex relationships between data and output. This model can model the complex non-linear relationships between pixel values and landmark location, and exploit their spatial relationships. This allows for more accurate and reliable face tracking, and is espe- cially helpful for illumination invariant facial tracking. Furthermore, the flexibility of this model is demonstrated through its employment in the task of emotion prediction in music. The description of the model has been published in the 300 faces in-the-wild challenge workshop in the IEEE International Conference on Computer Vision, 2013 (Baltrusˇaitis et al., 2013b). Emotion inference in continuous space I contribute to the growing body of methods for continuous dimensional emotion prediction from audio-visual data by employing a Continu- ous Conditional Random Field model, which reliably combines multiple modalities and exploits temporal characteristics of the emotional signal. This was published as Baltrusˇaitis et al. (2013a) at the IEEE International Conference on Automatic Face and Gesture Recogntition (FG). 18 1.2. Structure of the dissertation 1.2 Structure of the dissertation I will begin with an overview of affective computing in Chapter 2. I will explain the underlying emotion theories and possible application areas. Special focus will be put on facial expressions and head pose. Chapter 3 will give an overview of the datasets used to evaluate both rigid and non-rigid facial tracking. In Chapter 4 I will provide a detailed explanation of the Constrained Local Model (CLM) approach to facial landmark detection, including some in-depth analysis of implementation details. I will also describe the several extensions I have developed. An extension to CLM that uses depth information alongside regular visible light information will be presented in Chapter 5. A novel Continuous Conditional Neural Field (CCNF) graphical model is introduced in Chapter 6. I will present a particular instance of CCNF which can be used as a novel patch expert that can learn complex non- linear relationships between pixel values and landmark locations. I will also demonstrate how this patch expert can be used to build a CLM landmark detector that outperforms current state-of-the-art detectors in most conditions. Chapter 7 will demonstrate how such facial expression and head pose tracking can be used to infer emotions in dimensional space while ex- ploiting the temporal properties of the emotional signal. Another case study will be presented in Chapter 8. Here the CCNF model, developed for landmark detection, is used for emotion predic- tion in music, outperforming some state-of-the art approaches. Finally, Chapter 9 will provide the concluding remarks of the disser- tation and outline the current limitations together with future research directions. 19 1. Introduction 1.3 Publications 1. Tadas Baltrusˇaitis, Laurel D. Riek, and Peter Robinson. Synthesizing Expressions using Facial Feature Point Tracking: How Emotion is Conveyed, in ACM Workshop on Affective Interaction in Natural Envi- ronments, October 2010 2. Tadas Baltrusˇaitis and Peter Robinson. Analysis of Colour Space Transforms for Person Independent AAMs, in ACM / SSPNet 2nd International Symposium on Facial Analysis and Animation, September 2010 3. Tadas Baltrusˇaitis, Daniel McDuff, Ntombikayise Banda, Marwa Mah- moud, Rana el Kaliouby, Rosalind Picard, and Peter Robinson. Real- time inference of mental states from facial expressions and up- per body gestures, in IEEE International Conference on Automatic Face and Gesture Recognition, Facial Expression Recognition and Analysis Chal- lenge, June 2011 4. Geovany A. Ramirez, Tadas Baltrusˇaitis, and Louis-Philippe Morency. Modeling Latent Discriminative Dynamic of Multi - Dimensional Affective Signals, in 1st International Audio/Visual Emotion Challenge and Workshop in conjunction with ACII, October 2011 (Winner of the video sub-challenge) 5. Marwa Mahmoud, Tadas Baltrusˇaitis, and Peter Robinson. 3D corpus of spontaneous complex mental states, in International Conference on Affective Computing and Intelligent Interaction (ACII), October 2011 6. Tadas Baltrusˇaitis, Peter Robinson, and Louis-Philippe Morency. 3D Constrained local model for rigid and non-rigid facial tracking, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2012 7. Marwa Mahmoud, Tadas Baltrusˇaitis, and Peter Robinson. Crowd- sourcing in emotion studies across time and culture, in Workshop on Crowdsourcing for Multimedia, ACM Multimedia, October 2012 20 1.3. Publications 8. Tadas Baltrusˇaitis, Ntombikayise Banda, and Peter Robinson. Dimen- sional Affect Recognition using Continuous Conditional Random Fields, in IEEE International Conference on Automatic Face and Gesture Recognition (FG), April 2013 9. Vaiva Imbrasaite˙, Tadas Baltrusˇaitis, and Peter Robinson. Emotion tracking in music using continuous conditional random fields and baseline feature representation, in ICME 2013 Workshop on Affective Analysis in Multimedia, July 2013 10. Vaiva Imbrasaite˙, Tadas Baltrusˇaitis, and Peter Robinson. What re- ally matters? A study into people’s instinctive evaluation metrics for continuous emotion prediction in music, in International Confer- ence on Affective Computing and Intelligent Interaction (ACII), September 2013 11. Tadas Baltrusˇaitis, Peter Robinson, and Louis-Philippe Morency. Con- strained Local Neural Fields for robust facial landmark detection in the wild, in 300 Faces in-the-Wild Challenge (300-W), IEEE International Conference on Computer Vision (ICCV), December 2013 21 2 Affective Computing Affective computing was first popularised by Rosalind Picard’s book ”Affective Computing” which called for research into automatic sensing, detection and interpretation of affect and identified its possible uses in human computer interaction (HCI) contexts (Picard, 1997). Automatic affect sensing has attracted a lot of interest from various fields and research groups, including psychology, cognitive sciences, linguistics, computer vision, speech analysis, and machine learning. The progress in automatic affect recognition depends on the progress in all of these seemingly disparate fields. Following the lead of Picard (1997) I use the terms emotion, mental state and affective state interchangeably, using them to refer to a dynamic state when a person experiences a feeling. Affective computing has grown and diversified over the past decades. It now encompasses automatic affect sensing, affect synthesis and the design of emotionally intelligent interfaces. It is a field too broad to de- scribe in detail in this dissertation, but I attempt to provide an overview of the field with an emphasis on affect sensing from facial expressions. I first provide motivation for automatic affect inference, by giving ex- amples of various application areas. I follow this by a brief overview of the main emotional theories, with special focus on facial expressions. This is followed by an overview of affect sensing and facial expression analysis techniques. 23 2. Affective Computing 2.1 Application areas There are a number of areas where the automatic detection and syn- thesis of affect would be beneficial. I give a number of examples of such potential systems, and outline some of the work that already uses automatic affect analysis. Automatic tracking of attention, boredom and stress would be highly valuable in safety critical systems where the attentiveness of the opera- tor is crucial. Examples of such systems are air traffic control, nuclear power plant surveillance, and operating a motor vehicle. An automated tracking tool could make these systems more secure and efficient, be- cause early detection of negative affective states could alert the operator or others around him, thus helping to avoid accidents. Affect sensing systems could also be used to monitor patients in hos- pitals, or when medical staff are not readily available or overburdened. It could also be used in assisted living scenarios to monitor the pa- tients and inform the medical staff during emergencies. There are some promising developments in medical applications of affective computing. One such development is the automatic detection of pain as proposed by Ashraf et al. (2009). Another promising development is the automatic detection of depression from facial and auditory signals (Cohn et al., 2009). Automatic detection of affect would not only benefit safety critical and medical environments, it also has its uses in the entertainment indus- try. One can imagine video games providing the players with a more tailored experience if the affective state of the player was known to the game. In addition, affective information could be used to augment lim- ited channels of communication such as text messaging. One such sys- tem was developed by Ho¨o¨k (2009) and is called eMoto. It is a phone with an augmented SMS service where users, besides sending a text message, are allowed to choose its background from colourful and ani- mated shapes. These backgrounds are supposed to represent emotional content along two axes of arousal and valence. 24 2.1. Application areas People with autism spectrum disorder have difficulty understanding the emotional states of others and expressing these states themselves (Baron-Cohen et al., 1985; Picard, 2009). Automatic recognition of af- fect could help autistic people to express their own affective states (Pi- card, 2009), by allowing them to express outwardly what is being felt inwardly. It could also be possible to build systems that help these peo- ple better understand the affective states of others. In addition, affect synthesis is beneficial for creation of believable vir- tual characters (avatars) (Cassell, 2000), and robotic platforms (Riek and Robinson, 2011) as it allows these agents to act more like humans. Sys- tems that are able to analyse affect can often be used to synthesise it if generative models are used, hence affect synthesis would benefit from better affect analysis. Another possible application of automatic affect recognition is in fur- thering our understanding of human behaviour and emotions. These systems could be used to speed up the currently labour intensive, error prone and tedious task of labelling emotional data. Of special interest is work by Girard et al. (2013), in which automated tools for facial expres- sion analysis (Action Unit detection) are used to support and inform existing theories of depression. Another example of an affect system already being used is the work by Affectiva for automatic classification of content preference. McDuff et al. (2013) were successful in determining if people liked certain advertise- ments and were likely to watch them again by analysing their smiling behaviour. This work is potentially very useful for advertising and mar- keting domains, where new evaluation metrics are constantly sought. The authors collected a dataset in naturalistic environments by using the webcams of the visitors to their website. In total 6729 video segments were collected of people watching a number of selected advertisements. However, even though the authors were using a state-of-the-art tracker, in only 67% of the videos the majority of the frames were tracked suc- cessfully, demonstrating the need for facial trackers capable of coping with real life environments. 25 2. Affective Computing 2.2 Emotions Emotion research started with Charles Darwin about 140 years ago with his work The Expression of The Emotions in Man and Animals (Darwin, 1872). This created a lot of controversy at the time of its publication due to its contentious claim of universality of emotions and their evolution- ary origins. Emotions have been a popular research topic ever since. According to some researchers, emotions developed as an evolutionary advantage (Ekman, 1992). It is thought that emotions evolved for their adaptive value in fundamental life tasks (Ekman, 1992), i.e. that they make us act in a way that was advantageous over the course of evolution. Affective states and their behavioural expressions are an important part of human life. They influence the way we behave, make decisions and communicate with others (Scherer, 2005). This is because our actions are influenced both by the affective state we are in and the affective states of people around us. 2.2.1 Theories of emotion Before talking about the automatic detection of affect one has to under- stand what affect is. Unfortunately, psychologists themselves, have not reached a consensus on the definitions of emotion and affect. The three most popular ways that affect has been conceptualised in psychology research are as follows: discrete categories, dimensional representation, and appraisal-based. These theories are a good starting point to under- standing affect for the purposes of automatic affect recognition as they provide information about the ways affect is expressed and interpreted. Categorical A popular way to describe emotion is in terms of discrete categories us- ing the language from daily life (Ekman et al., 1982). The most popular example of such categorisation are the basic emotions proposed by Paul Ekman (Ekman, 1992). These are: happiness, sadness, surprise, fear, anger, and disgust. Ekman suggests that they have evolved in the same 26 2.2. Emotions Figure 2.1: Facial expressions of the six basic emotions - happiness, sad- ness, fear, anger, surprise and disgust, taken from Ekman and Friesen (1976). way for all mankind and their recognition and expression is indepen- dent of nurture. This is supported by a number of cross-cultural studies performed by Ekman et al. (1982), suggesting that the facial expressions of the basic emotions are perceived in the same way, regardless of cul- ture. Facial expressions representative of these emotions can be seen in Figure 2.1. The problem of using the basic emotions for automatic affect analysis is that they were never intended as an exhaustive list of possible affective states that a person can exhibit Ekman et al. (1982). What makes them basic is their universal expression and recognition, amongst other crite- ria (Ekman, 1992). Finally, they are not the emotions that appear most often in everyday life (Rozin and Cohen, 2003). Despite these shortcomings, basic emotions are very influential in auto- matic recognition of affect, as the majority of research has focused on detecting specifically these emotions, at least until recently (Zeng et al., 2009). However, there is a growing amount of evidence that these emo- tions are not very suitable for the purposes of affective computing, as they do not appear very often in HCI scenarios (D’Mello and Calvo, 2013). There exist alternative, categorical representations that include complex emotions. An example of such a categorisation is the taxonomy devel- oped by Baron-Cohen et al. (2004). It is a broad taxonomy including 24 groups of 412 different emotions. This taxonomy was created through a linguistic analysis of emotional terms in the English language. In addition to the basic emotions, it includes emotions such as boredom, 27 2. Affective Computing Figure 2.2: Facial expressions that could be attributed to certain values in the dimensional emotion space. confusion, interest, frustration etc. The emotions belonging to some of these categories such as confusion, thinking and interest, seem to be much more common in everyday human-human and human-computer interactions (D’Mello and Calvo, 2013; Rozin and Cohen, 2003). Baron-Cohen’s taxonomy has been used by a number of researches in automatic recognition (el Kaliouby and Robinson, 2005; Sobol-Shikler and Robinson, 2010) and in description of affect (Mahmoud et al., 2011, 2012), however, it is not nearly as popular as the basic emotion cat- egories. Complex emotions might be a more suitable representation, however, they lack the same level of underlying psychological research when compared to the six basic emotions. Furthermore, little is un- derstood about the universality and cultural specificity of the complex emotions, although there has been some work to suggest the universal- ity of some of them (Baron-Cohen, 1996). Dimensional Another way of describing affect is by using a dimensional representa- tion (Russell and Mehrabian, 1977), in which an affective state is charac- terised as a point in a multi-dimensional space and the axes represent a small number of affective dimensions. These dimensions attempt to ac- 28 2.2. Emotions count for similarities and differences in emotional experience (Fontaine et al., 2007). Examples of such affective dimensions are: valence (pleas- ant vs. unpleasant); power (sense of control, dominance vs. submission); activation (relaxed vs. aroused); and expectancy (anticipation and ap- praisals of novelty and unpredictability). Fontaine et al. (2007) argue that these four dimensions account for most of the distinctions between everyday emotional experiences, and hence form a good set to analyse. Furthermore, there is some evidence of the cross-cultural generality of these dimensions (Fontaine et al., 2007). Facial expressions which could be associated with certain points in the emotional dimension space can be seen in Figure 2.2. Dimensional representation allows for more flexibility when analysing emotions when compared to categorical representations. However, prob- lems arise when one tries to use only several dimensions, since some emotions become indistinguishable when projecting high-dimensional emotional states onto lower dimension representations. For example, fear becomes indistinguishable from anger if only valence and activation are used. Furthermore, this representation is not intuitive and requires training in order to label expressive behaviour. Affective computing researchers have started exploring the dimensional representation of emotion as well. It is often treated as a binary classi- fication problem (Gunes and Schuller, 2013; Schuller et al., 2011) (active vs. passive, positive vs. negative etc.); or even as a four-class one (clas- sification into quadrants of a 2D space). Treating it as a classification problem loses the added flexibility of this representation, hence there has been some recent work, treating it as a regression one (Baltrusˇaitis et al., 2013a; Imbrasaite˙ et al., 2013a; Nicolle et al., 2012). Appraisal based The third approach for representing emotion, and very influential amongst psychologists, is the appraisal theory (Scherer, 2005). In this representa- tion, an emotion is described through the appraisal of the situation that elicited the emotion, thus accounting for individual differences. Unfor- 29 2. Affective Computing tunately this approach does not lend itself well for purposes of auto- matic affect recognition. 2.2.2 Affect expression and recognition Humans express their affective states both consciously and unconsciously. Expressive behaviour is often unintended and in some cases even impos- sible to control. Furthermore, it is neither encoded nor decoded at an intentional, conscious level (Ambady and Rosenthal, 1992). Most humans are competent mind readers and can attribute complex emotional states to other humans (Ambady and Rosenthal, 1992; Baron- Cohen, 1996). Although one’s emotional state cannot be directly ob- served by another person it can be inferred from expressive behaviour. The modality of expressive behaviour varies. We reveal our affective states through our facial expressions and head gestures (Ekman et al., 1982). Our bodies reveal our emotional states as well, through various bodily postures and gestures (de Gelder, 2009) and even hand-over-face gestures (Mahmoud et al., 2011). The non-verbal features of our speech, such as prosody (Juslin and Scherer, 2005) also contain emotional in- formation. In addition, we reveal our emotional states through various nonlinguistic vocalisations such as sighs, yawns and laughter (Cowie et al., 2001; Russell et al., 2003). There have been various conflicting studies trying to measure the rela- tive importance of different modalities (facial expressions, speech, pos- ture) for conveying and interpreting affect. One such study, by Ekman et al. (1980), found that relative weight given to facial expression, speech, and body cues depends both on the judgement task and the conditions in which the behaviour occurs. Bugental et al. (1970) suggest that the influence of facial expression, as compared with other sources, depends on the expresser, the perceiver, the message contained in each channel, and previous experience. Another study, by de Gelder and Vroomen (2000), shows that one modality can influence the judgement of another. Furthermore, in a meta-analysis conducted by Ambady and Rosenthal (1992), it was suggested that accuracy of emotional judgement can de- 30 2.3. Facial expressions of emotion crease when multiple modalities are present (due to information over- load). In addition, they suggested that people rely mostly on facial expressions when interpreting emotional states. These studies provide a complex picture of the affective signals in dif- ferent modalities and highlight the difficulties facing the multi-modal fusion of affective signals, especially in the case of conflicting emotional information. It is still an open research question as to what is the best approach for combining different modalities of expressive behaviour. 2.3 Facial expressions of emotion The face is one of the most important channels of non-verbal commu- nication. Facial expressions figure prominently in research on almost every aspect of emotion (De la Torre and Cohn, 2011). Facial expres- sions can have non-emotional information associated with them as well: they help with turn taking, convey intent, communicate culture-specific signals (for example winks), and are indicative of certain medical con- ditions, such as pain or depression. Unsurprisingly, this multi-faceted tool for expression and communication has interested researchers for centuries. Facial expression of emotion has been a subject of scientific research for more than 150 years. Research began in the nineteenth century with Mecanisme de la Physionomie Humaine by the French neurologist Duchenne de Boulogne (Duchenne de Boulogne, 1862). Duchenne tried to identify the specific muscles representing specific emotions, such as the muscle of reflection and the muscle of aggression. His work repre- sents a landmark in scientific writing – it was the first time that photog- raphy had been used to illustrate a series of experiments. Duchenne’s work was popularised by Charles Darwin in The Expression of The Emotions in Man and Animals (1872), in which photographs from Duchenne’s experiments were published (see Figure 2.3 for some exam- ples). Darwin used these photographs of facial expressions to find out if people agreed about the emotion shown by each expression. He showed 31 2. Affective Computing Figure 2.3: Example photographs of facial expressions captured by Duchenne and used in Darwin’s experiments. The use of electrical probes (seen here being held by Duchenne and his assistant) helped keep the expression still for long enough, and to activate only particular muscles. the photographs to friends during dinner parties and to a number of his naturalist colleagues. However, the use of electrically elicited facial ex- pressions raises doubts about the accuracy and objectivity of the results. Nevertheless, his studies were cutting edge at the time, because of the use of external observers and realistic stimuli such as photographs. A major step in the research on facial expressions came from Paul Ek- man with his work on basic emotions (Ekman et al., 1982), and the Fa- cial Action Coding System (FACS) (Ekman and Friesen, 1977). The latter made it possible for researches to analyse and classify facial expressions in a standardised framework since FACS allows one to encode all the possible, visually discriminable, facial expression combinations on the human face. As indicated by Cohn (2006), it is the most widely used system for the analysis of facial expressions to date. There are two major approaches to the measurement of facial expres- sions. The first one is message judgement which assumes that the face is a read out of emotion, or some other social signal, and thus it should be interpreted as that by the observer. The second type of measurement is sign judgement, which assumes nothing about the semantics of the ex- pressions and leaves inferences to higher order decision making (Cohn, 2006). I am more interested in sign judgement as it has broader applica- bility to various disciplines, including affective computing, psychology, and expression synthesis. 32 2.3. Facial expressions of emotion Message judgement attempts to describe expressions in terms of emo- tions they reveal. Basic emotions form the most popular message tax- onomy. Basic emotions have specific facial expressions associated with them, for example anger is characterised by lowered eyebrows and tight- ened lips whereas surprise is characterised by raised eyebrows and open mouth (Ekman et al., 1982). Examples of facial expressions of basic emo- tions can be seen in Figure 2.1. Universality (both in terms of recognition and expression) of basic emo- tions is supported by cross-cultural studies conducted by Ekman et al. (1982). More recently, Matsumoto and Willingham (2009) compared fa- cial expressions of athletes in the 2004 Olympic and Paralympic games. They looked at the expressions of athletes after winning or losing a game. Interestingly, expressions shown did not differ between sighted, congenitally blind, and non-congenitally blind athletes. Moreover, the authors found no cultural differences. There also exist facial signals that are not necessarily related to emotions and are more likely to be culturally specific and learned. One such sig- nal is the eyebrow flash, which might indicate relevance, or help estab- lish eye contact (Frith, 2009). Facial expressions can also reliably com- municate physical pain (Prkachin and Solomon, 2008) and depression (Girard et al., 2013). Together with head nods and eye-gaze, facial expressions are very im- portant in human-human communication. Head nods and eyebrow raises act as illustrators - serving the function of emphasis during con- versation (Ekman, 2004). They also act as regulators during the conver- sation – helping with initiation and termination of speech. Furthermore, they signal the speaker to continue, with what they are saying, through nods, agreement-smiles, forward leans, brow raises etc. (Ekman, 2004). It is important to note that facial expression understanding is a context- dependent process. Aviezer et al. (2008) conducted studies demonstrat- ing that the same facial expression can mean different things in different contexts (for example, anger may be confused with disgust). It is still an 33 2. Affective Computing Figure 2.4: This is an example of the Wollaston illusion. Even though the eyes are the same in both of the images, the perceived gaze direction is affected by the head orientation. open research question how to incorporate context in expression recog- nition. Thus it is not enough to just label the expression, in order to get the bigger picture we need both the expression and context. 2.3.1 Head pose and eye gaze Head pose and eye gaze play a role in expressing affect and communi- cating social signals. From a computational point of view it sometimes makes sense to treat them together with facial expression, as they all occur in the same place – the human head. Hence, I provide a brief overview of affective and social signals conveyed by these modalities. Head pose is important when detecting certain emotional states such as interest, where the tilting of the head is important (Ekman et al., 1982). Head pose together with facial expressions also plays a role in the expression of pride and shame (Tracy and Matsumoto, 2008). Furthermore, the expression of embarrassment is accompanied by gaze aversion, downward head motion and a nervous smile (Keltner, 1995), demonstrating the importance to analyse these modalities together. As mentioned before head nods can act as illustrators and regulators during conversation. In addition, head movements of a listener during a dyadic interaction signal ’yes’ or ’no’, indicate communicative intentions and help with the synchronisation of interactional rhythm (Hadar et al., 34 2.4. Facial affect analysis 1985). Finally, head direction and eye gaze are also used to indicate the target of a conversation. Gaze direction is important when evaluating things like attraction, at- tentiveness, competence, social skills and mental health, as well as in- tensity of emotions (Kleinke, 1986). In order to estimate gaze direction, however, we also need to compute head orientation. Lastly, head ori- entation is used for eye gaze interpretation as well, as demonstrated in Figure 2.4. 2.4 Facial affect analysis The previous section outlined how affect can be expressed through fa- cial expressions, head pose and other non-verbal signals. These signals are easily understood by humans and there has been much progress in making them readable by computers as well. In this section I outline the work done on automated affect recognition from facial expressions and head pose. Automatic facial expression analysis has been of interest to researches for over 30 years (Suwa et al., 1978). Most of the initial attempts built systems that relied on very restricted conditions. Faces had to be frontal or profile, in controlled lighting conditions, and the system often had to know the location of the face or facial landmarks (Samal and Iyen- gar, 1992). The types of facial expressions analysed were also mainly restricted to acted and exaggerated basic emotions. A huge amount of progress has been made in the field of automatic facial expression anal- ysis since then. The first development was in the type of data analysed. Instead of look- ing at still images there has been a move to analyse much richer im- age sequences (el Kaliouby and Robinson, 2005; Ramirez et al., 2011; Zeng et al., 2009). In order to be successful at exploiting such com- plex temporal signals a number of new statistical learning models have been developed (Le´vesque et al., 2013; Song et al., 2012). In addition, there has been a recent move to not only look at the visible light sig- 35 2. Affective Computing nal (greyscale, RGB, etc.) but also to use 3D information available from various range scanners (Sandbach et al., 2012). This move has been mo- tivated by the difficulty of dealing with varying illumination in visible light signals. Most of the research so far has concentrated on posed expressions collected using high end scanners (Sandbach et al., 2012). However, datasets of naturalistic expressions of emotion are becoming available as well (Mahmoud et al., 2011; Zhang et al., 2013). The second major development has been a shift from posed data to evoked or natural expressions. This is a very important step as spon- taneous facial expressions of emotion differ from acted and deliberate ones in several ways: onset and offset speed; amplitude of movement; and offset duration (Schmidt et al., 2006; Valstar et al., 2006, 2007). This means that systems trained on posed data might not generalise to spon- taneous expressions. In order to achieve generalisation two develop- ments are required. Firstly, the collection of naturalistic datasets. There is a growing number of such datasets, including SEMAINE (McKeown et al., 2010), parts of MMI (Valstar and Pantic, 2010), CK+ (Lucey et al., 2010), and Cam3D (Mahmoud et al., 2011). Secondly, the progress of computer vision and machine learning techniques which can deal with such unconstrained data (more on this in Section 2.5). A fairly recent trend has started to look at facial expressions beyond the affect they express. One such area is the automatic detection of facial Action Units (AUs) (Valstar, 2008). It analyses the signal conveyed by the expression, not the intended message. The potential benefits of such an approach is that AU detections can later be used for inference of emotional state (Baltrusˇaitis et al., 2011; el Kaliouby and Robinson, 2005), and medical conditions such as depression (Girard et al., 2013). Furthermore, such an approach allows one to avoid the difficulties of context-dependent or culture-specific expressions. A large amount of progress has been made recently due to automatic affect recognition competitions. The first of which was the FERA chal- lenge for AU and basic emotion recognition (Valstar et al., 2011), this has been followed by the three Audio/Visual Emotion Challenges (AVEC) 36 2.5. Facial tracking for prediction of emotion in the dimensional space by using multi-modal signals (Schuller et al., 2011, 2012). These competitions both develop the state-of-the-art and make the comparisons between approaches easier. One thing in common with most of the above outlined systems is their reliance on facial landmark detection. The facial landmarks can be used directly for affect recognition (Jeni et al., 2012) or in conjunction with ap- pearance based features (such as Local Binary Patterns, Gabor Wavelets, and SIFT features). The advantage of using landmark locations directly is the ease of their temporal analysis - how much the eyebrow moves, the speed of onset and offset. However, the accurate location of landmarks is also necessary when using appearance based features, as they rely on face registration to a common reference frame (Chew et al., 2011). 2.5 Facial tracking I use the term facial tracking as an umbrella term to encompass facial landmark detection, facial landmark tracking and head pose estimation. Facial landmark detection refers to locating a certain number of points of interest in an image of a face. Facial landmark tracking refers to the track- ing of a set of interest points in an image sequence, by either treating each frame in a sequence as independent, or using temporal informa- tion. Head pose estimation attempts to compute the location and orienta- tion of the head either from a single image or an image sequence. All of these problems are related, and some trackers are able to deal with all of them at once. However, historically these problems were often treated separately. This section provides an overview of the existing facial tracking approaches. 2.5.1 Landmark detection and tracking Facial landmark tracking is sometimes called non-rigid tracking, as a face is a highly non-rigid object. It is also sometimes called face align- ment and face registration. There are three main motivations that spurred research in facial landmark detection and tracking: affective comput- 37 2. Affective Computing ing, facial recognition and performance-driven animation (Metaxas and Zhang, 2013; Pantic and Bartlett, 2007; Zeng et al., 2009; Zhao et al., 2003). All of these fields rely on accurate landmark detection. It is im- portant for affective computing and facial recognition as the landmark locations can be used as features, help with face segmentation, and pro- vide locations where appearance features can be computed. For the case of performance-driven animation, the features have to be tracked accurately in order to create believable and realistic animations. Arguably, the most popular approaches are various deformable model based ones, as they show good results for landmark detection and track- ing (Gao et al., 2010). Such approaches include Active Shape Models (Cootes and Taylor, 1992); Active Appearance Models (Cootes et al., 2001); 3D Morphable Models (Blanz and Vetter, 1999); and Constrained Local Models (Cristinacce and Cootes, 2006). A more detailed discussion of these approaches can be found in Section 4.1.1. There are few approaches which attempt to detect and track facial land- marks using depth data1, instead of just visible light2 images. Several approaches use Iterative Closest Point like algorithms for landmark de- tection and tracking on depth images (Breidt et al., 2011; Cai et al., 2010). Breidt et al. (2011) use depth information to fit an identity and expres- sion 3D morphable model. Cai et al. (2010) use the intensity to guide their 3D deformable model fitting. Another noteworthy example is that of Weise et al. (2011), in which a person-specific deformable model is fit to depth and texture streams for performance based animation. A slightly different approach to landmark detection uses explicit regres- sion to correct for landmark detection estimates. Such an approach uses a regressor to predict the shape, instead of fitting a model. This avoids the construction of a loss function that is minimised by deformable model based approaches. This approach has been used in the facial 1By depth data I refer to scene geometry, or depth images where pixels represent distance to the object, usually acquired through various range scanners or stereoscopy 2I use the term visible light to refer to RGB, greyscale intensity images or any of their transformations, such as gradients of greyscale images 38 2.5. Facial tracking point detector by Valstar et al. (2010), in which regressor estimates are combined with a probabilistic graph-based shape model. The shape model is evaluated after each iteration, correcting the predicted point locations, if they do not form a consistent face shape. This is similar to Cao et al. (2012) who use explicit shape regression to minimise the detection error directly. The above-mentioned approaches are mainly used for landmark detec- tion in images and not tracking. Most of them can be easily converted to landmark trackers by simply reinitialising the detection procedure in the subsequent frame by using the current estimates. Alternatively, various trackers could be used as well, often by first initialising them with land- mark detectors (Liwicki and Zafeiriou, 2011; Patras and Pantic, 2004). 2.5.2 Head pose tracking Head pose estimation is sometimes referred to as rigid tracking, as often the head is treated as a rigid object. These techniques can be grouped based on the type of data they work on: static, dynamic or hybrid. Static methods attempt to determine the head pose from a single intensity or depth image, while dynamic ones estimate the object motion from one frame to another. Static methods are more robust, while dynamic ones show better overall accuracy, but are prone to failure during longer tracking sequences due to accumulation of error (Murphy-Chutorian and Trivedi, 2009). Hybrid approaches attempt to combine the benefits of both static and dynamic tracking. Recent work also uses depth for static head pose detection (Breitenstein et al., 2008; Fanelli et al., 2011a,b). These approaches are promising, as they are not sensitive to changing illumination. However, they could still benefit from additional temporal information. 2.5.3 Combined landmark and head pose tracking Recently, approaches that combine head pose estimation together with feature point tracking have become more popular. There have been 39 2. Affective Computing several extensions to Active Appearance Models that explicitly model the 3D shape in the formulation of the point distribution model (Xiao et al., 2004), or train several types of models for different view points (Cootes et al., 2000). These approaches show better performance for feature tracking at various poses, but still suffer from low accuracy in estimating the head pose. CLM can also be easily extended for the pur- pose of rigid and non-rigid tracking by using a 3D point distribution model (Saragih et al., 2011). More recently, Zhu and Ramanan (2012) have demonstrated the accu- racy and robustness of tree-structured models at detection of faces, to- gether with pose estimation and landmark localisation. They show good results on standard benchmarks, as well as on an in the wild dataset. However, their approach is very slow (up to 40 seconds per image), and is not suitable for affect analysis in its current state. 40 3 Facial expression and head pose datasets This section provides a description of the facial expression and head pose datasets I used throughout the dissertation. They played a major role in evaluating my proposed methods. I used some of the datasets to evaluate facial landmark detection and others for facial tracking eval- uation. All of these datasets are available publicly, ensuring the repro- ducibility of the results. However, some of them have been adapted to fit my experimental needs. The datasets can be split into two types: image and image sequence. The image datasets are used to evaluate the accuracy of landmark de- tection methods. The image sequence (video) datasets are used to eval- uate facial tracking - head pose estimation and landmark tracking in a sequence. 3.1 Image datasets I used two image-based datasets for the evaluation of landmark detec- tion algorithms. The first one is the Carnegie Mellon University Multi- PIE (pose, illumination, expression) dataset (Gross et al., 2008), which will be referred to as Multi-PIE. The second one is a subset of the Bing- hamton University - 4D Facial Expression dataset (Yin et al., 2008), re- ferred to as BU-4DFE. Both of the datasets are used for algorithm train- ing and form the basis for their evaluation. 41 3. Facial expression and head pose datasets (a) Camera locations (b) Sample images from different cameras Figure 3.1: Multi-PIE dataset pose variations. 3.1.1 Multi-PIE The Multi-PIE dataset (Gross et al., 2008) is one of the most extensive datasets of facial images across pose, expression and illumination. It consists of more than 750,000 images of 337 people recorded in up to four sessions over the span of five months. Photographs of subjects were taken from 15 view points and under 19 illumination conditions while displaying a range of facial expressions. The dataset was originally in- tended for use in face recognition, as it contains images of the same person under different illuminations, poses, and expressions. However, it also proved very popular for evaluating facial landmark detection and facial expression analysis techniques. The subjects in this dataset were captured using 15 cameras at once, whilst going though 19 predefined illumination conditions. This led to 285 images of the same person with the same expression, but at different poses and under different illumination. Camera locations can be seen in Figure 3.1a, and sample images taken by these cameras in Figure 3.1b. In addition to multiple poses, Multi-PIE dataset consists of faces from multiple lighting conditions, as seen in Figure 3.2a. Having access to the same expressions under different lighting conditions allowed me to check how well certain approaches cope with different and unseen illu- mination. Moreover, it allowed me to train landmark detection systems that can work across varying lighting conditions. 42 3.1. Image datasets (a) Variation in illumination (b) Variation in expressions Figure 3.2: The available lighting and expression variations across the Multi-PIE dataset. Notice how the same expression appears under dif- ferent illuminations in Figure 3.2a. As it is a very big dataset, only a subset of the images have been labelled for facial feature points. I had access to 5874 such manual labels. These labels consist of 5060 fully frontal (or close to frontal images), and 814 images at ±15, ±30, ±45, ±60, ±75, ±90 degrees of yaw. This number was doubled for training purposes by considering mirrored images as well. An attractive property of the Multi-PIE dataset is that the landmark locations do not change across the illumination conditions, with the ex- ception of possible eye narrowing or blinking following a flash. This allowed me to reuse the ground truth labels from one lighting condi- tion on the others. In this dissertation I explored 4 lighting conditions: frontally lit face, left side lit face, right side lit face, and poorly lit face (Figure 3.3). I restricted myself to 4 out of 19 lighting conditions to save space and reduce computational complexity. The 5874 frontally lit faces are referred to as frontal illumination Multi- PIE, the 17622 side and poorly lit faces difficult illumination Multi-PIE, and the combined 23496 images general illumination Multi-PIE. Finally, for model training and testing purposes the datasets were split into training and testing partitions, with a quarter of subjects allocated 43 3. Facial expression and head pose datasets (a) Frontally lit face (b) Poorly lit (c) Left lit (d) Right lit Figure 3.3: Face images under varying illumination from the Multi- PIE dataset. The left most image indicates an easy lighting condition: frontally lit face; while the other three more difficult ones: dimly lit face, and two face images with a strong side light. to training and three quarters to testing. Since I am interested in per- son independent facial tracking, I ensured that the same subject never appeared in both training and testing. 3.1.2 BU-4DFE subset Binghamton University 4D (3D + time) Facial Expression (BU-4DFE) database is a 3D dynamic facial expression database (Yin et al., 2008). It consists of 3D video sequences of 101 subjects acting out one of the six basic emotions from neutral to apex, and back to neutral. It was col- lected using the Di3D1 dynamic face capturing system, which records sequences of texture images together with 3D models of faces. In my work I did not use BU-4DFE as a video dataset, but took only a subset of the available video frames and used them as a still image dataset. I chose to do this because labelling videos for facial landmark positions would have been a very labour intensive task, but it was man- ageable for a smaller subset of images. I took a subset of 707 frames (each participant with neutral expression and peaks of the 6 basic emotions) and labelled the images with 66 fea- ture points semi-automatically. At first the landmarks were detected using the CLM facial tracker by Saragih et al. (2011), followed by a man- ual inspection and correction of landmark locations. I discarded some 1http://www.di3d.com (accessed Apr. 2012) 44 3.1. Image datasets Figure 3.4: Sample of images extracted from the BU-4DFE video dataset. of the frames because of poor coverage, either by the range scanner, or the cameras (part of the chin missing etc.). Some of the frames were discarded because the corresponding 3D models were too noisy (based on a manual inspection). This led to a total of 554 images together with their corresponding 3D models. In this dissertation, the reduced dataset is referred to as BU-4DFE. Some samples of extracted images can be seen in Figure 3.4. Synthetic data generation The great advantage of the BU-4DFE dataset is that each of the colour images has a corresponding 3D model of face geometry (as a triangu- lated point cloud). This means it is easy to manipulate the dataset to provide more training data for facial landmark detection algorithms. This section outlines the steps I took to generate the extra training data from only a limited number of labelled images. Firstly, it was possible to generate extra texture images at various poses. This was done by aligning them to a reference frame using a statistical shape model (Section 4.2.5). Then the 3D model was rotated to a dif- ferent view, where it was rendered using the available texture informa- tion. The following orientations (roll, yaw, pitch) were used: (±75, 0, 0); (±45, 0, 0); (±20, 0, 0); (0, 0, 0); (0, 0,±30); (0, 0, 30). Examples of such synthetic images can be seen in Figure 3.5. The advantage of such data generation is that the landmark labels are consistent across pose, which is often difficult to achieve via manual labelling of each pose (especially for the face outline). However, this technique introduces some artifacts due to missing data from the range scanner. 45 3. Facial expression and head pose datasets Figure 3.5: Sample of synthetically generated intensity and depth im- ages from BU-4DFE at various poses. For the intensity images a basic background was simulated from various images in order to avoid over- fitting during training. Figure 3.6: Sample of synthetically generated intensity images with slight pose variations and varying lighting conditions: left, right and frontal illumination. The second advantage of the availability of 3D data, was the ability to generate synthetic images at various illuminations. This was achieved by rendering the 3D scenes with a light source at different positions. Four lighting conditions were generated: frontal, left, right and poorly lit. The rendering was done with the freeglut OpenGL library using the 3D data, normals and texture from the BU-4DFE dataset. Examples of synthetic images of faces under different illuminations can be seen in Figure 3.6. Finally, access to 3D models allowed for the creation of synthetic depth images similar to those that would be expected from various range scan- ners (such as Microsoft Kinect or Time-of-flight sensors). They were also rendered at various poses, leading to training data that was used for my experiments. Examples of depth images generated can be seen in Fig- ure 3.5. 46 3.2. Image sequence datasets Figure 3.7: Sample stills from one of the ICT-3DHP sequences. All of the synthetic greyscale images, together with the depth images, were used for landmark detector training. In order to avoid overfitting, the dataset was split similarly to Multi-PIE, with a quarter of subjects reserved for training and three quarters for testing (only non-synthetic images were used for testing). Lastly, in all of the experiments the same subject never appeared in both training and testing. 3.2 Image sequence datasets Three image sequence (video) datasets were used for facial tracking evaluation. One dataset was used to evaluate landmark detection in a sequence (tracking), while three datasets were used for evaluating head pose estimation accuracy. Head pose estimation accuracy can also be seen as a proxy metric for landmark detection accuracy. For example, if the correct head pose is estimated from the detected landmarks, it is very likely that the landmarks were detected accurately as well. 3.2.1 ICT-3DHP One of the head pose datasets was collected by myself, using the Mi- crosoft Kinect sensor. The dataset contains 10 image sequences with both colour and depth information (RGBD), of around 1400 frames each. It is publicly available for research purposes2. The head pose of the individual in each video was labelled using a 2http://projects.ict.usc.edu/3dhp/ (accessed August 2013) 47 3. Facial expression and head pose datasets (a) RGB images (b) Range data Figure 3.8: Sample images from the Biwi Kinect Head Pose database Polhemus Fastrak magnetic position tracker3. The tracker consists of a System Electronics Unit which generates and senses the magnetic fields and computes the position and orientation of a sensor with respect to a source unit. The sensor tracked by the system was attached to a baseball cap worn by each participant. The Fastrack system tracks the position and orientation of a small sensor as it moves through space using elec- tromagnetic fields. It demonstrates good accuracy in position - 1.4mm in RMSE, and orientation 0.12◦ in RMSE at the distance of 1.2 meters from the transmitter. It also has a very high update rate - 120Hz, and low latency - 4ms. The dataset was recorded in an office environment with unconstrained lighting conditions. The person sitting in front of the camera was in- structed to move their head in various ways. Even though this is not a naturalistic dataset, it is still very useful for assessing the accuracy of head pose estimation algorithms. It involves large head motions rang- ing from ±45◦ roll, ±75◦ yaw and ±40◦ pitch. Samples from one of the sequences can be seen in Figure 3.7. 48 3.2. Image sequence datasets 3.2.2 Biwi Kinect Head Pose Biwi Kinect Head Pose Database (Fanelli et al., 2011b) is another head pose dataset used in my work. It contains over 15k frames of 20 people (6 females and 14 males - with 4 people recorded twice) recorded with a Microsoft Kinect sensor while moving their heads around freely. For each frame, depth and colour images are provided. Sample frames from the dataset can be seen in Figure 3.8. The head pose was annotated using an automated system that relies on person specific face range scans4. This resulted in ground truth in the form of the 3D location of the head and its rotation angles. The head pose ranges from±75◦ yaw and±60◦ pitch. The dataset was collected to evaluate a static head pose algorithm proposed by Fanelli et al. (2011b). The frames from the database were converted to 24 sequences of RGBD images (same format as ICT-3D HP dataset). Since the Biwi Kinect Head Pose database was collected with frame-by-frame estimation in mind the resulting image sequences had a number of frames missing. This made the modified Biwi Kinect Head pose database very difficult for tracking based approaches as they rely on temporal information. Because the approaches used in this dissertation are all tracking based, this dataset allowed me to stress test them. Biwi for features In order to create a labelled facial feature point dataset of RGBD se- quences I hand-labelled a subset of the Biwi Kinect Head Pose database. I chose 4 sequences of 772, 572, 395, and 634 frames each and manually labelled every 30th frame of those sequences with 66 feature points for frontal images and 37 feature points for profile images. This led to 82 labelled images in total. This is a particularly challenging dataset for a feature point tracker due to large head pose variations (±75◦ yaw and ±60◦ pitch) and missing frames. 3http://www.polhemus.com/?page=Motion_Fastrak (accessed August 2013) 4www.faceshift.com 49 3. Facial expression and head pose datasets Figure 3.9: Sample stills from one of the Boston University head pose dataset sequences. Note the visible wire that is connecting the flock of birds tracker to the receiver. 3.2.3 Boston University head pose dataset Lastly, the Boston University head pose dataset was used (Cascia et al., 2000). It contains 45 video sequences, from 5 different people, with 200 frames each. The dataset was labelled using an Ascension Technology ”Flock of Birds” tracker (similar tracker to that used for the ICT-3DHP dataset). In each of the sequences a participant moved their head around freely. Sample stills from one of the sequences in the Boston University dataset can be seen in Figure 3.9. 50 4 Constrained local model 4.1 Introduction A crucial initial step in many affect sensing, face recognition, and hu- man behaviour understanding systems, is the estimation of head pose and detection of certain facial feature points. The detection of eyebrows, corners of eyes, and lips allows us to analyse their structure and mo- tion. Furthermore, it helps with face alignment for appearance based analysis. Facial landmark detection is a very difficult problem for several rea- sons. Firstly, the human face is a non-rigid object; its shape is affected by identity and facial expressions. Secondly, facial appearance is highly affected by lighting conditions; skin tone; facial hair; and various ac- cessories (glasses, hats, scarves etc.). Furthermore, people tend to move their heads when interacting (both as a social signal and general fidget- ing), leading to self-occlusion. A certain amount of occlusion also oc- curs due to hand-over-face gestures which are prevalent during natural communication and interaction with computers (Mahmoud et al., 2011; Pease and Pease, 2006). All of the above means that an algorithm capa- ble of tracking non-rigid objects, with variable shape and appearance, is needed. The algorithm must also cope with a considerable amount of out-of-plane motion, occlusion and lighting variation. 4.1.1 Deformable model approaches to facial tracking Approaches based on deformable models are commonly used for the task of landmark registration. Notable examples include Active Shape 51 4. Constrained local model Models (ASM) (Cootes and Taylor, 1992); Active Appearance Models (AAM) (Cootes et al., 2001); 3D Morphable Models (3DMM) (Blanz and Vetter, 1999); and Constrained Local Models (CLM) (Cristinacce and Cootes, 2006). The problem of fitting a deformable model involves find- ing the parameters of the model that best match a given image. All of the above mentioned approaches depend on a parametrised shape model, which controls the possible shape variations of the non-rigid object (see Figure 4.5 for an example of a shape model). The approaches, however, differ in the way they model object appearance. Approaches based on AAM and 3DMM model the appearance holistically (the whole face together), whereas approaches based on CLM and ASM model the appearance in a local fashion (each feature point has its own appearance model). Given a shape and appearance model, a deformable model fitting pro- cess is used to estimate the parameters that could have produced the appearance of a face in an unseen image. The parameters are optimised with respect to an error term which depends on how well the parameters model the appearance of a given image, or how well the current points represent an aligned model. There are two ways of finding the optimal parameters. The first directly minimises an error function through var- ious specialised or general optimisation techniques (Cootes and Taylor, 1992; Cootes et al., 2001; Cristinacce and Cootes, 2006; Saragih et al., 2011; Wang et al., 2008). The second trains a regressor to estimate the model parameter update based on the current state (Cristinacce and Cootes, 2007; Fanelli et al., 2013; Saragih and Goecke, 2007; Sauer et al., 2011). Both of these methods usually employ a regularisation term that penalises complex shapes. One of the most promising deformable models is the Constrained Lo- cal Model (CLM) proposed by Cristinacce and Cootes (2006), and vari- ous extensions that followed (Gu and Kanade, 2008; Saragih et al., 2011; Wang et al., 2008). Recent advances in CLM fitting and construction have led to good results in terms of accuracy, convergence rates, and real-time performance in the task of person-independent facial feature 52 4.1. Introduction tracking. It has outperformed various instances of AAM and ASM for the task of facial expression tracking (Gu and Kanade, 2008; Saragih et al., 2011; Wang et al., 2008). There are several naming conventions in the CLM literature, the origi- nal term Constrained Local Model was coined by Cristinacce and Cootes (2006), but others have used it to refer to a more general problem (Saragih et al., 2011). In my work, CLM refers to a deformable shape model that follows a statistical shape distribution and models the local appearance of each feature point with the help of a local detector (patch expert). CLM is able to achieve more generalisable fitting by modelling the ap- pearance of each feature separately (Saragih et al., 2011). This is partly because CLM models the fact that multiple people can share similar lo- cal features, e.g. nose, eyebrows etc., while having other features that differ. Furthermore, due to the independent modelling, CLM demon- strates robustness in the presence of uneven lighting and occlusion. For example, a strong shadow on the left side of the face would not af- fect fitting of the right side. For these reasons local description based approaches have become more popular than holistic ones. However, there has been some work done recently on creating generic Active Ap- pearance Models which are able to deal with person independence (Tz- imiropoulos et al., 2012). Unfortunately, their generality is still lacking when compared to CLM based approaches (see Section 6.4.1 for a com- parison). For the reasons discussed, I chose CLM as a starting point for my work. Even though there has been a lot of progress in CLM construction and fitting techniques, many open questions still remain. My work on CLM has led to improved facial tracking, which can be used for emotion recognition. 4.1.2 Problem formulation A CLM consists of two parts: a statistical shape model and patch ex- perts (also called local detectors). Both the the shape model and patch 53 4. Constrained local model Initial After 2 iterations After 6 iterations After 18 iterations Figure 4.1: An example of initialising a deformable model, and itera- tively fitting it until convergence. Taken from Cootes and Taylor (1992). experts can be trained offline and then used for online landmark detec- tion, which is achieved by fitting the CLM to a given image. The deformable model is controlled by parameters p and the instance of a model can be described by the locations of its feature points xi in an image I (usually a greyscale image, but other possible types are described in Section 4.3.2). The CLM fitting algorithms attempt to find the value of p that minimises the following energy function: E(p) = R(p) + n ∑ i=1 Di(xi; I). (4.1) R represents the regularisation term (smoothness term) which penalises overly complex or unlikely shapes, and D represents the amount of mis- alignment the ith landmark is experiencing at location xi in the image (data term). The value of xi is controlled by the parameters p through the shape model that is described later (Equation 4.10). An example of iterative minimisation of the above energy function for an Active Shape Model (Cootes and Taylor, 1992) can be seen in Figure 4.1. Equation 4.1 provides an alternative probabilistic interpretation of the error function. Under the probabilistic formulation of CLM, the fitting algorithms look for the maximum a posteriori probability (MAP) of the deformable model parameters p: 54 4.1. Introduction p(p|{li=1}ni=1, I) ∝ p(p) n ∏ i=1 p(li=1|xi, I), (4.2) where, li ∈ {1,−1} is a discrete random variable indicating whether the ith feature point is aligned or misaligned, p(p) is the prior probability of the model parameters p, and ∏ni=1 p(li = 1|xi, I) is the joint probability of the feature points being aligned at locations xi, given an image I . From the above formulation it can be clearly seen that all of the local detectors are assumed to be conditionally independent of each other, in contrast to other approaches that model appearance holistically. The Equation 4.2 is equivalent to Equation 4.1 if the regularisation and data terms take the following form: R(p) = − ln{p(p)}, (4.3) Di(xi; I) = − ln{p(li = 1|xi, I)}. (4.4) The probability of a certain feature being aligned at image location xi is p(li = 1|xi, I). It is computed from the response maps created by patch experts. In my work I have explored different patch experts in terms of both the modality and the regressor being used. I have also explored the best training practices for the classifiers, thus furthering the understanding of deformable models. It is possible to minimise the error in Equation 4.1 by using general mathematical optimisation techniques, such as the Newton method or stochastic optimisation approaches. However, these approaches often exhibit slow convergence, especially in the presence of a complex de- formable model with a large number of parameters (Saragih et al., 2011). This makes general optimisation techniques unsuitable for real-time, or close to real-time tracking. Hence, it is more common to use optimisa- tion strategies designed specifically for CLM fitting. 55 4. Constrained local model Figure 4.2: Two step CLM fitting: calculation of patch responses in the surrounding region of interest followed by a Point Distribution Model constrained parameter update. Taken from Saragih et al. (2011). A common approach to CLM fitting is illustrated in Figure 4.2. This approach involves an iteration of two steps. The first step performs an exhaustive local search around each current estimate of a landmark, evaluating the patch expert at each pixel location in an area of inter- est. This results in response maps around each of the landmarks. The second step involves an optimisation performed over the resulting re- sponse maps (taking into account the regularisation term). There are numerous optimisation techniques, such as regularised landmark mean shift (Saragih et al., 2011); exhaustive local search (Wang et al., 2007); and convex quadratic fitting (Wang et al., 2008). None of these techniques optimise across the response maps directly, as that is computationally intractable and susceptible to errors due to noisy response maps. In- stead, an approximation over these response maps is used: taking the maximal response value for each landmark (Wang et al., 2007); fitting a Gaussian over the response maps (Wang et al., 2008); fitting Gaussian Mixture Models over response maps (Gu and Kanade, 2008); or using a Kernel Density Estimator (Saragih et al., 2011). 4.1.3 Structure of discussion CLM-based landmark detection consists of three main parts: the shape model, patch experts and the fitting method. My discussion introduces 56 4.2. Statistical shape model Figure 4.3: Examples of hand labelled feature points, of faces with dif- ferent expressions and orientations. each of these parts and gives detailed information on how each of them can be constructed or implemented. Section 4.2 concentrates on the construction and choice of the shape model. Section 4.3 presents the types of patch experts regularly used in CLM fitting and explores some multi-modal extensions. Section 4.5 describes my novel CLM fitting method, together with a multi-scale extension. Section 4.6 outlines how all of these parts, with an addition of a face detector, can be made into a system which can detect and track facial features. Finally, the results of facial tracking on several publicly available datasets are presented in Section 4.7. 4.2 Statistical shape model A very important part of any model-based landmark detection algo- rithm is the shape model. Firstly, the model describes the possible de- formations of a face, i.e. what constitutes a legal and an illegal face shape. Secondly, the model evaluates the plausibility of that shape in order to guide the fitting (by acting as a prior or a regularisation term). Several examples of face shapes that the model should be able to de- scribe can be seen in Figure 4.3. Notice how the positions of feature points are affected by both the location and orientation of the head and expression (called global and local parameters respectively). There exist a number of possible shape models for facial landmark de- 57 4. Constrained local model Figure 4.4: Feature points being tracked in my work. Outline of the head and significant internal locations. tection (Cootes and Taylor, 2004, 1992; Gao et al., 2010; Matthews et al., 2007). Three main variables were chosen for the shape model: the type of the model; the dimensionality of the model; and the projection used for placing the model in the image. The following sections outline the choices I made and the reasons behind them. 4.2.1 Choosing the points In order to build a model of face geometry, the relevant points have to be identified. Cootes and Taylor suggest that landmarks should repre- sent the boundary or significant internal locations of an object (Cootes and Taylor, 1992). Furthermore, good landmarks are points which can be consistently located from one image to another during annotation of the training set (Cootes and Taylor, 2004). Points could be placed at clear corners of object boundaries, or easily located biological land- marks. However, as there are rarely enough points to give more than a sparse description of the shape, the boundaries are usually augmented with equally spaced points along them (Cootes and Taylor, 2004). For the above reasons, I chose to model the outline of the face and 58 4.2. Statistical shape model the important facial features for emotion recognition: eyebrows (rais- ing, furrowing), lips (smiles, open mouth, yawn, frown etc.), nose (for wrinkling), and eyes (for narrowing or widening). The feature points I used can be seen in Figure 4.4. Datasets have to be manually or semi-automatically (Sagonas et al., 2013) labelled with the chosen feature points for both training and eval- uation. Some of the datasets I used were labelled manually, while others were labelled with help from existing trackers. 4.2.2 Model The vast majority of deformable shape models use a linear model for non-rigid deformations (Cootes and Taylor, 2004; Gao et al., 2010; Gu and Kanade, 2008; Matthews et al., 2007; Wang et al., 2008). This type of linear model is called a Point Distribution Model (PDM) (Cootes and Taylor, 1992). The PDM is a linear model which parametrises a class of shapes. It can also be used to estimate the likelihood of the points being in a model, given a set of feature points. This is important for model fitting, as it can act as a prior. The shape of a face that has n landmark points can be described as a single column vector: X = [X1, X2, . . . , Xn, Y1, Y2, . . . , Yn, Z1, Z2, . . . Zn] T. (4.5) X indicates the collection of points in the model. The vector describ- ing a valid instance of a face using the PDM can be represented in the following way: X = X +Φq. (4.6) Above X is a model instance of a particular PDM and X is the mean shape of the face (described in the same format as Equation 4.5). The shape is controlled by the m components of linear deformation, de- scribed using the n×m matrix Φ, and the m dimensional column vector 59 4. Constrained local model q, representing the non-rigid deformation parameters. The above defi- nition is in the object coordinate frame. Description of how this can be placed in the image is presented in Section 4.2.4. Both X and Φ can be learned automatically from hand labelled images using Principal Component Analysis (Cootes and Taylor, 2004), through various non-rigid structure from motion methods (Matthews et al., 2007; Torresani et al., 2008), or even defined manually (Cai et al., 2010). The probability associated with a particular valid shape described by parameters q can then be expressed as a zero mean Gaussian with Co- variance matrix Λ = diag([λ1; . . . ;λm]) evaluated at q: p(q) = N (q; 0,Λ) = 1√ (2pi)m|Λ| exp{− 1 2 (qTΛ−1q)}. (4.7) Λ is constructed from the training set, based on how much shape vari- ation in the training is explained by the ith parameter, with λi corre- sponding to the qi parameter. A Gaussian shape likelihood and CLM formulation in Equation 4.3 leads to the following regularisation term: R(q) = − ln{ 1√ (2pi)m|Λ| exp{− 1 2(q TΛ−1q)}} = ln{√(2pi)m|Λ|}+ 12 qTΛ−1q. (4.8) As constant terms can be ignored, the regularisation term becomes: R(q) = 1 2 qTΛ−1q ∝ ||q||2 Λ−1 . (4.9) The notation ||x||W is a shorthand for √ xTWx, and represents the Ma- halanobis distance with a covariance matrix W, which measures how far a sample is from a mean in a multi-variate Gaussian distribution by calculating the z-value in each dimension . Since, in PDM case, the covariance matrix is diagonal, the Mahalanobis distance reduces to a normalised Euclidean distance. 60 4.2. Statistical shape model Mean shape−3λ −1.5λ +1.5λ +3λ Figure 4.5: The modes of variations of a point distribution model of a face, constructed from the Multi-PIE dataset using non-rigid struc- ture from motion and aligned to frontal orientation. The mean shape is shown together with the five biggest principal components - those with highest variance (λ). The principal components are shown top to bottom. Notice how both the variation in identity (morphology) and expression are captured by the model. 61 4. Constrained local model 4.2.3 Dimensionality of the model The previous section defined a 3D PDM (Equation 4.5), 2D versions also exist. 2D models were more popular in the early days of deformable model research due to the inherent difficulty of 3D labelling and un- availability of good 3D sensors. Developments in sensor technology and non-rigid structure from motion approaches increased the popularity of 3D PDMs (Matthews et al., 2007). Theoretically, a PDM could model shape in any dimension, but for pur- poses of face modelling the choice is between a 3D or 2D shape model. This choice will affect how the model is placed in an image, how the model is constructed, and the techniques used for fitting. The differ- ences arise because a 3D model needs to be projected, whereas a 2D one only needs to be scaled. The choice of model dimensionality depends on the problem at hand. For faces which are mainly frontal, a 2D model would suffice, as some variation of out-of-plane rotation could be captured amongst the prin- cipal components of the shape model - Φ. If out-of-plane variation is needed, one could even construct multiple 2D models for desired poses (Cootes et al., 2000). However, this would involve building different PDMs for different orientations and switching between them accord- ingly, leading to extra computation. Furthermore, numerous labelled training samples at various orientations would be required for such model construction, which is a very time-consuming task. In order to track a face at various orientations a 3D model is preferable over a 2D one. This is because the orientation is coded explicitly in the former, whereas it has to be controlled by non-rigid shape parameters in the latter. However, 3D model construction is tricky: in order to build a PDM, hand-labelled 3D samples are needed, and labelling faces on 3D surfaces is challenging (Cootes and Taylor, 2004). The task can be made slightly easier if texture images, corresponding to the range data, are also available. Alternatively, non-rigid structure from motion (NRSFM) approaches can 62 4.2. Statistical shape model be used to create the 3D PDM from labelled feature points at various ori- entations (Matthews et al., 2007; Torresani et al., 2008). However, there are issues facing this approach: labelling the same point at different orientations is tricky especially for the face outline, and hand labelling across pose is very time-consuming. Even though the construction of a 3D model instead of a 2D one, is potentially more difficult, Matthews et al. (2007) argue that 3D models are preferable for the following reasons. Firstly, their parametrisation is more compact – removing the need to model slight rotations, scalings and translations. Secondly, they are more natural – pose and shape are separated. Lastly, it provides an easier way to deal with self occlusions. Due to the advantages of 3D models over 2D ones, I chose to use a 3D PDM in my experiments. 4.2.4 Placing the model in an image In order to place the 3D PDM in an image, it needs to be projected. This can be done by either a weak or a full perspective projection. I chose the former as it simplifies the fitting considerably by removing the additional non-linearities introduced by full perspective projection. Weak perspective camera model (otherwise known as scaled orthographic projection) assumes that the object of interest lies roughly within a plane and, hence, the projection can be approximated by using the same depth for every point. Use of weak perspective for face tracking is a reasonable approximation because of the relatively small variations of depth along the face plane with respect to distance to the camera. The following equation is used to place a single feature point of the 3D PDM in an image using weak perspective projection: xi = s · R2D · (Xi +Φiq) + t, (4.10) where Xi = [xi, yi, zi] T is the mean value of the ith feature, Φi is a 3× m principal component matrix, and q is an m dimensional vector of parameters controlling the non-rigid shape. 63 4. Constrained local model The rigid shape parameters (or global parameters) in Equation 4.10 can be parametrised using 6 scalars. First of all, s is a scaling term that controls how close the face is to the camera (inversly proportional to average depth s = f Z ) and t = [tx, ty] T is the translation term. Finally, w = [wx, wy, wz]T is the rotation term that controls the 2× 3 matrix R2D - the first two rows of a full 3× 3 rotation matrix R (Equation 4.11). The rotation matrix is constructed from an axis-angle representation of rota- tion. Under axis-angle representation, any 3D rotation can be described using a vector w = [wx, wy, wz]T = θnˆ, where the magnitude of the vec- tor (|w| = θ) describes the size of rotation in radians around the nˆ axis. Axis-angle representation can be converted to a rotation matrix R using Rodrigues’ rotation formula (Szeliski, 2010): R = I + sin(θ)[nˆ]× + (1− cos(θ))[nˆ]2×, (4.11) where [nˆ]× =  0 −nˆz nˆynˆz 0 −nˆx −nˆy nˆx 0  . (4.12) The instance of the face in an image is therefore controlled using the parameter vector p = [s, w, t, q]; where q represents the local non-rigid deformation, and s, w, t are global motion (rigid) parameters. It is useful to define the function from parameter vector p to a 2 × n matrix of landmark locations, where the first and second rows are the x and y coordinates of landmarks in and image: Pwp(p) = Ts,w,t(X +Φq), (4.13) where Pwp stands for weak perspective projection, which can be defined with the help of homogeneous coordinates: Ts,w,t(X) = [s · R2D|t] ·  X1 X2 ... Xn Y1 Y2 ... Yn Z1 Z2 ... Zn 1 1 ... 1  . (4.14) 64 4.2. Statistical shape model Prior Section 4.2.2 demonstrated how to construct a prior for the non-rigid shape, by assuming that non-rigid shape parameters q follow a Gaus- sian distribution. For the rigid shape parameters s, w, t it is common to use a non-informative prior. This can be achieved by defining Λ˜−1 = diag([0; 0; 0; 0; 0; 0;λ−11 ; . . . ;λ −1 m ]). Leading to the following regularisa- tion term: R(p) = ||p||2 Λ˜ −1 . (4.15) Note that this leads to Λ˜ being undefined, due to the division by zero, however, this does not matter as Λ˜ is never used directly. 4.2.5 Point distribution model fitting A point distribution model (PDM) can be used to do several things: gen- erate realistic examples of a class, guide model fitting, and evaluate the likelihood of the model. It is also possible to find the PDM parameters given an instance of the model. That is, given a set of corresponding points in 2D, it is possible to find which model parameters p represent the instance, which is useful for two main reasons. Firstly, it makes it possible to estimate the pose of a face (orientation and translation) given only the labelled 2D landmarks. This is useful for both evaluating how well an algorithm performs at different orientations, and for picking which images to use when training different patch experts for differ- ent orientations. Secondly, it is useful for the generation of synthetic training data from range scans at different orientations. However, fitting a 3D model to 2D points is not straightforward, and re- quires an iterative approach. This can be done using the Gauss-Newton algorithm1 with a slight correction for rotation parameters. For the task of aligning a 3D PDM (x,Φ) to the 2D model instance y, the following function needs to be minimised: 1an algorithm for solving non-linear least squares problems of the form that in- volves sum of squared residuals 65 4. Constrained local model p∗ = argmin p {||y− Pwp(p)||22 + r||p||2Λ˜−1}. (4.16) Algorithm 1 Fitting a 3D PDM to new 2D points Require: Feature points - y, PDM−Φ, X, regularisation terms r, Λ Initialise the shape parameters, p, to zero while not converged (||y− Pwp(p)||22 + r||p||2Λ˜−1) do Linearise ||y− Pwp(p)||22 around p Calculate the Jacobian J Eq. 4.28 Solve the linear system for parameter update ∆p Eq. 4.19 Update the rotation parameters Update all other parameters p = p + ∆p end while return p = [R, T2D, s, q] Above, Pwp(p) is the scaled orthographic projection of the model de- scribed by parameters p (Equation 4.13). r||p||2 Λ˜ −1 is the regularisation term, which helps avoid overfitting. The parameter r controls the trade- off between penalising unlikely faces and the landmark placement error. The suitable value of r will depend on the noisiness of data, but I exper- imentally found that values 10–30 work well. The solution to Equation 4.16 can be found using Algorithm 1, which is a slightly modified version of Gauss-Newton method for non-linear least squares problems. Derivation If an initial estimate of p is available, it is possible to find ∆p in the direction of the optimal solution, leading to p∗ = p + ∆p. This leads to the next estimate of p, which can be used for the next iteration. In order to find a ∆p in the direction of an optimal value of p, Taylor series expansion of Pwp around the current estimate of p can be used: ||y− Pwp(p∗)||22 + r||p∗||2Λ˜−1 ≈ ||y− (Pwp(p) + J∆p)|| 2 2 + r||p∗||2Λ˜−1 , (4.17) 66 4.2. Statistical shape model where J = ∂Pwp(p)∂p is the Jacobian of Pwp(p) evaluated at p. The ∆p which brings us closer to the solution is: ∆p = argmin ∆p {||y− (Pwp(p) + J∆p)||22 + r||p + ∆p||2Λ˜−1}. (4.18) This can be solved for ∆p using the Tikhonov regularised linear least squares (also known as ridge regression): ∆p = (JT J + Λ˜−1)−1(JT(y− Pwp(p))− Λ˜−1p). (4.19) Jacobian Intuitively, the Jacobian describes how the function values are changing based on the infinitesimal changes of its parameters. In the case of PDM, it models the changes of landmark locations – x, based on the parame- ters – p. The computation of the Jacobian is needed for the PDM fitting and for other parts of CLM landmark detection, hence its derivation is explained in detail. As a reminder, the location of a landmark point evaluated at p = [s, w, t, q] is defined as: xi = s · R2D · (Xi +Φiq) + t = s · R2D · X′i + t, (4.20) where X′i = [X ′ i , Y ′ i , Z ′ i ] = Xi +Φiq for brevity. The change in x and y landmark locations, based on the changes in the scaling term s, is as follows: ∂xi ∂s = [ ∂xi ∂s ∂yi ∂s ] = [ R1,: · X′i R2,: · X′i ] , (4.21) where R1,: and R2,: indicate the first and the second row of the rotation matrix R. The change in landmark location based on the translation term t = [tx, ty]T is straightforward: ∂xi ∂tT =  ∂xi∂tx , ∂xi∂ty ∂yi ∂tx , ∂yi∂ty  = [ 1, 0 0, 1 ] . (4.22) 67 4. Constrained local model The least straightforward part of the rigid parameter Jacobian, is the effect of rotation parameters w = [wx, wy, wz]T on landmark locations. First, the landmark locations can be expressed in terms of the current rotation matrix R2D, and an infinitesimal rotation R∆: xi = s · R2D · R∆ · X′i + t. (4.23) The infinitesimal rotation R∆ can be approximated using the axis-angle representation of rotation, Rodriguez’s formula (Equation (4.11), and the small angle assumption of sin(θ) = θ and cos(θ) = 0 as: R∆ =  1 −wz wywz 1 −wx −wy wx 1  . (4.24) This leads to: xi = s · R2D ·  X′i − wzYi + wyZiwzX′i +Yi − wxZi −wyX′i + wxYi + Zi + t. (4.25) This can now be used to derive the changes in landmark locations due to changes in rotation parameters: ∂xi ∂wT = [ ∂xi ∂wx , ∂xi∂wy , ∂xi ∂wz ] = s · R2D  0 Zi −Yi−Zi 0 Xi Yi −Xi 0  . (4.26) The changes in landmarks due to changes in non-rigid parameters of shape are as follows: ∂xi ∂qT = s · R2DΦi. (4.27) These can be combined to get the full Jacobian of interest: 68 4.2. Statistical shape model J =  ∂x1 ∂s ∂x1 ∂wx ∂x1 ∂wy ∂x1 ∂wz ∂x1 ∂tx ∂x1 ∂ty ∂x1 ∂q1 · · · ∂x1∂qn ... ... ... ... ... ... ... . . . ... ∂xn ∂s ∂xn ∂wx ∂xn ∂wy ∂xn ∂wz ∂xn ∂tx ∂xn ∂ty ∂xn ∂q1 · · · ∂xn∂qn ∂y1 ∂s ∂y1 ∂wx ∂y1 ∂wy ∂y1 ∂wz ∂y1 ∂tx ∂y1 ∂ty ∂y1 ∂q1 · · · ∂y1∂qn ... ... ... ... ... ... ... . . . ... ∂yn ∂s ∂yn ∂wx ∂yn ∂wy ∂yn ∂wz ∂yn ∂tx ∂yn ∂ty ∂yn ∂q1 · · · ∂yn∂qn  . (4.28) This Jacobian can now be used to solve for the parameter update ∆p in Algorithm 1. Solving for the parameter update ∆p using the Equation 4.18 and adding them to the initial parameter estimates, leads to the updated shape pa- rameters closer to the optimal value. An exception, however, is made for rotation parameters. Rotation parameter update In order to get the final rotation parametrisation after the update, the current rotation matrix can be multiplied with the updated rotation ma- trix, leading to R′2D = R2DR∆. Equation 4.24 is used to compute R∆. The resulting R′2D can now be converted to axis-angle representation, leading to an updated orientation. However, because of the approximation used to construct R∆, it is not guaranteed to be orthogonal. It can be made orthogonal using Singular Value Decomposition (SVD). The decomposition of R∆ is as follows: USVT = R∆. (4.29) Here U and V are orthogonal. The corrected R∆ can now be expressed as: R∆corrected = U · det(UVT) ·VT. (4.30) The matrix determinant det(UVT) ensures the correct handedness of the new rotation. 69 4. Constrained local model This leads to a final rotation matrix R′2D = R2D · R∆corrected. 4.2.6 Model construction This section describes how a 3D PDM can be constructed automatically from a set of labelled examples. The shape model consists of three components: the mean model shape X, the main modes of variation Φ (principal components), and the variance matrix Λ, which describes the amount of variability explained by each of the principal components. There are multiple ways of creating a PDM. These include: using Princi- pal Component Analysis on the 3D landmark locations; using non-rigid structure from motion (NRSFM) on the 2D landmark locations (Torre- sani et al., 2008; Xiao et al., 2006); or even defining the shape variation manually (Cai et al., 2010). Given 2D locations of n feature points across m images, NRSFM recovers the motion of the non-rigid object relative to the camera. The object can be rotating, translating, or undergoing a linear 3D deformation. NRSFM estimates the transformations, affecting the object, and the linear model of deformation (Torresani et al., 2008; Xiao et al., 2006). In my work I used a model constructed using the NRSFM approach from the labels of the Multi-PIE dataset. It is a 3D deformable model with 24 principal components and can be seen in Figure 4.5. 4.3 Patch experts Patch experts (also called local detectors) are a very important part of the CLM. They evaluate the probability of a landmark being aligned (or alternatively the misalignment error) at a particular pixel location. There have been various patch experts proposed: simple template matching techniques (Cristinacce and Cootes, 2006); logistic regressors (Paquet, 2009); and Support Vector Machines (Jeni et al., 2012; Saragih et al., 2011; Wang et al., 2008). 70 4.3. Patch experts Figure 4.6: Example of 11 × 11px SVR patch experts evaluated on a greyscale image of a face at the locations indicated by red circles and a 21× 21px area of interest surrounding them. The green bounding boxes represent a particular patch expert response map in the area of interest. The darker response values indicate low probability of alignment, and brighter values indicate high probability of alignment. Under the probabilistic formulation, patch experts quantify the proba- bility of alignment of a feature i – p(li = 1|xi, I), at the image location xi, in an image I , based on the surrounding support region (often an m× m grid). The image is usually expressed as greyscale pixel values, but other modalities can be used as well. The evaluation of a patch ex- pert in an area of interest leads to a response map. An example of patch expert response maps can be seen in Figure 4.6. A very popular patch expert is a Support Vector Regressor (SVR) in combination with a logistic regressor (Jeni et al., 2012; Saragih et al., 2011; Wang et al., 2008). It is defined as follows: p(li|xi, I) = 11+ edCi(xi;I)+c . (4.31) Here, Ci is the output of a SVR regressor for the ith feature, c is the logistic regressor intercept, and d the regression coefficient. The use of a logistic regressor in addition to the Support Vector enforces the output to be within 0 and 1. The advantage of this formulation is its computational simplicity and potential for efficient implementation on images using convolution (see the following section). Furthermore, it 71 4. Constrained local model Figure 4.7: Example of a 11× 11 pixel area of interest (21× 21 pixel grid) being convolved with 11× 11 pixel SVR patch expert weights, resulting in a 11× 11 pixel response. This response can be used to calculate the final response map using logistic regression. The darker response values indicate low probability of alignment, and brighter values indicate high probability of alignment. is easy to train, and there are a number of libraries for efficient SVR training (Chang and Lin, 2011; Fan et al., 2008). The support vector regressor is expressed as: Ci(xi; I) = wTi P(W(xi; I)) + bi, (4.32) where {wi, bi} are the weights and biases associated with a particular feature SVR. Here W(xi; I) represents the support region – a vectorised version of n×n image patch centred around xi. Often an 11× 11 support region is used, it is small enough to enable real-time implementations and large enough to capture interesting information. P is the normal- isation function which returns a zero mean and unit L2 norm of the signal: P(x) = x− x||x− x||2 . (4.33) The normalisation helps make the patch less sensitive to intensity vari- ations due to changing lighting conditions. 72 4.3. Patch experts 4.3.1 Implementation using convolution In CLM fitting the patch expert is usually evaluated exhaustively in a rectangular area of interest, leading to a response map. Each response calculation has to be fast, as the patch experts for each landmark have to be evaluated at every pixel location in the area of interest. Typical area of interest sizes may range from 11× 11 to 21× 21 pixels, depending on the difficulty of the scene, precision of initial parameter estimate, and the potential amount of motion in the video sequence. This requires 121 – 441 patch responses per each of the 66 landmarks to be computed for every iteration of the fitting algorithm. In order to calculate each of the responses, each support region needs to be normalised and multiplied by the SVR weights. This becomes very computationally expensive if an exhaustive local search around each landmark is performed. Fortunately, the most computationally expensive tasks during patch re- sponse computation are equivalent to the normalised cross-correlation problem on an image I with a template T . This can be done by reshap- ing the SVR weights to a N×N template and flipping it along horizontal and vertical axes and calculating the response as follows: Ci((u, v); I) = ∑x,y[I(x, y)− Iu,v][T (x− u, y− v)− T ] {∑x,y[I(x, y)− Iu,v]2∑x,y[T (x− y, y− v)− T ]2} 12 + bi. (4.34) Here T is the mean of the weights and Iu,v is the mean of I under the template. An example of using convolution as a step in response calculation is in Figure 4.7. The correlation response can be calculated efficiently with the use of fast normalised cross-correlation (Lewis, 1995). If normalisation is ignored, the evaluation of an SVR regressor (without a bias term) across each area of interest is equivalent to convolution, which can be computed more efficiently in the Fourier domain. Normalisation is then performed using integral images. See Lewis (1995) for more details. 73 4. Constrained local model Use of fast normalised cross-correlation is not limited to response map computation from SVR patch experts only; it can be used alongside other regressors as well. The outlined optimisation also presents an al- ternative and an interesting view of patch experts. Patch experts can be seen as convolution kernels which produce the desired patch responses. 4.3.2 Modalities to use Although, there has been work exploring the use of gradient intensity images (Stegmann and Larsen, 2003) and different colour channels (Bal- trusˇaitis and Robinson, 2010; Ionita et al., 2009) for Active Appearance and Active Shape Model fitting, previously published work on CLM concentrates on the use of simple greyscale images (I). It is an inter- esting question to see if an additional modality helps with the fitting accuracy of CLM. In my work I explore the use of squared gradient intensity image as an additional input modality for CLM patch experts. The squared gradient intensity image is defined as follows: I∇ = ( ∂I ∂x )2 + ( ∂I ∂y )2 . (4.35) In order to retain the speed gained from using normalised cross-correlation, the feature vectors from intensity and gradient images cannot be simply combined. Separate regressors for each of them need to be trained, thus leading to two patch experts: p(li|xi, I) = 11+ edCi(xi;I)+c , (4.36) p(li|xi, I∇) = 1 1+ ed∇C∇,i(xi;I∇)+c∇ . (4.37) In order to benefit from multiple patch experts, their response maps have to be combined. This can be achieved in a number of ways – multiplication, arithmetic mean, and geometric mean: p(li|xi, I , I∇) = p(li|xi, I) · p(li|xi, I∇). (4.38) 74 4.4. Patch expert training p(li|xi, I , I∇) = p(li|xi, I) + p(li|xi, I∇)2 , (4.39) p(li|xi, I , I∇) = √ p(li|xi, I) · p(li|xi, I∇). (4.40) Experimentally, I found that the multiplication method was the most effective for combining greyscale and gradient intensity based response maps. This led to significantly better results than just using greyscale images. However, the slight disadvantage of using multiple modali- ties is the increased patch response computation time. The results of experiments using different modality combinations can be found in Sec- tion 4.7.2. 4.3.3 Multi-view patch experts In order to track faces at multiple poses, examples of faces at different poses are needed for patch expert training. However, if a patch expert of a certain feature is trained on all poses, it will not work well due to the complexity of the task. A way to approach this problem is by training separate sets of patch expert for the views of interest. This is similar to View-based Active Appearance models (Cootes et al., 2000), in which a separate view would have a separate associated AAM. During the model fitting, the set of patch experts to be used is chosen based on the current orientation estimate. Furthermore, if the landmark is invisible at a certain orientation, for example one side of the face in a profile image, the occluded points are excluded from model fitting. Section 4.8.2 experimentally demonstrates the benefits of using more views. However, training more views requires more data and increases the training time needed. 4.4 Patch expert training The purpose of patch expert training (SVR or logistic regression, etc.) is to learn a mapping from the patch support region (feature vector) to a scalar (response). This section describes both the features used for the 75 4. Constrained local model Input p(li|xi, I) Figure 4.8: Example of the training data. The image on the left indicates the sampling areas, and the response maps on the right indicate the ground truth responses expected from applying a patch expert on them, generated using a Gaussian centred on the ground truth location of the feature point. patch experts and the generation of their associated labels, and provides other details necessary to train patch experts. 4.4.1 Training data The patch support region is commonly a vectorised and normalised n× n pixel grid leading to an n× n dimensional feature vector. Examples of patch support regions at different scales can be seen in Figure 4.9. I used a 11× 11 pixel support region for my experiments. Given a patch support around the pixel location xi, the corresponding response is p(li|xi, I). Taking an image I , with landmark i at zi = [u, v]T, the probability of it being aligned at xi is modelled as p(li|xi, I) = N (xi; zi, σI), where I is a 2× 2 identity matrix. That is, the alignment probability can be modelled as an isotropic Gaussian with standard de- viation σ in both x and y dimensions, centred on the ground truth land- mark location. A good selection of σ is very important for training. If σ is too small the classifier might lead to too many misclassifications. This is because 76 4.4. Patch expert training Figure 4.9: Example of extracted support areas around two landmarks (nose tip and eye corner) at different scales. Notice how difficult it be- comes to distinguish features from just an 11× 11 pixel grid when the scaling term increases. the patches displaced from each other only by a single pixel will not be very different even though their expected output might. If σ is too large, however, the fine-grained accuracy of the regressor is sacrificed, leading to less accurate landmark detection. I experimentally determined that σ = 1 leads to best results across multiple datasets. In order to generate training data, areas close and far away from the ground truth location can be sampled. See Figure 4.8 for examples of such area sampling, together with expected response maps. In my ex- periments I used a 9× 9 pixel area, with a random offset of 0–4 pixels for positive and 10-15 pixels for negative training samples. The ratio of close to far samples was 1 to 20, with many more negative samples. In order to ensure some lighting invariance, the training data was nor- malised by taking a z-score of every patch. During the patch response calculation for model fitting, this normalisation was performed implic- itly through the use of normalised cross-correlation (see Section 4.3.1). Patch scaling For the training to be successful, all of the training data has to be in the same reference scale. However, this is not always the case, especially in the less constrained datasets. In order to align all of the training images 77 4. Constrained local model Scale rms corr. 0.25 0.074 0.14 0.35 0.075 0.10 0.5 0.076 0.07 Table 4.1: Results comparing SVR patch experts trained on different scales and evaluated on a hold out fold. Notice the decreased perfor- mance in terms of root means square error and Pearson correlation co- efficient. to the same reference frame, PDM parameters of their labelled feature points have to be found (see Section 4.2.5 of how to do this). The esti- mated scaling parameter s can then be used to scale the corresponding face image to the desired scale. It is important to select a suitable reference scale. If a large scale is used the patch expert can be more accurate, but not as robust; if a small scal- ing term is used the patch expert loses accuracy. This is demonstrated in my experiments (see Section 4.7.4) that show that CLM landmark detec- tion gains accuracy but loses robustness if the scaling term is increased. Furthermore, if the scale is too large, an 11× 11 patch expert does not have enough information to build an accurate regressor (this can be seen illustrated in Figure 4.9) and Table 4.1. In my experiments, I used a 3D point distribution model with a pro- jected inter-ocular distance of 65 pixels at s = 1, and 11× 11 pixel patch experts. I found the best reference scales to be {0.25, 0.35, 0.5}. 4.5 Constrained Local Model fitting CLM fitting usually employs a two step strategy (Cristinacce and Cootes, 2006; Gu and Kanade, 2008; Saragih et al., 2011; Wang et al., 2008). The first step evaluates each of the patch experts around the current estimate of its corresponding feature point - leading to a response map around every feature point. The second step iteratively updates the model pa- 78 4.5. Constrained Local Model fitting rameters to maximise Equation 4.2 until a convergence metric is reached. However, instead of optimising on the patch responses directly, an ap- proximation is used. One such approach is the regularised landmark mean-shift (RLMS) (Saragih et al., 2011). 4.5.1 Regularised landmark mean shift The RLMS algorithm, first introduced by Saragih et al. (2011), attempts to find the maximum a posteriori estimate of p in the following equation: p∗ = argmax p {p(p) n ∏ i=1 p(li=1|xi, I)}. (4.41) Treating the locations of the true landmarks as hidden variables, they can be marginalised out of the likelihood that the landmarks are aligned: p(li = 1|xi, I) = ∑ yi∈Ψi p(li = 1|yi, I)N (yi; xi, ρI), (4.42) where Ψi denotes all integer locations where the patch expert is evalu- ated (every pixel in an n×n area of interest around the current estimate). The value of ρ reflects the amount of observational noise expected, and is learned from the data (Saragih et al., 2011). This formulation is equiv- alent to approximating the likelihood of point alignment using a Gaus- sian Kernel Density Estimator (KDE): p(li = 1|xi, I) = ∑ yi∈Ψi piyiN (xi; yi, ρI). (4.43) Here p(li = 1|xi, I) refers to the approximation of the patch response map using KDE, and piyi = p(li = 1|yi, I) is the patch expert response at yi (Section 4.3). Substituting Equation 4.43 into Equation 4.41 leads to: p∗ = argmax p {p(p) n ∏ i=1 ∑ yi∈Ψi piyiN (xi; yi, ρI)}, (4.44) which can be solved using the RLMS algorithm. The approach uses ex- pectation maximisation (Saragih et al., 2011), where the E-step involves 79 4. Constrained local model Algorithm 2 RLMS algorithm Require: Image I , initial parameters p, kernel variance ρ, regularisation term r, and patch experts {di, ci, wi, bi}ni=1 Compute affine transform T from image space to patch space while num iterations do Convert image using the affine transform T Compute patch responses ( Equation 4.31 ) while not converged(p) do Compute mean-shift vectors v (Equation 4.45) Convert them back to image space using T −1 Compute global PDM parameter update ∆p (Equation 4.46) Update global parameters p = p + ∆p Compute all PDM parameter update ∆p (Equation 4.46) Update all parameters p = p + ∆p end while end while return p evaluating the posterior over the candidates, and the M-step finds the parameter update. The pseudocode for RLMS is shown in Algorithm 2. As a prior p(p) for parameters p, RLMS assumes that the non-rigid shape parameters q vary according to a Gaussian distribution (Sec- tion 4.2.2); and the rigid parameters s, w, and t follow a non-informative uniform distribution (Section 4.2.4). The RLMS approach relies on mean-shift algorithm, which is a common way to maximise over a kernel density estimate. The mean shift vec- tor v, comprising of mean shifts for every landmark under the current estimate of every feature point xci is defined as: vi = ∑ yi∈Ψi piyiN (xci ; yi, ρI) ∑zi∈Ψi piziN (xci ; zi, ρI) − xci . (4.45) Given the mean shift vector, and incorporating the prior, the parameter update rule is:2 2the RLMS formulation by Saragih et al. (2011) did not have a separate regularisa- tion term r and instead used the Gaussian KDE variance ρ 80 4.5. Constrained Local Model fitting ∆p = −(JT J + rΛ−1)(rΛ−1p− JTv). (4.46) The Jacobian J in the above equation is the same as defined in Equa- tion 4.28. Furthermore, the same correction of the rotation parameters as used in Section 4.2.5 is used to correct rotation parameters for RLMS. As is often the case in deformable model fitting, the update is first per- formed on the rigid (global) motion, followed by an update to all of the parameters. Because the patch experts are trained at a particular scale and orienta- tion, the image needs to be warped to match them. This is done using a 2D affine transform T from the current feature point estimates to the training reference frame. Finally, as the parameter update should hap- pen in the image, and not the reference space, the mean shift vectors are transformed to the image space using T −1. Alternative view Notice the similarity between the parameter update rule in Equation 4.46 and the parameter update rule for PDM fitting in Equation 4.19. They both use Tikhonov regularised linear least squares to determine the pa- rameter update ∆p. If the alignment error y − Pwp(p) in Equation 4.46 is replaced with a mean shift vector v, the result is the same RLMS parameter update rule. Treating a mean-shift vector as a misalignment error leads to a slightly different view of the RLMS algorithm. The mean shift vector points in the direction where the feature point should go, but the motion is restricted by the PDM and the regularisation terms. The mean shift becomes constrained by a subspace (Saragih et al., 2009). This interpre- tation leads to the following RLMS update objective: argmin ∆p {||v− J∆p||22 + r||p + ∆p||2Λ˜−1}. (4.47) 81 4. Constrained local model Figure 4.10: The reliabilities of SVR patch experts, smaller circles repre- sent more reliability (less variance). 4.5.2 Non-uniform Regularised landmark mean shift A problem facing RLMS-based CLM fitting is that each of the patch experts is equally trusted, but this should clearly not be the case. This can be seen illustrated in Figures 4.6 and 6.4, where the response maps of certain features are noisier. To tackle this issue, instead of solving Equation 4.47, I propose minimising the following objective function: arg min ∆p {||J∆p− v||2W + r||p + ∆p||2Λ˜−1}. (4.48) The diagonal weight matrix W allows for weighting of mean-shift vec- tors. Non-linear least squares with Tikhonov Regularisation leads to the following update rule: ∆p = −(JTW J + rΛ−1)(rΛ−1p− JTWv). (4.49) Note that, if we use a non-informative identity W = I, the above col- lapses to the regular RLMS update rule. I call the CLM fitting algorithm that uses this update rule - Non-uniform Regularised Landmark Mean Shift (NU-RLMS). To construct W, the performance of patch experts on training data is used. The correlation scores of each patch expert on the holdout fold of 82 4.6. System overview Face Detection Initialisation Fitting Bounding box Shape parameters Shape parameters Image Figure 4.11: The flow diagram for a CLM based landmark detection system. Face detection leads to initial shape parameter (which lead to initial landmark locations), followed by CLM fitting. training data are computed. This leads to W = w ·diag(c1; . . . ; cn; c1; . . . cn), where ci is the correlation coefficient of the ith patch expert on the hold- out test fold. The ith and i + nth elements on the diagonal represent the confidence of the ith patch expert. Patch expert reliability matrix W is computed separately for each scale and view. This is a simple but effec- tive way to estimate the error expected from a particular patch. Example reliabilities are displayed in Figure 4.10. 4.5.3 Multi-scale fitting Patch experts are trained in a particular reference scale (Section 4.4), which is also used to compute the response maps during fitting. There are advantages in both using smaller scales (more robustness) and higher scales (more accuracy) for fitting. I propose a multi-scale approach which combines the benefits of both. That is, instead of using patch experts trained at a single scale, the algorithm can start with patches trained at a lower scale and use higher scales at later iterations. The multi-scale NU-RLMS approach is defined in Algorithm 3. 4.6 System overview CLM and most other deformable model fitting techniques are local. Given initial shape parameters, the fitting algorithm looks for optimal 83 4. Constrained local model Algorithm 3 Multi scale NU-RLMS algorithm Require: Image I , initial parameters p, kernel variance ρ, regularisation term r, and patch experts {di, ci, wi, bi}ni=1 Compute affine transform T from image space to patch space while num iterations do Convert image using the affine transform T Compute patch responses ( Equation 4.31 ) while not converged(p) do Compute mean-shift vectors v (Equation 4.45) Convert them back to image space using T −1 Compute global PDM parameter update ∆p (Equation 4.49) Update global parameters p = p + ∆p Compute all PDM parameter update ∆p (Equation 4.49) Update all parameters p = p + ∆p end while Update {di, ci, wi, bi}ni=1 to higher scale if available end while return p shape parameters within a local area. If the initial parameters are suffi- ciently close to the global optimum, it will be found. This means an extra step is needed to find the initial shape parameters. In the case of fitting on single images, a face detector (Section 4.6.1) provides a bounding box of the face. This bounding area is used to initialise the rigid shape parameters. As the detector does not provide any information about the facial expression, the non-rigid parameters are all initialised to zero. These initial shape parameters are used as a starting point for CLM fitting. This leads to a full landmark detection system from static images (illustrated in Figure 4.11). Tracking in video could be done by dealing with each of the images in a sequence independently – applying landmark detection to each of them. This, however, is inefficient as face detection is often more com- putationally expensive than CLM fitting. Such an approach also ignores the temporal relationships between the shape parameters in an image sequence, as under normal conditions relatively little motion occurs be- 84 4.6. System overview Face Detection Initialisation Fitting Bounding box Shape parameters Shape parameters Image sequence Validation Failed Successful Figure 4.12: The flow diagram for a CLM based landmark tracking sys- tem. Face detection leads to initial landmark locations where CLM fit- ting can start. The success of CLM tracking is checked using a validator, if validation succeeds tracking continues on the next image using the current shape parameters. However, if validation fails tracking reini- tialises using a face detector. tween subsequent frames. Temporal relationships can be exploited by using the estimate of the shape parameters from the previous frame. Such initialisation, however, may eventually lead to drift, as the error will build up and the fitting algorithm will no longer converge on valid feature points. In order to combat drift it is necessary to know if the CLM was successful in locating facial features and inform the tracker to reinitialise using a face detector. This, however, requires an extra vali- dation step that estimates if the landmark detection was successful (Sec- tion 4.6.2). A complete facial landmark tracking system is summarised in Figure 4.12. 4.6.1 Face detector Face detection is a mature field in Computer Vision with a number of off-the-shelf face detectors available in various libraries. Arguably, the most popular face detector to date is the Viola-Jones detector (Viola and 85 4. Constrained local model 0 0.25 0.5 0.75 10.2 0.4 0.6 0.8 1 Absolute yaw from frontal (radians) Pr op or tio n de te ct ed Multi−PIE face detection (a) Detection accuracy 0.5 0.75 1 1.25 1.50 200 400 600 800 a c / ap Co un t Multi−PIE scaling prediction (b) Scaling prediction accu- racy (scorr./spred.) −20 −15 −10 −5 0 5 10 15 200 500 1000 1500 t c − tp Co un t Multi−PIE translation prediction X error Y error (c) Error in translation (tcorr. − test.) Figure 4.13: Evaluation of the Haar-cascade detector. Notice how the ac- curacy degrades for non-frontal images. Furthermore, the detector often provides bounding areas that are either too small or too large, and often slightly offset. The errors, however, seem to be Gaussian distributed. Jones, 2004). It is faster than most other approaches and its implemen- tations are readily available in most Computer Vision libraries. Most of the available implementations of the Viola-Jones detector are for frontal faces, but profile ones exist as well. However, the performance of profile models is usually worse (as illustrated in Figure 4.13a). This is possibly due to a lack of available training data, or the detection of profile faces being an inherently more difficult task. The Viola-Jones detector provides a list of bounding boxes of faces de- tected in an image (if multiple faces are detected the biggest is chosen). As most images in the test sets contain a person facing the camera, a frontal face detector is used first, followed by left profile and right pro- file detections in case of failure. Subsequently, the detected bounding box is used to initialise the rigid shape parameters needed for CLM fitting. Some detectors have a systematic bias, so a relationship between the bounding box and the rigid shape parameters (scaling, rotation and translation) needs to be learned for each detector individually. This can be easily done by detecting faces in a number of training images and estimating offset and scaling terms. In my work I used two available implementations of the Viola-Jones 86 4.6. System overview face detector: from OpenCV 2.4.0 and Matlab 2012b Vision toolbox. The OpenCV one is used for landmark tracking in videos and the Matlab one is used for landmark detection in images. The Matlab implementation also provides profile detectors. Since neither of the detectors provide any rotation estimates, the rotation vector w was initialised to (0, 0, 0) for frontal face detection and (0,±60◦, 0) for profile detections. The evaluation of the Matlab Viola-Jones model on the Multi-PIE dataset is displayed in Figure 4.13a (using a frontal detector followed by a profile one). The errors of rigid parameter estimation from the bounding box can be seen in Figures 4.13b and 4.13c. 4.6.2 Landmark detection validation In order to combat drift is is necessary to have a way to determine if landmark detection succeeded. I refer to this as validating landmark detections. For examples of correct and incorrect landmark detections, see Figures 4.14, and 4.15. One way to validate landmark detections would be to use the model likelihood (Equation 4.2), however, this measure is not very stable as it differs significantly from image to image and even more from dataset to dataset. Determining the right threshold is very difficult, if not impos- sible. Another way to validate landmark detections is to transform the area surrounded by the landmarks to a pre-defined reference shape. The vectorised resulting image can then be used as a feature vector for a classifier which will act as the validator. The use of the reference shape allows for mapping of any landmark configuration to an image of fixed size. It also makes it possible to reduce the effect of facial expression on facial appearance. As a reference shape the mean shape of the PDM is used (seen in Fig- ure 4.5). This shape is triangulated using Delaunay triangulation. This makes it possible to perform a piece-wise affine warp on each of the 87 4. Constrained local model Figure 4.14: Examples of landmark detections and the corresponding warps onto the reference shape. corresponding triangles in the source and reference shapes. An exam- ple of detected feature points and their warps onto a reference shape can be seen in Figure 4.14. The vectorised version of the reference warp can now be used as input into the validator. Furthermore, the per-pixel warping coefficients can be mostly precalculated, with only a limited set of them (per-triangle ones) needing to be recalculated per frame (Matthews and Baker, 2004). In order to train the classifier on the vectorised reference warp, it needs positive and negative landmark detection examples. Choosing the pos- itive samples is simple, ground truth landmark labels can be used. To generate the negative samples, the ground truth labels can be offset and scaled. I trained three linear SVM validators using the Multi-PIE and the BU- 4DFE datasets. They were trained at three orientations: (0, 0, 0), (0, 30, 0), and (0,−30, 0), so that self occlusions would be easier to deal with. 4.7 Experiments This section presents the experiments that I conducted in order to ex- plore the effect of the CLM extensions described in this chapter: multi- 88 4.7. Experiments modal patch experts, NU-RLMS algorithm, and multi-scale fitting. The benefits of CLM approaches over ASM and AAM have been demon- strated by numerous authors (Cristinacce and Cootes, 2006; Saragih et al., 2011; Wang et al., 2008), hence no comparisons will be made with them in this chapter. A detailed comparison with other state-of-the art land- mark detection methods is provided in Chapter 6. Furthermore, this section also demonstrates the usefulness of CLM as a head pose tracker, compared to other dedicated rigid head trackers. 4.7.1 Methodology Training Data For patch expert training I used two datasets: BU-4DFE (Section 3.1.2) and the frontally illuminated subset of the Multi-PIE dataset (Section 3.1.1). A quarter of subjects from each of the datasets were reserved for training whereas the rest were used for testing. For Multi-PIE this resulted in 84 subjects and 1713 images for training, and for BU-4DFE in 22 subjects and 130 images – used for synthetic data generation de- scribed in Section 3.1.2. For all of my experiments I used 106 training samples per view from both the BU-4DFE and Multi-PIE datasets. Approximately 1701 training samples (81 samples from 1 positive and 20 negative areas) came from a single image, resulting in 587 images in total used for training each of the views. If possible, an equal split of BU-4DFE and Multi-PIE images was used. However, as there were insufficient labelled examples of im- ages from the Multi-PIE dataset at certain views, more BU-4DFE images were used in some cases. In order to build a model that could work at different orientations I trained separate patch experts at different orientations. In total, 9 sets of patch experts were trained at the following orientations: (±75, 0, 0); (±45, 0, 0); (±20, 0, 0); (0, 0, 0); and (0, 0,±30). The orientation is de- scribed in degrees of roll, yaw, and pitch respectively. 89 4. Constrained local model Test data For landmark detection experiments I used the remaining 4161 im- ages from the frontally lit Multi-PIE dataset and 424 from the BU-4DFE dataset. For head pose estimation experiments I used three datasets with la- belled head pose ground truth: Boston University, Biwi Kinect, and ICT- 3DHP head pose datasets (Sections 3.2.3, 3.2.2, and 3.2.1 respectively). Initialisation For landmark detection in images, the model parameters were initialised with the use of an off-the-shelf face detector available with Matlab Com- puter Vision toolbox (Section 4.6.1). The procedure of converting the de- tected bounding box to initial shape parameters is described in Section 4.6.1. In the cases where the detector failed (255 out of 4161 images in the Multi-PIE test data, and none in the BU-4DFE data), the rigid shape pa- rameters were initialised by taking the correct values and adding some Gaussian noise. The amount of noise expected was determined by eval- uating the face detector used on the Multi-PIE dataset. This gave a realistic initialisation and allowed for the analysis of the CLM approach, without the results being affected by failed face detection. For video sequence tracking, an OpenCV 2.4.0 the Viola-Jones frontal face detector was used to both initialise and reinitialise tracking if it failed. Model parameters The parameter values I used for CLM fitting are provided in this section in order to assist with the reproducibility of the experiments. For landmark detection during RLMS and NU-RLMS fitting the follow- ing parameter values were used: regularisation term r = 25, mean shift kernel variance ρ = 1.5, number of RLMS or NU-RLMS iterations - 3, 90 4.7. Experiments area of interest for all three iterations 11× 11 pixels, and number of iter- ations for ∆p calculation - 10. The patch training scales used for training and fitting were s = {0.25, 0.35, 0.5}. In addition, in all but the modality experiments a single modality approach on greyscale intensity images was used. For head pose estimation all of the same parameters were used, with a couple of exceptions. If landmark detection in the previous frame was successful only two RLMS iterations were used of 9× 9 and 7× 7 pixel areas of interest, respectively. The number of iterations for ∆p calcula- tion was 5. The reason for these changes was to speed up the approach to be real-time. Furthermore, surprisingly the smaller areas of inter- est following successfully tracked frames seemed to lead to better head pose estimation. This may be due to the fact that the face moves lit- tle between neighbouring frames and smaller search regions help avoid local optima. Unless otherwise stated, the experiments in this chapter used the pa- rameter values detailed in this section. Landmark detection and tracking error metric In order to measure fitting accuracy, an error metric was needed. I used the Root Mean Square Error (RMSE) between the detected landmarks and the known ground truth locations: RMSE = √√√√ 1 N N ∑ i=1 ((x′i − xi)2 + (y′i − yi)2). (4.50) Above x′, y′ are the ground truth locations of landmarks; x, y are the detected locations; and N is the number of feature points used. I used a size normalised version of the above error so that it would be possible to compare errors across datasets, and to avoid bias caused by face size. In order to do this, the resulting error was divided by the average of the 91 4. Constrained local model width and height of the ground truth shape: RMSEnormed = √ 1 N ∑ N i=1 ((x ′ i − xi)2 + (y′i − yi)2) 0.5 · (width+ height) . (4.51) Not all of the landmarks were used for RMSE computation. For profile images only the visible points were used for error estimation (e.g. if a person’s face fas turned to the left, the left part of the face was not used). The outline of the face (points 1–17) was also unused because of two reasons. First, the outline is very difficult to label consistently across different views leading to inconsistent ground truth. Second, feature spacing in outline can dominate the error – even if all of the detected points are on the face outline they might not correspond well to the ground truth. Landmark detection error is often visualised in the deformable model fitting community as a convergence vs. error curve (for examples see Figures 4.16, 4.17). The curve is constructed by computing the propor- tion of images in which the error was below a certain value. The closer the curve is to the top left corner of the graph - the better the fitting. The curve can also reveal accuracy and robustness trade-offs between approaches, and is arguably more informative than median or mean error values. To aid the understanding of error values some example landmark de- tections with their RMSE can be seen in Figure 4.15. These examples also give an idea of the RMSE for which the landmark detection could be considered successful and help with the interpretation of the error curves: • RMSE < 0.02, all landmarks were detected very accurately • RMSE < 0.05, detection can be considered successful as all of the features have been located, but not necessarily very accurately • RMSE < 0.1, detection still manages locating most of the facial features, but not all of them 92 4.7. Experiments Figure 4.15: Comparing different RMS errors. Notice how the landmark detection can no longer reliably identify all of the regions of the face, with errors above 0.05. • RMSE > 0.1, detection has failed or is very unreliable Finally, RMSE is extremely unlikely to be normally distributed (at best it might be skew-normal). Therefore, in all of the cases where statistics were used to compare landmark detection approaches non-parametric tests were used to compare the medians. Head pose estimation error metric Most approaches to head pose estimation use the mean absolute angular error for one of the three rotation axes (yaw, pitch and roll). Usually an 93 4. Constrained local model 0.01 0.03 0.05 0.07 0.090 0.2 0.4 0.6 0.8 1 Size normalised shape RMS error Pr op or tio n of im ag es Fitting on Multi−PIE Uni−modal Arithmetic mean Multiplication Geometric Mean 0.01 0.02 0.03 0.04 0.050 0.2 0.4 0.6 0.8 1 Fitting on BU−4DFE Size normalised shape RMS error Pr op or tio n of im ag es Uni−modal Arithmetic Mean Multiplication Geometric Mean Figure 4.16: Error fitting curves when using uni-modal and multi-modal patch experts for CLM fitting. The gradient intensity combined with greyscale shows clear benefits for fitting on both datasets. Out of the methods for combining the patch responses, the one that multiplies them together fared best. average error across the orientations is reported as well, resulting in a single statistic which provides insight into the accuracy of competing methods (Murphy-Chutorian and Trivedi, 2009). The main problem of this measure is that error distributions are unlikely to be normal. Thus, the mean error would be negatively influenced by outliers due to drift or occasional miss-classifications. Nevertheless, this metric allows for easy comparison with other researchers’ results. Hence, it was computed in my experiments. In addition to the mean error, the median error was computed too. In head pose tracking, median error is often more informative than the mean, because the latter is strongly affected by outliers. For example, if estimated head pose is off by 100 degrees – it is incorrect; how in- correct it is will affect the mean but not the median error. The median error, thus, reflects the accuracy whereas the mean error reflects the ro- bustness. However, as most authors only provide mean values, using median errors makes it difficult to compare approaches. 94 4.7. Experiments 4.7.2 Multi-modal patch experts I conducted a set of experiments to determine if the use of multi-modal patch experts is helpful for CLM fitting. In addition, this set of exper- iments explored the effect of different patch response map aggregation techniques: multiplication, arithmetic mean or geometric mean. The experiments were carried out on Multi-PIE and BU-4DFE datasets. Results The results of the modality experiments on both of the datasets can be seen in Figure 4.16. A Friedman’s ANOVA was conducted to compare the effect of adding a gradient intensity modality on the RMS errors on the BU-4DFE dataset. There was a significant effect of modality, χ2(3) = 205.2, p < 0.001. Friedman’s ANOVAs were used to follow up the findings (a Bonfer- roni correction to p values was applied). The comparisons revealed that multiplication (Mdn = 0.0197) and geometric mean (Mdn = 0.0202) outperformed the uni-modal version (Mdn = 0.0205) and the arithmetic mean version (Mdn = 0.0204) at significance level p < 0.001. Further- more, combining the response maps using multiplication outperformed the geometric mean version p < 0.001. A Friedman’s ANOVA was conducted to compare the effect of adding a gradient intensity modality on the RMS errors on the Multi-PIE dataset. There was a significant effect of modality, χ2(3) = 300.5, p < 0.001. Friedman’s ANOVAs were used to follow up the findings (a Bonferroni correction to p values was applied). The comparisons revealed that all of the multimodal approaches: multiplication (Mdn = 0.0350), arith- metic mean (Mdn = 0.0356) and geometric mean (Mdn = 0.0350) out- performed the uni-modal version (Mdn = 0.0363) at significance level p < 0.001. Furthermore, combining the response maps using multipli- cation outperformed the two other multi-modal approaches p < 0.001. 95 4. Constrained local model 0.01 0.03 0.05 0.07 0.090 0.2 0.4 0.6 0.8 1 Fitting on Multi−PIE Size normalised shape RMS error Pr op or tio n of im ag es RLMS NU−RLMS 0.01 0.02 0.03 0.04 0.050 0.2 0.4 0.6 0.8 1 Fitting on BU−4DFE Size normalised shape RMS error Pr op or tio n of im ag es RLMS NU−RLMS Figure 4.17: Error fitting curves when using RLMS and NU-RLMS algo- rithms. NU-RLMS increases landmark detection accuracy on both of the datasets, although the increased accuracy is marginal on the BU-4DFE dataset. Discussion The results show that the patch responses from an additional patch ex- pert lead to an improved fitting accuracy on both of the datasets. Fi- nally, they demonstrate that the multiplication method for aggregating the patch response maps significantly outperforms the other methods on both of the datasets. 4.7.3 Non-Uniform Regularised Landmark Mean Shift To see the effect of my new NU-RLMS algorithm on fitting accuracy, I conducted a fitting experiment on the Multi-PIE and BU-4DFE datasets. In order to construct the weight matrix W, I used patch expert correla- tions with w = 7. Results The comparison of NU-RLMS with RLMS can be seen in Figure 4.17. A Wilcoxon signed rank test was performed to compare the RMS errors under the different fitting strategies for different datasets. For Multi-PIE fitting there was a significant difference in the errors for RLMS (Mdn = 0.0363) and NU-RLMS (Mdn = 0.0343), z = −19.1, p < 0.001. For 96 4.7. Experiments BU-4DFE dataset there was also a significant difference in the errors for RLMS (Mdn = 0.0232) and NU-RLMS (Mdn = 0.0229), z = −6.26, p < 0.001. Discussion The results indicate that, on both Multi-PIE and BU-4DFE, NU-RLMS is statistically significantly more accurate than RLMS. In conclusion, the above results demonstrate the benefits of not treating each of the patch experts equally and taking their reliability into account. However, the amount of improvement seems to depend on the dataset used, and in the case of BU-4DFE the improvement was quite small. This can possibly be explained by the simplicity of the dataset, leaving very little room for improvement. 4.7.4 Multi-scale fitting I conducted a set of experiments to evaluate how the CLM fitting is affected by the scaling term of the patch expert. Patch experts were trained using the following scales: s = {0.25, 0.35, 0.5}. First, fitting was done using only one of the scales during all three RLMS iterations. For fairness, the area of interest was adjusted for each scale, so that all of the conditions saw the same amount of the image (resulting in an added computational cost for larger scales). Secondly, I wanted to see if a multi-scale approach improved performance over a single-scale approach. The area of interest used for the multi-scale approach was 11× 11 pixels (same as the s = 0.25 case), hence the same computational cost. Results The results of the experiments can be seen in Figure 4.18. A Friedman’s ANOVA was conducted to compare the effect of the scal- ing used on the RMS errors on the BU-4DFE dataset. There was a sig- nificant effect of scaling, χ2(3) = 850.9, p < 0.001. Friedman’s ANOVAs 97 4. Constrained local model 0.01 0.03 0.05 0.07 0.09 0.11 0.13 0.150 0.2 0.4 0.6 0.8 1 Size normalised shape RMS error Pr op or tio n of im ag es Fitting on Multi−PIE s=0.25 s=0.35 s=0.5 Multi−Scale 0.01 0.02 0.03 0.04 0.050 0.2 0.4 0.6 0.8 1 Fitting on BU−4DFE Size normalised shape RMS error Pr op or tio n of im ag es s=0.25 s=0.35 s=0.5 Multi−Scale Figure 4.18: Error fitting curves when using differnt scale patch experts for CLM fitting. Notice how the robustness degrades as the scaling term grows larger (more images with high error), but the accuracy improves (more images with low error) leading to a trade-off. Multi-scale formu- lation, however, manages to retain both robustness and accuracy leading to better performance. were used to follow up the findings (a Bonferroni correction to p values was applied). The comparisons revealed significant differences between all of the conditions (p < 0.01). The median RMSE error values for dif- ferent conditions are as follows: scaling of 0.5, Mdn = 0.0208; scaling of 0.35, Mdn = 0.0249; scaling of 0.25 - Mdn = 0.0319; and the multi-scale approach, Mdn = 0.0205. A Friedman’s ANOVA was conducted to compare the effect of the scal- ing used on the RMS errors on the Multi-PIE dataset. There was a signif- icant effect of scaling, χ2(3) = 2954.4, p < 0.001. Friedman’s ANOVAs were used to follow up the findings (a Bonferroni correction to p values was applied). The comparisons revealed significant differences between all of the conditions (p < 0.001). The median RMSE error values for dif- ferent conditions are as follows: scaling of 0.5, Mdn = 0.0374; scaling of 0.35, Mdn = 0.0382; scaling of 0.25, Mdn = 0.0435; and the multi-scale approach, Mdn = 0.0363. 98 4.7. Experiments Discussion The above graphs (Figure 4.18) show the trade-off expected from us- ing patches trained at different scales – lower scale patch experts are more robust but less accurate than higher scale ones. This distinction is particularly clear on the BU-4DFE dataset, where the lowest scale reaches similar accuracy to a multi-scale formulation for RMSE below 0.05. Furthermore, the results also demonstrate that multi-scale based fitting manages to capture both robustness and accuracy. Another interesting result is that different scalings perform best on dif- ferent datasets. On BU-4DFE the s = 0.5 performed best, whereas on the Multi-PIE s = 0.35 led to the best accuracy. This is possibly because BU-4DFE faces are frontal, leading to better initialisation and reducing the need of lower scale search. In Multi-PIE, on the other hand, equiva- lent initialisation is difficult to provide, making initial lower scale search more important. 4.7.5 Head pose estimation As a final test I wanted to see how CLM facial tracking compares to other methods for estimating head pose. This also acted as a proxy evaluation for feature point tracking in video sequences: good feature point detection leads to accurate head pose estimation. I compared the CLM model to several state-of-the-art dedicated head pose trackers: Generalised Adaptive View-based Appearance Model (Morency et al., 2008), and regression forests on depth maps (Fanelli et al., 2011a). The CLM used is described in the Methodology section (Section 4.7.1). The results of head pose tracking experiments can be seen in Table 4.2. CLM exhibits similar or superior performance to head pose trackers that rely purely on intensity information. This indicates that CLM can act successfully as a head pose tracker. However, it seems to slightly under perform when compared to trackers that take depth information into account as well (when looking at the mean errors but not median 99 4. Constrained local model Model Yaw Pitch Roll Mean Mdn. Boston University GAVAM (Morency et al., 2008) 3.85 4.55 2.20 3.53 2.12 CLM 4.31 4.00 2.50 3.60 2.26 ICT-3DHP GAVAM (Morency et al., 2008) 6.58 5.01 3.50 5.03 3.08 CLM 5.41 4.32 4.83 4.85 2.45 ICT-3DHP with depth GAVAM (Morency et al., 2008) 3.76 5.24 4.93 4.64 2.91 Reg. for. (Fanelli et al., 2011a) 7.69 10.66 8.72 9.03 5.02 Biwi-HP GAVAM (Morency et al., 2008) 14.16 9.17 12.41 11.91 5.63 CLM 10.32 10.27 9.01 9.87 3.58 Biwi-HP with depth GAVAM (Morency et al., 2008) 6.75 5.53 10.66 7.65 3.92 Reg. for. (Fanelli et al., 2011a) 9.2 8.5 8.0 8.6 NA Table 4.2: Estimating head pose using CLM and other baselines. The datasets tested on were: ICT-3DHP, the Biwi Kinect head pose, and the Boston University dataset. Notice the comparable accuracy of the CLM approach. errors). In sum, CLM can be used for head pose estimation, but if depth data is available other approaches might be preferable. 4.7.6 Conclusions The experimental results demonstrate the error rates expected from CLM landmark detection and head pose tracking. Furthermore, they show the benefits of three proposed extensions to CLM facial tracking accu- racy. First, the CLM approach can be extended to use multiple visible light based channels (greyscale and gradient intensity) for better land- mark detection accuracy. Second, treating each of the patch response maps with different reliability by using NU-RLMS leads to more ac- curate tracking. Finally, using a multi-scale CLM fitting leads to more robust and accurate landmark detection. 100 4.8. CLM issues 4.8 CLM issues Much progress has been made to make CLM fitting and tracking more accurate, including several extensions outlined in the previous sections. However, the CLM approach described in this chapter is not without limitations. There are three main identifiable situations in which CLM landmark detection and tracking fails or is inaccurate: large variations in pose, illumination, and expression. The same factors affect most face tracking and landmark detection approaches, and are notoriously diffi- cult to solve. This section describes the experiments I performed to better understand the limitations of the CLM landmark detector. Specifically, I explored how landmark detection accuracy is affected by pose, illumination, and expression. The observations which follow help us to understand the existing limitations of CLM. 4.8.1 Illumination Even if the face is in the same pose and has the same expression, the captured image will depend very much on the illumination present in the scene (see Figure 3.3). However, landmark detection and pose es- timation approaches should not depend on the illumination, as it does not reveal affective information. Due to the huge effect that illumination has, it is very difficult to build landmark detectors which are completely illumination independent. It is, however, worthwhile making them as robust to illumination changes as possible, especially if they need to work in unconstrained and naturalistic environments. CLM based approaches tend to generalise well to unseen faces under the same illumination, however, the landmark detection accuracy de- grades rapidly in unseen illumination (not present in the patch training data). Some examples of a CLM trained on frontally lit faces (such as Figure 3.3a), but tested on different illuminations can be seen in Fig- ure 4.19. Note how landmark detection is affected by the shadows and uneven lighting. 101 4. Constrained local model Figure 4.19: Some common failure cases across different illumination fitting. Note how the strong shadow on the face affects fitting, with the shadow being identified as the nose ridge and tip. I constructed a set of experiments to demonstrate the effect of illumina- tion on the Multi-PIE dataset. For the following experiment, SVR based patch experts were trained on frontally lit faces (using the same data as in Section 4.7.2) and using NU-RLMS multi-scale and multi-modal CLM fitting. The fitting was performed on frontally lit faces and on three difficult lighting conditions (Section 3.1.1) to test the CLMs ability to generalise to unseen illumination (examples of such lighting can be seen in Figure 3.3). The experimental methodology was the same as outlined in Section 4.7, but with an additional test set that included unseen lighting conditions: left, right, and poorly lit. The results of fitting on seen and unseen lighting are given in Figure 4.20a. A Wilcoxon rank sum test was performed on the RMSE errors on the two different test sets. It revealed significantly worse performance of CLM on the general illumination test set (z = 37.7, p < 0.001). 102 4.8. CLM issues 0.01 0.03 0.05 0.07 0.090 0.2 0.4 0.6 0.8 1 Size normalised shape RMS error Pr op or tio n of im ag es Multi−PIE with different illuminations Difficult lighting Frontal lighting (a) 0.01 0.03 0.05 0.07 0.090 0.2 0.4 0.6 0.8 1 Size normalised shape RMS error Pr op or tio n of im ag es Multi−PIE with different illuminations Multi trained − Difficult lighting Multi trained − Frontal lighting Frontal trained − Difficult lighting Frontal trained − Frontal lighting (b) Figure 4.20: Fitting on the Multi-PIE dataset using differently trained SVR experts. (a) Fitting single light trained CLM on different lighting conditions. Observe a huge degradation in accuracy when fitting to unseen lighting. The performance degrades significantly when fitting on unseen illumination. (b) Fitting CLM on different lighting conditions together with extra training at different light conditions. Pay special attention to the dashed blue curve and how fitting on frontal lighting degrades because of more general training. However, if a more general expert is trained, performance generally is improved, as shown by the solid red curve. Even though CLM tries mitigating the effect of lighting variation through using intensity normalised patch experts the above result suggests this does not solve the problem. Given the results, CLM clearly shows lim- ited generalisability across illumination for landmark detection in im- ages. Simple approach to lighting issues A naı¨ve approach to solving the lighting invariance issue would be to use more varied lighting conditions when training the patch experts. For example, left, right and poorly lit faces could be included alongside frontally lit ones during the patch expert training. I conducted an ex- periment to see if this general illumination training helps with fitting on difficult lighting conditions. Instead of training on just frontally lit 103 4. Constrained local model faces, the SVR based patch experts were trained on the four lighting conditions (frontal, dim, left and right). The same experimental condi- tions as in previous sections were used. In order for the results not to be affected by the accuracy of the face detector, the bounding box from a detector run on a frontally lit face was used for all four images of different illumination. In Figure 4.20b one can see the results of fitting on different illumina- tions when using frontal and more general training. A Wilcoxon sign rank test revealed that there was an improvement on fitting accuracy on the difficult illumination case when a general illumination training was used (z = 77.5, p < 0.001). However, the performance on the frontal lighting case decreased, when more general patch experts were trained (z = −21.4, p < 0.001). The extra training helps on the difficult lighting condition, however, it still does not reach the performance that is achieved by using frontal trained patches on frontal lit faces. Also, the improved performance on difficult lighting comes at the expense of degraded performance on frontally lit faces. That is, if the CLM is made more robust, it is at the expense of accuracy. A possible reason for this is that a simple linear SVR patch expert can not learn the complex relationships between pixel values under different illuminations and the landmark alignment probabilities. I also wanted to see the effect of the differently trained patch experts on a single lighting condition dataset. The results of using the single and general illumination trained experts on the BU-4DFE dataset can be seen in Figure 4.21. A Wilcoxon sign rank test reveals significantly worse performance of the general illumination patch experts (z = −13.69, p < 0.001). The results confirm that the use of more general patch experts leads to worse fitting performance. The above results highlight two major problems of the SVR patch expert based CLM approach on visible light images. These are an inability to generalise to unseen lighting conditions and a reduced overall accuracy 104 4.8. CLM issues 0.01 0.02 0.03 0.04 0.050 0.2 0.4 0.6 0.8 1 Size normalised shape RMS error Pr op or tio n of im ag es Different illumination training Single light SVR Multi light SVR Figure 4.21: Fitting with single and multi-light patch experts on BU- 4DFE dataset. Note the decreased performance when a more general patch expert is used. 0.01 0.03 0.05 0.07 0.09 0.11 0.13 0.150 0.2 0.4 0.6 0.8 1 Size normalised shape RMS error Pr op or tio n of im ag es Fitting on Multi−PIE 9 − views 7 − views 3 − views 1 − view (a) Fitting on Multi-PIE using different numbers of views. Observe how accu- racy increases with additional views. 0 15 30 45 60 75 900 0.05 0.1 0.15 Absolute degrees away from frontal M ed ia n RM SE Different orientation Multi−PIE (b) Error distribution across different orientations. Observe how error in- creases with off frontal views even with multi-view patch experts. Figure 4.22: Analysis of CLM fitting at different orientations and with additional views if more general training is provided. Ideally, landmark detection would work equally well, or at least comparatively, under different lighting conditions. This is not the case for CLM. 105 4. Constrained local model Figure 4.23: Some common failure cases with across pose CLM land- mark detection. 4.8.2 Pose Another major issue that CLM faces is the degradation of landmark de- tection accuracy on non-frontal images. I analysed the fitting results from Section 4.7.2 to see how CLM accuracy depends on the orientation of the face being tracked. Figure 4.22b shows landmark detection accu- racy based on the distance of the pose (in degrees) from a frontal one. It can be seen that landmark detection accuracy degrades with more non-frontal poses. There are multiple reasons for the degradation of results at different orientations. First, with fewer points available for tracking, accurate es- timation becomes more difficult. Second, fewer non-synthetic profile training images were used when compared to frontal training, poten- tially leading to worse performance. Some examples that illustrate the difficulty of fitting on non-frontal images of faces can be seen in Fig- ure 4.22b. Adding extra views An additional test was performed to see if the addition of extra views to training is beneficial to CLM landmark detection. In total nine sets of patch experts were trained: (0,±75, 0), (0,±45, 0), (0,±20, 0), (0, 0,±30), and (0, 0, 0). I performed landmark detection on the Multi-PIE dataset under the fol- lowing conditions: single frontal view; three views - frontal and profiles; 7 views – frontal, up-down, and side views (without (0,±45, 0); and all 9 views. 106 4.8. CLM issues Happy Angry Sad Disg. Afraid Surpr. Neutral 2 4 6 8 ·10−2 Emotion RM SE er ro r Figure 4.24: Error rates on the BU-4DFE dataset on different emotion images. Observe the difference in error in expression of surprise. This is due to the widely opened mouth which CLM finds difficult to detect correctly. A Friedman’s ANOVA was conducted to compare the effect of the addi- tional views on the RMS errors on the Multi-PIE dataset. There was a significant effect of views (χ2(3) = 340.7, p < 0.001). Friedman’s ANOVAs were used to follow up the findings (a Bonferroni correction to p values was applied). The comparisons revealed significant differ- ences (p < 0.001) between all but two conditions: 9-views vs. 7-views and 3-views vs. 1-view . The median RMSE error values for different conditions were as follows: single view - Mdn = 0.0353, 3 views - Mdn = 0.0353, 7 views - Mdn = 0.0346 and 9-views - Mdn = 0.0345. Please note that the small improvement of adding extra views is affected by the fact that the majority of test images are close to frontal. The results of this experiment are shown in Figure 4.22a, where it can be clearly seen that adding extra views is beneficial. However, although it helps with landmark detection across pose it still does not solve it fully. 4.8.3 Expression issues Illumination and head pose variations are not the only things affecting the CLM landmark detection accuracy. CLM fitting accuracy also suffers 107 4. Constrained local model Figure 4.25: Some common failure cases with landmark estimation of surprised expression. Notice how most of the errors come from the inability to reliably detect the lower lip. in the presence of extreme variations of expression. To illustrate this, I analysed the results from Section 4.7.2 on the BU- 4DFE dataset which had ≈ 60 images for each of the basic emotions and of neutral faces. Since the emotional expressions in this dataset are posed based on the Ekman basic emotions, they have similar feature point configurations. This reveals how the fitting accuracy is affected by the type of expression. Results from this analysis can be seen in Figure 4.24. It can be seen that landmark detection accuracy degrades for the expression of surprise, which is a clear outlier. Informal analysis reveals that this is mostly due to large mouth opening present in the expression (see Figure 4.25). There are two main reasons for the degradation of performance in ex- pression of surprise. Firstly, the patch experts for lower lip might not be able to capture the many possible variations of appearance: closed lips, open lips with teeth present, and open lips without teeth present. Sec- ondly, a prior imposed on the parameters of the shape model, penalises shapes which are complex and far away from neutral face. This results in worse performance on large expressions. 4.8.4 Discussion The pose and illumination issues outlined above are addressed in the following chapters, suggesting how CLM fitting can be made more ro- bust in the presence of lighting and pose variations. The work I have 108 4.9. General discussion done on this can be split into three parts. The first one deals with a way of approaching lighting and pose difficulties by using depth/range scan- ner data in addition to visual light, leading to a CLM-Z tracker (Chap- ter 5). Second, I describe how a CLM tracker can be combined with a dedicated pose tracker to lead to better head pose tracking accuracies (Section 5.6). Finally, I present my CLNF model for face tracking which exploits an advanced patch expert that addresses all of the issues out- lined, and especially with the illumination generalisation (Chapter 6). 4.9 General discussion In this chapter, I provided a detailed description of CLM. The model can be used for facial landmark detection in images, facial landmark tracking in video sequences and head pose estimation. As most of the work in Chapters 5 and 6 is based on CLM, this chapter also serves as detailed background explanation for this type of deformable model. Also in this chapter, I presented and explored three extensions to the model to make it more accurate and robust: multi-modal patch experts, multi-scale fitting and NU-RLMS. Finally, I outlined the issues facing CLM facial tracking that need to be resolved for it to be useful in real-world environments: failure to gener- alise across illuminations, poor performance across pose, and decreased accuracy for extreme expressions. These problems guided my work, leading to CLM extensions presented in later chapters. 109 5 CLM-Z In this chapter, I present a 3D Constrained Local Model (CLM-Z) which takes advantage of both 3D geometry (depth data) and visible light im- ages to detect facial features in images, track them across video se- quences and estimate head pose. The use of depth data allows the approach to mitigate the effect of illumination and make the tracking more robust to pose variations. An additional advantage of CLM-Z is the option to use only depth information when no visible light signal is available or lighting conditions are inadequate. The benefits of CLM-Z over regular CLM are demonstrated by evaluat- ing it on four publicly available datasets: the Binghamton University 3D dynamic facial expression database (BU-4DFE) (Yin et al., 2008), the Biwi Kinect head pose database (Biwi) (Fanelli et al., 2011b), the Boston Uni- versity head pose database (BU) (Cascia et al., 2000), and my collected dataset ICT-3DHP (Baltrusˇaitis et al., 2012). A more detailed descrip- tion of the datasets can be found in Section 3. The experiments show that the CLM-Z method significantly outperforms existing state-of-the- art approaches both for person-independent facial feature tracking (con- vergence and accuracy) and head pose estimation accuracy. Finally, this chapter presents a way to combine CLM landmark detector with a dedicated head pose tracker, leading to better head pose estima- tion accuracy. 111 5. CLM-Z 5.1 Depth data It is possible to recover 3D scene geometry through use of specialised hardware or algorithms. There are many ways to capture such informa- tion, and I briefly describe some of the most popular ones: multi-view stereo, active stereo, and time-of-flight cameras. Multi-view stereo approaches compute the disparity between corre- sponding points in stereo images, which can then be converted to sparse or dense depth maps. This approach requires more than one calibrated camera (or a single moving camera and a static scene). Sometimes, multi-view stereo techniques use more than two cameras leading to more accurate reconstructed scenes. High-end range scanners use this technique to recover very accurate scene representations. Active stereo techniques combine a camera with a projector which projects known light patterns onto the scene. Using suitable known patterns, it is possible to work out the depth of the pixel seen in the devices imag- ing camera. An example of such a system is Microsoft Kinect (Zhang, 2012), which is the first mass-market product to combine an infra-red- based active stereo system with a colour video camera in a single case for a competitive price. The depth sensor operates in infra-red light to avoid interference with the scene which is captured by the colour video camera. Time-of-flight cameras record depth by measuring the time it takes a light signal to travel from the camera to an object and back. Current models cannot record colour images, but only an infra-red intensity image. Time-of-flight cameras have many industrial applications, for example in robotics and automotive industry, and they are also increas- ingly used in computer graphics. 5.1.1 Representation A common way to represent the captured scene geometry is by using a depth map, where each pixel in the depth map represents how far away the object is from the camera plane. Figure 5.1, from Shim and 112 5.1. Depth data Figure 5.1: Samples of the actual scene, the geometry and the recon- structions of the geometry using two range sensing devices. Taken from Shim and Lee (2012) 113 5. CLM-Z Lee (2012), shows examples of depth maps of the same scene produced using an active stereo and a Time-of-flight camera. The main benefit of such a scene imaging is that it is not affected by the illumination which visible light images are particularly sensitive to. Moreover, it provides an alternative view of the scene, allowing for ad- ditional analysis. However, depth maps of scenes are not without issues - there might be gaps appearing in the depth image due to occlusions, reflections and shadows. Hence, they require specialised algorithms for efficient analysis. In my work I used depth data in the form of depth maps collected us- ing two of the above listed techniques: multi-view stereo and Microsoft Kinect sensor. 5.2 Model The CLM-Z model is a special instance of CLM introduced in the pre- vious chapter. It uses the same Point Distribution model which can be described by parameters p = [s, w, q, t]: the scale factor s, object rota- tion w, 2D translation t, and a vector describing non-rigid variation of the shape q. See Section 4.2.2 for more details. 5.3 Patch experts The main difference between CLM and CLM-Z is the patch experts used. I introduce a novel patch expert that is based on a depth map and not on a visible light image (greyscale or the gradient intensity of greyscale). The patch expert is similar to the SVR based ones used in the previous chapter, however with one crucial difference: the normalisation function that deals with missing data present in depth data. The depth based patch expert evaluated on a depth map Z at a pixel location xi can be defined as follows: 114 5.3. Patch experts Figure 5.2: Response maps of three patch experts: (A) face outline, (B) nose ridge and (C) part of the chin. SVR patch expert response maps us- ing greyscale intensity contain strong responses along the edges, making it hard to find the actual feature position. By integrating response maps from both intensity and depth images, the CLM-Z approach mitigates the aperture problem. p(li|xi,Z) = 1 1+ edCZ ,i(xi;Z)+c , (5.1) where CZ ,i is the outputs of depth patch regressor (SVR), for the ith fea- ture, c is the logistic regressor intercept, and d the regression coefficient. CZ ,i(xi;Z) = wTZ ,iPZ (W(xi;Z)) + bZ ,i, (5.2) where {wi, bi} are the weights and biases associated with a particular SVR. Here W(xi;Z) is a vectorised version of n × n image patch cen- tered around xi. PZ ignores missing values in the patch when calculating the mean. It then subtracts that mean from the patch and sets the missing values to zero. Finally, the resulting patch is normalised to unit variance. This is crucial when dealing with depth data which can have missing values. For intensity and gradient images PI is used, which normalises the vec- torised patch to zero mean and unit variance. Due to the potential of 115 5. CLM-Z missing data caused by occlusions, reflections, and background elimina- tion, PI is not used on depth data. Instead, a robust PZ is used. Using PI on depth data leads to missing values skewing the normalised patch (especially around the face outline), resulting in decreased performance (see Figure 5.4). Depth patch experts are insensitive to lighting conditions, making them robust. However, they are not as accurate as the intensity based ones for fine grained feature detection, as demonstrated in the experiments section. Therefore, they need to work with the visible light patch experts for best performance. There are several options for combining the response maps from the depth and visible light patch experts. This is similar to the multi-modal patch expert case in Section 4.3.2. There are three main options, as before: arithmetic mean, geometric mean and multiplication. In my ex- periments I found little difference between these methods, so arithmetic mean can be chosen for speed. Example images of intensity, depth and combined response maps (the patch expert function evaluated around the pixels of an initial estimate) can be seen in Figure 5.2. A major issue that CLMs face is the aperture problem, where detection confidence across the edge is better than along it. This is especially apparent for nose ridge and face outline in the case of intensity response maps. Addition of the depth information helps with solving this problem, as the strong edges in both images do not correspond exactly, providing further disambiguation for points along strong edges. 5.4 Fitting For CLM-Z fitting, the same strategy as used in the previous chapter for CLM fitting can be used. However, it requires an additional step of calculating responses from depth patch experts. The CLM-Z fitting approach is summarised in Algorithm 4. For better accuracy, depth 116 5.5. Training data based patch experts should not be used during the last RLMS iteration. They are more robust, but less accurate (especially at larger scales). Algorithm 4 CLM-Z RLMS algorithm Require: I ,Z and p, kernel variance ρ, regularisation term r, patch experts Compute affine transform T from image space to patch space while num iterations do Convert image to using the affine transform T Compute intensity patch responses (Equation 4.31 ) Compute depth patch responses (Equation 5.1 ) Combine the response maps while not converged(p) do Compute mean-shift vectors v (Equation 4.45) Convert them back to image space using T −1 Compute PDM parameter update ∆p (Equation 4.46) Update parameters p = p + ∆p end while end while return p 5.5 Training data The synthetic depth images can be used to train the patch experts in the same way that visible light images are (except for the different normali- sation). The same sampling technique can be used to generate expected patch expert responses (with same σ = 1). For training the depth based patch experts I used the synthetic depth data generated from the BU-4DFE dataset described in Section 3.1.2, an example of such synthetic images with landmark labels can be seen in Figure 5.3. For training the visible light based patch experts Multi-PIE and BU-4DFE datasets were used (same as in Section 4.7.1). For all of my experiments I used 106 training samples in total from both the BU- 4DFE and Multi-PIE datasets for visible light patch experts, and only BU-4DFE for the depth patch experts. As before, patch experts were 117 5. CLM-Z Figure 5.3: Examples of synthetic depth images used for training. Closer pixels are darker, and black is missing data. Notice how face outline and certain feature points are difficult to identify from the depth images. trained at the following orientations: (±75, 0, 0); (±45, 0, 0); (±20, 0, 0); (0, 0, 0); (0, 0,±30); (0, 0, 30). 5.6 Combining rigid and non-rigid tracking Because non-rigid shape based approaches, such as CLM, do not pro- vide an accurate pose estimate on their own (see Section 5.7.6), it is possible to combine a CLM tracker with an existing rigid pose tracker. For a rigid head pose tracker a Generalised Adaptive View-based Ap- pearance Model (GAVAM) introduced by Morency et al. (2008) can be used. The tracker works on image sequences and estimates the transla- tion and orientation of the head in three dimensions with respect to the camera, in addition to providing an uncertainty associated with each estimate. GAVAM is an adaptive keyframe based differential tracker. It uses 3D scene flow (Vedula et al., 1999) to estimate the motion of the frame from keyframes. The keyframes are collected and adapted using a Kalman filter throughout the video stream. This leads to good accuracy track- 118 5.7. Experiments ing and limited drift. The tracker works on both intensity and depth video streams. It is also capable of working without depth information by approximating the head using an ellipsoid. I introduce three exten- sions to GAVAM in order to combine rigid and non-rigid tracking, hence improving pose estimation accuracy both in the 2D and 3D cases. Firstly, I replace the simple ellipsoid model used in 2D tracking with a person specific triangular mesh. The mesh is constructed from the first frame of the tracking sequence using the 3D PDM of the fitted CLM. Since different projection is assumed by CLM (weak-perspective) and GAVAM (full perspective), to convert from the CLM landmark positions to GAVAM reference frame: Zg = 1 s + Zp, Xg = Zg xi − cx f , Yg = Zg yi − cy f , (5.3) where f is the camera focal length, cx, cy the camera central points, s is the PDM scaling factor (inverse average depth for the weak perspec- tive model), Zp the Z component of a feature point in PDM reference frame xi, yi the feature points in image plane, and Xg, Yg, Zg the vertex locations in the GAVAM frame of reference. Secondly, I use the CLM tracker to provide a better estimate of initial head pose than is provided by the static head pose detector used in GAVAM. Furthermore, the initial estimate of head distance from the camera used in GAVAM (assuming that the head is 20 cm wide), is replaced with a more stable assumption of interpupillary distance of 62 mm (Dodgson, 2004), based on the tracked eye corners using the CLM-Z or CLM trackers. Lastly, an additional hypothesis – the current head pose estimate from CLM-Z (CLM in 2D case), is provided to aid the GAVAM tracker with the selection of keyframes to be used for differential tracking. 5.7 Experiments I performed a number of experiments to analyse and validate the CLM- Z approach. Firstly, the necessity for a new normalisation function was 119 5. CLM-Z tested (Section 5.7.2). Secondly, the approaches for fusing the depth and visible light based patch expert responses were investigated (Sec- tion 5.7.3). Finally, the CLM-Z approach was compared to CLM for the task of facial landmark detection (Section 5.7.4), landmark tracking (Section 5.7.5) and head pose estimation (Section 5.7.6). 5.7.1 Methodology The experimental methodology used in this section is identical to that used in Section 4.7, with the exception of the normalisation experiment (Section 5.7.2), results of which are from Baltrusˇaitis et al. (2012). The training data sampling and landmark detection procedures are not de- scribed here as they are similar to those used in Section 4.7. For video based feature point and head pose tracking, a slight addition is made over the regular CLM approach. Depth is used, in addition to greyscale, to check that the fitting has converged (similar to validation in Section 4.6). Furthermore, if no depth signal is present in the con- verged area, tracking is assumed to have failed, leading to attempts at reinitialisation using a face detector (Section 4.6). 5.7.2 Normalisation In order to see the effect the normalisation PZ has on the CLM-Z per- formance, I conducted experiments on landmark detection and tracking when using only depth information. Two sets of patch experts using PI and PZ normalisation techniques were trained and then used on the test sets. As a test set, the depth maps from the BU-4DFE and Biwi feature point datasets were used (the initialisation was performed on the visible light images). For BU-4DFE the task was landmark detection, whereas for Biwi it was landmark tracking. The results on both of the datasets can be seen in Figure 5.4. These results demonstrate the need for robust normalisation for landmark detection and tracking on depth images. 120 5.7. Experiments Figure 5.4: The fitting curve of CLM-Z comparing how the use of the specialised depth normalisation affects the landmark tracking accuracy. Note the much higher fitting accuracy on depth images using the nor- malisation scheme PZ , as opposed to zero mean unit variance one. This is especially evident on Biwi dataset which has a much noisier signal due to Kinect sensor. 0.01 0.02 0.03 0.04 0.050 0.2 0.4 0.6 0.8 1 Fitting on BU−4DFE Size normalised shape RMS error Pr op or tio n of im ag es Multiplication Arithmetic mean Geometric mean Figure 5.5: Using different methods to combine depth and intensity based information. Observe how there is no difference between the ways of combining the response maps. 5.7.3 Patch response combination An experiment was conducted to assess which of the patch combina- tion methods works best for combining visible light based response maps with the response maps from depth patch experts for the task of landmark detection. The same techniques that were used to combine greyscale and gradient intensity patch responses were explored (see Sec- 121 5. CLM-Z tion 4.7.2). These are: multiplication, geometric mean, and arithmetic mean. The experiment was conducted on the BU-4DFE dataset (using both the greyscale and depth images). The results of this experiment can be seen in Figure 5.5. Friedman’s ANOVA revealed no significant differ- ences between the methods for the task of landmark detection on this dataset (p > 0.05). This is an interesting result, because for the fusion of greyscale and gradient intensity patch responses, the best approach to use was multiplication of the response maps. However, there seems to be little, if any, difference in the way that the patch responses are fused for CLM-Z (no statistically significant differences). In further ex- periments, I used the arithmetic mean because it is quicker to calculate. 5.7.4 Landmark detection in images In order to assess the effect of adding depth information for landmark detection in images, I evaluated the CLM-Z approach on the BU-4DFE dataset. Landmark detection accuracy was assessed in the following three condi- tions: intensity and gradient only (CLM); depth only, intensity and gra- dient with depth (CLM-Z). In addition, two types of greyscale and gradi- ent intensity patch experts were considered: trained on only frontally lit images (less general but more accurate), and on multiple lighting condi- tions (more general but less accurate). Description of such training data and its implications are discussed in Section 4.8.1. Results The results of the CLM-Z experiment for facial landmark detection can be seen in Figure 5.6. A Friedman’s ANOVA was conducted to compare the effect of channels being used (depth, visible light, depth+visible light) on the RMS errors on the BU-4DFE dataset when using frontal illumination for training. There was a significant effect of the channel, χ2(2) = 136.5, p < 0.001. 122 5.7. Experiments 0.01 0.02 0.03 0.04 0.050 0.2 0.4 0.6 0.8 1 Fitting on BU−4DFE Size normalised shape RMS error Pr op or tio n of im ag es CLM CLM−Z CLM only depth (a) Using frontal light patch experts 0.01 0.02 0.03 0.04 0.050 0.2 0.4 0.6 0.8 1 Fitting on BU−4DFE Size normalised shape RMS error Pr op or tio n of im ag es CLM CLM−Z CLM only depth (b) Using general illumination patch experts Figure 5.6: Comparing landmark detection accuracy on BU-4DFE database when using CLM, CLM-Z and CLM using only depth data. Observe how addition of depth data improves detection accuracy, es- pecially when more general/robust intensity patch experts are used. Furthermore, only using depth data still leads to good convergence rate with 90% of images converged (RMSE> 0.05). Friedman’s ANOVAs were used to follow up the findings (a Bonferroni correction to p values was applied). The comparisons revealed that all of the channels: depth (Mdn = 0.0230), visible light (Mdn = 0.0204) and depth with visible-light (Mdn = 0.0198) were significantly different from each other at p < 0.001. A Friedman’s ANOVA was conducted to compare the effect of channels being used (depth, visible light, depth+visible light) on the RMS errors on the BU-4DFE dataset when using general illumination for training. There was a significant effect of the channel, χ2(2) = 68.8, p < 0.001. Friedman’s ANOVAs were used to follow up the findings (a Bonferroni correction to p values was applied). The comparisons revealed that all of the channels: depth (Mdn = 0.0230), visible light (Mdn = 0.0243) and depth with visible-light (Mdn = 0.0220) were significantly different from each other at p < 0.01. 123 5. CLM-Z Figure 5.7: Examples of facial expression tracking on Biwi dataset. Top row CLM, bottom row CLM-Z. Discussion The experiment was designed to see the effect of using CLM-Z over the CLM approach with both frontal and general illumination patch experts. In both cases CLM-Z achieved statistically significantly lower error rates. Secondly, note how performance degrades when using more general patch experts (Figure 5.6b) when compared to specific frontal patch ex- perts (Figure 5.6a). However, the degradation was smaller when the depth signal was available. This illustrates the added benefit of CLM-Z when the visible light signal is not as reliable or unpredictable (needing multi-light training). Finally, the depth modality on its own was still able to track the fea- ture points reasonably well (with 90% of images converging at a 0.05 threshold). This demonstrates the usefulness of depth when there is no intensity information available. Furthermore, using only the depth signal was more accurate in the general illumination patch expert case, demonstrating its effectiveness. 5.7.5 Evaluation on image sequences The effect of CLM-Z on tracking feature points in a video sequence was also assessed. For this I used a subset of the Biwi head pose dataset that is labelled for feature points (see Section 3.2.2). 124 5.7. Experiments 0 0.04 0.08 0.12 0.16 0.20 0.2 0.4 0.6 0.8 1 Fitting on Biwi Size normalised shape RMS error Pr op or tio n of im ag es CLM CLM−Z (a) With reinitialisation 0 0.04 0.08 0.12 0.16 0.20 0.2 0.4 0.6 0.8 Fitting on Biwi Size normalised shape RMS error Pr op or tio n of im ag es CLM CLM−Z (b) Without reinitialisation Figure 5.8: Using CLM-Z on four Biwi video sequences. Notice how the benefit of CLM-Z becomes more apparent when no reinitialisation is performed, suggesting it is a more robust approach. The training and fitting strategies used were the same as for the previ- ous experiments on video tracking (Section 4.7.5). The depth and visible light patches used were the same as in the above sections. For feature tracking in a sequence the model parameters from the previous frame were used as starting parameters for tracking the next frame. Two ex- periments were performed, one with a reinitialisation scheme after vali- dation (Section 4.6), and one without, in order to test both accuracy and robustness (a window size of 11× 11 was used in all iterations). Results The results of the first experiment with reinitialisation can be seen in Figure 5.8a. A Wilcoxon signed rank test on effect of the model on RMS errors revealed no significant difference in error rates (Z = −0.98, p > 0.1) between CLM-Z (Mdn = 0.049) and CLM (Mdn = 0.052). The results of the second experiment without reinitialisation can be seen in Figure 5.8b. A Wilcoxon signed rank test on effect of the model on RMS errors revealed a significant difference in error rates (Z = −2.88, p < 0.01) between CLM-Z (Mdn = 0.047) and CLM (Mdn = 0.056). 125 5. CLM-Z Figure 5.9: Examples of facial expression tracking on the ICT-3DHP dataset. Top row CLM, bottom row the CLM-Z approach. Discussion The CLM-Z approach managed to generalise well on a dataset not used for training, and improved the performance of a regular CLM when no reinitialisation was used. This was despite the training and testing datasets being quite different: high resolution range scanner data for training, low resolution noisy Kinect data for testing. During the experiment without any reinitialisation CLM-Z demonstrated the ability to keep tracking and recover from failure, whereas CLM ac- curacy decreased. This demonstrates the benefit of CLM-Z over CLM for the task of facial feature tracking in videos. 5.7.6 Head pose tracking using depth data To measure the performance of CLM-Z as a head pose tracker and the combination of the CLM-Z with a head pose tracker I evaluated them on two publicly available datasets: Biwi head pose dataset and ICT-3DHP. Both datasets contain labelled ground truth head pose data and aligned RGBD data. ICT-3DHP dataset has 10 video sequences with head motion and very few missing frames, the lighting is also reasonably static. On the other hand, Biwi head pose dataset was collected with a frame based algo- rithm in mind so it has numerous occasions of lost frames and occasional 126 5.7. Experiments mismatch between colour and depth frames. This makes the dataset es- pecially difficult for tracking based algorithms. Baseline methods The first baseline used is the CLM approach described in Chapter 4. This allowed me to evaluate the effect of depth data for the task of head pose estimation. The second baseline used is that of Random Regression Forests (Fanelli et al., 2011b) (using the implementation provided by the authors), which is a detection based method and provides head pose estimates on a per frame level. Another baseline used for the Biwi head pose dataset is that from Marras et al. (2013). Their method fuses greyscale and depth data using angle based features. It combines image gradient orientations as extracted from intensity images with the directions of surface normals computed from depth images. Lastly, it is a tracking based approach. The final baseline was the GAVAM (Morency et al., 2008) tracker which can use visible light alone, or can include depth information for im- proved performance. Results The results of the head pose tracking experiment are shown in Table 5.1. They include the results from: baseline methods, CLM-Z, and a com- bined rigid and non-rigid tracker. Discussion Firstly, it is clear that tracking based approaches performed much better on the ICT-3DHP dataset. The different tracker results on Biwi and ICT- 3DHP datasets are because the former was not collected with a tracking approach in mind and has a number of frames missing. Nevertheless, CLM tracking approaches managed to show good performance. 127 5. CLM-Z Model Yaw Pitch Roll Mean Mdn. ICT-3DHP GAVAM (Morency et al., 2008) 6.58 5.01 3.50 5.03 3.08 CLM 5.41 4.32 4.83 4.85 2.45 CLM with GAVAM 6.02 5.52 3.84 5.13 2.95 ICT-3DHP with depth Reg. for. (Fanelli et al., 2011b) 7.69 10.66 8.72 9.03 5.02 GAVAM (Morency et al., 2008) 3.76 5.24 4.93 4.64 2.91 CLM-Z 4.73 4.10 4.66 4.50 2.35 CLM-Z with GAVAM 4.41 5.22 5.36 5.00 2.81 Biwi-HP GAVAM (Morency et al., 2008) 14.16 9.17 12.41 11.91 5.63 CLM 10.32 10.27 9.01 9.87 3.58 CLM with GAVAM 13.19 9.46 10.81 11.15 4.94 Biwi-HP with depth Marras et al. (2013) 9.2 9.0 8.0 8.7 NA Reg. for. (Fanelli et al., 2011b) 9.2 8.5 8.0 8.6 NA GAVAM (Morency et al., 2008) 6.75 5.53 10.66 7.65 3.92 CLM-Z 10.52 7.98 8.14 8.88 3.20 CLM-Z with GAVAM 6.74 6.07 9.64 7.48 4.02 Table 5.1: Estimating the head pose using CLM-Z and CLM-Z with GAVAM approaches alongside some other state-of-the-art methods. It can be seen that CLM-Z outperformed CLM in terms of mean and me- dian errors. CLM-Z outperformed the regular CLM approach on both of the datasets, demonstrating the benefit of the depth signal. Furthermore, the head pose estimation accuracy of CLM-Z is comparable to that of dedicated head pose trackers, indicating the usefulness of CLM-Z as a head pose tracker. The combination of CLM-Z together with a head pose tracker had lit- tle effect on overall performance (with slightly better performance in 2D case and slightly worse performance in 3D case when compared to GAVAM). This suggests that a better fusion technique is needed for the 128 5.7. Experiments Model Yaw Pitch Roll Mean Mdn. GAVAM (Morency et al., 2008) 3.79 4.45 2.15 3.47 2.12 CLM 3.68 4.26 2.50 3.48 2.26 CLM with GAVAM 2.69 3.84 2.10 2.88 1.83 Table 5.2: Head pose estimation results on the BU dataset, measured in mean absolute error. rigid and non-rigid head pose estimation. Finally, it is evident that CLM and CLM-Z approaches either perform equally well, or better, than dedicated head pose trackers (especially when looking at the median error metric). These approaches look even more promising when considering that CLM runs at 18–22 fps; CLM-Z at 11–15 fps; GAVAM at 8–10 fps; and CLM-Z with GAVAM at 4–7 fps on these dataset. The experiments were performed on a 3.06 GHz dual core Intel i3 CPU. 5.7.7 Head pose tracking on 2D data One extension of CLM proposed in this chapter was its combination with a rigid head pose tracker such as GAVAM. In addition to evaluating it on the datasets which include depth it was also evaluated on a 2D dataset – Boston University head pose dataset. The results of the combined approach can be seen in Table 5.2. The approach that combines both of the trackers outperforms the separate GAVAM and CLM methods in all of the orientation dimensions, which is not the case when depth is available. Combination of rigid and non- rigid trackers seems to benefit the 2D case much more. However, there is a performance cost: CLM runs at 40 fps, GAVAM at 20 fps, and CLM with GAVAM at 15 fps on this dataset. The experiments were performed on a 3.06 GHz dual core Intel i3 CPU. 129 5. CLM-Z 5.8 Conclusion In this chapter I presented CLM-Z, a Constrained Local Model approach that fully integrates depth information alongside intensity for facial fea- ture point tracking and detection. This approach was evaluated on pub- licly available datasets and shows better performance both in terms of convergence and accuracy for feature point tracking from a single image and in a video sequence. The approach is especially helpful when the visible light signal is noisy or unreliable. CLM-Z is especially relevant due to recent availability of cheap consumer depth sensors, which can be used to improve existing computer vision techniques. 130 6 Constrained Local Neural Field A big issue in CLM based landmark detection is the performance of patch experts, which are rarely more complex than linear SVRs or lo- gistic regressors. Due to their simplicity, they may fail to learn com- plex non-linear relationships between pixel values and response maps. This is especially true if a more complex task of illumination invariant landmark detection is considered (see Section 4.8.1 and Figures 4.20b and 4.21). Because of its simplicity, it is difficult to expect a linear patch expert to work equally well in different illuminations. However, they are commonly used because they are simple to train and have fast imple- mentations (using the convolution trick described in the Section 4.3.1) leading to real-time tracking speeds. Patch experts do not need to be limited to simple linear SVRs or logistic regressors. However, ones based on more complex regressors (for exam- ple RBF kernel SVRs) can be very slow (under 1fps), making them unus- able for application where large amounts of data needs to be processed. This is especially true for affect inference, as some datasets contain hour long recordings leading to hundreds of thousands of frames per single recording. In this chapter, I present a Local Neural Field (LNF) patch expert which is an instance of the more general Continuous Conditional Neural Field (CCNF). It deals with the issues of learning complex scenes by using a hidden non-linear layer, and by exploiting spatial relationships between pixels. An additional advantage of the new LNF patch expert is that it can also be implemented by using simple convolutions for the most ex- pensive part of regression, resulting in close to real-time tracking. Con- 131 6. Constrained Local Neural Field wj y1 y2 y4 y5 θ 1 θ 2 θ 5 θ 5 Θ ’s Θ ’s Θ θ 5 θ 5 Θ ’s Θ ’s Θ θ 5 θ 5 Θ ’s Θ ’s Θ θ 5 θ 5 Θ ’s Θ ’s Θ x1 x2 x4 x5 y1 y2 y4 y5 θ 1 θ 2 θ 5 θ 5 Θ ’s Θ ’s Θ θ 5 θ 5 Θ ’s Θ ’s Θ θ 5 θ 5 Θ ’s Θ ’s Θ θ 5 θ 5 Θ ’s Θ ’s Θ x1 x2 x4 x5 y1 y2 y4 y5 θ 1 θ 2 θ 5 θ 5 Θ ’s Θ ’s Θ θ 5 θ 5 Θ ’s Θ ’s Θ θ 5 θ 5 Θ ’s Θ ’s Θ θ 5 θ 5 Θ ’s Θ ’s Θ x4 x5 x2x1 LNF patch expert that exploits spatial locality Non-uniform optimisation taking patch expert reliabilities into account Input image Landmark patch expert Individual landmark reliabilities Region of interest Landmark response maps Detected facial landmark Figure 6.1: Overview of the CLNF model. LNF patch expert is used to calculate patch response maps, which leads to more reliable responses. Optimisation over the patch responses is performed using the Non- Uniform Regularised Mean-Shift method that takes the reliability of each patch expert into account leading to more accurate fitting. Only 3 out of 66 patch experts are displayed for clarity. strained Local Neural Field (CLNF) is the name given to a CLM instance that uses LNF patch experts and, the previously introduced, NU-RLMS fitting method. An overview of CLNF can be seen in Figure 6.1. I demonstrate the benefit of the LNF patch expert by comparing it to state-of-the-art methods on three tasks: detection of facial landmarks in images, tracking landmarks in videos, and head pose estimation in videos. By using just the visible light images CLNF achieves compara- ble, or better, performance than CLM-Z and outperforms the CLM in all of these tasks, while still achieving close to real time speeds (≈ 20fps). Finally, I compare the CLNF approach with other state-of-the-art ap- 132 6.1. Continuous Conditional Neural Field X1 X2 X3 X4 X5 X6 θ1 θ2 θ3 θ5 θ6 y1 y2 y3 y4 y5 y6 θ5θ5Θ’sΘ’shi θ5θ5Θ’sΘ’shi θ5θ5Θ’sΘ’shi θ5θ5Θ’sΘ’shi θ5θ5Θ’sΘ’shi θ5θ5Θ’sΘ’shi Input Output g(yi,yj) v fi g(yi,yj) X7 θ3 y3 Θ’sΘ’shi l(yi,yj) Neural layer (a) Local Neural Field X1 y1 θ5θ5Θ’sΘ’shi Input Output fk Neural layer X2 y2 θ5θ5Θ’sΘ’shi Xn yn θ5θ5Θ’sΘ’shi gk (b) Linear-chain CCNF Figure 6.2: Examples of two instances of CCNF. Solid lines represent vertex features ( fk), dashed lines represent edge features (gk or lk). The input vector xi is connected to the relevant output scalar yi through the vertex features that combine the neural layer (Θ) and the vertex weights α. The outputs are further connected with edge features gk (similarity) or lk (sparsity). Only direct links from xi to yi are presented here, but extensions are straightforward. proaches for facial landmark detection, demonstrating its benefits. The experiments were conducted on four publicly available datasets. For facial landmark detection I used Multi-PIE and BU-4DFE datasets. For landmark tracking in videos I used the Biwi feature point dataset. Finally, for head pose estimation I used Boston University, Biwi and ICT- 3DHP datasets. The test sets have extreme pose variation and difficult (Multi-PIE) and uncontrolled lighting (ICT-3DHP, and Biwi). The discussion in this chapter is structured as follows. First, the CCNF model is introduced (Section 6.1). Then LNF patch expert, a special case of CCNF, is presented in Section 6.2. Finally, the experiments used to validate the new patch experts are presented. 6.1 Continuous Conditional Neural Field Continuous Conditional Neural Field (CCNF) is an undirected graphical model which models the conditional probability of a continuous valued vector y depending on continuous x. CCNF is a general graphical model which can be used as a patch expert, but also for time series modelling 133 6. Constrained Local Neural Field (see Chapter 8). CCNF combines the non-linearity of Conditional Neu- ral Fields (Peng et al., 2009) with the flexibility and continuous output of Continuous Conditional Random Fields (Qin et al., 2008). The model also bears close resemblance to the Discriminative Random Fields (DRF) model proposed by Kumar and Hebert (2003). DRF, however, is a classi- fication rather that regression model and it uses slightly different vertex (association) and edge (interaction) features (potentials) that are more suited for classification. In the discussion the following notation is used: x = {x1, x2, . . . , xn} is a set of observed input variables, y = {y1, y2, . . . , yn} is a set of output variables to be predicted, and n is the length of a sequence. The output variable yi is a scalar (yi ∈ R) and input xi is an m dimensional feature vector (xi ∈ Rm). CCNF model for a particular set of observations is a conditional proba- bility distribution with the probability density function: P(y|x) = exp(Ψ)∫ ∞ −∞ exp(Ψ)dy , (6.1) where Ψ is a set of potential functions which control the model and∫ ∞ −∞ exp(Ψ)dy is the normalisation (partition) function which makes the probability distribution a valid one (by making it sum to 1). The fol- lowing section describes the potential function than can be used in the CCNF model. Figure 6.2 demonstrates two CCNF instances: LNF and linear-chain CCNF. The former can be used as a patch expert and the latter for time- series modelling. CCNF is flexible enough to be used for such different tasks. 6.1.1 Potential functions Three types of potential functions are defined for the model: vertex features ( fk) and edge features (gk, and lk). CCNF potential function is 134 6.1. Continuous Conditional Neural Field defined as: Ψ =∑ i K1 ∑ k=1 αk fk(yi,x,θk) +∑ i,j K2 ∑ k=1 βkgk(yi, yj) +∑ i,j K3 ∑ k=1 γklk(yi, yj), (6.2) where model parameters α = {α1, α2, . . . αK1}, Θ = {θ1,θ2, . . . θK1}, and β = {β1, β2, . . . βK2}, γ = {γ1,γ2, . . . γK3} are learned and used for in- ference during testing. The individual potential functions are defined as follows: fk(yi,x,θk) = −(yi − h(θk, xi))2, (6.3) h(θ, x) = 1 1+ e−θTx , (6.4) gk(yi, yj) = −12S (gk) i,j (yi − yj)2, (6.5) lk(yi, yj) = −12S (lk) i,j (yi + yj) 2. (6.6) Vertex features fk represent the mapping from the xi to yi through a single layer neural network, where θk is the weight vector for a particular neuron k. The corresponding αk for vertex feature fk represents the reliability of the kth neuron. The number of neurons used will depend on the problem and can be determined during cross-validation. Edge features gk represent the similarities between observations yi and yj, enforcing smoothness between connected nodes. Neighbourhood measure S(gk) is used to create (and potentially weigh) the similarity connections between nodes. Edge features lk represent the sparsity (or inhibition) constraint between connected observations yi and yj. They penalise the model if both yi and yj are large, but do not if both of them are zero. However, lk edge features have an unwanted consequence of slightly penalising the model if one of yi or yj is large. This, of course, only works for positive y values, which is the case for response maps. Neighbourhood measure S(lk) is used to create (and potentially weigh) the sparsity connections between nodes. 135 6. Constrained Local Neural Field 6.1.2 Learning and Inference In this section I describe how to estimate the parameters {α,β,γ,Θ} given training data {x(q), y(q)}Mq=1 of M observations (sequences, areas of interest, etc.) where each x(q) = {x(q)1 , x(q)2 , . . . , x(q)n } is a set of inputs (pixel values under the response, features describing facial appearance), and each y(q) = {y(q)1 , y(q)2 , . . . , y(q)n } is a corresponding set of real val- ued labels (expected response from a patch expert, point in continuous emotional space). CCNF learning picks the α, β, γ and Θ values which maximise the conditional log-likelihood of the model on the training observations: L(α,β,γ,Θ) = M ∑ q=1 log P(y(q)|x(q)), (6.7) (α¯, β¯, γ¯, Θ¯) = arg max α,β,γ,Θ (L(α,β,γ,Θ)). (6.8) During inference a value of y that maximises the probability distribution given an observation x = {x1, x2, . . . , xn} is found: y∗ = arg max y (P(y|x)). (6.9) This can be computed using the learned parameters α, β, γ and Θ. Multi-variate Gaussian form It helps with the derivation of inference and the partial derivatives used for training, if the probability density function (Equation 6.1) is con- verted into multivariate Gaussian form: P(y|x) = 1 (2pi) n 2 |Σ| 12 exp(−1 2 (y−µ)TΣ−1(y−µ)), (6.10) Σ−1 = 2(A + B + C). (6.11) The diagonal matrix A represents the contribution of α terms (vertex features) to the covariance matrix, and the symmetric B and C represent 136 6.1. Continuous Conditional Neural Field the contribution of the β,γ terms (edge features): Ai,j =  K1 ∑ k=1 αk, i = j 0, i 6= j , (6.12) Bi,j =  ( K2 ∑ k=1 βk n ∑ r=1 S(gk)i,r )− ( K2 ∑ k=1 βkS (gk) i,j ), i = j − K2 ∑ k=1 βkS (gk) i,j , i 6= j , (6.13) Ci,j =  ( K2 ∑ k=1 γk n ∑ r=1 S(lk)i,r ) + ( K2 ∑ k=1 γkS (lk) i,j ), i = j K2 ∑ k=1 γkS (lk) i,j , i 6= j . (6.14) It is useful to define a vector d, which describes the linear terms in the distribution, and µ which is the mean value of the Gaussian form of the CCNF distribution: d = 2αTh(ΘX). (6.15) µ = Σd, (6.16) where X is a matrix in which the ith column represents xi, and Θ rep- resents the combined neural network weights, and h(M) is an element- wise application of sigmoid (activation function) on each element of M. Thus, h(ΘX) represents the response of each of the gates (neural layers) at each xi. Intuitively d is the contribution from the the vertex features, which con- tribute directly from input features x towards y. Σ on the other hand, controls the influence of the edge features to the output. Finally, µ is the expected value of the distribution, hence it is the value of y that maximises P(y|x): y∗ = arg max y (P(y|x)) = µ = Σd. (6.17) 137 6. Constrained Local Neural Field Having defined all the necessary variables, it is now possible to demon- strate the equivalence between probability density in Equation 6.1 and the multivariate Gaussian in Equation 6.10. First, combining the fea- ture functions from Equations 6.3, 6.5, and 6.6 with the potential Equa- tion 6.2, leads to: Ψ = ∑ i K1 ∑ k=1 αk fk(yi,x,θk) +∑ i,j K2 ∑ k=1 βkgk(yi, yj,x) +∑ i,j K3 ∑ k=1 γklk(yi, yj,x) = −∑ i K1 ∑ k=1 αk(yi − h(θTk xi))2 − 1 2∑i,j K2 ∑ k=1 βkS gk i,j (yi − yj)2 −1 2∑i,j K3 ∑ k=1 γkS (lk) i,j (yi + yj) 2. (6.18) The factor Ψ can now be expressed in terms of A, B and d defined previ- ously. This is done in parts, starting with terms containing α parameters from Equation 6.18: −∑ i K1 ∑ k=1 αk(yi − h(θTk xi))2 = −∑ i K1 ∑ k=1 αk(y2i − 2yih(θTk xi) + h(θTk xi)2) = −∑ i K1 ∑ k=1 αky2i +∑ i K1 ∑ k=1 αk2yih(θTk xi)−∑ i K1 ∑ k=1 αkh(θTk xi) 2 = −yT Ay + yTd−∑ i K1 ∑ k=1 αkh(θTk xi) 2. (6.19) 138 6.1. Continuous Conditional Neural Field Collecting terms with β and γ parameters in Equation 6.18 leads to: −1 2∑i,j K2 ∑ k=1 βkS (gk) i,j (yi − yj)2 − 1 2∑i,j K3 ∑ k=1 γkS (lk) i,j (yi + yj) 2 = −1 2∑i,j K2 ∑ k=1 βkS (gk) i,j (y 2 i − 2yiyj + y2j )− 1 2∑i,j K3 ∑ k=1 γkS (lk) i,j (y 2 i + 2yiyj + y 2 j ) = −1 2∑i,j K2 ∑ k=1 βkS (gk) i,j (y 2 i + y 2 j ) +∑ i,j K2 ∑ k=1 βkS (gk) i,j yiyj+ −1 2∑i,j K2 ∑ k=1 γkS (lk) i,j (y 2 i + y 2 j )−∑ i,j K2 ∑ k=1 γkS (lk) i,j yiyj = − K2 ∑ k=1 βk∑ i,j S(gk)i,j y 2 i + K2 ∑ k=1 βkS (gk) i,j ∑ i,j yiyj+ − K2 ∑ k=1 γk∑ i,j S(lk)i,j y 2 i − K2 ∑ k=1 γkS (lk) i,j ∑ i,j yiyj = −yTBy− yTCy. (6.20) Recall that every S(k) is a symmetric matrix by construction. Combining Equations 6.18, 6.19, and 6.20, and having d = Σ−1µ from the definition of µ in Equation 6.16 results in: Ψ = −yT Ay + yTd− yTBy− yTCy− e = −1 2 (yTΣ−1y) + yΣ−1µ− e. (6.21) Above, e = ∑i ∑ K1 k=1 αkh(θ T k xi) 2, it cancels out eventually, so is not writ- ten out fully. Replacing Ψ in Equation 6.1 with Ψ defined in Equation 6.21 leads to: P(y|x) = exp(Ψ)∫ ∞ −∞ exp(Ψ)dy = = exp(−12(yTΣ−1y) + yΣ−1µ) exp(−e)∫ ∞ −∞{exp(−12(yTΣ−1y) + yΣ−1µ) exp(−e)}dy = exp(−12(yTΣ−1y) + yΣ−1µ)∫ ∞ −∞{exp(−12(yTΣ−1y) + yΣ−1µ)}dy . (6.22) As e does not depend on y, it can be taken out of the integral, leading to it cancelling out. 139 6. Constrained Local Neural Field The partition function can be integrated using the integral of an expo- nential with square and linear terms1:∫ ∞ −∞ {exp(−1 2 (yTΣ−1y) + yΣ−1µ)}dy = (2pi) n 2 |Σ−1| 12 exp( 1 2 µΣ−1µ). (6.23) In order to guarantee that the partition function is integrable the follow- ing constraints have to be met: αk > 0 and βk > 0,γk > 0 (Qin et al., 2008). This is because if the constraints are not held Σ−1 is not guaran- teed to be positive semi-definite, and this is required for the integral. Finally, plugging Equations 6.21 and 6.23 into Equation 6.1 leads to: P(y|x) = exp(− 1 2 y TΣ−1y + yΣ−1µ) (2pi) n 2 |Σ−1| 12 exp(12µΣ −1µ) = exp(−12 yTΣ−1y + yΣ−1µ) exp(−12µΣ−1µ) (2pi) n 2 |Σ| 12 = exp(−12 yTΣ−1y + yΣ−1µ− 12µΣ−1µ) (2pi) n 2 |Σ| 12 = 1 (2pi) n 2 |Σ| 12 exp(−1 2 (y−µ)TΣ−1(y−µ)). (6.24) This demonstrates that the CCNF probability density function can be expressed as a multivariate Gaussian. Partial derivatives Having defined the probability distribution, it is now possible to find the partial derivatives of it with respect to the model parameters. These partial derivatives can then be used in a gradient based optimisation method to help in finding locally optimal model parameters faster and more accurately. Firstly, it is more convenient to express the problem in terms of log- likelihood, as this does not affect the minima or maxima of the objective. Applying a logarithm on the CCNF model from Equation 6.24 leads to: 1http://www.weylmann.com/gaussian.pdf 140 6.1. Continuous Conditional Neural Field log(P(y|x)) = −12(y−µ)TΣ−1(y−µ)− log((2pi) n 2 |Σ| 12 ) = −12(y−µ)TΣ−1(y−µ)− (n2 log(2pi) + 12 log |Σ|) = −12(y−µ)TΣ−1(y−µ) + 12 log |Σ−1| − n2 log(2pi) = −12 yTΣ−1y + yTd− 12dTΣd+ 12 log |Σ−1| − n2 log(2pi). (6.25) Recall that d = Σ−1µ, and |Σ| = 1|Σ−1| , where |Σ| denotes the determi- nant of the covariance matrix Σ. Furthermore, because Σ−1 is symmetric by construction, Σ−1 = (Σ−1)T and Σ = ΣT. First, the derivation of partial derivatives with respect to the α parame- ters is demonstrated. Recall that A is only dependent on α, B on β, and C on γ; d, however, depends on both α and θ, hence: ∂Σ−1 ∂αk = ∂2A + 2B + 2C ∂αk = ∂2A ∂αk = 2I, (6.26) ∂di ∂αk = 2h(ΘX)k,i, (6.27) ∂d ∂αk = (2h(ΘX)k,∗)T. (6.28) I is the identity matrix of size n× n, where n is the number of elements in a sequence. Xk,∗ notation refers to a row vector corresponding to the kth row of a matrix X. For brevity, D = h(ΘX) In the derivation below the partial derivative of a matrix inverse ∂M−1 ∂α = 141 6. Constrained Local Neural Field −M−1 ∂M ∂α M−1 is used, to get the partial derivative of Σ. ∂dTΣd ∂αk = ∂dT ∂αk Σd+ dT ∂Σd ∂αk = 2Dk,∗µ+ dT( ∂Σ ∂αk d+ Σ ∂d ∂αk ) = 2Dk,∗µ+ dT ∂Σ ∂αk d+ dTΣ2(Dk,∗)T = 4Dk,∗µ+ dT ∂Σ ∂αk d = 4Dk,∗µ+ dT(−Σ∂Σ −1 ∂αk Σ)d = 4Dk,∗µ− 2dTΣΣd = 4Dk,∗µ− 2µTµ (6.29) ∂ log |Σ−1| ∂αk = 1 |Σ−1| ∂|Σ−1| ∂αk = 1 |Σ−1| |Σ −1| × trace(Σ∂Σ −1 αk ) = 2× trace(ΣI) = 2× trace(Σ) (6.30) These can be combined to get the partial derivative of log-likelihood with respect to α terms: ∂ log(P(y|x)) αk = −yTy + 2yTDTk,∗ − 2D∗,kµ+µTµ+ trace(Σ) (6.31) Now the derivation of the partial derivatives of the likelihood with re- spect to β and γ parameters is shown (they are discussed together as they are very similar): ∂Σ−1 ∂βk = 2B(k), (6.32) ∂Σ−1 ∂γk = 2C(k), (6.33) B(k) =  (∑ n r=1 S (gk) i,r )− S(gk)i,j , i = j −S(gk)i,j , i 6= j , (6.34) C(k) =  (∑ n r=1 S (lk) i,r ) + S (lk) i,j , i = j S(lk)i,j , i 6= j , (6.35) ∂d ∂βk = 0, (6.36) 142 6.1. Continuous Conditional Neural Field ∂d ∂γk = 0, (6.37) dTΣd βk = −dT(Σ∂Σ −1 ∂βk Σ)d = −2dTΣB(k)Σd = −2µTB(k)µ, (6.38) dTΣd γk = −dT(Σ∂Σ −1 ∂γk Σ)d = −2dTΣC(k)Σd = −2µTC(k)µ, (6.39) ∂ log |Σ−1| ∂βk = 1 |Σ−1| ∂|Σ−1| ∂βk = 1 |Σ−1| |Σ −1| × trace(Σ∂Σ −1 βk ) = 2× trace(ΣB(k)) = 2×Vec(Σ)TVec(B(k)), (6.40) ∂ log |Σ−1| ∂γk = 1 |Σ−1| ∂|Σ−1| ∂γk = 1 |Σ−1| |Σ −1| × trace(Σ∂Σ −1 γk ) = 2× trace(ΣC(k)) = 2×Vec(Σ)TVec(C(k)). (6.41) Here the matrix trace property - trace(AB) = Vec(A)TVec(B) was used, where Vec refers to the matrix vectorisation operation which stacks up columns of a matrix together to form a single column matrix2. The derivative of inverse matrix as in the case with αk version, was also used. This can now be combined to lead to: ∂ log(P(y|x)) βk = −yTB(k)y +µTB(k)µ+Vec(Σ)TVec(B(k)), (6.42) ∂ log(P(y|x)) γk = −yTC(k)y +µTC(k)µ+Vec(Σ)TVec(C(k)). (6.43) Finally, the partial derivatives of the likelihood with respect to the θ parameters (the neural network weights) are derived. The notation is abused slightly for clarity and brevity, h(A) on a n × m size matrix A produces a n × m matrix with the activation function applied on each element. ∂Σ−1 ∂θi,j = 0, (6.44) 2This leads to much faster calculation of the trace as a multiplication of potentially very big matrices is avoided 143 6. Constrained Local Neural Field If the sigmoid activation function h(z) = 11+e−z is used: ∂h(z) ∂z = h(z)(1− h(z)), (6.45) br = 2 K1 ∑ k=1 αkh(θTk xr), (6.46) ∂br ∂θi,j = 2αih(θTi xr)(1− h(θTi xr))xr,j, (6.47) ∂d ∂θi,j = 2αi{h(θTi X) ◦ (1− h(θTi X))}X∗,j. (6.48) ∂dTΣd ∂θi,j = ∂dT ∂θi,j Σd+ dT ∂Σd ∂θi,j = ∂dT ∂θi,j µ+µT ∂d ∂θi,j = 2µT ∂d ∂θi,j (6.49) Above, ◦ is the Hadamard or element-wise product. These are now combined to get : ∂ log(P(y|x)) θi,j = yT ∂d∂θi,j −µT ∂d ∂θi,j = (y−µ)T(2αi{h(θTi X) ◦ (1− h(θTi X))}X∗,j) (6.50) Above, is basically the update of a single layer neural network (back propagation) with sigmoid activation where the current feed-forward prediction is µ and error is (y−µ). Learning The partial derivatives in Equations 6.31, 6.42, 6.43, and 6.50 can now be used in the CCNF learning algorithm. In order to avoid overfitting L2 norm regularisation terms are added to the likelihood function for each of the parameter types (λα||α||22,λβ||β||22, λβ||γ||22,λθ||Θ||22), with alpha and beta sharing the regularisation weight. The values of λα,λβ,λΘ are determined during cross-validation, as is the number of neural layers. 144 6.2. Local Neural Field I used the constrained BroydenFletcherGoldfarbShanno (BFGS) algo- rithm for finding locally optimal model parameters. I used the standard Matlab implementation of the algorithm. Inference Since the CCNF model can be viewed as a multivariate Gaussian, infer- ring y values that maximise P(y|x) is straightforward. The prediction is the mean value of the distribution: y′ = arg max y (P(y|x)) = µ = Σd. (6.51) 6.2 Local Neural Field In this section I present the novel LNF patch expert, also called the Grid-CCNF. Firstly, it learns complex non-linear relationships between the pixel values and the patch response maps. Secondly, it learns the relationships between nearby pixels in the response map. The two types of spatial relationships captured by the LNF model are: spatial similarity and sparsity. Spatial similarity ensures that pixels nearby should have similar alignment probabilities; sparsity reduces the number of peaks in the response map. LNF is an instance of CCNF presented in the previous section. Input x is the set of support regions in the area of interest; the output y is the response map. The LNF patch expert with the edge features used is summarised in Figure 6.3. The LNF patch expert uses two similarity edge features gk to enforce smoothness on connected nodes. S(g1) is defined to return 1 (otherwise return 0) only when the two nodes i and j are direct (horizontal/vertical) neighbours in a grid. S(g2) is defined to return 1 (otherwise 0) when i and j are diagonal neighbours in a grid. A single sparsity enforcing edge feature lk is used. A neighbourhood re- gion S(l) is defined to return 1 only when two nodes i and j are between 145 6. Constrained Local Neural Field Figure 6.3: A visualisation of the LNF patch expert. The three types of edge features used are displayed, sparsity is enforced between the cen- tral node and the nodes surrounded by the green rectangle, similarity is enforced between the central nodes and the neighbours (two similarities illustrated in red and blue). 3 and 5 edges apart (where edges are counted from the grid layout of the LNF patch expert Figure 6.2). LNF without edge features reduces to something similar to a three layer perceptron with a sigmoid activation function followed by a weighted sum of the hidden layers. It is also similar to the first layer of a Convo- lutional Neural Network (LeCun et al., 2010). Figure 6.4 demonstrates the advantages of modelling spatial dependen- cies and input non-linearities by comparing LNF (with and without edge features) and SVR patch experts. Note how the response maps of LNF patch experts with edge features have fewer peaks and are smoother than the ones without edge features. Furthermore, the LNF patch ex- perts are more accurate than the SVR ones. 6.2.1 Training In order to collect training data the SVR sampling method can be used (Section 4.4). However, instead of each sample being considered a sepa- rate instance, they are grouped into ’sequences’ of 9× 9 samples, leading to a single observation x = {x1, . . . , x81}. 146 6.3. Patch expert experiments Input SVR LNF (no edge) Ground truth LNF (edge) Figure 6.4: Different response maps from patch experts next to the ac- tual landmark location (hotter is higher probability). The ideal response is shown above Ground Truth (used for training). SVR refers to the usual patch expert used by CLM approaches. Two instances of the LNF model are also shown: one without spatial features, that is without gk and lk, and one with. Note how the edge features lead to fewer peaks and a smoother response, improving the patch response convexity. Fur- thermore, note the noisiness of the SVR response. 6.3 Patch expert experiments I conducted the following experiments to asses the properties of the new LNF patch expert. Firstly, the effect of spatial constraints on landmark detection was assessed. Secondly, LNF patch experts were compared to the SVR ones under easy illumination conditions. Finally, the ability of LNF to generalise across illuminations was evaluated; recall that in Sec- tion 4.8.1 it was shown that SVR based patch experts do not generalise well. 6.3.1 Methodology I used the same data to train LNF patch experts as was used to train the SVR patch experts in Section 4.7.1. The fitting parameters used for CLNF are: r = 25, ρ = 1.5, w = 5 for landmark detection in images; r = 20, ρ = 1.25, w = 10 for tracking in videos. 147 6. Constrained Local Neural Field 0.01 0.02 0.03 0.04 0.050 0.2 0.4 0.6 0.8 1 Fitting on BU−4DFE Size normalised shape RMS error Pr op or tio n of im ag es LNF no edge LNF edge 0.01 0.03 0.05 0.07 0.090 0.2 0.4 0.6 0.8 1 Size normalised shape RMS error Pr op or tio n of im ag es Fitting on Multi−PIE LNF no−edge LNF edge Figure 6.5: Comparison of using LNF patch experts with and without edge features. Observe the very slightly improved fitting accuracy with edge features. The training and testing data never overlap, and the same participant never appears in both of the sets. The term general illumination refers to training data from both BU-4DFE and Multi-PIE datasets which includes four different lighting conditions: frontal, left, right and poorly lit. The term frontal illumination is used to refer to frontally lit faces used for training from both BU-4DFE and Multi-PIE datasets. Two Multi-PIE subsets were used for testing: easy lighting and diffi- cult lighting. Easy lighting includes frontally lit faces; difficult lighting includes left, right and poorly lit faces. 6.3.2 Importance of edge features The first experiment explored the effect of modelling the relationships between output responses through the use of edge features in LNF patch experts. This was done by training general illumination LNF patches with and without edge features and comparing their performance on BU-4DFE and Multi-PIE frontal light datasets. Patch experts of three views were trained for the experiment - frontal and two profiles. 148 6.3. Patch expert experiments Results The results of using LNF with and without edge features on the two datasets can be seen in Figure 6.5. Wilcoxon sign rank test revealed significant differences (z = 6.48, p < 0.001) between LNF patch experts which use the edge features (Mdn = 0.0204) versus those which do not (Mdn = 0.0210) on the BU-4DFE. Wilcoxon sign rank test revealed significant differences (z = 3.09, p < 0.01) between LNF patch experts which use the edge features (Mdn = 0.0373) versus those which do not (Mdn = 0.0386) on the Multi-PIE frontally lit dataset. Discussion Edge features provide a very small but statistically significant improve- ment to fitting accuracy both on BU-4DFE and Multi-PIE datasets. These results provide support for the use of LNF patch experts with edge fea- tures for landmark detection in images. Therefore, in all of the further sections LNF refers to LNF patch experts with edge features. 6.3.3 Facial landmark detection under easy illumination I conducted an experiment to see how well the LNF patch experts per- form for landmark detection on frontally lit faces, when only such faces were seen in training. This is a task on which SVR based patch ex- perts perform well, but I wanted to see if LNF can improve on these re- sults. For this experiment both SVR and LNF patch experts were trained on frontal illumination and tested on the Multi-PIE frontal illumination subset and the BU-4DFE dataset. As shown in previous sections using multi-modal SVR patch experts helps with performance. In this section, a uni-modal LNF patch expert is compared with an already improved multi-modal SVR patch expert. 149 6. Constrained Local Neural Field 0.01 0.02 0.03 0.04 0.050 0.2 0.4 0.6 0.8 1 Size normalised shape RMS error Pr op or tio n of im ag es Fitting on BU−4DFE LNF SVR 0.01 0.03 0.05 0.07 0.090 0.2 0.4 0.6 0.8 1 Size normalised shape RMS error Pr op or tio n of im ag es Fitting on Multi−PIE LNF SVR Figure 6.6: Fitting on the Multi-PIE dataset, observe how the LNF patch experts outperform the SVR ones. Results Results of this experiment can be found in Figure 6.6. Wilcoxon sign rank test revealed significant differences (z = −5.60, p < 0.001) between LNF patch experts (Mdn = 0.0197) versus SVR ones (Mdn = 0.0204) on the BU-4DFE dataset. Wilcoxon sign rank test revealed significant differences (z = −19.2, p < 0.001) between LNF patch experts (Mdn = 0.0320) versus SVR ones (Mdn = 0.0347) on the Multi-PIE single light dataset. Discussion LNF based patch experts statistically significantly outperformed SVR patch experts on both of the Multi-PIE and BU-4DFE datasets when trained and tested on the same illumination. Furthermore, the LNF patch experts used were uni-modal, but they still outperformed the multi-modal SVR ones which used both intensity and gradient inten- sity images. 6.3.4 Facial landmark detection under general illumination The big problem with SVR patch experts and the main motivation be- hind LNF patch experts is the inability of the former to learn under a 150 6.3. Patch expert experiments 0.01 0.02 0.03 0.04 0.050 0.2 0.4 0.6 0.8 1 Size normalised shape RMS error Pr op or tio n of im ag es Fitting on BU−4DFE LNF SVR 0.01 0.03 0.05 0.07 0.090 0.2 0.4 0.6 0.8 1 Size normalised shape RMS error Pr op or tio n of im ag es Fitting on Multi−PIE LNF SVR Figure 6.7: Fitting on the Multi-PIE dataset and BU-4DFE datasets when training on the general illumination case. Observe a marked improve- ment when using LNF patch experts. number of illumination conditions. In this experiment, I wanted to see how the LNF patch experts affect the landmark detection accuracy on the Multi-PIE difficult lighting subset, Multi-PIE frontal lighting subset and the BU-4DFE dataset. Unless otherwise stated, both the SVR and LNF patch experts are trained on the general illumination training set. Furthermore, I wanted to see the effect of training more general patch experts has on a frontal illumination test set. In the SVR case this results in degraded performance, forcing a trade-off between robustness and accuracy. Results The results of using general illumination trained SVR and LNF patch ex- perts on the Multi-PIE difficult lighting subset and the BU-4DFE dataset can be seen in Figure 6.7. Wilcoxon sign rank test revealed significant differences (z = −9.65, p < 0.001) between LNF patch experts (Mdn = 0.0207) versus SVR ones (Mdn = 0.0229) on the BU-4DFE dataset. Wilcoxon sign rank test revealed significant differences (z = −68.06, p < 0.001) between LNF patch experts (Mdn = 0.0332) versus SVR ones 151 6. Constrained Local Neural Field 0.01 0.03 0.05 0.07 0.090 0.2 0.4 0.6 0.8 1 Size normalised shape RMS error Pr op or tio n of im ag es LNF Fitting on Multi−PIE Frontal trained General trained 0.01 0.03 0.05 0.07 0.090 0.2 0.4 0.6 0.8 1 Size normalised shape RMS error Pr op or tio n of im ag es SVR Fitting on Multi−PIE Frontal trained General trained Figure 6.8: The effect of using patch experts trained on general versus specific illumination cases. Observe how when using SVR patch experts the use of general training has a big negative effect on the specific case. This is not the case when using LNF patch experts, which can learn a general model without sacrificing accuracy. (Mdn = 0.0366) on the Multi-PIE difficult light dataset. The effect of different training sets (specific and general illumination) when testing on the frontally lit Multi-PIE subset can be seen in Figure 6.8. Observe the bigger drop in performance when using general train- ing with SVR patch experts: for LNF the drop in accuracy is from Mdn = 0.0320 to Mdn = 0.0335 and for SVR the drop in accuracy is from Mdn = 0.0347 to Mdn = 0.0374 (both statistically significant according to a Wilcoxon sign rank test). This demonstrates that LNF patch experts are better at learning a number of illuminations. Discussion The above experiments show the great benefit of LNF patch experts for a model which needs to work in more general environments. They sub- stantially outperform SVR patch experts when testing on both Multi-PIE and BU-4DFE datasets. Furthermore, there is less degradation of per- formance when using a general model, compared to the SVR case. The novel LNF patch expert tackles the big problem that SVR patch experts face – difficulty of illumination independent landmark detection. LNF 152 6.4. General experiments patch experts make it easier to train a general tracker without sacrificing performance in a specific case. 6.4 General experiments The previous section explored the use of LNF patch experts compared to the SVR based ones, demonstrating their advantages for landmark detection. In this section, I perform a broader set of experiments demon- strating the performance of CLNF (LNF with NU-RLMS) against other state-of-the-art approaches in landmark detection and head pose esti- mation. 6.4.1 Facial landmark detection This section compares the performance of the CLNF model versus other state-of-the-art approaches for landmark detection on the BU-4DFE and the frontal and general illumination Multi-PIE datasets. I explored how CLNF compares to other approaches: when detecting landmarks on the same dataset under the same illumination; and when generalising to unseen lighting conditions and unseen datasets. Baselines One of the baselines used is the Zhu and Ramanan (2012) tree based model. It is a joint face detector, head pose estimator and landmark detector. The tree based model has shown very good performance at locating the face and the landmark features on a number of datasets. I used the trained models and the detection code provided by the authors. Zhu and Ramanan’s approach has been trained on the Multi-PIE dataset using 900 faces from 13 viewpoints. 300 of those faces are frontal, while the remaining 600 are evenly distributed among the remaining view- points. The authors provide two models: a more accurate independent model which takes ≈ 40 seconds per image, and a less accurate fully- shared model which takes ≈ 5 seconds per Multi-PIE image (on a 3.06 GHz dual core Intel i3 CPU). I refer to these models as tree based indepen- 153 6. Constrained Local Neural Field dent and tree based shared respectively. The amount of training data the authors used is comparable to 587 training images used for every view of LNF patch expert, making it a fair comparison. Active Orientation Model (AOM) (Tzimiropoulos et al., 2012) is a gener- ative model of facial shape and appearance. It is similar AAM, however, there are two differences: a different model of appearance and a robust algorithm for model fitting and parameter estimation. Therefore, it gen- eralises better to unseen faces and variations. I used the trained model and the landmark detection code provided by the authors. Tzimiropou- los et al. (2012) trained and evaluated their model on close-to-frontal faces (-15 to 15 degrees yaw). They trained their models using 432 im- ages from 54 different subjects. For each subject 8 images were used: 1 image for frontal (0 degrees) neutral expression, 2 images for 2 different viewpoints (-15 and 15 degrees of yaw) displaying neutral expression; and 5 frontal images (0 degrees) displaying the remaining 5 expressions. This is comparable to the number of training samples used for CLNF training. As a final baseline, I used the CLM model introduced in the previous chapters: a multi-modal, multi-scale formulation with NU-RLMS fitting. For CLM the same training data was used as for the CLNF approach. Note this is a more accurate version of the CLM model presented by Saragih et al. (2011), and it is called CLM+ in this section. Methodology As CLNF was compared to other state-of-the art approaches that were not optimised for speed, I used CLNF parameters that make fitting slower, but more accurate. I used bigger areas of interest - 21× 21 pix- els and more NU-RLMS iterations - 4. This ensured a fair comparison. Other parameters used are as follows: r = 25, ρ = 1.5, w = 5. Further- more, even with increased complexity, CLM and CLNF approaches still performed faster that the baselines they were being compared against. With these accuracy tuned parameters CLNF runs at 3 frames per sec- ond on a 3.06GHz dual core Intel i3 CPU. 154 6.4. General experiments 0.01 0.03 0.05 0.07 0.090 0.2 0.4 0.6 0.8 1 Size normalised shape RMS error Pr op or tio n of im ag es Fitting on Multi−PIE CLNF CLM+ Tree based indep. Tree based shared (a) 0.01 0.03 0.05 0.07 0.090 0.2 0.4 0.6 0.8 1 Size normalised shape RMS error Pr op or tio n of im ag es Fitting on frontal Multi−PIE CLNF CLM+ Tree based indep. AOM (b) Figure 6.9: Comparison of CLNF and other landmark detectors on the frontally lit Multi-PIE dataset. a) The performance on the whole dataset. b) The performance on the close to frontal images (from −15◦ to 15◦ yaw). Notice how CLNF outperforms all of the other approaches in terms of accuracy, but the tree based models of Zhu and Ramanan (2012) are better in terms of robustness. Performance on the same dataset and same illumination In order to compare the CLNF method against the other state-of-the-art approaches, I performed landmark detection on the Multi-PIE dataset with frontal lighting. All of the approaches were trained on the frontally lit Multi-PIE dataset, on roughly the same amount of images (potentially on different subjects). CLNF and CLM approaches had BU-4DFE train- ing data as well. The results of the experiments can be see in Figure 6.9. The AOM model was only trained on close to frontal images, so a separate graph display- ing the performance of baselines and CLNF is shown in Figure 6.9b. To see the effect of landmark detectors on the RMSE, in the case of fit- ting on all of the Multi-PIE frontal lit images, a Friedman’s ANOVA was performed. It revealed a significant effect of the method (χ2(3) = 1822.8, p < 0.001). Next, Friedman’s ANOVAs were used to follow up the findings (a Bonferroni correction to p values was applied). The com- parisons revealed that all of the landmark detection methods were sig- nificantly different (p < 0.001) from each other. The rankings from 155 6. Constrained Local Neural Field most to least accurate are as follows (according to mean test rankings assigned during the ANOVA): CLNF (Mdn = 0.0312), tree based inde- pendent model (Mdn = 0.0362), CLM+ (Mdn = 0.0344), and tree based fully shared model (Mdn = 0.0426). In order to assess the effect of landmark detectors on the RMSE, in the case of fitting on close to frontal Multi-PIE images, a Friedman’s ANOVA was performed. It revealed a significant effect of the method (χ2(3) = 1487.9, p < 0.001). Friedman’s ANOVAs were used to fol- low up the findings (a Bonferroni correction to p values was applied). The comparisons revealed that all of the landmark detection methods were significantly different (p < 0.001) from each other. This led to the following rankings from most to least accurate (according to test rank- ings assigned during the ANOVA): CLNF (Mdn = 0.0257), AOM (Mdn = 0.0276), CLM+ (Mdn = 0.0275), and tree based independent model (Mdn = 0.0337). In sum, CLNF outperforms all of the other approaches in terms of er- ror rates, demonstrating its ability to learn the Multi-PIE dataset well. However, it is also clear that the tree based approach is more robust. This shows the potential of combining the two approaches. Performance on the same dataset but different illumination The following experiment explored the ability of CLNF and other base- lines to generalise to unseen illuminations. All of the detectors were trained on a frontal illumination and tested on the difficult illumina- tion Multi-PIE dataset. Furthermore, the results of multi-illumination trained CLNF (CLNF-ML) were included, for a reference to a detector that generalises well. The results of CLNF based landmark detection against previous base- lines can be seen in Figure 6.10. It can be clearly seen that the re- sults drop substantially when performing landmark detection on unseen lighting conditions, even on the same dataset. However, if we use multi- light training for CLNF the good results can be preserved. Furthermore, 156 6.4. General experiments 0.01 0.03 0.05 0.07 0.090 0.2 0.4 0.6 0.8 1 Size normalised shape RMS error Pr op or tio n of im ag es Fitting on Multi−PIE CLNF CLNF−ML CLM+ Tree based shared Tree based indep. 0.01 0.03 0.05 0.07 0.090 0.2 0.4 0.6 0.8 1 Size normalised shape RMS error Pr op or tio n of im ag es Fitting on frontal Multi−PIE CLNF CLM+ Tree based shared Tree based indep. AOM Figure 6.10: Comparison of CLNF and other landmark detectors on the Multi-PIE dataset, but on unseen to training illumination (except for CLNF-ML case which is there only for reference). Notice how the per- formance drops on unseen illumination, however the drop for a holistic approach (AOM) is the greatest. even though it works well on frontal lighting, the holistic AOM fails to generalise to unseen lighting conditions with only 22% convergence. In contrast part based models perform much better with over 50% conver- gence. To compare the ability of landmark detectors to generalise across illu- mination, a Friedman’s ANOVA was performed on RMSE of all of the Multi-PIE general illumination images. It revealed a significant effect of the method (χ2(3) = 1210.3, p < 0.001). Friedman’s ANOVAs were used to follow up the findings (a Bonferroni correction to p values was applied). The comparisons revealed that all of the landmark detection methods were significantly different (p < 0.001) except for the CLM+ and fully shared tree model. This led to the following rankings from most to least accurate (according to mean test rankings assigned dur- ing the ANOVA): tree based independent model (Mdn = 0.0465), CLNF (Mdn = 0.0538), tree based fully shared model (Mdn = 0.0521), and CLM+ (Mdn = 0.0592). To compare the ability of landmark detectors to generalise across illu- mination on frontal images, a Friedman’s ANOVA was performed on RMSE of frontal Multi-PIE general illumination images. It revealed a 157 6. Constrained Local Neural Field 0.01 0.02 0.03 0.04 0.050 0.2 0.4 0.6 0.8 1 Size normalised shape RMS error Pr op or tio n of im ag es Fitting on BU−4DFE CLNF CLM+ Tree based indep. AOM Figure 6.11: Comparison of CLNF and other landmark detectors on BU-4DFE datasets, when none of the detectors were trained on it. significant effect of the method (χ2(4) = 9849.1, p < 0.001). Friedman’s ANOVAs were used to follow up the findings (a Bonferroni correction to p values was applied). The comparisons revealed that all of the landmark detection methods were significantly different (p < 0.001). This led to the following rankings from most to least accurate (accord- ing to mean test rankings assigned during the ANOVA): CLNF (Mdn = 0.0411), tree based independent model (Mdn = 0.0423), CLM+ (Mdn = 0.0453), tree based fully shared model (Mdn = 0.0489), and AOM (Mdn = 0.111). The results indicate that part based approaches are more suitable for generalisations to unseen illuminations. Furthermore, CLNF is not only orders of magnitude faster, but often outperforms the tree based meth- ods of Zhu and Ramanan (2012) in terms of generalisability. Performance on a different dataset The final experiment assessed how well the different landmark detectors generalise to an unseen dataset, a crucial requirement for truly robust detectors. All of the detectors were trained on the frontal light Multi-PIE dataset and evaluated on the BU-4DFE dataset. The results of this experiment can be seen in Figure 6.11. It can be seen that the CLNF and CLM approaches outperform the tree based 158 6.4. General experiments method of Zhu and Ramanan (2012) and the person independent AOM of Tzimiropoulos et al. (2012) by a large margin. In order to examine the generalisability of landmark detectors across datasets, a Friedman’s ANOVA was performed on the RMS errors on BU-4DFE images. Detectors were significantly different (χ2(3) = 770.6, p < 0.001). Friedman’s ANOVAs were used to follow up the findings (a Bon- ferroni correction to p values was applied). The comparisons revealed that all of the landmarks were significantly different (p < 0.001) from each other, except for CLNF vs. CLM+ and AOM vs. independent tree based method. The following ordering was produced by the test from most to least accurate (according to test rankings assigned during the ANOVA): CLNF (Mdn = 0.0223), CLM+ (Mdn = 0.0230), AOM (Mdn = 0.0345) and tree based independent model (Mdn = 0.0349). In conclusion, this experiment illustrates the better generalisation of CLNF and CLM to unseen datasets, highlighting their usefulness for real life scenarios. Discussion The experiments have shown that CLNF performs much better than other landmark detectors both when fitting on the same dataset, and when generalising to unseen data and unseen lighting conditions. Even the simpler CLM with my extensions displays better generalisability. This suggests that one can achieve better generalisability by modelling parts of a face rather than the whole face, as is the case with AOM or AAM. Finally, CLNF (3fps) and CLM (5fps) approaches are faster than the AOM (2fps) and tree based models (0.03fps) for landmark detection in images. In all of the cases Matlab implementations were used, with- out much explicit optimisation running on a 3.06 GHz dual core Intel i3 CPU. These reported frame-rates are worse than ones reported for tracking in image sequences as they are run using Matlab code and with much bigger areas of interest. In conclusion, my CLNF model is both 159 6. Constrained Local Neural Field faster and more accurate than models proposed by Zhu and Ramanan (2012) and Tzimiropoulos et al. (2012). 6.4.2 Facial landmark tracking The effect of the CLNF model on tracking feature points in a video sequence was also assessed. For this, the subset of the Biwi head pose dataset that is labelled for feature points (see Section 3.2.2) was used. This is the same dataset that was used to evaluate the CLM-Z approach for facial feature tracking. The training and fitting strategies were the same as in the previous ex- periments on video tracking (Section 4.7.5), but with slightly different fitting parameters. I used general illumination training, as the light- ing in the dataset is varied and uncontrolled. For feature tracking in an image sequence using CLNF the model parameters were as follows: r = 20, ρ = 1.25, and w = 10. The approach was tested with and without reinitialisation strategies to assess both robustness and accuracy. Baselines I compared CLNF to CLM and CLM-Z methods. Note that CLM-Z used both intensity and depth data, hence it was a slightly unfair comparison. Results and Discussion The results of the experiment are shown in Figure 6.12. The graphs indicate that the CLNF model is more robust than CLM and CLM-Z with and without reinitialisation. However, Friedman’s ANOVA did not reveal any significant differences p > 0.05 in the no-reinitialisation case and p > 0.1 when reinitialisation was used. 6.4.3 Head pose estimation In order to asses the ability of CLNF to track head pose I used the same datasets and baselines that were used to assess the performance of CLM- Z (Chapter 5). Furthermore, as one of the biggest advantages of CLNF is 160 6.4. General experiments 0 0.04 0.08 0.12 0.16 0.20 0.2 0.4 0.6 0.8 1 Fitting on Biwi Size normalised shape RMS error Pr op or tio n of im ag es CLM CLM−Z CLNF (a) Without reinitialisation 0 0.04 0.08 0.12 0.16 0.20 0.2 0.4 0.6 0.8 1 Fitting on Biwi Size normalised shape RMS error Pr op or tio n of im ag es CLM CLM−Z CLNF (b) With reinitialisation Figure 6.12: Error fitting curves on the Biwi Kinect head pose dataset using the CLNF approach versus CLM and CLM-Z. Normalised by in- terocular distance. Frame 0 Frame 100 Frame 168 Frame 276 Frame 390 Frame 477 Frame 634 Frame 747 Figure 6.13: Sample point fitting on Biwi head pose dataset. Top row is CLM, bottom row is CLNF. its ability to effectively learn illumination variations if they are provided in training, I trained two CLNF versions: CLNF-SL (single frontal light) and CLNF-ML (four lighting conditions). Results The results of the head pose estimation are displayed in Table 6.1. It is evident that the CLNF-ML model outperforms the other approaches based both on RGB and RGBD. These results highlight the importance of capturing lighting variation in training in order to fit on unseen datasets. 161 6. Constrained Local Neural Field Model Yaw Pitch Roll Mean Mdn. ICT-3DHP GAVAM (Morency et al., 2008) 6.58 5.01 3.50 5.03 3.08 CLM 5.41 4.32 4.83 4.85 2.45 CLNF-SL 4.72 3.51 4.47 4.24 2.41 CLNF-ML 4.46 3.21 4.16 3.94 2.23 ICT-3DHP with depth GAVAM (Morency et al., 2008) 3.76 5.24 4.93 4.64 2.91 Reg. for. (Fanelli et al., 2011b) 7.69 10.66 8.72 9.03 5.02 CLM-Z 4.73 4.10 4.66 4.50 2.35 Biwi-HP GAVAM (Morency et al., 2008) 14.16 9.17 12.41 11.91 5.63 CLM 10.32 10.27 9.01 9.87 3.58 CLNF-SL 9.89 9.37 7.84 9.03 3.73 CLNF-ML 8.14 7.37 6.68 7.40 3.20 Biwi-HP with depth GAVAM (Morency et al., 2008) 6.75 5.53 10.66 7.65 3.92 Reg. for. (Fanelli et al., 2011b) 9.2 8.5 8.0 8.6 NA CLM-Z 10.52 7.98 8.14 8.88 3.20 Table 6.1: Estimating head pose using the CLNF approach on the ICT- 3DHP and the Biwi Kinect head pose datasets. The clear benefit of CLNF approach can be seen, even though it used only visible light. Furthermore, the results show that the CLNF demonstrates comparable, or better, performance than dedicated rigid head pose trackers. 6.5 Conclusions I presented a Constrained Local Neural Field model for facial landmark detection and tracking. The approach leads to more accurate landmark detection and head pose estimation, when compared to a number of state-of-the-art approaches. The new LNF patch expert can exploit spatial relationships between patch response values, and learn non-linear relationships between pixel 162 6.5. Conclusions values and patch responses. It is able to learn from multiple illumina- tions and retain accuracy. This becomes important when creating land- mark detectors and trackers that are expected to work in unseen envi- ronments and on unseen people. The improvement in accuracy comes from the modelling of non-linear relationships, rather than spatial rela- tionships (vertex rather than edge features). However, the edge features have a beneficial effect when modelling time series in Chapter 8 demon- strating the broader applicability of the CCNF model. Finally, it achieves close-to-real-time/real-time performance (≈ 20fps) bringing us closer to facial tracking in naturalistic environments. 163 7 Case study: Automatic expression analysis 7.1 Introduction The previous chapters concentrated on making landmark detection and facial expression tracking more robust to changes in illumination and pose. This chapter will discuss how such tracked feature points could be used for the task of emotion recognition. Most work in automated emotion recognition so far has focused on analysing the six discrete basic emotions: happiness, sadness, surprise, fear, anger and disgust. However, a single label (or multiple discrete la- bels from a small set) may not be sufficient for describing the complexity of an affective state. Consequently, There has been a move to analyse emotional signals along a small set of latent dimensions, providing a continuous rather than a categorical view of emotions. Examples of such affective dimensions are power (sense of control); valence (pleas- ant vs. unpleasant); activation (relaxed vs. aroused); and expectancy (anticipation). Affective computing researchers have started exploring the dimensional representation of emotion (Gunes and Pantic, 2010; Ramirez et al., 2011; Schuller et al., 2011). The problem of dimensional affect recognition is often posed as a binary classification problem – active vs. passive; or even as a four-class one – classification into quadrants of a 2D space. In my work, however, the problem of dimensional affect recognition is one of regression. 165 7. Case study: Automatic expression analysis In addition, most of the work so far has concentrated on analysing dif- ferent modalities in isolation rather than looking for ways to fuse them (Gunes and Pantic, 2010; Zeng et al., 2009). This is partly due to the lim- ited availability of suitably labelled multi-modal datasets and the diffi- culty of fusion itself: the optimal level at which the features should be fused is still an open research question (Gunes and Pantic, 2010; Lalanne et al., 2009; Zeng et al., 2009). Conditional Random Fields (CRF) (Lafferty et al., 2001) and various ex- tensions have proven very useful for emotion recognition tasks (Ramirez et al., 2011; Wo¨llmer et al., 2008). However, conventional CRF cannot be directly applied to continuous emotion prediction, as they model the output as discrete rather than continuous. I propose the use of Continu- ous Conditional Random Fields (CCRF) (Qin et al., 2008) in combination with SVRs for the task of continuous emotion recognition. The CCRF model is applied to the task of continuous dimensional emo- tion prediction on the AVEC 2012 subset of the SEMAINE dataset (Schuller et al., 2012). The benefits of using this approach for emotion recognition are demonstrated by comparing it to the SVR baseline. Furthermore, the Correlation Aware Continuous Conditional Random Field model (CA- CCRF), which exploits the correlations between the emotion dimensions, is proposed. My work also demonstrates the benefit of using facial geometry features for spontaneous affect recognition from video sequences. Such features are often ignored in favour of appearance based features, thus losing useful emotional information (Jeni et al., 2012). The reason geometry features are rarely used is the difficulty of acquiring a neutral expres- sion from which facial shape deformation can be measured. My work shows how to extract a neutral expression and demonstrates the utility of geometry alongside appearance for emotion prediction. The main contributions of this chapter are as follows: • A fully continuous CCRF emotion prediction model which exploits temporal properties of the emotion signal 166 7.2. Background • A CA-CCRF model which can exploit correlations between emo- tional dimensions • A novel way to fuse multi-modal emotional data • A demonstration of the utility of facial geometry for continuous affect recognition The work presented in this Chapter is the result of a collaboration with Ntombikayise Banda. I was responsible for the facial expression tracking and emotion modelling. Ntombikayise Banda extracted the appearance and audio features. 7.2 Background Nicolaou et al. (2010) present a coupled Hidden Markov Model for clas- sification of spontaneous affect based on Audio-Visual features. The model allows them to capture temporal correlations between different cues and modalities. They also show the benefits of using the likeli- hoods produced from separate (C)HMMs as input to another classifier. Such fusion is more accurate than picking the label with a maximum likelihood. The problem is formed as a classification rather than regres- sion one. Nicolaou et al. (2012) propose the use of the Output-Associative Rele- vance Vector Machine (OA-RVM) for dimensional and continuous pre- diction of emotions based on automatically tracked facial feature points. Their proposed regression framework exploits the inter-correlation be- tween valence and arousal dimensions by including in their mode the initial output estimation together with their input features. In addition, OA-RVM regression attempts to capture the temporal dynamics of out- put by employing a window that covers a set of past and future outputs. Of particular relevance to my dissertation is the work done by Wo¨llmer et al. (2008). They use Conditional Random Fields (CRF) for discrete emotion recognition by quantising the continuous labels for valence and 167 7. Case study: Automatic expression analysis arousal based on a selection of acoustic features. In addition, they use Long Short-Term Memory Recurrent Neural Networks to perform re- gression analysis on these two dimensions. Both of their approaches demonstrate the benefits of including temporal information for the pre- diction emotions. More recently Ramirez et al. (2011) proposed the use of Latent Dynamic Conditional Random Fields (LDCRF). Their approach attempts to learn the hidden dynamics between input features by incorporating hidden state variables which can model the sub-structure of gesture sequences. Their approach was particularly successful in predicting dimensional emotions from the visual signal. However, the LDCRF model can model only discrete output variables, hence the problem was posed as a classi- fication one. 7.3 Continuous CRF In my work I model affect continuously, rather than quantising it. Fur- thermore, my models capture the temporal relationships between each time step, since emotion has temporal properties and is not instanta- neous (el Kaliouby et al., 2003). A recent and promising approach which attempts to model such temporal relationships is the Continuous Con- ditional Random Fields (CCRF) (Qin et al., 2008). It is an extension of the classic Conditional Random Fields (CRF) (Lafferty et al., 2001) to the continuous case. I adapt the original CCRF model so it can be used for continuous emotion prediction. 7.3.1 Model definition CCRF is a discriminative undirected graphical model in which condi- tional probability P(y|x) is modelled explicitly. This is in contrast to generative models where a joint distribution P(y, x) is modelled. Dis- criminative approaches have shown promising results for sequence la- belling and segmentation (Sutton and McCallum, 2006). The graphical 168 7.3. Continuous CRF model that represents the linear-chain CCRF for emotion prediction is shown in Figure 7.1. The notation is as follows: x(q) = {x(q)1 , x(q)2 , . . . , x(q)n } is a set of ob- served input variables, {y(q)1 , y(q)2 , . . . , y(q)n } is a set of output variables to be predicted, and n is the number of frames/time-steps in a sequence. x(q)i ∈ Rm and y(q)i ∈ R, where m is the number of predictors used. q indicates the qth sequence of interest. The CCRF model for a particular sequence is a conditional probability distribution with the probability density function: P(y|x) = exp(Ψ)∫ ∞ −∞ exp(Ψ)dy (7.1) Ψ =∑ i K1 ∑ k=1 αk fk(yi,x) +∑ i,j K2 ∑ k=1 βkgk(yi, yj,x) (7.2) Above, x = {x1, x2, . . . , xn} is the set of input feature vectors (it can be represented as a matrix with per frame observations as rows) and y = {y1, y2, . . . , yn} is the unobserved variable. ∫ ∞ −∞ exp(Ψ)dy is the normalisation (partition) function which makes the probability distri- bution a valid one (by making it sum to 1). Following the convention of Qin et al. (2008), fk refers to vertex features, and gk to edge features. The model parameters α = {α1, α2, . . . αK1} and β = {β1, β2, . . . βK2} are used for inference and need to be estimated during learning. This model is very similar to CCNF introduced in Section 6.1. 7.3.2 Feature functions Two types of features are defined for the linear-chain CCRF model: ver- tex features fk and edge features gk. fk(yi,x) = −(yi − xi,k)2, (7.3) gk(yi, yj,x) = −12S (k) i,j (yi − yj)2. (7.4) Vertex features fk represent the dependency between the xi,k (the kth element of xi) and yk, for example dependency between a static emotion 169 7. Case study: Automatic expression analysis Figure 7.1: Graphical representation of the CCRF model. xi,k represents the kth feature of the ith observation, and yi is the unobserved variable to be predicted. Dashed lines represent the connection of observed to un- observed variables ( fk vertex features), so the first predictor is connected using f1, whilst the kth predictor is connected using fk. The solid lines show connections between the unobserved variables (edge features): the first connection is controlled by g1, the kth connection is controlled by gk. In the CCRF model all the output variables yi are connected to each other (edge functions can break connections by setting the appropriate Si,j to 0) prediction from a regressor and the actual emotion label. Intuitively, the corresponding αk for vertex feature fk represents the reliability of the kth predictor. This is particularly useful for multi-modal fusion, as it models the reliability of a particular signal for a particular emotion. For example, the CCRF model could learn that the facial appearance might be more important in predicting valence than the audio signal. Edge features gk represent the dependencies between observations yi and yj, for example how related is the emotion prediction at time step j to the one at time step i. This is also affected by the similarity measure S(k). As a fully connected model is used, the similarities S(k) allow us to control the strength or existence of such connections. Two types of similarities are defined for emotion modelling: S(neighbour)i,j = { 1, |i− j| = n 0, otherwise (7.5) S(distance)i,j = exp(− ||xi − xj|| σ ) (7.6) 170 7.3. Continuous CRF By varying n it is possible to construct a family of similarities and to connect the observation yi, not only to yi−1, but also to yi−2, and so on. By varying σ another set of similarities is created. These similarities control how strong the connections between y terms should be based on how similar the x terms are. This framework enables the easy creation of different similarity measures, which could be used in other applications. The learning phase of CCRF determines which of the similarities are important for the dataset of interest. For example, it can learn that for one emotion dimension eighbour similarities are more important than for others. Following Radosavljevic et al. (2010) and Qin et al. (2008), the feature functions model the square error between prediction and a feature. There- fore, each element of the feature vector xi should already be predicting the unobserved variable yi. This can be achieved using any regression technique, such as: Support Vector Regression, linear regression, or neu- ral networks. 7.3.3 Learning This section describes how to estimate the parameters {α,β} of a CCRF with quadratic vertex and edge functions. Given training data {x(q), y(q)}Mq=1 of M sequences, where each x(q) = {x(q)1 , x(q)2 , . . . , x(q)n } is a sequence of inputs and each y(q) = {y(q)1 , y(q)2 , . . . , y(q)n } is a sequence of real valued outputs. Matrix X denotes the concatenated sequence of inputs. CCRF learning picks the α and β values that optimise the conditional log-likelihood of the CCRF on the training sequences: L(α, β) = M ∑ q=1 log P(y(q)|x(q)) (7.7) (α¯, β¯) = arg max α,β (L(α, β)) (7.8) As the problem is convex (Qin et al., 2008), the optimal parameter values can be determined using standard techniques such as stochastic gradient ascent. 171 7. Case study: Automatic expression analysis The probability density of CCRF (Equation 7.1) can be expressed as a multivariate Gaussian. This form helps with explanation of inference and makes the derivation of partial derivatives of Equation 7.7 easier. The process of converting CCRF to Gaussian form is very similar to that of CCNF in Section 6.1: P(y|x) = 1 (2pi) n 2 |Σ| 12 exp(−1 2 (y−µ)TΣ−1(y−µ)), (7.9) Σ−1 = 2(A + B). (7.10) The diagonal matrix A represents the contribution of α terms (vertex features) to the covariance matrix. The symmetric matrix B represents the contribution of the β terms (edge features): Ai,j =  ∑ K1 k=1 αk, i = j 0, i 6= j , (7.11) Bi,j =  (∑ K2 k=1 βk ∑ n r=1 S k i,r)− (∑K2k=1 βkSki,j), i = j −∑K2k=1 βkSki,j, i 6= j . (7.12) A further vector b is defined, which describes the linear terms in the distribution: bi = 2 K1 ∑ k=1 αkXi,k. (7.13) µ is the mean value of the Gaussian CCRF distribution: µ = Σb. (7.14) The partial derivatives of the log P(y|x) can now be derived (see Sec- tion 6.1 for a similar derivation of partial derivatives of CCNF): ∂ log(P(y|x)) ∂αk = −yTy + 2yTXT∗,k − 2X∗,kµ+ µTµ+ tr(Σ), (7.15) ∂ log(P(y|x)) ∂βk = −yTB(k)y + µTB(k)µ+Vec(Σ)TVec(B(k)), (7.16) 172 7.3. Continuous CRF B(k) =  (∑ n r=1 S (k) i,r )− S(k)i,j , i = j −S(k)i,j , i 6= j . (7.17) In order to guarantee that the partition function is integrable, the fol- lowing constraints have to hold: αk > 0 and βk > 0 (Qin et al., 2008; Ra- dosavljevic et al., 2010). Such constrained optimisation can be achieved by using partial derivatives with respect to log αk and log βk instead of just αk and βk. A regularisation term is also added in order to avoid over- fitting. The regularisation is controlled by λα and λβ hyper-parameters (determined during cross-validation): ∂ log(P(y|x)) ∂ log αk = αk( ∂ log(P(y|x)) ∂αk − λααk), (7.18) ∂ log(P(y|x)) ∂ log βk = βk( ∂ log(P(y|x)) ∂βk − λββk). (7.19) Using these partial derivatives a CCRF learning algorithm that uses stochastic gradient ascent can be derived. The pseudocode is provided in Algorithm 5. Algorithm 5 CCRF learning algorithm Require: {x(q), y(q), S(1)q , S(2)q , . . . , S(K)q }Mq=1 Params: number of iterations T, learning rate ν, λα,λβ Initialise parameters {α,β} for r = 1 to T do for i = 1 to N do Compute gradients of current query (Eqs.(7.18),(7.19)) log αk = log αk + ν ∂ log(P(y|x)) ∂ log αk log βk = log βk + ν ∂ log(P(y|x)) ∂ log βk Update {α, β} end for end for return {α¯, β¯} = {α, β} 173 7. Case study: Automatic expression analysis 7.3.4 Inference Because the CCRF model can be viewed as a multivariate Gaussian, in- ferring y values that maximise P(y|x) is straightforward. The prediction is the mean value of the distribution. y∗ = arg max y (P(y|x)) = µ = Σb (7.20) One of the downsides of similarity constraints in CCRF is that inference can sometimes lead to an over-smoothed and dampened signal. This is more likely to happen if the input variables x are very noisy, leading to CCRF learning to trust temporal consistency much more than the x observations. To combat the over-smoothing, one could use very high λβ values to force the training to rely on the α predictions more than on temporal elements controlled by β. However, this would be at a cost of retaining a noisy signal. Alternatively, the scaling term s can be learned from the same training data (after the CCRF training is finished). Then the inference becomes y∗ = s ·µ, leading to a correctly scaled signal. Furthermore, if multiple CCRF models are to be trained (as is the case for dimensional emotions), the Z-score of both input x and output y variables can be used. This means that the same learning rate can be used on all of them. Normalisation also helps if one wants to use predic- tions from other dimensions in a single CCRF, as is done by CA-CCRF. 7.4 Video Features In order to infer emotional state from the face one needs to track facial feature points and the head pose. In addition, knowing the landmark locations makes it possible to analyse the appearance around them. For tracking faces the CLM-GAVAM tracker (see Section 5.6) can be used. 7.4.1 Geometric features In order to extract the geometric/shape features of facial expressions one needs to establish the neutral facial expression from which the ex- pression is measured. The geometric configuration of the initial frame 174 7.4. Video Features is not always reliable, as not all video sequences start with a neutral expression. In order to extract a neutral expression, a PDM which sepa- rates the expression and morphology subspaces can be used (Equation 7.21). Such a PDM is needed to decouple shape deformations arising from identity and expression. The CLM model is described by parameters p = [s, w, qm, qe, t], which can be varied to acquire various instances of the model: the scale fac- tor s; object rotation w (axis angle rotation); 2D translation t; a vector describing non-rigid variation of the identity shape qm; and expression shape qe (similar to a model used by Amberg et al. (2008)). The point distribution model (PDM) is: xi = s · R(xi +Φiqm +Ψiqe) + t, (7.21) here xi = (x, y) denotes the 2D location of the ith feature point in an image and xi = (X, Y, Z) is the mean value of the ith element of the PDM in the 3D reference frame. The vector Φi is the ith eigenvector obtained from the training set that describes the linear variations of non- rigid shape of this feature point in morphology space (constructed from the Basel 3DMM dataset (Paysan et al., 2009)). The vector Ψi is the ith eigenvector obtained from the training set that describes the linear variations of non-rigid shape in expression space (constructed from BU- 4DFE (Yin et al., 2008)). In order to fit CLM using the split PDM, the model is first optimised with respect to the morphology parameters qm, followed by expression parameters qe. After a frame is successfully tracked in a video sequence the morphology parameters are fixed and only expression parameters are optimised. Such optimisation can be performed using RLMS or NU- RLMS methods. After the fitting has been performed, the expression parameters qe describe the deformations due to expression. 7.4.2 Appearance-based features In addition to face geometry, appearance captures emotional informa- tion as well. Appearance can be described using local binary patterns 175 7. Case study: Automatic expression analysis (LBPs), which have been widely used in facial analysis tasks due to their tolerance against illumination variations and their computational simplicity (Shan et al., 2009). The local binary code, as introduced by Ojala and Pietikainen (1996), can be defined for each pixel with respect to its neighbours as: LBPP(xc, yc) = P−1 ∑ n=0 s(in − ic)2n, s(x) = { 1 x >= 0 0 x < 0 , (7.22) where (xc, yc) is pixel centre position, P represents the number of neigh- bouring pixels, in the intensity value of a neighbour pixel and ic the intensity value of the centre pixel. One extension of the LBP operator seeks to combine motion features with appearance features, thus incorporating the temporal dynamics of an image sequence (Zhao and Pietikainen, 2007). This is achieved by concatenating local binary pattens on three orthogonal planes (LBP- TOP): XY, XT and YT. The operator is expressed as: LBP− TOPPXY ,PXT ,PYT ,RX ,RY ,RT , where the notation (PXY, PXT, PYT, RX, RY, RT) denotes a neighbourhood of P points equally sampled on a circle of radius R on XY, XT and YT planes, respectively. An LBP code is extracted from the XY, XT and YT planes for all pixels, and statistics of the three different planes are obtained and then concatenated into a single histogram. This is demon- strated in Figure 7.2. This technique incorporates spatial domain infor- mation through the XY plane, and spatio-temporal co-occurrence statis- tics through the XT and YT planes. A detailed explanation of the LBP- TOP feature can be found in Zhao and Pietikainen (2007). In my work, I used the facial feature points from the CLM-GAVAM tracker to extract frontal faces from an image sequence. In order to extract a frontal face, perspective warping was used from the current 176 7.4. Video Features Figure 7.2: a) Three planes from which spatio-temporal local features are extracted. b) LBP histogram from each plane. c) Concatenated feature histogram. Taken from Zhao and Pietikainen (2007) tracked points to the neutral reference frame, also ensuring size uni- formity. The extracted faces were divided into a 3x3 non-overlapping grid, and LBP-TOP features were extracted for each block in the grid. Uniform patterns were applied, producing P(P − 1) + 3 output labels (instead of 2P), resulting in a significant dimension reduction to a 59- dimensional histogram per image block (for P = 8, R = 3). A complete feature vector was obtained by concatenating the block histograms for each plane, resulting in a 1593-dimensional vector. 7.4.3 Motion features Head gestures are an integral part of human communication as they convey a range of meanings and emotion. They involve a range of dynamics such as head orientation, rhythmic patterns, amplitude and speed of movement which act as indicators of affective states. The CLM-GAVAM tracker estimates 6 degrees-of-freedom of head pose corresponding to head rotation and translation. The variation intensity of head motion was tracked by calculating the standard deviation of the rotational and translational parameters, a measure which takes into account the amplitude range and speed of change in head motion. In addition to these statistics, the Euclidean norm of all rotational param- eters and that of translational parameters were added to describe the overall head movement. This resulted in the following 8-dimensional 177 7. Case study: Automatic expression analysis feature vector: [ σrx , σry , σrz , σtx , σty , σtz , σrxyz , σtxyz ] , where r corresponds to rotation parameters and t to translation param- eters. 7.5 Audio Features Vocal affect recognition analyses how things are said by extracting non- verbal information from speech. Scherer et al. (1989) states that emotion may produce changes in respiration, phonation and articulation, which in turn affect the acoustic features of the signal. Therefore, variations in acoustic measures contribute to our ability to discriminate between dif- ferent emotional states. I adopted prosodic features used by Ozkan et al. (2012). Table 7.1 lists these features and provides motivations for their choice. Details of their extraction algorithms can be found in Ozkan et al. (2012). 7.6 Final system The final emotion prediction system proposed is shown in Figure 7.3. The model depends on the per time step predictions from the previous layer. SVR is used, but this could be replaced by any other continuous predictor, such as linear regression or an artificial neural network. The features that were used with each SVR are explained in more detail in Sections 7.4 and 7.5. The CCRF model used is explained in Section 7.3. CCRF can employ any number of SVR predictors, and the various com- binations of them are explored in the evaluation section. Firstly, a sys- tem that just uses a prediction from an audio-visual SVR as its input (K = 1) was tested. Secondly, four SVR predictors (audio, shape, ap- pearance, and pose) of the same dimension (K = 4) were used. Finally, as the emotional dimensions do not form an orthogonal set, the corre- lations between them were exploited using a Correlation Aware CCRF 178 7.6. Final system Table 7.1: Description of the audio features used in this work. Feature Description Motivation Energy (in dB) reflects the perceived loudness of the speech signal has been found to have a high, posi- tive correlation with arousal (Pereira, 2000), with increased intensity correlating well with valence (Schro¨der, 2004) Articulation rate is calculated by iden- tifying the number of syllables per second has been found to be positively corre- lated with arousal (Schro¨der, 2004) Fundamental fre- quency ( f0) is the base frequency of the speech signal (that is, the frequency the vocal folds are vibrating at during voiced speech seg- ments) has been found to have a high, posi- tive correlation with arousal (Pereira, 2000); and a positive correlation between lower f0 and power (Schro¨der, 2004) Peak slope is a measure suitable for the identification of breathy to strained voice qualities there is evidence of a positive correlation between ’warm’ voice quality and valence (Schro¨der, 2004) Spectral station- arity captures the fluctu- ations and changes in the voice signal; a measure of the speech monotonicity monotonicity in speech is associated with low activity and negative valence (Davidson et al., 2003) (CA-CCRF). This was achieved by including SVR predictions from other dimensions alongside the corresponding SVRs. Both the original and negated SVR predictions (from valence, arousal, expectancy and power dimensions) were used when training the four CA-CCRFs (K = 32 for each). This allowed me to capture both positive and negative correla- 179 7. Case study: Automatic expression analysis Figure 7.3: Final continuous emotion recognition system, which com- bines support vector regressors with continuous conditional random fields. The number of SVRs used can be varied, and depends on the experiment. tions. In order to account for the fact that the dimensions have different scalings and different offsets, the Z-scores of the X(q)∗,k and y (q) were used instead of raw values for training and inference. 7.7 Evaluation 7.7.1 Database The proposed CCRF framework was evaluated using the dataset dis- tributed through the AVEC 2012 Emotion Challenge (Schuller et al., 2012). This dataset forms part of the Solid SAL section of SEMAINE database, which contains naturalistic dialogues between two human participants, one of the participants simulating an artificial listener agent. The dataset was, however, partitioned differently from the challenge. The recordings were split into three partitions: training set I (for SVR training), training set II (for CCRF training) and a test set (for evalua- tion) with 21, 20 and 18 video sessions in each partition, respectively. The interactions were annotated by at least two raters along the dimen- sions arousal, valence, power and expectancy. 7.7.2 Methodology The video features were extracted at a frame rate of 50 frames per second and down-sampled by employing the block averaging technique with a block size of 25 frames. The audio features were computed at 100Hz 180 7.7. Evaluation 0 100 200 300 400 500 600 700−0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 Valence from a Semaine video sequence Frame Va le nc e Ground truth CA−CCRF prediction Figure 7.4: A plot of a standardized CA-CCRF valence prediction against the ground truth from the test partition and down-sampled for alignment purposes. Linear kernel L2 loss e- SVRs with L2 regularisation were used as initial CCRF layers. The train- ing was performed using the Liblinear package (Fan et al., 2008). The SVR hyper-parameters were optimized using five-fold cross validation on training set I. Prediction labels were generated from each feature-type SVR model for the remaining two partitions for further CCRF training and inference. The training set II was used to determine the CCRF and CA-CCRF parameters (α¯, β¯) and to cross-validate the regulariza- tion hyper-parameters λα and λβ (ranging in [10−2, 100, 102, 104, 106]). Ten edge features (gk) were used for all experiments: 5 neighbour n = {1, 2, · · · , 5} and 5 distance σ = {2−6, 2−7, · · · , 2−11}. The learned β¯ weights that model the temporal and spatial similarities of the signals, the channel reliability measures α¯, and SVR predictions were used to predict unseen data (test set). The continuous emotion label predictions were then up-sampled to the original video frame rate through linear interpolation. An example of a CCRF prediction is shown in Figure 7.4. 181 7. Case study: Automatic expression analysis Baseline SVR models were trained using both training sets I and II to ensure that the baseline and the CCRF models were exposed to the same training data. Uni-modal SVR models and a multi-modal SVR model (through early fusion) were trained for comparison with the CCRF framework. 7.7.3 Results The model’s performance was measured using Pearson’s correlation co- efficient (r) following the AVEC 2012 emotion challenge evaluation strat- egy. The results were obtained by computing the correlation coefficient between the predicted labels and ground truth labels per character inter- action and per dimension, and calculating the average over all sessions. The following sections present the results of experiments conducted to evaluate CCRF ability to predict emotions. Feature-type analysis Figure 7.5: Comparison of the correlation results for each feature SVR model per dimensional affect Figure 7.5 illustrates the performance of the feature SVR models for each dimensional emotion. The graph suggests that the appearance based features (temporal LBPs) are a better estimator of valence and arousal, and that audio features provide better predictions for power and expectancy. Apart from the arousal dimension, the shape (geom- etry) features do not perform much worse than the appearance ones, highlighting their potential as comparable estimators of emotion. The 182 7.7. Evaluation Table 7.2: Correlation results of baseline SVR and CCRF models evalu- ated on the test partition Model Val. Arous. Pow. Expect. Mean video features SVR 0.176 0.234 0.100 0.120 0.158 CCRF 0.311 0.294 0.171 0.214 0.248 audio features SVR 0.062 0.053 0.103 0.104 0.081 CCRF 0.064 0.166 0.297 0.277 0.201 audio-visual features SVR 0.170 0.241 0.132 0.127 0.168 CCRF 0.326 0.341 0.273 0.248 0.297 Table 7.3: Investigating the fusion ability of CCRF with fused and non- fused predictor inputs CCRF Inputs Val. Arous. Pow. Expect. Mean 1 fused SVR 0.305 0.239 0.110 0.275 0.232 4 feature SVRs 0.326 0.341 0.273 0.248 0.297 Table 7.4: Comparison of CCRF and CA-CCRF model performances on test partition Val. Arous. Pow. Expect. Mean CCRF 0.326 0.341 0.273 0.248 0.297 CA-CCRF 0.343 0.333 0.309 0.218 0.301 generally low correlation is indicative of the challenging task of working with naturalistic data and the variety of expressions that can be associ- ated with an affective state. Model and modality comparisons Three types of modalities were investigated for both the baseline SVR and CCRF models: audio, video and audio-visual. Table 7.2 presents a comparative view of the three modalities for each model type. The results show that the CCRF model significantly outperforms the baseline 183 7. Case study: Automatic expression analysis SVR in all modalities and dimensions. This attests to the importance of temporal data in the analysis and recognition of emotion, and the success of the CCRF model in capturing these dynamics. Consistent with other studies (Nicolaou et al., 2012; Ozkan et al., 2012), it can be seen that visual features are better predictors of valence and that audio features perform better for the power dimension. However, in contrast to previous findings arousal state seems to be better predicted by visual rather than audio features. Lastly, the CCRF model succeeded in fusing valence and arousal dimen- sion; with overall results of the audio-visual CCRF outperforming the individual CCRFs. Fusion strength of CCRF Table 7.3 contrasts the use of a fused audio-visual SVR (K = 1) over the use of several SVR predictors (K = 4) as input to the CCRF model. The results show that fusing within the CCRF framework is better than providing fused predictors, therefore highlighting one of the strengths of the CCRF model: the information gain from using signal dynamics for fusion. Correlations between dimensions With reference to Table 7.4, it can be seen that the CA-CCRF model outperforms the regular CCRF for some dimensions. The effect of us- ing CA-CCRF is especially beneficial for power dimension. This is not surprising as, in the dataset used, power correlates with other dimen- sions (r = 0.25 with valence, r = 0.43 with arousal and r = −0.46 with expectancy). 7.8 Conclusion This chapter presented a CCRF model that can be used to model con- tinuous dimensional emotion. It can easily incorporate multiple simple 184 7.8. Conclusion predictors and exploits temporal correlations between time steps and different modalities. Furthermore, the model can easily be extended to include various other similarity functions that capture the dynamic na- ture of the signals. It also allows for high-order paths to be defined, exploiting long and short range dependencies of time series. The model is also able to exploit correlations between emotional dimensions lead- ing to better prediction for some dimensions. The compact and simple CCRF design allows for applications in other domains with dynamic properties. Further work Future research might benefit from the comparison of using CCNF in- stead of a CCRF model, for emotion prediction from audio-visual sig- nals. This would make the training more straightforward as only one model would need to be trained. Furthermore, joint learning of param- eters might benefit the training. However, it is unclear how correlations between dimensions could be exploited by a CCNF model. 185 8 Case study: Emotion analysis in music 8.1 Introduction So far in this dissertation I have concentrated on describing emotion recognition in humans (mainly from facial expressions and head pose). In this chapter I demonstrate how some of the tools developed in the previous chapters can also be used for emotion prediction in music. Music surrounds us every day, and people’s interaction with it is be- coming increasingly digitized—buying digital music albums, streaming music, etc (BPI, 2013). This introduces a need to develop better tools for music search, playlist generation and the general management of music libraries. People use a number of different descriptors for songs, including emotion (Bainbridge et al., 2003). In the same way as automatic emotion analysis from faces, the field of emotion recognition in music began by focusing on assigning a single categorical label to an entire piece of music. However, it has been slowly moving towards more complex techniques, with an increasing focus on dimensional representation of emotion, and continuous emotion track- ing. Both of these, especially when combined, require more advanced machine learning techniques. However, until recently, there have been only a few approaches that could tackle this problem (see Section 8.2). This chapter demonstrates how the Continuous Conditional Neural Fields (CCNF) model (Section 6.1) can be applied to the problem of continuous 187 8. Case study: Emotion analysis in music dimensional emotion tracking in music. I compared the performance of CCNF with SVR and CCRF models. The experiments demonstrate that CCNF outperforms the other models in most cases. The work presented in this chapter is the result of a collaboration with Vaiva Imbrasaite˙. I was responsible for the emotion modelling; Vaiva Imbrasaite˙ extracted the audio features. 8.2 Background Dimensional emotion representation describes emotion using several axes. In the field of emotion in music, the most commonly used di- mensions are arousal and valence (AV). Adding other axes (such as ex- pectancy and power) has also been considered, but it has repeatedly been shown that they add little to the description or the recognition of emotion in music (MacDorman, 2007). Most of the approaches to dimensional continuous emotion tracking in music have focused on inferring the emotion label over a time win- dow, which is independent of the surrounding music (bag-of-frames approach) (Korhonen et al., 2006; Panda and Paiva, 2011; Schmidt and Kim, 2010a; Schmidt et al., 2010). However, these approaches failed to exploit the temporal properties of music. Some research has been done on trying to incorporate temporal infor- mation in to the feature vector—either by using features extracted over varying window length for each second/sample (Schubert, 2004), or by using machine learning techniques adapted for sequential learning. Ex- amples include the sequential stacking algorithm used by Cohen and Carvahlo (2005), Kalman filtering or Conditional Random Fields (CRF) used by Schmidt and Kim (2010b, 2011). 188 8.3. Linear-chain Continuous Conditional Neural Fields Figure 8.1: The linear-chain CCNF model compared to the linear-chain CCRF one (Section 7.3). The input vector xi is connected to the relevant output scalar yi through the vertex features that combine the hi neural layers (gate functions) and the vertex weights α. The outputs are further connected with edge features gk 8.3 Linear-chain Continuous Conditional Neural Fields 8.3.1 Model definition It is possible to adapt the CCNF model introduced in Section 6.1 to use temporal rather than spatial relationships. The model is illustrated in Figure 8.1, and is called linear-chain CCNF. This model is also an extension of the linear-chain CCRF introduced in Section 7.3. The model has two types of features: vertex features fk and edge fea- tures gk. The potential function is defined as: Ψ =∑ i K1 ∑ k=1 αk fk(yi, xi,θk) +∑ i,j K2 ∑ k=1 βkgk(yi, yj). (8.1) The vertex features fk represent the mapping from the xi to yi through a one layer neural network, where θk is the weight vector for a particular neuron k: fk(yi, xi,θk) = −(yi − h(θk, xi))2, (8.2) h(θ, xi) = 1 1+ e−θTxi . (8.3) 189 8. Case study: Emotion analysis in music The number of vertex features K1 is determined experimentally during cross-validation. The values tried during cross-validation were K1 = {5, 10, 20, 30}. The edge features gk represent the similarities between observations yi and yj. The existence and strength of the edge connections is controlled by the neighbourhood measure S(k). In the linear-chain CCNF model, gk enforces smoothness between neigh- bouring nodes. A single edge feature is defined, i.e. K2 = 1. S(1) is defined to be 1 only when the two nodes i and j are neighbours in a chain, otherwise it is 0: gk(yi, yj) = −12S (k) i,j (yi − yj)2. (8.4) The linear-chain CCNF was used for emotion prediction in music. For training, song samples together with their corresponding dimensional continuous emotion labels were used. The dimensions were trained sep- arately. See Section 6.1 for more details on the model and on its learning and inference. 8.4 Evaluation A number of experiments were performed to assess the accuracy of the CCNF model when compared to several other baselines. 8.4.1 Dataset The dataset used in the experiments is the only publicly available emo- tion tracking dataset of music extracts labelled on the arousal-valence dimensional space. The data (Speck et al., 2011) has been labelled using Mechanical Turk (MTurk)1. The paid participants were asked to label 15-second excerpts with continuous emotion ratings on the AV space, with another 15 seconds given as a practice for each song. The songs in the dataset cover a wide range of genres: pop, various types of rock and 1https://www.mturk.com/ - accessed May 2013 190 8.4. Evaluation hip-hop/rap, and are drawn from the “uspop2002”2 data-base contain- ing popular songs. The dataset consists of 240 15-second clips (without the practice run), with µ = 16.9, σ = 2.7 ratings for each clip. In addi- tion, the dataset contains a standard set of features extracted from those musical clips: MFCCs, octave-based spectral contrast, statistical spec- trum descriptors (SSD), chromagram, and a set of EchoNest3 features. 8.4.2 Baselines Several baselines were used for comparing against the CCNF model. The first baseline was the linear-chain CCRF. A single neighbour simi- larity feature was defined (as in the case of CCNF). As an additional baseline, a linear and RBF kernel Support Vector Re- gressors were used. They have been used extensively for emotion pre- diction in music. 8.4.3 Error Metrics Three different evaluation metrics were used in the experiments: corre- lation, root-mean-square error (RMSE) and Euclidean distance. Both the correlation coefficient and RMSE were calculated in two modes: short and long. Long evaluation metrics were calculated over the span of the whole dataset, essentially concatenating all of the songs into one. Short evaluation metrics were calculated over each song and then averaged over all of the songs. The short correlation metric is non-squared, so as not to hide any potential negative correlation. Short metrics might be better suited for evaluation of emotion recognition in music, as there has been some evidence that people agree more with algorithms that optimise the short RMSE (Imbrasaite˙ et al., 2013b). Long metrics are reported as well, since these are usually reported in the literature. The average Euclidean distance was calculated as the distance between the two-dimensional position of the original label and 2http://labrosa.ee.columbia.edu/projects/musicsim/uspop2002.html - accessed May 2013 3http://developer.echonest.com/ - accessed May 2013 191 8. Case study: Emotion analysis in music the predicted label in the normalized AV space (each axis normalized to span between 0 and 1). Each metric was calculated for each fold and the average over 5 folds is reported. Observe that lower RMSE and Euclidean distance values correspond to better performance, while the opposite is true for correlation. 8.4.4 Design of the experiments For the purpose of this study, only the non EchoNest features provided in the MTurk dataset (Section 8.4.1) were used. Features were averaged over a one second window and the average of the labels for that second was used as the ground truth label. A separate model was trained for each emotional dimension: one for arousal and one for valence. Features Four types of features were used: MFCC, chromagram, spectral contrast and SSD. They were concatenated into a single vector. Their Z-scores were calculated on the training set (and the scalings were used on the test set). Cross-validation A 5-fold cross-validation was used for all of the experiments. The dataset was split into two parts: 4/5 for training and 1/5 for testing; this pro- cess was repeated 5 times. When splitting the dataset into folds, it was made sure that all of the feature vectors from a single song were in the same fold. The album and artist information was ignored, as it has been shown that they have no effect on this particular dataset (Imbrasaite˙ et al., 2013a). The reported results were averaged over 5 folds. For SVR-based experiments, 2-fold cross-validation (splitting into equal parts) was used on the training dataset to choose the hyper-parameters. These were then used for training on the whole training dataset. The process for the CCRF-based experiments contained an extra step. The training dataset was split into two parts—one for SVR and one for 192 8.4. Evaluation Model Arousal Valence Euclidean rms corr. rms corr. distance SVR-Lin 0.196 0.634 0.222 0.173 0.130 SVR-RBF 0.194 0.645 0.220 0.211 0.128 CCRF 0.204 0.721 0.223 0.247 0.136 CCNF 0.166 0.739 0.205 0.301 0.116 Table 8.1: Results comparing the CCNF approach to the CCRF and SVR with linear and RBF kernels. Model Arousal Valence rms corr. rms corr. SVR-Lin 0.180 0.012 0.189 0.036 SVR-RBF 0.178 0.011 0.186 0.007 CCRF 0.176 0.049 0.183 0.090 CCNF 0.143 0.072 0.170 0.019 Table 8.2: Results comparing the CCNF approach to the CCRF and SVR with linear and RBF kernels. The metrics used are short correlation and root mean square error. CCRF, and 2-fold cross-validation was performed on them individually to learn the hyper-parameters. For CCNF-based experiments, 2-fold cross-validation was used to pick the hyper-parameters, but the results were averaged over 4 random seed initializations. The chosen hyper-parameters were used for training on the whole dataset. The model was randomly initialized 20 times (using the best hyper-parameters) and the model with the highest likelihood (Equation 6.7) was picked for testing. It is important to note that the same folds were used for all of the ex- periments, and that the testing data were always kept separate from the training process. 193 8. Case study: Emotion analysis in music Model Arousal Valence Euclidean rms corr. rms corr. distance W/o Chroma 0.167 0.737 0.207 0.285 0.116 W/o Contrast 0.164 0.743 0.208 0.285 0.116 W/o MFCC 0.175 0.707 0.200 0.315 0.117 Table 8.3: Results of CCNF with smaller feature vectors. Model Arousal Valence rms corr. rms corr. W/o Chroma 0.144 0.068 0.172 0.046 W/o Contrast 0.143 0.047 0.169 0.040 W/o MFCC 0.150 0.032 0.164 0.089 Table 8.4: Results (short evaluation metrics) of CCNF with smaller fea- ture vectors. 8.4.5 Results CCNF consistently outperforms all of the other methods on all the eval- uation metrics except for short correlation for valence, where CCRF per- forms better (Tables 8.1 and 8.2). Not only is performance improved, but the results are substantially better than those of the other methods. Since neural network-based models are particularly sensitive to the size of the feature vector used, the effect of a smaller feature vector was explored by omitting a class of features. As can be seen from Tables 8.3 and 8.4, not including chromagram, octave-based spectral contrast or MFCC features improves the results even further (compare with Tables 8.1 and 8.2). 8.5 Discussion This chapter introduced an adaptation of CCNF — linear-chain CCNF. This model is particularly well suited for dimensional continuous emo- tion tracking. 194 8.5. Discussion The results achieved with this linear-chain CCNF are encouraging. It consistently outperformed the other models—both the standard base- line used in the field (SVR) and the more advanced CCRF model. These experiments demonstrate the applicability of CCNF for time-series mod- elling alongside its usefulness as a patch expert. Schmidt and Kim (2010b) used the same dataset for their experiments and had a similar experimental design to the one presented in this chap- ter. They reported mean Euclidean distances of 0.160-0.169, which is within the same order of magnitude as the best average Euclidean dis- tance of 0.116 achieved in the CCNF experiments. Unfortunately, even though the same dataset is used, the experimental design is slightly dif- ferent so concrete conclusions are difficult to draw. The experiments with smaller feature vectors showed that the size of the feature vector plays a major role in the performance of CCNF. The fact that better results were achieved by omitting a whole class of fea- tures shows that there may be some redundancy between the features. Thus, future work could investigate feature selection or sparsity enforc- ing techniques. 195 9 Conclusions 9.1 Contributions The main goal of this dissertation was to bring facial expression analysis in real world environments closer to reality. My work has demonstrated multiple ways of making facial tracking techniques work better under varying illumination and pose. The main contributions of this dissertation are as follows. Firstly, the exploration and extension of the Constrained Local Model (CLM) for facial tracking in difficult conditions. Secondly, the presentation of 3D Constrained Local Model (CLM-Z), a CLM based tracker that takes full advantage of depth information alongside visible light data. Thirdly, a Constrained Local Neural Field (CLNF) facial tracking model that is especially suited for tracking in difficult lighting conditions – when pose and illumination variations are expected. Finally, I demonstrated how these trackers can be used for emotion recognition in dimensional space using the tools developed. A brief description of these contributions follows. 9.1.1 Constrained Local Model extensions I have presented a detailed analysis of the CLM for facial tracking. I extended it to use a multi-scale formulation and demonstrated how it can take into account different reliabilities of patch experts by using non- uniform regularised landmark mean shift. Finally, I identified a number of challenges still facing CLM based landmark detection: changes in illumination, extreme pose and extreme expressions. 197 9. Conclusions 9.1.2 3D Constrained Local Model I introduced CLM-Z, which uses depth information alongside visible light data. I demonstrated how the training data for such a model could be generated synthetically, and used for training. Furthermore, I pre- sented a novel normalisation function that allows CLM-Z to deal with missing data in the depth signal. This model was extensively evaluated on public datasets demonstrating its superiority over the regular CLM and a number of other head pose trackers. 9.1.3 Continuous Conditional Neural Field I introduced the Continuous Conditional Neural Field graphical model. It is a regressor that can learn complex non-linear relationships and ex- ploit some of the temporal and spatial characteristics of a signal. One of its instances – Local Neural Field can be used as a patch expert in the CLM framework, and is particularly suited for landmark detection un- der difficult illumination. Furthermore, another instance – linear-chain CCNF can be successfully used for emotion prediction from music. 9.1.4 Emotion inference in continuous space I demonstrated how the facial landmark and head pose tracker devel- oped throughout this dissertation can be used for emotion inference in continuous space. This was accomplished using the Continuous Condi- tional Random Fields on features extracted from the face together with some acoustic features. 9.2 Future work My work addressed a number of existing issues in the field of facial tracking. It also points to certain areas which could benefit from further research. First of all, my work did not explicitly address landmark detection un- der extreme expression, such as screaming and yawning. Exploring suit- 198 9.2. Future work able priors or even different shape models might address this problem. CLM is very dependant on good initialisation. There are few face detec- tors that deal with faces at various poses effectively. The face detectors that do exist are too slow to be of practical use (Zhu and Ramanan, 2012). Fast face detection in the wild and across pose is still an unsolved problem. Finally, my work evaluated facial tracking on laboratory collected data. Further research is needed to see how well the algorithms would gener- alise in completely unconstrained environments. 199 Bibliography Nalini Ambady and Robert Rosenthal. Thin slices of expressive behavior as predictors of interpersonal consequences : a meta-analysis. Psycho- logical Bulletin, 111(2):256–274, 1992. Brian Amberg, Reinhard Knothe, and Thomas Vetter. Expression invari- ant 3d face recognition with a morphable model. In IEEE International Conference on Automatic Face and Gesture Recognition, 2008. Ahmed Bilal Ashraf, Simon Lucey, Jeffrey F. Cohn, Tsuhan Chen, Zara Ambadar, Kenneth M. Prkachin, and Patricia E. Solomon. The painful face: Pain expression recognition using active appearance models. Im- age and Vision Computing, 27:1788–1796, 2009. Hillel Aviezer, Ran R. Hassin, Jennifer Ryan, Cheryl Grady, Josh Susskind, Adam Anderson, Morris Moscovitch, and Shlomo Bentin. Angry, disgusted, or afraid? studies on the malleability of emotion perception. Psychological science, 19(7):724–732, 2008. David Bainbridge, Sally Jo Cunningham, and J. Stephen Downie. How people describe their music information needs : A grounded theory analysis of music queries. In International Conference on Music Informa- tion Retrieval, 2003. Tadas Baltrusˇaitis and Peter Robinson. Analysis of colour space trans- forms for person independent AAMs. In The ACM / SSPNET 2nd In- ternational Symposium on Facial Analysis and Animation, page 21, 2010. 201 Bibliography Tadas Baltrusˇaitis, Daniel McDuff, Ntombikayise Banda, Marwa M. Mahmoud, Rana el Kaliouby, Rosalind W. Picard, and Peter Robin- son. Real-time inference of mental states from facial expressions and upper body gestures. In IEEE International Conference on Automatic Face and Gesture Recognition, Facial Expression Recognition and Analysis Challenge, 2011. Tadas Baltrusˇaitis, Peter Robinson, and Louis-Philippe Morency. 3D Constrained Local Model for Rigid and Non-Rigid Facial Tracking. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2610–2617, 2012. Tadas Baltrusˇaitis, Ntombikayise Banda, and Peter Robinson. Dimen- sional affect recognition using continuous conditional random fields. In IEEE International Conference on Automatic Face and Gesture Recogni- tion, 2013a. Tadas Baltrusˇaitis, Peter Robinson, and Louis-Philippe Morency. Con- strained local neural fields for robust facial landmark detection in the wild. In International Conference on Computer Vision workshops, 2013b. Simon Baron-Cohen. Reading the mind in the face: A cross-cultural and developmental study. Visual Cognition, 3(1):3960, Mar 1996. Simon Baron-Cohen, Alan M. Leslie, and Uta Frith. Does the autistic child have a ”theory of mind”? Cognition, 21(1):37–46, 1985. Simon Baron-Cohen, Ofer Golan, Sally Wheelwright, and Jacqueline J. Hill. Mind reading: the interactive guide to emotions. 2004. Janet B. Bavelas, Linda Coates, and Trudy Johnson. Listeners as co- narrators. Journal of Personality and Social Psychology, 79(6):941–952, 2000. Volker Blanz and Thomas Vetter. A morphable model for the synthesis of 3D faces. In SIGGRAPH, pages 187–194, 1999. BPI. Digital Music Nation. Technical report, 2013. 202 Bibliography Martin Breidt, Heinrich H. Biilthoff, and Cristo´bal Curio. Robust se- mantic analysis by synthesis of 3D facial motion. In IEEE International Conference on Automatic Face and Gesture Recognition, 2011. Michael D. Breitenstein, Daniel Kuettel, Thibaut Weise, and Luc van Gool. Real-time face pose estimation from single range images. In IEEE Conference on Computer Vision and Pattern Recognition, 2008. Daphne Blunt Bugental, Jaques W. Kaswan, and Leonore R. Love. Per- ception of contradictory meanings conveyed by verbal and nonver- bal channels. Journal of personality and social psychology, 16(4):647–655, 1970. Qin Cai, David Gallup, Cha Zhang, and Zhengyou Zhang. 3D de- formable face tracking with a commodity depth camera. In European Conference on Computer Vision, 2010. Xudong Cao, Yichen Wei, Fang Wen, and Jian Sun. Face alignment by explicit shape regression. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2887–2894. IEEE, 2012. Marco La Cascia, Stan Sclaroff, and Vassilis Athitsos. Fast, Reliable Head Tracking under Varying Illumination: An Approach Based on Regis- tration of Texture-Mapped 3D Models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(4):322–336, 2000. Justine Cassell. Nudge nudge wink wink: elements of face-to-face conversation for embodied conversational agents, volume 1, chapter 1, pages 1–27. MIT Press, 2000. Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technol- ogy, 2(3):27:127:27, 2011. Sien W. Chew, Patrick Lucey, Simon Lucey, Jason M. Saragih, Jeffrey F. Cohn, and Sridha Sridharan. Person-Independent Facial Expression De- tection using Constrained Local Models. 2011. 203 Bibliography William W. Cohen and Vitor R. Carvahlo. Stacked sequential learning. In International Joint Conference on Artificial Intelligence, 2005. Jeffrey F. Cohn. Foundations of human computing: Facial expression and emotion. In ACM International Conference on Multimodal Interfaces, pages 233–238, 2006. Jeffrey F. Cohn, Tomas Simon Kruez, Iain Matthews, Ying Yang, Minh Hoai Nguyen, Margara Tejera Padilla, Feng Zhou, and Fer- nando De la Torre. Detecting depression from facial actions and vocal prosody. In Affective Computing and Intelligent Interaction, 2009. Timothy F. Cootes and C. J. Taylor. Statistical Models of Appearance for Computer Vision. 2004. Timothy F. Cootes and Christopher J. Taylor. Active Shape Models - ”Smart Snakes”. In British Machine Vision Conference, 1992. Timothy F. Cootes, Kevin N. Walker, and Christopher J. Taylor. View- based active appearance models. In IEEE International Conference on Automatic Face and Gesture Recognition, pages 227–232, 2000. Timothy F. Cootes, Gareth Edwards, and Christopher Taylor. Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23:681–685, 2001. Roddy Cowie, Ellen Douglas-Cowie, Nicolas Tsapatsoulis, George N. Votsis, Stefanos D. Kollias, Winfried A. Fellenz, and John Gerald Tay- lor. Emotion recognition in human-computer interaction. IEEE Signal Processing Magazine, 18(1):32–80, 2001. David Cristinacce and Timothy F. Cootes. Feature detection and tracking with constrained local models. In British Machine Vision Conference, 2006. David Cristinacce and Timothy F. Cootes. Boosted regression active shape models. In British Machine Vision Conference, 2007. 204 Bibliography Charles Darwin. The Expression of The Emotions in Man and Animals. London, John Murray, 1872. Richard J. Davidson, Klaus R. Scherer, and H. Hill Goldsmith. Handbook of Affective Sciences. 2003. Beatrice de Gelder. Why bodies? twelve reasons for including bodily expressions in affective neuroscience. Philosophical transactions of the Royal Society of London. Series B, Biological sciences, 364(1535):3475–3484, 2009. Beatrice de Gelder and Jean Vroomen. The perception of emotions by ear and by eye. Cognition and Emotion, pages 289–311, 2000. Fernando De la Torre and Jeffrey F Cohn. Guide to Visual Analysis of Hu- mans: Looking at People, chapter Facial Expression Analysis. Springer, 2011. Sidney D’Mello and Rafael Calvo. Beyond the basic emotions : What should affective computing compute? In Extended Abstracts of the ACM SIGCHI Conference on Human Factors in Computing Systems, pages 2287–2294, 2013. Neil A Dodgson. Variation and extrema of human interpupillary dis- tance. In Stereoscopic Displays and Virtual Reality Systems, pages 36–46, 2004. Guillaume-Benjamin-Amand Duchenne de Boulogne. In Mecanisme de la Physionomie Humaine. Cambridge University Press, 1862. Reprinting of the original 1862 dissertation. Paul Ekman. An argument for basic emotions. Cognition and Emotion, 6 (3):169–200, 1992. Paul Ekman. Language, knowledge, and representation, chapter Emotional and conversational nonverbal signals, pages 39–50. Kluwer Academic Publishers, 2004. 205 Bibliography Paul Ekman and Wallace V. Friesen. Pictures of Facial Affect. Consulting Psychologists Press, 1976. Paul Ekman and Wallace V. Friesen. Manual for the Facial Action Coding System. Palo Alto: Consulting Psychologists Press, 1977. Paul Ekman and Erika L. Rosenberg. What the Face Reveals: Basic and Applied Studies of Spontaneous Expression using the Facial Action Coding System. 2005. Paul Ekman, Wallace V. Friesen, Maureen O’Sullivan, and Klaus R. Scherer. Relative importance of face, body, and speech in judgments of personality and affect. Journal of Personality and Social Psychology, 38:270–277, 1980. Paul Ekman, Wallace V. Friesen, and Phoebe Ellsworth. Emotion in the Human Face. Cambridge University Press, second edition, 1982. Rana el Kaliouby and Peter Robinson. Real-Time Inference of Complex Mental States from Facial Expressions and Head Gestures, pages 181–200. Springer US, 2005. Rana el Kaliouby, Peter Robinson, and Simeon Keates. Temporal con- text and the recognition of emotion from facial expression. In HCI International Conference, pages 2–6, 2003. Rong-En Fan, Chang Kai-Wei, Hsieh Cho-Jui, Xiang-Rui Wang, and Chih-Jen Lin. LIBLINEAR: A library for large linear classification. The Journal of Machine Learning Research, 9:1871–1874, 2008. Gabriele Fanelli, Juergen Gall, and Luc Van Gool. Real Time Head Pose Estimation with Random Regression Forests. In IEEE Conference on Computer Vision and Pattern Recognition, pages 617–624, 2011a. Gabriele Fanelli, Thibaut Weise, Juergen Gall, and Luc Van Gool. Real time head pose estimation from consumer depth cameras. In Deutsche Arbeitsgemeinschaft fu¨r Mustererkennung, 2011b. 206 Bibliography Gabriele Fanelli, Matthias Dantone, and Luc Van Gool. Real time 3d face alignment with random forests-based active appearance models. In IEEE International Conference on Automatic Face and Gesture Recognition, 2013. Johnny R. J. Fontaine, Klaus R. Scherer, Etienne B. Roesch, and Phoebe C. Ellsworth. The world of emotions is not two-dimensional. Psychological science, 18(12):10501057, 2007. Chris Frith. Role of facial expressions in social interactions. Philosophical Transactions of the Royal Society B: Biological Sciences, 364(1535):3453– 3457, 2009. Xinbo Gao, Ya Su, Xuelong Li, and Dacheng Tao. A review of active ap- pearance models. In IEEE Transactions on Systems, Man, and Cybernetics - Part C: Applications and Reviews, volume 40, pages 145–158, 2010. Jeffrey M. Girard, Jeffrey F. Cohn, Mohammad H. Mahoor, Seyedmo- hammad Mavadati, and Dean P. Rosenwald. Social risk and depres- sion : Evidence from manual and automatic facial expression analysis. In IEEE International Conference on Automatic Face and Gesture Recogni- tion, 2013. Ralph Gross, Iain Matthews, Jeffrey Cohn, Takeo Kanade, and Simon Baker. Multi-PIE. In IEEE International Conference on Automatic Face and Gesture Recognition, 2008. Leon Gu and Takeo Kanade. A generative shape regularization model for robust face alignment. In IEEE European Conference on Computer Vision, pages 413–426. Springer, 2008. Hatice Gunes and Maja Pantic. Automatic, dimensional and continuous emotion recognition. International Journal of Synthetic Emotions, 1(1): 68–99, 2010. Hatice Gunes and Bjo¨rn Schuller. Categorical and dimensional affect analysis in continuous input: Current trends and future directions. Image and Vision Computing, 31(2):120–136, 2013. 207 Bibliography Uri Hadar, Timothy J. Steiner, and Frank Clifford Rose. Head movement during listening turns in conversation. Journal of Nonverbal Behavior, 9 (4):214–228, 1985. Kristina Ho¨o¨k. Affective loop experiences: designing for interactional embodiment. Philosophical Transactions of the Royal Society B: Biological Sciences, 364(1535), 2009. Vaiva Imbrasaite˙, Tadas Baltrusˇaitis, and Peter Robinson. Emotion track- ing in music using Continuous Conditional Random Fields and rela- tive feature representation. In IEEE International Conference on Multi- media and Expo, 2013a. Vaiva Imbrasaite˙, Tadas Baltrusˇaitis, and Peter Robinson. What really matters? a study into peoples instinctive evaluation metrics for con- tinuous emotion prediction in music. In Affective Computing and Intel- ligent Interaction, 2013b. Mircea C. Ionita, Peter Corcoran, and Vasile Buzuloiu. On color tex- ture normalization for active appearance models. IEEE Transactions on Image Processing, 18(6):1372–1378, 2009. La´szlo´ Jeni, Andra´s Lo¨rincz, Tama´s Nagy, Zsolt Palotai, Judit Sebo¨k, Zolta´n Szabo´, and Da´niel Taka´cs. 3D shape estimation in video se- quences provides high precision evaluation of facial expressions. Im- age and Vision Computing, 2012. Patrik Juslin and Klaus R Scherer. The New Handbook of Methods in Non- verbal Behavior Research, chapter Vocal expression of affect, pages 65– 135. 2005. Dacher Keltner. Signs of appeasement: Evidence for the distinct displays of embarrassment, amusement, and shame. Journal of Personality and Social Psychology, 68(3):441–454, 1995. Chris L. Kleinke. Gaze and eye contact: a research review. Psychological bulletin, 100(1):78–100, 1986. 208 Bibliography Mark Korhonen, David A. Clausi, and Ed Jernigan. Modeling emotional content of music using system identification. IEEE Transactions on Sys- tems Man and Cybernetics Part B - Cybernetics, 36(3), 2006. Sanjiv Kumar and Martial Hebert. Discriminative random fields: a dis- criminative framework for contextual interaction in classification. vol- ume 2, pages 1150–1157, 2003. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Con- ditional random fields : Probabilistic models for segmenting and la- beling sequence data. In International Conference on Machine Learning, pages 282–289, 2001. Denis Lalanne, Laurence Nigay, Philippe Palanque, Peter Robinson, Jean Vanderdonckt, and Jean-Francois Ladry. Fusion engines for multi- modal input: A survey. In International Conference on Multimodal Inter- faces, pages 153–160, 2009. Yann LeCun, Koray Kavukcuoglu, and Cle´ment Farabet. Convolutional networks and applications in vision. In International Symposium on Circuits and Systems, pages 253–256, 2010. Julien-Charles Le´vesque, Louis-Philippe Morency, and Christian Gagne´. Sequential Emotion Recognition using Latent-Dynamic Conditional Neural Fields. 2013. John P. Lewis. Fast template matching. Vision Interface, 10:120–123, 1995. Stephan Liwicki and Stefanos Zafeiriou. Fast and robust appearance- based tracking. In IEEE International Conference on Automatic Face and Gesture Recognition, pages 507–513, 2011. Patrick Lucey, Jeffrey F. Cohn, Takeo Kanade, Jason M. Saragih, Zara Ambadar, and Iain Matthews. The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified ex- pression. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2010. 209 Bibliography Karl F. MacDorman, Stuart Ough Chin-Chang H. Automatic emotion prediction of song excerpts: Index construction, algorithm design, and empirical comparison. Journal of New Music Research, 36(4):281299, Dec 2007. doi: 10.1080/09298210801927846. Marwa M. Mahmoud, Tadas Baltrusˇaitis, Peter Robinson, and Laurel D. Riek. 3D corpus of spontaneous complex mental states. In Affective Computing and Intelligent Interaction, 2011. Marwa M. Mahmoud, Tadas Baltrusˇaitis, and Peter Robinson. Crowd- souring in emotion studies across time and culture. In Proceedings of the ACM Multimedia workshop on Crowdsourcing for multimedia. ACM Press, 2012. Ioannis Marras, Joan Alabort-i Medina, Georgios Tzimiropoulos, Ste- fanos Zafeiriou, and Maja Pantic. Online Learning and Fusion of Ori- entation Appearance Models for Robust Rigid Object Tracking. In IEEE International Conference on Automatic Face and Gesture Recognition, 2013. David Matsumoto and Bob Willingham. Spontaneous facial expres- sions of emotion of congenitally and noncongenitally blind individ- uals. Journal of Personality, 96(1):1–10, 2009. Iain Matthews and Simon Baker. Active appearance models revisited. International Journal of Computer Vision, 60(2):135–164, 2004. Iain Matthews, Jing Xiao, and Simon Baker. 2D vs. 3D Deformable Face Models: Representational Power, Construction, and Real-Time Fitting. International Journal of Computer Vision, 75(1):93–113, 2007. Daniel McDuff, Rana el Kaliouby, David Demirdjian, and Rosalind W. Picard. Predicting online media effectiveness based on smile re- sponses gathered over the internet. In IEEE International Conference on Automatic Face and Gesture Recognition, 2013. 210 Bibliography Gary McKeown, Michel F. Valstar, Roddy Cowie, and Maja Pantic. The SEMAINE corpus of emotionally coloured character interactions. In IEEE International Conference on Multimedia and Expo, 2010. Dimitris Metaxas and Shaoting Zhang. A review of motion analysis methods for human nonverbal communication computing. Image and Vision Computing, 31:421–433, 2013. Louis-Philippe Morency, Jacob Whitehill, and Javier Movellan. General- ized Adaptive View-based Appearance Model: Integrated Framework for Monocular Head Pose Estimation. In IEEE International Conference on Automatic Face and Gesture Recognition, 2008. Erik Murphy-Chutorian and Mohan Manubhai Trivedi. Head pose es- timation in computer vision: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(4), 2009. Mihalis A. Nicolaou, Hatice Gunes, and Maja Pantic. Audio-visual classification and fusion of spontaneous affective data in likelihood space. In International Conference on Pattern Recognition, pages 3695– 3699, 2010. Mihalis A. Nicolaou, Hatice Gunes, and Maja Pantic. Output-associative RVM regression for dimensional and continuous emotion prediction. Image and Vision Computing, 30(3):186–196, 2012. Je´re´mie Nicolle, Vincent Rapp, Kvin Baily, and Lionel Prevost. Robust continuous prediction of human emotions using multiscale dynamic cues categories and subject descriptors. In ACM International Confer- ence on Multimodal Interaction, pages 501–508, 2012. Timo Ojala and Matti Pietikainen. A comparative study of texture mea- sures with classification based on feature distributions. Pattern Recog- nition, 29(1):51–59, 1996. Derya Ozkan, Stefan Scherer, and Louis-Philippe Morency. Step-wise Emotion Recognition Using Concatenated-HMM. In ACM Interna- tional Conference on Multimodal Interaction, pages 477–484, 2012. 211 Bibliography Renato Panda and Rui Pedro Paiva. Using support vector machines for automatic mood tracking in audio music. In 130th Audio Engineering Society Convention, 2011. Maja Pantic and Marian Stewart Bartlett. Face Recognition, chapter Ma- chine Analysis of Facial Expressions, pages 377–416. I-Tech Education and Publishing, 2007. Maja Pantic, Alex Pentland, Anton Nijholt, and Thomas Huang. Human computing and machine understanding of human behavior: A survey. In ACM International Conference on Multimodal Interfaces, pages 239– 248, 2006. Ulrich Paquet. Convexity and Bayesian constrained local models. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1193– 1199, 2009. Ioannis Patras and Maja Pantic. Particle filtering with factorized likeli- hoods for tracking facial features. In IEEE International Conference on Automatic Face and Gesture Recognition, pages 97–102, 2004. Pascal Paysan, Reinhard Knothe, Brian Amberg, Sami Romdhani, and Thomas Vetter. A 3D Face Model for Pose and Illumination Invariant Face Recognition. In IEEE International Conference on Advanced Video and Signal Based Surveilance, pages 296–301, 2009. Allan Pease and Barbara Pease. The Definitive Book of Body Language. Orion, 2006. Jian Peng, Liefeng Bo, and Jinbo Xu. Conditional neural fields. In Ad- vances in Neural Information Processing Systems, pages 1419–1427, 2009. Ce´cile Pereira. Dimensions of emotional meaning in speech. In Interna- tional Speech Communication Association Workshop on Speech and Emotion, pages 25–28, 2000. Rosalind W. Picard. Affective Computing. The MIT Press, 1997. 212 Bibliography Rosalind W. Picard. Future affective technology for autism and emotion communication. Philosophical transactions of the Royal Society of London. Series B, Biological sciences, 364(1535):3575–3584, Dec 2009. Rosalind W. Picard and Jonathan Klein. Computers that recognise and respond to user emotion: Theoretical and practical implications. In- teracting with Computers, 14(2):141–169, 2001. Kenneth M. Prkachin and Patricia E. Solomon. The structure, reliability and validity of pain expression: evidence from patients with shoulder pain. Pain, 139(2):267–274, 2008. Tao Qin, Tie-yan Liu, Xu-dong Zhang, De-sheng Wang, and Hang Li. Global ranking using continuous conditional random fields. In Ad- vances in Neural Information Processing Systems, pages 1281–1288, 2008. Vladan Radosavljevic, Slobodan Vucetic, and Zoran Obradovic. Contin- uous conditional random fields for regression in remote sensing. In European Conference on Artificial Intelligence, pages 809–814, 2010. Geovany A. Ramirez, Tadas Baltrusˇaitis, and Louis-Philippe Morency. Modeling latent discriminative dynamic of multi-dimensional affec- tive signals. In 1st International Audio/Visual Emotion Challenge and Workshop in conjunction with Affective Computing and Intelligent Inter- action, 2011. Laurel D. Riek and Peter Robinson. Using robots to help people habituate to visible disabilities. 2011. Peter Robinson and Rana el Kaliouby. Computation of emotions in man and machines. Philosophical Transactions of the Royal Society B: Biological Sciences, 364(1535):3441–3447, 2009. Paul Rozin and Adam B. Cohen. High frequency of facial expressions corresponding to confusion, concentration, and worry in an analysis of naturally occurring facial expressions of americans. Emotion, 3(1): 68–75, 2003. 213 Bibliography James A. Russell and Albert Mehrabian. Evidence for a three-factor theory of emotions. Journal of Research in Personality, 11(3):273–294, 1977. James A. Russell, Jo-Anne Bachorowski, and Jose-Miguel Fernandez- Dols. Facial and vocal expressions of emotion. Annual review of psy- chology, 54:329–349, 2003. Christos Sagonas, Georgios Tzimiropoulos, Stefanos Zafeiriou, and Maja Pantic. A semi-automatic methodology for facial landmark annota- tion. In Workshop on Analysis and Modeling of Faces and Gestures, 2013. Ashok Samal and Prasana A. Iyengar. Automatic recognition and analy- sis of human faces and facial expressions: a survey. Pattern recognition, 25(1):65–77, 1992. Georgia Sandbach, Stefanos Zafeiriou, Maja Pantic, and Lijun Yin. Static and dynamic 3d facial expression recognition: A comprehensive sur- vey. Image and Vision Computing, 30(10):683–697, 2012. Jason M. Saragih and Roland Goecke. A Nonlinear Discriminative Ap- proach to AAM Fitting. In International Conference on Computer Vision, 2007. Jason M. Saragih, Simon Lucey, and Jeffrey F. Cohn. Face alignment through subspace constrained mean-shifts. In IEEE International Con- ference on Computer Vision, pages 1034–1041, 2009. Jason M. Saragih, Simon Lucey, and Jeffrey F. Cohn. Deformable Model Fitting by Regularized Landmark Mean-Shift. International Journal of Computer Vision, 91(2):200–215, 2011. Patrick Sauer, Timothy F. Cootes, and Christopher J. Taylor. Accurate Regression Procedures for Active Appearance Models. In British Ma- chine Vision Conference, 2011. Klaus R. Scherer. Handbook of Cognition and Emotion, chapter Approaisal Theory, pages 637–663. Wiley-Blackwell, 2005. 214 Bibliography Klaus R. Scherer, H. Wagner, and A. Manstead. Handbook of Psychophysi- ology: Emotion and social behavior, chapter Vocal correlates of emotional arousal and affective disturbance, pages 165–197. 1989. Erik M. Schmidt and Youngmoo E. Kim. Prediction of time-varying musical mood distributions from audio. In International Society for Music Information Retrieval Conference, pages 465–470, 2010a. Erik M. Schmidt and Youngmoo E. Kim. Prediction of time-varying musical mood distributions using kalman filtering. In International Conference on Machine Learning and Applications, pages 655–660, 2010b. Erik M. Schmidt and Youngmoo E. Kim. Modeling musical emotion dynamics with conditional random fields. In International Society for Music Information Retrieval Conference, pages 777–782, 2011. Erik M. Schmidt, Douglas Turnbull, and Youngmoo E. Kim. Feature selection for content-based, time-varying musical emotion regression. In International Society for Music Information Retrieval Conference, pages 267–273, 2010. Karen L. Schmidt and Jeffrey F. Cohn. Human facial expressions as adaptations: Evolutionary questions in facial expression research. Yearbook of Physical Anthropology, 44:3–24, 2001. Karen L. Schmidt, Zara Ambadar, Jeffrey F. Cohn, and L. Ian Reed. Movement differences between deliberate and spontaneous facial ex- pressions: Zygomaticus major action in smiling. Journal of Nonverbal Behavior, 30(1):37–52, 2006. Marc Schro¨der. Affective Dialogue Systems, chapter Dimensional emo- tion representation as a basis for speech synthesis with non-extreme emotions, pages 209–221. 2004. Emery Schubert. Modeling Perceived Emotion With Continuous Musi- cal Features. Music Perception, 21(4), 2004. 215 Bibliography Bjo¨rn Schuller, Michel F. Valstar, Florian Eyben, Gary McKeown, Roddy Cowie, and Maja Pantic. AVEC 2011 - the first international audio / visual emotion challenge. In Affective Computing and Intelligent Inter- action, 2011. Bjo¨rn Schuller, Michel F. Valstar, Roddy Cowie, and Maja Pantic. Avec 2012 - the continuous audio / visual emotion challenge - an introduc- tion. In ACM International Conference on Multimodal Interaction, pages 361–362, 2012. Caifeng Shan, Shaogang Gong, and Peter W. McOwan. Facial expression recognition based on local binary patterns: A comprehensive study. Image and Vision Computing, 27(6):803–816, 2009. Hyunjung Shim and Seungkyu Lee. Performance evaluation of time- of-flight and structured light depth sensors in radiometric/geometric variations. Optical Engineering, 51(9), 2012. Tal Sobol-Shikler and Peter Robinson. Classification of complex infor- mation : Inference of co-occurring affective states from their expres- sions in speech. IEEE Transactions on Pattern Analysis and Machine In- telligence, 35(7):1284–1297, 2010. Yale Song, Louis-Philippe Morency, and Randall Davis. Multimodal Hu- man Behavior Analysis: Learning Correlation and Interaction Across Modal- ities. 2012. Jacquelin A. Speck, Erik M. Schmidt, Brandon G. Morton, and Young- moo E. Kim. A comparative study of collaborative vs. traditional mu- sical mood annotation, 2011. Mikkel B. Stegmann and Rasmus Larsen. Multi-band Modelling of Ap- pearance. Image and Vision Computing, 21(1):61–67, 2003. Charles Sutton and Andrew McCallum. Introduction to Statistical Rela- tional Learning, chapter Introduction to Conditional Random Fields for Relational Learning. MIT Press, 2006. 216 Bibliography Motoi Suwa, Noboru Sugie, and Keisuke Fujimura. A Preliminary Note on Pattern Recognition of Human Emotional Expression. In Interna- tional Joint Conference on Pattern Recognition, pages 408–410, 1978. Richard Szeliski. Computer Vision: Algorithms and Applications. Springer- Verlag New York Inc, 2010. Lorenzo Torresani, Aaron Hertzmann, and Chris Bregler. Nonrigid structure-from-motion: estimating shape and motion with hierarchi- cal priors. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 30(5):878–892, May 2008. Jessica L. Tracy and David Matsumoto. The spontaneous expression of pride and shame: evidence for biologically innate nonverbal displays. Proceedings of the National Academy of Sciences of the United States of America, 105(33):11655–11660, 2008. Georgios Tzimiropoulos, Joan Alabort-i Medina, Stefanos Zafeiriou, and Maja Pantic. Generic Active Appearance Models Revisited. In Asian Conference on Computer Vision, pages 650–663, 2012. Michel F. Valstar. Timing is everything A spatio-temporal approach to the analysis of facial actions. PhD thesis, 2008. Michel F. Valstar and Maja Pantic. Induced disgust, happiness and sur- prise: an addition to the mmi facial expression database. In Inter- national Conference on Language Resources and Evaluation, Workshop on EMOTION, pages 65–70, 2010. Michel F. Valstar, Maja Pantic, Zara Ambadar, and Jeffrey F. Cohn. Spon- taneous vs. posed facial behavior: Automatic analysis of brow actions. In ACM International Conference on Multimodal Interfaces, pages 162– 170, 2006. Michel F. Valstar, Hatice Gunes, and Maja Pantic. How to Distinguish Posed from Spontaneous Smiles using Geometric Features. In ACM International Conference on Multimodal Interfaces, 2007. 217 Bibliography Michel F. Valstar, Brais Martinez, Xavier Binefa, and Maja Pantic. Facial point detection using boosted regression and graph models. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2729–2736, 2010. Michel F. Valstar, Bihan Jiang, Marc Mehu, Maja Pantic, and Klaus R. Scherer. The First Facial Expression Recognition and Analysis Chal- lenge. In IEEE International Conference on Automatic Face and Gesture Recognition, 2011. Sundar Vedula, Peter Rander, Robert T. Collins, and Takeo Kanade. Three-dimensional scene flow. In IEEE International Conference on Com- puter Vision, pages 722–729, 1999. Paul Viola and Michael J. Jones. Robust real-time face detection. Inter- national Journal of Computer Vision, 57(2):137–154, 2004. Yang Wang, Simon Lucey, and Jeffrey Cohn. Non-rigid object alignment with a mismatch template based on exhaustive local search. In IEEE International Conference on Computer Vision, 2007. Yang Wang, Simon Lucey, and Jeffrey F. Cohn. Enforcing convexity for improved alignment with constrained local models. In IEEE Conference on Computer Vision and Pattern Recognition, 2008. Thibaut Weise, Sofien Bouaziz, Hao Li, and Mark Pauly. Realtime performance-based facial animation. In SIGGRAPH, 2011. Martin Wo¨llmer, Florian Eyben, Stephan Reiter, Bjo¨rn Schuller, Cate Cox, Ellen Douglas-Cowie, and Roddy Cowie. Abandoning emotion classes - towards continuous emotion recognition with modelling of long-range dependencies. In Interspeech, pages 597–600, 2008. Jing Xiao, Simon Baker, Ian Matthews, and Takeo Kanade. Real-time combined 2D + 3D active appearance models. In IEEE Conference on Computer Vision and Pattern Recognition, pages 535–542, 2004. 218 Bibliography Jing Xiao, Jinxiang Chai, and Takeo Kanade. A closed-form solution to non-rigid shape and motion recovery. International Journal of Computer Vision, 67(2):233–246, 2006. Lijun Yin, Xiaochen Chen, Yi Sun, Tony Worm, and Michael Reale. A high-resolution 3D dynamic facial expression database. In IEEE Inter- national Conference on Automatic Face and Gesture Recognition, 2008. Zhihong Zeng, Maja Pantic, Glenn I. Roisman, and Thomas S. Huang. A survey of affect recognition methods: Audio, visual, and sponta- neous expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(1):39–58, 2009. Xing Zhang, Lijun Yin, Jeffrey F. Cohn, Shaun Canavan, Michael Reale, Andy Horowitz, and Peng Liu. A High-Resolution Spontaneous 3D Dy- namic Facial Expression Database. 2013. Zhengyou Zhang. Microsoft kinect sensor and its effect. IEEE MultiMe- dia, 19(2):4–12, 2012. Guoying Zhao and Matti Pietikainen. Dynamic texture recognition us- ing local binary patterns with an application to facial expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6): 915–928, 2007. Wen-Yi Zhao, Rama Chellappa, P. Jonathon Phillips, and Azriel Rosen- feld. Face recognition: A literature survey. ACM Computing Surveys, 35(4):399–458, 2003. Xiangxin Zhu and Deva Ramanan. Face detection, pose estimation, and landmark localization in the wild. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2879–2886, 2012. 219