Affect recognition from facial movements and body gestures by hierarchical deep spatio-temporal features and fusion strategy

Size: px

Start display at page:

Download "Affect recognition from facial movements and body gestures by hierarchical deep spatio-temporal features and fusion strategy"

Clinton Welch
5 years ago
Views:

Accepted Manuscript Affect recognition from facial movements and body gestures by hierarchical deep spatio-temporal features and fusion strategy Bo Sun, Siming Cao, Jun He, Lejun Yu PII:

1 Accepted Manuscript Affect recognition from facial movements and body gestures by hierarchical deep spatio-temporal features and fusion strategy Bo Sun, Siming Cao, Jun He, Lejun Yu PII: S (17) DOI: Reference: NN 3858 To appear in: Neural Networks Received date : 7 May 2017 Revised date : 21 November 2017 Accepted date : 28 November 2017 Please cite this article as: Sun, B., Cao, S., He, J., Yu, L., Affect recognition from facial movements and body gestures by hierarchical deep spatio-temporal features and fusion strategy. Neural Networks (2017), This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

2 *Title Page (With all author details listed) Affect Recognition from Facial Movements and Body Gestures by Hierarchical Deep Spatio-Temporal Features and Fusion Strategy Bo Sun, Siming Cao, Jun He, and Lejun Yu Bo Sun Author is with the College of Information Science and Technology, Beijing Normal University, Beijing, China Siming Cao Author is with the College of Information Science and Technology, Beijing Normal University, Beijing, China Jun He Author is with the College of Information Science and Technology, Beijing Normal University, Beijing, China Lejun Yu Author is with the College of Information Science and Technology, Beijing Normal University, Beijing, China Correspondence author: Jun He,

3 *Manuscript Click here to view linked References Affect Recognition from Facial Movements and Body Gestures by Hierarchical Deep Spatio-Temporal Features and Fusion Strategy Bo Sun, Siming Cao, Jun He *, and Lejun Yu College of Information Science and Technology, Beijing Normal University, Beijing, China * Corresponding author. hejun@bnu.edu.cn. Abstract Affect presentation is periodic and multi-modal, such as through facial movements, body gestures, and so on. Studies have shown that temporal selection and multi-modal combinations may benefit affect recognition. In this article, we therefore propose a spatio-temporal fusion model that extracts spatio-temporal hierarchical features based on select expressive components. In addition, a multi-modal hierarchical fusion strategy is presented. Our model learns the spatio-temporal hierarchical features from videos by a proposed deep network, which combines a convolutional neural networks (CNN), bilateral long short-term memory recurrent neural networks (BLSTM-RNN) with principal component analysis (PCA). Our approach handles each video as a video sentence. It first obtains a skeleton with the temporal selection process and then segments key words with a certain sliding window. Finally, it obtains the features with a deep network comprised of a video-skeleton and video-words. Our model combines the feature level and decision level fusion for fusing the multi-modal information. Experimental results showed that our model improved the multi-modal affect recognition accuracy rate from 95.13% in existing literature to 99.57% on a face and body (FABO) database, our results have been increased by 4.44%, and it obtained a macro average accuracy (MAA) up to 99.71%. Keywords Affect recognition, deep learning, convolutional neural network, bilateral long short-term memory recurrent neural network, deep spatio-temporal hierarchical feature, multi-modal feature fusion strategy 1. INTRODUCTION The affect recognition ability is an important aspect of computer intelligence. It primarily influences the computer s response to an operator or interlocutor. And it has a wide range of applications in entertainment, industry, transportation, medicine, military and many other fields. Over the past few decades, several affect recognition methodologies have been proposed. The research has led to two key trends toward greater practicality, i.e., use of multi-modal information instead of mono-modal, and dynamic video instead of static images. In this research, initially, American psychologists Ekman et al. defined six basic categories of emotions, i.e. angry, disgust, fear, happy, sad and surprise [1]. Several years later, they developed the Facial Action Encoding System (FACS) [2]. In this system, a facial expression is deemed the result of facial muscles and the combination of many action units (AUs), which display the corresponding relationship between

4 facial movement and expression. The two works marked key milestones in the field and have continued to serve as the basis of emotion recognition, especially in facial emotion recognition research. Subsequently, many related algorithms and systems have been proposed [3-8]. Recently, deep learning methods have become widely used in the field of computer vision. The convolutional neural network (CNN) and bilateral long short-term memory recurrent neural network (BLSTM-RNN) are state-of-the-art machine-learning techniques in this area. Fan et al. [50] present a video-based emotion recognition system using CNN-RNN and C3D hybrid networks. Chen et al. [51] explored two simple, yet effective deep-learning-based methods for image emotion analysis. Noroozi et al. [52] applied a CNN to obtain key frames for summarizing videos. However, human emotion expression manifests in multi-modal, not mono-modal, information, such as facial movements, body gestures, voice utterances, etc. Each mono-modal is often ambiguous, uncertain, and incomplete. Metallinou et al. [53] thus examined context-sensitive schemes for emotion recognition in a multi-modal, hierarchical approach referred to as a bidirectional long short-term memory (BLSTM) neural network. In addition, Kret et al. [9] performed a psychological analysis of body movements for body expression recognition. They showed that using only facial expressions can be misleading, whereas combining them in some way can improve the emotional state recognition accuracy. Moreover, Neverova et al. [47] proposed a method for adaptive multi-modal gesture recognition. They showed that fusing multiple modalities leads to a significant increase in recognition rates and that information items from the individual channels have complementary characteristics. In recent years, to advance multi-modal affect recognition, many multi-modal emotion recognition competitions have been organized, such as the Emotion Recognition in the Wild Challenge (EmotiW) [18], Audio/Visual Emotion Challenge (AVEC) [19], and Multimodal Emotion Recognition Challenge [20]. Furthermore, the idea of combining multi-modalities for affect recognition has generated a new research topic, specifically determining which modalities should be used, and how to effectively integrate them. Some researchers have initially focused on fusing visual and audio modalities [11-14]. Later, others have explored utilizing audio, visual, and physiological signals synchronously for recognizing affects [15-17]. In this regard, Ambady and Rosenthal suggested that visual channels, i.e., facial movement and body gestures, are the most important cues for classification of human behavior [21]. Consequently, some researchers have suggested that fusing cues can produce better affect recognition results. Accordingly, corresponding studies have been conducted [22-27]. To date, two major fusion strategies exist, namely feature-level fusion and decision-level fusion. Feature-level fusion directly combines the discriminative ability of multiple features, which is assumed to be more suitable for modalities that are almost synchronous in the timescale (e.g., speech and lip movements) [40]. Decision-level fusion combines the discriminative results of multiple features. This approach is assumed to be more suitable for modalities that do not simultaneously occur in the timescale. (e.g., speech and body gestures) [40]. In studies based on the face and body (FABO) database, Gunes and Piccardi [26] separately applied feature-level fusion and decision-level fusion. Meanwhile, Barros et al. [32] used a fully connected layer to fuse each multi-modal stream, while Chen et al. [27] used only feature-level fusion. A means of

5 integrating facial movements and body gestures needs further development. The actions of facial movements or body gestures comprise a dynamic process, which can be described by four temporal phases: neutral, onset, apex, and offset [10]. Affect recognition based on videos contains spatio-temporal information compared with affect recognition based on static images. In studies based on the FABO database [44], Barros et al. [32] selected several apex frames for spatio-temporal feature extracting. Moreover, Chen et al. [27] proposed a framework to extract the temporal dynamic features of face and body gestures from the whole video. However, a method of exploring effective spatio-temporal features requires further exploration. To address the above issues, we propose a spatio-temporal fusion model, which not only extracts the high-level spatio-temporal hierarchical features, but it also includes a multi-modal hierarchical fusion strategy. This paper provides the following key contributions: 1) To extract effective spatio-temporal hierarchical features, we propose a temporal selection approach to obtain expressive materials. Frist, we employ the onset apex offset sequence as a video-skeleton. Then, using the sliding window strategy, we obtain several video-words from the video-skeleton. We describe conducted experiments that confirmed that our proposed temporal selection approach is notably more effective than previous methods. 2) Based on the expressive materials, we extract deep spatio-temporal features from the video-skeleton and video-words by respectively using a proposed network, which combines CNN, BLSTM-RNN, and principal component analysis (PCA). 3) We propose a hierarchical fusion method by combining feature-level fusion and decision-level fusion for visual multi-modal affect recognition. The method is based on facial movements and body gestures. It showed excellent performance in conducted experiments. We evaluated the proposed method on the FABO database. The proposed method performed better than existing state-of-the-art methods for visual multi-modal affect recognition. The remainder of this paper is organized as follows. Section 2 introduces related works. Section 3 describes the details of the overall methodology we proposed. The performed experiments and extensive experimental results are detailed in Section 4. Finally, our conclusions are given in Section RELATED WORK In this section, we firstly review some existing methods of mono-modal affect recognition from facial movements or body gestures. Secondly, we review some existing works on visual multi-modal recognition. 2.1 Mono-modal Affect Recognition As above mentioned, the original studies on the affect recognition algorithm are based on single-mode static images, especially face images. To date, many studies have been conducted on facial expressions. Zhong et al. [3] proposed a method to divide a face image into blocks of different scales. Then, similar and special ones were selected from among different expressions through learning to identify the most representative areas. Cheon et al. [4] proposed an algorithm for facial expression recognition based on a

6 differential active appearance model and manifold learning. Liu et al. [5] proposed an improved depth learning method, which can extract a series of representative facial features through repeated learning and training. A strong boosted classifier with statistical properties is then formed. In addition, Bo et al. [48] extracted several CNN features for continuous affect recognition. They also extracted acoustic features, LBP from three orthogonal planes (LBPTOP), Dense SIFT and CNN-LSTM features to recognize the emotions of film characters [49]. Guo et al. [55] proposed a multi-modality convolutional neural networks (CNNs) based on visual and geometrical information for micro emotion recognition. Schwan et al. [56] described an advanced pre-processing algorithm for facial images and a transfer learning mechanism for face emotion recognition. Compared to research on facial expressions, few body expression studies have been undertaken. This may be because it is difficult to accurately and reliably define the corresponding relationships of various body gestures to emotional categories. Some researchers in psychology, cognitive science, and computer science have studied emotion recognition based on body gestures, and effective systems have been presented for body emotion recognition. Glowinski et al. [7] studied the association between gestures and affective changes. They first coded the upper extremity changes of the human body, and then expressed the emotion through specific gestures. Nicolaou et al. [8] studied head movements. They mapped the angle and direction of the head motion to the emotions in the emotional space to produce emotional activations and expectations. Coulson et al. [45] studied the emotions implied in certain gestures. The results showed that emotional information contained in a gesture is similar to that found in speech. Moreover, Silva et al. [46] developed a gesture-based emotion recognition system that can automatically identify the emotional state of children in a game. Wang et al. [54] proposed a comprehensive emotion classification framework based on spatio-temporal volumes built with human actions. 2.2 Visual Multi-modal Recognition Mono-modal affect recognition has certain limitations because human affects are not limited to a single mode; rather, they rely on multi-modal expressions. Following the fundamental study of Ambady and Rosenthal [21], some researchers have advocated that fusing facial movements and body gestures to affect recognition can yield good results. Accordingly, in recent years corresponding studies have been conducted [22-27]. Kapoor and Picard [22] fused face information from videos and information gathered by a gesture sensor to determine the emotional states of children during a game. Gunes and Piccardi [24, 26] performed considerable work in multi-modal emotion recognition based on facial movements and body gestures. They created a FABO multi-modal database [64], including body and facial modalities. Based on the FABO database, Shan et al. [25] employed the Bag of Words (BOW) model to extract spatio-temporal features of body gestures and facial expressions based on a video. To maximize the correlation between the two modalities, they used canonical correlation analysis (CCA) to combine the facial movement and body gesture features at the feature level for multi-modal affect recognition. Moreover, to describe the dynamics of facial movements and body gestures for multi-modal affect recognition, Chen et al. [27] extracted MHI-HOG and image-hog features. In addition, they applied a temporal normalization method to solve the problem of temporal alignment on these two modalities.

7 Furthermore, Balomenos et al. [31] used the HMM method to identify six basic emotions by using gestures (hands clapping, hands on the head, hands moving and lifting along a certain pattern) and facial features. In addition, Barros et al. [32] used a hierarchical feature representation method based on a multichannel convolutional neural network (MCCNN) to integrate multi-modal affect recognition. They consequently achieved a significant improvement of recognition accuracy. And they proposed a neuro computational model that learned to attend to emotional expressions and to modulate emotion recognition [57]. Despite the above advancements, limitations remain in the existing affect recognition methods. Because human facial expressions and body gestures are dynamic processes, most existing approaches do not contain the temporal selection process for a critical sequence for affect recognition. Although Refs. [26, 27, 32] considered it, the approach did not perform well. Furthermore, differentiated time sequences for different affects were not considered. And most existing affect recognition methods only consider one single fusion method, either feature-level or decision-level fusion, whereas the relationships between video-words, video-skeletons, and video sentences are hierarchical. In addition, robustness is important for a model [61-63], which has not been considered by previous works. In the proposed approach, on the other hand, we adopt several temporal selection methods for affect recognition. From experimental results, it was shown that the best affect recognition is based on the onset apex offset sequence. Furthermore, we handle each video as a video sentence. The proposed model firstly obtains its skeleton using temporal selection. It then segments video-words with a certain sliding window on the video-skeleton. Finally, it obtains the features on video sentences and video-words with a deep network, which combines CNN, BLSTM-RNN, and PCA. In addition, we propose a multi-modal fusion method combining feature-level fusion and decision-level fusion. The experimental results in Section 4 demonstrate that the proposed framework outperformed the other methods with respect to visual multi-modal affect recognition from facial movements and body gestures. 3. METHODOLOGY This section introduces the details of our method, which is somewhat inspired by the relationship between words, skeletons, and sentences in Chinese. Especially, we treat each video as a video sentence, key clips as video-skeletons, and images as video-words. Using deep learning, we present a spatio-temporal fusion-based classifying model, which not only extracts the spatio-temporal hierarchical features, but also provides a multi-modal hierarchical fusion strategy. Fig. 1 depicts an overview of our model, which consists of three major parts. 1) Data pre-processing. This part performs face detection and alignment, as well as image size normalization. 2) Feature extraction and representation. This part includes temporal selection, a sliding window, and a proposed deep network, which combines CNN, BLSTM-RNN, and PCA. After temporal processes, our model learns the high-level spatio-temporal hierarchical features by the deep network. 3) Hierarchical-fusion-based classification. This part introduces a multi-modal hierarchical-fusion-based classification strategy, which combines feature-level and decision-level fusion.

8 Fig. 1. Overview of the proposed model 3.1 Data Pre-processing The data pre-processing part is composed of face detection and alignment, as well as image size normalization for face and body videos, respectively. For face videos, the first step is to detect and align the face in all frames. We follow the methods of Refs. [33, 34, 35] to extract and track the face. All frames are aligned to this base face through affine transformation and are resized to pixels. Examples of this step on the FABO database are shown in Fig. 2. Fig. 2. Examples of face detection and alignment on the FABO database Since the same network of the next step requires the same image size, size normalization is necessary for body videos. Based on experiments, we resize an original image ( ) to pixels directly using the bicubic interpolation. 3.2 Deep Feature Extraction and Representation A video is composed of many clips or images. For different affects, the most discriminative time sequences may be different, and they contain much more spatio-temporal information. Thus, a method of

9 representing it well is a challenge. Here, we apply the state-of-the-art deep learning approach and propose the previously mentioned deep network, whose structure is detailed in Section Since selecting the most expressive materials is fundamental, two techniques play a role through the NLP concept. They are introduced in Section Preparing expressive materials Affect representation is temporally periodic. Additionally, the given facial movement or body gesture is a physiologically constrained phase-wised sequence. It contains a sequence of general temporal phases, i.e., neutral, onset, apex, offset, and neutral as illustrated in Fig. 3. Generally, the neutral phases of different affects are almost similar, and the others representing dynamic change may impact the features obtained from a video. Then, inspired by previous studies [26, 32] and Chinese language processing, with consideration of the relationship of a frame image, clip, and video, we apply them as a word, phrase, and sentence, respectively. Temporal selection is important for feature extraction because extracting the skeleton is fundamental for understanding a sentence. It is based on the temporal segment. We performed temporal segmentation in our previous work [43] and achieved 89.52% and 95.20% accuracy rates for facial movements and body gestures, respectively. In our experiments, we used the ground truth temporal label to select video frames. Some previous approaches use several apex frames for the affect recognition, and others use the whole cycle of an affect. After analysis and experiments, we employ the onset-apex-offset sequence as the video-skeleton. To further extract the most expressive temporal words from the skeleton, a specifically sized sliding window is applied in our model. Based on experience, widths of seven and eight are applied for body gesture and facial movement videos, respectively. With the sliding window strategy, we obtain several video-words from a sequence, and we then map the results through the following function: (1) Where denotes the probability matrix of the classification results for a video-skeleton. denotes the probability matrix of the classification results for each video-words from a video-skeleton,. If we use Softmax as the classifier for each video-words, obeys polynomial distribution other than exponential distribution [59] or Gaussian distribution [60, 65], if we use SVM as classifier for each video-words, obeys binomial distribution. denotes the length of a video-skeleton. denotes the width of the sliding window.

Fig. 3. Sample of the temporal segments from body gestures (top) and facial movements (bottom) 3.2.

10 Fig. 3. Sample of the temporal segments from body gestures (top) and facial movements (bottom) Learning deep features After temporal processing, the main challenge becomes determining a means to obtain effective features of a dynamic sequence. Unlike a static image, we must consider both spatial and temporal information. Extracting and connecting them effectively is a critical problem. Because they are from different domains, we introduce the sub-space transformation algorithm and present our deep network structure combining CNN, BLSTM-RNN, and PCA. Additionally, considering the local and global information together, it learns high-level spatio-temporal hierarchical features. Fig. 4 shows the architecture of our proposed network.

11 Fig. 4. Architecture of the proposed deep network To obtain the spatial information, as mentioned above, we employ the architecture of AlexNet [36], which has five convolutional layers and three fully connection layers. It uses the reflected linear unit (ReLU) as the activation function. AlexNet was designed for the ImageNet Large-Scale Visual Recognition Competition 2012 Challenge [37]. Our proposed CNN has six convolutional layers, three fully connection layers, three max-pooling layers, and two dropout layers. For spatial feature extraction, inspired by our experience in the affect recognition challenge and some related works. [43][49][58], we concatenate the output of the ReLU activations after the first two fully connected layers. To effectively connect spatial and temporal, we apply PCA to conduct feature space transformation and reduce the highly dimensionality (99% of variance). In addition, we apply it prior to temporal information extraction. To obtain the temporal information, we employ the BLSTM-RNN method, which is the state-of-the-art model for sequence analysis. The main concept of BLSTM-RNN described herein is based on Refs. [28], [29], and [30]. Given an input feature sequence, LSTM computes the hidden vector sequence and output vector sequence by iterating the following equations. Given, and, LSTM updates for time step as follows: Input gate: (2) Forget gate: (3) Output gate: (4) (5)

12 (6) Local and global information can be synthesized by learning the bidirectional property with BLSTM-RNN. The bidirectional property is achieved by using forward and backward linking to connect the data sequences in two separate hidden layers. BLSTM-RNN combines the forward hidden sequence and backward hidden sequence to generate output. The output of the BLSTM-RNN hidden sequence is computed as the following function: (7) (8) (9) For temporal feature extraction, we concatenate the output of the ReLU activations after the first two fully connected layers. Prior to obtaining the temporal information, we apply PCA to lower their dimensions (99% of variance). Finally, we obtain the high-level spatio-temporal hierarchical features for affect recognition. Note that the features obtained from the video-skeleton of facial movements and body gestures are hereafter referred to as Face1 and Body1, respectively. Similarly, the features obtained from video-words of facial movements and body gestures are hereafter referred to as Face2 and Body2, respectively. 3.3 Hierarchical-fusion-based Classification Two major fusion strategies currently exist, namely feature-level fusion and decision-level fusion. From the previous theoretical analysis, facial movements and body gestures occur simultaneously in the time scale; however, they are not strictly synchronous. Additionally, considering the meaning of phrases and subject model methods for sentence comprehension, we combine the two types of fusion strategies and propose a hierarchical classification fusion strategy based on the support vector machine (SVM) [38]. After implementing these two kinds of fusion strategies on the FABO database, the results obtained from the experiments were complementary. Thus, we integrated them into a multi-modal hierarchical classification fusion strategy. Fig. 5 depicts the architecture.

Fig. 5. Architecture of the multi-modal hierarchical classification fusion strategy. 3.

13 Fig. 5. Architecture of the multi-modal hierarchical classification fusion strategy Feature-level fusion In the multi-modal hierarchical classification fusion strategy, two fusion strategies are employed for feature-level fusion, namely a neural network and concatenation, whose architectures are shown in the right inset of Fig. 5. 1) Neural Network As shown in the top-right of the figure, we concatenate the features extracted from skeletons and words of facial movements and body gestures, respectively. Then, we input them into the neural network, which only has two fully connection layers. Thirdly, we use the output of the ReLU activations after the first fully connected layer. Finally, PCA is applied to reduce the dimensionality and noise. 2) Concatenation As shown in the bottom-right of Fig. 5, we directly perform PCA on the concatenated features of the two types. Then, the second type of the fused feature is achieved. The feature dimensions for these data are shown in the Table 1 below. Table 1. Feature dimension after feature-level fusion strategy Fusion strategy data Feature dimension Fusion 1 Face1&Body1 3

14 Fusion 2 Face2&Body2 4 Face1&Body1 7 Face2&Body Decision-level fusion At this point, we utilize the weighted voting fusion network [41] for decision-level fusion, and Fig. 5 (left inset) shows its architectures. Firstly, we apply the output decision value of the SVM on the two types of mono-modal features and the above two fused ones. Then, an optimization fusion network is employed on those values instead of a voting combination. For example, let be the decision value of a classifier (e.g., SVM), which is the prediction probability for a sample in an affect class. Given m kinds of features and n affect categories, m n pre-decision values will be generated after the basic classifiers. These can be denoted as For input, a function is defined to estimate the value of,, which represents the probability of the category label y = k, based on each of the n different possible values. (10) Here, W contains m n weights. Finally, the output is an n-dimensional vector, which represents n probabilities of affect recognition for a sample. The most likely label for the final affect recognition is then chosen with a max-win strategy. 4. EXPERIMENTS 4.1 Experimental Setup Experimental data We conducted the experiments on the FABO database, which contains video recordings of face expressions and body gestures simultaneously obtained by two cameras. It is currently the only bi-modal database having both expression annotations and temporal annotations. Two sample videos from the database are shown in Fig. 6.

15 (a) (b) Fig. 6. Recordings in the FABO database: sample images from (a) an anger affect video of the face (right) and body (left); and (b) a happiness affect video of the face (right) and body (left) We chose 244 videos in which the ground truth affect labels from respective face and body videos were the same. As shown in Table 2, the database contained ten emotional states, including basic and non-basic affects. For each video, there were two to four complete expression cycles. We chose the first two expression cycles for the experiment. Then, we split each of them into two videos according to cycles, and we obtained 488 videos. We chose approximately half of them for training and the remaining ones for testing. Table 2. Number of videos available for each affect state in the FABO database Basic affects Numb Non-basic affects Numb happiness 20 boredom 26 anger 48 puzzlement 41 surprise 15 uncertainty 19 fear 19 anxiety 17 disgust 23 sadness Experimental implementation We employed the Keras [42] implementation for the proposed deep network learning technique. In addition, in the MATLAB environment, we used LIBLINEAR [39], which is a widely used open-source machine learning library. 4.2 Experimental Results

16 4.2.1 Model training and parameter evaluation After data pre-processing, we first trained the CNN models on the training set by employing the Keras [42] implementation. Second, we extracted the activation values of the last two fully connected layers, followed by a PCA step. Third, we trained the BLSTM models on training features from the previous step. Fourth, we extracted the activation values of the last two fully connected layers, followed by a PCA step. Fifth, we trained the linear SVM using various features (described in Section 3.2). The SVM models were trained on the training set and the parameters were tuned on the validation set through five-fold cross validation in the range from 2 5 to At last, we trained the fusion models on the training features and results. The initial parameters of all networks are initialized randomly Evaluation measures of experimental results We extracted high-level spatio-temporal hierarchical features by the method introduced in Section 3. In this section, we compare the recognition rate in percentages under different conditions. As the sample distribution was irregular, we chose the macro average accuracy (MAA) as the primary measure to evaluate the results, and we choose the accuracy (ACC) as the second measure. The calculating formulas are given as follows: (11) (12) (13) where denotes the number of affect categories, denotes the precision of the affect category, represents the value of the correct classification in the affect category, and is the value of the incorrect classification in the affect category Experiments on temporal selection At this point, we conducted three strategies of temporal selection for the video-skeleton. Specifically, we used the 1) apex frame images alone (five apex frames); 2) clip of the onset-apex-offset sequence (15 frames); and 3) whole cycle or video of an expression. Examples of temporal selection on the FABO database from facial movements are shown in Fig. 7.

17 Fig.7. Examples of temporal selection on the FABO database from facial movements Table 3. Results of affect recognition on facial movements with different temporal selections Temporal selection strategy MAA (%) ACC (%) apex phase onset-apex-offset sequence whole cycle Table 4. Results of affect recognition on body gestures with different temporal selections Temporal selection strategy MAA (%) ACC (%) apex phase onset-apex-offset sequence whole cycle Tables 3 and 4 show the affect recognition results on facial movements and body gestures with different temporal selections, respectively. The results show that, rather than the whole cycle or the apex phase, the best affect recognition is based on the onset-apex-offset sequence. This finding may suggest that, if an affect video is treated as a sentence or an article, a skeleton would be best to elucidate it Experiments on the sliding window size For different affects, the most differentiated time sequences may be different as the core words of different sentences likely differ. In other words, to understand an affect video, some primal words should be selected around the skeleton. We thus applied different sized sliding windows to obtain expressive

18 phrases from the skeleton. Examples of sliding windows on the FABO database from facial movements are shown in Fig. 8. After the sliding window process, each subsequence (video-words) contains two phases of information (i.e., onset-apex, and apex-offset), which represent different dynamic information items for affect recognition. On the other hand, we can equalize the distribution of samples by controlling the movement of the sliding window. Fig. 8. Examples of a sliding window on the FABO database from facial movements Table 5. Results of affect recognition on facial movements with different-sized sliding windows t MAA(%) ACC (%) Table 6. Results of affect recognition on body gestures with different-sized sliding windows t MAA(%) ACC (%) Tables 5 and 6 show the results of affect recognition on facial movements and body gestures according to the sliding window size, respectively. In these tables, item denotes the size of the sliding window, which is set from six to ten. The experimental results show that, for facial movements, the best affect recognition is achieved when t = 8. For body gestures, the best affect recognition is achieved when t = 7. For convenient description, we use numbers 1 and 2 to denote the experiments performed on the materials without and with a sliding window, respectively, in the following part Mono-modal affect recognition results In this part, we respectively describe the affects from each mono-modal information item. Table 7. Comparison of classification performance (ACC) between deep learning and deep features based SVM for each mono-modality. SVM(%) Deep learning (%) Face Face

19 Body Body Table 7 presents the comparison of classification performance (ACC) between deep learning and SVM for each mono-modality, it can be seen that deep feature based SVM has succeeded deep learning on the whole. Thus, SVM has always been applied in the subsequent experiments. Table 8. Results of affect recognition on a video-skeleton (Face1, Body1) and video-words (Face2, Body2) of facial movements and body gestures. Face1 Face2 Body1 Body2 MAA (%) ACC (%) Table 8 presents the results of affect recognition on a video-skeleton and video-words of facial movements and body gestures. In terms of MAA and ACC, especially the former, it is evident that affect recognition on video-words is better than that on a video-skeleton Bi-modal affect recognition results Using the best expressive mono-modal affect features, we performed bi-modal fusion experiments by employing different fusion strategies. Fig. 9 shows the differences between different affect recognition results on a video-skeleton of mono-modal and bi-modal information with different fusion strategies. Here, Face1 denotes affect recognition on a video-skeleton of facial movements. Body1 denotes affect recognition on a video-skeleton of body gestures. Multi1-1 denotes affect recognition on a video-skeleton of bi-modal information with a feature-level fusion strategy (neural network). Multi2-1 represents affect recognition on a video-skeleton of bi-modal information with a feature-level fusion strategy (concatenation). In addition, Multi3-1 denotes affect recognition on a video-skeleton of bi-modal information with a decision-level fusion strategy (weighted voting fusion network). Moreover, Multi4-1 represents affect recognition on a video-skeleton of bi-modal information with a hierarchical classification fusion strategy. From the curves of MAA and ACC, is apparent that affect recognition on bi-modal information is better than that on mono-modal information. In addition, among bi-modal fusion strategies, Multi4-1 the proposed hierarchical classification fusion strategy performs best.

20 Fig. 9. Comparisons of different affect recognition rates on a video-skeleton of mono-modal and bi-modal information using different fusion strategies T R U E T R U E PREDICT (a) (b) (c) PREDICT T R U E PREDICT (d) (e) (f) Fig. 10. Affect recognition before the sliding window process: confusion matrix of (a) facial movements; (b) body gestures; (c) the feature-level fusion strategy (neural network); (d) the feature-level fusion strategy (concatenation); (e) the decision-level fusion strategy (weighted voting fusion network); and (f) hierarchical classification fusion strategy. The rows and columns represent the ground truth and prediction value, respectively. Numbers 1 to 10 represent the ten affect categories, respectively. T R U E PREDICT

Fig. 11. Comparisons of different affect recognition on video-words of mono-modal and bi-modal information with different fusion strategies for each affect category Fig.

21 Fig. 11. Comparisons of different affect recognition on video-words of mono-modal and bi-modal information with different fusion strategies for each affect category Fig. 10 shows a confusion matrix for affect recognition before the sliding window process. Fig. 11 depicts comparisons between different affect recognition rates on a video-skeleton of mono-modal and bi-modal information with different fusion strategies for each affect category. Numbers 1 to 10 represent the ten affect categories, respectively. From these two figures, it is apparent that affect recognition on bi-modal information is almost better than that on mono-modal information for each affect category, and that feature-level and decision-level fusion provides complementary information regarding affect recognition. Thus, among these multi-modal fusion strategies, Multi4-1 performs best. Fig. 12. Comparisons of different affect recognition rates on video-words of mono-modal and bi-modal information with different fusion strategies

22 Fig. 12 compares the results of different affects on video-words of mono-modal and bi-modal information with different fusion strategies. Here, Face2 denotes affect recognition on video-words of facial movements. Body2 represents affect recognition on video-words of body gestures. In addition, Multi1-2 denotes affect recognition on video-words of bi-modal information with a feature-level fusion strategy (neural network). Multi2-2 represents affect recognition on video-words of bi-modal information with a feature-level fusion strategy (concatenation). Moreover, Multi3-2 denotes affect recognition on video-words of bi-modal information with a decision-level fusion strategy (weighted voting fusion network). Finally, Multi4-2 represents affect recognition on video-words of bi-modal information with a hierarchical classification fusion strategy. Compared to the previous conclusions of Fig. 9, it is apparent that the sliding window process can improve MAA. (a) (b) (c) T R U E PREDICT (d) (e) (f) Fig. 13. Affect recognition after the sliding window process: confusion matrix of (a) facial movements; (b) body gestures; (c) the feature-level fusion strategy (neural network); (d) the feature-level fusion strategy (concatenation); (e) the decision-level fusion strategy (weighted voting fusion network); and (f) the hierarchical classification fusion strategy. The rows and columns represent the ground truth and prediction value, respectively. Numbers 1 to 10 represent the ten affect categories, respectively.

Fig. 14. Comparisons of different affect recognition rates on video-words of mono-modal and bi-modal information with different fusion strategies for each affect category Fig.

23 Fig. 14. Comparisons of different affect recognition rates on video-words of mono-modal and bi-modal information with different fusion strategies for each affect category Fig. 13 shows the confusion matrix for affect recognition after the sliding window process. Fig. 14 presents the comparisons between different affect recognition rates on video-words of mono-modal and bi-modal information with different fusion strategies for each affect category. Numbers 1 to 10 represent the ten affect categories, respectively. From these two figures, in addition to the previous conclusions, it is evident that, compared with Fig. 12, the length distribution of these bins is more balanced. In other words, the MAA is improved. Fig. 15 Comparisons of different bi-modal affect recognition rates on a video-skeleton, video-words, and a video-skeleton plus video-words Fig. 15 shows the comparisons between different bi-modal affect recognition rates on a video-skeleton, video-words, and a video-skeleton plus video-words. The video-skeleton denotes the affect recognition on a video-skeleton of bi-modal information. Video-words denotes affect recognition on video-words of

bi-modal information. The video-skeleton plus video-words denotes affect recognition of bi-modal information with fusion of these two kinds of circumstances.

16 compares the results between different affect recognition rates on bi-modal information for each category of affects, which are represented by Numbers 1 to 10.

17 shows a confusion matrix of our final results for bi-modal affect recognition. Fig. 16. Comparisons of different affect recognition rates on bi-modal information for each category of affects Fig.

24 bi-modal information. The video-skeleton plus video-words denotes affect recognition of bi-modal information with fusion of these two kinds of circumstances. It is apparent that video-skeleton plus video-words performs best. Fig. 16 compares the results between different affect recognition rates on bi-modal information for each category of affects, which are represented by Numbers 1 to 10. From the figure, it is evident that bi-modal affect recognition on a video-skeleton plus video-words provides complementary information and thus it performs best. In addition, Fig. 17 shows a confusion matrix of our final results for bi-modal affect recognition. Fig. 16. Comparisons of different affect recognition rates on bi-modal information for each category of affects Fig. 17. Confusion matrix of our final results for bi-modal affect recognition Comparison with relevant state-of-the-art methods on the FABO database In Table 9, we compare the proposed approach to other relevant methods on the FABO database. The work presented in [26] using half of the data for training and the rest for testing. They have extracted a lot of complex features extracted for facial movements and body gestures. It obtained an ACC of 82.6% (frame basis) and 85% (video basis) for the recognition of 12 affects. In the work of [27], videos in each affect category were randomly separated into three subsets. They chose two of them as training data and the rest

subset as testing data each time. The methods combined both MHI-HOG and Image-HOG through a temporal normalization method to describe the dynamics of facial movements and body gestures.

25 subset as testing data each time. The methods combined both MHI-HOG and Image-HOG through a temporal normalization method to describe the dynamics of facial movements and body gestures. The recognition rate obtained was not more than 75% with the SVM classifier for the recognition of 10 affects. In the work of [32], each sequence was used as input for the model. Both training and testing data were composed of subsequent images of the same subject. The method selected several apex frames for affect recognition. In addition, a new category was created, Neutral, which was composed of other frames that were present in the remaining temporal phases, except for the apex frames. Multichannel Convolutional Neural Network (MCCNN) was used to extract hierarchical features and obtained average recognition rates of 91.3% for the 11 affects. [57] proposed a neuro computational model for emotion recognition. It consisted of a deep architecture, which implemented convolutional neural networks to learn the location of emotional expressions in a cluttered scene. Finally, it obtained an ACC of 95.13% for 10 affects. As mentioned in Section 3.2.2, our model learns the spatio-temporal hierarchical features from videos using a deep network that combines CNN, BLSTM-RNN, and PCA. For fusing multi-modal information, our model combines feature-level and decision-level fusion. Our proposed model obtained an MAA up to 99.71% and an ACC up to 99.57% with the SVM classifier for the recognition of 10 affect states. Table 9. Comparison of bi-modal affect recognition results from our model and other relevant approaches on the FABO database Literature [26] [27] [32] [57] Proposed model Subjects 10 Unknown Unknown Unknown 11 Classes Fusion strategy Number of data Feature-level or decision-level Unknown Feature-level Unknown Feature-level and decision-level videos 284 videos 281 videos Unknown videos Classifier Various SVM Softmax Unknown SVM Evaluation measure ACC ACC ACC ACC MAA, ACC Recognition result ACC=82.6%(frame basis) ACC= 85% (video basis) ACC=75% ACC=91.3% ACC=95.13% MAA=99.7% ACC=99.57% Robustness experiments Lastly, we conduct experiments to verify the robustness of the proposed model. The test data were images with noise, illumination changes, limb occlusion, or artificial occlusion. Some examples are shown in Table 10. Table 10. Examples of problematic image for robustness experiments. Problematic images face body Images with noise With 0.02 Salt and pepper noise

With 0.05 Salt and pepper noise With 0.1 Salt and pepper noise With 0.

Light50 With Light100 With Sunglasses Images with limb occlusion With mask Images with

preprocessing. Table 11 presents the results.

26 With 0.05 Salt and pepper noise With 0.1 Salt and pepper noise With 0.15 Salt and pepper noise With Light-100 With Light-50 Images with illumination changes With Light50 With Light100 With Sunglasses Images with limb occlusion With mask Images with artificial occlusion Frist, we conduct experiments on these problematic image without additional preprocessing. Table 11 presents the results. It shows that our model is not much robust to noise and strong light. It may be due to the ideal and fine experimental environment on the one hand. And on the other hand, neither denoising

27 nor blocking has been employed in our model. Next, we add some additional preprocessing in the data preprocessing stage, such as median filtering and histogram equalization process. Table 12 presents the results of problematic image with additional preprocessing. As we can see, the result has been improved much. Therefore, depending on the data characters, some additional preprocessing is helpful or necessary. Table 11 presents the results of problematic image without additional preprocessing. Problematic image ACC (%) With 0.02 Salt and pepper noise images with noise With 0.05 Salt and pepper noise With 0.1 Salt and pepper noise With 0.15 Salt and pepper noise With Light images with illumination With Light changes With Light With Light images with limb occlusion With Sunglasses 100 With mask 100 images with artificial occlusion 100 Table 12 presents the results of problematic image with additional preprocessing. Problematic image ACC (%) With 0.15 Salt and pepper noise 100 images with noise With 0.15 Salt and pepper noise 100 With 0.15 Salt and pepper noise 100 With 0.15 Salt and pepper noise 100 images with illumination With Light changes With Light CONCLUSIONS AND FUTURE WORKS This paper presented a spatio-temporal fusion model that extracts spatio-temporal hierarchical features and employs a proposed multi-modal hierarchical fusion strategy. Our model showed excellent performance in several conducted experiments. From the results, some guideline remark words on practical applications should be followed for an outstanding performance. 1) Affect recognition based on the video-skeleton and video-words of videos performed well, thus get efficient video-skeleton and video-words is the crucial premise for improve the recognition rate. We get them based on time annotation of the database. 2) Each independent mono-modal information item may be misleading, whereas combining them in some way can improve the affect recognition rate. In addition, feature-level and decision-level fusion provide complementary information and should be combined. So the model we proposed is

On The Role Of Head Motion In Affective Expression

On The Role Of Head Motion In Affective Expression Atanu Samanta, Tanaya Guha March 9, 2017 Department of Electrical Engineering Indian Institute of Technology, Kanpur, India Introduction Applications