Affect recognition from facial movements and body gestures by hierarchical deep spatio-temporal features and fusion strategy

Size: px
Start display at page:

Download "Affect recognition from facial movements and body gestures by hierarchical deep spatio-temporal features and fusion strategy"

Transcription

1 Accepted Manuscript Affect recognition from facial movements and body gestures by hierarchical deep spatio-temporal features and fusion strategy Bo Sun, Siming Cao, Jun He, Lejun Yu PII: S (17) DOI: Reference: NN 3858 To appear in: Neural Networks Received date : 7 May 2017 Revised date : 21 November 2017 Accepted date : 28 November 2017 Please cite this article as: Sun, B., Cao, S., He, J., Yu, L., Affect recognition from facial movements and body gestures by hierarchical deep spatio-temporal features and fusion strategy. Neural Networks (2017), This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

2 *Title Page (With all author details listed) Affect Recognition from Facial Movements and Body Gestures by Hierarchical Deep Spatio-Temporal Features and Fusion Strategy Bo Sun, Siming Cao, Jun He, and Lejun Yu Bo Sun Author is with the College of Information Science and Technology, Beijing Normal University, Beijing, China Siming Cao Author is with the College of Information Science and Technology, Beijing Normal University, Beijing, China Jun He Author is with the College of Information Science and Technology, Beijing Normal University, Beijing, China Lejun Yu Author is with the College of Information Science and Technology, Beijing Normal University, Beijing, China Correspondence author: Jun He,

3 *Manuscript Click here to view linked References Affect Recognition from Facial Movements and Body Gestures by Hierarchical Deep Spatio-Temporal Features and Fusion Strategy Bo Sun, Siming Cao, Jun He *, and Lejun Yu College of Information Science and Technology, Beijing Normal University, Beijing, China * Corresponding author. hejun@bnu.edu.cn. Abstract Affect presentation is periodic and multi-modal, such as through facial movements, body gestures, and so on. Studies have shown that temporal selection and multi-modal combinations may benefit affect recognition. In this article, we therefore propose a spatio-temporal fusion model that extracts spatio-temporal hierarchical features based on select expressive components. In addition, a multi-modal hierarchical fusion strategy is presented. Our model learns the spatio-temporal hierarchical features from videos by a proposed deep network, which combines a convolutional neural networks (CNN), bilateral long short-term memory recurrent neural networks (BLSTM-RNN) with principal component analysis (PCA). Our approach handles each video as a video sentence. It first obtains a skeleton with the temporal selection process and then segments key words with a certain sliding window. Finally, it obtains the features with a deep network comprised of a video-skeleton and video-words. Our model combines the feature level and decision level fusion for fusing the multi-modal information. Experimental results showed that our model improved the multi-modal affect recognition accuracy rate from 95.13% in existing literature to 99.57% on a face and body (FABO) database, our results have been increased by 4.44%, and it obtained a macro average accuracy (MAA) up to 99.71%. Keywords Affect recognition, deep learning, convolutional neural network, bilateral long short-term memory recurrent neural network, deep spatio-temporal hierarchical feature, multi-modal feature fusion strategy 1. INTRODUCTION The affect recognition ability is an important aspect of computer intelligence. It primarily influences the computer s response to an operator or interlocutor. And it has a wide range of applications in entertainment, industry, transportation, medicine, military and many other fields. Over the past few decades, several affect recognition methodologies have been proposed. The research has led to two key trends toward greater practicality, i.e., use of multi-modal information instead of mono-modal, and dynamic video instead of static images. In this research, initially, American psychologists Ekman et al. defined six basic categories of emotions, i.e. angry, disgust, fear, happy, sad and surprise [1]. Several years later, they developed the Facial Action Encoding System (FACS) [2]. In this system, a facial expression is deemed the result of facial muscles and the combination of many action units (AUs), which display the corresponding relationship between

4 facial movement and expression. The two works marked key milestones in the field and have continued to serve as the basis of emotion recognition, especially in facial emotion recognition research. Subsequently, many related algorithms and systems have been proposed [3-8]. Recently, deep learning methods have become widely used in the field of computer vision. The convolutional neural network (CNN) and bilateral long short-term memory recurrent neural network (BLSTM-RNN) are state-of-the-art machine-learning techniques in this area. Fan et al. [50] present a video-based emotion recognition system using CNN-RNN and C3D hybrid networks. Chen et al. [51] explored two simple, yet effective deep-learning-based methods for image emotion analysis. Noroozi et al. [52] applied a CNN to obtain key frames for summarizing videos. However, human emotion expression manifests in multi-modal, not mono-modal, information, such as facial movements, body gestures, voice utterances, etc. Each mono-modal is often ambiguous, uncertain, and incomplete. Metallinou et al. [53] thus examined context-sensitive schemes for emotion recognition in a multi-modal, hierarchical approach referred to as a bidirectional long short-term memory (BLSTM) neural network. In addition, Kret et al. [9] performed a psychological analysis of body movements for body expression recognition. They showed that using only facial expressions can be misleading, whereas combining them in some way can improve the emotional state recognition accuracy. Moreover, Neverova et al. [47] proposed a method for adaptive multi-modal gesture recognition. They showed that fusing multiple modalities leads to a significant increase in recognition rates and that information items from the individual channels have complementary characteristics. In recent years, to advance multi-modal affect recognition, many multi-modal emotion recognition competitions have been organized, such as the Emotion Recognition in the Wild Challenge (EmotiW) [18], Audio/Visual Emotion Challenge (AVEC) [19], and Multimodal Emotion Recognition Challenge [20]. Furthermore, the idea of combining multi-modalities for affect recognition has generated a new research topic, specifically determining which modalities should be used, and how to effectively integrate them. Some researchers have initially focused on fusing visual and audio modalities [11-14]. Later, others have explored utilizing audio, visual, and physiological signals synchronously for recognizing affects [15-17]. In this regard, Ambady and Rosenthal suggested that visual channels, i.e., facial movement and body gestures, are the most important cues for classification of human behavior [21]. Consequently, some researchers have suggested that fusing cues can produce better affect recognition results. Accordingly, corresponding studies have been conducted [22-27]. To date, two major fusion strategies exist, namely feature-level fusion and decision-level fusion. Feature-level fusion directly combines the discriminative ability of multiple features, which is assumed to be more suitable for modalities that are almost synchronous in the timescale (e.g., speech and lip movements) [40]. Decision-level fusion combines the discriminative results of multiple features. This approach is assumed to be more suitable for modalities that do not simultaneously occur in the timescale. (e.g., speech and body gestures) [40]. In studies based on the face and body (FABO) database, Gunes and Piccardi [26] separately applied feature-level fusion and decision-level fusion. Meanwhile, Barros et al. [32] used a fully connected layer to fuse each multi-modal stream, while Chen et al. [27] used only feature-level fusion. A means of

5 integrating facial movements and body gestures needs further development. The actions of facial movements or body gestures comprise a dynamic process, which can be described by four temporal phases: neutral, onset, apex, and offset [10]. Affect recognition based on videos contains spatio-temporal information compared with affect recognition based on static images. In studies based on the FABO database [44], Barros et al. [32] selected several apex frames for spatio-temporal feature extracting. Moreover, Chen et al. [27] proposed a framework to extract the temporal dynamic features of face and body gestures from the whole video. However, a method of exploring effective spatio-temporal features requires further exploration. To address the above issues, we propose a spatio-temporal fusion model, which not only extracts the high-level spatio-temporal hierarchical features, but it also includes a multi-modal hierarchical fusion strategy. This paper provides the following key contributions: 1) To extract effective spatio-temporal hierarchical features, we propose a temporal selection approach to obtain expressive materials. Frist, we employ the onset apex offset sequence as a video-skeleton. Then, using the sliding window strategy, we obtain several video-words from the video-skeleton. We describe conducted experiments that confirmed that our proposed temporal selection approach is notably more effective than previous methods. 2) Based on the expressive materials, we extract deep spatio-temporal features from the video-skeleton and video-words by respectively using a proposed network, which combines CNN, BLSTM-RNN, and principal component analysis (PCA). 3) We propose a hierarchical fusion method by combining feature-level fusion and decision-level fusion for visual multi-modal affect recognition. The method is based on facial movements and body gestures. It showed excellent performance in conducted experiments. We evaluated the proposed method on the FABO database. The proposed method performed better than existing state-of-the-art methods for visual multi-modal affect recognition. The remainder of this paper is organized as follows. Section 2 introduces related works. Section 3 describes the details of the overall methodology we proposed. The performed experiments and extensive experimental results are detailed in Section 4. Finally, our conclusions are given in Section RELATED WORK In this section, we firstly review some existing methods of mono-modal affect recognition from facial movements or body gestures. Secondly, we review some existing works on visual multi-modal recognition. 2.1 Mono-modal Affect Recognition As above mentioned, the original studies on the affect recognition algorithm are based on single-mode static images, especially face images. To date, many studies have been conducted on facial expressions. Zhong et al. [3] proposed a method to divide a face image into blocks of different scales. Then, similar and special ones were selected from among different expressions through learning to identify the most representative areas. Cheon et al. [4] proposed an algorithm for facial expression recognition based on a

6 differential active appearance model and manifold learning. Liu et al. [5] proposed an improved depth learning method, which can extract a series of representative facial features through repeated learning and training. A strong boosted classifier with statistical properties is then formed. In addition, Bo et al. [48] extracted several CNN features for continuous affect recognition. They also extracted acoustic features, LBP from three orthogonal planes (LBPTOP), Dense SIFT and CNN-LSTM features to recognize the emotions of film characters [49]. Guo et al. [55] proposed a multi-modality convolutional neural networks (CNNs) based on visual and geometrical information for micro emotion recognition. Schwan et al. [56] described an advanced pre-processing algorithm for facial images and a transfer learning mechanism for face emotion recognition. Compared to research on facial expressions, few body expression studies have been undertaken. This may be because it is difficult to accurately and reliably define the corresponding relationships of various body gestures to emotional categories. Some researchers in psychology, cognitive science, and computer science have studied emotion recognition based on body gestures, and effective systems have been presented for body emotion recognition. Glowinski et al. [7] studied the association between gestures and affective changes. They first coded the upper extremity changes of the human body, and then expressed the emotion through specific gestures. Nicolaou et al. [8] studied head movements. They mapped the angle and direction of the head motion to the emotions in the emotional space to produce emotional activations and expectations. Coulson et al. [45] studied the emotions implied in certain gestures. The results showed that emotional information contained in a gesture is similar to that found in speech. Moreover, Silva et al. [46] developed a gesture-based emotion recognition system that can automatically identify the emotional state of children in a game. Wang et al. [54] proposed a comprehensive emotion classification framework based on spatio-temporal volumes built with human actions. 2.2 Visual Multi-modal Recognition Mono-modal affect recognition has certain limitations because human affects are not limited to a single mode; rather, they rely on multi-modal expressions. Following the fundamental study of Ambady and Rosenthal [21], some researchers have advocated that fusing facial movements and body gestures to affect recognition can yield good results. Accordingly, in recent years corresponding studies have been conducted [22-27]. Kapoor and Picard [22] fused face information from videos and information gathered by a gesture sensor to determine the emotional states of children during a game. Gunes and Piccardi [24, 26] performed considerable work in multi-modal emotion recognition based on facial movements and body gestures. They created a FABO multi-modal database [64], including body and facial modalities. Based on the FABO database, Shan et al. [25] employed the Bag of Words (BOW) model to extract spatio-temporal features of body gestures and facial expressions based on a video. To maximize the correlation between the two modalities, they used canonical correlation analysis (CCA) to combine the facial movement and body gesture features at the feature level for multi-modal affect recognition. Moreover, to describe the dynamics of facial movements and body gestures for multi-modal affect recognition, Chen et al. [27] extracted MHI-HOG and image-hog features. In addition, they applied a temporal normalization method to solve the problem of temporal alignment on these two modalities.

7 Furthermore, Balomenos et al. [31] used the HMM method to identify six basic emotions by using gestures (hands clapping, hands on the head, hands moving and lifting along a certain pattern) and facial features. In addition, Barros et al. [32] used a hierarchical feature representation method based on a multichannel convolutional neural network (MCCNN) to integrate multi-modal affect recognition. They consequently achieved a significant improvement of recognition accuracy. And they proposed a neuro computational model that learned to attend to emotional expressions and to modulate emotion recognition [57]. Despite the above advancements, limitations remain in the existing affect recognition methods. Because human facial expressions and body gestures are dynamic processes, most existing approaches do not contain the temporal selection process for a critical sequence for affect recognition. Although Refs. [26, 27, 32] considered it, the approach did not perform well. Furthermore, differentiated time sequences for different affects were not considered. And most existing affect recognition methods only consider one single fusion method, either feature-level or decision-level fusion, whereas the relationships between video-words, video-skeletons, and video sentences are hierarchical. In addition, robustness is important for a model [61-63], which has not been considered by previous works. In the proposed approach, on the other hand, we adopt several temporal selection methods for affect recognition. From experimental results, it was shown that the best affect recognition is based on the onset apex offset sequence. Furthermore, we handle each video as a video sentence. The proposed model firstly obtains its skeleton using temporal selection. It then segments video-words with a certain sliding window on the video-skeleton. Finally, it obtains the features on video sentences and video-words with a deep network, which combines CNN, BLSTM-RNN, and PCA. In addition, we propose a multi-modal fusion method combining feature-level fusion and decision-level fusion. The experimental results in Section 4 demonstrate that the proposed framework outperformed the other methods with respect to visual multi-modal affect recognition from facial movements and body gestures. 3. METHODOLOGY This section introduces the details of our method, which is somewhat inspired by the relationship between words, skeletons, and sentences in Chinese. Especially, we treat each video as a video sentence, key clips as video-skeletons, and images as video-words. Using deep learning, we present a spatio-temporal fusion-based classifying model, which not only extracts the spatio-temporal hierarchical features, but also provides a multi-modal hierarchical fusion strategy. Fig. 1 depicts an overview of our model, which consists of three major parts. 1) Data pre-processing. This part performs face detection and alignment, as well as image size normalization. 2) Feature extraction and representation. This part includes temporal selection, a sliding window, and a proposed deep network, which combines CNN, BLSTM-RNN, and PCA. After temporal processes, our model learns the high-level spatio-temporal hierarchical features by the deep network. 3) Hierarchical-fusion-based classification. This part introduces a multi-modal hierarchical-fusion-based classification strategy, which combines feature-level and decision-level fusion.

8 Fig. 1. Overview of the proposed model 3.1 Data Pre-processing The data pre-processing part is composed of face detection and alignment, as well as image size normalization for face and body videos, respectively. For face videos, the first step is to detect and align the face in all frames. We follow the methods of Refs. [33, 34, 35] to extract and track the face. All frames are aligned to this base face through affine transformation and are resized to pixels. Examples of this step on the FABO database are shown in Fig. 2. Fig. 2. Examples of face detection and alignment on the FABO database Since the same network of the next step requires the same image size, size normalization is necessary for body videos. Based on experiments, we resize an original image ( ) to pixels directly using the bicubic interpolation. 3.2 Deep Feature Extraction and Representation A video is composed of many clips or images. For different affects, the most discriminative time sequences may be different, and they contain much more spatio-temporal information. Thus, a method of

9 representing it well is a challenge. Here, we apply the state-of-the-art deep learning approach and propose the previously mentioned deep network, whose structure is detailed in Section Since selecting the most expressive materials is fundamental, two techniques play a role through the NLP concept. They are introduced in Section Preparing expressive materials Affect representation is temporally periodic. Additionally, the given facial movement or body gesture is a physiologically constrained phase-wised sequence. It contains a sequence of general temporal phases, i.e., neutral, onset, apex, offset, and neutral as illustrated in Fig. 3. Generally, the neutral phases of different affects are almost similar, and the others representing dynamic change may impact the features obtained from a video. Then, inspired by previous studies [26, 32] and Chinese language processing, with consideration of the relationship of a frame image, clip, and video, we apply them as a word, phrase, and sentence, respectively. Temporal selection is important for feature extraction because extracting the skeleton is fundamental for understanding a sentence. It is based on the temporal segment. We performed temporal segmentation in our previous work [43] and achieved 89.52% and 95.20% accuracy rates for facial movements and body gestures, respectively. In our experiments, we used the ground truth temporal label to select video frames. Some previous approaches use several apex frames for the affect recognition, and others use the whole cycle of an affect. After analysis and experiments, we employ the onset-apex-offset sequence as the video-skeleton. To further extract the most expressive temporal words from the skeleton, a specifically sized sliding window is applied in our model. Based on experience, widths of seven and eight are applied for body gesture and facial movement videos, respectively. With the sliding window strategy, we obtain several video-words from a sequence, and we then map the results through the following function: (1) Where denotes the probability matrix of the classification results for a video-skeleton. denotes the probability matrix of the classification results for each video-words from a video-skeleton,. If we use Softmax as the classifier for each video-words, obeys polynomial distribution other than exponential distribution [59] or Gaussian distribution [60, 65], if we use SVM as classifier for each video-words, obeys binomial distribution. denotes the length of a video-skeleton. denotes the width of the sliding window.

10 Fig. 3. Sample of the temporal segments from body gestures (top) and facial movements (bottom) Learning deep features After temporal processing, the main challenge becomes determining a means to obtain effective features of a dynamic sequence. Unlike a static image, we must consider both spatial and temporal information. Extracting and connecting them effectively is a critical problem. Because they are from different domains, we introduce the sub-space transformation algorithm and present our deep network structure combining CNN, BLSTM-RNN, and PCA. Additionally, considering the local and global information together, it learns high-level spatio-temporal hierarchical features. Fig. 4 shows the architecture of our proposed network.

11 Fig. 4. Architecture of the proposed deep network To obtain the spatial information, as mentioned above, we employ the architecture of AlexNet [36], which has five convolutional layers and three fully connection layers. It uses the reflected linear unit (ReLU) as the activation function. AlexNet was designed for the ImageNet Large-Scale Visual Recognition Competition 2012 Challenge [37]. Our proposed CNN has six convolutional layers, three fully connection layers, three max-pooling layers, and two dropout layers. For spatial feature extraction, inspired by our experience in the affect recognition challenge and some related works. [43][49][58], we concatenate the output of the ReLU activations after the first two fully connected layers. To effectively connect spatial and temporal, we apply PCA to conduct feature space transformation and reduce the highly dimensionality (99% of variance). In addition, we apply it prior to temporal information extraction. To obtain the temporal information, we employ the BLSTM-RNN method, which is the state-of-the-art model for sequence analysis. The main concept of BLSTM-RNN described herein is based on Refs. [28], [29], and [30]. Given an input feature sequence, LSTM computes the hidden vector sequence and output vector sequence by iterating the following equations. Given, and, LSTM updates for time step as follows: Input gate: (2) Forget gate: (3) Output gate: (4) (5)

12 (6) Local and global information can be synthesized by learning the bidirectional property with BLSTM-RNN. The bidirectional property is achieved by using forward and backward linking to connect the data sequences in two separate hidden layers. BLSTM-RNN combines the forward hidden sequence and backward hidden sequence to generate output. The output of the BLSTM-RNN hidden sequence is computed as the following function: (7) (8) (9) For temporal feature extraction, we concatenate the output of the ReLU activations after the first two fully connected layers. Prior to obtaining the temporal information, we apply PCA to lower their dimensions (99% of variance). Finally, we obtain the high-level spatio-temporal hierarchical features for affect recognition. Note that the features obtained from the video-skeleton of facial movements and body gestures are hereafter referred to as Face1 and Body1, respectively. Similarly, the features obtained from video-words of facial movements and body gestures are hereafter referred to as Face2 and Body2, respectively. 3.3 Hierarchical-fusion-based Classification Two major fusion strategies currently exist, namely feature-level fusion and decision-level fusion. From the previous theoretical analysis, facial movements and body gestures occur simultaneously in the time scale; however, they are not strictly synchronous. Additionally, considering the meaning of phrases and subject model methods for sentence comprehension, we combine the two types of fusion strategies and propose a hierarchical classification fusion strategy based on the support vector machine (SVM) [38]. After implementing these two kinds of fusion strategies on the FABO database, the results obtained from the experiments were complementary. Thus, we integrated them into a multi-modal hierarchical classification fusion strategy. Fig. 5 depicts the architecture.

13 Fig. 5. Architecture of the multi-modal hierarchical classification fusion strategy Feature-level fusion In the multi-modal hierarchical classification fusion strategy, two fusion strategies are employed for feature-level fusion, namely a neural network and concatenation, whose architectures are shown in the right inset of Fig. 5. 1) Neural Network As shown in the top-right of the figure, we concatenate the features extracted from skeletons and words of facial movements and body gestures, respectively. Then, we input them into the neural network, which only has two fully connection layers. Thirdly, we use the output of the ReLU activations after the first fully connected layer. Finally, PCA is applied to reduce the dimensionality and noise. 2) Concatenation As shown in the bottom-right of Fig. 5, we directly perform PCA on the concatenated features of the two types. Then, the second type of the fused feature is achieved. The feature dimensions for these data are shown in the Table 1 below. Table 1. Feature dimension after feature-level fusion strategy Fusion strategy data Feature dimension Fusion 1 Face1&Body1 3

14 Fusion 2 Face2&Body2 4 Face1&Body1 7 Face2&Body Decision-level fusion At this point, we utilize the weighted voting fusion network [41] for decision-level fusion, and Fig. 5 (left inset) shows its architectures. Firstly, we apply the output decision value of the SVM on the two types of mono-modal features and the above two fused ones. Then, an optimization fusion network is employed on those values instead of a voting combination. For example, let be the decision value of a classifier (e.g., SVM), which is the prediction probability for a sample in an affect class. Given m kinds of features and n affect categories, m n pre-decision values will be generated after the basic classifiers. These can be denoted as For input, a function is defined to estimate the value of,, which represents the probability of the category label y = k, based on each of the n different possible values. (10) Here, W contains m n weights. Finally, the output is an n-dimensional vector, which represents n probabilities of affect recognition for a sample. The most likely label for the final affect recognition is then chosen with a max-win strategy. 4. EXPERIMENTS 4.1 Experimental Setup Experimental data We conducted the experiments on the FABO database, which contains video recordings of face expressions and body gestures simultaneously obtained by two cameras. It is currently the only bi-modal database having both expression annotations and temporal annotations. Two sample videos from the database are shown in Fig. 6.

15 (a) (b) Fig. 6. Recordings in the FABO database: sample images from (a) an anger affect video of the face (right) and body (left); and (b) a happiness affect video of the face (right) and body (left) We chose 244 videos in which the ground truth affect labels from respective face and body videos were the same. As shown in Table 2, the database contained ten emotional states, including basic and non-basic affects. For each video, there were two to four complete expression cycles. We chose the first two expression cycles for the experiment. Then, we split each of them into two videos according to cycles, and we obtained 488 videos. We chose approximately half of them for training and the remaining ones for testing. Table 2. Number of videos available for each affect state in the FABO database Basic affects Numb Non-basic affects Numb happiness 20 boredom 26 anger 48 puzzlement 41 surprise 15 uncertainty 19 fear 19 anxiety 17 disgust 23 sadness Experimental implementation We employed the Keras [42] implementation for the proposed deep network learning technique. In addition, in the MATLAB environment, we used LIBLINEAR [39], which is a widely used open-source machine learning library. 4.2 Experimental Results

16 4.2.1 Model training and parameter evaluation After data pre-processing, we first trained the CNN models on the training set by employing the Keras [42] implementation. Second, we extracted the activation values of the last two fully connected layers, followed by a PCA step. Third, we trained the BLSTM models on training features from the previous step. Fourth, we extracted the activation values of the last two fully connected layers, followed by a PCA step. Fifth, we trained the linear SVM using various features (described in Section 3.2). The SVM models were trained on the training set and the parameters were tuned on the validation set through five-fold cross validation in the range from 2 5 to At last, we trained the fusion models on the training features and results. The initial parameters of all networks are initialized randomly Evaluation measures of experimental results We extracted high-level spatio-temporal hierarchical features by the method introduced in Section 3. In this section, we compare the recognition rate in percentages under different conditions. As the sample distribution was irregular, we chose the macro average accuracy (MAA) as the primary measure to evaluate the results, and we choose the accuracy (ACC) as the second measure. The calculating formulas are given as follows: (11) (12) (13) where denotes the number of affect categories, denotes the precision of the affect category, represents the value of the correct classification in the affect category, and is the value of the incorrect classification in the affect category Experiments on temporal selection At this point, we conducted three strategies of temporal selection for the video-skeleton. Specifically, we used the 1) apex frame images alone (five apex frames); 2) clip of the onset-apex-offset sequence (15 frames); and 3) whole cycle or video of an expression. Examples of temporal selection on the FABO database from facial movements are shown in Fig. 7.

17 Fig.7. Examples of temporal selection on the FABO database from facial movements Table 3. Results of affect recognition on facial movements with different temporal selections Temporal selection strategy MAA (%) ACC (%) apex phase onset-apex-offset sequence whole cycle Table 4. Results of affect recognition on body gestures with different temporal selections Temporal selection strategy MAA (%) ACC (%) apex phase onset-apex-offset sequence whole cycle Tables 3 and 4 show the affect recognition results on facial movements and body gestures with different temporal selections, respectively. The results show that, rather than the whole cycle or the apex phase, the best affect recognition is based on the onset-apex-offset sequence. This finding may suggest that, if an affect video is treated as a sentence or an article, a skeleton would be best to elucidate it Experiments on the sliding window size For different affects, the most differentiated time sequences may be different as the core words of different sentences likely differ. In other words, to understand an affect video, some primal words should be selected around the skeleton. We thus applied different sized sliding windows to obtain expressive

18 phrases from the skeleton. Examples of sliding windows on the FABO database from facial movements are shown in Fig. 8. After the sliding window process, each subsequence (video-words) contains two phases of information (i.e., onset-apex, and apex-offset), which represent different dynamic information items for affect recognition. On the other hand, we can equalize the distribution of samples by controlling the movement of the sliding window. Fig. 8. Examples of a sliding window on the FABO database from facial movements Table 5. Results of affect recognition on facial movements with different-sized sliding windows t MAA(%) ACC (%) Table 6. Results of affect recognition on body gestures with different-sized sliding windows t MAA(%) ACC (%) Tables 5 and 6 show the results of affect recognition on facial movements and body gestures according to the sliding window size, respectively. In these tables, item denotes the size of the sliding window, which is set from six to ten. The experimental results show that, for facial movements, the best affect recognition is achieved when t = 8. For body gestures, the best affect recognition is achieved when t = 7. For convenient description, we use numbers 1 and 2 to denote the experiments performed on the materials without and with a sliding window, respectively, in the following part Mono-modal affect recognition results In this part, we respectively describe the affects from each mono-modal information item. Table 7. Comparison of classification performance (ACC) between deep learning and deep features based SVM for each mono-modality. SVM(%) Deep learning (%) Face Face

19 Body Body Table 7 presents the comparison of classification performance (ACC) between deep learning and SVM for each mono-modality, it can be seen that deep feature based SVM has succeeded deep learning on the whole. Thus, SVM has always been applied in the subsequent experiments. Table 8. Results of affect recognition on a video-skeleton (Face1, Body1) and video-words (Face2, Body2) of facial movements and body gestures. Face1 Face2 Body1 Body2 MAA (%) ACC (%) Table 8 presents the results of affect recognition on a video-skeleton and video-words of facial movements and body gestures. In terms of MAA and ACC, especially the former, it is evident that affect recognition on video-words is better than that on a video-skeleton Bi-modal affect recognition results Using the best expressive mono-modal affect features, we performed bi-modal fusion experiments by employing different fusion strategies. Fig. 9 shows the differences between different affect recognition results on a video-skeleton of mono-modal and bi-modal information with different fusion strategies. Here, Face1 denotes affect recognition on a video-skeleton of facial movements. Body1 denotes affect recognition on a video-skeleton of body gestures. Multi1-1 denotes affect recognition on a video-skeleton of bi-modal information with a feature-level fusion strategy (neural network). Multi2-1 represents affect recognition on a video-skeleton of bi-modal information with a feature-level fusion strategy (concatenation). In addition, Multi3-1 denotes affect recognition on a video-skeleton of bi-modal information with a decision-level fusion strategy (weighted voting fusion network). Moreover, Multi4-1 represents affect recognition on a video-skeleton of bi-modal information with a hierarchical classification fusion strategy. From the curves of MAA and ACC, is apparent that affect recognition on bi-modal information is better than that on mono-modal information. In addition, among bi-modal fusion strategies, Multi4-1 the proposed hierarchical classification fusion strategy performs best.

20 Fig. 9. Comparisons of different affect recognition rates on a video-skeleton of mono-modal and bi-modal information using different fusion strategies T R U E T R U E PREDICT (a) (b) (c) PREDICT T R U E PREDICT (d) (e) (f) Fig. 10. Affect recognition before the sliding window process: confusion matrix of (a) facial movements; (b) body gestures; (c) the feature-level fusion strategy (neural network); (d) the feature-level fusion strategy (concatenation); (e) the decision-level fusion strategy (weighted voting fusion network); and (f) hierarchical classification fusion strategy. The rows and columns represent the ground truth and prediction value, respectively. Numbers 1 to 10 represent the ten affect categories, respectively. T R U E PREDICT

21 Fig. 11. Comparisons of different affect recognition on video-words of mono-modal and bi-modal information with different fusion strategies for each affect category Fig. 10 shows a confusion matrix for affect recognition before the sliding window process. Fig. 11 depicts comparisons between different affect recognition rates on a video-skeleton of mono-modal and bi-modal information with different fusion strategies for each affect category. Numbers 1 to 10 represent the ten affect categories, respectively. From these two figures, it is apparent that affect recognition on bi-modal information is almost better than that on mono-modal information for each affect category, and that feature-level and decision-level fusion provides complementary information regarding affect recognition. Thus, among these multi-modal fusion strategies, Multi4-1 performs best. Fig. 12. Comparisons of different affect recognition rates on video-words of mono-modal and bi-modal information with different fusion strategies

22 Fig. 12 compares the results of different affects on video-words of mono-modal and bi-modal information with different fusion strategies. Here, Face2 denotes affect recognition on video-words of facial movements. Body2 represents affect recognition on video-words of body gestures. In addition, Multi1-2 denotes affect recognition on video-words of bi-modal information with a feature-level fusion strategy (neural network). Multi2-2 represents affect recognition on video-words of bi-modal information with a feature-level fusion strategy (concatenation). Moreover, Multi3-2 denotes affect recognition on video-words of bi-modal information with a decision-level fusion strategy (weighted voting fusion network). Finally, Multi4-2 represents affect recognition on video-words of bi-modal information with a hierarchical classification fusion strategy. Compared to the previous conclusions of Fig. 9, it is apparent that the sliding window process can improve MAA. (a) (b) (c) T R U E PREDICT (d) (e) (f) Fig. 13. Affect recognition after the sliding window process: confusion matrix of (a) facial movements; (b) body gestures; (c) the feature-level fusion strategy (neural network); (d) the feature-level fusion strategy (concatenation); (e) the decision-level fusion strategy (weighted voting fusion network); and (f) the hierarchical classification fusion strategy. The rows and columns represent the ground truth and prediction value, respectively. Numbers 1 to 10 represent the ten affect categories, respectively.

23 Fig. 14. Comparisons of different affect recognition rates on video-words of mono-modal and bi-modal information with different fusion strategies for each affect category Fig. 13 shows the confusion matrix for affect recognition after the sliding window process. Fig. 14 presents the comparisons between different affect recognition rates on video-words of mono-modal and bi-modal information with different fusion strategies for each affect category. Numbers 1 to 10 represent the ten affect categories, respectively. From these two figures, in addition to the previous conclusions, it is evident that, compared with Fig. 12, the length distribution of these bins is more balanced. In other words, the MAA is improved. Fig. 15 Comparisons of different bi-modal affect recognition rates on a video-skeleton, video-words, and a video-skeleton plus video-words Fig. 15 shows the comparisons between different bi-modal affect recognition rates on a video-skeleton, video-words, and a video-skeleton plus video-words. The video-skeleton denotes the affect recognition on a video-skeleton of bi-modal information. Video-words denotes affect recognition on video-words of

24 bi-modal information. The video-skeleton plus video-words denotes affect recognition of bi-modal information with fusion of these two kinds of circumstances. It is apparent that video-skeleton plus video-words performs best. Fig. 16 compares the results between different affect recognition rates on bi-modal information for each category of affects, which are represented by Numbers 1 to 10. From the figure, it is evident that bi-modal affect recognition on a video-skeleton plus video-words provides complementary information and thus it performs best. In addition, Fig. 17 shows a confusion matrix of our final results for bi-modal affect recognition. Fig. 16. Comparisons of different affect recognition rates on bi-modal information for each category of affects Fig. 17. Confusion matrix of our final results for bi-modal affect recognition Comparison with relevant state-of-the-art methods on the FABO database In Table 9, we compare the proposed approach to other relevant methods on the FABO database. The work presented in [26] using half of the data for training and the rest for testing. They have extracted a lot of complex features extracted for facial movements and body gestures. It obtained an ACC of 82.6% (frame basis) and 85% (video basis) for the recognition of 12 affects. In the work of [27], videos in each affect category were randomly separated into three subsets. They chose two of them as training data and the rest

25 subset as testing data each time. The methods combined both MHI-HOG and Image-HOG through a temporal normalization method to describe the dynamics of facial movements and body gestures. The recognition rate obtained was not more than 75% with the SVM classifier for the recognition of 10 affects. In the work of [32], each sequence was used as input for the model. Both training and testing data were composed of subsequent images of the same subject. The method selected several apex frames for affect recognition. In addition, a new category was created, Neutral, which was composed of other frames that were present in the remaining temporal phases, except for the apex frames. Multichannel Convolutional Neural Network (MCCNN) was used to extract hierarchical features and obtained average recognition rates of 91.3% for the 11 affects. [57] proposed a neuro computational model for emotion recognition. It consisted of a deep architecture, which implemented convolutional neural networks to learn the location of emotional expressions in a cluttered scene. Finally, it obtained an ACC of 95.13% for 10 affects. As mentioned in Section 3.2.2, our model learns the spatio-temporal hierarchical features from videos using a deep network that combines CNN, BLSTM-RNN, and PCA. For fusing multi-modal information, our model combines feature-level and decision-level fusion. Our proposed model obtained an MAA up to 99.71% and an ACC up to 99.57% with the SVM classifier for the recognition of 10 affect states. Table 9. Comparison of bi-modal affect recognition results from our model and other relevant approaches on the FABO database Literature [26] [27] [32] [57] Proposed model Subjects 10 Unknown Unknown Unknown 11 Classes Fusion strategy Number of data Feature-level or decision-level Unknown Feature-level Unknown Feature-level and decision-level videos 284 videos 281 videos Unknown videos Classifier Various SVM Softmax Unknown SVM Evaluation measure ACC ACC ACC ACC MAA, ACC Recognition result ACC=82.6%(frame basis) ACC= 85% (video basis) ACC=75% ACC=91.3% ACC=95.13% MAA=99.7% ACC=99.57% Robustness experiments Lastly, we conduct experiments to verify the robustness of the proposed model. The test data were images with noise, illumination changes, limb occlusion, or artificial occlusion. Some examples are shown in Table 10. Table 10. Examples of problematic image for robustness experiments. Problematic images face body Images with noise With 0.02 Salt and pepper noise

26 With 0.05 Salt and pepper noise With 0.1 Salt and pepper noise With 0.15 Salt and pepper noise With Light-100 With Light-50 Images with illumination changes With Light50 With Light100 With Sunglasses Images with limb occlusion With mask Images with artificial occlusion Frist, we conduct experiments on these problematic image without additional preprocessing. Table 11 presents the results. It shows that our model is not much robust to noise and strong light. It may be due to the ideal and fine experimental environment on the one hand. And on the other hand, neither denoising

27 nor blocking has been employed in our model. Next, we add some additional preprocessing in the data preprocessing stage, such as median filtering and histogram equalization process. Table 12 presents the results of problematic image with additional preprocessing. As we can see, the result has been improved much. Therefore, depending on the data characters, some additional preprocessing is helpful or necessary. Table 11 presents the results of problematic image without additional preprocessing. Problematic image ACC (%) With 0.02 Salt and pepper noise images with noise With 0.05 Salt and pepper noise With 0.1 Salt and pepper noise With 0.15 Salt and pepper noise With Light images with illumination With Light changes With Light With Light images with limb occlusion With Sunglasses 100 With mask 100 images with artificial occlusion 100 Table 12 presents the results of problematic image with additional preprocessing. Problematic image ACC (%) With 0.15 Salt and pepper noise 100 images with noise With 0.15 Salt and pepper noise 100 With 0.15 Salt and pepper noise 100 With 0.15 Salt and pepper noise 100 images with illumination With Light changes With Light CONCLUSIONS AND FUTURE WORKS This paper presented a spatio-temporal fusion model that extracts spatio-temporal hierarchical features and employs a proposed multi-modal hierarchical fusion strategy. Our model showed excellent performance in several conducted experiments. From the results, some guideline remark words on practical applications should be followed for an outstanding performance. 1) Affect recognition based on the video-skeleton and video-words of videos performed well, thus get efficient video-skeleton and video-words is the crucial premise for improve the recognition rate. We get them based on time annotation of the database. 2) Each independent mono-modal information item may be misleading, whereas combining them in some way can improve the affect recognition rate. In addition, feature-level and decision-level fusion provide complementary information and should be combined. So the model we proposed is

On The Role Of Head Motion In Affective Expression

On The Role Of Head Motion In Affective Expression On The Role Of Head Motion In Affective Expression Atanu Samanta, Tanaya Guha March 9, 2017 Department of Electrical Engineering Indian Institute of Technology, Kanpur, India Introduction Applications

More information

Multimodal context analysis and prediction

Multimodal context analysis and prediction Multimodal context analysis and prediction Valeria Tomaselli (valeria.tomaselli@st.com) Sebastiano Battiato Giovanni Maria Farinella Tiziana Rotondo (PhD student) Outline 2 Context analysis vs prediction

More information

Two-Stream Bidirectional Long Short-Term Memory for Mitosis Event Detection and Stage Localization in Phase-Contrast Microscopy Images

Two-Stream Bidirectional Long Short-Term Memory for Mitosis Event Detection and Stage Localization in Phase-Contrast Microscopy Images Two-Stream Bidirectional Long Short-Term Memory for Mitosis Event Detection and Stage Localization in Phase-Contrast Microscopy Images Yunxiang Mao and Zhaozheng Yin (B) Computer Science, Missouri University

More information

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as

More information

Dynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji

Dynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji Dynamic Data Modeling, Recognition, and Synthesis Rui Zhao Thesis Defense Advisor: Professor Qiang Ji Contents Introduction Related Work Dynamic Data Modeling & Analysis Temporal localization Insufficient

More information

Sound Recognition in Mixtures

Sound Recognition in Mixtures Sound Recognition in Mixtures Juhan Nam, Gautham J. Mysore 2, and Paris Smaragdis 2,3 Center for Computer Research in Music and Acoustics, Stanford University, 2 Advanced Technology Labs, Adobe Systems

More information

Neural Architectures for Image, Language, and Speech Processing

Neural Architectures for Image, Language, and Speech Processing Neural Architectures for Image, Language, and Speech Processing Karl Stratos June 26, 2018 1 / 31 Overview Feedforward Networks Need for Specialized Architectures Convolutional Neural Networks (CNNs) Recurrent

More information

CSC321 Lecture 16: ResNets and Attention

CSC321 Lecture 16: ResNets and Attention CSC321 Lecture 16: ResNets and Attention Roger Grosse Roger Grosse CSC321 Lecture 16: ResNets and Attention 1 / 24 Overview Two topics for today: Topic 1: Deep Residual Networks (ResNets) This is the state-of-the

More information

Unsupervised Learning of Hierarchical Models. in collaboration with Josh Susskind and Vlad Mnih

Unsupervised Learning of Hierarchical Models. in collaboration with Josh Susskind and Vlad Mnih Unsupervised Learning of Hierarchical Models Marc'Aurelio Ranzato Geoff Hinton in collaboration with Josh Susskind and Vlad Mnih Advanced Machine Learning, 9 March 2011 Example: facial expression recognition

More information

Task-Oriented Dialogue System (Young, 2000)

Task-Oriented Dialogue System (Young, 2000) 2 Review Task-Oriented Dialogue System (Young, 2000) 3 http://rsta.royalsocietypublishing.org/content/358/1769/1389.short Speech Signal Speech Recognition Hypothesis are there any action movies to see

More information

Multi-scale Geometric Summaries for Similarity-based Upstream S

Multi-scale Geometric Summaries for Similarity-based Upstream S Multi-scale Geometric Summaries for Similarity-based Upstream Sensor Fusion Duke University, ECE / Math 3/6/2019 Overall Goals / Design Choices Leverage multiple, heterogeneous modalities in identification

More information

Lecture 17: Neural Networks and Deep Learning

Lecture 17: Neural Networks and Deep Learning UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions

More information

Introduction to Deep Neural Networks

Introduction to Deep Neural Networks Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Pushpak Bhattacharyya CSE Dept, IIT Patna and Bombay LSTM 15 jun, 2017 lgsoft:nlp:lstm:pushpak 1 Recap 15 jun, 2017 lgsoft:nlp:lstm:pushpak 2 Feedforward Network and Backpropagation

More information

Neural Networks biological neuron artificial neuron 1

Neural Networks biological neuron artificial neuron 1 Neural Networks biological neuron artificial neuron 1 A two-layer neural network Output layer (activation represents classification) Weighted connections Hidden layer ( internal representation ) Input

More information

Memory-Augmented Attention Model for Scene Text Recognition

Memory-Augmented Attention Model for Scene Text Recognition Memory-Augmented Attention Model for Scene Text Recognition Cong Wang 1,2, Fei Yin 1,2, Cheng-Lin Liu 1,2,3 1 National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy of Sciences

More information

Spatial Transformation

Spatial Transformation Spatial Transformation Presented by Liqun Chen June 30, 2017 1 Overview 2 Spatial Transformer Networks 3 STN experiments 4 Recurrent Models of Visual Attention (RAM) 5 Recurrent Models of Visual Attention

More information

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 Machine Learning for Signal Processing Neural Networks Continue Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 1 So what are neural networks?? Voice signal N.Net Transcription Image N.Net Text

More information

Boosting: Algorithms and Applications

Boosting: Algorithms and Applications Boosting: Algorithms and Applications Lecture 11, ENGN 4522/6520, Statistical Pattern Recognition and Its Applications in Computer Vision ANU 2 nd Semester, 2008 Chunhua Shen, NICTA/RSISE Boosting Definition

More information

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017 Deep Learning for Natural Language Processing Sidharth Mudgal April 4, 2017 Table of contents 1. Intro 2. Word Vectors 3. Word2Vec 4. Char Level Word Embeddings 5. Application: Entity Matching 6. Conclusion

More information

A Hierarchical Convolutional Neural Network for Mitosis Detection in Phase-Contrast Microscopy Images

A Hierarchical Convolutional Neural Network for Mitosis Detection in Phase-Contrast Microscopy Images A Hierarchical Convolutional Neural Network for Mitosis Detection in Phase-Contrast Microscopy Images Yunxiang Mao and Zhaozheng Yin (B) Department of Computer Science, Missouri University of Science and

More information

Multimodal Machine Learning

Multimodal Machine Learning Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine Learning Laboratory [MultiComp Lab] 1 CMU Course 11-777: Multimodal Machine Learning 2 Lecture Objectives

More information

Long-Short Term Memory and Other Gated RNNs

Long-Short Term Memory and Other Gated RNNs Long-Short Term Memory and Other Gated RNNs Sargur Srihari srihari@buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Sequence Modeling

More information

Automatic Analysis of Facial Expressions Based on Deep Covariance Trajectories

Automatic Analysis of Facial Expressions Based on Deep Covariance Trajectories 1 Automatic Analysis of Facial Expressions Based on Deep Covariance Trajectories Naima Otberdout, Member, IEEE, Anis Kacem, Member, IEEE, Mohamed Daoudi, Senior, IEEE, Lahoucine Ballihi, Member, IEEE,

More information

arxiv: v3 [cs.lg] 14 Jan 2018

arxiv: v3 [cs.lg] 14 Jan 2018 A Gentle Tutorial of Recurrent Neural Network with Error Backpropagation Gang Chen Department of Computer Science and Engineering, SUNY at Buffalo arxiv:1610.02583v3 [cs.lg] 14 Jan 2018 1 abstract We describe

More information

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recap Standard RNNs Training: Backpropagation Through Time (BPTT) Application to sequence modeling Language modeling Applications: Automatic speech

More information

Conditional Language modeling with attention

Conditional Language modeling with attention Conditional Language modeling with attention 2017.08.25 Oxford Deep NLP 조수현 Review Conditional language model: assign probabilities to sequence of words given some conditioning context x What is the probability

More information

Supplemental Material. Bänziger, T., Mortillaro, M., Scherer, K. R. (2011). Introducing the Geneva Multimodal

Supplemental Material. Bänziger, T., Mortillaro, M., Scherer, K. R. (2011). Introducing the Geneva Multimodal Supplemental Material Bänziger, T., Mortillaro, M., Scherer, K. R. (2011). Introducing the Geneva Multimodal Expression Corpus for experimental research on emotion perception. Manuscript submitted for

More information

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow, Index A Activation functions, neuron/perceptron binary threshold activation function, 102 103 linear activation function, 102 rectified linear unit, 106 sigmoid activation function, 103 104 SoftMax activation

More information

Shape of Gaussians as Feature Descriptors

Shape of Gaussians as Feature Descriptors Shape of Gaussians as Feature Descriptors Liyu Gong, Tianjiang Wang and Fang Liu Intelligent and Distributed Computing Lab, School of Computer Science and Technology Huazhong University of Science and

More information

Recurrent Autoregressive Networks for Online Multi-Object Tracking. Presented By: Ishan Gupta

Recurrent Autoregressive Networks for Online Multi-Object Tracking. Presented By: Ishan Gupta Recurrent Autoregressive Networks for Online Multi-Object Tracking Presented By: Ishan Gupta Outline Multi Object Tracking Recurrent Autoregressive Networks (RANs) RANs for Online Tracking Other State

More information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?

More information

CS 231A Section 1: Linear Algebra & Probability Review

CS 231A Section 1: Linear Algebra & Probability Review CS 231A Section 1: Linear Algebra & Probability Review 1 Topics Support Vector Machines Boosting Viola-Jones face detector Linear Algebra Review Notation Operations & Properties Matrix Calculus Probability

More information

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions 2018 IEEE International Workshop on Machine Learning for Signal Processing (MLSP 18) Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions Authors: S. Scardapane, S. Van Vaerenbergh,

More information

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang CS 231A Section 1: Linear Algebra & Probability Review Kevin Tang Kevin Tang Section 1-1 9/30/2011 Topics Support Vector Machines Boosting Viola Jones face detector Linear Algebra Review Notation Operations

More information

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018 Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1 Sequence-to-sequence modelling Problem: E.g. A sequence X 1 X N goes in A different sequence Y 1 Y M comes out Speech recognition:

More information

Material presented. Direct Models for Classification. Agenda. Classification. Classification (2) Classification by machines 6/16/2010.

Material presented. Direct Models for Classification. Agenda. Classification. Classification (2) Classification by machines 6/16/2010. Material presented Direct Models for Classification SCARF JHU Summer School June 18, 2010 Patrick Nguyen (panguyen@microsoft.com) What is classification? What is a linear classifier? What are Direct Models?

More information

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization TTIC 31230, Fundamentals of Deep Learning David McAllester, April 2017 Vanishing and Exploding Gradients ReLUs Xavier Initialization Batch Normalization Highway Architectures: Resnets, LSTMs and GRUs Causes

More information

OBJECT DETECTION AND RECOGNITION IN DIGITAL IMAGES

OBJECT DETECTION AND RECOGNITION IN DIGITAL IMAGES OBJECT DETECTION AND RECOGNITION IN DIGITAL IMAGES THEORY AND PRACTICE Bogustaw Cyganek AGH University of Science and Technology, Poland WILEY A John Wiley &. Sons, Ltd., Publication Contents Preface Acknowledgements

More information

Statistical Filters for Crowd Image Analysis

Statistical Filters for Crowd Image Analysis Statistical Filters for Crowd Image Analysis Ákos Utasi, Ákos Kiss and Tamás Szirányi Distributed Events Analysis Research Group, Computer and Automation Research Institute H-1111 Budapest, Kende utca

More information

Global Scene Representations. Tilke Judd

Global Scene Representations. Tilke Judd Global Scene Representations Tilke Judd Papers Oliva and Torralba [2001] Fei Fei and Perona [2005] Labzebnik, Schmid and Ponce [2006] Commonalities Goal: Recognize natural scene categories Extract features

More information

Agenda. Digit Classification using CNN Digit Classification using SAE Visualization: Class models, filters and saliency 2 DCT

Agenda. Digit Classification using CNN Digit Classification using SAE Visualization: Class models, filters and saliency 2 DCT versus 1 Agenda Deep Learning: Motivation Learning: Backpropagation Deep architectures I: Convolutional Neural Networks (CNN) Deep architectures II: Stacked Auto Encoders (SAE) Caffe Deep Learning Toolbox:

More information

CITS 4402 Computer Vision

CITS 4402 Computer Vision CITS 4402 Computer Vision A/Prof Ajmal Mian Adj/A/Prof Mehdi Ravanbakhsh Lecture 06 Object Recognition Objectives To understand the concept of image based object recognition To learn how to match images

More information

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire

More information

Myoelectrical signal classification based on S transform and two-directional 2DPCA

Myoelectrical signal classification based on S transform and two-directional 2DPCA Myoelectrical signal classification based on S transform and two-directional 2DPCA Hong-Bo Xie1 * and Hui Liu2 1 ARC Centre of Excellence for Mathematical and Statistical Frontiers Queensland University

More information

Spatial Transformer Networks

Spatial Transformer Networks BIL722 - Deep Learning for Computer Vision Spatial Transformer Networks Max Jaderberg Andrew Zisserman Karen Simonyan Koray Kavukcuoglu Contents Introduction to Spatial Transformers Related Works Spatial

More information

Feature Design. Feature Design. Feature Design. & Deep Learning

Feature Design. Feature Design. Feature Design. & Deep Learning Artificial Intelligence and its applications Lecture 9 & Deep Learning Professor Daniel Yeung danyeung@ieee.org Dr. Patrick Chan patrickchan@ieee.org South China University of Technology, China Appropriately

More information

EE-559 Deep learning Recurrent Neural Networks

EE-559 Deep learning Recurrent Neural Networks EE-559 Deep learning 11.1. Recurrent Neural Networks François Fleuret https://fleuret.org/ee559/ Sun Feb 24 20:33:31 UTC 2019 Inference from sequences François Fleuret EE-559 Deep learning / 11.1. Recurrent

More information

DETECTING HUMAN ACTIVITIES IN THE ARCTIC OCEAN BY CONSTRUCTING AND ANALYZING SUPER-RESOLUTION IMAGES FROM MODIS DATA INTRODUCTION

DETECTING HUMAN ACTIVITIES IN THE ARCTIC OCEAN BY CONSTRUCTING AND ANALYZING SUPER-RESOLUTION IMAGES FROM MODIS DATA INTRODUCTION DETECTING HUMAN ACTIVITIES IN THE ARCTIC OCEAN BY CONSTRUCTING AND ANALYZING SUPER-RESOLUTION IMAGES FROM MODIS DATA Shizhi Chen and YingLi Tian Department of Electrical Engineering The City College of

More information

An overview of deep learning methods for genomics

An overview of deep learning methods for genomics An overview of deep learning methods for genomics Matthew Ploenzke STAT115/215/BIO/BIST282 Harvard University April 19, 218 1 Snapshot 1. Brief introduction to convolutional neural networks What is deep

More information

EfficientLow-rank Multimodal Fusion With Modality-specific Factors

EfficientLow-rank Multimodal Fusion With Modality-specific Factors EfficientLow-rank Multimodal Fusion With Modality-specific Factors Zhun Liu, Ying Shen, Varun Bharadwaj, Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency Artificial Intelligence Multimodal Sentiment and

More information

DYNAMIC TEXTURE RECOGNITION USING ENHANCED LBP FEATURES

DYNAMIC TEXTURE RECOGNITION USING ENHANCED LBP FEATURES DYNAMIC TEXTURE RECOGNITION USING ENHANCED FEATURES Jianfeng Ren BeingThere Centre Institute of Media Innovation Nanyang Technological University 50 Nanyang Drive, Singapore 637553. Xudong Jiang, Junsong

More information

Based on the original slides of Hung-yi Lee

Based on the original slides of Hung-yi Lee Based on the original slides of Hung-yi Lee Google Trends Deep learning obtains many exciting results. Can contribute to new Smart Services in the Context of the Internet of Things (IoT). IoT Services

More information

How to do backpropagation in a brain

How to do backpropagation in a brain How to do backpropagation in a brain Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto & Google Inc. Prelude I will start with three slides explaining a popular type of deep

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

CS 4700: Foundations of Artificial Intelligence

CS 4700: Foundations of Artificial Intelligence CS 4700: Foundations of Artificial Intelligence Prof. Bart Selman selman@cs.cornell.edu Machine Learning: Neural Networks R&N 18.7 Intro & perceptron learning 1 2 Neuron: How the brain works # neurons

More information

Presented By: Omer Shmueli and Sivan Niv

Presented By: Omer Shmueli and Sivan Niv Deep Speaker: an End-to-End Neural Speaker Embedding System Chao Li, Xiaokong Ma, Bing Jiang, Xiangang Li, Xuewei Zhang, Xiao Liu, Ying Cao, Ajay Kannan, Zhenyao Zhu Presented By: Omer Shmueli and Sivan

More information

Random Coattention Forest for Question Answering

Random Coattention Forest for Question Answering Random Coattention Forest for Question Answering Jheng-Hao Chen Stanford University jhenghao@stanford.edu Ting-Po Lee Stanford University tingpo@stanford.edu Yi-Chun Chen Stanford University yichunc@stanford.edu

More information

Hidden Markov Models Part 1: Introduction

Hidden Markov Models Part 1: Introduction Hidden Markov Models Part 1: Introduction CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Modeling Sequential Data Suppose that

More information

Introduction to Convolutional Neural Networks 2018 / 02 / 23

Introduction to Convolutional Neural Networks 2018 / 02 / 23 Introduction to Convolutional Neural Networks 2018 / 02 / 23 Buzzword: CNN Convolutional neural networks (CNN, ConvNet) is a class of deep, feed-forward (not recurrent) artificial neural networks that

More information

Asaf Bar Zvi Adi Hayat. Semantic Segmentation

Asaf Bar Zvi Adi Hayat. Semantic Segmentation Asaf Bar Zvi Adi Hayat Semantic Segmentation Today s Topics Fully Convolutional Networks (FCN) (CVPR 2015) Conditional Random Fields as Recurrent Neural Networks (ICCV 2015) Gaussian Conditional random

More information

Modeling Complex Temporal Composition of Actionlets for Activity Prediction

Modeling Complex Temporal Composition of Actionlets for Activity Prediction Modeling Complex Temporal Composition of Actionlets for Activity Prediction ECCV 2012 Activity Recognition Reading Group Framework of activity prediction What is an Actionlet To segment a long sequence

More information

Robust Sound Event Detection in Continuous Audio Environments

Robust Sound Event Detection in Continuous Audio Environments Robust Sound Event Detection in Continuous Audio Environments Haomin Zhang 1, Ian McLoughlin 2,1, Yan Song 1 1 National Engineering Laboratory of Speech and Language Information Processing The University

More information

Multiple Wavelet Coefficients Fusion in Deep Residual Networks for Fault Diagnosis

Multiple Wavelet Coefficients Fusion in Deep Residual Networks for Fault Diagnosis Multiple Wavelet Coefficients Fusion in Deep Residual Networks for Fault Diagnosis Minghang Zhao, Myeongsu Kang, Baoping Tang, Michael Pecht 1 Backgrounds Accurate fault diagnosis is important to ensure

More information

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials Contents 1 Pseudo-code for the damped Gauss-Newton vector product 2 2 Details of the pathological synthetic problems

More information

Face recognition Computer Vision Spring 2018, Lecture 21

Face recognition Computer Vision Spring 2018, Lecture 21 Face recognition http://www.cs.cmu.edu/~16385/ 16-385 Computer Vision Spring 2018, Lecture 21 Course announcements Homework 6 has been posted and is due on April 27 th. - Any questions about the homework?

More information

Human Pose Tracking I: Basics. David Fleet University of Toronto

Human Pose Tracking I: Basics. David Fleet University of Toronto Human Pose Tracking I: Basics David Fleet University of Toronto CIFAR Summer School, 2009 Looking at People Challenges: Complex pose / motion People have many degrees of freedom, comprising an articulated

More information

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning Lei Lei Ruoxuan Xiong December 16, 2017 1 Introduction Deep Neural Network

More information

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs) Multilayer Neural Networks (sometimes called Multilayer Perceptrons or MLPs) Linear separability Hyperplane In 2D: w x + w 2 x 2 + w 0 = 0 Feature x 2 = w w 2 x w 0 w 2 Feature 2 A perceptron can separate

More information

Convolutional Neural Networks

Convolutional Neural Networks Convolutional Neural Networks Books» http://www.deeplearningbook.org/ Books http://neuralnetworksanddeeplearning.com/.org/ reviews» http://www.deeplearningbook.org/contents/linear_algebra.html» http://www.deeplearningbook.org/contents/prob.html»

More information

Hierarchical Boosting and Filter Generation

Hierarchical Boosting and Filter Generation January 29, 2007 Plan Combining Classifiers Boosting Neural Network Structure of AdaBoost Image processing Hierarchical Boosting Hierarchical Structure Filters Combining Classifiers Combining Classifiers

More information

Deep Spatio-Temporal Time Series Land Cover Classification

Deep Spatio-Temporal Time Series Land Cover Classification Bachelor s Thesis Exposé Submitter: Arik Ermshaus Submission Date: Thursday 29 th March, 2018 Supervisor: Supervisor: Institution: Dr. rer. nat. Patrick Schäfer Prof. Dr. Ulf Leser Department of Computer

More information

A Method to Improve the Accuracy of Remote Sensing Data Classification by Exploiting the Multi-Scale Properties in the Scene

A Method to Improve the Accuracy of Remote Sensing Data Classification by Exploiting the Multi-Scale Properties in the Scene Proceedings of the 8th International Symposium on Spatial Accuracy Assessment in Natural Resources and Environmental Sciences Shanghai, P. R. China, June 25-27, 2008, pp. 183-188 A Method to Improve the

More information

CS 3710: Visual Recognition Describing Images with Features. Adriana Kovashka Department of Computer Science January 8, 2015

CS 3710: Visual Recognition Describing Images with Features. Adriana Kovashka Department of Computer Science January 8, 2015 CS 3710: Visual Recognition Describing Images with Features Adriana Kovashka Department of Computer Science January 8, 2015 Plan for Today Presentation assignments + schedule changes Image filtering Feature

More information

arxiv: v2 [cs.sd] 7 Feb 2018

arxiv: v2 [cs.sd] 7 Feb 2018 AUDIO SET CLASSIFICATION WITH ATTENTION MODEL: A PROBABILISTIC PERSPECTIVE Qiuqiang ong*, Yong Xu*, Wenwu Wang, Mark D. Plumbley Center for Vision, Speech and Signal Processing, University of Surrey, U

More information

WaveNet: A Generative Model for Raw Audio

WaveNet: A Generative Model for Raw Audio WaveNet: A Generative Model for Raw Audio Ido Guy & Daniel Brodeski Deep Learning Seminar 2017 TAU Outline Introduction WaveNet Experiments Introduction WaveNet is a deep generative model of raw audio

More information

Face detection and recognition. Detection Recognition Sally

Face detection and recognition. Detection Recognition Sally Face detection and recognition Detection Recognition Sally Face detection & recognition Viola & Jones detector Available in open CV Face recognition Eigenfaces for face recognition Metric learning identification

More information

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Neural Networks 2. 2 Receptive fields and dealing with image inputs CS 446 Machine Learning Fall 2016 Oct 04, 2016 Neural Networks 2 Professor: Dan Roth Scribe: C. Cheng, C. Cervantes Overview Convolutional Neural Networks Recurrent Neural Networks 1 Introduction There

More information

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions? Online Videos FERPA Sign waiver or sit on the sides or in the back Off camera question time before and after lecture Questions? Lecture 1, Slide 1 CS224d Deep NLP Lecture 4: Word Window Classification

More information

Multiple Similarities Based Kernel Subspace Learning for Image Classification

Multiple Similarities Based Kernel Subspace Learning for Image Classification Multiple Similarities Based Kernel Subspace Learning for Image Classification Wang Yan, Qingshan Liu, Hanqing Lu, and Songde Ma National Laboratory of Pattern Recognition, Institute of Automation, Chinese

More information

SYMBOL RECOGNITION IN HANDWRITTEN MATHEMATI- CAL FORMULAS

SYMBOL RECOGNITION IN HANDWRITTEN MATHEMATI- CAL FORMULAS SYMBOL RECOGNITION IN HANDWRITTEN MATHEMATI- CAL FORMULAS Hans-Jürgen Winkler ABSTRACT In this paper an efficient on-line recognition system for handwritten mathematical formulas is proposed. After formula

More information

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD WHAT IS A NEURAL NETWORK? The simplest definition of a neural network, more properly referred to as an 'artificial' neural network (ANN), is provided

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017 Modelling Time Series with Neural Networks Volker Tresp Summer 2017 1 Modelling of Time Series The next figure shows a time series (DAX) Other interesting time-series: energy prize, energy consumption,

More information

Tutorial on Methods for Interpreting and Understanding Deep Neural Networks. Part 3: Applications & Discussion

Tutorial on Methods for Interpreting and Understanding Deep Neural Networks. Part 3: Applications & Discussion Tutorial on Methods for Interpreting and Understanding Deep Neural Networks W. Samek, G. Montavon, K.-R. Müller Part 3: Applications & Discussion ICASSP 2017 Tutorial W. Samek, G. Montavon & K.-R. Müller

More information

Towards a Data-driven Approach to Exploring Galaxy Evolution via Generative Adversarial Networks

Towards a Data-driven Approach to Exploring Galaxy Evolution via Generative Adversarial Networks Towards a Data-driven Approach to Exploring Galaxy Evolution via Generative Adversarial Networks Tian Li tian.li@pku.edu.cn EECS, Peking University Abstract Since laboratory experiments for exploring astrophysical

More information

Online Appearance Model Learning for Video-Based Face Recognition

Online Appearance Model Learning for Video-Based Face Recognition Online Appearance Model Learning for Video-Based Face Recognition Liang Liu 1, Yunhong Wang 2,TieniuTan 1 1 National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy of Sciences,

More information

Final Examination CS540-2: Introduction to Artificial Intelligence

Final Examination CS540-2: Introduction to Artificial Intelligence Final Examination CS540-2: Introduction to Artificial Intelligence May 9, 2018 LAST NAME: SOLUTIONS FIRST NAME: Directions 1. This exam contains 33 questions worth a total of 100 points 2. Fill in your

More information

Multi-Class Sentiment Classification for Short Text Sequences

Multi-Class Sentiment Classification for Short Text Sequences Multi-Class Sentiment Classification for Short Text Sequences TIMOTHY LIU KAIHUI SINGAPORE UNIVERSITY OF TECHNOLOGY AND DESIGN What a selfless and courageous hero... Willing to give his life for a total

More information

TUTORIAL PART 1 Unsupervised Learning

TUTORIAL PART 1 Unsupervised Learning TUTORIAL PART 1 Unsupervised Learning Marc'Aurelio Ranzato Department of Computer Science Univ. of Toronto ranzato@cs.toronto.edu Co-organizers: Honglak Lee, Yoshua Bengio, Geoff Hinton, Yann LeCun, Andrew

More information

Natural Language Understanding. Kyunghyun Cho, NYU & U. Montreal

Natural Language Understanding. Kyunghyun Cho, NYU & U. Montreal Natural Language Understanding Kyunghyun Cho, NYU & U. Montreal 2 Machine Translation NEURAL MACHINE TRANSLATION 3 Topics: Statistical Machine Translation log p(f e) =log p(e f) + log p(f) f = (La, croissance,

More information

Autoregressive Neural Models for Statistical Parametric Speech Synthesis

Autoregressive Neural Models for Statistical Parametric Speech Synthesis Autoregressive Neural Models for Statistical Parametric Speech Synthesis シンワン Xin WANG 2018-01-11 contact: wangxin@nii.ac.jp we welcome critical comments, suggestions, and discussion 1 https://www.slideshare.net/kotarotanahashi/deep-learning-library-coyotecnn

More information

A New OCR System Similar to ASR System

A New OCR System Similar to ASR System A ew OCR System Similar to ASR System Abstract Optical character recognition (OCR) system is created using the concepts of automatic speech recognition where the hidden Markov Model is widely used. Results

More information

Aruna Bhat Research Scholar, Department of Electrical Engineering, IIT Delhi, India

Aruna Bhat Research Scholar, Department of Electrical Engineering, IIT Delhi, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2017 IJSRCSEIT Volume 2 Issue 6 ISSN : 2456-3307 Robust Face Recognition System using Non Additive

More information

Toward Correlating and Solving Abstract Tasks Using Convolutional Neural Networks Supplementary Material

Toward Correlating and Solving Abstract Tasks Using Convolutional Neural Networks Supplementary Material Toward Correlating and Solving Abstract Tasks Using Convolutional Neural Networks Supplementary Material Kuan-Chuan Peng Cornell University kp388@cornell.edu Tsuhan Chen Cornell University tsuhan@cornell.edu

More information

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Speaker Representation and Verification Part II. by Vasileios Vasilakakis Speaker Representation and Verification Part II by Vasileios Vasilakakis Outline -Approaches of Neural Networks in Speaker/Speech Recognition -Feed-Forward Neural Networks -Training with Back-propagation

More information

Improved Performance in Facial Expression Recognition Using 32 Geometric Features

Improved Performance in Facial Expression Recognition Using 32 Geometric Features Improved Performance in Facial Expression Recognition Using 32 Geometric Features Giuseppe Palestra 1(B), Adriana Pettinicchio 2, Marco Del Coco 2, Pierluigi Carcagnì 2, Marco Leo 2, and Cosimo Distante

More information

Action-Decision Networks for Visual Tracking with Deep Reinforcement Learning

Action-Decision Networks for Visual Tracking with Deep Reinforcement Learning Action-Decision Networks for Visual Tracking with Deep Reinforcement Learning Sangdoo Yun 1 Jongwon Choi 1 Youngjoon Yoo 2 Kimin Yun 3 and Jin Young Choi 1 1 ASRI, Dept. of Electrical and Computer Eng.,

More information

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs) Multilayer Neural Networks (sometimes called Multilayer Perceptrons or MLPs) Linear separability Hyperplane In 2D: w 1 x 1 + w 2 x 2 + w 0 = 0 Feature 1 x 2 = w 1 w 2 x 1 w 0 w 2 Feature 2 A perceptron

More information

Convolutional Neural Network Architecture

Convolutional Neural Network Architecture Convolutional Neural Network Architecture Zhisheng Zhong Feburary 2nd, 2018 Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 1 / 55 Outline 1 Introduction of Convolution Motivation

More information