Multimodal context analysis and prediction

Similar documents
Multimodal Machine Learning

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Lecture 17: Neural Networks and Deep Learning

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Anticipating Visual Representations from Unlabeled Data. Carl Vondrick, Hamed Pirsiavash, Antonio Torralba

Jakub Hajic Artificial Intelligence Seminar I

Task-Oriented Dialogue System (Young, 2000)

Recurrent Autoregressive Networks for Online Multi-Object Tracking. Presented By: Ishan Gupta

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Sequence Modeling with Neural Networks

UNSUPERVISED LEARNING

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Juergen Gall. Analyzing Human Behavior in Video Sequences

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

SGD and Deep Learning

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,

Statistical NLP for the Web

Introduction to Deep Neural Networks

Affect recognition from facial movements and body gestures by hierarchical deep spatio-temporal features and fusion strategy

Making Deep Learning Understandable for Analyzing Sequential Data about Gene Regulation

Convolutional Neural Networks

Neural Architectures for Image, Language, and Speech Processing

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

Stephen Scott.

Learning Semantic Prediction using Pretrained Deep Feedforward Networks

EE-559 Deep learning Recurrent Neural Networks

How to do backpropagation in a brain

Dynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji

Deep Learning Architectures and Algorithms

Natural Language Understanding. Kyunghyun Cho, NYU & U. Montreal

Cheng Soon Ong & Christian Walder. Canberra February June 2018

ECE521 Lectures 9 Fully Connected Neural Networks

Deep Feedforward Networks

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

arxiv: v3 [cs.lg] 14 Jan 2018

Deep Learning Autoencoder Models

Knowledge Extraction from DBNs for Images

Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity- Representativeness Reward

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning. Neural Networks

Deep Feedforward Networks. Sargur N. Srihari

Long-Short Term Memory and Other Gated RNNs

Spatial Transformer Networks

Hierarchical Signal Segmentation and Classification for Accurate Activity Recognition

Feature Design. Feature Design. Feature Design. & Deep Learning

Deep Learning of Invariant Spatiotemporal Features from Video. Bo Chen, Jo-Anne Ting, Ben Marlin, Nando de Freitas University of British Columbia

Neuroevolution for sound event detection in real life audio: A pilot study

Convolutional Neural Network Architecture

A Comparison among various Classification Algorithms for Travel Mode Detection using Sensors data collected by Smartphones

AUDIO SET CLASSIFICATION WITH ATTENTION MODEL: A PROBABILISTIC PERSPECTIVE. Qiuqiang Kong*, Yong Xu*, Wenwu Wang, Mark D. Plumbley

Machine Learning for Physicists Lecture 1

Two-Stream Bidirectional Long Short-Term Memory for Mitosis Event Detection and Stage Localization in Phase-Contrast Microscopy Images

CSC321 Lecture 16: ResNets and Attention

Memory-Augmented Attention Model for Scene Text Recognition

Natural Language Processing

Deep Learning for NLP

Course 395: Machine Learning - Lectures

Reservoir Computing and Echo State Networks

Agenda. Digit Classification using CNN Digit Classification using SAE Visualization: Class models, filters and saliency 2 DCT

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

Learning Deep Architectures

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

Deep unsupervised learning

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Deep Generative Models. (Unsupervised Learning)

Grundlagen der Künstlichen Intelligenz

Experiments on the Consciousness Prior

Normalization Techniques in Training of Deep Neural Networks

Knowledge Extraction from Deep Belief Networks for Images

Lecture 14: Deep Generative Learning

Deep Learning Recurrent Networks 2/28/2018

Sparse Kernel Machines - SVM

RARE SOUND EVENT DETECTION USING 1D CONVOLUTIONAL RECURRENT NEURAL NETWORKS

Gate Activation Signal Analysis for Gated Recurrent Neural Networks and Its Correlation with Phoneme Boundaries

Introduction to Convolutional Neural Networks (CNNs)

Demystifying deep learning. Artificial Intelligence Group Department of Computer Science and Technology, University of Cambridge, UK

(

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Slide credit from Hung-Yi Lee & Richard Socher

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Conditional Language modeling with attention

Reading Group on Deep Learning Session 2

An overview of deep learning methods for genomics

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST

Segmentation of Cell Membrane and Nucleus using Branches with Different Roles in Deep Neural Network

Recurrent Neural Network

Multi-scale Geometric Summaries for Similarity-based Upstream S

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

arxiv: v2 [cs.sd] 7 Feb 2018

Transcription:

Multimodal context analysis and prediction Valeria Tomaselli (valeria.tomaselli@st.com) Sebastiano Battiato Giovanni Maria Farinella Tiziana Rotondo (PhD student)

Outline 2 Context analysis vs prediction Multimodality Objective and Challenges Some notions on Neural Networks State of the Art Our work (preliminary analysis) Conclusions and Future work

Context Understanding vs Prediction 3 UNDERSTANDING It is the first step to build intelligent machines that provide assistive support to people s everyday life You ve an accident. I m calling for help! PREDICTION It will enable many real-world applications: foreseeing abnormal situations before they happen; improving the way machines interact with humans You re having an accident. I m slowing down your car!

Multi-modality 4 Humans perceive the context around them through a variety of stimuli MICROPHONE ACCELEROMETERS Machines can also exploit data coming from different sensors, such as camera, microphone, accelerometers, gyroscope, etc.; CAMERA GYROSCOPE

Objective 5 Each input modality is characterized by distinct statistical properties; A joint representation can benefit from the combination of multiple modalities; To merge different modalities to improve context understanding and prediction

Challenges 6 Handle multi-modal data Dataset availability Manage different sampling rates; Select the best fusion strategy Early fusion Late fusion Prediction: Learn a representation space suitable for prediction; Achieve performances comparable to context understanding

Some notions on Neural Networks

Neural Networks (1/2) 10 Consider a supervised learning problem, where we have access to labeled training examples x i, y i. Neural networks give a way of defining a complex, non-linear form of hypotheses h W,b (x), with parameters W, b we can fit to our data The neuron is a computational unit that outputs h W,b x = f W T 3 x = f( i=1 W i x i + b) Where f is called activation function (sigmoid, ReLU, etc.) A neural network puts together several neurons Training in 2 phases: Feed forward: a (l+1) = f(w l a l + b l ) Backpropagation: Suppose we have a fixed training set { x 1, y 1,, x m, y m } We define the overall cost function to be: J W, b = 1 m i=1 m 1 h 2 W,b x i y i 2 λ n + l 1 2 l=1 s l s l+1 i=1 j=1 (W l ji ) 2 Average sum-of squares error term Regularization term Our goal is to minimize J W, b as a function of W and b

Neural Networks (2/2) 11 Neural networks are the state-of-the-art algorithm to understand complex sensory data such as images, videos, speech, audio, voice, music, etc. Feed forward neural network Convolutional neural network (CNN) Autoencoder neural network Recurrent neural network (RNN) Long Short Term Memory (LSTM) neural network

State of the Art

Paper 1: Multimodal Deep Learning (1/2) 16 Authors: J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, A. Y. Ng International Conference on Machine Learning, 2011 Input : sequences of video frames and sequences of audio signals. They use different models for feature learning: They train Restricted Boltzmann Machines for (a) audio and (b) video separately as a baseline; The shallow model (c) is limited and they find that this model is unable to capture correlations across the modalities; The bimodal deep belief network (DBN) model (d) is trained in a greedy layer-wise fashion by first training models (a) & (b).

Paper 1: Multimodal Deep Learning (2/2) 17 Then, they use deep autoencoder models. A video-only model is shown in (a) where the model learns to reconstruct both modalities given only video as the input. A similar model can be drawn for the audio-only setting. They also train the (b) bimodal deep autoencoder in a denoising fashion, using an augmented dataset with examples that require the network to reconstruct both modalities given only one.

Paper 2: Anticipating Daily Intention using On-Wrist Motion Triggered Sensing (1/2) Authors: Tz-Ying Wu, Ting-An Chien, Cheng-Sheng Chan, Chan-Wei Hu, Min Sun, ICCV 2017 18 Input : sequences of video frames and action sequences. The authors use this model for predicting daily intentions; Their model consists of an RNN with LSTM cell encoder and a Policy Network. At each frame, RNN will generate an anticipated intention according to a new embedded representation g and the previous hidden state h of the RNN; The policy will generate the motion-trigger decision a for next frame, based on motion representation and the hidden state h of the RNN.

Paper 2: Anticipating Daily Intention using On- Wrist Motion Triggered Sensing (2/2) 19 The joint loss is: L = L P + λl A They use exponential loss to train their RNN-based model. The anticipation loss L A is defined as T t=1 L t A = T t=1 log p t (y gt ) e log(0.1)t t T where y gt is the ground truth intention and T is the time when intention reached. They define a policy loss function L P = 1 K T log(π(a k KT t h k k t, f m,t k=1 t=1 ; W P )) R t k where {a t k } t is the k th sequence of trigged patterns sampled from π( ), K is the number of sequences, and T is the time when intention reached. R t k is the reward of the k th sampled sequence at time t computed from R = p t (y gt ) R + 1 n T, p t (y gt ) R n T, if y = ygt if y ygt

Paper 3: Jointly Learning Energy Expenditures and Activities using Egocentric Multimodal Signals (1/2) 20 Authors: Katsuyuki Nakamura, Serena Yeung, Alexandre Alahi, Li Fei-Fei, CVPR 2017 Input : V = v 1, v 2,, v T sequence of video frames; A = a 1, a 2,, a T sequence of triaxial acceleration signals; The authors train a model for jointly perform activity detection and energy expenditure regression; The heart rate signal is used as a self-supervision to derive energy expenditure

Paper 3: Jointly Learning Energy Expenditures and Activities using Egocentric Multimodal Signals (2/2) 21 A multi-task loss is used to jointly optimize activity detection y a t and energy expenditure regression y e t at each time step Training data x t, y t a, y t e, where x t R d is the input feature vector, y t a R 24 is the ground truth activity label, and y t e R is the derived energy expenditure (y t e = αhr t + β weight + γhr t weight); Loss Function: L = L act + λl EE, where the first term L act is a cross-entropy loss for activity detection and the second term L EE is a Huber loss for energy expenditure regression L EE = 1 2 r2 if r δ δ r 1 δ otherwise 2 where r = y t e y t e.

Our work (preliminary analysis)

Chosen multi-modal dataset: Stanford-ECM Dataset (from Paper 3) 23 Stanford-ECM Dataset comprises 31 hours of egocentric video (113 videos) augmented with heart rate and acceleration data; The lengths of the individual videos covered a diverse range from 3 minutes to about 51 minutes in length; The mobile phone collected egocentric video at 720x1280 resolution and 30fps, as well as triaxial acceleration at 30Hz; The wrist-worn heart rate sensor was used to capture the heart rate every 5 seconds (0.2 Hz); Annotated with 25 different activities It has been conceived for multimodal classification (not prediction)

ECM Dataset Adaptation for Prediction 24 Identified only 2 types of transitions, suitable for performing training/test: Unknown/Activity Activity/Unknown The dataset has been cut around transitions 64 frames before and after transitions Selected 9 activities, besides Unknown Bicycling Playing With Children Walking Strolling Food Preparation Talking Standing Talking Sitting Sitting Tasks Shopping UNKNOWN ACTIVITY 64 frames 64 frames

Features extraction 25 Features have been separately computed before and after each transition (Unknown/Activity and Activity/Unknown); Visual features: Features extracted from pool5 layer of Inception CNN, pretrained on ImageNet; x t v = CNN θc v t, 1024-dimensional feature vector Acceleration features (sliding window size=32): Time-domain: mean, standard deviation, skewness, kurtosis, percentiles; Frequency domain: spectral entropy; Heart rate features: mean and standard deviation

Temporal Pyramid 26 We represent features in a temporal pyramid with three levels (level 0, level 1 and level 2). The top level j = 0 is a histogram (mean) over the full temporal extent of a data, the next level (j=1) is the concatenation of two histograms obtained by temporally segmenting the video into two halfs, and so on. Feature vector size is 7434 Video: 1024 x 7; Acceleration: 36 x 7; heart rate: 2 x 7

Baseline 27 SVM (Support Vector Machine) trained on feature vectors Classification: classify the activity from feature vectors extracted from Activity chunk; Prediction: predict the activity from feature vectors extracted from the Unknown chunk before the activity CLASSIFICATION SVM UNKNOWN ACTIVITY 64 frames 64 frames FEATURE VECTOR WALKING PREDICTION

Baseline Results 28 Training and test with all the input combinations Prediction: 10% accuracy drop w.r.t. classification Inputs (feature size) Classification Prediction Linear SVM Rbf SVM Linear SVM Rbf SVM Acceleration (252) 30% 46.67% 30% 46.67% Heart rate (14) 33.33% 30% 33.33% 36.67% Video (7168) 76.67% 76.67% 66.67% 60% Video + Acceleration (7420) 76.67% 83.33% 73.33% 73.33% Video + Heart rate (7182) 76.67% 76.67% 66.67% 70% Acceleration + Heart rate (266) 33.33% 46.67% 40% 46.67% Video + Acceleration + Heart rate (7434) 76.67% 76.67% 73.33% 73.33%

Conclusions & Future Work 29 Conclusions: Prediction seems a feasible task; Multi-modality improves both classification and prediction; Future work: Fill the gap between classification and prediction accuracies Populate an ST Multimodal dataset, exploiting Bluecoin sensors and video from a smartphone

Thank you