Multimodal Machine Learning

Size: px

Start display at page:

Download "Multimodal Machine Learning"

Doreen Chase
6 years ago
Views:

1 Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine Learning Laboratory [MultiComp Lab] 1

2 CMU Course : Multimodal Machine Learning 2

3 Lecture Objectives What is Multimodal? Multimodal: Core technical challenges Representation learning, translation, alignment, fusion and co-learning Multimodal representation learning Multimodal tensor representation Implicit Alignment Temporal attention Fusion and temporal modeling Multi-view LSTM and memory-based fusion

4 What is Multimodal? 4

5 What is Multimodal? Multimodal distribution Multiple modes, i.e., distinct peaks (local maxima) in the probability density function

6 What is Multimodal? Sensory Modalities

7 Multimodal Communicative Behaviors Verbal Lexicon Words Syntax Part-of-speech Dependencies Pragmatics Discourse acts Vocal Prosody Intonation Voice quality Vocal expressions Laughter, moans Visual Gestures Head gestures Eye gestures Arm gestures Body language Body posture Proxemics Eye contact Head gaze Eye gaze Facial expressions FACS action units Smile, frowning 7

8 What is Multimodal? Modality The way in which something happens or is experienced. Modality refers to a certain type of information and/or the representation format in which information is stored. Sensory modality: one of the primary forms of sensation, as vision or touch; channel of communication. Medium ( middle ) A means or instrumentality for storing or communicating information; system of communication/transmission. Medium is the means whereby this information is delivered to the senses of the interpreter. 8

9 Multiple Communities and Modalities Psychology Medical Speech Vision Language Multimedia Robotics Learning

10 Examples of Modalities Natural language (both spoken or written) Visual (from images or videos) Auditory (including voice, sounds and music) Haptics / touch Smell, taste and self-motion Physiological signals Electrocardiogram (ECG), skin conductance Other modalities Infrared images, depth images, fmri

11 Prior Research on Multimodal Four eras of multimodal research The behavioral era (1970s until late 1980s) The computational era (late 1980s until 2000) The interaction era ( ) The deep learning era (2010s until ) Main focus of this tutorial

12 The McGurk Effect (1976) Hearing lips and seeing voices Nature

13 The McGurk Effect (1976) Hearing lips and seeing voices Nature

14 The Computational Era(Late 1980s until 2000) 1) Audio-Visual Speech Recognition (AVSR)

15 Core Technical Challenges 15

16 Core Challenges in Deep Multimodal ML Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency These challenges are non-exclusive. 16

17 Core Challenge 1: Representation Definition: Learning how to represent and summarize multimodal data in away that exploits the complementarity and redundancy. A Joint representations: Representation Modality 1 Modality 2 17

18 Joint Multimodal Representation Wow! I like it! Joyful tone Joint Representation (Multimodal Space) Tensed voice 18

19 Joint Multimodal Representations Audio-visual speech recognition Bimodal Deep Belief Network Image captioning [Ngiam et al., ICML 2011] [Srivastava and Salahutdinov, NIPS 2012] Multimodal Deep Boltzmann Machine Audio-visual emotion recognition Deep Boltzmann Machine [Kim et al., ICASSP 2013] Depth Video Depth Multimodal Multimodal Representation Visual Verbal Depth Verbal 19

20 Multimodal Vector Space Arithmetic [Kiros et al., Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models, 2014] 20

21 Core Challenge 1: Representation Definition: Learning how to represent and summarize multimodal data in away that exploits the complementarity and redundancy. A Joint representations: B Coordinated representations: Representation Repres. 1 Repres 2 Modality 1 Modality 2 Modality 1 Modality 2 21

22 Coordinated Representation: Deep CCA Learn linear projections that are maximally correlated: u, v = argmax u,v corr u T X, v T Y View H x View H y X u v Y H x U V W x W y Text Image X Y H y Andrew et al., ICML

23 Fancy algorithm Core Challenge 2: Alignment Definition: Identify the direct relations between (sub)elements from two or more different modalities. Modality 1 t 1 t 2 Modality 2 t 4 A Explicit Alignment The goal is to directly find correspondences between elements of different modalities t 3 t 5 B Implicit Alignment Uses internally latent alignment of modalities in order to better solve a different problem t n t n 23

24 Temporal sequence alignment Applications: - Re-aligning asynchronous data - Finding similar data across modalities (we can estimate the aligned cost) - Event reconstruction from multiple sources

25 Alignment examples (multimodal)

26 Implicit Alignment Karpathy et al., Deep Fragment Embeddings for Bidirectional Image Sentence Mapping, 26

27 Core Challenge 3: Fusion Definition: To join information from two or more modalities to perform a prediction task. A Model-Agnostic Approaches 1) Early Fusion 2) Late Fusion Modality 1 Classifier Modality 1 Classifier Modality 2 Modality 2 Classifier 27

28 Core Challenge 3: Fusion Definition: To join information from two or more modalities to perform a prediction task. B Model-Based (Intermediate) Approaches 1) Deep neural networks 2) Kernel-based methods Multiple kernel learning y 3) Graphical models x 1 A h 1 A h 1 V x 1 V h 2 A x 2 A h 2 V x 2 V h 3 A x 3 A h 3 V x 3 V h 4 A x 4 A h 4 V x 4 V h 5 A x 5 A h 5 V x 5 V Multi-View Hidden CRF 28

where the translation relationship can often be

29 Core Challenge 4: Translation Definition: Process of changing data from one modality to another, where the translation relationship can often be open-ended or subjective. A Example-based B Model-driven 29

30 Core Challenge 4 Translation Visual gestures (both speaker and listener gestures) Transcriptions + Audio streams Marsella et al., Virtual character performance from speech, SIGGRAPH/Eurographics Symposium on Computer Animation, 2013

31 Core Challenge 5: Co-Learning Definition: Transfer knowledge between modalities, including their representations and predictive models. Prediction Modality 1 Modality 2 Help during training 31

32 Core Challenge 5: Co-Learning A Parallel B Non-Parallel C Hybrid 32

33 Taxonomy of Multimodal Research [ ] Representation Joint o Neural networks o Graphical models o Sequential Coordinated o Similarity o Structured Translation Example-based o Retrieval o Combination Model-based o Grammar-based o Encoder-decoder o Online prediction Alignment Explicit o Unsupervised o Supervised Implicit o Graphical models o Neural networks Fusion Model agnostic o Early fusion o Late fusion o Hybrid fusion Model-based o Kernel-based o Graphical models o Neural networks Co-learning Parallel data o Co-training o Transfer learning Non-parallel data Zero-shot learning Concept grounding Transfer learning Hybrid data Bridging Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency, Multimodal Machine Learning: A Survey and Taxonomy

34 Multimodal Applications [ ] Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency, Multimodal Machine Learning: A Survey and Taxonomy

35 Multimodal Representations 35

36 Core Challenge: Representation Definition: Learning how to represent and summarize multimodal data in away that exploits the complementarity and redundancy. A Joint representations: B Coordinated representations: Representation Repres. 1 Repres 2 Modality 1 Modality 2 Modality 1 Modality 2 36

37 Deep Multimodal autoencoders A deep representation learning approach A bimodal auto-encoder Used for Audio-visual speech recognition [Ngiam et al., Multimodal Deep Learning, 2011]

38 Deep Multimodal autoencoders - training Individual modalities can be pretrained RBMs Denoising Autoencoders To train the model to reconstruct the other modality Use both Remove audio [Ngiam et al., Multimodal Deep Learning, 2011]

39 Deep Multimodal autoencoders - training Individual modalities can be pretrained RBMs Denoising Autoencoders To train the model to reconstruct the other modality Use both Remove audio Remove video [Ngiam et al., Multimodal Deep Learning, 2011]

40 Multimodal Encoder-Decoder Visual modality often encoded using CNN Language modality will be decoded using LSTM A simple multilayer perceptron will be used to translate from visual (CNN) to language (LSTM) Text X Image Y 40

41 Multimodal Joint Representation For supervised learning tasks Joining the unimodal representations: Simple concatenation Element-wise multiplication or summation Multilayer perceptron How to explicitly model both unimodal and bimodal interactions? h x e.g. Sentiment softmax Text X h m Image Y h y

42 Multimodal Sentiment Analysis Sentiment Intensity [-3,+3] softmax h m h x h y h z h m = f W h x, h y, h z Text X Image Y Audio Z 42

43 Trimodal Bimodal Unimodal Unimodal, Bimodal and Trimodal Interactions Speaker s behaviors Sentiment Intensity This movie is sick? Ambiguous! This movie is fair Smile Loud voice? Unimodal cues Ambiguous! This movie is sick Smile This movie is sick Frown This movie is sick Loud voice? Resolves ambiguity (bimodal interaction) Still Ambiguous! This movie is sick Smile Loud voice This movie is fair Smile Loud voice Different trimodal interactions! 43

44 Multimodal Tensor Fusion Network (TFN) Models both unimodal and bimodal interactions: h m = h x 1 h y 1 = h x h x h y 1 h y Bimodal Unimodal e.g. Sentiment softmax 1 h m Important! h x h y [Zadeh, Jones and Morency, EMNLP 2017] Text X Image Y 44

h y h z Explicitly models unimodal, bimodal and trimodal interactions!

45 Multimodal Tensor Fusion Network (TFN) h x Can be extended to three modalities: h x h z h x h y h m = h x h y 1 1 h z 1 h y h z h y h z h x h y h z Explicitly models unimodal, bimodal and trimodal interactions! [Zadeh, Jones and Morency, EMNLP 2017] h x Text X h y Image Y Audio Z h z 45

46 Experimental Results MOSI Dataset Improvement over State-Of-The-Art 46

47 Coordinated Multimodal Representations Learn (unsupervised) two or more coordinated representations from multiple modalities. A loss function is defined to bring closer these multiple representations. Similarity metric (e.g., cosine distance) Text X Image Y 47

48 Deep Canonical Correlation Analysis Same objective function as CCA: argmax V,U,W x,w y corr H x, H y View H x 1 Linear projections maximizing correlation 2 Orthogonal projections 3 Unit variance of the projection vectors Andrew et al., ICML 2013 H x View H y U V W x W y Text Image X Y H y 48

correlation and reconstruction error from individual views X Text View H

49 Deep Canonically Correlated Autoencoders (DCCAE) Jointly optimize for DCCA and autoencoders loss functions A trade-off between multi-view correlation and reconstruction error from individual views X Text View H x Y Image Wang et al., ICML 2015 H x View H y U V W x W y Text Image X Y H y 49

50 Implicit alignment 50

51 Machine Translation Given a sentence in one language translate it to another Dog on the beach le chien sur la plage

52 Attention Model for Machine Translation Dog Attention module / gate Hidden state s 0 Context z 0 h 1 h 2 h 3 h 4 h 5 Encoder le chien sur la plage [Bahdanau et al., Neural Machine Translation by Jointly Learning to Align and Translate, ICLR 2015]

53 Attention Model for Machine Translation Dog on Attention module / gate Hidden state s 1 Context z 1 h 1 h 2 h 3 h 4 h 5 Encoder le chien sur la plage [Bahdanau et al., Neural Machine Translation by Jointly Learning to Align and Translate, ICLR 2015]

54 Attention Model for Machine Translation Dog on the Attention module / gate Hidden state s 2 h 1 h 2 h 3 h 4 h 5 Context z 2 Encoder le chien sur la plage [Bahdanau et al., Neural Machine Translation by Jointly Learning to Align and Translate, ICLR 2015]

55 Attention Model for Machine Translation 55

56 Attention Model for Image Captioning Distribution over L locations Output word a 1 a 2 d 1 a 3 d 2 s 0 s 1 s 2 z 1 y 1 z 2 y 2 Expectation over features: D First word 56

57 Attention Model for Image Captioning

58 Attention Model for Video Sequences [Pei, Baltrušaitis, Tax and Morency. Temporal Attention-Gated Model for Robust Sequence Classification, CVPR, 2017 ]

59 Temporal Attention-Gated Model (TAGM) 1 a t Recurrent Attention-Gated Unit h t 1 x + h t x t + ReLU x a t 1 a t a t+1 h t 1 h t h t+1 x t 1 x t x t+1 59

60 Temporal Attention Gated Model (TAGM) [Pei, Baltrušaitis, Tax and Morency. Temporal Attention-Gated Model for Robust Sequence Classification, CVPR, 2017 ]

61 Multimodal Fusion 61

62 Multiple Kernel Learning Pick a family of kernels for each modality and learn which kernels are important for the classification case Generalizes the idea of Support Vector Machines Works as well for unimodal and multimodal data, very little adaptation is needed [Lanckriet 2004]

Multimodal Fusion for Sequential Data Modality-private structure Internal grouping of observations Modality-shared structure Interaction and synchrony p y x A, x V ; θ) = h A,h V p y, h A, h V x A, x

63 Multimodal Fusion for Sequential Data Modality-private structure Internal grouping of observations Modality-shared structure Interaction and synchrony p y x A, x V ; θ) = h A,h V p y, h A, h V x A, x V ; θ Multi-View Hidden Conditional Sentiment Random Field y A A A A A h 1 h 2 h 3 h 4 h 5 x 1 A V V V V V h 1 h 2 h 3 h 4 h 5 A A A A x 2 x 3 x 4 x 5 V V x 1 x 2 x 3 V x 4 V x 5 V We saw the yellowdog Approximate inference using loopy-belief [Song, Morency and Davis, CVPR 2012] 63

64 Sequence Modeling with LSTM y 1 y 2 y 3 y τ LSTM (1) LSTM (2) LSTM (3) LSTM (τ) x 1 x 2 x 3 x τ 64

65 Multimodal Sequence Modeling Early Fusion y 1 y 2 y 3 y τ LSTM (1) LSTM (2) LSTM (3) LSTM (τ) (1) (1) (1) (1) x 1 x 2 x 3 x τ (2) (2) (2) (2) x 1 x 2 x 3 x τ (3) (3) (3) (3) x 1 x 2 x 3 x τ 65

66 Multi-View Long Short-Term Memory (MV-LSTM) y 1 y 2 y 3 y τ MV- LSTM (1) MV- LSTM (2) MV- LSTM (3) MV- LSTM (τ) (1) (1) (1) (1) x 1 x 2 x 3 x τ (2) (2) (2) (2) x 1 x 2 x 3 x τ (3) (3) (3) (3) x 1 x 2 x 3 x τ [Shyam, Morency, et al. Extending Long Short-Term Memory for Multi-View Structured Learning, ECCV, 2016] 66

67 Multi-View Long Short-Term Memory Multi-view topologies Multiple memory cells MV- LSTM (1) (1) h t 1 (2) h t 1 (3) h t 1 MVtanh (1) g t (2) g t g (3) c t (1) c t (2) c t (3) h t (1) h t (2) h t (3) x t (1) MVsigm x t (2) x t (3) MVsigm MVsigm [Shyam, Morency, et al. Extending Long Short-Term Memory for Multi-View Structured Learning, ECCV, 2016] 67

68 Topologies for Multi-View LSTM Fullyconnected α=1, β=1 MV- LSTM (1) (1) h t 1 (2) h t 1 (3) h t 1 Multi-view topologies (1) g t (2) g t g (3) x t (1) (1) h t 1 (2) h t 1 (3) h t 1 View-specific α=1, β=0 (1) g t x t (1) (1) h t 1 (2) h t 1 (3) h t 1 MVtanh (1) g t x t (1) x t (2) x t (3) α: Memory from current view β: Memory from other views Design parameters x t (1) (1) h t 1 (2) h t 1 (3) h t 1 Coupled α=0, β=1 (1) g t x t (1) (1) h t 1 (2) h t 1 (3) h t 1 Hybrid α=2/3, β=1/3 (1) g t [Shyam, Morency, et al. Extending Long Short-Term Memory for Multi-View Structured Learning, ECCV, 2016] 68

69 Multi-View Long Short-Term Memory (MV-LSTM) Multimodal prediction of children engagement [Shyam, Morency, et al. Extending Long Short-Term Memory for Multi-View Structured Learning, ECCV, 2016] 69

70 Memory Based A memory accumulates multimodal information over time. From the representations throughout a source network. No need to modify the structure of the source network, only attached the memory. 70

71 Memory Based [Zadeh et al., Memory Fusion Network for Multi-view Sequential Learning, AAAI 2018] 71

72 Multimodal Machine Learning Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency 72

Tutorial on Multimodal Machine Learning

Tutorial on Multimodal Machine Learning Louis-Philippe (LP) Morency Tadas Baltrusaitis CMU Multimodal Communication and Machine Learning Laboratory [MultiComp Lab] 1 Your Instructors Louis-Philippe Morency