Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Similar documents
Deep Learning Recurrent Networks 2/28/2018

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Deep Learning Recurrent Networks 10/11/2017

CSC321 Lecture 16: ResNets and Attention

Recurrent Neural Network

Conditional Language modeling with attention

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Deep Learning Recurrent Networks 10/16/2017

Neural Architectures for Image, Language, and Speech Processing

Lecture 5 Neural models for NLP

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Better Conditional Language Modeling. Chris Dyer

CSC321 Lecture 10 Training RNNs

CSC321 Lecture 15: Exploding and Vanishing Gradients

Sequence Modeling with Neural Networks

Seq2Seq Losses (CTC)

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Lecture 17: Neural Networks and Deep Learning

Long-Short Term Memory and Other Gated RNNs

Neural Networks in Structured Prediction. November 17, 2015

CSC321 Lecture 15: Recurrent Neural Networks

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Recurrent Neural Networks. deeplearning.ai. Why sequence models?

Natural Language Processing and Recurrent Neural Networks

Natural Language Processing

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Recurrent Neural Networks

Introduction to RNNs!

Recurrent Neural Networks. Jian Tang

Stephen Scott.

NEURAL LANGUAGE MODELS

Conditional Language Modeling. Chris Dyer

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Lecture 11 Recurrent Neural Networks I

Convolutional Neural Networks

arxiv: v1 [cs.cl] 21 May 2017

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST

Lecture 11 Recurrent Neural Networks I

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

with Local Dependencies

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

Lecture 15: Exploding and Vanishing Gradients

From perceptrons to word embeddings. Simon Šuster University of Groningen

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Sequence to Sequence Models and Attention

Slide credit from Hung-Yi Lee & Richard Socher

Machine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26

Neural Networks Language Models

Machine Learning. Neural Networks

Applied Natural Language Processing

Natural Language Understanding. Kyunghyun Cho, NYU & U. Montreal

Statistical NLP for the Web

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

SGD and Deep Learning

Temporal Modeling and Basic Speech Recognition

Lecture 15: Recurrent Neural Nets

arxiv: v3 [cs.lg] 14 Jan 2018

How to do backpropagation in a brain

Expectation Maximization (EM)

CSC321 Lecture 7 Neural language models

Deep Learning for NLP

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Recurrent and Recursive Networks

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

EE-559 Deep learning Recurrent Neural Networks

Today s Lecture. Dropout

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Lecture 12: Algorithms for HMMs

Artificial Neural Networks

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

arxiv: v2 [cs.cl] 1 Jan 2019

Math 350: An exploration of HMMs through doodles.

Task-Oriented Dialogue System (Young, 2000)

Neural Networks for Machine Learning. Lecture 2a An overview of the main types of neural network architecture

Natural Language Processing with Deep Learning CS224N/Ling284

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Neural Network Language Modeling

Feature Design. Feature Design. Feature Design. & Deep Learning

Lecture 6: Neural Networks for Representing Word Meaning

Trajectory-based Radical Analysis Network for Online Handwritten Chinese Character Recognition

Lecture 12: Algorithms for HMMs

CSC321 Lecture 20: Reversible and Autoregressive Models

Lecture 5: Logistic Regression. Neural Networks

Intelligent Systems (AI-2)

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Speech and Language Processing

Hidden Markov Models, I. Examples. Steven R. Dunbar. Toy Models. Standard Mathematical Models. Realistic Hidden Markov Models.

Sequence Labeling: HMMs & Structured Perceptron

CSCI 315: Artificial Intelligence through Deep Learning

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Random Coattention Forest for Question Answering

Making Deep Learning Understandable for Analyzing Sequential Data about Gene Regulation

Improved Learning through Augmenting the Loss

Intelligent Systems (AI-2)

Transcription:

Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1

Sequence-to-sequence modelling Problem: E.g. A sequence X 1 X N goes in A different sequence Y 1 Y M comes out Speech recognition: Speech goes in, a word sequence comes out Alternately output may be phoneme or character sequence Machine translation: Word sequence goes in, word sequence comes out In general N M No synchrony between X and Y. 2

Sequence to sequence Seq2seq I ate an apple I ate an apple Seq2seq Ich habe einen apfel gegessen Sequence goes in, sequence comes out No notion of synchrony between input and output May even not have a notion of alignment E.g. I ate an apple Ich habe einen apfel gegessen v 3

Recap: Have dealt with the aligned Y(t) case: CTC h -1 X(t) t=0 The input and output sequences happen in the same order Although they may be asynchronous E.g. Speech recognition Time The input speech corresponds to the phoneme sequence output 4

Today I ate an apple Seq2seq Ich habe einen apfel gegessen Sequence goes in, sequence comes out No notion of synchrony between input and output May even not have a notion of alignment E.g. I ate an apple Ich habe einen apfel gegessen v 5

Recap: Predicting text Simple problem: Given a series of symbols (characters or words) w 1 w 2 w n, predict the next symbol (character or word) w n+1 6

Simple recurrence : Text Modelling w 1 w 2 w 3 w 4 w 5 w 6 w 7 h -1 w 0 w 1 w 2 w 3 w 4 w 5 w 6 Learn a model that can predict the next character given a sequence of characters Or, at a higher level, words After observing inputs w 0 w k it predicts w k+1 7

Language modelling using RNNs Four score and seven years??? A B R A H A M L I N C O L?? Problem: Given a sequence of words (or characters) predict the next one 8

Training w 1 w 2 w 3 w 4 w 5 w 6 w 7 Y(t) DIVERGENCE h -1 P P P P P P P w 0 w 1 w 2 w 3 w 4 w 5 w 6 Input: symbols as one-hot vectors Dimensionality of the vector is the size of the vocabulary Projected down to lower-dimensional embeddings Output: Probability distribution over symbols Y t, i = P(V i w 0 w t 1 ) V i is the i-th symbol in the vocabulary Divergence The probability assigned to the correct next word Div Y target 1 T, Y(1 T) = Xent Y target t, Y(t) = log Y(t, w t+1 ) t t 9

Generating Language: The model W 2 W 3 W 4 W 5 W 6 W 7 W 8 W 9 W 10 P P P P P P P P P W 1 W 2 W 3 W 4 W 5 W 6 W 7 W 8 W 9 The hidden units are (one or more layers of) LSTM units Trained via backpropagation from a lot of text 10

Generating Language: Synthesis 1 y 4 2 y 4 y 4 N y t i = P(W t = V i W 1 W t 1 ) The probability that the t-th word in the sequence is the i-th word in the vocabulary given all previous t-1 words P P P W 1 W 2 W 3 On trained model : Provide the first few words One-hot vectors After the last input word, the network generates a probability distribution over words Outputs an N-valued probability distribution rather than a one-hot vector 11

Generating Language: Synthesis W 4 1 y 4 2 y 4 y 4 N y t i = P(W t = V i W 1 W t 1 ) The probability that the t-th word in the sequence is the i-th word in the vocabulary given all previous t-1 words P P P W 1 W 2 W 3 On trained model : Provide the first few words One-hot vectors After the last input word, the network generates a probability distribution over words Outputs an N-valued probability distribution rather than a one-hot vector Draw a word from the distribution And set it as the next word in the series 12

Generating Language: Synthesis W 4 1 y 5 2 y 5 y 5 N y t i = P(W t = V i W 1 W t 1 ) The probability that the t-th word in the sequence is the i-th word in the vocabulary given all previous t-1 words P P P P W 1 W 2 W 3 Feed the drawn word as the next word in the series And draw the next word from the output probability distribution Continue this process until we terminate generation In some cases, e.g. generating programs, there may be a natural termination 13

Generating Language: Synthesis W 4 W 5 1 y 5 2 y 5 y 5 N y t i = P(W t = V i W 1 W t 1 ) The probability that the t-th word in the sequence is the i-th word in the vocabulary given all previous t-1 words P P P P W 1 W 2 W 3 Feed the drawn word as the next word in the series And draw the next word from the output probability distribution Continue this process until we terminate generation In some cases, e.g. generating programs, there may be a natural termination 14

Generating Language: Synthesis W 4 W 5 W 6 W 7 W 8 W 9 W 10 P P P P P P P P P W 1 W 2 W 3 Feed the drawn word as the next word in the series And draw the next word from the output probability distribution Continue this process until we terminate generation For text generation we will usually end at an <eos> (end of sequence) symbol The <eos> symbol is a special symbol included in the vocabulary, which indicates the termination of a sequence and occurs only at the final position of sequences 15

Returning our problem I ate an apple Seq2seq Ich habe einen apfel gegessen Problem: A sequence X 1 X N goes in A different sequence Y 1 Y M comes out Similar to predicting text, but with a difference The output is in a different language.. 16

Modelling the problem Delayed sequence to sequence 17

Modelling the problem Delayed sequence to sequence Delayed self-referencing sequence-to-sequence 18

The simple translation model I ate an apple <eos> The input sequence feeds into an recurrent structure The input sequence is terminated by an explicit <eos> symbol The hidden activation at the <eos> stores all information about the sentence Subsequently a second RNN uses the hidden activation as initial state to produce a sequence of outputs The output at each time becomes the input at the next time Output production continues until an <eos> is produced 19

The simple translation model I ate an apple <eos> The input sequence feeds into an recurrent structure The input sequence is terminated by an explicit <eos> symbol The hidden activation at the <eos> stores all information about the sentence Subsequently a second RNN uses the hidden activation as initial state to produce a sequence of outputs The output at each time becomes the input at the next time Output production continues until an <eos> is produced 20

The simple translation model Ich I ate an apple <eos> The input sequence feeds into an recurrent structure The input sequence is terminated by an explicit <eos> symbol The hidden activation at the <eos> stores all information about the sentence Subsequently a second RNN uses the hidden activation as initial state to produce a sequence of outputs The output at each time becomes the input at the next time Output production continues until an <eos> is produced 21

The simple translation model Ich habe I ate an apple <eos> Ich The input sequence feeds into an recurrent structure The input sequence is terminated by an explicit <eos> symbol The hidden activation at the <eos> stores all information about the sentence Subsequently a second RNN uses the hidden activation as initial state to produce a sequence of outputs The output at each time becomes the input at the next time Output production continues until an <eos> is produced 22

The simple translation model Ich habe einen I ate an apple <eos> Ich habe The input sequence feeds into an recurrent structure The input sequence is terminated by an explicit <eos> symbol The hidden activation at the <eos> stores all information about the sentence Subsequently a second RNN uses the hidden activation as initial state to produce a sequence of outputs The output at each time becomes the input at the next time Output production continues until an <eos> is produced 23

The simple translation model Ich habe einen apfel gegessen <eos> I ate an apple <eos> Ich habe einen apfel gegessen The input sequence feeds into an recurrent structure The input sequence is terminated by an explicit <eos> symbol The hidden activation at the <eos> stores all information about the sentence Subsequently a second RNN uses the hidden activation as initial state to produce a sequence of outputs The output at each time becomes the input at the next time Output production continues until an <eos> is produced 24

Ich habe einen apfel gegessen <eos> I ate an apple <eos> Ich Ich habe einen apfel gegessen habe einen apfel gegessen <eos> I ate an apple <eos> Ich habe einen apfel gegessen We will illustrate with a single hidden layer, but the discussion generalizes to more layers 25

The simple translation model ENCODER Ich habe einen apfel gegessen <eos> I ate an apple <eos> Ich habe einen apfel gegessen The input sequence feeds into an recurrent structure DECODER The input sequence is terminated by an explicit <eos> symbol The hidden activation at the <eos> stores all information about the sentence Subsequently a second RNN uses the hidden activation as initial state to produce a sequence of outputs The output at each time becomes the input at the next time Output production continues until an <eos> is produced 26

The simple translation model Ich habe einen apfel gegessen <eos> P 1 P 1 P 1 P 1 P 1 P 2 P 2 P 2 P 2 P 2 I ate an apple <eos> Ich habe einen apfel gegessen A more detailed look: The one-hot word representations may be compressed via embeddings Embeddings will be learned along with the rest of the net In the following slides we will not represent the projection matrices 27

What the network actually produces ich apfel bier <eos> I ate an apple <eos> At each time k the network actually produces a probability distribution over the output vocabulary y k w = P O k = w O k 1,, O 1, I 1,, I N The probability given the entire input sequence I 1,, I N and the partial output sequence O 1,, O k 1 until k At each time a word is drawn from the output distribution The drawn word is provided as input to the next time 28

What the network actually produces Ich ich apfel bier <eos> I ate an apple <eos> At each time k the network actually produces a probability distribution over the output vocabulary y k w = P O k = w O k 1,, O 1, I 1,, I N The probability given the entire input sequence I 1,, I N and the partial output sequence O 1,, O k 1 until k At each time a word is drawn from the output distribution The drawn word is provided as input to the next time 29

What the network actually produces Ich ich apfel bier <eos> I ate an apple <eos> Ich At each time k the network actually produces a probability distribution over the output vocabulary y k w = P O k = w O k 1,, O 1, I 1,, I N The probability given the entire input sequence I 1,, I N and the partial output sequence O 1,, O k 1 until k At each time a word is drawn from the output distribution The drawn word is provided as input to the next time 30

What the network actually produces Ich ich ich y 1 apfel apfel y 1 bier bier y 1 <eos> <eos> y 1 I ate an apple <eos> Ich At each time k the network actually produces a probability distribution over the output vocabulary y k w = P O k = w O k 1,, O 1, I 1,, I N The probability given the entire input sequence I 1,, I N and the partial output sequence O 1,, O k 1 until k At each time a word is drawn from the output distribution The drawn word is provided as input to the next time 31

What the network actually produces Ich habe ich ich y 1 apfel apfel y 1 bier bier y 1 <eos> <eos> y 1 I ate an apple <eos> Ich At each time k the network actually produces a probability distribution over the output vocabulary y k w = P O k = w O k 1,, O 1, I 1,, I N The probability given the entire input sequence I 1,, I N and the partial output sequence O 1,, O k 1 until k At each time a word is drawn from the output distribution The drawn word is provided as input to the next time 32

What the network actually produces Ich habe ich ich y 1 apfel apfel y 1 bier bier y 1 <eos> <eos> y 1 I ate an apple <eos> Ich habe At each time k the network actually produces a probability distribution over the output vocabulary y k w = P O k = w O k 1,, O 1, I 1,, I N The probability given the entire input sequence I 1,, I N and the partial output sequence O 1,, O k 1 until k At each time a word is drawn from the output distribution The drawn word is provided as input to the next time 33

What the network actually produces Ich habe ich ich y 1 ich y 2 apfel apfel y 1 apfel y 2 bier bier y 1 bier y 2 <eos> <eos> <eos> y 1 y 2 I ate an apple <eos> Ich habe At each time k the network actually produces a probability distribution over the output vocabulary y k w = P O k = w O k 1,, O 1, I 1,, I N The probability given the entire input sequence I 1,, I N and the partial output sequence O 1,, O k 1 until k At each time a word is drawn from the output distribution The drawn word is provided as input to the next time 34

What the network actually produces Ich habe einen ich ich y 1 ich y 2 apfel apfel y 1 apfel y 2 bier bier y 1 bier y 2 <eos> <eos> <eos> y 1 y 2 I ate an apple <eos> Ich habe At each time k the network actually produces a probability distribution over the output vocabulary y k w = P O k = w O k 1,, O 1, I 1,, I N The probability given the entire input sequence I 1,, I N and the partial output sequence O 1,, O k 1 until k At each time a word is drawn from the output distribution The drawn word is provided as input to the next time 35

What the network actually produces Ich habe einen apfel gegessen <eos> ich ich y 1 ich y 2 ich y 3 ich y 4 ich y 5 apfel apfel y 1 apfel y 2 apfel y 3 apfel y 4 apfel y 5 bier bier y 1 bier y 2 bier y 3 bier y 4 bier y 5 <eos> <eos> <eos> <eos> <eos> <eos> y 1 y 2 y 3 y 4 y 5 I ate an apple <eos> Ich habe einen apfel gegessen At each time k the network actually produces a probability distribution over the output vocabulary y k w = P O k = w O k 1,, O 1, I 1,, I N The probability given the entire input sequence I 1,, I N and the partial output sequence O 1,, O k 1 until k At each time a word is drawn from the output distribution The drawn word is provided as input to the next time 36

Generating an output from the net Ich ich habe einen apfel gegessen <eos> ich ich ich ich ich y 1 y 2 y 3 y 4 y 5 apfel y 1 apfel y 2 apfel y 3 apfel y 4 apfel y 5 apfel bier y 1 bier y 2 bier y 3 bier y 4 bier y 5 bier <eos> <eos> y 1 <eos> <eos> y 2 y 3 y 4 <eos> y 5 <eos> I ate an apple <eos> Ich habe einen apfel gegessen At each time the network produces a probability distribution over words, given the entire input and previous outputs At each time a word is drawn from the output distribution The drawn word is provided as input to the next time The process continues until an <eos> is generated 37

The probability of the output Ich habe einen apfel gegessen <eos> ich ich y 1 ich y 2 ich y 3 ich y 4 ich y 5 apfel apfel y 1 apfel y 2 apfel y 3 apfel y 4 apfel y 5 bier bier y 1 bier y 2 bier y 3 bier y 4 bier y 5 <eos> <eos> <eos> <eos> <eos> <eos> y 1 y 2 y 3 y 4 y 5 I ate an apple <eos> Ich habe einen apfel gegessen P O 1,, O L W 1 in,, W N in = y 1 O 1 y 1 O 2 y 1 O L The objective of drawing: Produce the most likely output (that ends in an <eos>) argmax O 1,,O L O y 1 O 1 y 2 O 1 y L 1 38

The probability of the output Ich habe einen apfel gegessen <eos> ich y 1 ich y 2 ich y 3 ich y 4 ich y 5 ich Objective: O argmax y 1 O 1 y 2 O 1 y L 1 O 1,,O L apfel bier apfel y 1 bier y 1 apfel y 2 bier y 2 apfel y 3 bier y 3 apfel y 4 bier y 4 apfel y 5 bier y 5 <eos> <eos> y 1 <eos> <eos> y 2 y 3 y 4 <eos> y 5 <eos> I ate an apple <eos> Ich habe einen apfel gegessen Cannot just pick the most likely symbol at each time That may cause the distribution to be more confused at the next time Choosing a different, less likely word could cause the distribution at the next time to be more peaky, resulting in a more likely output overall 39

P(O 3 O 1, O 2, I 1,, I N ) P(O 3 O 1, O 2, I 1,, I N ) Greedy is not good T=0 1 2 w T=0 1 2 1 w 2 w 3 w V w 1 w 2 w 3 w V Hypothetical example (from English speech recognition : Input is speech, ouput must be text) Nose has highest probability at t=2 and is selected The model is very confused at t=3 and assigns low probabilities to many words at the next time Selecting any of these will result in low probability for the entire 3-word sequence Knows has slightly lower probability than nose, but is still high and is selected he knows is a reasonable beginning and the model assigns high probabilities to words such as something Selecting one of these results in higher overall probability for the 3-word sequence 40

P(O 2 O 1, I 1,, I N ) Greedy is not good What should we have chosen at t=2?? Will selecting nose continue to have a bad effect into the distant future? T=0 1 2 w 1 w 2 w 3 w V Problem: Impossible to know a priori which word leads to the more promising future Should we draw nose or knows? Effect may not be obvious until several words down the line Or the choice of the wrong word early may cumulatively lead to a poorer overall score over time 41

P(O 1 I 1,, I N ) Greedy is not good What should we have chosen at t=1?? Choose the or he? T=0 1 2 w 1 the w 3 he Problem: Impossible to know a priori which word leads to the more promising future Even earlier: Choosing the lower probability the instead of he at T=0 may have made a choice of nose more reasonable at T=1.. In general, making a poor choice at any time commits us to a poor future But we cannot know at that time the choice was poor Solution: Don t choose.. 42

Solution: Multiple choices I He We The Retain both choices and fork the network With every possible word as input 43

Problem: Multiple choices I He We The Problem: This will blow up very quickly For an output vocabulary of size V, after T output steps we d have forked out V T branches 44

Solution: Prune I He Top K P(O 1 I 1,, I N ) We The Solution: Prune At each time, retain only the top K scoring forks 45

Solution: Prune I He Top K P(O 1 I 1,, I N ) We The Solution: Prune At each time, retain only the top K scoring forks 46

Solution: Prune I He Knows Note: based on product Top K P(O 2 O 1 I 1,, I N ) = Top K P(O 2 O 1, I 1,, I N )P(O 1 I 1,, I N ) I The Nose Solution: Prune At each time, retain only the top K scoring forks 47

Solution: Prune I He Knows Note: based on product Top K P(O 2 O 1 I 1,, I N ) = Top K P(O 2 O 1, I 1,, I N )P(O 1 I 1,, I N ) I The Nose Solution: Prune At each time, retain only the top K scoring forks 48

Solution: Prune He Knows = Top K P O 2 O 1, I 1,, I N P O 2 O 1, I 1,, I N P(O 1 I 1,, I N ) The Nose Solution: Prune At each time, retain only the top K scoring forks 49

Solution: Prune He Knows = Top K P O 2 O 1, I 1,, I N P O 2 O 1, I 1,, I N P(O 1 I 1,, I N ) The Nose Solution: Prune At each time, retain only the top K scoring forks 50

Solution: Prune He Knows n Top K P O n O 1,, O n 1, I 1,, I N t=1 The Solution: Prune Nose At each time, retain only the top K scoring forks 51

Terminate He Knows <eos> The Nose Terminate When the current most likely path overall ends in <eos> Or continue producing more outputs (each of which terminates in <eos>) to get N-best outputs 52

Termination: <eos> <eos> Example has K = 2 Knows He <eos> <eos> The Nose Terminate Paths cannot continue once the output an <eos> So paths may be different lengths Select the most likely sequence ending in <eos> across all terminating sequences 53

Training the system Ich habe einen apfel gegessen <eos> I ate an apple <eos> Ich habe einen apfel gegessen Must learn to make predictions appropriately Given I ate an apple <eos>, produce Ich habe einen apfel gegessen <eos>. 54

Training : Forward pass Y 0 Y 1 Y 2 Y 3 Y 4 Y 5 I ate an apple <eos> Ich habe einen apfel gegessen Forward pass: Input the source and target sequences, sequentially Output will be a probability distribution over target symbol set (vocabulary) 55

Training : Backward pass Ich habe einen apfel gegessen <eos> Div Div Div Div Div Div Y 0 Y 1 Y 2 Y 3 Y 4 Y 5 I ate an apple <eos> Ich habe einen apfel gegessen Backward pass: Compute the divergence between the output distribution and target word sequence 56

Training : Backward pass Ich habe einen apfel gegessen <eos> Div Div Div Div Div Div Y 0 Y 1 Y 2 Y 3 Y 4 Y 5 I ate an apple <eos> Ich habe einen apfel gegessen Backward pass: Compute the divergence between the output distribution and target word sequence Backpropagate the derivatives of the divergence through the network to learn the net 57

Training : Backward pass Ich habe einen apfel gegessen <eos> Div Div Div Div Div Div Y 0 Y 1 Y 2 Y 3 Y 4 Y 5 I ate an apple <eos> Ich habe einen apfel gegessen In practice, if we apply SGD, we may randomly sample words from the output to actually use for the backprop and update Typical usage: Randomly select one word from each input training instance (comprising an input-output pair) For each iteration Randomly select training instance: (input, output) Forward pass Randomly select a single output y(t) and corresponding desired output d(t) for backprop 58

Overall training Given several training instance (X, D) Forward pass: Compute the output of the network for (X, D) Note, both X and D are used in the forward pass Backward pass: Compute the divergence between the desired target D and the actual output Y Propagate derivatives of divergence for updates 59

Trick of the trade: Reversing the input Ich habe einen apfel gegessen <eos> Div Div Div Div Div Div Y 0 Y 1 Y 2 Y 3 Y 4 Y 5 <eos> apple an ate I Ich habe einen apfel gegessen Standard trick of the trade: The input sequence is fed in reverse order Things work better this way 60

Trick of the trade: Reversing the input Ich habe einen apfel gegessen <eos> Div Div Div Div Div Div Y 0 Y 1 Y 2 Y 3 Y 4 Y 5 <eos> apple an ate I Ich habe einen apfel gegessen Standard trick of the trade: The input sequence is fed in reverse order Things work better this way 61

Trick of the trade: Reversing the input Ich habe einen apfel gegessen <eos> Y 0 Y 1 Y 2 Y 3 Y 4 Y 5 <eos> apple an ate I Ich habe einen apfel gegessen Standard trick of the trade: The input sequence is fed in reverse order Things work better this way This happens both for training and during actual decode 62

Overall training Given several training instance (X, D) Forward pass: Compute the output of the network for (X, D) with input in reverse order Note, both X and D are used in the forward pass Backward pass: Compute the divergence between the desired target D and the actual output Y Propagate derivatives of divergence for updates 63

Applications Machine Translation My name is Tom Ich heisse Tom/Mein name ist Tom Automatic speech recognition Speech recording My name is Tom Dialog I have a problem How may I help you Image to text Picture Caption for picture 64

Machine Translation Example Hidden state clusters by meaning! From Sequence-to-sequence learning with neural networks, Sutskever, Vinyals and Le 65

Machine Translation Example Examples of translation From Sequence-to-sequence learning with neural networks, Sutskever, Vinyals and Le 66

Human Machine Conversation: Example From A neural conversational model, Orin Vinyals and Quoc Le Trained on human-human converstations Task: Human text in, machine response out 67

Generating Image Captions CNN Image Not really a seq-to-seq problem, more an image-to-sequence problem Initial state is produced by a state-of-art CNN-based image classification system Subsequent model is just the decoder end of a seq-to-seq model Show and Tell: A Neural Image Caption Generator, O. Vinyals, A. Toshev, S. Bengio, D. Erhan 68

Generating Image Captions Decoding: Given image Process it with CNN to get output of classification layer Sequentially generate words by drawing from the conditional output distribution P(W t W 0 W 1 W t 1, Image) In practice, we can perform the beam search explained earlier 69

Generating Image Captions A a boy cat Decoding: Given image Process it with CNN to get output of classification layer Sequentially generate words by drawing from the conditional output distribution P(W t W 0 W 1 W t 1, Image) In practice, we can perform the beam search explained earlier 70

Generating Image Captions A a boy cat boy y 1 a y 1 boy y 1 cat Decoding: Given image Process it with CNN to get output of classification layer A Sequentially generate words by drawing from the conditional output distribution P(W t W 0 W 1 W t 1, Image) In practice, we can perform the beam search explained earlier 71

Generating Image Captions A boy on a boy cat y 1 a y 1 boy y 1 cat y 2 a y 2 boy y 2 cat Decoding: Given image Process it with CNN to get output of classification layer A Sequentially generate words by drawing from the conditional output distribution P(W t W 0 W 1 W t 1, Image) In practice, we can perform the beam search explained earlier boy 72

Generating Image Captions A boy on a a boy cat y 1 a y 1 boy y 1 cat y 2 a y 2 boy y 2 cat y 3 a y 3 boy y 3 cat Decoding: Given image A boy on Process it with CNN to get output of classification layer Sequentially generate words by drawing from the conditional output distribution P(W t W 0 W 1 W t 1, Image) In practice, we can perform the beam search explained earlier 73

Generating Image Captions A boy on a surfboard a boy cat y 1 a y 1 boy y 1 cat y 2 a y 2 boy y 2 cat y 3 a y 3 boy y 3 cat y 4 a y 4 boy y 4 cat Decoding: Given image A boy on a Process it with CNN to get output of classification layer Sequentially generate words by drawing from the conditional output distribution P(W t W 0 W 1 W t 1, Image) In practice, we can perform the beam search explained earlier 74

Generating Image Captions A boy on a surfboard <eos> a boy cat y 1 a y 1 boy y 1 cat y 2 a y 2 boy y 2 cat y 3 a y 3 boy y 3 cat y 4 a y 4 boy y 4 cat y 5 a y 5 boy y 5 cat Decoding: Given image A boy on a surfboard Process it with CNN to get output of classification layer Sequentially generate words by drawing from the conditional output distribution P(W t W 0 W 1 W t 1, Image) In practice, we can perform the beam search explained earlier 75

Training CNN Image Training: Given several (Image, Caption) pairs The image network is pretrained on a large corpus, e.g. image net Forward pass: Produce output distributions given the image and caption Backward pass: Compute the divergence w.r.t. training caption, and backpropagate derivatives All components of the network, including final classification layer of the image classification net are updated The CNN portions of the image classifier are not modified (transfer learning) 76

a boy cat y 1 a y 1 boy y 1 cat y 2 a y 2 boy y 2 cat y 3 a y 3 boy y 3 cat y 4 a y 4 boy y 4 cat a y 5 boy y 5 cat y 5 A boy on a surfboard Training: Given several (Image, Caption) pairs The image network is pretrained on a large corpus, e.g. image net Forward pass: Produce output distributions given the image and caption Backward pass: Compute the divergence w.r.t. training caption, and backpropagate derivatives All components of the network, including final classification layer of the image classification net are updated The CNN portions of the image classifier are not modified (transfer learning) 77

A boy on a surfboard <eos> Div Div Div Div Div Div a boy cat y 1 a y 1 boy y 1 cat y 2 a y 2 boy y 2 cat y 3 a y 3 boy y 3 cat y 4 a y 4 boy y 4 cat a y 5 boy y 5 cat y 5 A boy on a surfboard Training: Given several (Image, Caption) pairs The image network is pretrained on a large corpus, e.g. image net Forward pass: Produce output distributions given the image and caption Backward pass: Compute the divergence w.r.t. training caption, and backpropagate derivatives All components of the network, including final classification layer of the image classification net are updated The CNN portions of the image classifier are not modified (transfer learning) 78

Examples from Vinyals et. Al. 79

Translating Videos to Natural Language Using Deep Recurrent Neural Networks Translating Videos to Natural Language Using Deep Recurrent Neural Networks Subhashini Venugopalan, Huijun Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, Kate Saenko North American Chapter of the Association for Computational Linguistics, Denver, Colorado, June 2015. 80

A problem with this framework Ich habe einen apfel gegessen <eos> Y 0 Y 1 Y 2 Y 3 Y 4 Y 5 I ate an apple <eos> Ich habe einen apfel gegessen All the information about the input sequence is embedded into a single vector The hidden node layer at the end of the input sequence This one node is overloaded with information Particularly if the input is long 81

A problem with this framework I ate an apple <eos> In reality: All hidden values carry information Some of which may be diluted downstream 82

A problem with this framework Ich habe einen apfel gegessen <eos> I ate an apple <eos> Ich habe einen apfel gegessen In reality: All hidden values carry information Some of which may be diluted downstream Different outputs are related to different inputs Recall input and output may not be in sequence 83

A problem with this framework Ich habe einen apfel gegessen <eos> I ate an apple <eos> Ich habe einen apfel gegessen In reality: All hidden values carry information Some of which may be diluted downstream Different outputs are related to different inputs Recall input and output may not be in sequence Have no way of knowing a priori which input must connect to what output Connecting everything to everything is infeasible Variable sized inputs and outputs Overparametrized Connection pattern ignores the actual a synchronous dependence of output on input 85

Solution: Attention models s 1 h 1 h 0 h 1 h 2 h 3 I ate an apple <eos> Separating the encoder and decoder in illustration 86

Solution: Attention models s 1 s 0 s 1 s 2 s 3 s 4 s 5 w i 1 h i i w i 2 h i i w i 3 h i i w i 4 h i i w i 5 h i i w i 6 h i i h 1 h 0 h 1 h 2 h 3 h 4 I ate an apple <eos> Compute a weighted combination of all the hidden outputs into a single vector Weights vary by output time 87

Solution: Attention models Note: Weights vary with output time s 1 s 0 s 1 s 2 s 3 s 4 s 5 w i 1 h i i w i 2 h i i w i 3 h i i w i 4 h i i w i 5 h i i w i 6 h i i h 1 h 0 h 1 h 2 h 3 h 4 Input to hidden decoder layer: σ i w i t h i Weights: w i (t) are scalars I ate an apple <eos> Compute a weighted combination of all the hidden outputs into a single vector Weights vary by output time 88

Solution: Attention models Input to hidden decoder layer: σ i w i t h i s 1 s 0 s 1 s 2 s 3 s 4 s 5 w i 1 h i i w i 2 h i i w i 3 h i i w i 4 h i i w i 5 h i i w i 6 h i i h 1 h 0 h 1 h 2 h 3 h 4 w i t = a h i, s t 1 I ate an apple <eos> Require a time-varying weight that specifies relationship of output time to input time Weights are functions of current output state 89

Attention models Ich habe einen Input to hidden decoder layer: σ i w i t h i Sum to 1.0 s 1 s 0 s 1 s 2 s 3 s 4 s 5 Ich habe einen h 1 h 0 h 1 h 2 h 3 h 4 w i t = a h i, s t 1 I ate an apple <eos> The weights are a distribution over the input Must automatically highlight the most important input components for any output 90

Attention models Ich habe einen Input to hidden decoder layer: σ i w i t h i Sum to 1.0 s 1 s 0 s 1 s 2 s 3 s 4 s 5 Ich habe einen h 1 h 0 h 1 h 2 h 3 h 4 e i t = g h i, s t 1 I ate an apple <eos> Raw weight at any time: A function g() that works on the two hidden states Actual weight: softmax over raw weights w i t = exp(e i t ) σ j exp(e j t ) 91

Attention models Ich habe einen g h i, s t 1 g h i, s t 1 = h i T s t 1 = h i T W g s t 1 s 1 s 0 s 1 s 2 s 3 s 4 s 5 g h i, s t 1 = v g T tanh W g h i s t 1 Ich habe einen h 1 h 0 h 1 h 2 h 3 h 4 I ate an apple <eos> Typical options for g() e i t = g h i, s t 1 w i t = exp(e i t ) σ j exp(e j t ) Variables in red are to be learned 92

Converting an input (forward pass) h 1 h 0 h 1 h 2 h 3 h 4 I ate an apple <eos> Pass the input through the encoder to produce hidden representations h i 93

Converting an input (forward pass) What is this? Multiple options Simplest: s 1 = h N s 1 If s and h are different sizes: s 1 = W s h N W s is learnable parameter h 1 h 0 h 1 h 2 h 3 h 4 I ate an apple <eos> Compute weights for first output 94

Converting an input (forward pass) s 1 g h i, s 1 = h i T W g s 1 h 1 h 0 h 1 h 2 h 3 h 4 I ate an apple <eos> e i 0 = g h i, s 1 w i 0 = exp(e i 0 ) σ j exp(e j 0 ) Compute weights (for every h i ) for first output 95

Converting an input (forward pass) z 0 = w i 0 h i i s 1 g h i, s 1 = h i T W g s 1 h 1 h 0 h 1 h 2 h 3 h 4 I ate an apple <eos> e i 0 = g h i, s 1 w i 0 = exp(e i 0 ) σ j exp(e j 0 ) Compute weights (for every h i ) for first output Compute weighted combination of hidden values 96

Converting an input (forward pass) Y 0 z 0 = w i 0 h i i s 1 s 0 y0 ich du hat Y 0 h 1 h 0 h 1 h 2 h 3 h 4 I ate an apple <eos> Produce the first output Will be distribution over words 97

Converting an input (forward pass) z 0 = w i 0 h i i Y 0 s 1 s 0 Ich y0 ich du hat Y 0 h 1 h 0 h 1 h 2 h 3 h 4 I ate an apple <eos> Produce the first output Will be distribution over words Draw a word from the distribution 98

Ich Y 0 s 1 s 0 z 0 g h i, s 0 = h i T W g s 0 h 1 h 0 h 1 h 2 h 3 h 4 I ate an apple <eos> e i 1 = g h i, s 0 w i 1 = exp(e i 1 ) σ j exp(e j 1 ) Compute the weights for all instances for time = 1 99

Ich Y 0 z 1 = w i 1 h i i s 1 s 0 z 0 g h i, s 0 = h i T W g s 0 h 1 h 0 h 1 h 2 h 3 h 4 I ate an apple <eos> e i 1 = g h i, s 0 w i 1 = exp(e i 1 ) σ j exp(e j 1 ) Compute the weighted sum of hidden input values at t=1 100

Ich Y 0 Y 1 z 1 = w i 1 h i i s 1 s 0 z 0 s 1 y 1 ich y 1 du y 1 hat Ich Y 1 h 1 h 0 h 1 h 2 h 3 h 4 I ate an apple <eos> Compute the output at t=1 Will be a probability distribution over words 101

Ich Y 0 Y 1 habe z 1 = w i 1 h i i s 1 s 0 z 0 s 1 y 1 ich y 1 du y 1 hat Ich Y 1 h 1 h 0 h 1 h 2 h 3 h 4 I ate an apple <eos> Draw a word from the output distribution at t=1 102

Ich Y 0 habe Y 1 s 1 s 0 s 1 z 0 Ich z 1 g h i, s 1 = h i T W g s 1 h 1 h 0 h 1 h 2 h 3 h 4 I ate an apple <eos> e i 2 = g h i, s 1 w i 2 = exp(e i 2 ) σ j exp(e j 2 ) Compute the weights for all instances for time = 2 103

Ich Y 0 habe Y 1 z 2 = w i 2 h i i s 1 s 0 s 1 z 0 Ich z 1 g h i, s 1 = h i T W g s 1 h 1 h 0 h 1 h 2 h 3 h 4 I ate an apple <eos> e i 2 = g h i, s 1 w i 2 = exp(e i 2 ) σ j exp(e j 2 ) Compute the weighted sum of hidden input values at t=2 104

Ich habe Y 0 Y 1 Y 2 z 2 = w i 2 h i i s 1 s 0 z 0 s 1 s 2 y 2 ich y 2 du y 2 hat Ich habe Y 2 z 1 h 1 h 0 h 1 h 2 h 3 h 4 I ate an apple <eos> Compute the output at t=2 Will be a probability distribution over words 105

Ich habe z 2 = w i 2 h i i Y 0 s 1 s 0 z 0 Y 1 s 1 Y 2 s 2 einen y 2 ich y 2 du y 2 hat Ich habe Y 2 z 1 h 1 h 0 h 1 h 2 h 3 h 4 I ate an apple <eos> Draw a word from the distribution 106

Ich Y 0 habe Y 1 einen Y 2 s 1 s 0 s 1 s 2 z 0 Ich habe z 1 z 2 h 1 h 0 h 1 h 2 h 3 h 4 I ate an apple <eos> e i 3 = g h i, s 2 w i 3 = exp(e i 3 ) σ j exp(e j 3 ) Compute the weights for all instances for time = 3 107

Ich habe einen Y 0 Y 1 Y 2 z 3 = w i 3 h i i s 1 s 0 s 1 s 2 z 0 Ich habe z 1 z 2 h 1 h 0 h 1 h 2 h 3 h 4 I ate an apple <eos> Compute the weighted sum of hidden input values at t=3 108

Ich habe einen z 3 = w i 3 h i i Y 0 s 1 s 0 z 0 Y 1 s 1 Y 2 s 2 Y 3 s 3 apfel y 2 ich y 2 du y 2 hat Ich habe einen Y 3 z 1 z 2 h 1 h 0 h 1 h 2 h 3 h 4 I ate an apple <eos> Compute the output at t=3 Will be a probability distribution over words Draw a word from the distribution 109

Ich habe einen apfel gegessen <eos> Y 0 Y 1 Y 2 Y 3 Y 4 Y 5 s 1 s 0 s 1 s 2 s 3 s 4 s 5 z 0 Ich habe einen apfel gegessen z 1 z 2 z 3 z 4 z 5 h 1 h 0 h 1 h 2 h 3 h 4 I ate an apple <eos> Continue the process until an end-of-sequence symbol is produced 110

Ich habe einen apfel gegessen <eos> Y 0 Y 1 Y 2 Y 3 Y 4 Y 5 s 1 s 0 s 1 s 2 s 3 s 4 s 5 z 0 Ich habe einen apfel gegessen z 1 z 2 z 3 z 4 z 5 h 1 h 0 h 1 h 2 h 3 h 4 I ate an apple <eos> As before, the objective of drawing: Produce the most likely output (that ends in an <eos>) argmax O 1,,O L y 1 O 1 y 1 O 2 y 1 O L Simply selecting the most likely symbol at each time may result in suboptimal output

Solution: Multiple choices I He We The Retain all choices and fork the network With every possible word as input 112

To prevent blowup: Prune I He Top K P(O 1 I 1,, I N ) We The Prune At each time, retain only the top K scoring forks 113

To prevent blowup: Prune I He Top K P(O 1 I 1,, I N ) We The Prune At each time, retain only the top K scoring forks 114

Decoding I He Knows Note: based on product Top K P(O 2 O 1 I 1,, I N ) = Top K P(O 2 O 1, I 1,, I N )P(O 1 I 1,, I N ) I The Nose At each time, retain only the top K scoring forks 115

Decoding I He Knows Note: based on product Top K P(O 2 O 1 I 1,, I N ) = Top K P(O 2 O 1, I 1,, I N )P(O 1 I 1,, I N ) I The Nose At each time, retain only the top K scoring forks 116

Decoding He Knows = Top K P O 2 O 1, I 1,, I N P O 2 O 1, I 1,, I N P(O 1 I 1,, I N ) The Nose At each time, retain only the top K scoring forks 117

Decoding He Knows = Top K P O 2 O 1, I 1,, I N P O 2 O 1, I 1,, I N P(O 1 I 1,, I N ) The Nose At each time, retain only the top K scoring forks 118

Decoding He Knows n Top K P O n O 1,, O n 1, I 1,, I N t=1 The Nose At each time, retain only the top K scoring forks 119

Terminate He Knows <eos> The Nose Terminate When the current most likely path overall ends in <eos> Or continue producing more outputs (each of which terminates in <eos>) to get N-best outputs 120

Termination: <eos> <eos> Example has K = 2 Knows He <eos> <eos> The Nose Terminate Paths cannot continue once the output an <eos> So paths may be different lengths Select the most likely sequence ending in <eos> across all terminating sequences 121

What does the attention learn? Ich Y 0 Y 1 z 1 = w i 1 h i i s 1 s 0 s 1 z 0 Ich g h i, s 0 = h i T W g s 0 h 1 h 0 h 1 h 2 h 3 h 4 e i 1 = g h i, s 0 w i 1 = exp(e i 1 ) σ j exp(e j 1 ) I ate an apple <eos> The key component of this model is the attention weight It captures the relative importance of each position in the input to the current output 122

Alignments example: Bahdanau et al. Plot of w i (t) Color shows value (white is larger) t Note how most important input words for any output word get automatically highlighted The general trend is somewhat linear because word order is roughly similar in both languages t i i 123

Translation Examples Bahdanau et al. 2016 124

Training the network We have seen how a trained network can be used to compute outputs Convert one sequence to another Lets consider training.. 125

ich du hat y 1 ich y 1 du y 1 hat y 2 ich y 2 du y 2 hat y 3 ich y 3 du y 3 hat y 4 ich y 4 du y 4 hat y 5 ich y 5 du y 5 hat s 1 s 0 s 1 s 2 s 3 s 4 s 5 Ich habe einen apfel gegessen h 1 h 0 h 1 h 2 h 3 h 4 I ate an apple <eos> Given training input (source sequence, target sequence) pairs Forward pass: Pass the actual Pass the input sequence through the encoder At each time the output is a probability distribution over words 126

Ich habe einen apfel gegessen<eos> Div Div Div Div Div Div ich du hat y 1 ich y 1 du y 1 hat y 2 ich y 2 du y 2 hat y 3 ich y 3 du y 3 hat y 4 ich y 4 du y 4 hat y 5 ich y 5 du y 5 hat s 1 s 0 s 1 s 2 s 3 s 4 s 5 Ich habe einen apfel gegessen h 1 h 0 h 1 h 2 h 3 h 4 I ate an apple <eos> Backward pass: Compute a divergence between target output and output distributions Backpropagate derivatives through the network 127

Ich habe einen apfel gegessen<eos> Div Div Div Div Div Div Back propagation also updates parameters of the attention function g() ich du hat y 1 ich y 1 du y 1 hat y 2 ich y 2 du y 2 hat y 3 ich y 3 du y 3 hat y 4 ich y 4 du y 4 hat y 5 ich y 5 du y 5 hat s 1 s 0 s 1 s 2 s 3 s 4 s 5 Ich habe einen apfel gegessen h 1 h 0 h 1 h 2 h 3 h 4 I ate an apple <eos> Backward pass: Compute a divergence between target output and output distributions Backpropagate derivatives through the network 128

Various extensions Attention: Local attention vs global attention E.g. Effective Approaches to Attention-based Neural Machine Translation, Luong et al., 2015 Other variants Bidirectional processing of input sequence Bidirectional networks in encoder E.g. Neural Machine Translation by Jointly Learning to Align and Translate, Bahdanau et al. 2016 129

Some impressive results.. Attention-based models are currently responsible for the state of the art in many sequence-conversion systems Machine translation Input: Speech in source language Output: Speech in target language Speech recognition Input: Speech audio feature vector sequence Output: Transcribed word or character sequence 130

Attention models in image captioning Show attend and tell: Neural image caption generation with visual attention, Xu et al., 2016 Encoder network is a convolutional neural network Filter outputs at each location are the equivalent of h i in the regular sequence-to-sequence model 131

In closing Have looked at various forms of sequence-to-sequence models Generalizations of recurrent neural network formalisms For more details, please refer to papers Post on piazza if you have questions Will appear in HW4: Speech recognition with attention models 132