CSC321 Lecture 10 Training RNNs

Similar documents
CSC321 Lecture 15: Recurrent Neural Networks

CSC321 Lecture 15: Exploding and Vanishing Gradients

Lecture 15: Exploding and Vanishing Gradients

CSC321 Lecture 16: ResNets and Attention

Recurrent Neural Networks. Jian Tang

Slide credit from Hung-Yi Lee & Richard Socher

EE-559 Deep learning LSTM and GRU

High Order LSTM/GRU. Wenjie Luo. January 19, 2016

CSC321 Lecture 5 Learning in a Single Neuron

Sequence Modeling with Neural Networks

Introduction to RNNs!

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Lecture 15: Recurrent Neural Nets

CSC321 Lecture 7 Neural language models

arxiv: v1 [cs.cl] 21 May 2017

CSC321 Lecture 4 The Perceptron Algorithm

CSC321 Lecture 9 Recurrent neural nets

Lecture 17: Neural Networks and Deep Learning

Deep learning for Natural Language Processing and Machine Translation

CSC321 Lecture 6: Backpropagation

CSC321 Lecture 5: Multilayer Perceptrons

Long-Short Term Memory and Other Gated RNNs

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Based on the original slides of Hung-yi Lee

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

A QUESTION ANSWERING SYSTEM USING ENCODER-DECODER, SEQUENCE-TO-SEQUENCE, RECURRENT NEURAL NETWORKS. A Project. Presented to

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

CSC 411 Lecture 10: Neural Networks

Conditional Language modeling with attention

Recurrent neural networks

Neural Networks 2. 2 Receptive fields and dealing with image inputs

CSC321 Lecture 8: Optimization

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

CSC321 Lecture 2: Linear Regression

arxiv: v3 [cs.lg] 14 Jan 2018

Faster Training of Very Deep Networks Via p-norm Gates

CSCI 315: Artificial Intelligence through Deep Learning

Improved Learning through Augmenting the Loss

CSC321 Lecture 20: Reversible and Autoregressive Models

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST

CSC321 Lecture 7: Optimization

Learning Unitary Operators with Help from u(n)

CSC321 Lecture 9: Generalization

CSC321 Lecture 20: Autoencoders

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Sequence to Sequence Models and Attention

Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, Deep Lunch

Lecture 4: Training a Classifier

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Machine Learning. Neural Networks

Neural Networks for Machine Learning. Lecture 2a An overview of the main types of neural network architecture

Deep Learning Recurrent Networks 2/28/2018

Recurrent and Recursive Networks

Stephen Scott.

CSC321 Lecture 4: Learning a Classifier

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

CSC321 Lecture 4: Learning a Classifier

Better Conditional Language Modeling. Chris Dyer

Neural Architectures for Image, Language, and Speech Processing

Recurrent Neural Networks (RNNs) Lecture 9 - Networks for Sequential Data RNNs & LSTMs. RNN with no outputs. RNN with no outputs

Natural Language Processing

arxiv: v1 [cs.cl] 22 Jun 2015

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials

CSC321 Lecture 9: Generalization

RECURRENT NETWORKS I. Philipp Krähenbühl

Learning Long-Term Dependencies with Gradient Descent is Difficult

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization

Lecture 9: Generalization

arxiv: v1 [cs.ne] 19 Dec 2016

Recurrent Neural Network

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

Recurrent Neural Networks. deeplearning.ai. Why sequence models?

Short-term water demand forecast based on deep neural network ABSTRACT

Artificial Neural Networks. MGS Lecture 2

CSC 411 Lecture 6: Linear Regression

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Lecture 11 Recurrent Neural Networks I

Multi-Source Neural Translation

Multi-Source Neural Translation

IMPROVING STOCHASTIC GRADIENT DESCENT

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Natural Language Processing and Recurrent Neural Networks

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

text classification 3: neural networks

Lecture 4: Training a Classifier

CKY-based Convolutional Attention for Neural Machine Translation

arxiv: v1 [cs.cl] 31 May 2015

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Conditional Language Modeling. Chris Dyer

EE-559 Deep learning Recurrent Neural Networks

Self-Attention with Relative Position Representations

Artificial Neural Networks 2

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Combining Static and Dynamic Information for Clinical Event Prediction

Lecture 11 Recurrent Neural Networks I

Neural Networks Language Models

Machine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26

Transcription:

CSC321 Lecture 10 Training RNNs Roger Grosse and Nitish Srivastava February 23, 2015 Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 1 / 18

Overview Last time, we saw that RNNs can perform some interesting computations. Now let s look at how to train them. We ll use backprop through time, which is really just backprop. There are only 2 new ideas we need to think about: Weight constraints Exploding and vanishing gradients Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 2 / 18

Backprop through time First, some general advice. There are two ways to think about backpropagation: 1 computationally, in terms of the messages passed between units in the network 2 abstractly, as a way of computing partial derivatives C/ w ij Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 3 / 18

Backprop through time First, some general advice. There are two ways to think about backpropagation: 1 computationally, in terms of the messages passed between units in the network 2 abstractly, as a way of computing partial derivatives C/ w ij When implementing the algorithm or reasoning about the running time, think about the messages being passed. Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 3 / 18

Backprop through time First, some general advice. There are two ways to think about backpropagation: 1 computationally, in terms of the messages passed between units in the network 2 abstractly, as a way of computing partial derivatives C/ w ij When implementing the algorithm or reasoning about the running time, think about the messages being passed. When reasoning about the qualitative behavior of gradient descent, think about what the partial derivatives really mean, and forget about the message passing! This is the viewpoint we ll take when we look at weight constraints and exploding/vanishing gradients. Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 3 / 18

Backprop through time Consider this RNN, where we have the weight constraint that w 1 = w 2 = w 3 = w. Assume for simplicity that all units are linear. We want to compute C/ w. This tells us how the cost changes when we make a small change to w. Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 4 / 18

Backprop through time Consider this RNN, where we have the weight constraint that w 1 = w 2 = w 3 = w. Assume for simplicity that all units are linear. We want to compute C/ w. This tells us how the cost changes when we make a small change to w. Changing w corresponds to changing w 1, w 2, and w 3. Since they all affect different outputs, their contributions to the loss will sum together: C w = C + C + C w 1 w 2 w 3 Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 4 / 18

Backprop through time Consider this RNN, where we have the weight constraint that w 1 = w 2 = w 3 = w. Assume for simplicity that all units are linear. We want to compute C/ w. This tells us how the cost changes when we make a small change to w. Changing w corresponds to changing w 1, w 2, and w 3. Since they all affect different outputs, their contributions to the loss will sum together: C w = C + C + C w 1 w 2 w 3 Each of these terms can be computed with standard backprop. Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 4 / 18

Backprop through time Now let s consider the hidden-to-hidden weights. As before, we constrain w 1 = w 2 = w. The partial derivative C/ w tells us how a small change to w affects the cost. Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 5 / 18

Backprop through time Now let s consider the hidden-to-hidden weights. As before, we constrain w 1 = w 2 = w. The partial derivative C/ w tells us how a small change to w affects the cost. Conceptually, this case is more complicated than the previous one, since changes to w 1 and w 2 will interact nonlinearly. However, we assume the changes are small enough that these nonlinear interactions are negligible. As before, the contributions sum together: C w = C w 1 + C w 2. Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 5 / 18

Exploding and vanishing gradients Geoff referred to the problem of exploding and vanishing gradients. This metaphor reflects the process by which gradients are computed during backpropagation: C h 1 = h 2 h 1 h 3 h 2 y h N These terms multiply together. If they re all larger than 1, the gradient explodes. If they re all smaller, it vanishes. Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 6 / 18

Exploding and vanishing gradients But I said earlier that we should reason in terms of what partial derivatives really mean, not in terms of backprop computations. Let s apply this to vanishing/exploding gradients. Consider a network which has one logistic hidden unit at each time step, and no inputs or outputs. This is an example of an iterated function. Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 7 / 18

Exploding and vanishing gradients Consider the following iterated function: x t+1 = xt 2 + 0.15. We can determine the behavior of repeated iterations visually: Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 8 / 18

Exploding and vanishing gradients Consider the following iterated function: x t+1 = xt 2 + 0.15. We can determine the behavior of repeated iterations visually: The behavior of the system can be summarized with a phase plot: Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 9 / 18

Question 1: Exploding and vanishing gradients Now consider an RNN with a single logistic hidden unit with weight 6 and bias -3, and no inputs or outputs. This corresponds to the iteration h t = f (h t 1) = σ (6h t 1 3). This function is shown on the right. Let r = h 1 denote the initial activation. The fixed points are h = 0.07, 0.5, 0.93. 1 Draw the phase plot for this system. 2 Approximately sketch the value of h 100 as a function of r. 3 If C is some cost function evaluated at time 100, for which values of r do you expect the gradient C/ r to vanish or explode? Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 10 / 18

Question 1: Exploding and vanishing gradients The gradient explodes for r = 0.5 and vanishes elsewhere. Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 11 / 18

Question 1: Exploding and vanishing gradients Some observations: Fixed points of f correspond to points where f crosses the line x t+1 = x t. Fixed points with f (x t ) > 1 correspond to sources. Fixed points with f (x t ) < 1 correspond to sinks. Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 12 / 18

Question 1: Exploding and vanishing gradients Some observations: Fixed points of f correspond to points where f crosses the line x t+1 = x t. Fixed points with f (x t ) > 1 correspond to sources. Fixed points with f (x t ) < 1 correspond to sinks. Note that iterated functions can behave in really complicated ways! (E.g. Mandelbrot set) Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 12 / 18

Exploding and vanishing gradients The real problem isn t backprop it s the fact that long-range dependencies are really complicated! The memorization task exemplifies how the issue can arise in RNNs. The network must read and memorize the input sequence, then spit it out again. This forces it to use its hidden unit capacity very well. Two strategies for dealing with exploding/vanishing gradients: 1 Minimize the long-distance dependencies 2 Keep the network s transformation close to the identity function Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 13 / 18

Long Short Term Memory Replace each single unit in an RNN by a memory block - c t+1 = c t forget gate + new input input gate This prevents vanishing gradients- If the forget gate and input gate are mostly 1, the cell state effectively adds up the inputs. C Therefore, does not decay as the inputs gradient is sent back. These designs are called constant error carousels. The model can of course do more interesting things, like selectively add up inputs by turning the input gate on/off. Its default behavior is to maintain the same values of the memory cells, which is the simplest form of long-distance dependency. Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 14 / 18

Exploding and vanishing gradients Dealing with exploding/vanishing gradients Long-term Short-Term Memory (LSTM) Easy to preserve parts of the hidden state over time, which simplifies the form of the dependencies between distant time steps. Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 15 / 18

Exploding and vanishing gradients Dealing with exploding/vanishing gradients Long-term Short-Term Memory (LSTM) Easy to preserve parts of the hidden state over time, which simplifies the form of the dependencies between distant time steps. Reverse the input or output sequence Therefore at least some predictions only require short-range dependencies. The network can learn to predict these first, before learning the harder ones. Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 15 / 18

Exploding and vanishing gradients Dealing with exploding/vanishing gradients Long-term Short-Term Memory (LSTM) Easy to preserve parts of the hidden state over time, which simplifies the form of the dependencies between distant time steps. Reverse the input or output sequence Therefore at least some predictions only require short-range dependencies. The network can learn to predict these first, before learning the harder ones. Clip the gradients so that their norm is smaller than some maximum value This throws away information, but at least the weight updates are better behaved Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 15 / 18

Neural Machine Translation We d like to translate, e.g., English to French sentences, and we have pairs of translated sentences to train on. What s wrong with the following setup? Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 16 / 18

Neural Machine Translation We d like to translate, e.g., English to French sentences, and we have pairs of translated sentences to train on. What s wrong with the following setup? The sentences might not be the same length, and the words might not align. Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 16 / 18

Neural Machine Translation Instead, the network first reads and memorizes the sentence. When it sees the END token, it starts outputting the translation. Demo from Montreal http://104.131.78.120/ Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, Y. Bengio. EMNLP 2014. Sequence to Sequence Learning with Neural Networks, Ilya Sutskever, Oriol Vinyals and Quoc Le, NIPS 2014. Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 17 / 18

Handwriting generation Input: character sequence. Targets: real values describing the motion of the pen. Training data: pairs of text and tracked motion. http://www.cs.toronto.edu/~graves/handwriting.html Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 18 / 18