CSC321 Lecture 15: Exploding and Vanishing Gradients

Similar documents
Lecture 15: Exploding and Vanishing Gradients

CSC321 Lecture 10 Training RNNs

CSC321 Lecture 16: ResNets and Attention

CSC321 Lecture 5: Multilayer Perceptrons

CSC 411 Lecture 10: Neural Networks

Recurrent Neural Networks. Jian Tang

Long-Short Term Memory and Other Gated RNNs

Slide credit from Hung-Yi Lee & Richard Socher

CSC321 Lecture 9: Generalization

Lecture 15: Recurrent Neural Nets

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

Recurrent neural networks

CSC321 Lecture 9: Generalization

CSC321 Lecture 15: Recurrent Neural Networks

Recurrent and Recursive Networks

CSCI 315: Artificial Intelligence through Deep Learning

Sequence Modeling with Neural Networks

CSC321 Lecture 6: Backpropagation

arxiv: v3 [cs.lg] 14 Jan 2018

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

CSC321 Lecture 20: Reversible and Autoregressive Models

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

EE-559 Deep learning LSTM and GRU

Introduction to RNNs!

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

Lecture 11 Recurrent Neural Networks I

Lecture 17: Neural Networks and Deep Learning

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

CSC321 Lecture 7: Optimization

CSC321 Lecture 8: Optimization

Deep Neural Networks (1) Hidden layers; Back-propagation

CSC321 Lecture 2: Linear Regression

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

CSC321 Lecture 4: Learning a Classifier

Stephen Scott.

Lecture 11 Recurrent Neural Networks I

Deep Learning Recurrent Networks 2/28/2018

CSC321 Lecture 4: Learning a Classifier

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets)

Deep Neural Networks (1) Hidden layers; Back-propagation

Natural Language Processing and Recurrent Neural Networks

EE-559 Deep learning Recurrent Neural Networks

Neural Architectures for Image, Language, and Speech Processing

Lecture 4 Backpropagation

ECE521 Lecture 7/8. Logistic Regression

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function

CSC321 Lecture 5 Learning in a Single Neuron

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Neural Networks for Machine Learning. Lecture 2a An overview of the main types of neural network architecture

Lecture 2: Learning with neural networks

Deep Learning Recurrent Networks 10/11/2017

Deep Learning Lab Course 2017 (Deep Learning Practical)

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization

CSC 411 Lecture 6: Linear Regression

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Lecture 10: Powers of Matrices, Difference Equations

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Lecture 6 Optimization for Deep Neural Networks

Understanding How ConvNets See

Learning Long-Term Dependencies with Gradient Descent is Difficult

Based on the original slides of Hung-yi Lee

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Deep Learning: Self-Taught Learning and Deep vs. Shallow Architectures. Lecture 04

Recurrent Neural Networks (RNNs) Lecture 9 - Networks for Sequential Data RNNs & LSTMs. RNN with no outputs. RNN with no outputs

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Long Short-Term Memory (LSTM)

(

Long-Short Term Memory

CSC321 Lecture 20: Autoencoders

Introduction to Machine Learning

Deep Learning II: Momentum & Adaptive Step Size

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Lecture 3 Feedforward Networks and Backpropagation

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Neural Networks: Backpropagation

NEURAL LANGUAGE MODELS

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

Artificial Neural Networks 2

Lecture 5: Recurrent Neural Networks

Learning Unitary Operators with Help from u(n)

Course 395: Machine Learning - Lectures

Neural Networks Language Models

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Recurrent Neural Networks

Lecture 4: Training a Classifier

4. Multilayer Perceptrons

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

Neural Networks in Structured Prediction. November 17, 2015

Lecture 5: Logistic Regression. Neural Networks

Recurrent Neural Networks. deeplearning.ai. Why sequence models?

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials

Deep Learning Recurrent Networks 10/16/2017

Transcription:

CSC321 Lecture 15: Exploding and Vanishing Gradients Roger Grosse Roger Grosse CSC321 Lecture 15: Exploding and Vanishing Gradients 1 / 23

Overview Yesterday, we saw how to compute the gradient descent update for an RNN using backprop through time. The updates are mathematically correct, but unless we re very careful, gradient descent completely fails because the gradients explode or vanish. The problem is, it s hard to learn dependencies over long time windows. Today s lecture is about what causes exploding and vanishing gradients, and how to deal with them. Or, equivalently, how to learn long-term dependencies. Roger Grosse CSC321 Lecture 15: Exploding and Vanishing Gradients 2 / 23

Why Gradients Explode or Vanish Recall the RNN for machine translation. It reads an entire English sentence, and then has to output its French translation. A typical sentence length is 20 words. This means there s a gap of 20 time steps between when it sees information and when it needs it. The derivatives need to travel over this entire pathway. Roger Grosse CSC321 Lecture 15: Exploding and Vanishing Gradients 3 / 23

Why Gradients Explode or Vanish Recall: backprop through time Activations: Parameters: L = 1 y (t) = L L y (t) r (t) = y (t) φ (r (t) ) h (t) = r (t) v + z (t+1) w z (t) = h (t) φ (z (t) ) u = t v = t w = t z (t) x (t) r (t) h (t) z (t+1) h (t) Roger Grosse CSC321 Lecture 15: Exploding and Vanishing Gradients 4 / 23

Why Gradients Explode or Vanish Consider a univariate version of the encoder network: Backprop updates: h (t) = z (t+1) w z (t) = h (t) φ (z (t) ) Applying this recursively: h (1) = w T 1 φ (z (2) ) φ (z (T ) ) }{{} h(t ) the Jacobian h (T ) / h (1) With linear activations: h (T ) / h (1) = w T 1 Exploding: w = 1.1, T = 50 h(t ) = 117.4 h (1) Vanishing: w = 0.9, T = 50 h(t ) = 0.00515 h (1) Roger Grosse CSC321 Lecture 15: Exploding and Vanishing Gradients 5 / 23

Why Gradients Explode or Vanish More generally, in the multivariate case, the Jacobians multiply: h (T ) h (1) = h(t ) h(2) h (T 1) h (1) Matrices can explode or vanish just like scalar values, though it s slightly harder to make precise. Contrast this with the forward pass: The forward pass has nonlinear activation functions which squash the activations, preventing them from blowing up. The backward pass is linear, so it s hard to keep things stable. There s a thin line between exploding and vanishing. Roger Grosse CSC321 Lecture 15: Exploding and Vanishing Gradients 6 / 23

Why Gradients Explode or Vanish We just looked at exploding/vanishing gradients in terms of the mechanics of backprop. Now let s think about it conceptually. The Jacobian h (T ) / h (1) means, how much does h (T ) change when you change h (1)? Each hidden layer computes some function of the previous hiddens and the current input: This function gets iterated: h (t) = f (h (t 1), x (t) ) h (4) = f (f (f (h (1), x (2) ), x (3) ), x (4) ). Let s study iterated functions as a way of understanding what RNNs are computing. Roger Grosse CSC321 Lecture 15: Exploding and Vanishing Gradients 7 / 23

Iterated Functions Iterated functions are complicated. Consider: f (x) = 3.5 x (1 x) Roger Grosse CSC321 Lecture 15: Exploding and Vanishing Gradients 8 / 23

Iterated Functions An aside: Remember the Mandelbrot set? That s based on an iterated quadratic map over the complex plane: 2 zn = zn 1 +c The set consists of the values of c for which the iterates stay bounded. CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=321973 Roger Grosse CSC321 Lecture 15: Exploding and Vanishing Gradients 9 / 23

Iterated Functions Consider the following iterated function: x t+1 = xt 2 + 0.15. We can determine the behavior of repeated iterations visually: The behavior of the system can be summarized with a phase plot: Roger Grosse CSC321 Lecture 15: Exploding and Vanishing Gradients 10 / 23

Iterated Functions Some observations: Fixed points of f correspond to points where f crosses the line x t+1 = x t. Fixed points with f (x t ) > 1 correspond to sources. Fixed points with f (x t ) < 1 correspond to sinks. Roger Grosse CSC321 Lecture 15: Exploding and Vanishing Gradients 11 / 23

Why Gradients Explode or Vanish Let s imagine an RNN s behavior as a dynamical system, which has various attractors: Geoffrey Hinton, Coursera Within one of the colored regions, the gradients vanish because even if you move a little, you still wind up at the same attractor. If you re on the boundary, the gradient blows up because moving slightly moves you from one attractor to the other. Roger Grosse CSC321 Lecture 15: Exploding and Vanishing Gradients 12 / 23

Why Gradients Explode or Vanish Consider an RNN with tanh activation function: The function computed by the network: Roger Grosse CSC321 Lecture 15: Exploding and Vanishing Gradients 13 / 23

Why Gradients Explode or Vanish Cliffs make it hard to estimate the true cost gradient. Here are the loss and cost functions with respect to the bias parameter for the hidden units: Generally, the gradients will explode on some inputs and vanish on others. In expectation, the cost may be fairly smooth. Roger Grosse CSC321 Lecture 15: Exploding and Vanishing Gradients 14 / 23

Keeping Things Stable One simple solution: gradient clipping Clip the gradient g so that it has a norm of at most η: if g > η: g ηg g The gradients are biased, but at least they don t blow up. Goodfellow et al., Deep Learning Roger Grosse CSC321 Lecture 15: Exploding and Vanishing Gradients 15 / 23

Keeping Things Stable Another trick: reverse the input sequence. This way, there s only one time step between the first word of the input and the first word of the output. The network can first learn short-term dependencies between early words in the sentence, and then long-term dependencies between later words. Roger Grosse CSC321 Lecture 15: Exploding and Vanishing Gradients 16 / 23

Keeping Things Stable Really, we re better off redesigning the architecture, since the exploding/vanishing problem highlights a conceptual problem with vanilla RNNs. The hidden units are a kind of memory. Therefore, their default behavior should be to keep their previous value. I.e., the function at each time step should be close to the identity function. It s hard to implement the identity function if the activation function is nonlinear! If the function is close to the identity, the gradient computations are stable. The Jacobians h (t+1) / h (t) are close to the identity matrix, so we can multiply them together and things don t blow up. Roger Grosse CSC321 Lecture 15: Exploding and Vanishing Gradients 17 / 23

Keeping Things Stable Identity RNNs Use the ReLU activation function Initialize all the weight matrices to the identity matrix Negative activations are clipped to zero, but for positive activations, units simply retain their value in the absence of inputs. This allows learning much longer-term dependencies than vanilla RNNs. It was able to learn to classify MNIST digits, input as sequence one pixel at a time! Le et al., 2015. A simple way to initialize recurrent networks of rectified linear units. Roger Grosse CSC321 Lecture 15: Exploding and Vanishing Gradients 18 / 23

Long-Term Short Term Memory Another architecture which makes it easy to remember information over long time periods is called Long-Term Short Term Memory (LSTM) What s with the name? The idea is that a network s activations are its short-term memory and its weights are its long-term memory. The LSTM architecture wants the short-term memory to last for a long time period. It s composed of memory cells which have controllers saying when to store or forget information. Roger Grosse CSC321 Lecture 15: Exploding and Vanishing Gradients 19 / 23

Long-Term Short Term Memory Replace each single unit in an RNN by a memory block - c t+1 = c t forget gate + new input input gate i = 0, f = 1 remember the previous value i = 1, f = 1 add to the previous value i = 0, f = 0 erase the value i = 1, f = 0 overwrite the value Setting i = 0, f = 1 gives the reasonable default behavior of just remembering things. Roger Grosse CSC321 Lecture 15: Exploding and Vanishing Gradients 20 / 23

Long-Term Short Term Memory In each step, we have a vector of memory cells c, a vector of hidden units h, and vectors of input, output, and forget gates i, o, and f. There s a full set of connections from all the inputs and hiddens to the input and all of the gates: i t f t o t = g t σ σ σ tanh W ( yt h t 1 c t = f t c t 1 + i t g t h t = o t tanh(c t ) Exercise: show that if f t+1 = 1, i t+1 = 0, and o t = 0, the gradients for the memory cell get passed through unmodified, i.e. c t = c t+1. ) Roger Grosse CSC321 Lecture 15: Exploding and Vanishing Gradients 21 / 23

Long-Term Short Term Memory Sound complicated? ML researchers thought so, so LSTMs were hardly used for about a decade after they were proposed. In 2013 and 2014, researchers used them to get impressive results on challenging and important problems like speech recognition and machine translation. Since then, they ve been one of the most widely used RNN architectures. There have been many attempts to simplify the architecture, but nothing was conclusively shown to be simpler and better. You never have to think about the complexity, since frameworks like TensorFlow provide nice black box implementations. Roger Grosse CSC321 Lecture 15: Exploding and Vanishing Gradients 22 / 23

Long-Term Short Term Memory Visualizations: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ Roger Grosse CSC321 Lecture 15: Exploding and Vanishing Gradients 23 / 23