Learning Long Term Dependencies with Gradient Descent is Difficult

Similar documents
Day 3 Lecture 3. Optimizing deep networks

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Deep Learning Recurrent Networks 2/28/2018

Lecture 5: Logistic Regression. Neural Networks

Lecture 11 Recurrent Neural Networks I

Learning Long-Term Dependencies with Gradient Descent is Difficult

Reading Group on Deep Learning Session 1

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Machine Learning. Neural Networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Lecture 17: Neural Networks and Deep Learning

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

AI Programming CS F-20 Neural Networks

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

OPTIMIZATION METHODS IN DEEP LEARNING

Deep Learning & Artificial Intelligence WS 2018/2019

Lecture 11 Recurrent Neural Networks I

Artificial Neural Networks

Sequence Modeling with Neural Networks

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials

Recurrent Neural Networks

Recurrent Neural Networks. Jian Tang

Links between Perceptrons, MLPs and SVMs

CSC 578 Neural Networks and Deep Learning

Neural Networks biological neuron artificial neuron 1

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Deep Feedforward Networks

CSC321 Lecture 5: Multilayer Perceptrons

Optimization for neural networks

Slide credit from Hung-Yi Lee & Richard Socher

CSC 411 Lecture 10: Neural Networks

Grundlagen der Künstlichen Intelligenz

Deep Feedforward Networks

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

A thorough derivation of back-propagation for people who really want to understand it by: Mike Gashler, September 2010

Course 395: Machine Learning - Lectures

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

CSC321 Lecture 7: Optimization

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Introduction to gradient descent

CSC321 Lecture 8: Optimization

Neural networks. Chapter 19, Sections 1 5 1

Chapter 4 Neural Networks in System Identification

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Automatic Differentiation and Neural Networks

Artificial Neural Networks 2

From perceptrons to word embeddings. Simon Šuster University of Groningen

Neural Network Training

Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, Deep Lunch

LSTM CAN SOLVE HARD. Jurgen Schmidhuber Lugano, Switzerland. Abstract. guessing than by the proposed algorithms.

Condensed Table of Contents for Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control by J. C.

Logistic Regression & Neural Networks

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Machine Learning (CSE 446): Backpropagation

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Neural networks. Chapter 20, Section 5 1

GRADIENT DESCENT. CSE 559A: Computer Vision GRADIENT DESCENT GRADIENT DESCENT [0, 1] Pr(y = 1) w T x. 1 f (x; θ) = 1 f (x; θ) = exp( w T x)

Nonlinear Optimization Methods for Machine Learning

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Improving the Convergence of Back-Propogation Learning with Second Order Methods

Non-convex optimization. Issam Laradji

Introduction to Neural Networks

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Artificial Neural Networks

Comparison of Modern Stochastic Optimization Algorithms

Input layer. Weight matrix [ ] Output layer

More Tips for Training Neural Network. Hung-yi Lee

Lecture 4: Perceptrons and Multilayer Perceptrons

Multilayer Perceptrons (MLPs)

Numerical Optimization Techniques

Direct Method for Training Feed-forward Neural Networks using Batch Extended Kalman Filter for Multi- Step-Ahead Predictions

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

Neural Networks (Part 1) Goals for the lecture

CSC242: Intro to AI. Lecture 21

Neural Networks in Structured Prediction. November 17, 2015

The error-backpropagation algorithm is one of the most important and widely used (and some would say wildly used) learning techniques for neural

Artificial Neural Networks. MGS Lecture 2

Credit Assignment: Beyond Backpropagation

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

Neural networks. Chapter 20. Chapter 20 1

Basic Principles of Unsupervised and Unsupervised

Hopfield Neural Network

CSC321 Lecture 15: Exploding and Vanishing Gradients

More Optimization. Optimization Methods. Methods

100 inference steps doesn't seem like enough. Many neuron-like threshold switching units. Many weighted interconnections among units

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Artificial Neuron (Perceptron)

A recursive algorithm based on the extended Kalman filter for the training of feedforward neural models. Isabelle Rivals and Léon Personnaz

Intelligent Systems Discriminative Learning, Neural Networks

Artificial Intelligence Hopfield Networks

Neural Nets and Symbolic Reasoning Hopfield Networks

CSCI 315: Artificial Intelligence through Deep Learning

Lab 5: 16 th April Exercises on Neural Networks

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Normalization Techniques

Multilayer Perceptron

HOPFIELD neural networks (HNNs) are a class of nonlinear

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Transcription:

Learning Long Term Dependencies with Gradient Descent is Difficult IEEE Trans. on Neural Networks 1994 Yoshua Bengio, Patrice Simard, Paolo Frasconi Presented by: Matt Grimes, Ayse Naz Erkan

Recurrent Neural Networks Definition A neural network with feedback. Provides Memory Context sensitive processing of streams. Illustration from Introduction to machine learning, E. Alpaydin 2005

Training algorithms for recurrent networks Back propagation through time (Rumelhart, Hinton,Williams,86) Applies backprop to recurrent networks by unfolding them through N timesteps and treating them like static networks (stores the activations of units while going forward in time.) Others, forward propagation algorithms Computationally more expensive but local in time, can be applied for online learning. A variant for constrained recurrent networks (Gori, Bengio, Mori, 89) (Mozer, 89) Local in time, computational requirement proportional to number of weights. However, has limited storage capabilities with general sequences.

From the dynamical systems perspective: Three basic requirements to be able to learn relevant state information

Feature wishlist I. Be able to store information for an arbitrary duration e.g. be able to learn constants g=9.8 m s2

Feature wishlist II. Be resistant to noise, irrelevant input x, v,,

Feature wishlist III. System parameters should be trainable in a reasonable amount of time? 1 x t = g t 2... 2

Problem Storage in recurrent neural nets via gradient based backprop is problematic due to the trade off between robust storage and efficient learning.

Min task problem A simple case Classify a sequence based on first L values. Decision is given at the end. Therefore, the system must remember its state for an arbitrary amount of time. (1st requirement) The values after that are all Gaussian noise, so irrelevant for determining the class of the input sequence. (2nd requirement)

Min task problem We are interested in the learning capabilities of the latching problem. The big question is : Can latching subsystem propagate error information back to the subsystem that feeds it? The ability to learn h0,...,hl is the measure of effectiveness of gradient info. (3rd requirement) Subsystem that computes class info? Latching subsystem ht ht is the result of the computation that extracts class information. xt

A simple recurrent network candidate solution Input: ht Output: xt = tanh(at) Parameters: Trainable input ht wo xt 1 delay xt *w wxt 1 tanh + ht at In the paper:

Simple recurrent network candidate solution An attractor is an invariant point at = at 1 For the autonomous a t =w tanh a t 1 case: Attractors at nonzero at =tanh toat a 1 = a solutions w t t 1

Simple recurrent network candidate solution Non autonomous case (input ht 0): Escape threshold h*: As long as IhtI < h*, neuron won't escape attractor. the activation of your neuron does not change. If IhtI > h*, then the system may be pushed to another attractor.

Experimental Results for the minimal task problem Convergence density w0 (initial weight) vs Bigger noise (s) = harder to learn Wronger initial w = harder to learn s (noise variance) L=3, T=20 Longer noise sequence = harder to learn Frequency of training convergence vs T S=0.2

Discrete dynamical systems The basin of attraction of an attractor X A set of points a that converge to X under repeated application of update map M(a). M2(a) M(a) a M3(a) X M4(a)

Discrete dynamical systems Map M is contracting on a set A if: For any two points a1, a2 in A, M(a1) M(a2) < a1 a2 X a1 a2

Discrete dynamical systems X is a hyperbolic attractor that contracts on its reduced attracting set if: For all a in, all eigenvalues of Mt'(a) < 1 F O A IN N S BA CTIO RA T AT X AT ED T C DU G SE E R TIN C A TR

Discrete dynamical systems The sphere of uncertainty of a point, if contained in, will shrink at each timestep under no input. F O A IN N S BA CTIO RA T AT X AT ED T C DU G SE E R TIN C A TR

Discrete dynamical systems A system in state a is robustly latched to X if no input u will bump it out of. F O A IN N S BA CTIO RA T AT a+u u X a AT ED T C DU G SE E R TIN C A TR

Discrete dynamical systems Let be the maximum shrinkage rate. Mt(a+u) will remain within a radius d of Mt(a) as long as ut < (1 )d. F O A IN N S BA CTIO RA T AT X a AT ED T C DU G SE E R TIN C A TR

Discrete dynamical systems Minimum shrinkage rate = M' Within, M' < 1 dat da t 1 dat da 0 = M ' a t 1 t 1 = =0 M ' a da t da 0 vanishes as t 1 F O A IN N S O BA CTI RA T AT X a 0 AT ED C DU G S E R IN T AC R T

Effect of the gradient The derivative of the cost C at time t with respect to the parameters of the system is Ct C t a C t at a = = W t a W t a t a W From the previous analysis, we know for terms witht Ct a 0 a W This means that even though there is a change in the state variable that allows the system to change the basin of attractor, this would not be reflected in the gradient info.

Effect of gradient A B If B has not been trained enough for latching then the gradient of B's output with respect to A's early output will be zero. A's early outputs can't influence B's error at time T, B does not remember them. On the other hand, if B has latched, then the terms in the gradient that refer to the earlier states the times when real input came, in other words what really matters for training the classifier have vanished according to the theorem. So training A is very difficult.

Alternatives to gradient descent Stochastic search Simulated annealing Multi grid random search Curvature weighted gradient descent Quasi newton Time weighted QN

Simulated Annealing Metropolis criterion Accept uphill steps with probability: There are many annealing approaches: C P a=exp T Dynamic selection hyperrectangle Update dimensions to keep acceptance rate A deterministic annealing example from: Matthew Brand and Aaron Hertzmann,

Multi grid Random Search The problem isn't local minima, it's long plateaus (small gradients) Multigrid random search == gradient descent without the uphills.

Quasi newton methods Newton update Uses Hessian H for more accurate update. Costly, O(NW2) (N = # examples) Could point to maximum if H < 0 wo -g -H-1g

Quasi Newton methods An alternative (Becker, LeCun 1989): Only use diagonal elements of Hessian H 1 in O(NW) time Separate equations for each weight 1 C p w = H wi 1 C p wi = 2 wi C p wi2

Quasi Newton methods Some tweaks: = learning rate = insurance against zero curvature wi = 2 C p w 2 i C p wi When curvature is low, take bigger jumps

Quasi Newton methods Apply to our recurrent net by: Unfolding recurrent net into deep static net: Treating the unfolded weights as different variables. Keep them close to each other via some soft constraint. wi = t 2 C p w 2 it C p w it

Quasi Newton methods Compare with backprop/gradient descent: wi wi = t = C p t 2 w 2 it C p wi t C p wit

Discrete error propagation We're shoehorning a continuous solution (gradient descent) into a binary problem (latching): continuous discrete C y y x x How can we back propagate error signals across discrete functions?

Discrete error propagation Let y = sign(x) x if y =2 x= 0 if y=0 { y Backpropagated 1 C y = y : y Problem: not necessarily 2, 2,

Discrete error propagation C gradient yg= Translate continuous into discrete value y using a discrete stochastic function: y g E y g = = { 2 with probability 2 with probability g MIN MAX MIN MAX g MAX MIN g Expectation quickly converges to g when

Results 3 problems: Latch problem 2 sequence problem Classify noisy sequence where the first L values hover around a constant. Latch problem with non constant first L values. Parity problem Is the sum of a sequence even or odd.

Results Simulated annealing Good: best overall error. Bad: requires ~101.8 times more iterations than other methods. Multigrid stochastic search Good: comparable speed to gradient based methods. Bad: comparable susceptibility to local minima.

Results Standard backprop Time weighted quasi newton Vanishing gradient problem, local minima. Same problems as backprop, just less bad. Discrete error propagation: Usually the fastest, error rate comparable to non annealing methods. Particularly good at parity problem: robust to local minima?

End Questions?

Appendix: Derivation of Newton update E w de dw wo 1 = E wo w w o T H w wo 2 = H w wo = w H 1 de dw