Long-Short Term Memory Sepp Hochreiter, Jürgen Schmidhuber Presented by Derek Jones
Table of Contents 1. Introduction 2. Previous Work 3. Issues in Learning Long-Term Dependencies 4. Constant Error Flow 5. Long-Short Term Memory 6. Experiments 1
Motivation The Problem: Algorithms such as Back-propagation through time (BPTT) and Real-time recurrent learning (RTRL) suffer from vanishing or exploding gradients as the gradients are passed backwards through time Vanishing gradient signals can make the learning of long time dependencies infeasible, requiring prohibitively large amounts of computation time to train the network Exploding gradient signals may cause the network to experience oscillating weight updates and therefore preventing the optimization procedure from making substantial progress 2
Motivation The Solution: A novel memory cell architecture, Long short-term memory (LSTM) LSTM is designed to overcome the issues of vanishing and or exploding gradients that are backpropagated upon differentiating the error signal w.r.t. network parameters, through time LSTM networks are able to bridge arbitrarily large time lags, even in the presence of noisy inputs 3
Table of Contents 1. Introduction 2. Previous Work 3. Issues in Learning Long-Term Dependencies 4. Constant Error Flow 5. Long-Short Term Memory 6. Experiments 4
Previous Work Second order nets: Use multiplicative units to prevent error flow from unwanted perturbations. Does not solve long time lag problems, update cost of O(W 2 ). Simple Weight Guessing: Solves many simplistic problems. Becomes infeasible for realistic problems that may involve large numbers of parameters or high weight precision. Adaptive sequence chunkers: have the capacity to bridge arbitrarily long time lags given local predictability across time lag inducing subsequences. Deteriorating performance in presence of input noise. 5
Table of Contents 1. Introduction 2. Previous Work 3. Issues in Learning Long-Term Dependencies 4. Constant Error Flow 5. Long-Short Term Memory 6. Experiments 6
Issues in Learning Long-Term Dependencies Basic Problem: gradients backpropagated over many stages tend to either vanish or explode 7
Issues in Learning Long-Term Dependencies Consider a simple RNN with no input layer x or activation h (t) = W T h (t 1) h (t) = (W t ) T h (0) h (t) = QΛQh (0) with W t = (V diag(λ)v 1 ) t = V diag(λ) t V 1 8
Issues in Learning Long-Term Dependencies Any eigenvalues λ i Λ with λ i > 1 will explode Any eigenvalues λ i Λ with λ i < 1 will vanish Thus, the product w t will either vanish or explode depending on the magnitude of w 9
Issues in Learning Long-Term Dependencies Solution: Carefully choosing the weights can help to alleviate the vanishing & exploding gradient problems Solution: Restrict the network to specific regions of parameter space Both may work to prevent the network from learning memories that are invariant to small perturbations 10
Issues in Learning Long-Term Dependencies Exponentially smaller weights given to long-term interactions compared to short-term interactions This prevents the standard RNN from learning long-term dependencies = limiting the usefulness of such architectures for practical problems How to address this problem of vanishing & exploding gradients? 11
Table of Contents 1. Introduction 2. Previous Work 3. Issues in Learning Long-Term Dependencies 4. Constant Error Flow 5. Long-Short Term Memory 6. Experiments 12
Constant Error Flow How to Avoid Vanishing or Exploding Error Signals? Hint: require the following: f j(net j (t))w jj = 1.0 this means f j has to be linear, unit j s activation has to remain constant: y j (t + 1) = f j (net j (t + 1)) = f j (w jj y j (t)) = y j (t) 13
Constant Error Flow: Constant Error Carousel There are several issues with simply requiring this condition: Input Weight Conflict: same incoming weight has to be used for both storing certain inputs and for ignoring others The weights often receive conflicting updates. This lack of context makes learning difficult. Need more context sensitive mechanism for write operations through input weights 14
Constant Error Flow Output Weight Conflict: assume that some output unit j is switched on and stores a previous input w kj will attract conflicting wight update signals While stable in early training, j may suddenly start to cause avoidable errors by attempting to participate in reducing more difficult long time lag errors Need more context-sensitive mechanism for controlling read operations through output weights 15
Constant Error Flow Output/input weight conflicts may occur for short as well as long time lags As time lag increases, stored information must be protected form unwanted perturbation for longer and longer periods More and more already correct outputs require protection against unwanted perturbation 16
Table of Contents 1. Introduction 2. Previous Work 3. Issues in Learning Long-Term Dependencies 4. Constant Error Flow 5. Long-Short Term Memory 6. Experiments 17
Long-Short Term Memory Figure 1: A simple repeating RNN cell 18
Long-Short Term Memory Figure 2: An example repeating LSTM Cell 19
Long-Short Term Memory: How does it work? Instead of having a single layer, LSTM has 4. Input and output gates allow the LSTM to add/remove information to the cell state The gates themselves are composed of a single layer with sigmoid activation where resulting output is between 0 (let nothing through) and 1 (let everything through). 20
Long-Short Term Memory: The Forget Layer Figure 3: The Forget Layer 21
Long-Short Term Memory: The Forget Layer The current input x t and previous hidden state h t 1 are concatenated and pushed through a sigmoid activation values of 0 mean to forget everything, values of 1 mean to remember everything from the input set of information Then the result is component-wise multiplied with the previous cell state c t 1 22
Long-Short Term Memory: Storing New Information 23
Long-Short Term Memory: Updating the Old Cell State 24
Long-Short Term Memory: Computing the Output 25
Table of Contents 1. Introduction 2. Previous Work 3. Issues in Learning Long-Term Dependencies 4. Constant Error Flow 5. Long-Short Term Memory 6. Experiments 26
Experiment 1: Embedded Reber Grammar Task: read strings one symbol at a time and predict the next symbol at each time step t. Training/Testing: 512 total strings, divided into training/testing partitions using a 50-50 split Goal(s): Provides a common benchmark where RTRL and BPTT do not fail completely, and shows usefulness of gates 27
Experiment 1: Embedded Reber Grammar Figure 4: Experiment 1 Results 28
Experiment 6a: Temporal Order Problem Task: classify sequences into 1 of 4 categories that depend on the temporal ordering of chosen symbols from the set {a, b, c, d} Each string begins with an E symbol and ends with a B. All but two intermediate symbols randomly chosen that contain no informative class information. Two positions are randomly chosen to place the symbols that contain class information 29
Experiment 6b: Temporal Order Problem Task: classify sequences into 1 of 8 categories that depend on the temporal ordering of chosen symbols from the set {a, b, c, d} Each string begins with an E symbol and ends with a B. All but three intermediate symbols randomly chosen that contain no informative class information. Three positions are randomly chosen to place the symbols that contain class information 30
Experiment 6 (a&b): Temporal Order Problem Figure 5: Experiment 6 Results 31
Questions? 31