Long-Short Term Memory

Similar documents
LONG SHORT-TERM MEMORY. Technical Report FKI , Version 3.0. Abstract

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

On the use of Long-Short Term Memory neural networks for time series prediction

Recurrent Neural Net Learning and Vanishing Gradient. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 6(2):107{116, 1998

LSTM CAN SOLVE HARD. Jurgen Schmidhuber Lugano, Switzerland. Abstract. guessing than by the proposed algorithms.

arxiv: v3 [cs.lg] 14 Jan 2018

Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, Deep Lunch

y c y out h y out g y in net w c net c =1.0 for Std. LSTM output gating out out ouput gate output squashing memorizing and forgetting forget gate

Introduction to RNNs!

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Long-Short Term Memory and Other Gated RNNs

Slide credit from Hung-Yi Lee & Richard Socher

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials

Lecture 17: Neural Networks and Deep Learning

Recurrent Neural Network Training with Preconditioned Stochastic Gradient Descent

Lecture 11 Recurrent Neural Networks I

Recurrent Neural Networks. Jian Tang

Recurrent neural networks

Generating Sequences with Recurrent Neural Networks

Lecture 11 Recurrent Neural Networks I

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Recurrent Neural Networks

RECURRENT NETWORKS I. Philipp Krähenbühl

CSC321 Lecture 15: Exploding and Vanishing Gradients

CSCI 315: Artificial Intelligence through Deep Learning

Sequence Modeling with Neural Networks

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-

Stephen Scott.

Deep Learning Recurrent Networks 2/28/2018

Analysis of Multilayer Neural Network Modeling and Long Short-Term Memory

High Order LSTM/GRU. Wenjie Luo. January 19, 2016

Learning Long-Term Dependencies with Gradient Descent is Difficult

Contents. (75pts) COS495 Midterm. (15pts) Short answers

Neural Turing Machine. Author: Alex Graves, Greg Wayne, Ivo Danihelka Presented By: Tinghui Wang (Steve)

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Recurrent Neural Networks. deeplearning.ai. Why sequence models?

Recurrent and Recursive Networks

Lecture 15: Exploding and Vanishing Gradients

Natural Language Understanding. Recap: probability, language models, and feedforward networks. Lecture 12: Recurrent Neural Networks and LSTMs

Spike-based Long Short-Term Memory networks

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization

Natural Language Processing and Recurrent Neural Networks

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

CSC321 Lecture 10 Training RNNs

CSC321 Lecture 16: ResNets and Attention

Biologically Plausible Speech Recognition with LSTM Neural Nets

Tracking the World State with Recurrent Entity Networks

ECE521 Lectures 9 Fully Connected Neural Networks

Neural Networks Language Models

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Structured Neural Networks (I)

Learning Unitary Operators with Help from u(n)

EE-559 Deep learning LSTM and GRU

Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting 卷积 LSTM 网络 : 利用机器学习预测短期降雨 施行健 香港科技大学 VALSE 2016/03/23

EE-559 Deep learning Recurrent Neural Networks

(

COGS Q250 Fall Homework 7: Learning in Neural Networks Due: 9:00am, Friday 2nd November.

Christian Mohr

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Recurrent Neural Network

Neural Networks. Volker Tresp Summer 2015

Deep Learning Recurrent Networks 10/11/2017

Lecture 4: Perceptrons and Multilayer Perceptrons

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

(2pts) What is the object being embedded (i.e. a vector representing this object is computed) when one uses

More on Neural Networks

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Demystifying deep learning. Artificial Intelligence Group Department of Computer Science and Technology, University of Cambridge, UK

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Recurrent Neural Networks. COMP-550 Oct 5, 2017

Deep Residual. Variations

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Training Multi-Layer Neural Networks. - the Back-Propagation Method. (c) Marcin Sydow

RECURRENT NEURAL NETWORKS WITH FLEXIBLE GATES USING KERNEL ACTIVATION FUNCTIONS

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

An Investigation of the State Formation and Transition Limitations for Prediction Problems in Recurrent Neural Networks

Lecture 7 Artificial neural networks: Supervised learning

Back-propagation as reinforcement in prediction tasks

From perceptrons to word embeddings. Simon Šuster University of Groningen

A New Concept using LSTM Neural Networks for Dynamic System Identification

Ways to make neural networks generalize better

Conditional Language modeling with attention

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Lecture 5: Recurrent Neural Networks

arxiv: v1 [cs.cl] 21 May 2017

Seq2Tree: A Tree-Structured Extension of LSTM Network

A thorough derivation of back-propagation for people who really want to understand it by: Mike Gashler, September 2010

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Direct Method for Training Feed-forward Neural Networks using Batch Extended Kalman Filter for Multi- Step-Ahead Predictions

Natural Language Processing

y(x n, w) t n 2. (1)

Intro to Neural Networks and Deep Learning

Policy Gradient Critics

Transcription:

Long-Short Term Memory Sepp Hochreiter, Jürgen Schmidhuber Presented by Derek Jones

Table of Contents 1. Introduction 2. Previous Work 3. Issues in Learning Long-Term Dependencies 4. Constant Error Flow 5. Long-Short Term Memory 6. Experiments 1

Motivation The Problem: Algorithms such as Back-propagation through time (BPTT) and Real-time recurrent learning (RTRL) suffer from vanishing or exploding gradients as the gradients are passed backwards through time Vanishing gradient signals can make the learning of long time dependencies infeasible, requiring prohibitively large amounts of computation time to train the network Exploding gradient signals may cause the network to experience oscillating weight updates and therefore preventing the optimization procedure from making substantial progress 2

Motivation The Solution: A novel memory cell architecture, Long short-term memory (LSTM) LSTM is designed to overcome the issues of vanishing and or exploding gradients that are backpropagated upon differentiating the error signal w.r.t. network parameters, through time LSTM networks are able to bridge arbitrarily large time lags, even in the presence of noisy inputs 3

Table of Contents 1. Introduction 2. Previous Work 3. Issues in Learning Long-Term Dependencies 4. Constant Error Flow 5. Long-Short Term Memory 6. Experiments 4

Previous Work Second order nets: Use multiplicative units to prevent error flow from unwanted perturbations. Does not solve long time lag problems, update cost of O(W 2 ). Simple Weight Guessing: Solves many simplistic problems. Becomes infeasible for realistic problems that may involve large numbers of parameters or high weight precision. Adaptive sequence chunkers: have the capacity to bridge arbitrarily long time lags given local predictability across time lag inducing subsequences. Deteriorating performance in presence of input noise. 5

Table of Contents 1. Introduction 2. Previous Work 3. Issues in Learning Long-Term Dependencies 4. Constant Error Flow 5. Long-Short Term Memory 6. Experiments 6

Issues in Learning Long-Term Dependencies Basic Problem: gradients backpropagated over many stages tend to either vanish or explode 7

Issues in Learning Long-Term Dependencies Consider a simple RNN with no input layer x or activation h (t) = W T h (t 1) h (t) = (W t ) T h (0) h (t) = QΛQh (0) with W t = (V diag(λ)v 1 ) t = V diag(λ) t V 1 8

Issues in Learning Long-Term Dependencies Any eigenvalues λ i Λ with λ i > 1 will explode Any eigenvalues λ i Λ with λ i < 1 will vanish Thus, the product w t will either vanish or explode depending on the magnitude of w 9

Issues in Learning Long-Term Dependencies Solution: Carefully choosing the weights can help to alleviate the vanishing & exploding gradient problems Solution: Restrict the network to specific regions of parameter space Both may work to prevent the network from learning memories that are invariant to small perturbations 10

Issues in Learning Long-Term Dependencies Exponentially smaller weights given to long-term interactions compared to short-term interactions This prevents the standard RNN from learning long-term dependencies = limiting the usefulness of such architectures for practical problems How to address this problem of vanishing & exploding gradients? 11

Table of Contents 1. Introduction 2. Previous Work 3. Issues in Learning Long-Term Dependencies 4. Constant Error Flow 5. Long-Short Term Memory 6. Experiments 12

Constant Error Flow How to Avoid Vanishing or Exploding Error Signals? Hint: require the following: f j(net j (t))w jj = 1.0 this means f j has to be linear, unit j s activation has to remain constant: y j (t + 1) = f j (net j (t + 1)) = f j (w jj y j (t)) = y j (t) 13

Constant Error Flow: Constant Error Carousel There are several issues with simply requiring this condition: Input Weight Conflict: same incoming weight has to be used for both storing certain inputs and for ignoring others The weights often receive conflicting updates. This lack of context makes learning difficult. Need more context sensitive mechanism for write operations through input weights 14

Constant Error Flow Output Weight Conflict: assume that some output unit j is switched on and stores a previous input w kj will attract conflicting wight update signals While stable in early training, j may suddenly start to cause avoidable errors by attempting to participate in reducing more difficult long time lag errors Need more context-sensitive mechanism for controlling read operations through output weights 15

Constant Error Flow Output/input weight conflicts may occur for short as well as long time lags As time lag increases, stored information must be protected form unwanted perturbation for longer and longer periods More and more already correct outputs require protection against unwanted perturbation 16

Table of Contents 1. Introduction 2. Previous Work 3. Issues in Learning Long-Term Dependencies 4. Constant Error Flow 5. Long-Short Term Memory 6. Experiments 17

Long-Short Term Memory Figure 1: A simple repeating RNN cell 18

Long-Short Term Memory Figure 2: An example repeating LSTM Cell 19

Long-Short Term Memory: How does it work? Instead of having a single layer, LSTM has 4. Input and output gates allow the LSTM to add/remove information to the cell state The gates themselves are composed of a single layer with sigmoid activation where resulting output is between 0 (let nothing through) and 1 (let everything through). 20

Long-Short Term Memory: The Forget Layer Figure 3: The Forget Layer 21

Long-Short Term Memory: The Forget Layer The current input x t and previous hidden state h t 1 are concatenated and pushed through a sigmoid activation values of 0 mean to forget everything, values of 1 mean to remember everything from the input set of information Then the result is component-wise multiplied with the previous cell state c t 1 22

Long-Short Term Memory: Storing New Information 23

Long-Short Term Memory: Updating the Old Cell State 24

Long-Short Term Memory: Computing the Output 25

Table of Contents 1. Introduction 2. Previous Work 3. Issues in Learning Long-Term Dependencies 4. Constant Error Flow 5. Long-Short Term Memory 6. Experiments 26

Experiment 1: Embedded Reber Grammar Task: read strings one symbol at a time and predict the next symbol at each time step t. Training/Testing: 512 total strings, divided into training/testing partitions using a 50-50 split Goal(s): Provides a common benchmark where RTRL and BPTT do not fail completely, and shows usefulness of gates 27

Experiment 1: Embedded Reber Grammar Figure 4: Experiment 1 Results 28

Experiment 6a: Temporal Order Problem Task: classify sequences into 1 of 4 categories that depend on the temporal ordering of chosen symbols from the set {a, b, c, d} Each string begins with an E symbol and ends with a B. All but two intermediate symbols randomly chosen that contain no informative class information. Two positions are randomly chosen to place the symbols that contain class information 29

Experiment 6b: Temporal Order Problem Task: classify sequences into 1 of 8 categories that depend on the temporal ordering of chosen symbols from the set {a, b, c, d} Each string begins with an E symbol and ends with a B. All but three intermediate symbols randomly chosen that contain no informative class information. Three positions are randomly chosen to place the symbols that contain class information 30

Experiment 6 (a&b): Temporal Order Problem Figure 5: Experiment 6 Results 31

Questions? 31