Long Short-Term Memory (LSTM)

Similar documents
Recurrent Neural Networks. Jian Tang

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

CSC321 Lecture 15: Exploding and Vanishing Gradients

(

CSCI 315: Artificial Intelligence through Deep Learning

Sequence Modeling with Neural Networks

Recurrent Neural Networks (RNNs) Lecture 9 - Networks for Sequential Data RNNs & LSTMs. RNN with no outputs. RNN with no outputs

Deep Learning Recurrent Networks 10/11/2017

Lecture 11 Recurrent Neural Networks I

Lecture 11 Recurrent Neural Networks I

Deep Learning Recurrent Networks 10/16/2017

Recurrent and Recursive Networks

Lecture 17: Neural Networks and Deep Learning

RECURRENT NETWORKS I. Philipp Krähenbühl

Natural Language Processing and Recurrent Neural Networks

NEURAL LANGUAGE MODELS

Analysis of Multilayer Neural Network Modeling and Long Short-Term Memory

arxiv: v3 [cs.lg] 14 Jan 2018

Deep Learning Recurrent Networks 2/28/2018

Long-Short Term Memory and Other Gated RNNs

ECE521 Lectures 9 Fully Connected Neural Networks

Natural Language Understanding. Recap: probability, language models, and feedforward networks. Lecture 12: Recurrent Neural Networks and LSTMs

Slide credit from Hung-Yi Lee & Richard Socher

Stephen Scott.

Lecture 15: Exploding and Vanishing Gradients

Neural Network Language Modeling

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization

EE-559 Deep learning LSTM and GRU

EE-559 Deep learning Recurrent Neural Networks

Deep Neural Networks (1) Hidden layers; Back-propagation

A Tutorial On Backward Propagation Through Time (BPTT) In The Gated Recurrent Unit (GRU) RNN

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

CSC321 Lecture 16: ResNets and Attention

CSC321 Lecture 10 Training RNNs

Introduction to RNNs!

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Neural Networks Language Models

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Recurrent Neural Network

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials

Generating Sequences with Recurrent Neural Networks

High Order LSTM/GRU. Wenjie Luo. January 19, 2016

Recurrent Neural Networks. deeplearning.ai. Why sequence models?

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,

Recurrent neural networks

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

Deep Neural Networks (1) Hidden layers; Back-propagation

Structured Neural Networks (I)

Long-Short Term Memory

NLP Programming Tutorial 8 - Recurrent Neural Nets

ARTIFICIAL neural networks (ANNs) are made from

Neural Networks and Deep Learning

Learning Unitary Operators with Help from u(n)

Contents. (75pts) COS495 Midterm. (15pts) Short answers

Neural Turing Machine. Author: Alex Graves, Greg Wayne, Ivo Danihelka Presented By: Tinghui Wang (Steve)

Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, Deep Lunch

Deep Recurrent Neural Networks

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

CSC321 Lecture 5: Multilayer Perceptrons

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Introduction to Convolutional Neural Networks 2018 / 02 / 23

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function

Language Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1

Deep Learning for Automatic Speech Recognition Part II

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

arxiv: v1 [cs.cl] 31 May 2015

Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting 卷积 LSTM 网络 : 利用机器学习预测短期降雨 施行健 香港科技大学 VALSE 2016/03/23

Lecture 5: Recurrent Neural Networks

Recurrent Neural Networks. COMP-550 Oct 5, 2017

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

Ch.6 Deep Feedforward Networks (2/3)

CSC 411 Lecture 10: Neural Networks

Speech and Language Processing

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-

Tracking the World State with Recurrent Entity Networks

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Neural Networks. Volker Tresp Summer 2015

Neural Architectures for Image, Language, and Speech Processing

Recurrent Neural Network Training with Preconditioned Stochastic Gradient Descent

Introduction to Deep Neural Networks

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Recurrent Neural Networks

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Neural Networks. Intro to AI Bert Huang Virginia Tech

Multilayer Perceptron

Introduction to Neural Networks

Learning from Data: Multi-layer Perceptrons

Feed-forward Network Functions

Based on the original slides of Hung-yi Lee

Better Conditional Language Modeling. Chris Dyer

Large Vocabulary Continuous Speech Recognition with Long Short-Term Recurrent Networks

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Transcription:

Long Short-Term Memory (LSTM) A brief introduction Daniel Renshaw 24th November 2014 1 / 15

Context and notation Just to give the LSTM something to do: neural network language modelling Vocabulary, size V x t R V : true word in position t (one-hot) y t R V : predicted word in position t (distribution) Assume all sentences zero padded to length L 2 / 15

Context and notation Model: y t+1 = p (x t x t 1, x t 2,..., x 1 ) for 1 t < L Minimize cross-entropy objective: L 1 J = H (y t+1, x t ) = t=1 L 1 t=1 V x t,i log (y t+1,i ) i σ () is some sigmoid-like function (e.g. logistic or tanh) b is a bias vector, W is a weight vector 3 / 15

Multi-Layer Perceptron (MLP) y t h t e t 3 e t 3 e t 3 x t 3 x t 2 x t 1 ) y t+1 = softmax (W yh h t h t = σ (W he [e t 1 ; e t 2 ; e t 3 ] + b h) e t = W ex x t 4 / 15

Recurrent Neural Network (RNN) y t y t+1 y t+2 h t 1 h t h t+1 e t 1 e t e t+1 x t 1 x t x t+1 ) y t+1 = softmax (W yh h t h t = σ (W he e t + W hh h t 1 + b h) e t = W ex x t 5 / 15

Vanishing gradients Error gradients pass through nonlinearity every step Image from https://theclevermachine.wordpress.com Unless weights large, error signal will degrade δ h = σ () W (h+1)h δ h+1 6 / 15

Vanishing gradients Gradients may vanish or explode Can aect any 'deep' network e.g. ne-tuning a non-recurrent deep neural network Image from Alex Graves' textbook 7 / 15

Constant Error Carousel Allow the network to propagate errors without modication No nonlinearity in recursion y t y t+1 y t+2 m t 1 m t m t+1 h t 1 h t h t+1 e t 1 e t e t+1 x t 1 x t x t+1 8 / 15

Constant Error Carousel Allow the network to propagate errors without modication No nonlinearity in recursion y t y t+1 y t+2 m t 1 m t m t+1 h t 1 h t h t+1 e t 1 e t e t+1 x t 1 x t x t+1 8 / 15

Constant Error Carousel Allow the network to propagate errors without modication No nonlinearity in recursion m t 1 m t h t 1 h t e t b h dense matrix multiplication 8 / 15

Constant Error Carousel Allow the network to propagate errors without modication No nonlinearity in recursion y t+1 = softmax (W ym m t ) m t = σ (h t ) h t = h t 1 +σ (W he e t + W hm m t 1 + b h) e t = W ex x t 8 / 15

LSTM v1: input and output gates Attenuate input and output signals m t 1 b o logistic o t m t h t 1 h t b i logistic i t e t b h 9 / 15

LSTM v1: input and output gates Attenuate input and output signals y t+1 = softmax (W ym m t ) m t = o t σ (h t ) o t = logistic (W oe e t + W om m t 1 + b o ) h t = h t 1 + i t σ (W he e t + W hm m t 1 + b h) i t = logistic ( W ie e t + W im m t 1 + b i) e t = W ex x t 9 / 15

LSTM v2: forget (remember) gate Model controls when memory, h t, is reduced Forget gate should be called remember gate m t 1 b o b f logistic logistic o t f t m t h t 1 h t b i logistic i t e t b h 10 / 15

LSTM v2: forget (remember) gate Model controls when memory, h t, is reduced Forget gate should be called remember gate y t+1 = softmax (W ym m t ) m t = o t σ (h t ) o t = logistic (W oe e t + W om m t 1 + b o ) h t = f i h t 1 + i t σ (W he e t + W hm m t 1 + b h) i t = logistic ( W ie e t + W im m t 1 + b i) f i = logistic (W ) fe e t + W fm m t 1 + b f e t = W ex x t 10 / 15

LSTM v3: peepholes Allow the gates to additionally see the internal memory state Diagonal matrices only (all others dense) m t 1 b o b f logistic logistic o t f t m t h t 1 h t b i logistic i t e t b h diagonal matrix multiplication 11 / 15

LSTM v3: peepholes Allow the gates to additionally see the internal memory state Diagonal matrices only (all others dense) y t+1 = softmax (W ym m t ) m t = o t σ (h t ) o t = logistic (W oe e t + W om m t 1 +W oh h t + b o) h t = f i h t 1 + i t σ (W he e t + W hm m t 1 + b h) i t = logistic (W ie e t + W im m t 1 +W ih h t 1 + b i) f i = logistic (W ) fe e t + W fm m t 1 +W fh h t 1 + b f e t = W ex x t 11 / 15

LSTM v4: output projection layer Reduces dimensionality of recursive messages Can speed up training without aecting results quality m t 1 b o b f logistic logistic o t f t m t h t 1 h t b i logistic i t e t b h 12 / 15

LSTM v4: output projection layer Reduces dimensionality of recursive messages Can speed up training without aecting results quality y t+1 = softmax (W ym m t ) m t = W mm (o t σ (h t )) o t = logistic (W oe e t + W om m t 1 + W oh h t + b o) h t = f i h t 1 + i t σ (W he e t + W hm m t 1 + b h) i t = logistic (W ie e t + W im m t 1 + W ih h t 1 + b i) f i = logistic (W ) fe e t + W fm m t 1 + W fh h t 1 + b f e t = W ex x t 12 / 15

Gradients no longer vanish Image from Alex Graves' textbook 13 / 15

LSTM implementations RNNLIB (Alex Graves) http://sourceforge.net/p/rnnl/ PyLearn2 (experimental code, in sandbox/rnn/models/rnn.py) Theano, e.g. def lstm_step(x_t, m_tm1, h_tm1, w_xe,..., b_o): e_t = dot(x_t, w_xe) i_t = sigmoid(dot(e_t, w_ei) + dot(m_tm1, w_mi) + c_tm1 * w_ci + b_i) f_t = sigmoid(dot(e_t, w_ef) + dot(m_tm1, w_mf) + c_tm1 * w_cf + b_f) h_t = f_t * h_tm1 + i_t * tanh(dot(e_t, w_eh) + dot(m_tm1, w_mh) + b_h) o_t = sigmoid(dot(e_t, w_eo) + dot(m_tm1, w_mo) + c_t * w_co + b_o) m_t = dot(o_t * tanh(h_t), w_mm) y_t = softmax(dot(m_t, w_my)) return m_t, c_t, y_t 14 / 15

Further thoughts Sequences vs. hierarchies vs. plain 'deep' Other solutions to vanishing gradients Clockwork RNN Dierent training algorithms (e.g. Hessian Free optimization) Rectied linear units (ReLU)? σ (x) = max (0, x); constant gradient when active 15 / 15