Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, Deep Lunch

Similar documents
High Order LSTM/GRU. Wenjie Luo. January 19, 2016

Slide credit from Hung-Yi Lee & Richard Socher

CSC321 Lecture 15: Recurrent Neural Networks

EE-559 Deep learning LSTM and GRU

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

arxiv: v1 [cs.cl] 21 May 2017

Introduction to RNNs!

Long-Short Term Memory

Improved Learning through Augmenting the Loss

Long-Short Term Memory and Other Gated RNNs

Seq2Tree: A Tree-Structured Extension of LSTM Network

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

arxiv: v3 [cs.lg] 14 Jan 2018

Lecture 11 Recurrent Neural Networks I

Recurrent Neural Networks. Jian Tang

Neural Networks Language Models

Generating Sequences with Recurrent Neural Networks

Recurrent Neural Networks. Armand Joulin Facebook AI research

Recurrent Neural Network

Deep learning for Natural Language Processing and Machine Translation

CSC321 Lecture 10 Training RNNs

Lecture 11 Recurrent Neural Networks I

Deep Recurrent Neural Networks

Lecture 15: Recurrent Neural Nets

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

arxiv: v1 [cs.ne] 19 Dec 2016

Sequence Modeling with Neural Networks

Large Vocabulary Continuous Speech Recognition with Long Short-Term Recurrent Networks

arxiv: v2 [cs.ne] 7 Apr 2015

Learning Long-Term Dependencies with Gradient Descent is Difficult

Combining Static and Dynamic Information for Clinical Event Prediction

Neural Turing Machine. Author: Alex Graves, Greg Wayne, Ivo Danihelka Presented By: Tinghui Wang (Steve)

YNU-HPCC at IJCNLP-2017 Task 1: Chinese Grammatical Error Diagnosis Using a Bi-directional LSTM-CRF Model

Gated Feedback Recurrent Neural Networks

Gated Recurrent Neural Tensor Network

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-

Recurrent Neural Networks. deeplearning.ai. Why sequence models?

On the use of Long-Short Term Memory neural networks for time series prediction

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST

arxiv: v3 [cs.cl] 24 Feb 2018

Deep Learning Recurrent Networks 2/28/2018

Faster Training of Very Deep Networks Via p-norm Gates

Analysis of Multilayer Neural Network Modeling and Long Short-Term Memory

Lecture 17: Neural Networks and Deep Learning

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function

A QUESTION ANSWERING SYSTEM USING ENCODER-DECODER, SEQUENCE-TO-SEQUENCE, RECURRENT NEURAL NETWORKS. A Project. Presented to

Lecture 15: Exploding and Vanishing Gradients

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

CSC321 Lecture 9 Recurrent neural nets

Recurrent Neural Network Training with Preconditioned Stochastic Gradient Descent

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

RECURRENT NETWORKS I. Philipp Krähenbühl

Stephen Scott.

CS224N Final Project: Exploiting the Redundancy in Neural Machine Translation

Equivalence Results between Feedforward and Recurrent Neural Networks for Sequences

Recurrent and Recursive Networks

CSC321 Lecture 15: Exploding and Vanishing Gradients

arxiv: v1 [cs.cl] 31 May 2015

CSC321 Lecture 16: ResNets and Attention

Minimal Gated Unit for Recurrent Neural Networks

arxiv: v1 [stat.ml] 18 Nov 2017

Natural Language Processing and Recurrent Neural Networks

Recurrent Neural Networks. Dr. Kira Radinsky CTO SalesPredict Visi8ng Professor/Scien8st Technion. Slides were adapted from lectures by Richard Socher

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

MULTIPLICATIVE LSTM FOR SEQUENCE MODELLING

Learning Long Term Dependencies with Gradient Descent is Difficult

Inter-document Contextual Language Model

CSCI 315: Artificial Intelligence through Deep Learning

Recurrent Residual Learning for Sequence Classification

Recurrent Neural Networks

Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting 卷积 LSTM 网络 : 利用机器学习预测短期降雨 施行健 香港科技大学 VALSE 2016/03/23

Deep Gate Recurrent Neural Network

Speech and Language Processing

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

y c y out h y out g y in net w c net c =1.0 for Std. LSTM output gating out out ouput gate output squashing memorizing and forgetting forget gate

RECURRENT NEURAL NETWORK BASED LANGUAGE MODELING WITH CONTROLLABLE EXTERNAL MEMORY

RECURRENT NEURAL NETWORKS WITH FLEXIBLE GATES USING KERNEL ACTIVATION FUNCTIONS

A New Concept using LSTM Neural Networks for Dynamic System Identification

Recurrent neural networks

Deep Learning for Automatic Speech Recognition Part II

(

arxiv: v2 [cs.ne] 19 May 2016

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Biologically Plausible Speech Recognition with LSTM Neural Nets

A Tutorial On Backward Propagation Through Time (BPTT) In The Gated Recurrent Unit (GRU) RNN

Gate Activation Signal Analysis for Gated Recurrent Neural Networks and Its Correlation with Phoneme Boundaries

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Highway-LSTM and Recurrent Highway Networks for Speech Recognition

Direct Method for Training Feed-forward Neural Networks using Batch Extended Kalman Filter for Multi- Step-Ahead Predictions

A Critical Review of Recurrent Neural Networks for Sequence Learning

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

From perceptrons to word embeddings. Simon Šuster University of Groningen

Recurrent Neural Networks. COMP-550 Oct 5, 2017

Unfolded Recurrent Neural Networks for Speech Recognition

Deep Residual. Variations

Learning Unitary Operators with Help from u(n)

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Transcription:

Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, 2015 @ Deep Lunch 1

Why LSTM? OJen used in many recent RNN- based systems Machine transla;on Program execu;on Can capture long- term dependency 2

Sequence to Sequence Learning with Neural Networks [Sutskever+14] Machine transla;on (English to French) Achieved state- of- the- art, while making li]le assump;on about data There is a cat under the chair. LSTM- RNN LSTM- RNN Il y a un chat sous la chaise. encoder decoder 3

Learning to Execute [Zaremba+14] Trained an RNN to execute a code wri]en in Python- like language j=8584 for x in range(8): j+=920 b=(1500+j) print((b+7567)) LSTM- RNN 25011. 4

Review: Feed- Forward Neural Network output layer OUT: y=exp(w o h)/z W o hidden layer h=σ(w x x) W x input layer IN: x 5

Review: Recurrent Neural Network output layer OUT: y=exp(w o h)/z W o self- loop hidden layer h=σ(w x x+w h h) W h W x input layer IN: x 6

Review: Training of RNN Find good values for W x, W h, W o output layer W o Each arrow has a corresponding weight that has to be op:mized. hidden layer W x W h input layer 7

Review: Gradient Descent Itera;vely update weight by moving along gradient W new = W old +α E W E: error func;on (cross- entropy for mul;- classifica;on) α: learning rate W new best W How is E W calculated? Back- propaga;on! W old 8

Review: Gradient Calcula;on Gradients are calculated from errors(δ) propagated backward. y = exp(w o h)/z W o δ o = y - y correct E W o = δ o h T backward propaga;on of error h = σ(w x x) x W x δ h = σ (W x x) W ot δ o E W x = δ h x T Error gets mul;plied by the weight. 9

Review: Unfolding RNN through Time output layer hidden layer self- loop self- loop input layer t = 0 t = 1 t = 2 10

Review: Back Propaga;on through Time (BPTT) output layer δ 0 δ 1 δ 2 hidden layer input layer t = 0 t = 1 t = 2 11

Problem: Vanishing Gradients Remember: error is mul;plied by weight each ;me (weights are ojen smaller than 1) δ 0 δ 10 W h W h W h W h W o W x t = 0 t = 1 t = 9 t = 10 Errors (thus gradients) gets smaller exponen;ally! 12

Why does it Ma]er? Cannot learn long- term dependency Long- term dependency emerges when input signal and teacher signal are far apart (>10 ;me steps). 13

Long- Term Dependency You are presented with the following 6 sequences consis;ng of As, Bs, Xs. 1: AXXXXXXXXXXA 2: AXXXXXXXXXXA 3: BXXXXXXXXXXB 4: AXXXXXXXXXXA 5: BXXXXXXXXXXB 6: AXXXXXXXXXXA Can you guess what comes next? AXXXXXXXXXX? Simplified version of Task 2a from [Hochreiter+97] 14

Unfortunately, RNN with BPTT fails. RNN must find out by itself that it must store the first le]er in the hidden layer. (Note that manually sesng the weights for doing this is easy.) Guess: B Correct: A δ δ A X X X Only place where RNN could reprogram itself (update weights) to store the first le]er in the hidden layer, but error is too small! 15

Solu;ons Alterna;ve training techniques Hessian- free op;miza;on [Martens+11] Alterna;ve architectures (training remains the same) Long Short- term Memory (LSTM) 16

Why does Gradient Vanish? Because error(δ) is mul;plied by scaling factors. δ i = σ (y i )W ji δ j (connec;on from unit i to unit j) What if we set W ii = 1 and σ(x) = x? à Constant Error Carrousel (CEC) δ 1.0 error can circulate here indefinitely IN δ δ OUT Constant Error Carrousel (CEC) enables error propaga;on of arbitrary ;me steps. 17

LSTM Unit CEC with input gate and output gate is LSTM unit Gates are for controlling error flow depending on the context. IN tanh mul. 1.0 mul. OUT W y in [0,1] y out [0,1] logsig W in input gate logsig W out output gate 18

En;re Picutre of LSTM- RNN output layer recurrent connec:on hidden layer (with LSTM units) input layer 19

LSTM Variants Forget gates explicitly reset the value in CEC Peephole connec;ons Mul;ple CECs in an LSTM unit etc. Common: CEC and gates 20

Summary RNN with BPTT fails to learn long- term dependency due to vanishing gradients. LSTM overcomes this problem by having Constant Error Carrousel (CEC). LSTM has input and output gates. 21

References Felix A Gers, Nicol N Schraudolph, and Jürgen Schmidhuber. Learning precise ;ming with LSTM recurrent networks. The Journal of Machine Learning Research, Vol. 3, pp. 115 143, 2003. Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and Jürgen Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long- term dependencies, Vol. 1. IEEE Press, 2001. Sepp Hochreiter and Jü rgen Schmidhuber. Long short- term memory. Neural computa;on, Vol. 9, No. 8, pp. 1735 1780, 1997. James Martens and Ilya Sutskever. Learning recurrent neural networks with Hessian- free op;miza;on. In Proceedings of the 28th Interna;onal Conference on Machine Learning (ICML- 11), pp. 1033 1040, 2011. Mar;n Sundermeyer, Ralf Schlüter, and Hermann Ney. LSTM neural networks for language modeling. In INTERSPEECH, pp. 194 197, 2012. Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. Sequence to sequence learning with neural networks. In Advances in Neural Informa;on Processing Systems, pp. 3104 3112, 2014. Wojciech Zaremba and Ilya Sutskever. Learning to execute. arxiv preprint arxiv: 1410.4615, 2014. 22