Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, Deep Lunch

Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, 2015 @ Deep Lunch 1

Why LSTM? OJen used in many recent RNN- based systems Machine transla;on Program execu;on Can capture long- term dependency 2

Sequence to Sequence Learning with Neural Networks [Sutskever+14] Machine transla;on (English to French) Achieved state- of- the- art, while making li]le assump;on about data There is a cat under the chair. LSTM- RNN LSTM- RNN Il y a un chat sous la chaise. encoder decoder 3

Learning to Execute [Zaremba+14] Trained an RNN to execute a code wri]en in Python- like language j=8584 for x in range(8): j+=920 b=(1500+j) print((b+7567)) LSTM- RNN 25011. 4

Review: Feed- Forward Neural Network output layer OUT: y=exp(w o h)/z W o hidden layer h=σ(w x x) W x input layer IN: x 5

Review: Recurrent Neural Network output layer OUT: y=exp(w o h)/z W o self- loop hidden layer h=σ(w x x+w h h) W h W x input layer IN: x 6

Review: Training of RNN Find good values for W x, W h, W o output layer W o Each arrow has a corresponding weight that has to be op:mized. hidden layer W x W h input layer 7

Review: Gradient Descent Itera;vely update weight by moving along gradient W new = W old +α E W E: error func;on (cross- entropy for mul;- classifica;on) α: learning rate W new best W How is E W calculated? Back- propaga;on! W old 8

Review: Gradient Calcula;on Gradients are calculated from errors(δ) propagated backward. y = exp(w o h)/z W o δ o = y - y correct E W o = δ o h T backward propaga;on of error h = σ(w x x) x W x δ h = σ (W x x) W ot δ o E W x = δ h x T Error gets mul;plied by the weight. 9

Review: Unfolding RNN through Time output layer hidden layer self- loop self- loop input layer t = 0 t = 1 t = 2 10

Review: Back Propaga;on through Time (BPTT) output layer δ 0 δ 1 δ 2 hidden layer input layer t = 0 t = 1 t = 2 11

Problem: Vanishing Gradients Remember: error is mul;plied by weight each ;me (weights are ojen smaller than 1) δ 0 δ 10 W h W h W h W h W o W x t = 0 t = 1 t = 9 t = 10 Errors (thus gradients) gets smaller exponen;ally! 12

Why does it Ma]er? Cannot learn long- term dependency Long- term dependency emerges when input signal and teacher signal are far apart (>10 ;me steps). 13

Long- Term Dependency You are presented with the following 6 sequences consis;ng of As, Bs, Xs. 1: AXXXXXXXXXXA 2: AXXXXXXXXXXA 3: BXXXXXXXXXXB 4: AXXXXXXXXXXA 5: BXXXXXXXXXXB 6: AXXXXXXXXXXA Can you guess what comes next? AXXXXXXXXXX? Simplified version of Task 2a from [Hochreiter+97] 14

Unfortunately, RNN with BPTT fails. RNN must find out by itself that it must store the first le]er in the hidden layer. (Note that manually sesng the weights for doing this is easy.) Guess: B Correct: A δ δ A X X X Only place where RNN could reprogram itself (update weights) to store the first le]er in the hidden layer, but error is too small! 15

Solu;ons Alterna;ve training techniques Hessian- free op;miza;on [Martens+11] Alterna;ve architectures (training remains the same) Long Short- term Memory (LSTM) 16

Why does Gradient Vanish? Because error(δ) is mul;plied by scaling factors. δ i = σ (y i )W ji δ j (connec;on from unit i to unit j) What if we set W ii = 1 and σ(x) = x? à Constant Error Carrousel (CEC) δ 1.0 error can circulate here indefinitely IN δ δ OUT Constant Error Carrousel (CEC) enables error propaga;on of arbitrary ;me steps. 17

LSTM Unit CEC with input gate and output gate is LSTM unit Gates are for controlling error flow depending on the context. IN tanh mul. 1.0 mul. OUT W y in [0,1] y out [0,1] logsig W in input gate logsig W out output gate 18

En;re Picutre of LSTM- RNN output layer recurrent connec:on hidden layer (with LSTM units) input layer 19

LSTM Variants Forget gates explicitly reset the value in CEC Peephole connec;ons Mul;ple CECs in an LSTM unit etc. Common: CEC and gates 20

Summary RNN with BPTT fails to learn long- term dependency due to vanishing gradients. LSTM overcomes this problem by having Constant Error Carrousel (CEC). LSTM has input and output gates. 21

References Felix A Gers, Nicol N Schraudolph, and Jürgen Schmidhuber. Learning precise ;ming with LSTM recurrent networks. The Journal of Machine Learning Research, Vol. 3, pp. 115 143, 2003. Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and Jürgen Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long- term dependencies, Vol. 1. IEEE Press, 2001. Sepp Hochreiter and Jü rgen Schmidhuber. Long short- term memory. Neural computa;on, Vol. 9, No. 8, pp. 1735 1780, 1997. James Martens and Ilya Sutskever. Learning recurrent neural networks with Hessian- free op;miza;on. In Proceedings of the 28th Interna;onal Conference on Machine Learning (ICML- 11), pp. 1033 1040, 2011. Mar;n Sundermeyer, Ralf Schlüter, and Hermann Ney. LSTM neural networks for language modeling. In INTERSPEECH, pp. 194 197, 2012. Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. Sequence to sequence learning with neural networks. In Advances in Neural Informa;on Processing Systems, pp. 3104 3112, 2014. Wojciech Zaremba and Ilya Sutskever. Learning to execute. arxiv preprint arxiv: 1410.4615, 2014. 22