Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, Deep Lunch

Size: px

Start display at page:

Download "Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, Deep Lunch"

Marshall Reynolds
5 years ago
Views:

1 Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, Deep Lunch 1

2 Why LSTM? OJen used in many recent RNN- based systems Machine transla;on Program execu;on Can capture long- term dependency 2

3 Sequence to Sequence Learning with Neural Networks [Sutskever+14] Machine transla;on (English to French) Achieved state- of- the- art, while making li]le assump;on about data There is a cat under the chair. LSTM- RNN LSTM- RNN Il y a un chat sous la chaise. encoder decoder 3

4 Learning to Execute [Zaremba+14] Trained an RNN to execute a code wri]en in Python- like language j=8584 for x in range(8): j+=920 b=(1500+j) print((b+7567)) LSTM- RNN

5 Review: Feed- Forward Neural Network output layer OUT: y=exp(w o h)/z W o hidden layer h=σ(w x x) W x input layer IN: x 5

6 Review: Recurrent Neural Network output layer OUT: y=exp(w o h)/z W o self- loop hidden layer h=σ(w x x+w h h) W h W x input layer IN: x 6

7 Review: Training of RNN Find good values for W x, W h, W o output layer W o Each arrow has a corresponding weight that has to be op:mized. hidden layer W x W h input layer 7

8 Review: Gradient Descent Itera;vely update weight by moving along gradient W new = W old +α E W E: error func;on (cross- entropy for mul;- classifica;on) α: learning rate W new best W How is E W calculated? Back- propaga;on! W old 8

9 Review: Gradient Calcula;on Gradients are calculated from errors(δ) propagated backward. y = exp(w o h)/z W o δ o = y - y correct E W o = δ o h T backward propaga;on of error h = σ(w x x) x W x δ h = σ (W x x) W ot δ o E W x = δ h x T Error gets mul;plied by the weight. 9

10 Review: Unfolding RNN through Time output layer hidden layer self- loop self- loop input layer t = 0 t = 1 t = 2 10

11 Review: Back Propaga;on through Time (BPTT) output layer δ 0 δ 1 δ 2 hidden layer input layer t = 0 t = 1 t = 2 11

12 Problem: Vanishing Gradients Remember: error is mul;plied by weight each ;me (weights are ojen smaller than 1) δ 0 δ 10 W h W h W h W h W o W x t = 0 t = 1 t = 9 t = 10 Errors (thus gradients) gets smaller exponen;ally! 12

13 Why does it Ma]er? Cannot learn long- term dependency Long- term dependency emerges when input signal and teacher signal are far apart (>10 ;me steps). 13

14 Long- Term Dependency You are presented with the following 6 sequences consis;ng of As, Bs, Xs. 1: AXXXXXXXXXXA 2: AXXXXXXXXXXA 3: BXXXXXXXXXXB 4: AXXXXXXXXXXA 5: BXXXXXXXXXXB 6: AXXXXXXXXXXA Can you guess what comes next? AXXXXXXXXXX? Simplified version of Task 2a from [Hochreiter+97] 14

15 Unfortunately, RNN with BPTT fails. RNN must find out by itself that it must store the first le]er in the hidden layer. (Note that manually sesng the weights for doing this is easy.) Guess: B Correct: A δ δ A X X X Only place where RNN could reprogram itself (update weights) to store the first le]er in the hidden layer, but error is too small! 15

16 Solu;ons Alterna;ve training techniques Hessian- free op;miza;on [Martens+11] Alterna;ve architectures (training remains the same) Long Short- term Memory (LSTM) 16

17 Why does Gradient Vanish? Because error(δ) is mul;plied by scaling factors. δ i = σ (y i )W ji δ j (connec;on from unit i to unit j) What if we set W ii = 1 and σ(x) = x? à Constant Error Carrousel (CEC) δ 1.0 error can circulate here indefinitely IN δ δ OUT Constant Error Carrousel (CEC) enables error propaga;on of arbitrary ;me steps. 17

18 LSTM Unit CEC with input gate and output gate is LSTM unit Gates are for controlling error flow depending on the context. IN tanh mul. 1.0 mul. OUT W y in [0,1] y out [0,1] logsig W in input gate logsig W out output gate 18

19 En;re Picutre of LSTM- RNN output layer recurrent connec:on hidden layer (with LSTM units) input layer 19

20 LSTM Variants Forget gates explicitly reset the value in CEC Peephole connec;ons Mul;ple CECs in an LSTM unit etc. Common: CEC and gates 20

21 Summary RNN with BPTT fails to learn long- term dependency due to vanishing gradients. LSTM overcomes this problem by having Constant Error Carrousel (CEC). LSTM has input and output gates. 21

22 References Felix A Gers, Nicol N Schraudolph, and Jürgen Schmidhuber. Learning precise ;ming with LSTM recurrent networks. The Journal of Machine Learning Research, Vol. 3, pp , Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and Jürgen Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long- term dependencies, Vol. 1. IEEE Press, Sepp Hochreiter and Jü rgen Schmidhuber. Long short- term memory. Neural computa;on, Vol. 9, No. 8, pp , James Martens and Ilya Sutskever. Learning recurrent neural networks with Hessian- free op;miza;on. In Proceedings of the 28th Interna;onal Conference on Machine Learning (ICML- 11), pp , Mar;n Sundermeyer, Ralf Schlüter, and Hermann Ney. LSTM neural networks for language modeling. In INTERSPEECH, pp , Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. Sequence to sequence learning with neural networks. In Advances in Neural Informa;on Processing Systems, pp , Wojciech Zaremba and Ilya Sutskever. Learning to execute. arxiv preprint arxiv: ,

High Order LSTM/GRU. Wenjie Luo. January 19, 2016

High Order LSTM/GRU. Wenjie Luo. January 19, 2016 High Order LSTM/GRU Wenjie Luo January 19, 2016 1 Introduction RNN is a powerful model for sequence data but suffers from gradient vanishing and explosion, thus difficult to be trained to capture long