Learning Long-Term Dependencies with Gradient Descent is Diﬃcult

Size: px

Start display at page:

Download "Learning Long-Term Dependencies with Gradient Descent is Diﬃcult"

Felix Malone
5 years ago
Views:

1 Learning Long-Term Dependencies with Gradient Descent is Diﬃcult Y. Bengio, P. Simard & P. Frasconi, IEEE Trans. Neural Nets, 1994 June 23, 2016, ICML, New York City Back-to-the-future Workshop Yoshua Bengio Montreal InsDtute for Learning Algorithms Université de Montréal

2 Simple Experiments from 1992 while I was at MIT 2 categories of sequences Can the single tanh unit learn to store for T Hme steps 1 bit of informahon given by the sign of inihal input? Prob(success seq. length T) 2

3 How to store 1 bit? Dynamics with multiple basins of attraction in some dimensions Some subspace of the state can store 1 or more bits of informahon if the dynamical system has mulhple basins of aqrachon in some dimensions Basins boundary Bit=0 Bit=1 3

4 Robustly storing 1 bit in the presence of bounded noise With spectral radius > 1, noise can kick state out of aqractor M >1 β Γ M <1 X UNSTABLE Domain of a t β Not so with radius<1 CONTRACTIVE à STABLE M >1 Γ X M <1 4 Domain of a t

5 Storing Reliably è Vanishing gradients Reliably storing bits of informahon requires spectral radius<1 The product of T matrices whose spectral radius is < 1 is a matrix whose spectral radius converges to 0 at exponenhal rate in T If spectral radius of Jacobian is < 1 è propagated gradients vanish 5

6 Vanishing or Exploding Gradients Hochreiter s 1991 MSc thesis (in German) had independently discovered that backpropagated gradients in RNNs tend to either vanish or explode as sequence length increases 6

7 Why it hurts gradient-based learning Long-term dependencies get a weight that is exponenhally smaller (in T) compared to short-term dependencies Becomes exponenhally smaller for longer Hme differences, when spectral radius < 1 7

8 Dealing with Gradient Explosion by Gradient Norm Clipping (Mikolov thesis 2012; Pascanu, Mikolov, Bengio, ICML 2013) 8 error

Conference version (1993) of the 1994 paper by the same authors had a predecessor of GRU and targetprop (The problem of learning long-term dependencies in recurrent

9 Conference version (1993) of the 1994 paper by the same authors had a predecessor of GRU and targetprop (The problem of learning long-term dependencies in recurrent networks, Bengio, Frasconi & Simard ICNN 1993) Flip-ﬂop unit to store 1 bit, with gahng signal to control when to write Pseudo-backprop through it by a form of targetprop 9

10 Bypassing nonlinearities to learn longer term dependencies Delays (Lin et al & Giles 1995) MulHple Hme scales (Elhihi & Bengio NIPS 1995) s o x W 1 W 3 o t 1 o t o t+1 W 3 W 3 W 3 s t 2 s t 1 s t s t+1 W 1 W 1 W 1 W 1 W 3 unfold x t 1 x t x t+1 10

11 Fighting the vanishing gradient: LSTM & GRU Create a path where gradients can flow for longer with a self-loop Corresponds to an eigenvalue of Jacobian slightly less than 1 LSTM is now heavily used (Hochreiter & Schmidhuber 1997) GRU light-weight version (Cho et al 2014) 11 (Hochreiter 1991); first version of the LSTM, called Neural Long- Term Storage with self-loop LSTM: (Hochreiter & Schmidhuber 1997) output self-loop + state input input gate forget gate output gate

12 Fast Forward 20 years: Attention Mechanisms for Memory Access Neural Turing Machines (Graves et al 2014) and Memory Networks (Weston et al 2014) Use a content-based aqenhon mechanism (Bahdanau et al 2014) to control the read and write access into a memory The aqenhon mechanism outputs a soimax over memory locahons write read 12

13 Large Memory Networks: Sparse Access Memory for Long-Term Dependencies A mental state stored in an external memory can stay for arbitrarily long durahons, unhl it is overwriqen (parhally or not) Forgekng = vanishing gradient. Memory = higher-dimensional state, avoiding or reducing the need for forgekng/vanishing passive copy access 13

14 Designing the RNN Architecture (Zhang et al 2016) Recurrent depth: max path length divided by sequence length Feedforward depth: max length from input to nearest output Skip coefficient: shortest path length divided sequence length 14

15 (a) (a) (3) (4) (b) Figure 2: Left: (a) the architectures for sh, st, bu and td, with their (d r,d f ) equal to (1, 2), (1, 3), (1, 3) and (2, 3), respectively. The longest path in td are colored in red. (b) The 9 architectures denoted by their (d f,d r ) with d r =1, 2, 3 and d f =2, 3, 4. We only plot the hidden states within 1 time step (which also have a period of It 1) in both makes (a) and (b). Right: a (a) Various difference architectures that we consider in Section 4.4. From top to bottom are baseline s =1, and s =2, s =3. (b) Proposed architectures that we consider in Section 4.5 where we take k =3as an example. The shortest paths in (a) and (b) that correspond to the recurrent skip coefficients are colored in blue. Impact of change in recurrent depth DATASET MODELS\ARCHS sh st bu td PennTreebank tanh RNN tanh RNN-SMALL text8 tanh RNN-LARGE LSTM-SMALL LSTM-LARGE Impact of change in skip coefficient sequential MNIST 34.9 dataset: 46.9 Each 74.9 MNIST image data MNIST is reshaped into a sequence, 84.8 turning the digit classification s=1task s=3 into Figure s=5 a sequence s=7 2: Left: s=9 classification (a) the architectures one s=1 with s=3 long-term for sh, s=4st, dependencies s=5 bu s=6 and td, with [25, 24]. their (d r,d f pmnist (2, ), respectively The pmnist longest 28.5 path25.0 in td colored 65.9 A slight modification of the dataset is to permute the image sequences are by abaseline fixed random red. order (b) s The =1, 9 archa beforehand (permuted Model MNIST). withresults dpmnist r =1, in [25] 2, 3 and havedshown f =2, that 3, 4. both Wetanh only RNNs plot and the LSTMs hidden states did notwithin 1 t Architecture, s (1), 1 (2), of 1) in both (a) and (b). Right: (a) Various k 1 (3), achieve satisfying irnn[25] performance, 97.0 which 82.0 also highlights the difficulty of this task. architectures =3as that anwexampl k (4), k 2 consider i MNIST k = urnn[24] 95.1 are baseline 91.4 For all of our experiments we use Adam s [26] =1, for and optimization, s k =2, = 21 s 39.5 =3. and (b) 39.9 conduct Proposed 69.6 architectures that we con k =3as an example. The shortest paths colored a grid 71.8 LSTM[24] search in (a) and (b) inthat blue. on the learningrnn(tanh)[25] rate {10 2, , , 10 5 pmnist k = }. For tanh RNNs, the parameters are initializedcorrespond with to samples from a uniform distribution. colored infor blue. k= stanh(s = 21, 11) LSTM networks we adopt a similar initialization scheme, while the forget gate biases are chosen by the grid search on { 5, 3, 1, 0, 1, 3, 5}. We employ Table 2: Results for MNIST/pMNIST. Top-left: test accuracies with different s for tanh RNN. Top-right: test early stopping and the batch size was DATASET set to 50. MODELS\ARCHS sh DATASET st bu td accuracies with different s for LSTM. Bottom: compared to previous results. Bottom-right: test accuracies for architectures (1), (2), (3) and (4) for tanh PennTreebank RNN. tanh RNN d Recurrent Depth is Non-trivial tanh RNN-SMALL 1.80 PennTreebank d f Table 2, bottom-left panel, shows that our text8 simple architecture tanh RNN-LARGE improves upon the urnn 1.69 by % 1.64 on 1.59 d f TopMNIST, investigate and the achieves first almost question, the same we compare performance 4 similar as LSTM LSTM-SMALL connecting the MNIST architectures: dataset 1.65 with 1-layer 1.66 only 25% 1.65 (shallow) 1.63 d f sh, number 2-layers of parameters stacked st, [24]. 2-layers Note thatstacked obtainingwith goodanperformance extra bottom-up on sequential connection MNIST bu, requires and 2-layers a (a) (a) d f \d r d r =1 d r =2 d r =3 d f = d f = d f = Figure 2: Left: (a) (2, 3), respectively. with d r =1, 2, 3 an of 1) in both (a) and Table 1: Left: test BPCs of sh, st, bu, td for tanh RNNs and LSTMs. Right: test BPCs of tanh RNNs with recurrent depth d r =1, 2, 3 and feedforward depth d f =2, 3, 4 respectively. RNN(tanh) s = 1 s = 5 s = 9 s = 13 s = 21 LSTM s=1 s=3 s=5 s=7 s=9 ( (

16 New Ideas to Help Information Propagation Unitary matrices: all e-values of matrix are 1 (Arjowski, Amar & Bengio ICML 2016) Zoneout: randomly choose to simply copy the state unchanged (Krueger et al 2016, submiyed) 16

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning Recurrent Neural Network (RNNs) University of Waterloo October 23, 2015 Slides are partially based on Book in preparation, by Bengio, Goodfellow, and Aaron Courville, 2015 Sequential data Recurrent neural