Slide credit from Hung-Yi Lee & Richard Socher

Size: px

Start display at page:

Download "Slide credit from Hung-Yi Lee & Richard Socher"

Annabel Gibson
5 years ago
Views:

1 Slide credit from Hung-Yi Lee & Richard Socher 1

2 Review Recurrent Neural Network 2

3 Recurrent Neural Network Idea: condition the neural network on all previous words and tie the weights at each time step Assumption: temporal information matters 3

4 output word prob dist RNN Language Modeling hidden P(next word is wreck ) P(next word is a ) input P(next word is nice ) context vector P(next word is beach ) vector of START vector of wreck vector of a vector of nice Idea: pass the information from the previous hidden layer to leverage all contexts 4

5 RNNLM Formulation At each time step, probability of the next word vector of the current word 5

6 Recurrent Neural Network Definition : tanh, ReLU 6

7 Model Training All model parameters can be updated by y t-1 y t y t+1 target predicted 7

8 Outline Language Modeling N-gram Language Model Feed-Forward Neural Language Model Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network Definition Training via Backpropagation through Time (BPTT) Training Issue Applications Sequential Input Sequential Output Aligned Sequential Pairs (Tagging) Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder) 8

9 Backpropagation Layer l j l w ij Layer l 1 2 i l i Backward Pass Error signal l a j x 1 j l 1 l 1 Forward Pass 9

10 Backpropagation l δ l δ 1 l δ 2 Layer l 1 2 l z 1 l z 2 Layer L-1 L-1 1 z 2 z L1 1 L1 2 Layer L L 1 2 L z 1 L z 2 C y C y 1 C y 2 l i Backward Pass Error signal l δ i i l z i l W 1 T m W L T L1 z m n L z n C y n 10

11 Backpropagation through Time (BPTT) Unfold x t s t o t y t Input: init, x 1, x 2,, x t Output: o t Target: y t init x 1 s 1 x t-2 x t-1 s t-1 Cy C s t-2 o1 C o 2 C o n 11

12 Backpropagation through Time (BPTT) Unfold x t s t o t y t Input: init, x 1, x 2,, x t Output: o t Target: y t x t-2 x t-1 s t-1 1 s t Cy x 1 s 1 2 n init n 12

13 Backpropagation through Time (BPTT) Unfold x t s t o t y t x t-1 s t-1 Cy Input: init, x 1, x 2,, x t Output: o t Target: y t x t-2 s t-2 x 1 s 1 init 13

14 Backpropagation through Time (BPTT) Unfold Input: init, x 1, x 2,, x t Output: o t Target: y t init i x 1 s 1 j x t-2 j the same memory x t i x t-1 s t-1 j i s t-2 pointer pointer j i s t o t y t Cy Weights are tied together 14

15 Backpropagation through Time (BPTT) Unfold Input: init, x 1, x 2,, x t Output: o t Target: y t i x 1 s 1 j x t-2 j x t i x t-1 s t-1 j k i s t-2 k j i s t o t y t Cy init k Weights are tied together 15

16 BPTT Forward Pass: Backward Pass: Compute s 1, s 2, s 3, s 4 For C (4) For C (3) For C (2) For C (1) y 1 y 2 y 3 y 4 C (1) C (2) C (3) C (4) o 1 o 2 o 3 o 4 init s 1 s 2 s 3 s 4 x 1 x 2 x 3 x 4 16

17 RNN Training Issue The gradient is a product of Jacobian matrices, each associated with a step in the forward computation Multiply the same matrix at each time step during backprop The gradient becomes very small or very large quickly vanishing or exploding gradient Bengio et al., Learning long-term dependencies with gradient descent is difficult, IEEE Trans. of Neural Networks, [link] Pascanu et al., On the difficulty of training recurrent neural networks, in ICML, [link] 17

18 Rough Error Surface Cost w 2 w1 The error surface is either very flat or very steep Bengio et al., Learning long-term dependencies with gradient descent is difficult, IEEE Trans. of Neural Networks, [link] Pascanu et al., On the difficulty of training recurrent neural networks, in ICML, [link] 18

19 Possible Solutions Recurrent Neural Network 19

20 Exploding Gradient: Clipping clipped gradient Idea: control the gradient value to avoid exploding Cost Parameter setting: values from half to ten times the average can still yield convergence w 2 w 1 20

21 Vanishing Gradient: Initialization + ReLU IRNN initialize all W as identity matrix I use ReLU for activation functions Le et al., A Simple Way to Initialize Recurrent Networks of Rectified Linear Units, arxiv, [link] 21

Issue: RNN cannot handle such long-term dependencies in practice due to vanishing gradient apply

22 Vanishing Gradient: Gating Mechanism RNN models temporal sequence information can handle long-term dependencies in theory I grew up in France I speak fluent French. Issue: RNN cannot handle such long-term dependencies in practice due to vanishing gradient apply the gating mechanism to directly encode the long-distance information 22

23 Extension Recurrent Neural Network 23

24 Bidirectional RNN h = h; h represents (summarizes) the past and future around a single token 24

25 Deep Bidirectional RNN Each memory layer passes an intermediate representation to the next 25

26 Concluding Remarks Recurrent Neural Networks Definition Issue: Vanishing/Exploding Gradient Solution: Exploding Gradient: Clipping Vanishing Gradient: Initialization, ReLU, Gated RNNs Extension Bidirectional Deep RNN 26

Recurrent Neural Networks. Jian Tang

Recurrent Neural Networks. Jian Tang Recurrent Neural Networks Jian Tang tangjianpku@gmail.com 1 RNN: Recurrent neural networks Neural networks for sequence modeling Summarize a sequence with fix-sized vector through recursively updating