Slide credit from Hung-Yi Lee & Richard Socher

Slide credit from Hung-Yi Lee & Richard Socher 1

Review Recurrent Neural Network 2

Recurrent Neural Network Idea: condition the neural network on all previous words and tie the weights at each time step Assumption: temporal information matters 3

output word prob dist RNN Language Modeling hidden P(next word is wreck ) P(next word is a ) input P(next word is nice ) context vector P(next word is beach ) vector of START vector of wreck vector of a vector of nice Idea: pass the information from the previous hidden layer to leverage all contexts 4

RNNLM Formulation At each time step, probability of the next word vector of the current word 5

Recurrent Neural Network Definition : tanh, ReLU http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/ 6

Model Training All model parameters can be updated by y t-1 y t y t+1 target predicted http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/ 7

Outline Language Modeling N-gram Language Model Feed-Forward Neural Language Model Recurrent Neural Network Language Model (RNNLM) Recurrent Neural Network Definition Training via Backpropagation through Time (BPTT) Training Issue Applications Sequential Input Sequential Output Aligned Sequential Pairs (Tagging) Unaligned Sequential Pairs (Seq2Seq/Encoder-Decoder) 8

Backpropagation Layer l 1 1 2 j l w ij Layer l 1 2 i l i Backward Pass Error signal l a j x 1 j l 1 l 1 Forward Pass 9

Backpropagation l δ l δ 1 l δ 2 Layer l 1 2 l z 1 l z 2 Layer L-1 L-1 1 z 2 z L1 1 L1 2 Layer L L 1 2 L z 1 L z 2 C y C y 1 C y 2 l i Backward Pass Error signal l δ i i l z i l W 1 T m W L T L1 z m n L z n C y n 10

Backpropagation through Time (BPTT) Unfold x t s t o t y t Input: init, x 1, x 2,, x t Output: o t Target: y t init x 1 s 1 x t-2 x t-1 s t-1 Cy C s t-2 o1 C o 2 C o n 11

Backpropagation through Time (BPTT) Unfold x t s t o t y t Input: init, x 1, x 2,, x t Output: o t Target: y t x t-2 x t-1 s t-1 1 s t-2 1 2 Cy x 1 s 1 2 n init n 12

Backpropagation through Time (BPTT) Unfold x t s t o t y t x t-1 s t-1 Cy Input: init, x 1, x 2,, x t Output: o t Target: y t x t-2 s t-2 x 1 s 1 init 13

Backpropagation through Time (BPTT) Unfold Input: init, x 1, x 2,, x t Output: o t Target: y t init i x 1 s 1 j x t-2 j the same memory x t i x t-1 s t-1 j i s t-2 pointer pointer j i s t o t y t Cy Weights are tied together 14

Backpropagation through Time (BPTT) Unfold Input: init, x 1, x 2,, x t Output: o t Target: y t i x 1 s 1 j x t-2 j x t i x t-1 s t-1 j k i s t-2 k j i s t o t y t Cy init k Weights are tied together 15

BPTT Forward Pass: Backward Pass: Compute s 1, s 2, s 3, s 4 For C (4) For C (3) For C (2) For C (1) y 1 y 2 y 3 y 4 C (1) C (2) C (3) C (4) o 1 o 2 o 3 o 4 init s 1 s 2 s 3 s 4 x 1 x 2 x 3 x 4 16

RNN Training Issue The gradient is a product of Jacobian matrices, each associated with a step in the forward computation Multiply the same matrix at each time step during backprop The gradient becomes very small or very large quickly vanishing or exploding gradient Bengio et al., Learning long-term dependencies with gradient descent is difficult, IEEE Trans. of Neural Networks, 1994. [link] Pascanu et al., On the difficulty of training recurrent neural networks, in ICML, 2013. [link] 17

Rough Error Surface Cost w 2 w1 The error surface is either very flat or very steep Bengio et al., Learning long-term dependencies with gradient descent is difficult, IEEE Trans. of Neural Networks, 1994. [link] Pascanu et al., On the difficulty of training recurrent neural networks, in ICML, 2013. [link] 18

Possible Solutions Recurrent Neural Network 19

Exploding Gradient: Clipping clipped gradient Idea: control the gradient value to avoid exploding Cost Parameter setting: values from half to ten times the average can still yield convergence w 2 w 1 20

Vanishing Gradient: Initialization + ReLU IRNN initialize all W as identity matrix I use ReLU for activation functions Le et al., A Simple Way to Initialize Recurrent Networks of Rectified Linear Units, arxiv, 2016. [link] 21

Vanishing Gradient: Gating Mechanism RNN models temporal sequence information can handle long-term dependencies in theory I grew up in France I speak fluent French. Issue: RNN cannot handle such long-term dependencies in practice due to vanishing gradient apply the gating mechanism to directly encode the long-distance information http://colah.github.io/posts/2015-08-understanding-lstms/ 22

Extension Recurrent Neural Network 23

Bidirectional RNN h = h; h represents (summarizes) the past and future around a single token 24

Deep Bidirectional RNN Each memory layer passes an intermediate representation to the next 25

Concluding Remarks Recurrent Neural Networks Definition Issue: Vanishing/Exploding Gradient Solution: Exploding Gradient: Clipping Vanishing Gradient: Initialization, ReLU, Gated RNNs Extension Bidirectional Deep RNN 26