Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

Size: px

Start display at page:

Download "Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)"

Charles Fisher
5 years ago
Views:

1 Recurrent Neural Networks 2 CS 287 (Based on Yoav Goldberg s notes)

2 Review: Representation of Sequence Many tasks in NLP involve sequences w 1,..., w n Representations as matrix dense vectors X (Following YG, slight abuse of notation) x 1 = x 0 1W 0,..., x n = x 0 nw 0 Would like fixed-dimensional representation.

3 Review: Sequence Recurrence Can map from dense sequence to dense representation. x 1,..., x n s 1,..., s n For all i {1,..., n} s i = R(s i 1, x i ; θ) θ is shared by all R Example: s 4 = R(s 3, x 4 ) = R(R(s 2, x 3 ), x 4 ) = R(R(R(R(s 0, x 1 ), x 2 ), x 3 ), x 4 )

4 Review: BPTT (Acceptor) Run forward propagation. Run backward propagation. Update all weights (shared) ŷ s n... x n s 3... s 2 x 3 s 1 x 2 s 0 x 1

5 Review: BPTT (Acceptor) Run forward propagation. Run backward propagation. Update all weights (shared) ŷ s n... x n s 3... s 2 x 3 s 1 x 2 s 0 x 1

6 Review: BPTT (Acceptor) Run forward propagation. Run backward propagation. Update all weights (shared) ŷ s n... x n s 3... s 2 x 3 s 1 x 2 s 0 x 1

7 Review: BPTT (Acceptor) Run forward propagation. Run backward propagation. Update all weights (shared) ŷ s n... x n s 3... s 2 x 3 s 1 x 2 s 0 x 1

8 Review: BPTT (Acceptor) Run forward propagation. Run backward propagation. Update all weights (shared) ŷ s n... x n s 3... s 2 x 3 s 1 x 2 s 0 x 1

9 Review: BPTT (Acceptor) Run forward propagation. Run backward propagation. Update all weights (shared) ŷ s n... x n s 3... s 2 x 3 s 1 x 2 s 0 x 1

10 Review: BPTT (Acceptor) Run forward propagation. Run backward propagation. Update all weights (shared) ŷ s n... x n s 3... s 2 x 3 s 1 x 2 s 0 x 1

11 Review: BPTT (Acceptor) Run forward propagation. Run backward propagation. Update all weights (shared) ŷ s n... x n s 3... s 2 x 3 s 1 x 2 s 0 x 1

12 Review: BPTT (Acceptor) Run forward propagation. Run backward propagation. Update all weights (shared) ŷ s n... x n s 3... s 2 x 3 s 1 x 2 s 0 x 1

13 Review: BPTT (Acceptor) Run forward propagation. Run backward propagation. Update all weights (shared) ŷ s n... x n s 3... s 2 x 3 s 1 x 2 s 0 x 1

14 Review: Issues Can be inefficient, but batch/gpus help. Model is much deeper than previous approaches. This matters a lot, focus of next class. Variable-size model for each sentence. Have to be a bit more clever in Torch.

15 Quiz Consider a ReLU version of the Elman RNN with function R defined as NN(x, s) = ReLU(sW s + xw x + b). We use this RNN with an acceptor architecture over the sequence x 1,..., x 5. Assume we have computed the gradient for the final layer L s 5 What is the symbolic gradient of the previous state L s 4? What is the symbolic gradient of the first state L s 1?

16 Answer s 5,j L L = s 4,i j s 4,i s 5,j W = s L i,j s 5,j s 5,j > 0 j 0 o.w. = 1(s 5,j > 0)Wi,j s j L s 5,j Chain-Rule ReLU Gradient rule Indicator notation

17 Answer L s 1,j1 = j 2 s 2,j2 s 1,j1... j 5 s 5,j5 s 4,j4 = j 2 = L s 5,j5... 1(s 2,j2 > 0)... 1(s 5,j5 > 0)Wj s 1,j 2... W s j 5 5 j 2...j 5 k=2 1(s k,jk > 0)W s j k 1,j k L s 5,j5 L j 4,j 5 s 5,j5 Multiple applications of Chain rule Combine and multiply. Product of weights.

18 The Promise of RNNs Hope: Learn the long-range interactions of language from data. For acceptors this means maintaining early state: Example: How can you not see this movie? You should not see this movie. Memory interaction here is at s 1, but gradient signal is at s n

19 Long-Term Gradients Gradients go through multiplicative layers. Fine at end layers, but issues with early layers. For instance consider quiz with hardtanh 5 j 2...j 5 k=2 1(1 > s k,jk > 0)W s j k 1,j k L s 5,j5 The multiplicative effect tends very large exploding or vanishing gradients.

20 Exploding Gradients Easier case, can be handled heuristically, Occurs if there is no saturation but exponential blowup. n Wj s k 1,j k j 2...j n k=2 In these cases, there are reasonable short-term gradients, but bad long-term gradients. Two practical heuristics: gradient clipping, i.e. bounding any gradient by a maximum value gradient normalization, i.e. renormalizing the the RNN gradients if they are above a fixed norm value.

21 Vanishing Gradients Occurs when combining small weights (or from saturation) n Wj s k 1,j k j 2...j n k=2 Again, affects mainly long-term gradients. However, not as simple as boosting gradients.

22 LSTM (Hochreiter and Schmidhuber, 1997)

23 LSTM Formally R(s i 1, x i ) = [c i, h i ] c i = j i + f c i 1 h i = tanh(c i ) o i = tanh(xw xi + h i 1 W hi + b i ) j = σ(xw xj + h i 1 W hj + b j ) f = σ(xw xf + h i 1 W hf + b f ) o = tanh(xw xo + h i 1 W ho + b o ) (eeks.)

25 Unreasonable Effectiveness of RNNs

27 PANDARUS: Alas, I think he shall be come approached and the day When little srain would be attain d into being never fed, And who is but a chain and subjects of his death, I should not sleep. Second Senator: They are away this miseries, produced upon my soul, Breaking and strongly should be buried, when I perish The earth and thoughts of many states. DUKE VINCENTIO: Well, your wit is in the care of side and that.

28 Music Composition composing-music-with-recurrent-neural-networks/

29 Today s Lecture Build up to LSTMs Note: Development is ahistoric, later papers first. Main idea: Getting around the vanishing gradient issue.

30 Contents Highway Networks Gated Recurrent Unit (GRU) LSTM

31 Review: Deep ReLU Networks NN layer (x) = ReLU(xW 1 + b 1 ) Given a input x can build arbitrarily deep fully-connected networks. Deep Neural Network (DNN) : NN layer (NN layer (NN layer (... NN layer (x))))

32 Deep Networks NN layer (x) = ReLU(xW 1 + b 1 ) h n h n 1... h 2 h 1 x Can have similar issues with vanishing gradients. L h n 1,jn 1 = j n 1(h n,jn > 0)W jn 1,j n L h n,jn

33 Deep Networks NN layer (x) = ReLU(xW 1 + b 1 ) h n h n 1... h 2 h 1 x Can have similar issues with vanishing gradients. L h n 1,jn 1 = j n 1(h n,jn > 0)W jn 1,j n L h n,jn

34 Review: NLM Skip Connections

35 Thought Experiment: Additive Skip-Connections NN sl1 (x) = 1 2 ReLU(xW1 + b 1 ) x h n h n 1... h 3 h 2 h 1 x

36 Exercise: Gradients Exercise: What is the gradient of Inductively what happens? L h n 1 with skip-connections? h n h n 1... h 3 h 2 h 1 x

37 Exercise We now have the average of two terms. One with no saturation condition or multiplicative term. L h n 1,jn 1 = 1 2 ( j n 1(h n,jn > 0)W jn 1,j n L h n,jn ) (h n 1,jn 1 L h n,jn 1 )

38 Thought Experiment 2: Dynamic Skip-Connections NN sl2 (x) = (1 t) ReLU(xW 1 + b 1 ) + tx t = σ(xw t + b t ) W 1 R d hid d hid W t R d hid 1 h n h n 1... h 3 h 2 h 1 x

39 Dynamic Skip-Connections: intuition NN sl2 (x) = (1 t) ReLU(xW 1 + b 1 ) + tx t = σ(xw t + b t ) No longer directly computing the next layer h i. Instead computing h and dynamic update proportion t. Seems like a small change from DNN. Why does this matter?

40 Dynamic Skip-Connections Gradient NN sl2 (x) = (1 t) ReLU(xW 1 + b 1 ) + tx t = σ(xw t + b t ) The t values are saved on the forward pass. t allows for a direct flow of gradients to earlier layers. L L = (1 t) h n 1,jn 1 ( 1(h n,jn > 0)W jn 1,j n ) j n h n,jn + t (h n 1,jn 1 L h n,jn 1 ) +... (What is the...?)

41 Dynamic Skip-Connections NN sl2 (x) = (1 t) ReLU(xW 1 + b 1 ) + tx t = σ(xw t + b t ) L L = (1 t) h n 1,jn 1 ( 1(h n,jn > 0)W jn 1,j n ) j n h n,jn + t (h n 1,jn 1 L h n,jn 1 ) +... Note: h and W t are also receiving gradients through the sigmoid! (Backprop is fun.)

42 Highway Network (Srivastava et al., 2015) Now add a combination at each dimension. is point-wise multiplication. NN highway (x) = (1 t) h + t x h = ReLU(xW 1 + b 1 ) t = σ(xw t + b t ) W t, W 1 R d hid d hid b t, b 1 R 1 d hid h; transform (e.g. standard MLP layer) t; carry (dimension-specific dynamic skipping)

43 Highway Gradients The t values are saved on the forward pass. t j determines the update of dimension j. L h n 1,jn 1 = ( L (1 t jn )1(h n,jn > 0)W jn 1,j n ) j n h n,jn + t jn 1 (h n 1,jn 1 L h n,jn 1 ) +...

44 Intuition: Gating This is known as the gating operation t x Allows vector t to mask or gate x. True gating would have t {0, 1} d hid Approximate with the sigmoid, t = σ(w t x + b)

45 Contents Highway Networks Gated Recurrent Unit (GRU) LSTM

46 Back To Acceptor RNNs Acceptor RNNs are deep networks with shared weights. Elman tanh layers NN(x, s) = tanh(sw s + xw x + b). s n... x n s 4... s 3 x 4 s 2 x 3 s 1 x 2

47 RNN with Skip-Connections Can replace layer with modified highway layer. R(s i 1, x i ) = (1 t) h + t s i 1 h = tanh(xw x + s i 1 W s + b) t = σ(xw xt + s i 1 W st + b t ) W xt, W x R d in d hid W st, W s R d hid d hid b t, b R 1 d hid

48 RNN with Skip-Connections s n... x n s 4... s 3 x 4 s 2 x 3 s 1 x 2 s 0 x 1

49 Final Idea: Stopping flow For many tasks, it is useful to halt propagation. Can do this by applying a reset gate r. h = tanh(xw x + (r s i 1 )W s + b) r = σ(xw xr + s i 1 W sr + b r )

50 Gated Recurrent Unit (GRU) (Cho et al 2014) R(s i 1, x i ) = (1 t) h + t s i 1 h = tanh(xw x + (r s i 1 )W s + b) r = σ(xw xr + s i 1 W sr + b r ) t = σ(xw xt + s i 1 W st + b t ) W xt, W xr, W x R d in d hid W st, W sr, W s R d hid d hid b t, b R 1 d hid t; dynamic skip-connections r; reset gating s; hidden state

51 Example: Language Modeling In non-markovian language modeling, treat corpus as a sequence r allows resetting the state after a sequence. consumers may want to move their telephones a little closer to the tv set </s> <unk> <unk> watching abc s monday night football can now vote during <unk> for the greatest play in N years from among four or five <unk> <unk> </s> two weeks ago viewers of several nbc <unk> consumer segments started calling a N number for advice on various <unk> issues </s> and the new syndicated reality show hard copy records viewers opinions for possible airing on the next day s show </s>

52 Contents Highway Networks Gated Recurrent Unit (GRU) LSTM

53 LSTM Long Short-Term Memory network uses the same idea. Model has three gates, input, output, forget. Developed first, but a bit harder to follow. Seems to be better at LM, comparable to GRU on other tasks.

54 LSTMs State The state s i is made of 2 components : c i ; cell h i ; hidden R(s i 1, x i ) = [c i, h i ]

55 LSTM Development: Input and Forget Gates The cell is updated with c, not a convex combination (two gates). R(s i 1, x i ) = [c i, h i ] c i = j h + f c i 1 h = tanh(xw xi + h i 1 W hi + b i ) j = σ(xw xj + h i 1 W hj + b j ) f = σ(xw xf + h i 1 W hf + b f ) c i ; cell j; input gate f; forget gate

56 LSTM Development: Hidden Hidden h i squashes c i R(c i 1, x i ) = [c i, h i ] h i = tanh(c i ) c i = j h + f c i 1 h = tanh(xw xi + h i 1 W hi + b i ) j = σ(xw xj + h i 1 W hj + b j ) f = σ(xw xf + h i 1 W hf + b f ) c i ; cell h i ; hidden j; input gate f; forget gate

57 Long Short-Term Memory Finally, another output gate is applied to h R(s i 1, x i ) = [c i, h i ] c i = j i + f c i 1 h i = tanh(c i ) o i = tanh(xw xi + h i 1 W hi + b i ) j = σ(xw xj + h i 1 W hj + b j ) f = σ(xw xf + h i 1 W hf + b f ) o = tanh(xw xo + h i 1 W ho + b o )

58 Output Gate? The output gate feels the most ad-hoc (why tanh?) Luckily Jozefowicz et al (2015) find output not too important.

60 Accuracy Results (Jozefowicz et al, 2015) Accuracy of RNN models on three different tasks. LSTM models include versions without the forget, input, and output gates.

61 English LM Results (Jozefowicz et al, 2015) NLL (Perplexity)

Structured Neural Networks (I)

Structured Neural Networks (I) CS 690N, Spring 208 Advanced Natural Language Processing http://peoplecsumassedu/~brenocon/anlp208/ Brendan O Connor College of Information and Computer Sciences University