Structured Neural Networks (I)

Size: px

Start display at page:

Download "Structured Neural Networks (I)"

Alexia Houston
5 years ago
Views:

1 Structured Neural Networks (I) CS 690N, Spring 208 Advanced Natural Language Processing Brendan O Connor College of Information and Computer Sciences University of Massachusetts Amherst

2 Structured neural networks? How to deal with arbitrary length inputs? Documents, sentences, long-distance history Build structure directly into network architectures Convolutional Recurrent Dynamic autodiff frameworks make training easy(-ish) (PyTorch, DyNet) 2

3 Averaging network Continuous Bag-of-Words CBOW(f,,f k )= k Use averaged representation for eg softmax classifier Example: FastText doc classifier (Joulin et al 206) Pre-trained word embeddings Bag-of-words, Bag-of-ngrams Hashing (ngram embeddings randomly shared) Hierarchical softmax speed trick With >00k sentiment labeled training docs, performs better than explicit feature logistic regression kx i= v(f i ) 3

4 Convolutional NN 6 3 W max the quick brown fox jumped over the lazy dog the quick brown quick brown fox brown fox jumped fox jumped over jumped over the over the lazy the lazy dog MUL+tanh MUL+tanh MUL+tanh MUL+tanh MUL+tanh MUL+tanh MUL+tanh convolution pooling Sentence representation independent of sentence length: Sliding window of concatenated word embeddings Feedforward transform then elementwise max across positions Final sentence representation could be used in various ways: eg classification (Kim 204) Use joint training Only learns local dependencies (like n-grams) 4 [Diagram: Yoav Goldberg]

5 Recurrent NN y y 2 y 3 y 4 y 5 s 0 s s 2 s 3 s 4 s 5 x x 2 x 3 x 4 x 5 Figure 6: Graphical representation of an RNN (unrolled) Simple ( vanilla ) RNN (Elman 990) s i =R srnn (s i, x i )=g(x i W x + s i W s + b) y i =O srnn (s i )=s i s i, y i 2 R d s, x i 2 R d x, W x 2 R d x d s, W s 2 R d s d s, b 2 R d s Other local models: LSTM and GRU

6 RNN Uses loss Acceptor P predict & calc loss y 5 s 0 s s 2 s 3 s 4 x x 2 x 3 x 4 x 5 Figure 7: Acceptor RNN Training Graph Transducer sum loss predict & calc loss predict & calc loss predict & calc loss predict & calc loss predict & calc loss y y 2 y 3 y 4 y 5 s 0 s s 2 s 3 s 4 x x 2 x 3 x 4 x 5 6 [Diagram: Yoav Goldberg]

7 RNN Uses Encoder-decoder sum loss predict & calc loss predict & calc loss predict & calc loss predict & calc loss predict & calc loss y y 2 y 3 y 4 y 5 s d 0 R D,O D s d R D,O D s d 2 R D,O D s d 3 R D,O D s d 4 R D,O D x x 2 x 3 x 4 x 5 s e 0 R E,O E s e R E,O E s e 2 R E,O E s e 3 R E,O E s e 4 s e 5 R E,O E x x 2 x 3 x 4 x 5 Figure 9: Encoder-Decoder RNN Training Graph 7 [Diagram: Yoav Goldberg]

8 Language Modelling: Review Language models aim to represent the history of observed text (w,, wt ) succinctly in order to predict the next word (wt ): With count based n-gram LMs we approximate the history with just the previous n words Neural n-gram LMs embed the same fixed n-gram history in a continues space and thus capture correlations between histories With Recurrent Neural Network LMs we drop the fixed n-gram history and compress the entire history in a fixed length vector, enabling long range correlations to be captured pˆ3 the it if was and all her he cat rock dog yes we ten sun of a I you There built aardvark ~ a ~ pˆ2 the it if was and all her he cat rock dog yes we ten sun of a I you There built aardvark pˆ the it if was and all her he cat rock dog yes we ten sun of a I you There built aardvark the it if was and all her he cat rock dog yes we ten sun of a I you There built aardvark built ~ he ~ There h h2 h3 h4 w0 w w2 w3 pˆ4 <s> [Slide: Phil Blunsom]

9 Capturing Long Range Dependencies If an RNN Language Model is to outperform an n-gram model it must discover and represent long range dependencies: p(sandcastle Alice went to the beach There she built a) While a simple RNN LM can represent such dependencies in theory, can it learn them? costn pˆn h0 h h2 w0 w 2 [Slide: Phil Blunsom]

10 RNNs: Exploding and Vanishing Gradients Consider the path of partial derivatives linking a change in cost 4 to changes in h : h n = g(v [x n ; h n ]+c) ˆp n = softmax(wh n + b) w 4 cost 4 h h 2 h 3 h 4 ˆp 4 h 0 w 0 w w 2 w ˆp ˆp [Slide: Phil Blunsom]

11 RNNs: Exploding and Vanishing Gradients Consider the path of partial derivatives linking a change in costn to changes in h : hn = g (V [xn ; hn ] + c) pˆn = softmax(whn + b) costn pˆn h0 h h2 @ Y 2 [Slide: Phil Blunsom]

12 RNNs: Exploding and Vanishing Gradients NNs: Exploding and Vanishing Gradients Consider the path of partial derivatives linking a change in costn Consider the path of partial derivatives linking a change in costn to changes in h : to changes in h : Q @h N N hn = g (V [xn ; hn ] + @hn n2{n,,2} pˆn hn = g (V [xn ; hn ] + c), @ costn costn h h0 h0 w0 h w0 h2 w h2 w pˆn 2 2 pˆn [Slide: Phil Blunsom]

13 RNNs: Exploding and Vanishing Gradients Consider the path of partial derivatives linking a change in costn to changes in h @zn N hn = g (Vx xn + Vh hn + c {z } zn costn pˆn h0 h h2 w0 w 2 [Slide: Phil Blunsom]

14 RNNs: Exploding and Vanishing Gradients Consider the path of partial derivatives linking a change in costn to changes in h @zn N hn = g (Vx xn + Vh hn + c {z } = diag g 0 @hn = Vh costn pˆn h0 h h2 w0 w 2 [Slide: Phil Blunsom]

15 RNNs: Exploding and Vanishing Gradients Consider the path of partial derivatives linking a change in costn to changes in h @zn N hn = g (Vx xn + Vh hn + c {z } = diag g 0 = = diag g 0 (zn @hn = Vh costn pˆn h0 h h2 w0 w 2 [Slide: Phil Blunsom]

16 RNNs: Exploding and Vanishing ˆp ˆp N Y n2{n,,2} diag g 0 (z n ) V h A The core of the recurrent product is the repeated multiplication of V h If the largest eigenvalue of V h is:, then gradient will propagate, >, the product will grow exponentially (explode), <, the product shrinks exponentially (vanishes) [Slide: Phil Blunsom]

LSTM (Long short-term memory) Goals: Be able to remember for longer distances 2 Stable backpropagation during training Augment individual timesteps with a number of specialized vectors and gating

17 LSTM (Long short-term memory) Goals: Be able to remember for longer distances 2 Stable backpropagation during training Augment individual timesteps with a number of specialized vectors and gating functions (Simpler alternative: GRU But LSTM is most standard) Main state c: Memory cell h: Hidden state Update system g: proposed new values f, i, o: Forget, Input, Output gates control acceptance of g into new state c j =c j f + g i h j = tanh(c j ) o i = (x j W xi + h j W hi ) f = (x j W xf + h j W hf ) o = (x j W xo + h j W ho ) g = tanh(x j W xg + h j W hg ) Christopher Olah: Understanding LSTM Networks colahgithubio/posts/ understanding-lstms/ 7

18 memory component ( cell ) cj- cj hj- hj hidden state xj input 8 main information gating function

19 memory component ( cell ) cj- cj hj- g hj hidden state xj input proposed state g = tanh(x j W xg + h j W hg ) 8 main information gating function

20 memory component ( cell ) cj- cj hj- g hj hidden state xj input proposed state g = tanh(x j W xg + h j W hg ) 8 main information gating function

21 memory component ( cell ) cj- cj f i hj- g hj hidden state gates [0,] D i = (x j W xi + h j W hi ) f = (x j W xf + h j W hf ) o = (x j W xo + h j W ho ) xj input o proposed state g = tanh(x j W xg + h j W hg ) 8 main information gating function

22 memory component ( cell ) c j =c j f + g i cj- cj f i hj- g hj hidden state gates [0,] D i = (x j W xi + h j W hi ) f = (x j W xf + h j W hf ) o = (x j W xo + h j W ho ) xj input o proposed state g = tanh(x j W xg + h j W hg ) 8 main information gating function

23 memory component ( cell ) c j =c j f + g i cj- cj f i hj- g hj hidden state h j = tanh(c j ) o o gates [0,] D i = (x j W xi + h j W hi ) f = (x j W xf + h j W hf ) xj input proposed state g = tanh(x j W xg + h j W hg ) o = (x j W xo + h j W ho ) 8 main information gating function

24 Note many LSTM variants (peephole or not; ci+use (-f) or not) [diagram: Gers and Schmidhuber 200] LSTMs have a poor reputation for understandability yet do something right usually just used as a black-box 9

25 PANDARUS: Alas, I think he shall be come approached and the day When little srain would be attain'd into being never fed, And who is but a chain and subjects of his death, I should not sleep Second Senator: They are away this miseries, produced upon my soul, Breaking and strongly should be buried, when I perish The earth and thoughts of many states First, the devishin it son? MONTANO: 'Tis true as full Squellen the rest me, my passacre and nothink my fairs,' done to vision of actious to thy to love, brings gods! THUR: Will comfited our flight offend make thy love; Brothere is oats at on thes:'--why, cross and so her shouldestruck at one their hearina in all go to lives of Costag, To his he tyrant of you our the fill we hath trouble an over me? KING JOHN: Great though I gain; for talk to mine and to the Christ: a right him out

26 Structure awareness

27 LSTMs used as a generic, sequence-aware model within language modeling, translation generation, classification and tagging Various LSTM-analyzing-text visualizations Question: can they learn interactions we know are in natural language? Thursday: Linzen et al! 22

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes) Recurrent Neural Networks 2 CS 287 (Based on Yoav Goldberg s notes) Review: Representation of Sequence Many tasks in NLP involve sequences w 1,..., w n Representations as matrix dense vectors X (Following