Better Conditional Language Modeling. Chris Dyer

Size: px

Start display at page:

Download "Better Conditional Language Modeling. Chris Dyer"

Augustine Jackson
5 years ago
Views:

1 Better Conditional Language Modeling Chris Dyer

2 Conditional LMs A conditional language model assigns probabilities to sequences of words, w =(w 1,w 2,...,w`), given some conditioning context, x. As with unconditional models, it is again helpful to use the chain rule to decompose this probability: Ỳ p(w x) = p(w t x,w 1,w 2,...,w t 1 ) t=1 What is the probability of the next word, given the history of previously generated words and conditioning context x?

3 Kalchbrenner and Blunsom 2013 Encoder c =embed(x) s = Vc Recurrent decoder Recurrent connection Embedding of w t 1 h t = g(w[h t 1 ; w t 1 ] + s + b]) u t = Ph t + b 0 p(w t x, w <t ) = softmax(u t ) Learnt bias Source sentence Recall unconditional RNN h t = g(w[h t 1 ; w t 1 ]+b])

4 K&B 2013: RNN Decoder p(tom s, hsi) s p(likes s, hsi, tom) p(beer s, hsi, tom, likes) p(h\si s, hsi, tom, likes, beer) tom likes beer </s> ˆp 1 softmax softmax softmax softmax h 1 h 2 h 3 h 4 h 0 x 1 x 2 x 3 x 4 <s>

5 K&B 2013: Encoder How should we define c =embed(x)? The simplest model possible: c = X i x i x 1 x 2 x 3 x 4 x 5 x 6 x 1 x 2 x 3 x 4 x 5 x 6

6 K&B 2013: Problems The bag of words assumption is really bad (part 1) Alice saw Bob. Bob saw Alice. I would like some fresh bread with aged cheese. I would like some aged bread with fresh cheese. We are putting a lot of information inside a single vector (part 2)

7 Sutskever et al. (2014) LSTM encoder (c 0, h 0 ) are parameters (c i, h i ) = LSTM(x i, c i 1, h i 1 ) The encoding is (c`, h`) where ` = x. LSTM decoder w 0 = hsi (c t+`, h t+`) = LSTM(w t 1, c t+` 1, h t+` 1 ) u t = Ph t+` + b p(w t x, w <t ) = softmax(u t )

8 Sutskever et al. (2014) START c Aller Anfang ist schwer STOP

9 Sutskever et al. (2014) Beginnings START c Aller Anfang ist schwer STOP

10 Sutskever et al. (2014) Beginnings are START c Aller Anfang ist schwer STOP

11 Sutskever et al. (2014) Beginnings are difficult START c Aller Anfang ist schwer STOP

12 Sutskever et al. (2014) Beginnings are difficult STOP START c Aller Anfang ist schwer STOP

13 Sutskever et al. (2014) Good RNNs deal naturally with sequences of various lengths LSTMs in principle can propagate gradients a long distance Very simple architecture! Bad The hidden state has to remember a lot of information!

14 Sutskever et al. (2014): Tricks Beginnings are difficult STOP START c Aller Anfang ist schwer STOP

15 Sutskever et al. (2014): Tricks Read the input sequence backwards : +4 BLEU Beginnings are difficult STOP START c Aller Anfang ist schwer STOP

16 Sutskever et al. (2014): Tricks Use an ensemble of J independently trained models. Ensemble of 2 models: +3 BLEU Ensemble of 5 models: +4.5 BLEU Decoder: (c (j) t+`, h(j) t+`) = LSTM(j) (w t u (j) t = Ph (j) t JX u t = 1 J j 0 =1 + b (j) u (j0 ) p(w t x, w <t ) = softmax(u t ) 1, c (j) t+` 1, h(j) t+` 1 )

17 Sutskever et al. (2014): Tricks Use beam search: +1 BLEU x = Bier trinke ich beer drink I drink logprob=-6.93 drink logprob=-6.28 hsi logprob=0 beer logprob=-1.82 I logprob=-2.11 I logprob=-5.80 beer logprob=-8.66 like logprob=-7.31 beer logprob=-3.04 drink logprob=-2.87 wine logprob=-5.12 w 0 w 1 w 2 w 3

18 Sutskever et al. (2014): Tricks Use beam search: +1 BLEU Make the beam really big: -1 BLEU (Koehn and Knowles, 2017)

19 Conditioning with vectors We are compressing a lot of information in a finite-sized vector.

20 Conditioning with vectors We are compressing a lot of information in a finite-sized vector. You can't cram the meaning of a whole %&!$# sentence into a single $&!#* vector! Prof. Ray Mooney

21 Conditioning with vectors We are compressing a lot of information in a finite-sized vector. Gradients have a long way to travel. Even LSTMs forget!

22 Conditioning with vectors We are compressing a lot of information in a finite-sized vector. Gradients have a long way to travel. Even LSTMs forget! What is to be done?

23 Solving the vector bottleneck Represent a source sentence as a matrix Generate a target sentence from a matrix This will Solve the capacity problem Solve the gradient flow problem

24 Sentences as vectors matrices Problem with the fixed-size vector model Sentences are of different sizes but vectors are of the same size Solution: use matrices instead Fixed number of rows, but number of columns depends on the number of words Usually f = #cols

25 Sentences as matrices Ich möchte ein Bier

26 Sentences as matrices Ich möchte ein Bier Mach s gut

27 Sentences as matrices Ich möchte ein Bier Mach s gut Die Wahrheiten der Menschen sind die unwiderlegbaren Irrtümer

28 Sentences as matrices Ich möchte ein Bier Mach s gut Die Wahrheiten der Menschen sind die unwiderlegbaren Irrtümer Question: How do we build these matrices?

29 Sentences as matrices With concatenation Each word type is represented by an n-dimensional vector Take all of the vectors for the sentence and concatenate them into a matrix Simplest possible model So simple, no one has bothered to publish how well/badly it works!

30 x 1 x 2 x 3 x 4 Ich möchte ein Bier

31 f i = x i x 1 x 2 x 3 x 4 Ich möchte ein Bier

32 f i = x i F 2 R n f x 1 x 2 x 3 x 4 Ich möchte ein Bier Ich möchte ein Bier

33 Sentences as matrices With bidirectional RNNs By far the most widely used matrix representation, due to Bahdanau et al (2015) One column per word Each column (word) has two halves concatenated together: a forward representation, i.e., a word and its left context a reverse representation, i.e., a word and its right context Implementation: bidirectional RNNs (GRUs or LSTMs) to read f from left to right and right to left, concatenate representations

34 x 1 x 2 x 3 x 4 Ich möchte ein Bier

35 ! h 1! h 2! h 3! h 4 x 1 x 2 x 3 x 4 Ich möchte ein Bier

36 h 1 h 2 h 3 h 4! h 1! h 2! h 3! h 4 x 1 x 2 x 3 x 4 Ich möchte ein Bier

37 f i =[h i ;! h i ] h 1 h 2 h 3 h 4! h 1! h 2! h 3! h 4 x 1 x 2 x 3 x 4 Ich möchte ein Bier

38 f i =[h i ;! h i ] h 1 h 2 h 3 h 4! h 1! h 2! h 3! h 4 x 1 x 2 x 3 x 4 Ich möchte ein Bier

39 f i =[h i ;! h i ] h 1 h 2 h 3 h 4! h 1! h 2! h 3! h 4 x 1 x 2 x 3 x 4 Ich möchte ein Bier

40 f i =[h i ;! h i ] h 1 h 2 h 3 h 4! h 1! h 2! h 3! h 4 x 1 x 2 x 3 x 4 Ich möchte ein Bier

41 f i =[h i ;! h i ] h 1 h 2 h 3 h 4 F 2 R 2n f! h 1! h 2! h 3! h 4 x 1 x 2 x 3 x 4 Ich möchte ein Bier Ich möchte ein Bier

42 Sentences as matrices Where are we in 2018? There are lots of ways to construct F More exotic architectures coming out daily Increasingly common goal: get rid of O( f ) sequential processing steps, i.e., RNNs during training syntactic information can help (Sennrich & Haddow, 2016; Nadejde et al., 2017), but many more integration strategies are possible try something with phrase types instead of word types? Multi-word expressions are a pain in the neck.

43 Conditioning on matrices We have a matrix F representing the input, now we need to generate from it Bahdanau et al. (2015) and Luong et al. (2015) concurrently proposed using attention for translating from matrix-encoded sentences High-level idea Generate the output sentence word by word using an RNN At each output position t, the RNN receives two inputs (in addition to any recurrent inputs) a fixed-size vector embedding of the previously generated output symbol e t-1 a fixed-size vector encoding a view of the input matrix How do we get a fixed-size vector from a matrix that changes over time? Bahdanau et al: do a weighted sum of the columns of F (i.e., words) based on how important they are at the current time step. (i.e., just a matrix-vector product Fa t ) The weighting of the input columns at each time-step (a t ) is called attention

44 Recall RNNs

45 Recall RNNs

46 Recall RNNs

47 I'd Recall RNNs

48 I'd I'd Recall RNNs

49 I'd like I'd Recall RNNs

51 Ich möchte ein Bier

52 Ich möchte ein Bier

53 Attention history: a > 1 a > 2 a > 3 a > 4 a > 5 Ich möchte ein Bier

54 c 1 = Fa 1 z } { Attention history: a > 1 a > 2 a > 3 a > 4 a > 5 Ich möchte ein Bier

55 c 1 = Fa 1 z } { Attention history: a > 1 a > 2 a > 3 a > 4 a > 5 Ich möchte ein Bier

56 I'd c 1 = Fa 1 z } { Attention history: a > 1 a > 2 a > 3 a > 4 a > 5 Ich möchte ein Bier

57 I'd I'd Attention history: a > 1 a > 2 a > 3 a > 4 a > 5 Ich möchte ein Bier

58 I'd I'd Attention history: a > 1 a > 2 a > 3 a > 4 a > 5 Ich möchte ein Bier

59 I'd I'd Attention history: a > 1 a > 2 a > 3 a > 4 a > 5 Ich möchte ein Bier

60 I'd I'd c 2 = Fa 2 z } { Attention history: a > 1 a > 2 a > 3 a > 4 a > 5 Ich möchte ein Bier

61 I'd like I'd c 2 = Fa 2 z } { Attention history: a > 1 a > 2 a > 3 a > 4 a > 5 Ich möchte ein Bier

62 I'd like a I'd like z } { Attention history: a > 1 a > 2 a > 3 a > 4 a > 5 Ich möchte ein Bier

63 I'd like a beer I'd like a z } { Attention history: a > 1 a > 2 a > 3 a > 4 a > 5 Ich möchte ein Bier

64 I'd like a beer STOP stop I'd like a beer z } { Attention history: a > 1 a > 2 a > 3 a > 4 a > 5 Ich möchte ein Bier

65 Attention How do we know what to attend to at each timestep? That is, how do we compute a t?

66 Computing attention At each time step (one time step = one output word), we want to be able to attend to different words in the source sentence We need a weight for every column: this is an f -length vector a t Here is a simplified version of Bahdanau et al. s solution Use an RNN to predict model output, call the hidden states (st has a fixed dimensionality, call it m) At time t compute the expected input embedding r t = Vs t 1 ( V is a learned parameter) Take the dot product with every column in the source matrix to compute the attention energy. u t = F > r t (called e t in the paper) (Since F has f columns, u t has f rows) Exponentiate and normalize to 1: a t = softmax(u t ) (called t in the paper) Finally, the input source vector for time t is c t = Fa t s t

67 Computing attention At each time step (one time step = one output word), we want to be able to attend to different words in the source sentence We need a weight for every column: this is an f -length vector a t Here is a simplified version of Bahdanau et al. s solution Use an RNN to predict model output, call the hidden states (st has a fixed dimensionality, call it m) At time t compute the query key embedding r t = Vs t 1 ( V is a learned parameter) Take the dot product with every column in the source matrix to compute the attention energy. u t = F > r t (called e t in the paper) (Since F has f columns, u t has f rows) Exponentiate and normalize to 1: a t = softmax(u t ) (called t in the paper) Finally, the input source vector for time t is c t = Fa t s t

68 Computing attention At each time step (one time step = one output word), we want to be able to attend to different words in the source sentence We need a weight for every column: this is an f -length vector a t Here is a simplified version of Bahdanau et al. s solution Use an RNN to predict model output, call the hidden states (st has a fixed dimensionality, call it m) At time t compute the query key embedding r t = Vs t 1 ( V is a learned parameter) Take the dot product with every column in the source matrix to compute the attention energy. u t = F > r t (called e t in the paper) (Since F has f columns, u t has f rows) Exponentiate and normalize to 1: a t = softmax(u t ) (called t in the paper) Finally, the input source vector for time t is c t = Fa t s t

69 Computing attention At each time step (one time step = one output word), we want to be able to attend to different words in the source sentence We need a weight for every column: this is an f -length vector a t Here is a simplified version of Bahdanau et al. s solution Use an RNN to predict model output, call the hidden states (st has a fixed dimensionality, call it m) At time t compute the query key embedding r t = Vs t 1 ( V is a learned parameter) Take the dot product with every column in the source matrix to compute the attention energy. u t = F > r t (called e t in the paper) (Since F has f columns, u t has f rows) Exponentiate and normalize to 1: a t = softmax(u t ) (called t in the paper) Finally, the input source vector for time t is c t = Fa t s t

70 Computing attention At each time step (one time step = one output word), we want to be able to attend to different words in the source sentence We need a weight for every column: this is an f -length vector a t Here is a simplified version of Bahdanau et al. s solution Use an RNN to predict model output, call the hidden states (st has a fixed dimensionality, call it m) At time t compute the query key embedding r t = Vs t 1 ( V is a learned parameter) Take the dot product with every column in the source matrix to compute the attention energy. u t = F > r t (called e t in the paper) (Since F has f columns, u t has f rows) Exponentiate and normalize to 1: a t = softmax(u t ) (called t in the paper) Finally, the input source vector for time t is c t = Fa t s t

71 Computing attention In the actual model, Bahdanau et al. replace the dot product between the columns of F and r t with an MLP: u t = F > r t u t = tanh (WF + r t ) v (simple model) (Bahdanau et al) Here, W and v are learned parameters of appropriate dimension and + broadcasts over the f columns in WF This can learn more complex interactions It is unclear if the added complexity is necessary for good performance

72 Computing attention Nonlinear additive attention model In the actual model, Bahdanau et al. replace the dot product between the columns of F and r t with an MLP: u t = F > r t u t = v > tanh(wf + r t ) (simple model) (Bahdanau et al) Here, W and v are learned parameters of appropriate dimension and + broadcasts over the f columns in WF This can learn more complex interactions It is unclear if the added complexity is necessary for good performance

73 Computing attention Nonlinear additive attention model In the actual model, Bahdanau et al. replace the dot product between the columns of F and r t with an MLP: u t = F > r t u t = v > tanh(wf + r t ) (simple model) (Bahdanau et al) Here, W and v are learned parameters of appropriate dimension and + broadcasts over the f columns in WF This can learn more complex interactions It is unclear if the added complexity is necessary for good performance

74 Putting it all together F = EncodeAsMatrix(f) (Part 1 of lecture) e 0 = hsi s 0 = w (Learned initial state; Bahdanau uses U h 1 ) t =0 while e t 6= h/si : t = t +1 r t = Vs t 1 u t = v > tanh(wf + r t )}(Compute attention; part 2 of lecture) a t = softmax(u t ) c t = Fa t s t = RNN(s t 1, [e t 1 ; c t ]) ( e t 1 is a learned embedding of e t ) y t = softmax(ps t + b) ( P and b are learned parameters) e t e <t Categorical(y t )

75 Putting it all together F = EncodeAsMatrix(f) (Part 1 of lecture) e 0 = hsi s 0 = w (Learned initial state; Bahdanau uses U h 1 ) t =0 while e t 6= h/si : t = t +1 r t = Vs t 1 u t = v > tanh(wf + r t )}(Compute attention; part 2 of lecture) a t = softmax(u t ) c t = Fa t s t = RNN(s t 1, [e t 1 ; c t ]) ( e t 1 is a learned embedding of e t ) y t = softmax(ps t + b) ( P and b are learned parameters) e t e <t Categorical(y t ) doesn t depend on output decisions

76 Putting it all together F = EncodeAsMatrix(f) (Part 1 of lecture) e 0 = hsi s 0 = w (Learned initial state; Bahdanau uses U h 1 ) t =0 X = WF while e t 6= h/si : t = t +1 r t = Vs t 1 X u t = v > tanh(wf + r t )}(Compute attention; part 2 of lecture) a t = softmax(u t ) c t = Fa t s t = RNN(s t 1, [e t 1 ; c t ]) ( e t 1 is a learned embedding of e t ) y t = softmax(ps t + b) ( P and b are learned parameters) e t e <t Categorical(y t )

77 Putting it all together F = EncodeAsMatrix(f) (Part 1 of lecture) e 0 = hsi s 0 = w (Learned initial state; Bahdanau uses U h 1 ) t =0 X = WF while e t 6= h/si : t = t +1 r t = Vs t 1 u t = v > tanh(x + r t ) a t = softmax(u t ) c t = Fa t s t = RNN(s t 1, [e t 1 ; c t ]) y t = softmax(ps t + b) e t e <t Categorical(y t ) }(Compute attention; part 2 of lecture) ( e t 1 is a learned embedding of e t ) ( P and b are learned parameters)

78 Attention in MT: Results Add attention to seq2seq translation: +11 BLEU

80 A word about gradients

81 I'd like a beer I'd like a z } { Attention history: a > 1 a > 2 a > 3 a > 4 a > 5 Ich möchte ein Bier

82 I'd like a beer I'd like a z } { Attention history: a > 1 a > 2 a > 3 a > 4 a > 5 Ich möchte ein Bier

83 I'd like a beer I'd like a z } { Attention history: a > 1 a > 2 a > 3 a > 4 a > 5 Ich möchte ein Bier

84 Attention and translation Cho s question: does a translator read and memorize the input sentence/document and then generate the output? Compressing the entire input sentence into a vector basically says memorize the sentence Common sense experience says translators refer back and forth to the input. (also backed up by eyetracking studies) Should humans be a model for machines?

85 Summary Attention provides the ability to establish information flow directly from distant closely related to pooling operations in convnets (and other architectures) Traditional attention model seems to only cares about content No obvious bias in favor of diagonals, short jumps, fertility, etc. Some work has begun to add other structural biases (Luong et al., 2015; Cohn et al., 2016), but there are lots more opportunities Factorization into keys and values (Miller et al., 2016; Ba et al., 2016, Gulcehre et al., 2016) Attention weights provide interpretation you can look at

86 Questions?

Conditional Language Modeling. Chris Dyer

Conditional Language Modeling. Chris Dyer Conditional Language Modeling Chris Dyer Unconditional LMs A language model assigns probabilities to sequences of words,. w =(w 1,w 2,...,w`) It is convenient to decompose this probability using the chain