Conditional Language modeling with attention

Size: px

Start display at page:

Download "Conditional Language modeling with attention"

Isaac Richardson
5 years ago
Views:

1 Conditional Language modeling with attention Oxford Deep NLP 조수현

2 Review Conditional language model: assign probabilities to sequence of words given some conditioning context x What is the probability of the next word, given the history of previously generated words and conditioning context x? p w x = l t=1 p(w t x, w 1, w 2,, w t 1 ) w = w 1, w 2,, w l sequence of words x: conditioning context

3 Basic language modeling with RNN in machine translation Problem 1 Compressing a lot of information in a finite-sized vector Problem 2 Learning gradient is troublesome Use attention mechanism

4 Solving vector problem in translation 1. Represent input sentence as matrix 2. Condition on matrix generate target sentence 1. Solve capacity problem(more capacity to hold longer sentence) 2. Solve gradient flow problem

5 Solving vector problem in translation Seeeeeeeeentence 1 Seeeeeentence 2 Seentence 3 Different size of vectors Use matrix Rows: fixed Columns: # of words Q: How do we make matrix?????

vectors for the sentence and concatenate them in to

6 Forming matrix: concatenation(method 1) Each word type is represented by n-dimensional vector Take all vectors for the sentence and concatenate them in to matrix The simplest method i th column of f = i th word vector

Eventually end up with single fixed-size vector representation A

7 Forming matrix: Convolutional Nets(method 2) Simply concatenate matrix Apply convolutional networks to obtain a context-dependent matrix Eventually end up with single fixed-size vector representation A Convolutional Encoder Model for Neural Machine Translation(Gehring et al., 2016)

8 Forming matrix: bidirectional RNN(method 3) + Forward Backward Concatenation

9 Forming matrix: bidirectional RNN(method 3) Most widely used matrix representation One column per word GRU, LSTM are available instead of RNN

10 Other ways Very little systemic work Undiscovered things Use CNN learn local /grammar relationship Word embedding + syntactic information - Embedding + POS tagging(linguistic information) Phrase types rather than word types Multi-word expressions are a pain in the neck. =troublesome.

11 Generate sentence with basic RNN model in machine translation Prediction by sampling Prediction by sampling Feed in representation of word at next time step

12 Generate sentence with attention model in machine translation Representation vector F generate using attention Idea Generate output sentence word by using an RNN (a word at a time) At each time step, there are 2 inputs - Output from previous time(: a fixed-sized vector embedding) - A fixed-sized vector encoding a view of the input matrix a t (attention): weighting of the input columns at each time step Fa t : a weighted sum of the columns of F based on how important they are on current time step

13 Generate sentence with attention model in machine translation Start with start symbol Encoded sentence Compare hidden state with columns of matrix Attention weighting

14 Generate sentence with attention model in machine translation Make context vector by taking production and addition Feed in context vector Run RNN as usual Compute hidden state Sample a word from vocabulary Use sampled word in next time step

15 Generate sentence with attention model in machine translation Use hidden unit(h 1 ) to achieve attention Construct a 2 Feed in context vector Find how much weight to give to each column

track of attention weights over time Look at history of

16 Generate sentence with attention model in machine translation Process until the stop symbol By keeping track of attention weights over time Look at history of what model has paid attention to in producing particular output

17 Computing attention At each time step, we want to attend to different words in the source sentence We need a weight for every column s i : hidden state h i = h i T : h i T T (=F) : summarize information of preceding, following words e ij (= u t ) = a(s i 1, h j ) (= F T r i = F T Vs i 1 ) (r i = V s i 1 ) - e ij indicates how important h j is, un-normalized attention weight α ij (= a t ) = exp(e ij) : normalization Tx k=1 exp(e ik ) T c i (= c t ) = x j=1 aij h j : weighted sum of h j s i = f(s i 1, y i 1, c i ) p y i y 1,, y i 1, x = g(y i 1, s i, c i ) Linear model is simple but does not work well in practice 2 3

Computing attention At each time step, we want to attend to different words in the source sentence We need a weight for every column s i : hidden state h i = h i T : h i T T (=F) : summarize

18 Computing attention At each time step, we want to attend to different words in the source sentence We need a weight for every column s i : hidden state h i = h i T : h i T T (=F) : summarize information of preceding, following words e ij (= u t ) = v T tanh(wf + r i ) (r i = V s i 1 ) v, W : learned parameter - e ij indicates how important h j is, un-normalized attention weight α ij (= a t ) = exp(e ij) : normalization Tx k=1 exp(e ik ) T c i (= c t ) = x j=1 aij h j : weighted sum of h j s i = f(s i 1, y i 1, c i ) p y i y 1,, y i 1, x = g(y i 1, s i, c i ) 2 3

19 Putting it all together

20 Putting it all together

21 Putting it all together

into the hidden state at time t Use information from context only

22 Model variant Compute attention weights & context vector as a function of previous hidden state of RNN X feed context vector into the hidden state at time t Use information from context only when deciding what to generate More time in test Less time in training

23 Summary Good for interpretability Attention is closely related to pooling in convnets Attention weights provide interpretation you can look at Bahdanau s attention model only seems to care about content Some work has begun to add other structural biases (Luong et al., 2015)

Solving gradient flow problem Situation Multiply scalar(attention weight) on column Weight column from attention mask Assumption: large error on cross entropy loss (problem on parameter

24 Solving gradient flow problem Situation Multiply scalar(attention weight) on column Weight column from attention mask Assumption: large error on cross entropy loss (problem on parameter of the model) Back propagate errors down to the representation of the word Stronger gradient on more attentional word Much more direct connection at time step Help forgetting problem of LSTM

25 Image caption generation with attention Show, attend and tell: neural image caption generation with visual attention(xu et al., 2016) Move all over the image Compute representation that are functions of local fields F = [ a 1 ] F = [ a 1 a 2 ] F = [ a 1 a 2 a 3 ]

26 Hard attention VS Soft attention Soft attention: differentiable Bahdanau et al., 2014 Deterministic Attention term, loss function are differentiable function of inputs All gradients exist Use standard back propagation c t = Fa t (: weighted average) Hard attention: not differentiable Xu et al., 2015 Do not know the correct answer s t ~Categorical a t, c t = F :,st Sample a column: sample N sequences of attention decisions from the model Gradient = gradient probability of sequence Reinforcement learning: reward function=log probability of the word

27 Hard attention(continue) L = logp w x = log p w, s x = log p s x p w x, s s p s x logp w x, s N 1 p s i x logp(w x, s) N i=1 s Jenson s inequality MC approximation f( n i=1 p i x i ) n i=1 p i f(x i )

28 Result

29 Result Conclusion Significant performance improvements More interpretability Better gradient flow Better capacity

30 Q&A

Better Conditional Language Modeling. Chris Dyer

Better Conditional Language Modeling Chris Dyer Conditional LMs A conditional language model assigns probabilities to sequences of words, w =(w 1,w 2,...,w`), given some conditioning context, x. As with