Conditional Language modeling with attention 2017.08.25 Oxford Deep NLP 조수현
Review Conditional language model: assign probabilities to sequence of words given some conditioning context x What is the probability of the next word, given the history of previously generated words and conditioning context x? p w x = l t=1 p(w t x, w 1, w 2,, w t 1 ) w = w 1, w 2,, w l sequence of words x: conditioning context
Basic language modeling with RNN in machine translation Problem 1 Compressing a lot of information in a finite-sized vector Problem 2 Learning gradient is troublesome Use attention mechanism
Solving vector problem in translation 1. Represent input sentence as matrix 2. Condition on matrix generate target sentence 1. Solve capacity problem(more capacity to hold longer sentence) 2. Solve gradient flow problem
Solving vector problem in translation Seeeeeeeeentence 1 Seeeeeentence 2 Seentence 3 Different size of vectors Use matrix Rows: fixed Columns: # of words Q: How do we make matrix?????
Forming matrix: concatenation(method 1) Each word type is represented by n-dimensional vector Take all vectors for the sentence and concatenate them in to matrix The simplest method i th column of f = i th word vector
Forming matrix: Convolutional Nets(method 2) Simply concatenate matrix Apply convolutional networks to obtain a context-dependent matrix Eventually end up with single fixed-size vector representation A Convolutional Encoder Model for Neural Machine Translation(Gehring et al., 2016)
Forming matrix: bidirectional RNN(method 3) + Forward Backward Concatenation
Forming matrix: bidirectional RNN(method 3) Most widely used matrix representation One column per word GRU, LSTM are available instead of RNN
Other ways Very little systemic work Undiscovered things Use CNN learn local /grammar relationship Word embedding + syntactic information - Embedding + POS tagging(linguistic information) Phrase types rather than word types Multi-word expressions are a pain in the neck. =troublesome.
Generate sentence with basic RNN model in machine translation Prediction by sampling Prediction by sampling Feed in representation of word at next time step
Generate sentence with attention model in machine translation Representation vector F generate using attention Idea Generate output sentence word by using an RNN (a word at a time) At each time step, there are 2 inputs - Output from previous time(: a fixed-sized vector embedding) - A fixed-sized vector encoding a view of the input matrix a t (attention): weighting of the input columns at each time step Fa t : a weighted sum of the columns of F based on how important they are on current time step
Generate sentence with attention model in machine translation Start with start symbol Encoded sentence Compare hidden state with columns of matrix Attention weighting
Generate sentence with attention model in machine translation Make context vector by taking production and addition Feed in context vector Run RNN as usual Compute hidden state Sample a word from vocabulary Use sampled word in next time step
Generate sentence with attention model in machine translation Use hidden unit(h 1 ) to achieve attention Construct a 2 Feed in context vector Find how much weight to give to each column
Generate sentence with attention model in machine translation Process until the stop symbol By keeping track of attention weights over time Look at history of what model has paid attention to in producing particular output
Computing attention At each time step, we want to attend to different words in the source sentence We need a weight for every column s i : hidden state h i = h i T : h i T T (=F) : summarize information of preceding, following words e ij (= u t ) = a(s i 1, h j ) (= F T r i = F T Vs i 1 ) (r i = V s i 1 ) - e ij indicates how important h j is, un-normalized attention weight α ij (= a t ) = exp(e ij) : normalization Tx k=1 exp(e ik ) T c i (= c t ) = x j=1 aij h j : weighted sum of h j s i = f(s i 1, y i 1, c i ) 1 2 3 1 p y i y 1,, y i 1, x = g(y i 1, s i, c i ) Linear model is simple but does not work well in practice 2 3
Computing attention At each time step, we want to attend to different words in the source sentence We need a weight for every column s i : hidden state h i = h i T : h i T T (=F) : summarize information of preceding, following words e ij (= u t ) = v T tanh(wf + r i ) (r i = V s i 1 ) v, W : learned parameter - e ij indicates how important h j is, un-normalized attention weight α ij (= a t ) = exp(e ij) : normalization Tx k=1 exp(e ik ) T c i (= c t ) = x j=1 aij h j : weighted sum of h j s i = f(s i 1, y i 1, c i ) 1 2 3 1 p y i y 1,, y i 1, x = g(y i 1, s i, c i ) 2 3
Putting it all together
Putting it all together
Putting it all together
Model variant Compute attention weights & context vector as a function of previous hidden state of RNN X feed context vector into the hidden state at time t Use information from context only when deciding what to generate More time in test Less time in training
Summary Good for interpretability Attention is closely related to pooling in convnets Attention weights provide interpretation you can look at Bahdanau s attention model only seems to care about content Some work has begun to add other structural biases (Luong et al., 2015) https://medium.com/@ozinkegliyin/six-challenges-for-neural-machine-translation-8a780ead92ab
Solving gradient flow problem Situation Multiply scalar(attention weight) on column Weight column from attention mask Assumption: large error on cross entropy loss (problem on parameter of the model) Back propagate errors down to the representation of the word Stronger gradient on more attentional word Much more direct connection at time step Help forgetting problem of LSTM
Image caption generation with attention Show, attend and tell: neural image caption generation with visual attention(xu et al., 2016) Move all over the image Compute representation that are functions of local fields F = [ a 1 ] F = [ a 1 a 2 ] F = [ a 1 a 2 a 3 ]
Hard attention VS Soft attention Soft attention: differentiable Bahdanau et al., 2014 Deterministic Attention term, loss function are differentiable function of inputs All gradients exist Use standard back propagation c t = Fa t (: weighted average) Hard attention: not differentiable Xu et al., 2015 Do not know the correct answer s t ~Categorical a t, c t = F :,st Sample a column: sample N sequences of attention decisions from the model Gradient = gradient probability of sequence Reinforcement learning: reward function=log probability of the word https://stackoverflow.com/questions/35549588/soft-attention-vs-hard-attention
Hard attention(continue) L = logp w x = log p w, s x = log p s x p w x, s s p s x logp w x, s N 1 p s i x logp(w x, s) N i=1 s Jenson s inequality MC approximation f( n i=1 p i x i ) n i=1 p i f(x i ) http://suhak.tistory.com/221
Result
Result Conclusion Significant performance improvements More interpretability Better gradient flow Better capacity
Q&A