CS23: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention
Today s outline We will learn how to: I. Word Vector Representation i. Training - Generalize results with word vectors - Augment a RNN with ii. Operations iii. Applications: debasing / restaurant reviews Attention mechanisms II. Attention i. Machine Translation ii. Image Captioning
Word Vector Representation vocabulary one-hot representation word-vector representation "a". V eng = "abs" " football" "zone" "zoo" v football = e football =.3.84.57 How do you get the vector representation?
Word Vector Representation: Training v football =. Data? window Stanford is going to beat Cal next week target p("a" " football" p("abs" " football" p("abyss" " football" p("zebra" " football" p("zone" " football" p("zoo" " football" =.38 =.26 =. =.3 =.2 =.5 Target word (x Stanford is is going going loss? e "a" L = Nearby word (y is Stanford going is to ylog(ŷ z [] = W [] v + b [] e "abs" one-hot " e "zone" shape = (2, V e "zoo"
e "a" e "abs" " e "zone" e "zoo" = e football W [] v football = Word Vector Representation: Embedding matrix
Word Vector Representation: Visualization and Operations Scatter plot of Word vectors Operations on vectors France dog cat lion zoo parc Vegas casino Paris Italia Rome Paris baguette goal ball football maths deep-learning computer-science man king = women x x = queen Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean: Distributed Representations of Words and Phrases and their Compositionality Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean: Efficient Estimation of Word Representations in Vector Space Laurens van der Maaten, Geoffrey Hinton: Visualizing Data using t-sne
Word Vector Representation: bias Sexist bias man king = women x x = queen but also man - woman = computer programmer - homemaker Tolga Bolukbasi et al. : Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings
Word Vector Representation: Application Sentiment analysis on restaurant reviews. Data? 5 reviews Review (x Label (Negative/ Tonight was awesome Worst entrée ever.. 2. Architecture? Tonight v "tonight" =.2 e "tonight" =.32 was v "was" =.75 e "was" =. AVERAGE e σ (We + b >.5.74 awesome v "awesome" =.54 e "awesome" =.3 Generalizes very well because of word vector representations 3. Loss? L = ylog(ŷ + ( ylog( ŷ
Word Vector Representation: Application To remember: - In NLP, Words are often represented by meaningful vectors - These vectors are trained thanks to a Neural Network - We can do operations on these vectors - They can be biased, depending on the dataset used to train them - They have a great generalization power
Attention: Motivation Neural Machine Translation.32.64 is.32.64 in.32.64..2 Encoder Decoder h h h 2 h 2 h 3 h 3 the kitchen <eos>.32.64.32.64.32.64 h 4 h 5 h 4 h 5 s..2 s 2..2 dans s 3..2 la s 4.32.64.32.64 dans.32.64..2 cuisine..2 <eos> s 5 s 6 la.32.64 cuisine.32.64
Attention: Motivation Neural Machine Translation.32.64.32.64.32.64..2 Encoder Decoder h h h 2 h 2 h 3 h 3.32.64.32.64.32.64 h 4 h 5 h 4 h 5 s..2 s 2..2 dans s 3..2 la s 4.32.64.32.64 dans.32.64..2 cuisine..2 <eos> s 5 s 6 la.32.64 cuisine.32.64 is in the kitchen <eos> Inverting input works better?
Neural Machine Translation with Attention: Problems & Ideas Some problems: - The encoder encodes all information in the source sentence into a fixed length vector - While it seems that some specific parts of the source sentence are more useful to predict some parts of the output sentence - Bad performances on long sentences Ideas: - We d like to spread the information encoded from the source sentence and selectively retrieve the relevant parts at each prediction of the output sentence - Why don t we use every hidden state hj from the encoding part?
Attention: Motivation.32.64.32.64.32.64..2 h h h 2 h 2 h 3 h 3.32.64.32.64.32.64 h 4 h 5 h 4 h 5 s..2 s 2..2 dans s 3..2 la s 4.32.64.32.64 dans.32.64..2 cuisine..2 <eos> s 5 s 6 la.32.64 cuisine.32.64 is in the kitchen <eos> c = α, j j= 6 h j
Attention: Motivation dans la cuisine <eos> generated at each step during decoding c = 6 α, j h j j=..2..2..2..2..2..2 h h 2 h 3 h 4 h 5 s s 2 s 3 s 4 s 5 s 6 h h 2 h 3 h 4 h 5.32.64.32.64.32.64.32.64.32.64.32.64 h 4 = contains information about the input sentence up to is with a stronger focus on the parts closer to is α 2,3 = f (s,h 3 y 2 x 3 = probability that the target word ( is translated from source word ( in (= score that went through a function c = the expected annotation over all annotations with probabilities α, j kitchen the in is <eos>
Neural Machine Translation with Attention: Architecture..2 c i = the expected annotation over all annotations with probabilities α i, j = f (s i,h j α i, j s c Attention mechanism h h 2 h 3 h 4 h 5.
Neural Machine Translation with Attention: Architecture c i = the expected annotation over all annotations with probabilities α i, j..2..2 α i, j = f (s i,h j s s 2 c 2 s Attention mechanism h h 2 h 3 h 4 h 5.
Neural Machine Translation with Attention: Architecture dans c i = the expected annotation over all annotations with probabilities α i, j..2..2..2 α i, j = f (s i,h j s s 2 s 3 c 3 s 2 Attention mechanism h h 2 h 3 h 4 h 5.
Neural Machine Translation with Attention: Architecture dans la..2..2..2..2 s s 2 s 3 s 4 c 4 s 3 Attention mechanism h h 2 h 3 h 4 h 5.
Neural Machine Translation with Attention: Architecture dans la..2..2..2..2 How to train this? s s 2 s 3 s 4 c s c 2 s 2 c 3 s 3 c 4 same as Machine Translation but takes also derivatives with respect to attention parameters Attention mechanism Attention mechanism Attention mechanism Attention mechanism h h 2 h 3 h 4 h 5.
Neural Machine Translation with Attention: Training What are the parameters? Encoder Attention Decoder W [] = EmbeddingMatrix LSMT: f t = σ (W f [h t, x t ]+ b f i t = σ (W i [h t, x t ]+ b i C ~ t = tanh(w C [h t, x t ]+ b C C t = f t C t + i t C ~ t o t = σ (W o [h t, x t ]+ b o c i = T x j= α ij h j α ij = exp(e ij T x k= exp(e ik e ij = v a T tanh(w a s i +U a h j LSMT: f t = σ (W f [h t, x t ]+ b f i t = σ (W i [h t, x t ]+ b i C ~ t = tanh(w C [h t, x t ]+ b C C t = f t C t + i t C ~ t o t = σ (W o [h t, x t ]+ b o h t = o t tanh(c t h t = o t tanh(c t batch t output Loss function: L = log P(y x;θ + λ ( α ti i patch
Neural Machine Translation with Attention: Training Visualizing Attention Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio : Neural Machine Translation by Jointly Learning to Align and Translate
Image Captioning with Attention Neural Machine Translation is in the kitchen.32.64.32.64 <eos>.32.64.32.64.32.64.32.64..2..2..2 dans..2 la..2 cuisine..2 <eos> Encoder Decoder
Image Captioning with Attention dans la cuisine <eos> Image Captioning Encoder..2..2..2..2..2..2 image Decoder
Image Captioning with Attention a bird flying over a. Image Captioning with no attention Encoder..2..2..2..2..2..2 CNN image Decoder Vinyals et al. : Show and Tell: A Neural Image Caption Generator
Image Captioning with Attention Image Captioning with attention A bird is flying. Encoder..2..2..2..2 CNN {a,a2,...,al}. s s s 2 3 s 4 average. c s 2 s 3 s c c c 3 2 4 Attention mechanism Attention mechanism Attention mechanism Attention mechanism. a a 2 a 3 a 4 a 5 a 6 image Decoder Vinyals et al. : Show and Tell: A Neural Image Caption Generator Kelvin Xu et al. : Show, Attend and Tell: Neural Image Caption Generation with Visual Attention