Variational Attention for Sequence-to-Sequence Models

Size: px

Start display at page:

Download "Variational Attention for Sequence-to-Sequence Models"

Theodora Simon
5 years ago
Views:

1 Variational Attention for Sequence-to-Sequence Models Hareesh Bahuleyan, 1 Lili Mou, 1 Olga Vechtomova, Pascal Poupart University of Waterloo March, 2018

2 Outline 1 Introduction 2 Background & Motivation 3 Variational Attention 4 Experiments 5 Conclusion

3 Outline 1 Introduction 2 Background & Motivation 3 Variational Attention 4 Experiments 5 Conclusion

4 Variational Autoencoder (VAE) Kingma and Welling (2013) A combination of (neural) autoencoders and variational inference Compared with traditional variational inference Takes use of neural networks as a powerful density estimator Compared with traditional autoencoders Imposes a probabilistic distribution on latent representations

5 Why VAE? Learns the distribution of data Instead of the MAP sample Implicitly capture the diversity of data Generating samples from scratch Image generation, sentence generation, etc. Controlling the generated samples Not quite feasible in deterministic AE Regularization

6 Variational Encoder-Decoder (VED) Encoder-decoder frameworks Machine translation Dialog systems Summarization VAE VED (Variational encoder-decoder)

7 VAE/VED in NLP Seq2Seq models as encoders & decoders Attention mechanism serving as dynamic alignment However, we observe the bypassing phenomenon

8 Our Proposal Variational attention mechanism Modeling attention probability/vector as random variables Imposing some prior and posterior distributions over attention Experimental results show that the variational space is more effective when combined with variational attention.

9 Outline 1 Introduction 2 Background & Motivation 3 Variational Attention 4 Experiments 5 Conclusion

10 Variational Inference Z Hidden variables Y Observable variables [ { }] log p θ (y (n) p θ (y (n), z) ) = E z qφ (z y (n) ) log q φ (z y (n) ) + KL(q φ (z y (n) ) p(z y)) [ ] E z qφ (z y (n) ) log p θ (y (n) z) ( KL q φ (z y ) p(z)) (n) = L (n) (θ, φ) loss = E[reconstruction loss] + KL divergence

11 Variational Autoencoder loss = E[reconstruction loss] + KL divergence Prior p(z) = N (0, I) Posterior q φ (z y) = N (µ, diag{σ}) where µ, σ = NN(y)

12 Variational Encoder-Decoder Encoder-Decoder: Transforming X to Y Attempt#1: Condition any distribution on X (Zhang et al., 2016; Cao and Clark, 2017; Serban et al., 2017) Doubts: The posterior contains Y Cannot have fine-grained (word/character-level) variational modeling

13 Variational Encoder-Decoder Encoder-Decoder: Transforming X to Y Attempt#2 (Cao and Clark, 2017): Assuming Y is some function of X i.e., Y = Y (X), then q(z y) = q(z Y (x)) = q(z x)

14 Attention Mechanism At each step of decoding Attention probability Attention vector α ji = exp{ α ji } x i =1 exp{ α ji } x a j = α ji h (src) i i=1

15 Attention Mechanism At each step of decoding Attention probability Attention vector α ji = exp{ α ji } x i =1 exp{ α ji } x a j = α ji h (src) i i=1 (Bypassing phenomenon)

16 Bypassing Phenomenon A deterministic connection bypasses variational space Hidden state initialization: Input: the men are playing musical instruments (a) VAE w/o hidden state init. (Avg entropy: 2.52) the men are playing musical instruments the men are playing video games the musicians are playing musical instruments the women are playing musical instruments (b) VAE w/ hidden state init. (Avg entropy: 2.01) the men are playing musical instruments the men are playing musical instruments the men are playing musical instruments the man is playing musical instruments

17 Outline 1 Introduction 2 Background & Motivation 3 Variational Attention 4 Experiments 5 Conclusion

18 Intuition General idea: Model attention as random variables

19 Lower Bound L (n) j (θ, φ) [ ] ( ) =E z,a qφ (z,a x (n) ) log p θ (y (n) z, a) KL q φ (z, a x (n) ) p(z, a) [ ] =E (z) z q φ (z x(n) ),a q (a) log p φ (a x(n) ) θ (y (n) z, a) ( ) ( ) KL q (z) φ (z x(n) ) p(z) KL q (a) φ (a x(n) ) p(a)

20 Prior Standard norm: p(a j ) = N (0, I) Norm centered at h (src) : where h (src) = 1 x x i=1 h(src) i p(a j ) = N ( h (src), I)

21 Posterior Attention probability Attention vector α ji = exp{ α ji } x i =1 exp{ α ji } a (det) j = Posterior: N (µ aj, σ aj ), where x i=1 α ji h (src) i µ aj a det j, σ aj = NN(a det j )

22 Training Objective J (n) (θ, φ) = J rec (θ, φ, y (n) ) ( ) + λ KL [KL q (z) φ (z) p(z) y ( ) ] + γ a KL q (a) φ (a j) p(a j ) j=1

23 Geometric Interpretation (a) (b) (c) (d)

24 Outline 1 Introduction 2 Background & Motivation 3 Variational Attention 4 Experiments 5 Conclusion

25 Task Question generation (Du et al., 2017) Given some information (a sentence), to generate a related question Dataset from the Stanford Question Answering Dataset (Rajpurkar et al., 2016, SQuAD)

26 Settings KL-annealing: logistic schema Word dropout: 25% Hyperparaemters tuned on VAE and adopted to all VED models

27 Overall Performance Model Inference BLEU-1 BLEU-2 BLEU-3 BLEU-4 Entropy Dist-1 Dist-2 DED (w/o Attn) Du et al. (2017) MAP DED (w/o Attn) MAP DED+DAttn MAP VED+DAttn, VED+DAttn (2-stage training) VED+VAttn-0 VED+VAttn- h MAP Sampling MAP Sampling MAP Sampling MAP Sampling

28 Learning Curves.

29 Effect of γ a J (n) (θ, φ) = J rec (θ, φ, y (n) ) ( ) + λ KL [KL q (z) φ (z) p(z) +γ a y j=1 ( )] KL q (a) φ (a j) p(a j ).

30 Case Study Source when the british forces evacuated at the close of the war in 1783, they transported 3,000 freedmen for resettlement in nova scotia. Reference in what year did the american revolutionary war end? VED+DAttn VED+Vattn- h how many people evacuated in newfoundland? how many people evacuated in newfoundland? what did the british forces seize in the war? how many people lived in nova scotia? where did the british forces retreat? when did the british forces leave the war? Source downstream, more than 200,000 people were evacuated from mianyang by june 1 in anticipation of the dam bursting. Reference how many people were evacuated downstream? VED+DAttn VED+VAttn- h how many people evacuated from the mianyang basin? how many people evacuated from the mianyang basin? how many people evacuated from the mianyang basin? how many people evacuated from the tunnel? how many people evacuated from the dam? how many people were evacuated from fort in the dam?

31 Outline 1 Introduction 2 Background & Motivation 3 Variational Attention 4 Experiments 5 Conclusion

32 Conclusion Our work Address the bypassing phenomenon Propose a variational attention Future work Probabilistic modeling of attention probability Lesson learned Design philosophy of VAE/VED

33 References Kris Cao and Stephen Clark Latent variable dialogue models and their diversity. In EACL. pages Xinya Du, Junru Shao, and Claire Cardie Learning to ask: Neural question generation for reading comprehension. In ACL. pages Diederik P Kingma and Max Welling Auto-encoding variational Bayes. arxiv preprint arxiv: Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang SQuAD: 100,000+ questions for machine comprehension of text. In EMNLP. pages Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron C Courville, and Yoshua Bengio A hierarchical latent variable encoder-decoder model for generating dialogues. In AAAI. pages Biao Zhang, Deyi Xiong, jinsong su, Qun Liu, Rongrong Ji, Hong Duan, and Min Zhang Variational neural discourse relation recognizer. In EMNLP. pages

34 Thanks for listening Question?

Leveraging Context Information for Natural Question Generation

Leveraging Context Information for Natural Question Generation Linfeng Song 1, Zhiguo Wang 2, Wael Hamza 2, Yue Zhang 3 and Daniel Gildea 1 1 Department of Computer Science, University of Rochester, Rochester,