Deep Latent-Variable Models of Natural Language

Size: px

Start display at page:

Download "Deep Latent-Variable Models of Natural Language"

Doris Cox
5 years ago
Views:

1 Deep Latent-Variable of Natural Language Yoon Kim, Sam Wiseman, Alexander Rush Tutorial /153

2 Goals Background 1 Goals Background /153

3 Goals Background 1 Goals Background /153

4 Goals Background Goal of Latent-Variable Modeling Probabilistic models provide a declarative language for specifying prior knowledge and structural relationships in the context of unknown variables. Makes it easy to specify: Known interactions in the data Uncertainty about unknown factors Constraints on model properties 4/153

5 Goals Background Goal of Latent-Variable Modeling Probabilistic models provide a declarative language for specifying prior knowledge and structural relationships in the context of unknown variables. Makes it easy to specify: Known interactions in the data Uncertainty about unknown factors Constraints on model properties 4/153

6 Goals Background Latent-Variable Modeling in NLP Long and rich history of latent-variable models of natural language. Major successes include, among many others: Statistical alignment for translation Document clustering and topic modeling Unsupervised part-of-speech tagging and parsing 5/153

7 Goals Background Goals of Deep Learning Toolbox of methods for learning rich, non-linear data representations through numerical optimization. Makes it easy to fit: Highly-flexible predictive models Transferable feature representations Structurally-aligned network architectures 6/153

8 Goals Background Goals of Deep Learning Toolbox of methods for learning rich, non-linear data representations through numerical optimization. Makes it easy to fit: Highly-flexible predictive models Transferable feature representations Structurally-aligned network architectures 6/153

9 Goals Background Deep Learning in NLP Current dominant paradigm for NLP. Major successes include, among many others: Text classification Neural machine translation NLU Tasks (QA, NLI, etc) 7/153

10 Goals Background Tutorial: Deep Latent-Variable for NLP How should a contemporary ML/NLP researcher reason about latent-variables? What unique challenges come from modeling text with latent variables? What techniques have been explored and shown to be effective in recent papers? We explore these through the lens of variational inference. 8/153

11 Goals Background Tutorial: Deep Latent-Variable for NLP How should a contemporary ML/NLP researcher reason about latent-variables? What unique challenges come from modeling text with latent variables? What techniques have been explored and shown to be effective in recent papers? We explore these through the lens of variational inference. 8/153

12 Tutorial Take-Aways Goals Background 1 A collection of deep latent-variable models for NLP 2 An understanding of a variational objective 3 A toolkit of algorithms for optimization 4 A formal guide to advanced techniques 5 A survey of example applications 6 Code samples and techniques for practical use 9/153

13 Goals Background Tutorial Non-s Not covered (for time, not relevance): Many classical latent-variable approaches. Undirected graphical models such as MRFs Non-likelihood based models such as GANs Sampling-based inference such as MCMC. Details of deep learning architectures. 10/153

14 Goals Background 1 Goals Background /153

15 What are deep networks? Goals Background Deep networks are parameterized non-linear functions; They transform input z into features h using parameters π. Important examples: The multilayer perceptron, h = MLP(z; π) = V σ(w z + b) + a π = {V, W, a, b}, The recurrent neural network, which maps a sequence of inputs z 1:T into a sequence of features h 1:T, h t = RNN(h t 1, z t ; π) = σ(uz t + V h t 1 + b) π = {V, U, b}. 12/153

16 What are deep networks? Goals Background Deep networks are parameterized non-linear functions; They transform input z into features h using parameters π. Important examples: The multilayer perceptron, h = MLP(z; π) = V σ(w z + b) + a π = {V, W, a, b}, The recurrent neural network, which maps a sequence of inputs z 1:T into a sequence of features h 1:T, h t = RNN(h t 1, z t ; π) = σ(uz t + V h t 1 + b) π = {V, U, b}. 12/153

17 Goals Background What are latent variable models? Latent variable models give us a joint distribution p(x, z; θ). x is our observed data z is a collection of latent variables θ are the deterministic parameters of the model, such as the neural network parameters Data consists of N i.i.d samples, N p(x (1:N), z (1:N) ; θ) = p(x (n) z (n) ; θ)p(z (n) ; θ). n=1 13/153

18 Goals Background What are latent variable models? Latent variable models give us a joint distribution p(x, z; θ). x is our observed data z is a collection of latent variables θ are the deterministic parameters of the model, such as the neural network parameters Data consists of N i.i.d samples, N p(x (1:N), z (1:N) ; θ) = p(x (n) z (n) ; θ)p(z (n) ; θ). n=1 13/153

19 Goals Background What are latent variable models? Latent variable models give us a joint distribution p(x, z; θ). x is our observed data z is a collection of latent variables θ are the deterministic parameters of the model, such as the neural network parameters Data consists of N i.i.d samples, N p(x (1:N), z (1:N) ; θ) = p(x (n) z (n) ; θ)p(z (n) ; θ). n=1 13/153

20 Probabilistic Graphical Goals Background A directed PGM shows the conditional independence structure. By chain rule, latent variable model over observations can be represented as, θ z (n) x (n) N N p(x (1:N), z (1:N) ; θ) = p(x (n) z (n) ; θ)p(z (n) ; θ) n=1 Specific models may factor further. 14/153

21 Goals Background Posterior For models p(x, z; θ), we ll be interested in the posterior over latent variables z: Why? p(z x; θ) = p(x, z; θ) p(x; θ). z will often represent interesting information about our data (e.g., the cluster x (n) lives in, how similar x (n) and x (n+1) are). Learning the parameters θ of the model often requires calculating posteriors as a subroutine. Intuition: if I know likely z (n) for x (n), I can learn by maximizing p(x (n) z (n) ; θ). 15/153

22 Goals Background Posterior For models p(x, z; θ), we ll be interested in the posterior over latent variables z: Why? p(z x; θ) = p(x, z; θ) p(x; θ). z will often represent interesting information about our data (e.g., the cluster x (n) lives in, how similar x (n) and x (n+1) are). Learning the parameters θ of the model often requires calculating posteriors as a subroutine. Intuition: if I know likely z (n) for x (n), I can learn by maximizing p(x (n) z (n) ; θ). 15/153

23 Goals Background Problem Statement: Two Views Deep & LV are naturally complementary: Rich function approximators with modular parts. Declarative methods for specifying model constraints. Deep & LV are frustratingly incompatible: Deep networks make posterior inference intractable. Latent variable objectives complicate backpropagation. 16/153

24 Goals Background Problem Statement: Two Views Deep & LV are naturally complementary: Rich function approximators with modular parts. Declarative methods for specifying model constraints. Deep & LV are frustratingly incompatible: Deep networks make posterior inference intractable. Latent variable objectives complicate backpropagation. 16/153

25 1 Discrete Continuous Structured 2 Discrete Continuous Structured /153

26 Discrete Continuous Structured A Language Model Our goal is to model a sentence, x 1... x T. Context: RNN language models are remarkable at this task, x 1:T RNNLM(x 1:T ; θ). Defined as, T T p(x 1:T ) = p(x t x <t ) = softmax(w h t ) xt t=1 t=1 where h t = RNN(h t 1, x t 1 ; θ)... x (n) 1 x (n) T N θ 18/153

27 Discrete Continuous Structured A Language Model Our goal is to model a sentence, x 1... x T. Context: RNN language models are remarkable at this task, x 1:T RNNLM(x 1:T ; θ). Defined as, T T p(x 1:T ) = p(x t x <t ) = softmax(w h t ) xt t=1 t=1 where h t = RNN(h t 1, x t 1 ; θ)... x (n) 1 x (n) T N θ 18/153

28 Discrete Continuous Structured A Collection of Model Archetypes Focus: semi-supervised or unsupervised learning, i.e. don t just learn the probabilities, but the process. Range of choices in selecting z 1 Discrete LVs z (Clustering) 2 Continuous LVs z (Dimensionality reduction) 3 Structured LVs z (Structured learning) 19/153

29 Discrete Continuous Structured A Collection of Model Archetypes Focus: semi-supervised or unsupervised learning, i.e. don t just learn the probabilities, but the process. Range of choices in selecting z 1 Discrete LVs z (Clustering) 2 Continuous LVs z (Dimensionality reduction) 3 Structured LVs z (Structured learning) 19/153

30 1 Discrete Continuous Structured 2 Discrete Continuous Structured /153

31 Model 1: Discrete Clustering Discrete Continuous Structured Process: In an old house in Paris that was covered with vines lived twelve little girls in two straight lines. Cluster 23 Discrete latent variable models induce a clustering over sentences x (n). Example uses: Document/sentence clustering [Willett 1988; Aggarwal and Zhai 2012]. Mixture of expert text generation models [Jacobs et al. 1991; Garmash and Monz 2016; Lee et al. 2016] 21/153

32 Model 1: Discrete Clustering Discrete Continuous Structured Process: In an old house in Paris that was covered with vines lived twelve little girls in two straight lines. Cluster 23 Discrete latent variable models induce a clustering over sentences x (n). Example uses: Document/sentence clustering [Willett 1988; Aggarwal and Zhai 2012]. Mixture of expert text generation models [Jacobs et al. 1991; Garmash and Monz 2016; Lee et al. 2016] 21/153

33 Model 1: Discrete - Mixture of Categoricals Discrete Continuous Structured Generative process: 1 Draw cluster z {1,..., K} from a categorical with param µ. 2 Draw word T words x t from a categorical with word distribution π z. Parameters: θ = {µ K 1, K V stochastic matrix π} Gives rise to the Naive Bayes distribution: T p(x, z; θ) = p(z; µ) p(x z; π) = µ z Cat(x t ; π) = µ z t=1 T t=1 π z,xt 22/153

34 Model 1: Graphical Model View µ Discrete Continuous Structured π z (n)... x (n) 1 x (n) T N N p(x (n), z (n) ; µ, π) = n=1 = N p(z (n) ; µ) p(x (n) z (n) ; π) n=1 N T µ z (n) n=1 t=1 π z (n),x (n) t 23/153

35 Deep Model 1: Discrete - Mixture of RNNs Discrete Continuous Structured Generative process: 1 Draw cluster z {1,..., K} from a categorical. 2 Draw words x 1:T from RNNLM with parameters π z. p(x, z; θ) = µ z RNNLM(x 1:T ; π z ) µ z (n) π... x (n) 1 x (n) T N 24/153

36 Discrete Continuous Structured Difference Between Dependence structure: Mixture of Categoricals: x t independent of other x j given z. Mixture of RNNs: x t fully dependent. Interesting question: how will this affect the learned latent space? Number of parameters: Mixture of Categoricals: K V. Mixture of RNNs: K d 2 + V d with RNN with d hidden dims. 25/153

37 Discrete Continuous Structured Difference Between Dependence structure: Mixture of Categoricals: x t independent of other x j given z. Mixture of RNNs: x t fully dependent. Interesting question: how will this affect the learned latent space? Number of parameters: Mixture of Categoricals: K V. Mixture of RNNs: K d 2 + V d with RNN with d hidden dims. 25/153

38 Posterior Discrete Continuous Structured For both discrete models, can apply Bayes rule: p(z) p(x z) p(z x; θ) = p(x) p(z) p(x z) = K p(z=k) p(x z=k) k=1 For mixture of categoricals, posterior uses word counts under each π k. For mixture of RNNs, posterior requires running RNN over x for each k. 26/153

39 Posterior Discrete Continuous Structured For both discrete models, can apply Bayes rule: p(z) p(x z) p(z x; θ) = p(x) p(z) p(x z) = K p(z=k) p(x z=k) k=1 For mixture of categoricals, posterior uses word counts under each π k. For mixture of RNNs, posterior requires running RNN over x for each k. 26/153

40 Posterior Discrete Continuous Structured For both discrete models, can apply Bayes rule: p(z) p(x z) p(z x; θ) = p(x) p(z) p(x z) = K p(z=k) p(x z=k) k=1 For mixture of categoricals, posterior uses word counts under each π k. For mixture of RNNs, posterior requires running RNN over x for each k. 26/153

41 1 Discrete Continuous Structured 2 Discrete Continuous Structured /153

42 Model 2: Continuous / Dimensionality Reduction Discrete Continuous Structured Process: In an old house in Paris that was covered with vines lived twelve little girls in two straight lines. Find a lower-dimensional, well-behaved continuous representation of a sentence. Latent variables in R d make distance/similarity easy. Examples: Recent work in text generation assumes a latent vector per sentence [Bowman et al. 2016; Yang et al. 2017; Hu et al. 2017]. Certain sentence embeddings (e.g., Skip-Thought vectors [Kiros et al. 2015]) can be interpreted in this way. 28/153

43 Model 2: Continuous / Dimensionality Reduction Discrete Continuous Structured Process: In an old house in Paris that was covered with vines lived twelve little girls in two straight lines. Find a lower-dimensional, well-behaved continuous representation of a sentence. Latent variables in R d make distance/similarity easy. Examples: Recent work in text generation assumes a latent vector per sentence [Bowman et al. 2016; Yang et al. 2017; Hu et al. 2017]. Certain sentence embeddings (e.g., Skip-Thought vectors [Kiros et al. 2015]) can be interpreted in this way. 28/153

44 Model 2: Continuous Mixture Discrete Continuous Structured Generative Process: 1 Draw continuous latent variable z from Normal with param µ. 2 For each t, draw word x t from categorical with param softmax(w z). Parameters: θ = {µ R d, π}, π = {W R V d } Intuition: µ is a global distribution, z captures local word distribution of the sentence. 29/153

45 Graphical Model View µ Discrete Continuous Structured π z (n)... x (n) 1 x (n) T N Gives rise to the joint distribution: N N p(x (n), z (n) ; θ) = p(z (n) ; µ) p(x (n) z (n) ; π) n=1 n=1 30/153

46 Discrete Continuous Structured Deep Model 2: Continuous Mixture of RNNs Generative Process: 1 Draw latent variable z N (µ, I). 2 Draw each token x t from a conditional RNNLM. RNN is also conditioned on latent z, p(x, z; π, µ, I) = p(z; µ, I) p(x z; π) = N (z; µ, I) CRNNLM(x 1:T ; π, z) where T CRNNLM(x 1:T ; π, z) = softmax(w h t ) xt t=1 h t = RNN(h t 1, [x t 1 ; z]; π) 31/153

47 Discrete Continuous Structured Deep Model 2: Continuous Mixture of RNNs Generative Process: 1 Draw latent variable z N (µ, I). 2 Draw each token x t from a conditional RNNLM. RNN is also conditioned on latent z, p(x, z; π, µ, I) = p(z; µ, I) p(x z; π) = N (z; µ, I) CRNNLM(x 1:T ; π, z) where T CRNNLM(x 1:T ; π, z) = softmax(w h t ) xt t=1 h t = RNN(h t 1, [x t 1 ; z]; π) 31/153

48 Discrete Continuous Structured Deep Model 2: Continuous Mixture of RNNs Generative Process: 1 Draw latent variable z N (µ, I). 2 Draw each token x t from a conditional RNNLM. RNN is also conditioned on latent z, p(x, z; π, µ, I) = p(z; µ, I) p(x z; π) = N (z; µ, I) CRNNLM(x 1:T ; π, z) where T CRNNLM(x 1:T ; π, z) = softmax(w h t ) xt t=1 h t = RNN(h t 1, [x t 1 ; z]; π) 31/153

49 Discrete Continuous Structured Graphical Model View µ z (n) π... x (n) 1 x (n) T N 32/153

50 Discrete Continuous Structured Posterior For continuous models, Bayes rule is harder to compute, p(z x; θ) = z p(z; µ) p(x z; π) p(z; µ) p(x z; π) dz Shallow and deep Model 2 variants mirror Model 1 variants exactly, but with continuous z. Integral intractable (in general) for both shallow and deep variants. 33/153

51 Discrete Continuous Structured Posterior For continuous models, Bayes rule is harder to compute, p(z x; θ) = z p(z; µ) p(x z; π) p(z; µ) p(x z; π) dz Shallow and deep Model 2 variants mirror Model 1 variants exactly, but with continuous z. Integral intractable (in general) for both shallow and deep variants. 33/153

52 1 Discrete Continuous Structured 2 Discrete Continuous Structured /153

53 Model 3: Structure Learning Discrete Continuous Structured Process: In an old house in Paris that was covered with vines lived twelve little girls in two straight lines. Structured latent variable models are used to infer unannotated structure: Unsupervised POS tagging [Brown et al. 1992; Merialdo 1994; Smith and Eisner 2005] Unsupervised dependency parsing [Klein and Manning 2004; Headden III et al. 2009] Or when structure is useful for interpreting our data: Segmentation of documents into topical passages [Hearst 1997] Alignment [Vogel et al. 1996] 35/153

54 Model 3: Structured - Hidden Markov Model Discrete Continuous Structured Generative Process: 1 For each t, draw z t {1,..., K} from a categorical with param µ zt 1. 2 Draw observed token x t from categorical with param π zt. Parameters: θ = {K K stochastic matrix µ, K V stochastic matrix π} Gives rise to the joint distribution: p(x, z; θ) = = T T p(z t z t 1 ; µ zt 1 ) p(x t z t ; π zt ) t=1 t=1 T T µ zt 1,z t t=1 t=1 π zt,x t 36/153

55 Model 3: Structured - Hidden Markov Model Discrete Continuous Structured Generative Process: 1 For each t, draw z t {1,..., K} from a categorical with param µ zt 1. 2 Draw observed token x t from categorical with param π zt. Parameters: θ = {K K stochastic matrix µ, K V stochastic matrix π} Gives rise to the joint distribution: p(x, z; θ) = = T T p(z t z t 1 ; µ zt 1 ) p(x t z t ; π zt ) t=1 t=1 T T µ zt 1,z t t=1 t=1 π zt,x t 36/153

56 Graphical Model View µ z 1 z 2 z 3 z 4 Discrete Continuous Structured x 1 x 2 x 3 x 4 π N p(x, z; θ) = = T T p(z t z t 1 ; µ zt 1 ) p(x t z t ; π zt ) t=1 t=1 T T µ zt 1,z t t=1 t=1 π zt,x t 37/153

57 Further Extension: Factorial HMM z 3,1 z 3,2 z 3,3 z 3,4 Discrete Continuous Structured z 2,1 z 2,2 z 2,3 z 2,4 z 1,1 z 1,2 z 1,3 z 1,4 x 1 x 2 x 3 x 4 N L T T p(x, z; θ) = p(z l,t z l,t 1 ) p(x t z 1:L,t ) l=1 t=1 t=1 38/153

58 Deep Model 3: Deep HMM Discrete Continuous Structured Parameterize transition and emission distributions with neural networks (c.f., Tran et al. [2016]) Model transition distribution as p(z t z t 1 ) = softmax(mlp(z t 1 ; µ)) Model emission distribution as p(x t z t ) = softmax(mlp(z t ; π)) Note: K K transition parameters for standard HMM vs. O(K d + d 2 ) for deep version. 39/153

59 Deep Model 3: Deep HMM Discrete Continuous Structured Parameterize transition and emission distributions with neural networks (c.f., Tran et al. [2016]) Model transition distribution as p(z t z t 1 ) = softmax(mlp(z t 1 ; µ)) Model emission distribution as p(x t z t ) = softmax(mlp(z t ; π)) Note: K K transition parameters for standard HMM vs. O(K d + d 2 ) for deep version. 39/153

60 Graphical Model View µ z 1 z 2 z 3 z 4 Discrete Continuous Structured x 1 x 2 x 3 x 4 π N p(x, z; θ) = = T T p(z t z t 1 ; µ zt 1 ) p(x t z t ; π zt ) t=1 t=1 T T µ zt 1,z t t=1 t=1 π zt,x t 40/153

61 Posterior Discrete Continuous Structured For structured models, Bayes rule may tractable, p(z x; θ) = p(z; µ) p(x z; π) z p(z ; µ) p(x z ; π) Unlike previous models, z contains interdependent parts. For both shallow and deep Model 3 variants, it s possible to calculate p(x; θ) exactly, with a dynamic program. For some structured models, like Factorial HMM, the dynamic program may still be intractable. 41/153

62 Posterior Discrete Continuous Structured For structured models, Bayes rule may tractable, p(z x; θ) = p(z; µ) p(x z; π) z p(z ; µ) p(x z ; π) Unlike previous models, z contains interdependent parts. For both shallow and deep Model 3 variants, it s possible to calculate p(x; θ) exactly, with a dynamic program. For some structured models, like Factorial HMM, the dynamic program may still be intractable. 41/153

63 1 Maximum Likelihood ELBO 2 3 Maximum Likelihood ELBO /153

64 1 Maximum Likelihood ELBO 2 3 Maximum Likelihood ELBO /153

65 Maximum Likelihood ELBO Learning with Maximum Likelihood : Find model parameters θ that maximize the likelihood of the data, θ = arg max θ N log p(x (n) ; θ) n=1 44/153

66 Learning Deep Maximum Likelihood ELBO N L(θ) = log p(x (n) ; θ) n=1 x N θ Dominant framework is gradient-based optimization: θ (i) = θ (i 1) + η θ L(θ) θ L(θ) calculated with backpropagation. Tactics: mini-batch based training, adaptive learning rates [Duchi et al. 2011; Kingma and Ba 2015]. 45/153

67 Maximum Likelihood ELBO Learning Deep Latent-Variable : Marginalization Likelihood requires summing out the latent variables, p(x; θ) = p(x, z; θ) (= p(x, z; θ)dz if continuous z) z Z In general, hard to optimize log-likelihood for the training set, N L(θ) = log p(x (n), z; θ) z Z n=1 θ z (n) x (n) N 46/153

68 Maximum Likelihood ELBO Learning Deep Latent-Variable : Marginalization Likelihood requires summing out the latent variables, p(x; θ) = p(x, z; θ) (= p(x, z; θ)dz if continuous z) z Z In general, hard to optimize log-likelihood for the training set, N L(θ) = log p(x (n), z; θ) z Z n=1 θ z (n) x (n) N 46/153

69 1 Maximum Likelihood ELBO 2 3 Maximum Likelihood ELBO /153

70 High-level: decompose objective into lower-bound and gap. Maximum Likelihood ELBO L(θ) GAP(θ, λ) LB(θ, λ) L(θ) = LB(θ, λ) + GAP(θ, λ) for some λ Provides framework for deriving a rich set of optimization algorithms. 48/153

71 Marginal Likelihood: Decomposition Maximum Likelihood ELBO For any 1 distribution q(z x; λ) over z, [ p(x, z; θ) ] L(θ) = E q log + KL[q(z x; λ) p(z x; θ)] q(z x; λ) posterior gap ELBO (evidence lower bound) Since KL is always non-negative, L(θ) ELBO(θ, λ). 1 Technical condition: supp(q(z)) supp(p(z x; θ)) 49/153

72 Evidence Lower Bound: Proof Maximum Likelihood ELBO log p(x; θ) = E q log p(x) (Expectation over z) p(x, z) = E q log (Mult/div by p(z x), combine numerator) p(z x) ( ) p(x, z) q(z x) = E q log (Mult/div by q(z x)) q(z x) p(z x) p(x, z) = E q log q(z x) + E q(z x) q log (Split Log) p(z x) p(x, z; θ) = E q log + KL[q(z x; λ) p(z x; θ)] q(z x; λ) 50/153

73 Evidence Lower Bound: Proof Maximum Likelihood ELBO log p(x; θ) = E q log p(x) (Expectation over z) p(x, z) = E q log (Mult/div by p(z x), combine numerator) p(z x) ( ) p(x, z) q(z x) = E q log (Mult/div by q(z x)) q(z x) p(z x) p(x, z) = E q log q(z x) + E q(z x) q log (Split Log) p(z x) p(x, z; θ) = E q log + KL[q(z x; λ) p(z x; θ)] q(z x; λ) 50/153

74 Evidence Lower Bound: Proof Maximum Likelihood ELBO log p(x; θ) = E q log p(x) (Expectation over z) p(x, z) = E q log (Mult/div by p(z x), combine numerator) p(z x) ( ) p(x, z) q(z x) = E q log (Mult/div by q(z x)) q(z x) p(z x) p(x, z) = E q log q(z x) + E q(z x) q log (Split Log) p(z x) p(x, z; θ) = E q log + KL[q(z x; λ) p(z x; θ)] q(z x; λ) 50/153

75 Evidence Lower Bound: Proof Maximum Likelihood ELBO log p(x; θ) = E q log p(x) (Expectation over z) p(x, z) = E q log (Mult/div by p(z x), combine numerator) p(z x) ( ) p(x, z) q(z x) = E q log (Mult/div by q(z x)) q(z x) p(z x) p(x, z) = E q log q(z x) + E q(z x) q log (Split Log) p(z x) p(x, z; θ) = E q log + KL[q(z x; λ) p(z x; θ)] q(z x; λ) 50/153

76 Evidence Lower Bound: Proof Maximum Likelihood ELBO log p(x; θ) = E q log p(x) (Expectation over z) p(x, z) = E q log (Mult/div by p(z x), combine numerator) p(z x) ( ) p(x, z) q(z x) = E q log (Mult/div by q(z x)) q(z x) p(z x) p(x, z) = E q log q(z x) + E q(z x) q log (Split Log) p(z x) p(x, z; θ) = E q log + KL[q(z x; λ) p(z x; θ)] q(z x; λ) 50/153

77 Evidence Lower Bound over Observations Maximum Likelihood ELBO [ p(x, z; θ) ] ELBO(θ, λ; x) = E q(z) log q(z x; λ) ELBO is a function of the generative model parameters, θ, and the variational parameters, λ. N log p(x (n) ; θ) n=1 = N ELBO(θ, λ; x (n) ) n=1 N n=1 E q(z x (n) ; λ) [ log p(x(n), z; θ) ] q(z x (n) ; λ) = ELBO(θ, λ; x (1:N) ) = ELBO(θ, λ) 51/153

78 Maximum Likelihood ELBO Setup: Selecting Family Just as with p and θ, we can select any form of q and λ that satisfies ELBO conditions. Different choices of q will lead to different algorithms. We will explore several forms of q: Posterior Point Estimate / MAP Amortized Mean Field (later) 52/153

79 Example Family : Full Posterior Form z (n) KL(q(z x) p(z x)) z (n) Maximum Likelihood ELBO λ (n) N x (n) N θ λ = [λ (1),..., λ (N) ] is a concatenation of local variational parameters λ (n), e.g. q(z (n) x (n) ; λ) = q(z (n) x (n) ; λ (n) ) = N (λ (n), 1) 53/153

80 Example Family: Amortized Parameterization [Kingma and Welling 2014] λ z (n) KL(q(z x) p(z x)) z (n) Maximum Likelihood ELBO x (n) N x (n) N θ λ parameterizes a global network (encoder/inference network) that is run over x (n) to produce the local variational distribution, e.g. q(z (n) x (n) ; λ) = N (µ (n), 1), µ (n) = enc(x (n) ; λ) 54/153

81 1 Exact Gradient Sampling Conjugacy Exact Gradient Sampling Conjugacy /153

82 Maximizing the Evidence Lower Bound Exact Gradient Sampling Conjugacy Central quantity of interest: almost all methods are maximizing the ELBO Aggregate ELBO objective, arg max θ,λ arg max ELBO(θ, λ) θ,λ ELBO(θ, λ) = arg max θ,λ = arg max θ,λ N ELBO(θ, λ; x (n) ) n=1 N [ E q log p(x(n), z (n) ; θ) ] q(z (n) x (n) ; λ) n=1 56/153

83 Maximizing the Evidence Lower Bound Exact Gradient Sampling Conjugacy Central quantity of interest: almost all methods are maximizing the ELBO Aggregate ELBO objective, arg max θ,λ arg max ELBO(θ, λ) θ,λ ELBO(θ, λ) = arg max θ,λ = arg max θ,λ N ELBO(θ, λ; x (n) ) n=1 N [ E q log p(x(n), z (n) ; θ) ] q(z (n) x (n) ; λ) n=1 56/153

84 Maximizing ELBO: Model Parameters [ arg max E q log θ p(x, z; θ) ] q(z x; λ) = arg max E q [log p(x, z; θ)] θ Exact Gradient Sampling Conjugacy ELBO θ ELBO Intuition: Maximum likelihood problem under variables drawn from q(z x; λ). 57/153

85 Exact Gradient Sampling Conjugacy Model Estimation: Gradient Ascent on Model Parameters Easy: Gradient respect to θ [ ] θ ELBO(θ, λ; x) = θ E q log p(x, z; θ) ] = E q [ θ log p(x, z; θ) Since q not dependent on θ, moves inside expectation. Estimate with samples from q. Term log p(x, z; θ) is easy to evaluate. (In practice single sample is often sufficient). In special cases, can exactly evaluate expectation. 58/153

86 Exact Gradient Sampling Conjugacy Model Estimation: Gradient Ascent on Model Parameters Easy: Gradient respect to θ [ ] θ ELBO(θ, λ; x) = θ E q log p(x, z; θ) ] = E q [ θ log p(x, z; θ) Since q not dependent on θ, moves inside expectation. Estimate with samples from q. Term log p(x, z; θ) is easy to evaluate. (In practice single sample is often sufficient). In special cases, can exactly evaluate expectation. 58/153

87 Exact Gradient Sampling Conjugacy Maximizing ELBO: Distribution arg max ELBO(θ, λ) λ L = arg max log p(x; θ) KL[q(z x; λ) p(z x; θ)] λ = arg min KL[q(z x; λ) p(z x; θ)] λ λ posterior gap ELBO Intuition: q should approximate the posterior p(z x). However, may be difficult if q or p is a deep model. 59/153

88 Model : Gradient Ascent on λ? Exact Gradient Sampling Conjugacy Hard: Gradient respect to λ λ ELBO(θ, λ; x) = λ E q [ log E q [ λ log p(x, z; θ) ] q(z x; λ) p(x, z; θ) ] q(z x; λ) Cannot naively move inside the expectation, since q depends on λ. This section: in practice: 1 Exact gradient 2 Sampling: score function, reparameterization 3 Conjugacy: closed-form, coordinate ascent 60/153

89 Model : Gradient Ascent on λ? Exact Gradient Sampling Conjugacy Hard: Gradient respect to λ λ ELBO(θ, λ; x) = λ E q [ log E q [ λ log p(x, z; θ) ] q(z x; λ) p(x, z; θ) ] q(z x; λ) Cannot naively move inside the expectation, since q depends on λ. This section: in practice: 1 Exact gradient 2 Sampling: score function, reparameterization 3 Conjugacy: closed-form, coordinate ascent 60/153

90 1 Exact Gradient Sampling Conjugacy Exact Gradient Sampling Conjugacy /153

91 Strategy 1: Exact Gradient Exact Gradient Sampling Conjugacy [ p(x, z; θ) ] λ ELBO(θ, λ; x) = λ E q(z x; λ) log q(z x; λ) ( p(x, z; θ) ) = λ q(z x; λ) log q(z x; λ) Naive enumeration: Linear in Z. z Z Depending on structure of q and p, potentially faster with dynamic programming. Applicable mainly to Model 1 and 3 (Discrete and Structured), or Model 2 with point estimate. 62/153

92 Strategy 1: Exact Gradient Exact Gradient Sampling Conjugacy [ p(x, z; θ) ] λ ELBO(θ, λ; x) = λ E q(z x; λ) log q(z x; λ) ( p(x, z; θ) ) = λ q(z x; λ) log q(z x; λ) Naive enumeration: Linear in Z. z Z Depending on structure of q and p, potentially faster with dynamic programming. Applicable mainly to Model 1 and 3 (Discrete and Structured), or Model 2 with point estimate. 62/153

93 λ z Example: Model 1 - Naive Bayes z Exact Gradient Sampling Conjugacy Let q(z x; λ) = Cat(ν) where ν = enc(x; λ) x x 1 λ ELBO(θ, λ; x) = λ E q(z x; λ) [ log ( = λ q(z x; λ) log z Z ( = λ ν z log z Z... p(x, z; θ) ) ν z x T p(x, z; θ) ] q(z x; λ) p(x, z; θ) ) q(z x; λ) 63/153

94 λ z Example: Model 1 - Naive Bayes z Exact Gradient Sampling Conjugacy Let q(z x; λ) = Cat(ν) where ν = enc(x; λ) x x 1 λ ELBO(θ, λ; x) = λ E q(z x; λ) [ log ( = λ q(z x; λ) log z Z ( = λ ν z log z Z... p(x, z; θ) ) ν z x T p(x, z; θ) ] q(z x; λ) p(x, z; θ) ) q(z x; λ) 63/153

95 1 Exact Gradient Sampling Conjugacy Exact Gradient Sampling Conjugacy /153

96 Strategy 2: Sampling [ log p(x, z; θ) ] λ ELBO(θ, λ; x) = λ E q log log q(z x; λ) [ ] [ ] = λ E q log p(x, z; θ) λ E q log q(z x; θ) Exact Gradient Sampling Conjugacy How can we approximate this gradient with sampling? Naive algorithm fails to provide non-zero gradient. λ 1 J z (1),..., z (J) q(z x; λ) J j=1 [ ] log p(x, z (j) ; θ) = 0 Manipulate expression so we can move λ inside E q before sampling. 65/153

97 Strategy 2: Sampling [ log p(x, z; θ) ] λ ELBO(θ, λ; x) = λ E q log log q(z x; λ) [ ] [ ] = λ E q log p(x, z; θ) λ E q log q(z x; θ) Exact Gradient Sampling Conjugacy How can we approximate this gradient with sampling? Naive algorithm fails to provide non-zero gradient. λ 1 J z (1),..., z (J) q(z x; λ) J j=1 [ ] log p(x, z (j) ; θ) = 0 Manipulate expression so we can move λ inside E q before sampling. 65/153

98 Exact Gradient Sampling Conjugacy Strategy 2a: Sampling Score Function Gradient Estimator First term. Use basic identity: log q = q q Policy-gradient style training [Williams 1992] [ ] λ E q log p(x, z; θ) = z q = q log q λ q(z x; λ) log p(x, z; θ) 66/153

99 Exact Gradient Sampling Conjugacy Strategy 2a: Sampling Score Function Gradient Estimator First term. Use basic identity: log q = q q Policy-gradient style training [Williams 1992] [ ] λ E q log p(x, z; θ) = z q = q log q λ q(z x; λ) log p(x, z; θ) }{{} q log q 67/153

100 Exact Gradient Sampling Conjugacy Strategy 2a: Sampling Score Function Gradient Estimator First term. Use basic identity: log q = q q Policy-gradient style training [Williams 1992] [ ] λ E q log p(x, z; θ) = z q = q log q λ q(z x; λ) log p(x, z; θ) = z q(z x; λ) λ log q(z x; λ) log p(x, z; θ) 68/153

101 Strategy 2a: Sampling Score Function Gradient Estimator Exact Gradient Sampling Conjugacy First term. Use basic identity: log q = q q Policy-gradient style training [Williams 1992] [ ] λ E q log p(x, z; θ) = z q = q log q λ q(z x; λ) log p(x, z; θ) = q(z x; λ) λ log q(z x; λ) log p(x, z; θ) z [ ] = E q log p(x, z; θ) λ log q(z x; λ) 69/153

102 Exact Gradient Sampling Conjugacy Strategy 2a: Sampling Score Function Gradient Estimator Second term. Need additional identity: [ ] λ E q log q(z x; λ) = z q = q = 1 = 0 ) λ (q(z x; λ) log q(z x; λ) 70/153

103 Exact Gradient Sampling Conjugacy Strategy 2a: Sampling Score Function Gradient Estimator Second term. Need additional identity: [ ] λ E q log q(z x; λ) = z ( = z λ q(z x; λ) }{{} q log q q = q = 1 = 0 λ (q(z x; λ) log q(z x; λ) ) ( log q(z x; λ) + q(z x; λ) λ log q(z x; λ) }{{} q q ) 71/153

104 Strategy 2a: Sampling Score Function Gradient Estimator Second term. Need additional identity: q = q = 1 = 0 Exact Gradient Sampling Conjugacy [ ] λ E q log q(z x; λ) = z λ (q(z x; λ) log q(z x; λ) = z log q(z x; λ)q(z x; λ) λ log q(z x; λ) + z λ q(z x; λ) 72/153

105 Strategy 2a: Sampling Score Function Gradient Estimator Second term. Need additional identity: q = q = 1 = 0 Exact Gradient Sampling Conjugacy [ ] λ E q log q(z x; λ) = z λ (q(z x; λ) log q(z x; λ) = z log q(z x; λ)q(z x; λ) λ log q(z x; λ) + λ q(z x; λ) z }{{} = q= 1=0 73/153

106 Strategy 2a: Sampling Score Function Gradient Estimator Second term. Need additional identity: q = q = 1 = 0 Exact Gradient Sampling Conjugacy [ ] λ E q log q(z x; λ) = z λ (q(z x; λ) log q(z x; λ) = z log q(z x; λ)q(z x; λ) λ log q(z x; λ) + z λ q(z x; λ) = E q [log q(z x; λ) λ q(z x; λ)] 74/153

107 Exact Gradient Sampling Conjugacy Strategy 2a: Sampling Score Function Gradient Estimator Putting these together, [ p(x, z; θ) ] λ ELBO(θ, λ; x) = λ E q log q(z x; λ) [ p(x, z; θ) ] = E q log q(z x; λ) λ log q(z x; λ) ] = E q [R θ,λ (z) λ log q(z x; λ) 75/153

108 Exact Gradient Sampling Conjugacy Strategy 2a: Sampling Score Function Gradient Estimator Estimate with samples, z (1),..., z (J) q(z x; λ) ] E q [R θ,λ (z) λ log q(z x; λ) 1 J J R θ,λ (z (j) ) λ log q(z (j) x; λ) j=1 Intuition: if a sample z (j) is has high reward R θ,λ (z (j) ), increase the probability of z (j) by moving along the gradient λ log q(z (j) x; λ). 76/153

109 Exact Gradient Sampling Conjugacy Strategy 2a: Sampling Score Function Gradient Estimator Essentially reinforcement learning with reward R θ,λ (z) Score function gradient is generally applicable regardless of what distribution q takes (only need to evaluate λ log q). This generality comes at a cost, since the reward is black-box : unbiased estimator, but high variance. In practice, need variance-reducing control variate B. (More on this later). 77/153

110 Exact Gradient Sampling Conjugacy λ z x Example: Model 1 - Naive Bayes Let q(z x; λ) = Cat(ν) where ν = enc(x; λ) Sample z (1),..., z (J) q(z x; λ) [ p(x, z; θ) ] λ ELBO(θ, λ; x) = E q log q(z x; λ) λ log q(z x; λ) 1 J J j=1 x 1 z... ν z (j) log p(x, z(j) ; θ) λ log ν ν z (j) z (j) x T Computational complexity: O(J) vs O( Z ) 78/153

111 Exact Gradient Sampling Conjugacy λ z x Example: Model 1 - Naive Bayes Let q(z x; λ) = Cat(ν) where ν = enc(x; λ) Sample z (1),..., z (J) q(z x; λ) [ p(x, z; θ) ] λ ELBO(θ, λ; x) = E q log q(z x; λ) λ log q(z x; λ) 1 J J j=1 x 1 z... ν z (j) log p(x, z(j) ; θ) λ log ν ν z (j) z (j) x T Computational complexity: O(J) vs O( Z ) 78/153

112 Exact Gradient Sampling Conjugacy Strategy 2b: Sampling Reparameterization Suppose we can sample from q by applying a deterministic, differentiable transformation g to a base noise density, ɛ U z = g(ɛ, λ) Gradient calculation (first term): [ ] [ ] λ E z q(z x; λ) log p(x, z; θ) = λ E ɛ U log p(x, g(ɛ, λ); θ) ] = E ɛ U [ λ log p(x, g(ɛ, λ); θ) where 1 J J λ log p(x, g(ɛ (j), λ); θ) j=1 ɛ (1),..., ɛ (J) U 79/153

113 Exact Gradient Sampling Conjugacy Strategy 2b: Sampling Reparameterization Suppose we can sample from q by applying a deterministic, differentiable transformation g to a base noise density, ɛ U z = g(ɛ, λ) Gradient calculation (first term): [ ] [ ] λ E z q(z x; λ) log p(x, z; θ) = λ E ɛ U log p(x, g(ɛ, λ); θ) ] = E ɛ U [ λ log p(x, g(ɛ, λ); θ) where 1 J J λ log p(x, g(ɛ (j), λ); θ) j=1 ɛ (1),..., ɛ (J) U 79/153

114 Exact Gradient Sampling Conjugacy Strategy 2b: Sampling Reparameterization Unbiased, like the score function gradient estimator, but empirically lower variance. In practice, single sample is often sufficient. Cannot be used out-of-the-box for discrete z. 80/153

115 Strategy 2: Continuous Latent Variable RNN ɛ Exact Gradient Sampling Conjugacy λ z x x 1 z... Choose variational family to be an amortized diagonal Gaussian q(z x; λ) = N (µ, σ 2 ) µ, σ 2 = enc(x; λ) Then we can sample from q(z x; λ) by x T ɛ N (0, I) z = µ + σɛ 81/153

116 Strategy 2b: Sampling Reparameterization Exact Gradient Sampling Conjugacy (Recall R θ,λ (z) = log Score function: Reparameterization: p(x,z; θ) q(z x; λ) ) λ ELBO(θ, λ; x) = E z q [R θ,λ (z) λ log q(z x; λ)] λ ELBO(θ, λ; x) = E ɛ N (0,I) [ λ R θ,λ (g(ɛ, λ; x))] where g(ɛ, λ; x) = µ + σɛ. Informally, reparameterization gradients differentiate through R θ,λ ( ) and thus has more knowledge about the structure of the objective function. 82/153

117 1 Exact Gradient Sampling Conjugacy Exact Gradient Sampling Conjugacy /153

118 Exact Gradient Sampling Conjugacy Strategy 3: Conjugacy For certain choices for p and q, we can compute parts of exactly in closed-form. Recall that arg max λ arg max ELBO(θ, λ; x) λ ELBO(θ, λ; x) = arg min KL[q(z x; λ) p(z x; θ)] λ 84/153

119 Exact Gradient Sampling Conjugacy Strategy 3: Conjugacy For certain choices for p and q, we can compute parts of exactly in closed-form. Recall that arg max λ arg max ELBO(θ, λ; x) λ ELBO(θ, λ; x) = arg min KL[q(z x; λ) p(z x; θ)] λ 84/153

120 Strategy 3a: Conjugacy Tractable Posterior Exact Gradient Sampling Conjugacy Suppose we can tractably calculate p(z x; θ). Then KL[q(z x; λ) p(z x; θ)] is minimized when, q(z x; λ) = p(z x; θ) The E-step in Expectation Maximization algorithm [Dempster et al. 1977] posterior gap L λ ELBO 85/153

121 Example: Model 1 - Naive Bayes λ z z x x 1... x T Exact Gradient Sampling Conjugacy p(z x; θ) = p(x, z; θ) K z =1 p(x, z ; θ) So λ is given by the parameters of the categorical distribution, i.e. λ = [p(z = 1 x; θ),..., p(z = K x; θ)] 86/153

122 Example: Model 3 HMM µ z 1 z 2 z 3 z 4 Exact Gradient Sampling Conjugacy x 1 x 2 x 3 x 4 N π T p(x, z; θ) = p(z 0 ) p(z t z t 1 ; µ)p(x t z t ; π) t=1 87/153

123 Exact Gradient Sampling Conjugacy Example: Model 3 HMM Run forward/backward dynamic programming to calculate posterior marginals, p(z t, z t+1 x; θ) variational parameters λ R T K2 store edge marginals. These are enough to calculate q(z; λ) = p(z x; θ) (i.e. the exact posterior) over any sequence z. 88/153

124 Exact Gradient Sampling Conjugacy Example: Model 3 HMM Run forward/backward dynamic programming to calculate posterior marginals, p(z t, z t+1 x; θ) variational parameters λ R T K2 store edge marginals. These are enough to calculate q(z; λ) = p(z x; θ) (i.e. the exact posterior) over any sequence z. 88/153

125 Exact Gradient Sampling Conjugacy Connection: Gradient Ascent on Log Marginal Likelihood Why not perform gradient ascent directly on log marginal likelihood? log p(x; θ) = log p(x, z; θ) z Same as optimizing ELBO with posterior inference (i.e EM). Gradients of model parameters given by (where q(z x; λ) = p(z x; θ)): θ log p(x; θ) = E q(z x; λ) [ θ log p(x, z; θ)] posterior gap L ELBO 89/153

126 Exact Gradient Sampling Conjugacy Connection: Gradient Ascent on Log Marginal Likelihood Why not perform gradient ascent directly on log marginal likelihood? log p(x; θ) = log p(x, z; θ) z Same as optimizing ELBO with posterior inference (i.e EM). Gradients of model parameters given by (where q(z x; λ) = p(z x; θ)): θ log p(x; θ) = E q(z x; λ) [ θ log p(x, z; θ)] posterior gap L ELBO 89/153

127 Exact Gradient Sampling Conjugacy Connection: Gradient Ascent on Log Marginal Likelihood Practically, this means we don t have to manually perform posterior inference in the E-step. Can just calculate log p(x; θ) and call backpropagation. Example: in deep HMM, just implement forward algorithm to calculate log p(x; θ) and backpropagate using autodiff. No need to implement backward algorithm. (Or vice versa). (See Eisner [2016]: Inside-Outside and Forward-Backward Algorithms Are Just Backprop ) 90/153

128 Exact Gradient Sampling Conjugacy Strategy 3b: Conditional Conjugacy Let p(z x; θ) be intractable, but suppose p(x, z; θ) is conditionally conjugate, meaning p(z t x, z t ; θ) is exponential family. Restrict the family of distributions q so that it factorizes over z t, i.e. q(z; λ) = (mean field family) T q(z t ; λ t ) Further choose q(z t ; λ t ) so that it is in the same family as p(z t x, z t ; θ). t=1 91/153

129 Strategy 3b: Conditional Conjugacy z (n) 1 z (n) T KL(q(z) p(z x)) z (n) Exact Gradient Sampling Conjugacy λ (n) 1 λ (n) T N q(z; λ) = T q(z t ; λ t ) t=1 x (n) N θ 92/153

130 Mean Field Family Exact Gradient Sampling Conjugacy Optimize ELBO via coordinate ascent, i.e. iterate for λ 1,..., λ T arg max λ t [ T ] KL q(z t ; λ t ) p(z x; θ) Coordinate ascent updates will take the form ( ) q(z t ; λ t ) exp E q(z t ; λ t )[log p(x, z; θ)] where t=1 E q(z t ; λ t )[log p(x, z; θ)] = j t q(z j ; λ j ) log p(x, z; θ) Since p(z t x, z t ) was assumed to be in the exponential family, above updates can be derived in closed form. j t 93/153

131 Example: Model 3 Factorial HMM z 3,1 z 3,2 z 3,3 z 3,4 Exact Gradient Sampling Conjugacy z 2,1 z 2,2 z 2,3 z 2,4 z 1,1 z 1,2 z 1,3 z 1,4 x 1 x 2 x 3 x 4 N L T p(x, z; θ) = p(z l,t z l,t 1 ; θ)p(x t z l,t ; θ) l=1 t=1 94/153

132 Example: Model 3 Factorial HMM Exact Gradient Sampling Conjugacy z 3,1 z 3,2 z 3,3 z 3,4 z 2,1 z 2,2 z 2,3 z 2,4 z 1,1 z 1,2 z 1,3 z 1,4 x 1 x 2 x 3 x 4 ( ) q(z 1,1 ; λ 1,1 ) exp E q(z (1,1) ; λ (1,1) )[log p(x, z; θ)] N 95/153

133 Example: Model 3 Factorial HMM Exact Gradient Sampling Conjugacy z 3,1 z 3,2 z 3,3 z 3,4 z 2,1 z 2,2 z 2,3 z 2,4 z 1,1 z 1,2 z 1,3 z 1,4 x 1 x 2 x 3 x 4 ( ) q(z 2,1 ; λ 2,1 ) exp E q(z (2,1) ; λ (2,1) )[log p(x, z; θ)] N 96/153

134 Exact : Example: Model 3 Factorial HMM Exact Gradient Sampling Conjugacy Naive: K states, L levels = HMM with K L states = O(T K 2L ) Smarter: O(T LK L+1 ) Mean Field: Gaussian emissions: O(T LK 2 ) [Ghahramani and Jordan 1996]. Categorical emission: need more variational approximations, but ultimately O(LKV T ) [Nepal and Yates 2013]. 97/153

135 Exact : Example: Model 3 Factorial HMM Exact Gradient Sampling Conjugacy Naive: K states, L levels = HMM with K L states = O(T K 2L ) Smarter: O(T LK L+1 ) Mean Field: Gaussian emissions: O(T LK 2 ) [Ghahramani and Jordan 1996]. Categorical emission: need more variational approximations, but ultimately O(LKV T ) [Nepal and Yates 2013]. 97/153

136 Gumbel-Softmax Flows IWAE 5 Gumbel-Softmax Flows IWAE 6 98/153

137 Gumbel-Softmax Flows IWAE 1 Gumbel-Softmax: Extend reparameterization to discrete variables. 2 Flows: Optimize a tighter bound by making the variational family q more flexible. 3 Importance Weighting: Optimize a tighter bound through importance sampling. 99/153

138 Gumbel-Softmax Flows IWAE 5 Gumbel-Softmax Flows IWAE 6 100/153

139 Challenges of Discrete Variables Gumbel-Softmax Flows IWAE Review: we can always use score function estimator [ p(x, z; θ) ] λ ELBO(x, θ, λ) = E q log q(z x; λ) λ log q(z x; λ) [( p(x, z; θ) ) ] = E q log q(z x; λ) B λ log q(z x; λ) E q [B λ log q(z x; λ)] = 0 (since E[ log q] = q log q = q = 0) Control variate B (not dependent on z, but can depend on x). Estimate this quantity with another neural net [Mnih and Gregor 2014] ( p(x, z; θ) ) 2 B(x; ψ) log q(z x; λ) 101/153

140 Challenges of Discrete Variables Gumbel-Softmax Flows IWAE Review: we can always use score function estimator [ p(x, z; θ) ] λ ELBO(x, θ, λ) = E q log q(z x; λ) λ log q(z x; λ) [( p(x, z; θ) ) ] = E q log q(z x; λ) B λ log q(z x; λ) E q [B λ log q(z x; λ)] = 0 (since E[ log q] = q log q = q = 0) Control variate B (not dependent on z, but can depend on x). Estimate this quantity with another neural net [Mnih and Gregor 2014] ( p(x, z; θ) ) 2 B(x; ψ) log q(z x; λ) 101/153

141 Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017] Gumbel-Softmax Flows IWAE The Gumbel-Max trick [Papandreou and Yuille 2011] p(z k = 1; α) = α k K j=1 α j where z = [0, 0,..., 1,..., 0] is a one-hot vector. Can sample from p(z; α) by 1 Drawing independent Gumbel noise ɛ = ɛ 1,..., ɛ K ɛ k = log( log u k ) u k U(0, 1) 2 Adding ɛ k to log α k, finding argmax, i.e. i = arg max[log α k + ɛ k ] z i = 1 k 102/153

142 Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017] Gumbel-Softmax Flows IWAE The Gumbel-Max trick [Papandreou and Yuille 2011] p(z k = 1; α) = α k K j=1 α j where z = [0, 0,..., 1,..., 0] is a one-hot vector. Can sample from p(z; α) by 1 Drawing independent Gumbel noise ɛ = ɛ 1,..., ɛ K ɛ k = log( log u k ) u k U(0, 1) 2 Adding ɛ k to log α k, finding argmax, i.e. i = arg max[log α k + ɛ k ] z i = 1 k 102/153

143 Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017] Gumbel-Softmax Flows IWAE The Gumbel-Max trick [Papandreou and Yuille 2011] p(z k = 1; α) = α k K j=1 α j where z = [0, 0,..., 1,..., 0] is a one-hot vector. Can sample from p(z; α) by 1 Drawing independent Gumbel noise ɛ = ɛ 1,..., ɛ K ɛ k = log( log u k ) u k U(0, 1) 2 Adding ɛ k to log α k, finding argmax, i.e. i = arg max[log α k + ɛ k ] z i = 1 k 102/153

144 Gumbel-Softmax Flows IWAE Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017] Reparameterization: z = arg max s K 1 (log α + ɛ) s = g(ɛ, α) z = g(ɛ, α) is a deterministic function applied to stochastic noise. Let s try applying this: (Recalling R θ,λ (z) = log q(z k = 1 x; λ) = p(x,z; θ) q(z x; λ) ), α k K j=1 α j α = enc(x; λ) λ E q(z x; λ) [R θ,λ (z)] = λ E ɛ Gumbel [R θ,λ (g(ɛ, α)] = E ɛ Gumbel [ λ R θ,λ (g(ɛ, α))] 103/153

145 Gumbel-Softmax Flows IWAE Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017] Reparameterization: z = arg max s K 1 (log α + ɛ) s = g(ɛ, α) z = g(ɛ, α) is a deterministic function applied to stochastic noise. Let s try applying this: (Recalling R θ,λ (z) = log q(z k = 1 x; λ) = p(x,z; θ) q(z x; λ) ), α k K j=1 α j α = enc(x; λ) λ E q(z x; λ) [R θ,λ (z)] = λ E ɛ Gumbel [R θ,λ (g(ɛ, α)] = E ɛ Gumbel [ λ R θ,λ (g(ɛ, α))] 103/153

146 Gumbel-Softmax Flows IWAE Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017] Reparameterization: z = arg max s K 1 (log α + ɛ) s = g(ɛ, α) z = g(ɛ, α) is a deterministic function applied to stochastic noise. Let s try applying this: (Recalling R θ,λ (z) = log q(z k = 1 x; λ) = p(x,z; θ) q(z x; λ) ), α k K j=1 α j α = enc(x; λ) λ E q(z x; λ) [R θ,λ (z)] = λ E ɛ Gumbel [R θ,λ (g(ɛ, α)] = E ɛ Gumbel [ λ R θ,λ (g(ɛ, α))] 103/153

147 Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017] But this won t work, because zero gradients (almost everywhere) z = g(ɛ, α) = arg max s K 1 (log α + ɛ) s = λ R θ,λ (z) = 0 Gumbel-Softmax Flows IWAE Gumbel-Softmax trick: replace arg max with softmax ( log α + ɛ ) z = softmax τ (where τ is a temperature term.) z k = exp((log α k + ɛ k )/τ) K j=1 exp((log α j + ɛ j )/τ) [ ( λ E q(z x; λ) [R θ,λ (z)] E ɛ Gumbel λ R θ,λ softmax ( log α + ɛ) )] τ 104/153

148 Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017] But this won t work, because zero gradients (almost everywhere) z = g(ɛ, α) = arg max s K 1 (log α + ɛ) s = λ R θ,λ (z) = 0 Gumbel-Softmax Flows IWAE Gumbel-Softmax trick: replace arg max with softmax ( log α + ɛ ) z = softmax τ (where τ is a temperature term.) z k = exp((log α k + ɛ k )/τ) K j=1 exp((log α j + ɛ j )/τ) [ ( λ E q(z x; λ) [R θ,λ (z)] E ɛ Gumbel λ R θ,λ softmax ( log α + ɛ) )] τ 104/153

149 Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017] But this won t work, because zero gradients (almost everywhere) z = g(ɛ, α) = arg max s K 1 (log α + ɛ) s = λ R θ,λ (z) = 0 Gumbel-Softmax Flows IWAE Gumbel-Softmax trick: replace arg max with softmax ( log α + ɛ ) z = softmax τ (where τ is a temperature term.) z k = exp((log α k + ɛ k )/τ) K j=1 exp((log α j + ɛ j )/τ) [ ( λ E q(z x; λ) [R θ,λ (z)] E ɛ Gumbel λ R θ,λ softmax ( log α + ɛ) )] τ 104/153

150 Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017] Approaches a discrete distribution as τ 0 (anneal τ during training). Reparameterizable by construction Differentiable and has non-zero gradients Gumbel-Softmax Flows IWAE (from Maddison et al. [2017]) 105/153

151 Gumbel-Softmax Flows IWAE Gumbel-Softmax: Discrete Reparameterization [Jang et al. 2017; Maddison et al. 2017] See Maddison et al. [2017] on whether we can use the original categorical densities p(z), q(z), or need to use relaxed densities p GS (z), q GS (z). Requires that p(x z; θ) makes sense for non-discrete z (e.g. attention). Lower-variance, but biased gradient estimator. Variance as τ /153

152 Gumbel-Softmax Flows IWAE 5 Gumbel-Softmax Flows IWAE 6 107/153

153 Gumbel-Softmax Flows IWAE Flows [Rezende and Mohamed 2015; Kingma et al. 2016] Recall log p(x; θ) = ELBO(θ, λ; x) KL[q(z x; λ) p(z x; θ)] Bound is tight when variational posterior equals true posterior q(z x; λ) = p(z x; θ) = log p(x; θ) = ELBO(θ, λ; x) We want to make q(z x; λ) as flexible as possible: can we do better than just Gaussian? 108/153

154 Gumbel-Softmax Flows IWAE Flows [Rezende and Mohamed 2015; Kingma et al. 2016] Recall log p(x; θ) = ELBO(θ, λ; x) KL[q(z x; λ) p(z x; θ)] Bound is tight when variational posterior equals true posterior q(z x; λ) = p(z x; θ) = log p(x; θ) = ELBO(θ, λ; x) We want to make q(z x; λ) as flexible as possible: can we do better than just Gaussian? 108/153

155 Gumbel-Softmax Flows IWAE Flows [Rezende and Mohamed 2015; Kingma et al. 2016] Idea: transform a sample from a simple initial variational distribution, z 0 q(z x; λ) = N (µ, σ 2 ) µ, σ 2 = enc(x; λ) into a more complex one z K = f K f 2 f 1 (z 0 ; λ) where f i (z i 1 ; λ) s are invertible transformations (whose parameters are absorbed by λ). 109/153

156 Gumbel-Softmax Flows IWAE Flows [Rezende and Mohamed 2015; Kingma et al. 2016] Idea: transform a sample from a simple initial variational distribution, z 0 q(z x; λ) = N (µ, σ 2 ) µ, σ 2 = enc(x; λ) into a more complex one z K = f K f 2 f 1 (z 0 ; λ) where f i (z i 1 ; λ) s are invertible transformations (whose parameters are absorbed by λ). 109/153

157 Flows [Rezende and Mohamed 2015; Kingma et al. 2016] Gumbel-Softmax Flows IWAE Sample from final variational posterior is given by z K. Density is given by the change of variables formula: log q K (z K x; λ) = log q(z 0 x; λ) + = log q(z 0 x; λ) }{{} log density of Gaussian k=1 K 1 f k log z k k=1 K log f k z }{{ k 1 } log determinant of Jacobian Determinant calculation is O(N 3 ) in general, but can be made faster depending on parameterization of f k 110/153

158 Gumbel-Softmax Flows IWAE Flows [Rezende and Mohamed 2015; Kingma et al. 2016] Can still use reparameterization to obtain gradients. Letting F (z) = f K f 1 (z), [ p(x, z; θ) ] ELBO(θ, λ; x) = λ E qk (z K x; λ) log q K (z K x; λ) [ = λ E q(z0 x; λ) log p(x, F (z 0); θ) log F ] q(z 0 x; λ) z 0 [ ( = E ɛ N (0,I) λ log p(x, F (z 0); θ) log F )] q(z 0 x; λ) z 0 111/153

159 Flows [Rezende and Mohamed 2015; Kingma et al. 2016] Gumbel-Softmax Flows IWAE Examples of f k (z k 1 ; λ) Normalizing Flows [Rezende and Mohamed 2015] f k (z k 1 ) = z k 1 + u k h(wk z k 1 + b k ) Inverse Autoregressive Flows [Kingma et al. 2016] f k (z k 1 ) = z k 1 σ k + µ k σ k,d = sigmoid(nn(z k 1,<d )) µ k,d = NN(z k 1,<d ) (In this case the Jacobian is upper triangular, so determinant is just the product of diagonals) 112/153

160 Flows [Rezende and Mohamed 2015; Kingma et al. 2016] Gumbel-Softmax Flows IWAE (from Rezende and Mohamed [2015]) 113/153

161 Gumbel-Softmax Flows IWAE 5 Gumbel-Softmax Flows IWAE 6 114/153

162 Gumbel-Softmax Flows IWAE Importance Weighted Autoencoder (IWAE) [Burda et al. 2015] Flows are a way of tightening the ELBO by making the variational family more flexible. Not the only way: can obtain a tighter lower bound on log p(x; θ) by using Consider: multiple importance samples. I K = 1 K where z (1:K) K k=1 q(z(k) x; λ). K k=1 p(x, z (k) ; θ) q(z (k) x; λ), Note that I K is an unbiased estimator of p(x; θ): E q(z (1:K) x; λ) [I K] = p(x; θ). 115/153

163 Gumbel-Softmax Flows IWAE Importance Weighted Autoencoder (IWAE) [Burda et al. 2015] Flows are a way of tightening the ELBO by making the variational family more flexible. Not the only way: can obtain a tighter lower bound on log p(x; θ) by using Consider: multiple importance samples. I K = 1 K where z (1:K) K k=1 q(z(k) x; λ). K k=1 p(x, z (k) ; θ) q(z (k) x; λ), Note that I K is an unbiased estimator of p(x; θ): E q(z (1:K) x; λ) [I K] = p(x; θ). 115/153

164 Importance Weighted Autoencoder (IWAE) [Burda et al. 2015] Gumbel-Softmax Flows IWAE Any unbiased estimator of p(x; θ) can be used to obtain a lower bound, using Jensen s inequality: p(x; θ) = E q(z (1:K) x; λ) [I K] = log p(x; θ) E q(z (1:K) x; λ) [log I K] [ ] = E q(z (1:K) x; λ) log 1 K p(x, z (k) ; θ) K q(z (k) x; λ) However, can also show [Burda et al. 2015]: log p(x; θ) E [log I K ] E [log I K 1 ] k=1 lim K E [log I K ] = log p(x; θ) under mild conditions 116/153

165 Importance Weighted Autoencoder (IWAE) [Burda et al. 2015] Gumbel-Softmax Flows IWAE E q(z (1:K) x; λ) [ log 1 K K k=1 Note that with K = 1, we recover the ELBO. Can interpret p(x,z(k) ; θ) q(z (k) x; λ) as importance weights. ] p(x, z (k) ; θ) q(z (k) x; λ) If q(z x; λ) is reparameterizable, we can use the reparameterization trick to optimize E [log I K ] directly. Otherwise, need score function gradient estimators [Mnih and Rezende 2016]. 117/153

166 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics 5 6 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics 118/153

167 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics 5 6 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics 119/153

168 Sentence VAE Example [Bowman et al. 2016] Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Generative Model (Model 2): Draw z N (0, I) Draw x t z CRNNLM(θ, z) Model (Amortized): Deep Diagonal Gaussians, q(z x; λ) = N (µ, σ 2 ) h T = RNN(x; ψ) µ = W 1 ht σ 2 = exp(w 2 ht ) λ = {W 1, W 2, ψ} 120/153

169 Sentence VAE Example [Bowman et al. 2016] (from Bowman et al. [2016]) ɛ Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics λ z z x x 1... x T 121/153

170 Issue 1: Posterior Collapse Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics ELBO(θ, λ) = E q(z x; λ) [log p(x,z; θ) q(z x; λ) ] = E q(z x; λ) [log p(x z; θ)] KL[q(z x; λ) p(z)] }{{}}{{} Reconstruction likelihood Regularizer Model L/ELBO Reconstruction KL RNN LM RNN VAE (On Yahoo Corpus from Yang et al. [2017]) 122/153

171 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Issue 1: Posterior Collapse x and z become independent, and p(x, z; θ) reduces to a non-lv language model. Chen et al. [2017]: If it s possible to model p (x) without making use of z, then ELBO optimum is at: p (x) = p(x z; θ) = p(x; θ) q(z x; λ) = p(z) KL[q(z x; λ) p(z)] = 0 123/153

172 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Mitigating Posterior Collapse Use less powerful likelihood models [Miao et al. 2016; Yang et al. 2017], or word dropout [Bowman et al. 2016]. Model LL/ELBO Reconstruction KL RNN LM RNN VAE Word Drop CNN VAE (On Yahoo Corpus from Yang et al. [2017]) 124/153

173 Mitigating Posterior Collapse Gradually anneal multiplier on KL term, i.e. E q(z x; λ) [log p(x z; θ)] β KL[q(z x; λ) p(z)] β goes from 0 to 1 as training progresses Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics (from Bowman et al. [2016]) 125/153

174 Mitigating Posterior Collapse Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Other approaches: Use auxiliary losses (e.g. train z as part of a topic model) [Dieng et al. 2017; Wang et al. 2018] Use von Mises Fisher distribution with a fixed concentration parameter [Guu et al. 2017; Xu and Durrett 2018] Combine stochastic/amortized variational inference [Kim et al. 2018] Add skip connections [Dieng et al. 2018] In practice, often necessary to combine various methods. 126/153

175 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Issue 2: Evaluation ELBO always lower bounds log p(x; θ), so can calculate an upper bound on PPL efficiently. When reporting ELBO, should also separately report, KL[q(z x; λ) p(z)] to give an indication of how much the latent variable is being used. 127/153

176 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Issue 2: Evaluation Also can evaluate log p(x; θ) with importance sampling So [ p(x z; θ)p(z) ] p(x; θ) = E q(z x; λ) q(z x; λ) 1 K p(x z (k) ; θ)p(z (k) ) K q(z (k) x; λ) k=1 = log p(x; θ) log 1 K K k=1 p(x z (k) ; θ)p(z (k) ) q(z (k) x; λ) 128/153

177 Evaluation Qualitative evaluation Evaluate samples from prior/variational posterior. Interpolation in latent space. Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics (from Bowman et al. [2016]) 129/153

178 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics 5 6 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics 130/153

179 Encoder/Decoder [Sutskever et al. 2014; Cho et al. 2014] Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Given: Source information s = s 1,..., s M. Generative process: Draw x 1:T s CRNNLM(θ, enc(s)). 131/153

180 Latent, Per-token Experts [Yang et al. 2018] Generative process: For t = 1,..., T, Draw z t x <t, s softmax(uh t ). Draw x t z t, x <t, s softmax(w tanh(q zt h t ); θ) z (n) 1 z (n) 2 z (n) T Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics... x (n) x (n) 1 2 x (n) T N If U R K d, used K experts; increases the flexibility of per-token distribution. 132/153

181 Case-Study: Latent Per-token Experts [Yang et al. 2018] Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Learning: z t are independent given x <t, so we can marginalize at each time-step (Method 3: Conjugacy). Test-time: arg max log p(x s; θ) = θ T K arg max log p(z t =k s, x <t ; θ) p(x t z t =k, x <t, s; θ). θ arg max x 1:T t=1 k=1 T t=1 k=1 K p(z t =k s, x <t ; θ) p(x t z t =k, x <t, s; θ). 133/153

182 Case-Study: Latent, Per-token Experts [Yang et al. 2018] PTB language modeling results (s is constant): Model PPL Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Merity et al. [2018] Softmax-mixture [Yang et al. 2018] Dialogue generation results (s is context): Model BLEU Prec Rec No mixture Softmax-mixture [Yang et al. 2018] /153

183 Attention [Bahdanau et al. 2015] Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Decoding with an attention mechanism: M x t x <t, s softmax(w [h t, α t,m enc(s) m ]). m=1 135/153

184 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Copy Attention [Gu et al. 2016; Gulcehre et al. 2016] Copy attention models copying words directly from s. Generative process: For t = 1,..., T, Set α t to be attention weights. Draw z t x <t, s Bern(MLP([h t, enc(s)])). If z t = 0 Draw x t z t, x <t, s softmax(w h t ). Else Draw x t {s 1,..., s M } z t, x <t, s Cat(α t ). 136/153

185 Copy Attention Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Learning: Can maximize the log per-token marginal [Gu et al. 2016], as with per-token experts: Test-time: max log p(x 1,..., x T s; θ) θ T = max log p(z t = z s, x <t ; θ) p(x t z, x <t, x; θ). θ arg max x 1:T t=1 z {0,1} T t=1 z {0,1} p(z t =z s, x <t ; θ) p(x t z, x <t, s; θ). 137/153

186 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Attention as a Latent Variable [Deng et al. 2018] Generative process: For t = 1,..., T, Set α t to be attention weights. Draw z t x <t, s Cat(α t ). Draw x t z t, x <t, s softmax(w [h t, enc(s zt )]; θ). 138/153

187 Attention as a Latent Variable [Deng et al. 2018] Marginal likelihood under latent attention model: T M p(x 1:T s; θ) = α t,m softmax(w [h t, enc(s m )]; θ) xt. t=1 m=1 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Standard attention likelihood: T M p(x 1:T s; θ) = softmax(w [h t, α t,m enc(s m )]; θ) xt. t=1 m=1 139/153

188 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Attention as a Latent Variable [Deng et al. 2018] Learning Strategy #1: Maximize the log marginal via enumeration as above. Learning Strategy #2: Maximize the ELBO with AVI: max E q(z t; λ) [log p(x t x <t, z t, s)] KL[q(z t ; λ) p(z t x <t, s)]. λ,θ q(z t x; λ) approximates p(z t x 1:T, s; θ); implemented with a BLSTM. q isn t reparameterizable, so gradients obtained using REINFORCE + baseline. 140/153

189 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Attention as a Latent Variable [Deng et al. 2018] Test-time: Calculate p(x t x <t, s; θ) by summing out z t. MT Results on IWSLT-2014: Model PPL BLEU Standard Attn Latent Attn (marginal) Latent Attn (ELBO) /153

190 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Encoder/Decoder with Structured Latent Variables At least two EMNLP 2018 papers augment encoder/decoder text generation models with structured latent variables: 1 Lee et al. [2018] generate x 1:T by iteratively refining sequences of words z 1:T. 2 Wiseman et al. [2018] generate x 1:T conditioned on a latent template or plan z 1:S. 142/153

191 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics 5 6 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics 143/153

192 Summary as a Latent Variable [Miao and Blunsom 2016] Generative process for a document x = x 1,..., x T : Draw a latent summary z 1,..., z M RNNLM(θ) Draw x 1,..., x T z 1:M CRNNLM(θ, z) Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Posterior : p(z 1:M x 1:T ; θ) = p(summary document; θ). 144/153

193 Summary as a Latent Variable [Miao and Blunsom 2016] Generative process for a document x = x 1,..., x T : Draw a latent summary z 1,..., z M RNNLM(θ) Draw x 1,..., x T z 1:M CRNNLM(θ, z) Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Posterior : p(z 1:M x 1:T ; θ) = p(summary document; θ). 144/153

194 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Summary as a Latent Variable [Miao and Blunsom 2016] Learning: Maximize the ELBO with amortized family: max E q(z λ,θ 1:M ; λ) [log p(x 1:T z 1:M ; θ)] KL[q(z 1:M ; λ) p(z 1:M ; θ)] q(z 1:M ; λ) approximates p(z 1:M x 1:T ; θ); also implemented with encoder/decoder RNNs. q(z 1:M ; λ) not reparameterizable, so gradients use REINFORCE + baselines. 145/153

195 Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Summary as a Latent Variable [Miao and Blunsom 2016] Semi-supervised Training: Can also use documents without corresponding summaries in training. Train q(z 1:M ; λ) p(z 1:M x 1:T ; θ) with labeled examples. Infer summary z for an unlabeled document with q. Use inferred z to improve model p(x 1:T z 1:M ; θ). Allows for outperforming strictly supervised models! 146/153

196 Topic [Blei et al. 2003] Sentence VAE Encoder/Decoder with Latent Variables Latent Summaries and Topics Generative process: for each document x (n) = x (n) 1,..., x(n) T, Draw topic distribution z (n) top Dir(α) For t = 1,..., T : Draw topic z (n) t Cat(z (n) top) Draw x t Cat(β (n) z ) t 147/153

13: Variational inference II

10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational